Compare commits

..

No commits in common. "c17f76a3e47bbc02aaedf3cc01de1b8d97342a61" and "dcc76eee158df05a4ea763b9c8a36a0c1b51fbba" have entirely different histories.

31 changed files with 274 additions and 904 deletions

View File

@ -6,8 +6,8 @@
# Client Parameters # Client Parameters
These parameters apply only to Vitastor clients (QEMU, fio, NBD and so on) and These parameters apply only to clients and affect their interaction with
affect their interaction with the cluster. the cluster.
- [client_max_dirty_bytes](#client_max_dirty_bytes) - [client_max_dirty_bytes](#client_max_dirty_bytes)
- [client_max_dirty_ops](#client_max_dirty_ops) - [client_max_dirty_ops](#client_max_dirty_ops)

View File

@ -6,7 +6,7 @@
# Параметры клиентского кода # Параметры клиентского кода
Данные параметры применяются только к клиентам Vitastor (QEMU, fio, NBD и т.п.) и Данные параметры применяются только к клиентам Vitastor (QEMU, fio, NBD) и
затрагивают логику их работы с кластером. затрагивают логику их работы с кластером.
- [client_max_dirty_bytes](#client_max_dirty_bytes) - [client_max_dirty_bytes](#client_max_dirty_bytes)

View File

@ -19,7 +19,6 @@ them, even without restarting by updating configuration in etcd.
- [autosync_interval](#autosync_interval) - [autosync_interval](#autosync_interval)
- [autosync_writes](#autosync_writes) - [autosync_writes](#autosync_writes)
- [recovery_queue_depth](#recovery_queue_depth) - [recovery_queue_depth](#recovery_queue_depth)
- [recovery_sleep_us](#recovery_sleep_us)
- [recovery_pg_switch](#recovery_pg_switch) - [recovery_pg_switch](#recovery_pg_switch)
- [recovery_sync_batch](#recovery_sync_batch) - [recovery_sync_batch](#recovery_sync_batch)
- [readonly](#readonly) - [readonly](#readonly)
@ -52,13 +51,6 @@ them, even without restarting by updating configuration in etcd.
- [scrub_list_limit](#scrub_list_limit) - [scrub_list_limit](#scrub_list_limit)
- [scrub_find_best](#scrub_find_best) - [scrub_find_best](#scrub_find_best)
- [scrub_ec_max_bruteforce](#scrub_ec_max_bruteforce) - [scrub_ec_max_bruteforce](#scrub_ec_max_bruteforce)
- [recovery_tune_interval](#recovery_tune_interval)
- [recovery_tune_util_low](#recovery_tune_util_low)
- [recovery_tune_util_high](#recovery_tune_util_high)
- [recovery_tune_client_util_low](#recovery_tune_client_util_low)
- [recovery_tune_client_util_high](#recovery_tune_client_util_high)
- [recovery_tune_agg_interval](#recovery_tune_agg_interval)
- [recovery_tune_sleep_min_us](#recovery_tune_sleep_min_us)
## etcd_report_interval ## etcd_report_interval
@ -143,24 +135,12 @@ operations before issuing an fsync operation internally.
## recovery_queue_depth ## recovery_queue_depth
- Type: integer - Type: integer
- Default: 1 - Default: 4
- Can be changed online: yes - Can be changed online: yes
Maximum recovery and rebalance operations initiated by each OSD in parallel. Maximum recovery operations per one primary OSD at any given moment of time.
Note that each OSD talks to a lot of other OSDs so actual number of parallel Currently it's the only parameter available to tune the speed or recovery
recovery operations per each OSD is greater than just recovery_queue_depth. and rebalancing, but it's planned to implement more.
Increasing this parameter can speedup recovery if [auto-tuning](#recovery_tune_interval)
allows it or if it is disabled.
## recovery_sleep_us
- Type: microseconds
- Default: 0
- Can be changed online: yes
Delay for all recovery- and rebalance- related operations. If non-zero,
such operations are artificially slowed down to reduce the impact on
client I/O.
## recovery_pg_switch ## recovery_pg_switch
@ -528,81 +508,3 @@ the variant with most available equal copies is correct. For example, if
you have 3 replicas and 1 of them differs, this one is considered to be you have 3 replicas and 1 of them differs, this one is considered to be
corrupted. But if there is no "best" version with more copies than all corrupted. But if there is no "best" version with more copies than all
others have then the object is also marked as inconsistent. others have then the object is also marked as inconsistent.
## recovery_tune_interval
- Type: seconds
- Default: 1
- Can be changed online: yes
Interval at which OSD re-considers client and recovery load and automatically
adjusts [recovery_sleep_us](#recovery_sleep_us). Recovery auto-tuning is
disabled if recovery_tune_interval is set to 0.
Auto-tuning targets utilization. Utilization is a measure of load and is
equal to the product of iops and average latency (so it may be greater
than 1). You set "low" and "high" client utilization thresholds and two
corresponding target recovery utilization levels. OSD calculates desired
recovery utilization from client utilization using linear interpolation
and auto-tunes recovery operation delay to make actual recovery utilization
match desired.
This allows to reduce recovery/rebalance impact on client operations. It is
of course impossible to remove it completely, but it should become adequate.
In some tests rebalance could earlier drop client write speed from 1.5 GB/s
to 50-100 MB/s, with default auto-tuning settings it now only reduces
to ~1 GB/s.
## recovery_tune_util_low
- Type: number
- Default: 0.1
- Can be changed online: yes
Desired recovery/rebalance utilization when client load is high, i.e. when
it is at or above recovery_tune_client_util_high.
## recovery_tune_util_high
- Type: number
- Default: 1
- Can be changed online: yes
Desired recovery/rebalance utilization when client load is low, i.e. when
it is at or below recovery_tune_client_util_low.
## recovery_tune_client_util_low
- Type: number
- Default: 0
- Can be changed online: yes
Client utilization considered "low".
## recovery_tune_client_util_high
- Type: number
- Default: 0.5
- Can be changed online: yes
Client utilization considered "high".
## recovery_tune_agg_interval
- Type: integer
- Default: 10
- Can be changed online: yes
The number of last auto-tuning iterations to use for calculating the
delay as average. Lower values result in quicker response to client
load change, higher values result in more stable delay. Default value of 10
is usually fine.
## recovery_tune_sleep_min_us
- Type: microseconds
- Default: 10
- Can be changed online: yes
Minimum possible value for auto-tuned recovery_sleep_us. Values lower
than this value are changed to 0.

View File

@ -20,7 +20,6 @@
- [autosync_interval](#autosync_interval) - [autosync_interval](#autosync_interval)
- [autosync_writes](#autosync_writes) - [autosync_writes](#autosync_writes)
- [recovery_queue_depth](#recovery_queue_depth) - [recovery_queue_depth](#recovery_queue_depth)
- [recovery_sleep_us](#recovery_sleep_us)
- [recovery_pg_switch](#recovery_pg_switch) - [recovery_pg_switch](#recovery_pg_switch)
- [recovery_sync_batch](#recovery_sync_batch) - [recovery_sync_batch](#recovery_sync_batch)
- [readonly](#readonly) - [readonly](#readonly)
@ -53,13 +52,6 @@
- [scrub_list_limit](#scrub_list_limit) - [scrub_list_limit](#scrub_list_limit)
- [scrub_find_best](#scrub_find_best) - [scrub_find_best](#scrub_find_best)
- [scrub_ec_max_bruteforce](#scrub_ec_max_bruteforce) - [scrub_ec_max_bruteforce](#scrub_ec_max_bruteforce)
- [recovery_tune_interval](#recovery_tune_interval)
- [recovery_tune_util_low](#recovery_tune_util_low)
- [recovery_tune_util_high](#recovery_tune_util_high)
- [recovery_tune_client_util_low](#recovery_tune_client_util_low)
- [recovery_tune_client_util_high](#recovery_tune_client_util_high)
- [recovery_tune_agg_interval](#recovery_tune_agg_interval)
- [recovery_tune_sleep_min_us](#recovery_tune_sleep_min_us)
## etcd_report_interval ## etcd_report_interval
@ -146,25 +138,13 @@ OSD, чтобы успевать очищать журнал - без них OSD
## recovery_queue_depth ## recovery_queue_depth
- Тип: целое число - Тип: целое число
- Значение по умолчанию: 1 - Значение по умолчанию: 4
- Можно менять на лету: да - Можно менять на лету: да
Максимальное число параллельных операций восстановления, инициируемых одним Максимальное число операций восстановления на одном первичном OSD в любой
OSD в любой момент времени. Имейте в виду, что каждый OSD обычно работает с момент времени. На данный момент единственный параметр, который можно менять
многими другими OSD, так что на практике параллелизм восстановления больше, для ускорения или замедления восстановления и перебалансировки данных, но
чем просто recovery_queue_depth. Увеличение значения этого параметра может в планах реализация других параметров.
ускорить восстановление если [автотюнинг скорости](#recovery_tune_interval)
разрешает это или если он отключён.
## recovery_sleep_us
- Тип: микросекунды
- Значение по умолчанию: 0
- Можно менять на лету: да
Delay for all recovery- and rebalance- related operations. If non-zero,
such operations are artificially slowed down to reduce the impact on
client I/O.
## recovery_pg_switch ## recovery_pg_switch
@ -555,83 +535,3 @@ EC (кодов коррекции ошибок) с более, чем 1 диск
считается некорректной. Однако, если "лучшую" версию с числом доступных считается некорректной. Однако, если "лучшую" версию с числом доступных
копий большим, чем у всех других версий, найти невозможно, то объект тоже копий большим, чем у всех других версий, найти невозможно, то объект тоже
маркируется неконсистентным. маркируется неконсистентным.
## recovery_tune_interval
- Тип: секунды
- Значение по умолчанию: 1
- Можно менять на лету: да
Интервал, с которым OSD пересматривает клиентскую нагрузку и нагрузку
восстановления и автоматически подстраивает [recovery_sleep_us](#recovery_sleep_us).
Автотюнинг (автоподстройка) отключается, если recovery_tune_interval
устанавливается в значение 0.
Автотюнинг регулирует утилизацию. Утилизация является мерой нагрузки
и равна произведению числа операций в секунду и средней задержки
(то есть, она может быть выше 1). Вы задаёте два уровня клиентской
утилизации - "низкий" и "высокий" (low и high) и два соответствующих
целевых уровня утилизации операциями восстановления. OSD рассчитывает
желаемый уровень утилизации восстановления линейной интерполяцией от
клиентской утилизации и подстраивает задержку операций восстановления
так, чтобы фактическая утилизация восстановления совпадала с желаемой.
Это позволяет снизить влияние восстановления и ребаланса на клиентские
операции. Конечно, невозможно исключить такое влияние полностью, но оно
должно становиться адекватнее. В некоторых тестах перебалансировка могла
снижать клиентскую скорость записи с 1.5 ГБ/с до 50-100 МБ/с, а теперь, с
настройками автотюнинга по умолчанию, она снижается только до ~1 ГБ/с.
## recovery_tune_util_low
- Тип: число
- Значение по умолчанию: 0.1
- Можно менять на лету: да
Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
высокая, то есть, находится на уровне или выше recovery_tune_client_util_high.
## recovery_tune_util_high
- Тип: число
- Значение по умолчанию: 1
- Можно менять на лету: да
Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
низкая, то есть, находится на уровне или ниже recovery_tune_client_util_low.
## recovery_tune_client_util_low
- Тип: число
- Значение по умолчанию: 0
- Можно менять на лету: да
Клиентская утилизация, которая считается "низкой".
## recovery_tune_client_util_high
- Тип: число
- Значение по умолчанию: 0.5
- Можно менять на лету: да
Клиентская утилизация, которая считается "высокой".
## recovery_tune_agg_interval
- Тип: целое число
- Значение по умолчанию: 10
- Можно менять на лету: да
Число последних итераций автоподстройки для расчёта задержки как среднего
значения. Меньшие значения параметра ускоряют отклик на изменение нагрузки,
большие значения делают задержку стабильнее. Значение по умолчанию 10
обычно нормальное и не требует изменений.
## recovery_tune_sleep_min_us
- Тип: микросекунды
- Значение по умолчанию: 10
- Можно менять на лету: да
Минимальное возможное значение авто-подстроенного recovery_sleep_us.
Значения ниже данного заменяются на 0.

View File

@ -38,7 +38,6 @@ const types = {
bool: 'boolean', bool: 'boolean',
int: 'integer', int: 'integer',
sec: 'seconds', sec: 'seconds',
float: 'number',
ms: 'milliseconds', ms: 'milliseconds',
us: 'microseconds', us: 'microseconds',
}, },
@ -47,7 +46,6 @@ const types = {
bool: 'булево (да/нет)', bool: 'булево (да/нет)',
int: 'целое число', int: 'целое число',
sec: 'секунды', sec: 'секунды',
float: 'число',
ms: 'миллисекунды', ms: 'миллисекунды',
us: 'микросекунды', us: 'микросекунды',
}, },

View File

@ -107,29 +107,17 @@
принудительной отправкой fsync-а. принудительной отправкой fsync-а.
- name: recovery_queue_depth - name: recovery_queue_depth
type: int type: int
default: 1 default: 4
online: true online: true
info: | info: |
Maximum recovery and rebalance operations initiated by each OSD in parallel. Maximum recovery operations per one primary OSD at any given moment of time.
Note that each OSD talks to a lot of other OSDs so actual number of parallel Currently it's the only parameter available to tune the speed or recovery
recovery operations per each OSD is greater than just recovery_queue_depth. and rebalancing, but it's planned to implement more.
Increasing this parameter can speedup recovery if [auto-tuning](#recovery_tune_interval)
allows it or if it is disabled.
info_ru: | info_ru: |
Максимальное число параллельных операций восстановления, инициируемых одним Максимальное число операций восстановления на одном первичном OSD в любой
OSD в любой момент времени. Имейте в виду, что каждый OSD обычно работает с момент времени. На данный момент единственный параметр, который можно менять
многими другими OSD, так что на практике параллелизм восстановления больше, для ускорения или замедления восстановления и перебалансировки данных, но
чем просто recovery_queue_depth. Увеличение значения этого параметра может в планах реализация других параметров.
ускорить восстановление если [автотюнинг скорости](#recovery_tune_interval)
разрешает это или если он отключён.
- name: recovery_sleep_us
type: us
default: 0
online: true
info: |
Delay for all recovery- and rebalance- related operations. If non-zero,
such operations are artificially slowed down to reduce the impact on
client I/O.
- name: recovery_pg_switch - name: recovery_pg_switch
type: int type: int
default: 128 default: 128
@ -638,101 +626,3 @@
считается некорректной. Однако, если "лучшую" версию с числом доступных считается некорректной. Однако, если "лучшую" версию с числом доступных
копий большим, чем у всех других версий, найти невозможно, то объект тоже копий большим, чем у всех других версий, найти невозможно, то объект тоже
маркируется неконсистентным. маркируется неконсистентным.
- name: recovery_tune_interval
type: sec
default: 1
online: true
info: |
Interval at which OSD re-considers client and recovery load and automatically
adjusts [recovery_sleep_us](#recovery_sleep_us). Recovery auto-tuning is
disabled if recovery_tune_interval is set to 0.
Auto-tuning targets utilization. Utilization is a measure of load and is
equal to the product of iops and average latency (so it may be greater
than 1). You set "low" and "high" client utilization thresholds and two
corresponding target recovery utilization levels. OSD calculates desired
recovery utilization from client utilization using linear interpolation
and auto-tunes recovery operation delay to make actual recovery utilization
match desired.
This allows to reduce recovery/rebalance impact on client operations. It is
of course impossible to remove it completely, but it should become adequate.
In some tests rebalance could earlier drop client write speed from 1.5 GB/s
to 50-100 MB/s, with default auto-tuning settings it now only reduces
to ~1 GB/s.
info_ru: |
Интервал, с которым OSD пересматривает клиентскую нагрузку и нагрузку
восстановления и автоматически подстраивает [recovery_sleep_us](#recovery_sleep_us).
Автотюнинг (автоподстройка) отключается, если recovery_tune_interval
устанавливается в значение 0.
Автотюнинг регулирует утилизацию. Утилизация является мерой нагрузки
и равна произведению числа операций в секунду и средней задержки
(то есть, она может быть выше 1). Вы задаёте два уровня клиентской
утилизации - "низкий" и "высокий" (low и high) и два соответствующих
целевых уровня утилизации операциями восстановления. OSD рассчитывает
желаемый уровень утилизации восстановления линейной интерполяцией от
клиентской утилизации и подстраивает задержку операций восстановления
так, чтобы фактическая утилизация восстановления совпадала с желаемой.
Это позволяет снизить влияние восстановления и ребаланса на клиентские
операции. Конечно, невозможно исключить такое влияние полностью, но оно
должно становиться адекватнее. В некоторых тестах перебалансировка могла
снижать клиентскую скорость записи с 1.5 ГБ/с до 50-100 МБ/с, а теперь, с
настройками автотюнинга по умолчанию, она снижается только до ~1 ГБ/с.
- name: recovery_tune_util_low
type: float
default: 0.1
online: true
info: |
Desired recovery/rebalance utilization when client load is high, i.e. when
it is at or above recovery_tune_client_util_high.
info_ru: |
Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
высокая, то есть, находится на уровне или выше recovery_tune_client_util_high.
- name: recovery_tune_util_high
type: float
default: 1
online: true
info: |
Desired recovery/rebalance utilization when client load is low, i.e. when
it is at or below recovery_tune_client_util_low.
info_ru: |
Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
низкая, то есть, находится на уровне или ниже recovery_tune_client_util_low.
- name: recovery_tune_client_util_low
type: float
default: 0
online: true
info: Client utilization considered "low".
info_ru: Клиентская утилизация, которая считается "низкой".
- name: recovery_tune_client_util_high
type: float
default: 0.5
online: true
info: Client utilization considered "high".
info_ru: Клиентская утилизация, которая считается "высокой".
- name: recovery_tune_agg_interval
type: int
default: 10
online: true
info: |
The number of last auto-tuning iterations to use for calculating the
delay as average. Lower values result in quicker response to client
load change, higher values result in more stable delay. Default value of 10
is usually fine.
info_ru: |
Число последних итераций автоподстройки для расчёта задержки как среднего
значения. Меньшие значения параметра ускоряют отклик на изменение нагрузки,
большие значения делают задержку стабильнее. Значение по умолчанию 10
обычно нормальное и не требует изменений.
- name: recovery_tune_sleep_min_us
type: us
default: 10
online: true
info: |
Minimum possible value for auto-tuned recovery_sleep_us. Values lower
than this value are changed to 0.
info_ru: |
Минимальное возможное значение авто-подстроенного recovery_sleep_us.
Значения ниже данного заменяются на 0.

View File

@ -32,7 +32,6 @@
- [Scrubbing](../config/osd.en.md#auto_scrub) (verification of copies) - [Scrubbing](../config/osd.en.md#auto_scrub) (verification of copies)
- [Checksums](../config/layout-osd.en.md#data_csum_type) - [Checksums](../config/layout-osd.en.md#data_csum_type)
- [Client write-back cache](../config/client.en.md#client_enable_writeback) - [Client write-back cache](../config/client.en.md#client_enable_writeback)
- [Intelligent recovery auto-tuning](../config/osd.en.md#recovery_tune_interval)
## Plugins and tools ## Plugins and tools

View File

@ -34,7 +34,6 @@
- [Фоновая проверка целостности](../config/osd.ru.md#auto_scrub) (сверка копий) - [Фоновая проверка целостности](../config/osd.ru.md#auto_scrub) (сверка копий)
- [Контрольные суммы](../config/layout-osd.ru.md#data_csum_type) - [Контрольные суммы](../config/layout-osd.ru.md#data_csum_type)
- [Буферизация записи на стороне клиента](../config/client.ru.md#client_enable_writeback) - [Буферизация записи на стороне клиента](../config/client.ru.md#client_enable_writeback)
- [Интеллектуальная автоподстройка скорости восстановления](../config/osd.ru.md#recovery_tune_interval)
## Драйверы и инструменты ## Драйверы и инструменты

View File

@ -3,7 +3,6 @@
module.exports = { module.exports = {
scale_pg_count, scale_pg_count,
scale_pg_history,
}; };
function add_pg_history(new_pg_history, new_pg, prev_pgs, prev_pg_history, old_pg) function add_pg_history(new_pg_history, new_pg, prev_pgs, prev_pg_history, old_pg)
@ -44,18 +43,16 @@ function finish_pg_history(merged_history)
merged_history.all_peers = Object.values(merged_history.all_peers); merged_history.all_peers = Object.values(merged_history.all_peers);
} }
function scale_pg_history(prev_pg_history, prev_pgs, new_pgs) function scale_pg_count(prev_pgs, real_prev_pgs, prev_pg_history, new_pg_history, new_pg_count)
{ {
const new_pg_history = []; const old_pg_count = real_prev_pgs.length;
const old_pg_count = prev_pgs.length;
const new_pg_count = new_pgs.length;
// Add all possibly intersecting PGs to the history of new PGs // Add all possibly intersecting PGs to the history of new PGs
if (!(new_pg_count % old_pg_count)) if (!(new_pg_count % old_pg_count))
{ {
// New PG count is a multiple of old PG count // New PG count is a multiple of old PG count
for (let i = 0; i < new_pg_count; i++) for (let i = 0; i < new_pg_count; i++)
{ {
add_pg_history(new_pg_history, i, prev_pgs, prev_pg_history, i % old_pg_count); add_pg_history(new_pg_history, i, real_prev_pgs, prev_pg_history, i % old_pg_count);
finish_pg_history(new_pg_history[i]); finish_pg_history(new_pg_history[i]);
} }
} }
@ -67,7 +64,7 @@ function scale_pg_history(prev_pg_history, prev_pgs, new_pgs)
{ {
for (let j = 0; j < mul; j++) for (let j = 0; j < mul; j++)
{ {
add_pg_history(new_pg_history, i, prev_pgs, prev_pg_history, i+j*new_pg_count); add_pg_history(new_pg_history, i, real_prev_pgs, prev_pg_history, i+j*new_pg_count);
} }
finish_pg_history(new_pg_history[i]); finish_pg_history(new_pg_history[i]);
} }
@ -79,7 +76,7 @@ function scale_pg_history(prev_pg_history, prev_pgs, new_pgs)
let merged_history = {}; let merged_history = {};
for (let i = 0; i < old_pg_count; i++) for (let i = 0; i < old_pg_count; i++)
{ {
add_pg_history(merged_history, 1, prev_pgs, prev_pg_history, i); add_pg_history(merged_history, 1, real_prev_pgs, prev_pg_history, i);
} }
finish_pg_history(merged_history[1]); finish_pg_history(merged_history[1]);
for (let i = 0; i < new_pg_count; i++) for (let i = 0; i < new_pg_count; i++)
@ -92,12 +89,6 @@ function scale_pg_history(prev_pg_history, prev_pgs, new_pgs)
{ {
new_pg_history[i] = null; new_pg_history[i] = null;
} }
return new_pg_history;
}
function scale_pg_count(prev_pgs, new_pg_count)
{
const old_pg_count = prev_pgs.length;
// Just for the lp_solve optimizer - pick a "previous" PG for each "new" one // Just for the lp_solve optimizer - pick a "previous" PG for each "new" one
if (prev_pgs.length < new_pg_count) if (prev_pgs.length < new_pg_count)
{ {

View File

@ -59,7 +59,6 @@ const etcd_tree = {
etcd_mon_timeout: 1000, // ms. min: 0 etcd_mon_timeout: 1000, // ms. min: 0
etcd_mon_retries: 5, // min: 0 etcd_mon_retries: 5, // min: 0
mon_change_timeout: 1000, // ms. min: 100 mon_change_timeout: 1000, // ms. min: 100
mon_retry_change_timeout: 50, // ms. min: 10
mon_stats_timeout: 1000, // ms. min: 100 mon_stats_timeout: 1000, // ms. min: 100
osd_out_time: 600, // seconds. min: 0 osd_out_time: 600, // seconds. min: 0
placement_levels: { datacenter: 1, rack: 2, host: 3, osd: 4, ... }, placement_levels: { datacenter: 1, rack: 2, host: 3, osd: 4, ... },
@ -111,15 +110,7 @@ const etcd_tree = {
autosync_interval: 5, autosync_interval: 5,
autosync_writes: 128, autosync_writes: 128,
client_queue_depth: 128, // unused client_queue_depth: 128, // unused
recovery_queue_depth: 1, recovery_queue_depth: 4,
recovery_sleep_us: 0,
recovery_tune_util_low: 0.1,
recovery_tune_client_util_low: 0,
recovery_tune_util_high: 1.0,
recovery_tune_client_util_high: 0.5,
recovery_tune_interval: 1,
recovery_tune_agg_interval: 10, // 10 times recovery_tune_interval
recovery_tune_sleep_min_us: 10, // 10 microseconds
recovery_pg_switch: 128, recovery_pg_switch: 128,
recovery_sync_batch: 16, recovery_sync_batch: 16,
no_recovery: false, no_recovery: false,
@ -499,11 +490,6 @@ class Mon
{ {
this.config.mon_change_timeout = 100; this.config.mon_change_timeout = 100;
} }
this.config.mon_retry_change_timeout = Number(this.config.mon_retry_change_timeout) || 50;
if (this.config.mon_retry_change_timeout < 50)
{
this.config.mon_retry_change_timeout = 50;
}
this.config.mon_stats_timeout = Number(this.config.mon_stats_timeout) || 1000; this.config.mon_stats_timeout = Number(this.config.mon_stats_timeout) || 1000;
if (this.config.mon_stats_timeout < 100) if (this.config.mon_stats_timeout < 100)
{ {
@ -1236,89 +1222,6 @@ class Mon
return aff_osds; return aff_osds;
} }
async generate_pool_pgs(pool_id, osd_tree, levels)
{
const pool_cfg = this.state.config.pools[pool_id];
if (!this.validate_pool_cfg(pool_id, pool_cfg, false))
{
return null;
}
let pool_tree = osd_tree[pool_cfg.root_node || ''];
pool_tree = pool_tree ? pool_tree.children : [];
pool_tree = LPOptimizer.flatten_tree(pool_tree, levels, pool_cfg.failure_domain, 'osd');
this.filter_osds_by_tags(osd_tree, pool_tree, pool_cfg.osd_tags);
this.filter_osds_by_block_layout(
pool_tree,
pool_cfg.block_size || this.config.block_size || 131072,
pool_cfg.bitmap_granularity || this.config.bitmap_granularity || 4096,
pool_cfg.immediate_commit || this.config.immediate_commit || 'none'
);
// First try last_clean_pgs to minimize data movement
let prev_pgs = [];
for (const pg in ((this.state.history.last_clean_pgs.items||{})[pool_id]||{}))
{
prev_pgs[pg-1] = [ ...this.state.history.last_clean_pgs.items[pool_id][pg].osd_set ];
}
if (!prev_pgs.length)
{
// Fall back to config/pgs if it's empty
for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
{
prev_pgs[pg-1] = [ ...this.state.config.pgs.items[pool_id][pg].osd_set ];
}
}
const old_pg_count = prev_pgs.length;
const optimize_cfg = {
osd_tree: pool_tree,
pg_count: pool_cfg.pg_count,
pg_size: pool_cfg.pg_size,
pg_minsize: pool_cfg.pg_minsize,
max_combinations: pool_cfg.max_osd_combinations,
ordered: pool_cfg.scheme != 'replicated',
};
let optimize_result;
// Re-shuffle PGs if config/pgs.hash is empty
if (old_pg_count > 0 && this.state.config.pgs.hash)
{
if (prev_pgs.length != pool_cfg.pg_count)
{
// Scale PG count
// Do it even if old_pg_count is already equal to pool_cfg.pg_count,
// because last_clean_pgs may still contain the old number of PGs
PGUtil.scale_pg_count(prev_pgs, pool_cfg.pg_count);
}
for (const pg of prev_pgs)
{
while (pg.length < pool_cfg.pg_size)
{
pg.push(0);
}
}
optimize_result = await LPOptimizer.optimize_change({
prev_pgs,
...optimize_cfg,
});
}
else
{
optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
}
console.log(`Pool ${pool_id} (${pool_cfg.name || 'unnamed'}):`);
LPOptimizer.print_change_stats(optimize_result);
const pg_effsize = Math.min(pool_cfg.pg_size, Object.keys(pool_tree).length);
return {
pool_id,
pgs: optimize_result.int_pgs,
stats: {
total_raw_tb: optimize_result.space,
pg_real_size: pg_effsize || pool_cfg.pg_size,
raw_to_usable: (pg_effsize || pool_cfg.pg_size) / (pool_cfg.scheme === 'replicated'
? 1 : (pool_cfg.pg_size - (pool_cfg.parity_chunks||0))),
space_efficiency: optimize_result.space/(optimize_result.total_space||1),
},
};
}
async recheck_pgs() async recheck_pgs()
{ {
if (this.recheck_pgs_active) if (this.recheck_pgs_active)
@ -1333,47 +1236,158 @@ class Mon
const { up_osds, levels, osd_tree } = this.get_osd_tree(); const { up_osds, levels, osd_tree } = this.get_osd_tree();
const tree_cfg = { const tree_cfg = {
osd_tree, osd_tree,
levels,
pools: this.state.config.pools, pools: this.state.config.pools,
}; };
const tree_hash = sha1hex(stableStringify(tree_cfg)); const tree_hash = sha1hex(stableStringify(tree_cfg));
if (this.state.config.pgs.hash != tree_hash) if (this.state.config.pgs.hash != tree_hash)
{ {
// Something has changed // Something has changed
console.log('Pool configuration or OSD tree changed, re-optimizing'); const new_config_pgs = JSON.parse(JSON.stringify(this.state.config.pgs));
// First re-optimize PGs, but don't look at history yet const etcd_request = { compare: [], success: [] };
const optimize_results = await Promise.all(Object.keys(this.state.config.pools) for (const pool_id in (this.state.config.pgs||{}).items||{})
.map(pool_id => this.generate_pool_pgs(pool_id, osd_tree, levels)));
// Then apply the modification in the form of an optimistic transaction,
// each time considering new pg/history modifications (OSDs modify it during rebalance)
while (!await this.apply_pool_pgs(optimize_results, up_osds, osd_tree, tree_hash))
{ {
console.log( if (!this.state.config.pools[pool_id])
'Someone changed PG configuration while we also tried to change it.'+
' Retrying in '+this.config.mon_retry_change_timeout+' ms'
);
// Failed to apply - parallel change detected. Wait a bit and retry
const old_rev = this.etcd_watch_revision;
while (this.etcd_watch_revision === old_rev)
{ {
await new Promise(ok => setTimeout(ok, this.config.mon_retry_change_timeout)); // Pool deleted. Delete all PGs, but first stop them.
if (!await this.stop_all_pgs(pool_id))
{
this.recheck_pgs_active = false;
this.schedule_recheck();
return;
}
const prev_pgs = [];
for (const pg in this.state.config.pgs.items[pool_id]||{})
{
prev_pgs[pg-1] = this.state.config.pgs.items[pool_id][pg].osd_set;
}
// Also delete pool statistics
etcd_request.success.push({ requestDeleteRange: {
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
} });
this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, prev_pgs, [], []);
} }
const new_ot = this.get_osd_tree();
const new_tcfg = {
osd_tree: new_ot.osd_tree,
levels: new_ot.levels,
pools: this.state.config.pools,
};
if (sha1hex(stableStringify(new_tcfg)) !== tree_hash)
{
// Configuration actually changed, restart from the beginning
this.recheck_pgs_active = false;
setImmediate(() => this.recheck_pgs().catch(this.die));
return;
}
// Configuration didn't change, PG history probably changed, so just retry
} }
console.log('PG configuration successfully changed'); for (const pool_id in this.state.config.pools)
{
const pool_cfg = this.state.config.pools[pool_id];
if (!this.validate_pool_cfg(pool_id, pool_cfg, false))
{
continue;
}
let pool_tree = osd_tree[pool_cfg.root_node || ''];
pool_tree = pool_tree ? pool_tree.children : [];
pool_tree = LPOptimizer.flatten_tree(pool_tree, levels, pool_cfg.failure_domain, 'osd');
this.filter_osds_by_tags(osd_tree, pool_tree, pool_cfg.osd_tags);
this.filter_osds_by_block_layout(
pool_tree,
pool_cfg.block_size || this.config.block_size || 131072,
pool_cfg.bitmap_granularity || this.config.bitmap_granularity || 4096,
pool_cfg.immediate_commit || this.config.immediate_commit || 'none'
);
// These are for the purpose of building history.osd_sets
const real_prev_pgs = [];
let pg_history = [];
for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
{
real_prev_pgs[pg-1] = this.state.config.pgs.items[pool_id][pg].osd_set;
if (this.state.pg.history[pool_id] &&
this.state.pg.history[pool_id][pg])
{
pg_history[pg-1] = this.state.pg.history[pool_id][pg];
}
}
// And these are for the purpose of minimizing data movement
let prev_pgs = [];
for (const pg in ((this.state.history.last_clean_pgs.items||{})[pool_id]||{}))
{
prev_pgs[pg-1] = this.state.history.last_clean_pgs.items[pool_id][pg].osd_set;
}
prev_pgs = JSON.parse(JSON.stringify(prev_pgs.length ? prev_pgs : real_prev_pgs));
const old_pg_count = real_prev_pgs.length;
const optimize_cfg = {
osd_tree: pool_tree,
pg_count: pool_cfg.pg_count,
pg_size: pool_cfg.pg_size,
pg_minsize: pool_cfg.pg_minsize,
max_combinations: pool_cfg.max_osd_combinations,
ordered: pool_cfg.scheme != 'replicated',
};
let optimize_result;
if (old_pg_count > 0)
{
if (old_pg_count != pool_cfg.pg_count)
{
// PG count changed. Need to bring all PGs down.
if (!await this.stop_all_pgs(pool_id))
{
this.recheck_pgs_active = false;
this.schedule_recheck();
return;
}
}
if (prev_pgs.length != pool_cfg.pg_count)
{
// Scale PG count
// Do it even if old_pg_count is already equal to pool_cfg.pg_count,
// because last_clean_pgs may still contain the old number of PGs
const new_pg_history = [];
PGUtil.scale_pg_count(prev_pgs, real_prev_pgs, pg_history, new_pg_history, pool_cfg.pg_count);
pg_history = new_pg_history;
}
for (const pg of prev_pgs)
{
while (pg.length < pool_cfg.pg_size)
{
pg.push(0);
}
}
if (!this.state.config.pgs.hash)
{
// Re-shuffle PGs
optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
}
else
{
optimize_result = await LPOptimizer.optimize_change({
prev_pgs,
...optimize_cfg,
});
}
}
else
{
optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
}
if (old_pg_count != optimize_result.int_pgs.length)
{
console.log(
`PG count for pool ${pool_id} (${pool_cfg.name || 'unnamed'})`+
` changed from: ${old_pg_count} to ${optimize_result.int_pgs.length}`
);
// Drop stats
etcd_request.success.push({ requestDeleteRange: {
key: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'/'),
range_end: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'0'),
} });
}
LPOptimizer.print_change_stats(optimize_result);
const pg_effsize = Math.min(pool_cfg.pg_size, Object.keys(pool_tree).length);
this.state.pool.stats[pool_id] = {
used_raw_tb: (this.state.pool.stats[pool_id]||{}).used_raw_tb || 0,
total_raw_tb: optimize_result.space,
pg_real_size: pg_effsize || pool_cfg.pg_size,
raw_to_usable: (pg_effsize || pool_cfg.pg_size) / (pool_cfg.scheme === 'replicated'
? 1 : (pool_cfg.pg_size - (pool_cfg.parity_chunks||0))),
space_efficiency: optimize_result.space/(optimize_result.total_space||1),
};
etcd_request.success.push({ requestPut: {
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
value: b64(JSON.stringify(this.state.pool.stats[pool_id])),
} });
this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, real_prev_pgs, optimize_result.int_pgs, pg_history);
}
new_config_pgs.hash = tree_hash;
await this.save_pg_config(new_config_pgs, etcd_request);
} }
else else
{ {
@ -1420,81 +1434,8 @@ class Mon
this.recheck_pgs_active = false; this.recheck_pgs_active = false;
} }
async apply_pool_pgs(results, up_osds, osd_tree, tree_hash) async save_pg_config(new_config_pgs, etcd_request = { compare: [], success: [] })
{ {
for (const pool_id in (this.state.config.pgs||{}).items||{})
{
// We should stop all PGs when deleting a pool or changing its PG count
if (!this.state.config.pools[pool_id] ||
this.state.config.pgs.items[pool_id] && this.state.config.pools[pool_id].pg_count !=
Object.keys(this.state.config.pgs.items[pool_id]).reduce((a, c) => (a < (0|c) ? (0|c) : a), 0))
{
if (!await this.stop_all_pgs(pool_id))
{
return false;
}
}
}
const new_config_pgs = JSON.parse(JSON.stringify(this.state.config.pgs));
const etcd_request = { compare: [], success: [] };
for (const pool_id in (new_config_pgs||{}).items||{})
{
if (!this.state.config.pools[pool_id])
{
const prev_pgs = [];
for (const pg in new_config_pgs.items[pool_id]||{})
{
prev_pgs[pg-1] = new_config_pgs.items[pool_id][pg].osd_set;
}
// Also delete pool statistics
etcd_request.success.push({ requestDeleteRange: {
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
} });
this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, prev_pgs, [], []);
}
}
for (const pool_res of results)
{
const pool_id = pool_res.pool_id;
const pool_cfg = this.state.config.pools[pool_id];
let pg_history = [];
for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
{
if (this.state.pg.history[pool_id] &&
this.state.pg.history[pool_id][pg])
{
pg_history[pg-1] = this.state.pg.history[pool_id][pg];
}
}
const real_prev_pgs = [];
for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
{
real_prev_pgs[pg-1] = [ ...this.state.config.pgs.items[pool_id][pg].osd_set ];
}
if (real_prev_pgs.length > 0 && real_prev_pgs.length != pool_res.pgs.length)
{
console.log(
`Changing PG count for pool ${pool_id} (${pool_cfg.name || 'unnamed'})`+
` from: ${real_prev_pgs.length} to ${pool_res.pgs.length}`
);
pg_history = PGUtil.scale_pg_history(pg_history, real_prev_pgs, pool_res.pgs);
// Drop stats
etcd_request.success.push({ requestDeleteRange: {
key: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'/'),
range_end: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'0'),
} });
}
const stats = {
used_raw_tb: (this.state.pool.stats[pool_id]||{}).used_raw_tb || 0,
...pool_res.stats,
};
etcd_request.success.push({ requestPut: {
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
value: b64(JSON.stringify(stats)),
} });
this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, real_prev_pgs, pool_res.pgs, pg_history);
}
new_config_pgs.hash = tree_hash;
etcd_request.compare.push( etcd_request.compare.push(
{ key: b64(this.etcd_prefix+'/mon/master'), target: 'LEASE', lease: ''+this.etcd_lease_id }, { key: b64(this.etcd_prefix+'/mon/master'), target: 'LEASE', lease: ''+this.etcd_lease_id },
{ key: b64(this.etcd_prefix+'/config/pgs'), target: 'MOD', mod_revision: ''+this.etcd_watch_revision, result: 'LESS' }, { key: b64(this.etcd_prefix+'/config/pgs'), target: 'MOD', mod_revision: ''+this.etcd_watch_revision, result: 'LESS' },
@ -1502,8 +1443,14 @@ class Mon
etcd_request.success.push( etcd_request.success.push(
{ requestPut: { key: b64(this.etcd_prefix+'/config/pgs'), value: b64(JSON.stringify(new_config_pgs)) } }, { requestPut: { key: b64(this.etcd_prefix+'/config/pgs'), value: b64(JSON.stringify(new_config_pgs)) } },
); );
const txn_res = await this.etcd_call('/kv/txn', etcd_request, this.config.etcd_mon_timeout, 0); const res = await this.etcd_call('/kv/txn', etcd_request, this.config.etcd_mon_timeout, 0);
return txn_res.succeeded; if (!res.succeeded)
{
console.log('Someone changed PG configuration while we also tried to change it. Retrying in '+this.config.mon_change_timeout+' ms');
this.schedule_recheck();
return;
}
console.log('PG configuration successfully changed');
} }
// Schedule next recheck at least at <unixtime> // Schedule next recheck at least at <unixtime>

View File

@ -277,7 +277,6 @@ class blockstore_impl_t
int unsynced_big_write_count = 0, unstable_unsynced = 0; int unsynced_big_write_count = 0, unstable_unsynced = 0;
int unsynced_queued_ops = 0; int unsynced_queued_ops = 0;
allocator *data_alloc = NULL; allocator *data_alloc = NULL;
uint64_t used_blocks = 0;
uint8_t *zero_object; uint8_t *zero_object;
void *metadata_buffer = NULL; void *metadata_buffer = NULL;
@ -431,7 +430,7 @@ public:
inline uint32_t get_block_size() { return dsk.data_block_size; } inline uint32_t get_block_size() { return dsk.data_block_size; }
inline uint64_t get_block_count() { return dsk.block_count; } inline uint64_t get_block_count() { return dsk.block_count; }
inline uint64_t get_free_block_count() { return dsk.block_count - used_blocks; } inline uint64_t get_free_block_count() { return data_alloc->get_free_count(); }
inline uint32_t get_bitmap_granularity() { return dsk.disk_alignment; } inline uint32_t get_bitmap_granularity() { return dsk.disk_alignment; }
inline uint64_t get_journal_size() { return dsk.journal_len; } inline uint64_t get_journal_size() { return dsk.journal_len; }
}; };

View File

@ -376,7 +376,6 @@ bool blockstore_init_meta::handle_meta_block(uint8_t *buf, uint64_t entries_per_
else else
{ {
bs->inode_space_stats[entry->oid.inode] += bs->dsk.data_block_size; bs->inode_space_stats[entry->oid.inode] += bs->dsk.data_block_size;
bs->used_blocks++;
} }
entries_loaded++; entries_loaded++;
#ifdef BLOCKSTORE_DEBUG #ifdef BLOCKSTORE_DEBUG
@ -1182,7 +1181,6 @@ void blockstore_init_journal::erase_dirty_object(blockstore_dirty_db_t::iterator
sp -= bs->dsk.data_block_size; sp -= bs->dsk.data_block_size;
else else
bs->inode_space_stats.erase(oid.inode); bs->inode_space_stats.erase(oid.inode);
bs->used_blocks--;
} }
bs->erase_dirty(dirty_it, dirty_end, clean_loc); bs->erase_dirty(dirty_it, dirty_end, clean_loc);
// Remove it from the flusher's queue, too // Remove it from the flusher's queue, too

View File

@ -445,7 +445,6 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
if (!exists) if (!exists)
{ {
inode_space_stats[dirty_it->first.oid.inode] += dsk.data_block_size; inode_space_stats[dirty_it->first.oid.inode] += dsk.data_block_size;
used_blocks++;
} }
big_to_flush++; big_to_flush++;
} }
@ -456,7 +455,6 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
sp -= dsk.data_block_size; sp -= dsk.data_block_size;
else else
inode_space_stats.erase(dirty_it->first.oid.inode); inode_space_stats.erase(dirty_it->first.oid.inode);
used_blocks--;
big_to_flush++; big_to_flush++;
} }
} }

View File

@ -705,8 +705,6 @@ resume_1:
} }
goto resume_2; goto resume_2;
} }
// Protect from try_send completing the operation immediately
op->inflight_count++;
for (int i = 0; i < op->parts.size(); i++) for (int i = 0; i < op->parts.size(); i++)
{ {
if (!(op->parts[i].flags & PART_SENT)) if (!(op->parts[i].flags & PART_SENT))
@ -730,10 +728,8 @@ resume_1:
} }
} }
} }
op->inflight_count--;
if (op->state == 1) if (op->state == 1)
{ {
// Some suboperations have to be resent
return 0; return 0;
} }
resume_2: resume_2:

View File

@ -149,7 +149,7 @@ public:
std::map<osd_num_t, osd_wanted_peer_t> wanted_peers; std::map<osd_num_t, osd_wanted_peer_t> wanted_peers;
std::map<uint64_t, int> osd_peer_fds; std::map<uint64_t, int> osd_peer_fds;
// op statistics // op statistics
osd_op_stats_t stats, recovery_stats; osd_op_stats_t stats;
void init(); void init();
void parse_config(const json11::Json & config); void parse_config(const json11::Json & config);
@ -175,7 +175,6 @@ public:
bool connect_rdma(int peer_fd, std::string rdma_address, uint64_t client_max_msg); bool connect_rdma(int peer_fd, std::string rdma_address, uint64_t client_max_msg);
#endif #endif
void inc_op_stats(osd_op_stats_t & stats, uint64_t opcode, timespec & tv_begin, timespec & tv_end, uint64_t len);
void measure_exec(osd_op_t *cur_op); void measure_exec(osd_op_t *cur_op);
protected: protected:

View File

@ -24,17 +24,3 @@ osd_op_t::~osd_op_t()
free(buf); free(buf);
} }
} }
bool osd_op_t::is_recovery_related()
{
return (req.hdr.opcode == OSD_OP_SEC_READ ||
req.hdr.opcode == OSD_OP_SEC_WRITE ||
req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE) &&
(req.sec_rw.flags & OSD_OP_RECOVERY_RELATED) ||
req.hdr.opcode == OSD_OP_SEC_DELETE &&
(req.sec_del.flags & OSD_OP_RECOVERY_RELATED) ||
req.hdr.opcode == OSD_OP_SEC_STABILIZE &&
(req.sec_stab.flags & OSD_OP_RECOVERY_RELATED) ||
req.hdr.opcode == OSD_OP_SEC_SYNC &&
(req.sec_sync.flags & OSD_OP_RECOVERY_RELATED);
}

View File

@ -173,6 +173,4 @@ struct osd_op_t
osd_op_buf_list_t iov; osd_op_buf_list_t iov;
~osd_op_t(); ~osd_op_t();
bool is_recovery_related();
}; };

View File

@ -131,23 +131,6 @@ void osd_messenger_t::outbox_push(osd_op_t *cur_op)
} }
} }
void osd_messenger_t::inc_op_stats(osd_op_stats_t & stats, uint64_t opcode, timespec & tv_begin, timespec & tv_end, uint64_t len)
{
uint64_t usecs = (
(tv_end.tv_sec - tv_begin.tv_sec)*1000000 +
(tv_end.tv_nsec - tv_begin.tv_nsec)/1000
);
stats.op_stat_count[opcode]++;
if (!stats.op_stat_count[opcode])
{
stats.op_stat_count[opcode] = 1;
stats.op_stat_sum[opcode] = 0;
stats.op_stat_bytes[opcode] = 0;
}
stats.op_stat_sum[opcode] += usecs;
stats.op_stat_bytes[opcode] += len;
}
void osd_messenger_t::measure_exec(osd_op_t *cur_op) void osd_messenger_t::measure_exec(osd_op_t *cur_op)
{ {
// Measure execution latency // Measure execution latency
@ -159,24 +142,29 @@ void osd_messenger_t::measure_exec(osd_op_t *cur_op)
{ {
clock_gettime(CLOCK_REALTIME, &cur_op->tv_end); clock_gettime(CLOCK_REALTIME, &cur_op->tv_end);
} }
uint64_t len = 0; stats.op_stat_count[cur_op->req.hdr.opcode]++;
if (!stats.op_stat_count[cur_op->req.hdr.opcode])
{
stats.op_stat_count[cur_op->req.hdr.opcode]++;
stats.op_stat_sum[cur_op->req.hdr.opcode] = 0;
stats.op_stat_bytes[cur_op->req.hdr.opcode] = 0;
}
stats.op_stat_sum[cur_op->req.hdr.opcode] += (
(cur_op->tv_end.tv_sec - cur_op->tv_begin.tv_sec)*1000000 +
(cur_op->tv_end.tv_nsec - cur_op->tv_begin.tv_nsec)/1000
);
if (cur_op->req.hdr.opcode == OSD_OP_READ || if (cur_op->req.hdr.opcode == OSD_OP_READ ||
cur_op->req.hdr.opcode == OSD_OP_WRITE || cur_op->req.hdr.opcode == OSD_OP_WRITE ||
cur_op->req.hdr.opcode == OSD_OP_SCRUB) cur_op->req.hdr.opcode == OSD_OP_SCRUB)
{ {
// req.rw.len is internally set to the full object size for scrubs // req.rw.len is internally set to the full object size for scrubs
len = cur_op->req.rw.len; stats.op_stat_bytes[cur_op->req.hdr.opcode] += cur_op->req.rw.len;
} }
else if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ || else if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ ||
cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE || cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE) cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE)
{ {
len = cur_op->req.sec_rw.len; stats.op_stat_bytes[cur_op->req.hdr.opcode] += cur_op->req.sec_rw.len;
}
inc_op_stats(stats, cur_op->req.hdr.opcode, cur_op->tv_begin, cur_op->tv_end, len);
if (cur_op->is_recovery_related())
{
inc_op_stats(recovery_stats, cur_op->req.hdr.opcode, cur_op->tv_begin, cur_op->tv_end, len);
} }
} }

View File

@ -68,21 +68,14 @@ osd_t::osd_t(const json11::Json & config, ring_loop_t *ringloop)
} }
} }
if (print_stats_timer_id == -1) print_stats_timer_id = this->tfd->set_timer(print_stats_interval*1000, true, [this](int timer_id)
{ {
print_stats_timer_id = this->tfd->set_timer(print_stats_interval*1000, true, [this](int timer_id) print_stats();
{ });
print_stats(); slow_log_timer_id = this->tfd->set_timer(slow_log_interval*1000, true, [this](int timer_id)
});
}
if (slow_log_timer_id == -1)
{ {
slow_log_timer_id = this->tfd->set_timer(slow_log_interval*1000, true, [this](int timer_id) print_slow();
{ });
print_slow();
});
}
apply_recovery_tune_interval();
msgr.tfd = this->tfd; msgr.tfd = this->tfd;
msgr.ringloop = this->ringloop; msgr.ringloop = this->ringloop;
@ -104,11 +97,6 @@ osd_t::~osd_t()
tfd->clear_timer(slow_log_timer_id); tfd->clear_timer(slow_log_timer_id);
slow_log_timer_id = -1; slow_log_timer_id = -1;
} }
if (rtune_timer_id >= 0)
{
tfd->clear_timer(rtune_timer_id);
rtune_timer_id = -1;
}
if (print_stats_timer_id >= 0) if (print_stats_timer_id >= 0)
{ {
tfd->clear_timer(print_stats_timer_id); tfd->clear_timer(print_stats_timer_id);
@ -208,30 +196,6 @@ void osd_t::parse_config(bool init)
recovery_queue_depth = config["recovery_queue_depth"].uint64_value(); recovery_queue_depth = config["recovery_queue_depth"].uint64_value();
if (recovery_queue_depth < 1 || recovery_queue_depth > MAX_RECOVERY_QUEUE) if (recovery_queue_depth < 1 || recovery_queue_depth > MAX_RECOVERY_QUEUE)
recovery_queue_depth = DEFAULT_RECOVERY_QUEUE; recovery_queue_depth = DEFAULT_RECOVERY_QUEUE;
recovery_sleep_us = config["recovery_sleep_us"].uint64_value();
recovery_tune_util_low = config["recovery_tune_util_low"].is_null()
? 0.1 : config["recovery_tune_util_low"].number_value();
if (recovery_tune_util_low < 0.01)
recovery_tune_util_low = 0.01;
recovery_tune_util_high = config["recovery_tune_util_high"].is_null()
? 1.0 : config["recovery_tune_util_high"].number_value();
if (recovery_tune_util_high < 0.01)
recovery_tune_util_high = 0.01;
recovery_tune_client_util_low = config["recovery_tune_client_util_low"].is_null()
? 0 : config["recovery_tune_client_util_low"].number_value();
if (recovery_tune_client_util_low < 0.01)
recovery_tune_client_util_low = 0.01;
recovery_tune_client_util_high = config["recovery_tune_client_util_high"].is_null()
? 0.5 : config["recovery_tune_client_util_high"].number_value();
if (recovery_tune_client_util_high < 0.01)
recovery_tune_client_util_high = 0.01;
auto old_recovery_tune_interval = recovery_tune_interval;
recovery_tune_interval = config["recovery_tune_interval"].is_null()
? 1 : config["recovery_tune_interval"].uint64_value();
recovery_tune_agg_interval = config["recovery_tune_agg_interval"].is_null()
? 10 : config["recovery_tune_agg_interval"].uint64_value();
recovery_tune_sleep_min_us = config["recovery_tune_sleep_min_us"].is_null()
? 10 : config["recovery_tune_sleep_min_us"].uint64_value();
recovery_pg_switch = config["recovery_pg_switch"].uint64_value(); recovery_pg_switch = config["recovery_pg_switch"].uint64_value();
if (recovery_pg_switch < 1) if (recovery_pg_switch < 1)
recovery_pg_switch = DEFAULT_RECOVERY_PG_SWITCH; recovery_pg_switch = DEFAULT_RECOVERY_PG_SWITCH;
@ -310,10 +274,6 @@ void osd_t::parse_config(bool init)
print_slow(); print_slow();
}); });
} }
if (old_recovery_tune_interval != recovery_tune_interval)
{
apply_recovery_tune_interval();
}
} }
void osd_t::bind_socket() void osd_t::bind_socket()
@ -461,6 +421,14 @@ void osd_t::exec_op(osd_op_t *cur_op)
} }
} }
void osd_t::reset_stats()
{
msgr.stats = {};
prev_stats = {};
memset(recovery_stat_count, 0, sizeof(recovery_stat_count));
memset(recovery_stat_bytes, 0, sizeof(recovery_stat_bytes));
}
void osd_t::print_stats() void osd_t::print_stats()
{ {
for (int i = OSD_OP_MIN; i <= OSD_OP_MAX; i++) for (int i = OSD_OP_MIN; i <= OSD_OP_MAX; i++)
@ -498,20 +466,19 @@ void osd_t::print_stats()
} }
for (int i = 0; i < 2; i++) for (int i = 0; i < 2; i++)
{ {
if (recovery_stat[i].count > recovery_print_prev[i].count) if (recovery_stat_count[0][i] != recovery_stat_count[1][i])
{ {
uint64_t bw = (recovery_stat[i].bytes - recovery_print_prev[i].bytes) / print_stats_interval; uint64_t bw = (recovery_stat_bytes[0][i] - recovery_stat_bytes[1][i]) / print_stats_interval;
printf( printf(
"[OSD %lu] %s recovery: %.1f op/s, B/W: %.2f %s, avg latency %ld us, delay %ld us\n", osd_num, recovery_stat_names[i], "[OSD %lu] %s recovery: %.1f op/s, B/W: %.2f %s\n", osd_num, recovery_stat_names[i],
(recovery_stat[i].count - recovery_print_prev[i].count) * 1.0 / print_stats_interval, (recovery_stat_count[0][i] - recovery_stat_count[1][i]) * 1.0 / print_stats_interval,
(bw > 1024*1024*1024 ? bw/1024.0/1024/1024 : (bw > 1024*1024 ? bw/1024.0/1024 : bw/1024.0)), (bw > 1024*1024*1024 ? bw/1024.0/1024/1024 : (bw > 1024*1024 ? bw/1024.0/1024 : bw/1024.0)),
(bw > 1024*1024*1024 ? "GB/s" : (bw > 1024*1024 ? "MB/s" : "KB/s")), (bw > 1024*1024*1024 ? "GB/s" : (bw > 1024*1024 ? "MB/s" : "KB/s"))
(recovery_stat[i].usec - recovery_print_prev[i].usec) / (recovery_stat[i].count - recovery_print_prev[i].count),
recovery_target_sleep_us
); );
recovery_stat_count[1][i] = recovery_stat_count[0][i];
recovery_stat_bytes[1][i] = recovery_stat_bytes[0][i];
} }
} }
memcpy(recovery_print_prev, recovery_stat, sizeof(recovery_stat));
if (corrupted_objects > 0) if (corrupted_objects > 0)
{ {
printf("[OSD %lu] %lu object(s) corrupted\n", osd_num, corrupted_objects); printf("[OSD %lu] %lu object(s) corrupted\n", osd_num, corrupted_objects);
@ -605,8 +572,8 @@ void osd_t::print_slow()
op->req.hdr.opcode == OSD_OP_SEC_STABILIZE || op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK || op->req.hdr.opcode == OSD_OP_SEC_STABILIZE || op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK ||
op->req.hdr.opcode == OSD_OP_SEC_READ_BMP) op->req.hdr.opcode == OSD_OP_SEC_READ_BMP)
{ {
bufprintf(" state=%d", op->bs_op ? PRIV(op->bs_op)->op_state : -1); bufprintf(" state=%d", PRIV(op->bs_op)->op_state);
int wait_for = op->bs_op ? PRIV(op->bs_op)->wait_for : 0; int wait_for = PRIV(op->bs_op)->wait_for;
if (wait_for) if (wait_for)
{ {
bufprintf(" wait=%d (detail=%lu)", wait_for, PRIV(op->bs_op)->wait_detail); bufprintf(" wait=%d (detail=%lu)", wait_for, PRIV(op->bs_op)->wait_detail);

View File

@ -34,7 +34,7 @@
#define DEFAULT_AUTOSYNC_INTERVAL 5 #define DEFAULT_AUTOSYNC_INTERVAL 5
#define DEFAULT_AUTOSYNC_WRITES 128 #define DEFAULT_AUTOSYNC_WRITES 128
#define MAX_RECOVERY_QUEUE 2048 #define MAX_RECOVERY_QUEUE 2048
#define DEFAULT_RECOVERY_QUEUE 1 #define DEFAULT_RECOVERY_QUEUE 4
#define DEFAULT_RECOVERY_PG_SWITCH 128 #define DEFAULT_RECOVERY_PG_SWITCH 128
#define DEFAULT_RECOVERY_BATCH 16 #define DEFAULT_RECOVERY_BATCH 16
@ -87,11 +87,6 @@ struct osd_chain_read_t
struct osd_rmw_stripe_t; struct osd_rmw_stripe_t;
struct recovery_stat_t
{
uint64_t count, usec, bytes;
};
class osd_t class osd_t
{ {
// config // config
@ -116,15 +111,7 @@ class osd_t
int immediate_commit = IMMEDIATE_NONE; int immediate_commit = IMMEDIATE_NONE;
int autosync_interval = DEFAULT_AUTOSYNC_INTERVAL; // "emergency" sync every 5 seconds int autosync_interval = DEFAULT_AUTOSYNC_INTERVAL; // "emergency" sync every 5 seconds
int autosync_writes = DEFAULT_AUTOSYNC_WRITES; int autosync_writes = DEFAULT_AUTOSYNC_WRITES;
uint64_t recovery_queue_depth = 1; int recovery_queue_depth = DEFAULT_RECOVERY_QUEUE;
uint64_t recovery_sleep_us = 0;
double recovery_tune_util_low = 0.1;
double recovery_tune_client_util_low = 0;
double recovery_tune_util_high = 1.0;
double recovery_tune_client_util_high = 0.5;
int recovery_tune_interval = 1;
int recovery_tune_agg_interval = 10;
int recovery_tune_sleep_min_us = 10;
int recovery_pg_switch = DEFAULT_RECOVERY_PG_SWITCH; int recovery_pg_switch = DEFAULT_RECOVERY_PG_SWITCH;
int recovery_sync_batch = DEFAULT_RECOVERY_BATCH; int recovery_sync_batch = DEFAULT_RECOVERY_BATCH;
int inode_vanish_time = 60; int inode_vanish_time = 60;
@ -202,18 +189,8 @@ class osd_t
std::map<uint64_t, inode_stats_t> inode_stats; std::map<uint64_t, inode_stats_t> inode_stats;
std::map<uint64_t, timespec> vanishing_inodes; std::map<uint64_t, timespec> vanishing_inodes;
const char* recovery_stat_names[2] = { "degraded", "misplaced" }; const char* recovery_stat_names[2] = { "degraded", "misplaced" };
recovery_stat_t recovery_stat[2]; uint64_t recovery_stat_count[2][2] = {};
recovery_stat_t recovery_print_prev[2]; uint64_t recovery_stat_bytes[2][2] = {};
// recovery auto-tuning
int rtune_timer_id = -1;
uint64_t rtune_avg_lat = 0;
double rtune_client_util = 0, rtune_target_util = 1;
osd_op_stats_t rtune_prev_stats, rtune_prev_recovery_stats;
std::vector<uint64_t> recovery_target_sleep_items;
uint64_t recovery_target_sleep_us = 0;
uint64_t recovery_target_sleep_total = 0;
int recovery_target_sleep_cur = 0, recovery_target_sleep_count = 0;
// cluster connection // cluster connection
void parse_config(bool init); void parse_config(bool init);
@ -231,9 +208,8 @@ class osd_t
void create_osd_state(); void create_osd_state();
void renew_lease(bool reload); void renew_lease(bool reload);
void print_stats(); void print_stats();
void tune_recovery();
void apply_recovery_tune_interval();
void print_slow(); void print_slow();
void reset_stats();
json11::Json get_statistics(); json11::Json get_statistics();
void report_statistics(); void report_statistics();
void report_pg_state(pg_t & pg); void report_pg_state(pg_t & pg);
@ -262,7 +238,6 @@ class osd_t
bool submit_flush_op(pool_id_t pool_id, pg_num_t pg_num, pg_flush_batch_t *fb, bool rollback, osd_num_t peer_osd, int count, obj_ver_id *data); bool submit_flush_op(pool_id_t pool_id, pg_num_t pg_num, pg_flush_batch_t *fb, bool rollback, osd_num_t peer_osd, int count, obj_ver_id *data);
bool pick_next_recovery(osd_recovery_op_t &op); bool pick_next_recovery(osd_recovery_op_t &op);
void submit_recovery_op(osd_recovery_op_t *op); void submit_recovery_op(osd_recovery_op_t *op);
void finish_recovery_op(osd_recovery_op_t *op);
bool continue_recovery(); bool continue_recovery();
pg_osd_set_state_t* change_osd_set(pg_osd_set_state_t *st, pg_t *pg); pg_osd_set_state_t* change_osd_set(pg_osd_set_state_t *st, pg_t *pg);
@ -304,7 +279,7 @@ class osd_t
bool remember_unstable_write(osd_op_t *cur_op, pg_t & pg, pg_osd_set_t & loc_set, int base_state); bool remember_unstable_write(osd_op_t *cur_op, pg_t & pg, pg_osd_set_t & loc_set, int base_state);
void handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op); void handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op);
void handle_primary_bs_subop(osd_op_t *subop); void handle_primary_bs_subop(osd_op_t *subop);
void add_bs_subop_stats(osd_op_t *subop, bool recovery_related = false); void add_bs_subop_stats(osd_op_t *subop);
void pg_cancel_write_queue(pg_t & pg, osd_op_t *first_op, object_id oid, int retval); void pg_cancel_write_queue(pg_t & pg, osd_op_t *first_op, object_id oid, int retval);
void submit_primary_subops(int submit_type, uint64_t op_version, const uint64_t* osd_set, osd_op_t *cur_op); void submit_primary_subops(int submit_type, uint64_t op_version, const uint64_t* osd_set, osd_op_t *cur_op);

View File

@ -213,14 +213,12 @@ json11::Json osd_t::get_statistics()
st["subop_stats"] = subop_stats; st["subop_stats"] = subop_stats;
st["recovery_stats"] = json11::Json::object { st["recovery_stats"] = json11::Json::object {
{ recovery_stat_names[0], json11::Json::object { { recovery_stat_names[0], json11::Json::object {
{ "count", recovery_stat[0].count }, { "count", recovery_stat_count[0][0] },
{ "bytes", recovery_stat[0].bytes }, { "bytes", recovery_stat_bytes[0][0] },
{ "usec", recovery_stat[0].usec },
} }, } },
{ recovery_stat_names[1], json11::Json::object { { recovery_stat_names[1], json11::Json::object {
{ "count", recovery_stat[1].count }, { "count", recovery_stat_count[0][1] },
{ "bytes", recovery_stat[1].bytes }, { "bytes", recovery_stat_bytes[0][1] },
{ "usec", recovery_stat[1].usec },
} }, } },
}; };
return st; return st;

View File

@ -325,129 +325,26 @@ void osd_t::submit_recovery_op(osd_recovery_op_t *op)
{ {
printf("Recovery operation done for %lx:%lx\n", op->oid.inode, op->oid.stripe); printf("Recovery operation done for %lx:%lx\n", op->oid.inode, op->oid.stripe);
} }
finish_recovery_op(op); // CAREFUL! op = &recovery_ops[op->oid]. Don't access op->* after recovery_ops.erase()
op->osd_op = NULL;
recovery_ops.erase(op->oid);
delete osd_op;
if (immediate_commit != IMMEDIATE_ALL)
{
recovery_done++;
if (recovery_done >= recovery_sync_batch)
{
// Force sync every <recovery_sync_batch> operations
// This is required not to pile up an excessive amount of delete operations
autosync();
recovery_done = 0;
}
}
continue_recovery();
}; };
exec_op(op->osd_op); exec_op(op->osd_op);
} }
void osd_t::apply_recovery_tune_interval()
{
if (rtune_timer_id >= 0)
{
tfd->clear_timer(rtune_timer_id);
rtune_timer_id = -1;
}
if (recovery_tune_interval != 0)
{
rtune_timer_id = this->tfd->set_timer(recovery_tune_interval*1000, true, [this](int timer_id)
{
tune_recovery();
});
}
else
{
recovery_target_sleep_us = recovery_sleep_us;
}
}
void osd_t::finish_recovery_op(osd_recovery_op_t *op)
{
// CAREFUL! op = &recovery_ops[op->oid]. Don't access op->* after recovery_ops.erase()
delete op->osd_op;
op->osd_op = NULL;
recovery_ops.erase(op->oid);
if (immediate_commit != IMMEDIATE_ALL)
{
recovery_done++;
if (recovery_done >= recovery_sync_batch)
{
// Force sync every <recovery_sync_batch> operations
// This is required not to pile up an excessive amount of delete operations
autosync();
recovery_done = 0;
}
}
continue_recovery();
}
void osd_t::tune_recovery()
{
static int accounted_ops[] = {
OSD_OP_SEC_READ, OSD_OP_SEC_WRITE, OSD_OP_SEC_WRITE_STABLE,
OSD_OP_SEC_STABILIZE, OSD_OP_SEC_SYNC, OSD_OP_SEC_DELETE
};
uint64_t total_client_usec = 0, total_recovery_usec = 0, recovery_count = 0;
for (int i = 0; i < sizeof(accounted_ops)/sizeof(accounted_ops[0]); i++)
{
total_client_usec += (msgr.stats.op_stat_sum[accounted_ops[i]]
- rtune_prev_stats.op_stat_sum[accounted_ops[i]]);
total_recovery_usec += (msgr.recovery_stats.op_stat_sum[accounted_ops[i]]
- rtune_prev_recovery_stats.op_stat_sum[accounted_ops[i]]);
recovery_count += (msgr.recovery_stats.op_stat_count[accounted_ops[i]]
- rtune_prev_recovery_stats.op_stat_count[accounted_ops[i]]);
rtune_prev_stats.op_stat_sum[accounted_ops[i]] = msgr.stats.op_stat_sum[accounted_ops[i]];
rtune_prev_recovery_stats.op_stat_sum[accounted_ops[i]] = msgr.recovery_stats.op_stat_sum[accounted_ops[i]];
rtune_prev_recovery_stats.op_stat_count[accounted_ops[i]] = msgr.recovery_stats.op_stat_count[accounted_ops[i]];
}
total_client_usec -= total_recovery_usec;
if (recovery_count == 0)
{
return;
}
// example:
// total 3 GB/s
// recovery queue 1
// 120 OSDs
// EC 5+3
// 128kb block_size => 640kb object
// 3000*1024/640/120 = 40 MB/s per OSD = 64 recovered objects per OSD
// = 64*8*2 subops = 1024 recovery subop iops
// 8 recovery subop queue
// => subop avg latency = 0.0078125 sec
// utilisation = 8
// target util 1
// intuitively target latency should be 8x of real
// target_lat = rtune_avg_lat * utilisation / target_util
// = rtune_avg_lat * rtune_avg_lat * rtune_avg_iops / target_util
// = 0.0625
// recovery utilisation will be 1
rtune_client_util = total_client_usec/1000000.0/recovery_tune_interval;
rtune_target_util = (rtune_client_util < recovery_tune_client_util_low
? recovery_tune_util_high
: recovery_tune_util_low + (rtune_client_util >= recovery_tune_client_util_high
? 0 : (recovery_tune_util_high-recovery_tune_util_low)*
(recovery_tune_client_util_high-rtune_client_util)/(recovery_tune_client_util_high-recovery_tune_client_util_low)
)
);
rtune_avg_lat = total_recovery_usec/recovery_count;
uint64_t target_lat = rtune_avg_lat * rtune_avg_lat/1000000.0 * recovery_count/recovery_tune_interval / rtune_target_util;
auto sleep_us = target_lat > rtune_avg_lat+recovery_tune_sleep_min_us ? target_lat-rtune_avg_lat : 0;
if (recovery_target_sleep_items.size() != recovery_tune_agg_interval)
{
recovery_target_sleep_items.resize(recovery_tune_agg_interval);
for (int i = 0; i < recovery_tune_agg_interval; i++)
recovery_target_sleep_items[i] = 0;
recovery_target_sleep_total = 0;
recovery_target_sleep_cur = 0;
recovery_target_sleep_count = 0;
}
recovery_target_sleep_total -= recovery_target_sleep_items[recovery_target_sleep_cur];
recovery_target_sleep_items[recovery_target_sleep_cur] = sleep_us;
recovery_target_sleep_cur = (recovery_target_sleep_cur+1) % recovery_tune_agg_interval;
recovery_target_sleep_total += sleep_us;
if (recovery_target_sleep_count < recovery_tune_agg_interval)
recovery_target_sleep_count++;
recovery_target_sleep_us = recovery_target_sleep_total / recovery_target_sleep_count;
if (log_level > 4)
{
printf(
"[OSD %lu] auto-tune: client util: %.2f, recovery util: %.2f, lat: %lu us -> target util %.2f, delay %lu us\n",
osd_num, rtune_client_util, total_recovery_usec/1000000.0/recovery_tune_interval,
rtune_avg_lat, rtune_target_util, recovery_target_sleep_us
);
}
}
// Just trigger write requests for degraded objects. They'll be recovered during writing // Just trigger write requests for degraded objects. They'll be recovered during writing
bool osd_t::continue_recovery() bool osd_t::continue_recovery()
{ {

View File

@ -34,7 +34,6 @@
#define OSD_OP_MAX 18 #define OSD_OP_MAX 18
#define OSD_RW_MAX 64*1024*1024 #define OSD_RW_MAX 64*1024*1024
#define OSD_PROTOCOL_VERSION 1 #define OSD_PROTOCOL_VERSION 1
#define OSD_OP_RECOVERY_RELATED (uint32_t)1
// Memory alignment for direct I/O (usually 512 bytes) // Memory alignment for direct I/O (usually 512 bytes)
#ifndef DIRECT_IO_ALIGNMENT #ifndef DIRECT_IO_ALIGNMENT
@ -89,8 +88,7 @@ struct __attribute__((__packed__)) osd_op_sec_rw_t
uint32_t len; uint32_t len;
// bitmap/attribute length - bitmap comes after header, but before data // bitmap/attribute length - bitmap comes after header, but before data
uint32_t attr_len; uint32_t attr_len;
// the only possible flag is OSD_OP_RECOVERY_RELATED uint32_t pad0;
uint32_t flags;
}; };
struct __attribute__((__packed__)) osd_reply_sec_rw_t struct __attribute__((__packed__)) osd_reply_sec_rw_t
@ -111,9 +109,6 @@ struct __attribute__((__packed__)) osd_op_sec_del_t
object_id oid; object_id oid;
// delete version (automatic or specific) // delete version (automatic or specific)
uint64_t version; uint64_t version;
// the only possible flag is OSD_OP_RECOVERY_RELATED
uint32_t flags;
uint32_t pad0;
}; };
struct __attribute__((__packed__)) osd_reply_sec_del_t struct __attribute__((__packed__)) osd_reply_sec_del_t
@ -126,9 +121,6 @@ struct __attribute__((__packed__)) osd_reply_sec_del_t
struct __attribute__((__packed__)) osd_op_sec_sync_t struct __attribute__((__packed__)) osd_op_sec_sync_t
{ {
osd_op_header_t header; osd_op_header_t header;
// the only possible flag is OSD_OP_RECOVERY_RELATED
uint32_t flags;
uint32_t pad0;
}; };
struct __attribute__((__packed__)) osd_reply_sec_sync_t struct __attribute__((__packed__)) osd_reply_sec_sync_t
@ -142,9 +134,6 @@ struct __attribute__((__packed__)) osd_op_sec_stab_t
osd_op_header_t header; osd_op_header_t header;
// obj_ver_id array length in bytes // obj_ver_id array length in bytes
uint64_t len; uint64_t len;
// the only possible flag is OSD_OP_RECOVERY_RELATED
uint32_t flags;
uint32_t pad0;
}; };
typedef osd_op_sec_stab_t osd_op_sec_rollback_t; typedef osd_op_sec_stab_t osd_op_sec_rollback_t;

View File

@ -3,15 +3,13 @@
#include "osd_primary.h" #include "osd_primary.h"
#define SELF_FD -1
void osd_t::autosync() void osd_t::autosync()
{ {
if (immediate_commit != IMMEDIATE_ALL && !autosync_op) if (immediate_commit != IMMEDIATE_ALL && !autosync_op)
{ {
autosync_op = new osd_op_t(); autosync_op = new osd_op_t();
autosync_op->op_type = OSD_OP_IN; autosync_op->op_type = OSD_OP_IN;
autosync_op->peer_fd = SELF_FD; autosync_op->peer_fd = -1;
autosync_op->req = (osd_any_op_t){ autosync_op->req = (osd_any_op_t){
.sync = { .sync = {
.header = { .header = {
@ -87,13 +85,9 @@ void osd_t::finish_op(osd_op_t *cur_op, int retval)
cur_op->reply.hdr.id = cur_op->req.hdr.id; cur_op->reply.hdr.id = cur_op->req.hdr.id;
cur_op->reply.hdr.opcode = cur_op->req.hdr.opcode; cur_op->reply.hdr.opcode = cur_op->req.hdr.opcode;
cur_op->reply.hdr.retval = retval; cur_op->reply.hdr.retval = retval;
if (cur_op->peer_fd == SELF_FD) if (cur_op->peer_fd == -1)
{ {
// Do not include internal primary writes (recovery/rebalance) into client op statistics msgr.measure_exec(cur_op);
if (cur_op->req.hdr.opcode != OSD_OP_WRITE)
{
msgr.measure_exec(cur_op);
}
// Copy lambda to be unaffected by `delete op` // Copy lambda to be unaffected by `delete op`
std::function<void(osd_op_t*)>(cur_op->callback)(cur_op); std::function<void(osd_op_t*)>(cur_op->callback)(cur_op);
} }
@ -221,7 +215,6 @@ int osd_t::submit_primary_subop_batch(int submit_type, inode_t inode, uint64_t o
.offset = wr ? si->write_start : si->read_start, .offset = wr ? si->write_start : si->read_start,
.len = subop_len, .len = subop_len,
.attr_len = wr ? clean_entry_bitmap_size : 0, .attr_len = wr ? clean_entry_bitmap_size : 0,
.flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0,
}; };
#ifdef OSD_DEBUG #ifdef OSD_DEBUG
printf( printf(
@ -301,8 +294,7 @@ void osd_t::handle_primary_bs_subop(osd_op_t *subop)
" retval = "+std::to_string(bs_op->retval)+")" " retval = "+std::to_string(bs_op->retval)+")"
); );
} }
bool recovery_related = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB; add_bs_subop_stats(subop);
add_bs_subop_stats(subop, recovery_related);
subop->req.hdr.opcode = bs_op_to_osd_op[bs_op->opcode]; subop->req.hdr.opcode = bs_op_to_osd_op[bs_op->opcode];
subop->reply.hdr.retval = bs_op->retval; subop->reply.hdr.retval = bs_op->retval;
if (bs_op->opcode == BS_OP_READ || bs_op->opcode == BS_OP_WRITE || bs_op->opcode == BS_OP_WRITE_STABLE) if (bs_op->opcode == BS_OP_READ || bs_op->opcode == BS_OP_WRITE || bs_op->opcode == BS_OP_WRITE_STABLE)
@ -314,33 +306,30 @@ void osd_t::handle_primary_bs_subop(osd_op_t *subop)
} }
delete bs_op; delete bs_op;
subop->bs_op = NULL; subop->bs_op = NULL;
subop->peer_fd = SELF_FD; subop->peer_fd = -1;
if (recovery_related && recovery_target_sleep_us) handle_primary_subop(subop, cur_op);
{
tfd->set_timer_us(recovery_target_sleep_us, false, [=](int timer_id)
{
handle_primary_subop(subop, cur_op);
});
}
else
{
handle_primary_subop(subop, cur_op);
}
} }
void osd_t::add_bs_subop_stats(osd_op_t *subop, bool recovery_related) void osd_t::add_bs_subop_stats(osd_op_t *subop)
{ {
// Include local blockstore ops in statistics // Include local blockstore ops in statistics
uint64_t opcode = bs_op_to_osd_op[subop->bs_op->opcode]; uint64_t opcode = bs_op_to_osd_op[subop->bs_op->opcode];
timespec tv_end; timespec tv_end;
clock_gettime(CLOCK_REALTIME, &tv_end); clock_gettime(CLOCK_REALTIME, &tv_end);
uint64_t len = (opcode == OSD_OP_SEC_READ || opcode == OSD_OP_SEC_WRITE) msgr.stats.op_stat_count[opcode]++;
? subop->bs_op->len : 0; if (!msgr.stats.op_stat_count[opcode])
msgr.inc_op_stats(msgr.stats, opcode, subop->tv_begin, tv_end, len);
if (recovery_related)
{ {
// It is OSD_OP_RECOVERY_RELATED msgr.stats.op_stat_count[opcode] = 1;
msgr.inc_op_stats(msgr.recovery_stats, opcode, subop->tv_begin, tv_end, len); msgr.stats.op_stat_sum[opcode] = 0;
msgr.stats.op_stat_bytes[opcode] = 0;
}
msgr.stats.op_stat_sum[opcode] += (
(tv_end.tv_sec - subop->tv_begin.tv_sec)*1000000 +
(tv_end.tv_nsec - subop->tv_begin.tv_nsec)/1000
);
if (opcode == OSD_OP_SEC_READ || opcode == OSD_OP_SEC_WRITE)
{
msgr.stats.op_stat_bytes[opcode] += subop->bs_op->len;
} }
} }
@ -563,7 +552,6 @@ void osd_t::submit_primary_del_batch(osd_op_t *cur_op, obj_ver_osd_t *chunks_to_
}, },
.oid = chunk.oid, .oid = chunk.oid,
.version = chunk.version, .version = chunk.version,
.flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0,
} }; } };
subops[i].callback = [cur_op, this](osd_op_t *subop) subops[i].callback = [cur_op, this](osd_op_t *subop)
{ {
@ -621,7 +609,6 @@ int osd_t::submit_primary_sync_subops(osd_op_t *cur_op)
.id = msgr.next_subop_id++, .id = msgr.next_subop_id++,
.opcode = OSD_OP_SEC_SYNC, .opcode = OSD_OP_SEC_SYNC,
}, },
.flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0,
} }; } };
subops[i].callback = [cur_op, this](osd_op_t *subop) subops[i].callback = [cur_op, this](osd_op_t *subop)
{ {
@ -681,7 +668,6 @@ void osd_t::submit_primary_stab_subops(osd_op_t *cur_op)
.opcode = OSD_OP_SEC_STABILIZE, .opcode = OSD_OP_SEC_STABILIZE,
}, },
.len = (uint64_t)(stab_osd.len * sizeof(obj_ver_id)), .len = (uint64_t)(stab_osd.len * sizeof(obj_ver_id)),
.flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0,
} }; } };
subops[i].iov.push_back(op_data->unstable_writes + stab_osd.start, stab_osd.len * sizeof(obj_ver_id)); subops[i].iov.push_back(op_data->unstable_writes + stab_osd.start, stab_osd.len * sizeof(obj_ver_id));
subops[i].callback = [cur_op, this](osd_op_t *subop) subops[i].callback = [cur_op, this](osd_op_t *subop)

View File

@ -292,26 +292,16 @@ resume_7:
{ {
{ {
int recovery_type = op_data->object_state->state & (OBJ_DEGRADED|OBJ_INCOMPLETE) ? 0 : 1; int recovery_type = op_data->object_state->state & (OBJ_DEGRADED|OBJ_INCOMPLETE) ? 0 : 1;
recovery_stat[recovery_type].count++; recovery_stat_count[0][recovery_type]++;
if (!recovery_stat[recovery_type].count) // wrapped if (!recovery_stat_count[0][recovery_type])
{ {
memset(&recovery_print_prev[recovery_type], 0, sizeof(recovery_print_prev[recovery_type])); recovery_stat_count[0][recovery_type]++;
memset(&recovery_stat[recovery_type], 0, sizeof(recovery_stat[recovery_type])); recovery_stat_bytes[0][recovery_type] = 0;
recovery_stat[recovery_type].count++;
} }
for (int role = 0; role < (op_data->scheme == POOL_SCHEME_REPLICATED ? 1 : pg.pg_size); role++) for (int role = 0; role < (op_data->scheme == POOL_SCHEME_REPLICATED ? 1 : pg.pg_size); role++)
{ {
recovery_stat[recovery_type].bytes += op_data->stripes[role].write_end - op_data->stripes[role].write_start; recovery_stat_bytes[0][recovery_type] += op_data->stripes[role].write_end - op_data->stripes[role].write_start;
} }
if (!cur_op->tv_end.tv_sec)
{
clock_gettime(CLOCK_REALTIME, &cur_op->tv_end);
}
uint64_t usec = (
(cur_op->tv_end.tv_sec - cur_op->tv_begin.tv_sec)*1000000 +
(cur_op->tv_end.tv_nsec - cur_op->tv_begin.tv_nsec)/1000
);
recovery_stat[recovery_type].usec += usec;
} }
// Any kind of a non-clean object can have extra chunks, because we don't record objects // Any kind of a non-clean object can have extra chunks, because we don't record objects
// as degraded & misplaced or incomplete & misplaced at the same time. So try to remove extra chunks // as degraded & misplaced or incomplete & misplaced at the same time. So try to remove extra chunks

View File

@ -42,21 +42,7 @@ void osd_t::secondary_op_callback(osd_op_t *op)
int retval = op->bs_op->retval; int retval = op->bs_op->retval;
delete op->bs_op; delete op->bs_op;
op->bs_op = NULL; op->bs_op = NULL;
if (op->is_recovery_related() && recovery_target_sleep_us) finish_op(op, retval);
{
if (!op->tv_end.tv_sec)
{
clock_gettime(CLOCK_REALTIME, &op->tv_end);
}
tfd->set_timer_us(recovery_target_sleep_us, false, [this, op, retval](int timer_id)
{
finish_op(op, retval);
});
}
else
{
finish_op(op, retval);
}
} }
void osd_t::exec_secondary(osd_op_t *cur_op) void osd_t::exec_secondary(osd_op_t *cur_op)

View File

@ -19,10 +19,10 @@ fi
if [ "$IMMEDIATE_COMMIT" != "" ]; then if [ "$IMMEDIATE_COMMIT" != "" ]; then
NO_SAME="--journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024 --disable_data_fsync 1 --immediate_commit all --log_level 10 --etcd_stats_interval 5" NO_SAME="--journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024 --disable_data_fsync 1 --immediate_commit all --log_level 10 --etcd_stats_interval 5"
$ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"recovery_tune_util_low":1,"osd_out_time":1,"immediate_commit":"all","client_enable_writeback":true}' $ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"osd_out_time":1,"immediate_commit":"all","client_enable_writeback":true}'
else else
NO_SAME="--journal_sector_buffer_count 1024 --log_level 10 --etcd_stats_interval 5" NO_SAME="--journal_sector_buffer_count 1024 --log_level 10 --etcd_stats_interval 5"
$ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"recovery_tune_util_low":1,"osd_out_time":1,"client_enable_writeback":true}' $ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"osd_out_time":1,"client_enable_writeback":true}'
fi fi
start_osd_on() start_osd_on()
@ -53,7 +53,7 @@ for i in $(seq 1 $OSD_COUNT); do
start_osd $i start_osd $i
done done
(while true; do node mon/mon-main.js --etcd_address $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 || true; done) >>./testdata/mon.log 2>&1 & (while true; do node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 || true; done) &>./testdata/mon.log &
MON_PID=$! MON_PID=$!
if [ "$SCHEME" = "ec" ]; then if [ "$SCHEME" = "ec" ]; then

View File

@ -18,7 +18,6 @@ try_change()
for i in {1..6}; do for i in {1..6}; do
echo --- Change PG count to $n --- >>testdata/osd$i.log echo --- Change PG count to $n --- >>testdata/osd$i.log
done done
echo --- Change PG count to $n --- >>testdata/mon.log
$ETCDCTL put /vitastor/config/pools '{"1":{'$POOLCFG',"pg_size":'$PG_SIZE',"pg_minsize":'$PG_MINSIZE',"pg_count":'$n'}}' $ETCDCTL put /vitastor/config/pools '{"1":{'$POOLCFG',"pg_size":'$PG_SIZE',"pg_minsize":'$PG_MINSIZE',"pg_count":'$n'}}'

View File

@ -15,7 +15,7 @@ $ETCDCTL put /vitastor/osd/stats/7 '{"host":"host4","size":1073741824,"time":"'$
$ETCDCTL put /vitastor/osd/stats/8 '{"host":"host4","size":1073741824,"time":"'$TIME'"}' $ETCDCTL put /vitastor/osd/stats/8 '{"host":"host4","size":1073741824,"time":"'$TIME'"}'
$ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":4,"failure_domain":"rack"}}' $ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":4,"failure_domain":"rack"}}'
node mon/mon-main.js --etcd_address $ETCD_URL --etcd_prefix "/vitastor" >>./testdata/mon.log 2>&1 & node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" &>./testdata/mon.log &
MON_PID=$! MON_PID=$!
sleep 2 sleep 2

View File

@ -7,7 +7,7 @@ OSD_COUNT=5
OSD_ARGS="$OSD_ARGS" OSD_ARGS="$OSD_ARGS"
for i in $(seq 1 $OSD_COUNT); do for i in $(seq 1 $OSD_COUNT); do
dd if=/dev/zero of=./testdata/test_osd$i.bin bs=1024 count=1 seek=$((OSD_SIZE*1024-1)) dd if=/dev/zero of=./testdata/test_osd$i.bin bs=1024 count=1 seek=$((OSD_SIZE*1024-1))
build/src/vitastor-osd --log_level 10 --osd_num $i --bind_address 127.0.0.1 --etcd_stats_interval 5 $OSD_ARGS --etcd_address $ETCD_URL $(build/src/vitastor-disk simple-offsets --format options ./testdata/test_osd$i.bin 2>/dev/null) >>./testdata/osd$i.log 2>&1 & build/src/vitastor-osd --osd_num $i --bind_address 127.0.0.1 --etcd_stats_interval 5 $OSD_ARGS --etcd_address $ETCD_URL $(build/src/vitastor-disk simple-offsets --format options ./testdata/test_osd$i.bin 2>/dev/null) >>./testdata/osd$i.log 2>&1 &
eval OSD${i}_PID=$! eval OSD${i}_PID=$!
done done
@ -53,11 +53,6 @@ for i in {1..30}; do
fi fi
done done
# Sync so all moved objects are removed from OSD 1 (they aren't removed without a sync)
LD_PRELOAD="build/src/libfio_vitastor.so" \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4k -direct=1 -iodepth=1 -fsync=1 -number_ios=2 -rw=write \
-etcd=$ETCD_URL -pool=1 -inode=2 -size=32M -cluster_log_level=10
$ETCDCTL put /vitastor/config/pgs '{"items":{"1":{"1":{"osd_set":[4,5],"primary":0}}}}' $ETCDCTL put /vitastor/config/pgs '{"items":{"1":{"1":{"osd_set":[4,5],"primary":0}}}}'
$ETCDCTL put /vitastor/pg/history/1/1 '{"all_peers":[1,2,3]}' $ETCDCTL put /vitastor/pg/history/1/1 '{"all_peers":[1,2,3]}'

View File

@ -15,7 +15,7 @@ for i in $(seq 1 $OSD_COUNT); do
eval OSD${i}_PID=$! eval OSD${i}_PID=$!
done done
(while true; do node mon/mon-main.js --etcd_address $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 || true; done) >>./testdata/mon.log 2>&1 & (while true; do node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 || true; done) &>./testdata/mon.log &
MON_PID=$! MON_PID=$!
sleep 3 sleep 3