Compare commits
16 Commits
7b1b3e02d9
...
53184b3f81
Author | SHA1 | Date |
---|---|---|
Vitaliy Filippov | 53184b3f81 | |
Vitaliy Filippov | 48b5f871e0 | |
Vitaliy Filippov | c17f76a3e4 | |
Vitaliy Filippov | a6ab54b1ba | |
Vitaliy Filippov | 99ee8596ea | |
Vitaliy Filippov | c4928e6ecd | |
Vitaliy Filippov | ec7dcd1be5 | |
Vitaliy Filippov | e600bbc151 | |
Vitaliy Filippov | 8b8c1179a7 | |
Vitaliy Filippov | d5a6fa6dd7 | |
Vitaliy Filippov | f757a35a8d | |
Vitaliy Filippov | 1edf86ed26 | |
Vitaliy Filippov | 5ca7cde612 | |
Vitaliy Filippov | 751935ddd8 | |
Vitaliy Filippov | d84dee7098 | |
Vitaliy Filippov | dcc76eee15 |
|
@ -0,0 +1,115 @@
|
|||
## Contributor License Agreement
|
||||
|
||||
> This Agreement is made in the Russian and English languages. **The English
|
||||
text of Agreement is for informational purposes only** and is not binding
|
||||
for the Parties.
|
||||
>
|
||||
> In the event of a conflict between the provisions of the Russian and
|
||||
English versions of this Agreement, the **Russian version shall prevail**.
|
||||
>
|
||||
> Russian version is published at https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-ru.md
|
||||
|
||||
This document represents the offer of Filippov Vitaliy Vladimirovich
|
||||
("Author"), author and copyright holder of Vitastor software ("Program"),
|
||||
acknowledged by a certificate of Federal Service for Intellectual
|
||||
Property of Russian Federation (Rospatent) # 2021617829 dated 20 May 2021,
|
||||
to "Contributors" to conclude this license agreement as follows
|
||||
("Agreement" or "Offer").
|
||||
|
||||
In accordance with Art. 435, Art. 438 of the Civil Code of the Russian
|
||||
Federation, this Agreement is an offer and in case of acceptance of the
|
||||
offer, an agreement is considered concluded on the conditions specified
|
||||
in the offer.
|
||||
|
||||
1. Applicable Terms. \
|
||||
1.1. "Official Repository" shall mean the computer storage, operated by
|
||||
the Author, containing all prior and future versions of the Source
|
||||
Code of the Program, at Internet addresses https://git.yourcmc.ru/vitalif/vitastor/
|
||||
or https://github.com/vitalif/vitastor/. \
|
||||
1.2. "Contributions" shall mean results of intellectual activity
|
||||
(including, but not limited to, source code, libraries, components,
|
||||
texts, documentation) which can be software or elements of the software
|
||||
and which are provided by Contributors to the Author for inclusion
|
||||
in the Program. \
|
||||
1.3. "Contributor" shall mean a person who provides Contributions to
|
||||
the Author and agrees with all provisions of this Agreement.
|
||||
A Сontributor can be: 1) an individual; or 2) a legal entity or an
|
||||
individual entrepreneur in case when an individual provides Contributions
|
||||
on behalf of third parties, including on behalf of his employer.
|
||||
|
||||
2. Subject of the Agreement. \
|
||||
2.1. Subject of the Agreement shall be the Contributions sent to the Author by Contributors.
|
||||
2.2. The Contributor grants to the Author the right to use Contributions at his own
|
||||
discretion and without any necessity to get a prior approval from Contributor or
|
||||
any other third party in any way, under a simple (non-exclusive), royalty-free,
|
||||
irrevocable license throughout the world by all means not contrary to law, in whole
|
||||
or as a part of the Program, or other open-source or closed-source computer programs,
|
||||
products or services (hereinafter -- the "License"), including, but not limited to: \
|
||||
2.2.1. to execute Contributions and use them for any tasks; \
|
||||
2.2.2. to publish and distribute Contributions in modified or unmodified form and/or to rent them; \
|
||||
2.2.3. to modify Contributions, add comments, illustrations or any explanations to Contributions while using them; \
|
||||
2.2.4. to create other results of intellectual activity based on Contributions, including derivative works and composite works; \
|
||||
2.2.5. to translate Contributions into other languages, including other programming languages; \
|
||||
2.2.6. to carry out rental and public display of Contributions; \
|
||||
2.2.7. to use Contributions under the trade name and/or any trademark or any other label, or without it, as the Author thinks fit; \
|
||||
2.3. The Contributor grants to the Author the right to sublicense any of the aforementioned
|
||||
rights to third parties on any terms at the Author's discretion. \
|
||||
2.4. The License is provided for the entire duration of Contributor's
|
||||
exclusive intellectual property rights to the Contributions. \
|
||||
2.5. The Contributor grants to the Author the right to decide how and where to mention,
|
||||
or to not mention at all, the fact of his authorship, name, nickname and/or company
|
||||
details when including Contributions into the Program or in any other computer
|
||||
programs, products or services.
|
||||
|
||||
3. Acceptance of the Offer \
|
||||
3.1. The Contributor may provide Contributions to the Author in the form of
|
||||
a "Pull Request" in an Official Repository of the Program or by any
|
||||
other electronic means of communication, including, but not limited to,
|
||||
E-mail or messenger applications. \
|
||||
3.2. The acceptance of the Offer shall be the fact of provision of Contributions
|
||||
to the Author by the Contributor by any means with the following remark:
|
||||
“I accept Vitastor CLA agreement: https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md”
|
||||
or “Я принимаю соглашение Vitastor CLA: https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-ru.md”. \
|
||||
3.3. Date of acceptance of the Offer shall be the date of such provision.
|
||||
|
||||
4. Rights and obligations of the parties. \
|
||||
4.1. The Contributor reserves the right to use Contributions by any lawful means
|
||||
not contrary to this Agreement. \
|
||||
4.2. The Author has the right to refuse to include Contributions into the Program
|
||||
at any moment with no explanation to the Contributor.
|
||||
|
||||
5. Representations and Warranties. \
|
||||
5.1. The person providing Contributions for the purpose of their inclusion
|
||||
in the Program represents and warrants that he is the Contributor
|
||||
or legally acts on the Contributor's behalf. Name or company details
|
||||
of the Contributor shall be provided with the Contribution at the moment
|
||||
of their provision to the Author. \
|
||||
5.2. The Contributor represents and warrants that he legally owns exclusive
|
||||
intellectual property rights to the Contributions. \
|
||||
5.3. The Contributor represents and warrants that any further use of \
|
||||
Contributions by the Author as provided by Contributor under the terms
|
||||
of the Agreement does not infringe on intellectual and other rights and
|
||||
legitimate interests of third parties. \
|
||||
5.4. The Contributor represents and warrants that he has all rights and legal
|
||||
capacity needed to accept this Offer; \
|
||||
5.5. The Contributor represents and warrants that Contributions don't
|
||||
contain malware or any information considered illegal under the law
|
||||
of Russian Federation.
|
||||
|
||||
6. Termination of the Agreement \
|
||||
6.1. The Agreement may be terminated at will of both Author and Contributor,
|
||||
formalised in the written form or if the Agreement is terminated on
|
||||
reasons prescribed by the law of Russian Federation.
|
||||
|
||||
7. Final Clauses \
|
||||
7.1. The Contributor may optionally sign the Agreement in the written form. \
|
||||
7.2. The Agreement is deemed to become effective from the Date of signing of
|
||||
the Agreement and until the expiration of Contributor's exclusive
|
||||
intellectual property rights to the Contributions. \
|
||||
7.3. The Author may unilaterally alter the Agreement without informing Contributors.
|
||||
The new version of the document shall come into effect 3 (three) days after
|
||||
being published in the Official Repository of the Program at Internet address
|
||||
[https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md](https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md).
|
||||
Contributors should keep informed about the actual version of the Agreement themselves. \
|
||||
7.4. If the Author and the Contributor fail to agree on disputable issues,
|
||||
disputes shall be referred to the Moscow Arbitration court.
|
|
@ -0,0 +1,108 @@
|
|||
## Лицензионное соглашение с участником
|
||||
|
||||
> Данная Оферта написана в Русской и Английской версиях. **Версия на английском
|
||||
языке предоставляется в информационных целях** и не связывает стороны договора.
|
||||
>
|
||||
> В случае несоответствий между положениями Русской и Английской версий Договора,
|
||||
**Русская версия имеет приоритет**.
|
||||
>
|
||||
> Английская версия опубликована по адресу https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md
|
||||
|
||||
Настоящий договор-оферта (далее по тексту – Оферта, Договор) адресована физическим
|
||||
и юридическим лицам (далее – Участникам) и является официальным публичным предложением
|
||||
Филиппова Виталия Владимировича (далее – Автора) программного обеспечения Vitastor,
|
||||
свидетельство Федеральной службы по интеллектуальной собственности (Роспатент) № 2021617829
|
||||
от 20 мая 2021 г. (далее – Программа) о нижеследующем:
|
||||
|
||||
1. Термины и определения \
|
||||
1.1. Репозиторий – электронное хранилище, содержащее исходный код Программы. \
|
||||
1.2. Доработка – результат интеллектуальной деятельности Участника, включающий
|
||||
в себя изменения или дополнения к исходному коду Программы, которые Участник
|
||||
желает включить в состав Программы для дальнейшего использования и распространения
|
||||
Автором и для этого направляет их Автору. \
|
||||
1.3. Участник – физическое или юридическое лицо, вносящее Доработки в код Программы. \
|
||||
1.4. ГК РФ – Гражданский кодекс Российской Федерации.
|
||||
|
||||
2. Предмет оферты \
|
||||
2.1. Предметом настоящей оферты являются Доработки, отправляемые Участником Автору. \
|
||||
2.2. Участник предоставляет Автору право использовать Доработки по собственному усмотрению
|
||||
и без необходимости предварительного согласования с Участником или иным третьим лицом
|
||||
на условиях простой (неисключительной) безвозмездной безотзывной лицензии, полностью
|
||||
или фрагментарно, в составе Программы или других программ, продуктов или сервисов
|
||||
как с открытым, так и с закрытым исходным кодом, любыми способами, не противоречащими
|
||||
закону, включая, но не ограничиваясь следующими: \
|
||||
2.2.1. Запускать и использовать Доработки для выполнения любых задач; \
|
||||
2.2.2. Распространять, импортировать и доводить Доработки до всеобщего сведения; \
|
||||
2.2.3. Вносить в Доработки изменения, сокращения и дополнения, снабжать Доработки
|
||||
при их использовании комментариями, иллюстрациями или пояснениями; \
|
||||
2.2.4. Создавать на основе Доработок иные результаты интеллектуальной деятельности,
|
||||
в том числе производные и составные произведения; \
|
||||
2.2.5. Переводить Доработки на другие языки, в том числе на другие языки программирования; \
|
||||
2.2.6. Осуществлять прокат и публичный показ Доработок; \
|
||||
2.2.7. Использовать Доработки под любым фирменным наименованием, товарным знаком
|
||||
(знаком обслуживания) или иным обозначением, или без такового. \
|
||||
2.3. Участник предоставляет Автору право сублицензировать полученные права на Доработки
|
||||
третьим лицам на любых условиях на усмотрение Автора. \
|
||||
2.4. Участник предоставляет Автору права на Доработки на территории всего мира. \
|
||||
2.5. Участник предоставляет Автору права на весь срок действия исключительного права
|
||||
Участника на Доработки. \
|
||||
2.6. Участник предоставляет Автору права на Доработки на безвозмездной основе. \
|
||||
2.7. Участник разрешает Автору самостоятельно определять порядок, способ и
|
||||
место указания его имени, реквизитов и/или псевдонима при включении
|
||||
Доработок в состав Программы или других программ, продуктов или сервисов.
|
||||
|
||||
3. Акцепт Оферты \
|
||||
3.1. Участник может передавать Доработки в адрес Автора через зеркала официального
|
||||
Репозитория Программы по адресам https://git.yourcmc.ru/vitalif/vitastor/ или
|
||||
https://github.com/vitalif/vitastor/ в виде “запроса на слияние” (pull request),
|
||||
либо в письменном виде или с помощью любых других электронных средств коммуникации,
|
||||
например, электронной почты или мессенджеров. \
|
||||
3.2. Факт передачи Участником Доработок в адрес Автора любым способом с одной из пометок
|
||||
“I accept Vitastor CLA agreement: https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md”
|
||||
или “Я принимаю соглашение Vitastor CLA: https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-ru.md”
|
||||
является полным и безоговорочным акцептом (принятием) Участником условий настоящей
|
||||
Оферты, т.е. Участник считается ознакомившимся с настоящим публичным договором и
|
||||
в соответствии с ГК РФ признается лицом, вступившим с Автором в договорные отношения
|
||||
на основании настоящей Оферты. \
|
||||
3.3. Датой акцептирования настоящей Оферты считается дата такой передачи.
|
||||
|
||||
4. Права и обязанности Сторон \
|
||||
4.1. Участник сохраняет за собой право использовать Доработки любым законным
|
||||
способом, не противоречащим настоящему Договору. \
|
||||
4.2. Автор вправе отказать Участнику во включении Доработок в состав
|
||||
Программы без объяснения причин в любой момент по своему усмотрению.
|
||||
|
||||
5. Гарантии и заверения \
|
||||
5.1. Лицо, направляющее Доработки для целей их включения в состав Программы,
|
||||
гарантирует, что является Участником или представителем Участника. Имя или реквизиты
|
||||
Участника должны быть указаны при их передаче в адрес Автора Программы. \
|
||||
5.2. Участник гарантирует, что является законным обладателем исключительных прав
|
||||
на Доработки. \
|
||||
5.3. Участник гарантирует, что на момент акцептирования настоящей Оферты ему
|
||||
ничего не известно (и не могло быть известно) о правах третьих лиц на
|
||||
передаваемые Автору Доработки или их часть, которые могут быть нарушены
|
||||
в связи с передачей Доработок по настоящему Договору. \
|
||||
5.4. Участник гарантирует, что является дееспособным лицом и обладает всеми
|
||||
необходимыми правами для заключения Договора. \
|
||||
5.5. Участник гарантирует, что Доработки не содержат вредоносного ПО, а также
|
||||
любой другой информации, запрещённой к распространению по законам Российской
|
||||
Федерации.
|
||||
|
||||
6. Прекращение действия оферты \
|
||||
6.1. Действие настоящего договора может быть прекращено по соглашению сторон,
|
||||
оформленному в письменном виде, а также вследствие его расторжения по основаниям,
|
||||
предусмотренным законом.
|
||||
|
||||
7. Заключительные положения \
|
||||
7.1. Участник вправе по желанию подписать настоящий Договор в письменном виде. \
|
||||
7.2. Настоящий договор действует с момента его заключения и до истечения срока
|
||||
действия исключительных прав Участника на Доработки. \
|
||||
7.3. Автор имеет право в одностороннем порядке вносить изменения и дополнения в договор
|
||||
без специального уведомления об этом Участников. Новая редакция документа вступает
|
||||
в силу через 3 (Три) календарных дня со дня опубликования в официальном Репозитории
|
||||
Программы по адресу в сети Интернет
|
||||
[https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-ru.md](https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-ru.md).
|
||||
Участники самостоятельно отслеживают действующие условия Оферты. \
|
||||
7.4. Все споры, возникающие между сторонами в процессе их взаимодействия по настоящему
|
||||
договору, решаются путём переговоров. В случае невозможности урегулирования споров
|
||||
переговорным порядком стороны разрешают их в Арбитражном суде г.Москвы.
|
|
@ -6,8 +6,8 @@
|
|||
|
||||
# Client Parameters
|
||||
|
||||
These parameters apply only to clients and affect their interaction with
|
||||
the cluster.
|
||||
These parameters apply only to Vitastor clients (QEMU, fio, NBD and so on) and
|
||||
affect their interaction with the cluster.
|
||||
|
||||
- [client_max_dirty_bytes](#client_max_dirty_bytes)
|
||||
- [client_max_dirty_ops](#client_max_dirty_ops)
|
||||
|
|
|
@ -6,7 +6,7 @@
|
|||
|
||||
# Параметры клиентского кода
|
||||
|
||||
Данные параметры применяются только к клиентам Vitastor (QEMU, fio, NBD) и
|
||||
Данные параметры применяются только к клиентам Vitastor (QEMU, fio, NBD и т.п.) и
|
||||
затрагивают логику их работы с кластером.
|
||||
|
||||
- [client_max_dirty_bytes](#client_max_dirty_bytes)
|
||||
|
|
|
@ -19,6 +19,7 @@ them, even without restarting by updating configuration in etcd.
|
|||
- [autosync_interval](#autosync_interval)
|
||||
- [autosync_writes](#autosync_writes)
|
||||
- [recovery_queue_depth](#recovery_queue_depth)
|
||||
- [recovery_sleep_us](#recovery_sleep_us)
|
||||
- [recovery_pg_switch](#recovery_pg_switch)
|
||||
- [recovery_sync_batch](#recovery_sync_batch)
|
||||
- [readonly](#readonly)
|
||||
|
@ -51,6 +52,13 @@ them, even without restarting by updating configuration in etcd.
|
|||
- [scrub_list_limit](#scrub_list_limit)
|
||||
- [scrub_find_best](#scrub_find_best)
|
||||
- [scrub_ec_max_bruteforce](#scrub_ec_max_bruteforce)
|
||||
- [recovery_tune_interval](#recovery_tune_interval)
|
||||
- [recovery_tune_util_low](#recovery_tune_util_low)
|
||||
- [recovery_tune_util_high](#recovery_tune_util_high)
|
||||
- [recovery_tune_client_util_low](#recovery_tune_client_util_low)
|
||||
- [recovery_tune_client_util_high](#recovery_tune_client_util_high)
|
||||
- [recovery_tune_agg_interval](#recovery_tune_agg_interval)
|
||||
- [recovery_tune_sleep_min_us](#recovery_tune_sleep_min_us)
|
||||
|
||||
## etcd_report_interval
|
||||
|
||||
|
@ -135,12 +143,24 @@ operations before issuing an fsync operation internally.
|
|||
## recovery_queue_depth
|
||||
|
||||
- Type: integer
|
||||
- Default: 4
|
||||
- Default: 1
|
||||
- Can be changed online: yes
|
||||
|
||||
Maximum recovery operations per one primary OSD at any given moment of time.
|
||||
Currently it's the only parameter available to tune the speed or recovery
|
||||
and rebalancing, but it's planned to implement more.
|
||||
Maximum recovery and rebalance operations initiated by each OSD in parallel.
|
||||
Note that each OSD talks to a lot of other OSDs so actual number of parallel
|
||||
recovery operations per each OSD is greater than just recovery_queue_depth.
|
||||
Increasing this parameter can speedup recovery if [auto-tuning](#recovery_tune_interval)
|
||||
allows it or if it is disabled.
|
||||
|
||||
## recovery_sleep_us
|
||||
|
||||
- Type: microseconds
|
||||
- Default: 0
|
||||
- Can be changed online: yes
|
||||
|
||||
Delay for all recovery- and rebalance- related operations. If non-zero,
|
||||
such operations are artificially slowed down to reduce the impact on
|
||||
client I/O.
|
||||
|
||||
## recovery_pg_switch
|
||||
|
||||
|
@ -508,3 +528,81 @@ the variant with most available equal copies is correct. For example, if
|
|||
you have 3 replicas and 1 of them differs, this one is considered to be
|
||||
corrupted. But if there is no "best" version with more copies than all
|
||||
others have then the object is also marked as inconsistent.
|
||||
|
||||
## recovery_tune_interval
|
||||
|
||||
- Type: seconds
|
||||
- Default: 1
|
||||
- Can be changed online: yes
|
||||
|
||||
Interval at which OSD re-considers client and recovery load and automatically
|
||||
adjusts [recovery_sleep_us](#recovery_sleep_us). Recovery auto-tuning is
|
||||
disabled if recovery_tune_interval is set to 0.
|
||||
|
||||
Auto-tuning targets utilization. Utilization is a measure of load and is
|
||||
equal to the product of iops and average latency (so it may be greater
|
||||
than 1). You set "low" and "high" client utilization thresholds and two
|
||||
corresponding target recovery utilization levels. OSD calculates desired
|
||||
recovery utilization from client utilization using linear interpolation
|
||||
and auto-tunes recovery operation delay to make actual recovery utilization
|
||||
match desired.
|
||||
|
||||
This allows to reduce recovery/rebalance impact on client operations. It is
|
||||
of course impossible to remove it completely, but it should become adequate.
|
||||
In some tests rebalance could earlier drop client write speed from 1.5 GB/s
|
||||
to 50-100 MB/s, with default auto-tuning settings it now only reduces
|
||||
to ~1 GB/s.
|
||||
|
||||
## recovery_tune_util_low
|
||||
|
||||
- Type: number
|
||||
- Default: 0.1
|
||||
- Can be changed online: yes
|
||||
|
||||
Desired recovery/rebalance utilization when client load is high, i.e. when
|
||||
it is at or above recovery_tune_client_util_high.
|
||||
|
||||
## recovery_tune_util_high
|
||||
|
||||
- Type: number
|
||||
- Default: 1
|
||||
- Can be changed online: yes
|
||||
|
||||
Desired recovery/rebalance utilization when client load is low, i.e. when
|
||||
it is at or below recovery_tune_client_util_low.
|
||||
|
||||
## recovery_tune_client_util_low
|
||||
|
||||
- Type: number
|
||||
- Default: 0
|
||||
- Can be changed online: yes
|
||||
|
||||
Client utilization considered "low".
|
||||
|
||||
## recovery_tune_client_util_high
|
||||
|
||||
- Type: number
|
||||
- Default: 0.5
|
||||
- Can be changed online: yes
|
||||
|
||||
Client utilization considered "high".
|
||||
|
||||
## recovery_tune_agg_interval
|
||||
|
||||
- Type: integer
|
||||
- Default: 10
|
||||
- Can be changed online: yes
|
||||
|
||||
The number of last auto-tuning iterations to use for calculating the
|
||||
delay as average. Lower values result in quicker response to client
|
||||
load change, higher values result in more stable delay. Default value of 10
|
||||
is usually fine.
|
||||
|
||||
## recovery_tune_sleep_min_us
|
||||
|
||||
- Type: microseconds
|
||||
- Default: 10
|
||||
- Can be changed online: yes
|
||||
|
||||
Minimum possible value for auto-tuned recovery_sleep_us. Values lower
|
||||
than this value are changed to 0.
|
||||
|
|
|
@ -20,6 +20,7 @@
|
|||
- [autosync_interval](#autosync_interval)
|
||||
- [autosync_writes](#autosync_writes)
|
||||
- [recovery_queue_depth](#recovery_queue_depth)
|
||||
- [recovery_sleep_us](#recovery_sleep_us)
|
||||
- [recovery_pg_switch](#recovery_pg_switch)
|
||||
- [recovery_sync_batch](#recovery_sync_batch)
|
||||
- [readonly](#readonly)
|
||||
|
@ -52,6 +53,13 @@
|
|||
- [scrub_list_limit](#scrub_list_limit)
|
||||
- [scrub_find_best](#scrub_find_best)
|
||||
- [scrub_ec_max_bruteforce](#scrub_ec_max_bruteforce)
|
||||
- [recovery_tune_interval](#recovery_tune_interval)
|
||||
- [recovery_tune_util_low](#recovery_tune_util_low)
|
||||
- [recovery_tune_util_high](#recovery_tune_util_high)
|
||||
- [recovery_tune_client_util_low](#recovery_tune_client_util_low)
|
||||
- [recovery_tune_client_util_high](#recovery_tune_client_util_high)
|
||||
- [recovery_tune_agg_interval](#recovery_tune_agg_interval)
|
||||
- [recovery_tune_sleep_min_us](#recovery_tune_sleep_min_us)
|
||||
|
||||
## etcd_report_interval
|
||||
|
||||
|
@ -138,13 +146,25 @@ OSD, чтобы успевать очищать журнал - без них OSD
|
|||
## recovery_queue_depth
|
||||
|
||||
- Тип: целое число
|
||||
- Значение по умолчанию: 4
|
||||
- Значение по умолчанию: 1
|
||||
- Можно менять на лету: да
|
||||
|
||||
Максимальное число операций восстановления на одном первичном OSD в любой
|
||||
момент времени. На данный момент единственный параметр, который можно менять
|
||||
для ускорения или замедления восстановления и перебалансировки данных, но
|
||||
в планах реализация других параметров.
|
||||
Максимальное число параллельных операций восстановления, инициируемых одним
|
||||
OSD в любой момент времени. Имейте в виду, что каждый OSD обычно работает с
|
||||
многими другими OSD, так что на практике параллелизм восстановления больше,
|
||||
чем просто recovery_queue_depth. Увеличение значения этого параметра может
|
||||
ускорить восстановление если [автотюнинг скорости](#recovery_tune_interval)
|
||||
разрешает это или если он отключён.
|
||||
|
||||
## recovery_sleep_us
|
||||
|
||||
- Тип: микросекунды
|
||||
- Значение по умолчанию: 0
|
||||
- Можно менять на лету: да
|
||||
|
||||
Delay for all recovery- and rebalance- related operations. If non-zero,
|
||||
such operations are artificially slowed down to reduce the impact on
|
||||
client I/O.
|
||||
|
||||
## recovery_pg_switch
|
||||
|
||||
|
@ -535,3 +555,83 @@ EC (кодов коррекции ошибок) с более, чем 1 диск
|
|||
считается некорректной. Однако, если "лучшую" версию с числом доступных
|
||||
копий большим, чем у всех других версий, найти невозможно, то объект тоже
|
||||
маркируется неконсистентным.
|
||||
|
||||
## recovery_tune_interval
|
||||
|
||||
- Тип: секунды
|
||||
- Значение по умолчанию: 1
|
||||
- Можно менять на лету: да
|
||||
|
||||
Интервал, с которым OSD пересматривает клиентскую нагрузку и нагрузку
|
||||
восстановления и автоматически подстраивает [recovery_sleep_us](#recovery_sleep_us).
|
||||
Автотюнинг (автоподстройка) отключается, если recovery_tune_interval
|
||||
устанавливается в значение 0.
|
||||
|
||||
Автотюнинг регулирует утилизацию. Утилизация является мерой нагрузки
|
||||
и равна произведению числа операций в секунду и средней задержки
|
||||
(то есть, она может быть выше 1). Вы задаёте два уровня клиентской
|
||||
утилизации - "низкий" и "высокий" (low и high) и два соответствующих
|
||||
целевых уровня утилизации операциями восстановления. OSD рассчитывает
|
||||
желаемый уровень утилизации восстановления линейной интерполяцией от
|
||||
клиентской утилизации и подстраивает задержку операций восстановления
|
||||
так, чтобы фактическая утилизация восстановления совпадала с желаемой.
|
||||
|
||||
Это позволяет снизить влияние восстановления и ребаланса на клиентские
|
||||
операции. Конечно, невозможно исключить такое влияние полностью, но оно
|
||||
должно становиться адекватнее. В некоторых тестах перебалансировка могла
|
||||
снижать клиентскую скорость записи с 1.5 ГБ/с до 50-100 МБ/с, а теперь, с
|
||||
настройками автотюнинга по умолчанию, она снижается только до ~1 ГБ/с.
|
||||
|
||||
## recovery_tune_util_low
|
||||
|
||||
- Тип: число
|
||||
- Значение по умолчанию: 0.1
|
||||
- Можно менять на лету: да
|
||||
|
||||
Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
|
||||
высокая, то есть, находится на уровне или выше recovery_tune_client_util_high.
|
||||
|
||||
## recovery_tune_util_high
|
||||
|
||||
- Тип: число
|
||||
- Значение по умолчанию: 1
|
||||
- Можно менять на лету: да
|
||||
|
||||
Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
|
||||
низкая, то есть, находится на уровне или ниже recovery_tune_client_util_low.
|
||||
|
||||
## recovery_tune_client_util_low
|
||||
|
||||
- Тип: число
|
||||
- Значение по умолчанию: 0
|
||||
- Можно менять на лету: да
|
||||
|
||||
Клиентская утилизация, которая считается "низкой".
|
||||
|
||||
## recovery_tune_client_util_high
|
||||
|
||||
- Тип: число
|
||||
- Значение по умолчанию: 0.5
|
||||
- Можно менять на лету: да
|
||||
|
||||
Клиентская утилизация, которая считается "высокой".
|
||||
|
||||
## recovery_tune_agg_interval
|
||||
|
||||
- Тип: целое число
|
||||
- Значение по умолчанию: 10
|
||||
- Можно менять на лету: да
|
||||
|
||||
Число последних итераций автоподстройки для расчёта задержки как среднего
|
||||
значения. Меньшие значения параметра ускоряют отклик на изменение нагрузки,
|
||||
большие значения делают задержку стабильнее. Значение по умолчанию 10
|
||||
обычно нормальное и не требует изменений.
|
||||
|
||||
## recovery_tune_sleep_min_us
|
||||
|
||||
- Тип: микросекунды
|
||||
- Значение по умолчанию: 10
|
||||
- Можно менять на лету: да
|
||||
|
||||
Минимальное возможное значение авто-подстроенного recovery_sleep_us.
|
||||
Значения ниже данного заменяются на 0.
|
||||
|
|
|
@ -38,6 +38,7 @@ const types = {
|
|||
bool: 'boolean',
|
||||
int: 'integer',
|
||||
sec: 'seconds',
|
||||
float: 'number',
|
||||
ms: 'milliseconds',
|
||||
us: 'microseconds',
|
||||
},
|
||||
|
@ -46,6 +47,7 @@ const types = {
|
|||
bool: 'булево (да/нет)',
|
||||
int: 'целое число',
|
||||
sec: 'секунды',
|
||||
float: 'число',
|
||||
ms: 'миллисекунды',
|
||||
us: 'микросекунды',
|
||||
},
|
||||
|
|
|
@ -107,17 +107,29 @@
|
|||
принудительной отправкой fsync-а.
|
||||
- name: recovery_queue_depth
|
||||
type: int
|
||||
default: 4
|
||||
default: 1
|
||||
online: true
|
||||
info: |
|
||||
Maximum recovery operations per one primary OSD at any given moment of time.
|
||||
Currently it's the only parameter available to tune the speed or recovery
|
||||
and rebalancing, but it's planned to implement more.
|
||||
Maximum recovery and rebalance operations initiated by each OSD in parallel.
|
||||
Note that each OSD talks to a lot of other OSDs so actual number of parallel
|
||||
recovery operations per each OSD is greater than just recovery_queue_depth.
|
||||
Increasing this parameter can speedup recovery if [auto-tuning](#recovery_tune_interval)
|
||||
allows it or if it is disabled.
|
||||
info_ru: |
|
||||
Максимальное число операций восстановления на одном первичном OSD в любой
|
||||
момент времени. На данный момент единственный параметр, который можно менять
|
||||
для ускорения или замедления восстановления и перебалансировки данных, но
|
||||
в планах реализация других параметров.
|
||||
Максимальное число параллельных операций восстановления, инициируемых одним
|
||||
OSD в любой момент времени. Имейте в виду, что каждый OSD обычно работает с
|
||||
многими другими OSD, так что на практике параллелизм восстановления больше,
|
||||
чем просто recovery_queue_depth. Увеличение значения этого параметра может
|
||||
ускорить восстановление если [автотюнинг скорости](#recovery_tune_interval)
|
||||
разрешает это или если он отключён.
|
||||
- name: recovery_sleep_us
|
||||
type: us
|
||||
default: 0
|
||||
online: true
|
||||
info: |
|
||||
Delay for all recovery- and rebalance- related operations. If non-zero,
|
||||
such operations are artificially slowed down to reduce the impact on
|
||||
client I/O.
|
||||
- name: recovery_pg_switch
|
||||
type: int
|
||||
default: 128
|
||||
|
@ -626,3 +638,101 @@
|
|||
считается некорректной. Однако, если "лучшую" версию с числом доступных
|
||||
копий большим, чем у всех других версий, найти невозможно, то объект тоже
|
||||
маркируется неконсистентным.
|
||||
- name: recovery_tune_interval
|
||||
type: sec
|
||||
default: 1
|
||||
online: true
|
||||
info: |
|
||||
Interval at which OSD re-considers client and recovery load and automatically
|
||||
adjusts [recovery_sleep_us](#recovery_sleep_us). Recovery auto-tuning is
|
||||
disabled if recovery_tune_interval is set to 0.
|
||||
|
||||
Auto-tuning targets utilization. Utilization is a measure of load and is
|
||||
equal to the product of iops and average latency (so it may be greater
|
||||
than 1). You set "low" and "high" client utilization thresholds and two
|
||||
corresponding target recovery utilization levels. OSD calculates desired
|
||||
recovery utilization from client utilization using linear interpolation
|
||||
and auto-tunes recovery operation delay to make actual recovery utilization
|
||||
match desired.
|
||||
|
||||
This allows to reduce recovery/rebalance impact on client operations. It is
|
||||
of course impossible to remove it completely, but it should become adequate.
|
||||
In some tests rebalance could earlier drop client write speed from 1.5 GB/s
|
||||
to 50-100 MB/s, with default auto-tuning settings it now only reduces
|
||||
to ~1 GB/s.
|
||||
info_ru: |
|
||||
Интервал, с которым OSD пересматривает клиентскую нагрузку и нагрузку
|
||||
восстановления и автоматически подстраивает [recovery_sleep_us](#recovery_sleep_us).
|
||||
Автотюнинг (автоподстройка) отключается, если recovery_tune_interval
|
||||
устанавливается в значение 0.
|
||||
|
||||
Автотюнинг регулирует утилизацию. Утилизация является мерой нагрузки
|
||||
и равна произведению числа операций в секунду и средней задержки
|
||||
(то есть, она может быть выше 1). Вы задаёте два уровня клиентской
|
||||
утилизации - "низкий" и "высокий" (low и high) и два соответствующих
|
||||
целевых уровня утилизации операциями восстановления. OSD рассчитывает
|
||||
желаемый уровень утилизации восстановления линейной интерполяцией от
|
||||
клиентской утилизации и подстраивает задержку операций восстановления
|
||||
так, чтобы фактическая утилизация восстановления совпадала с желаемой.
|
||||
|
||||
Это позволяет снизить влияние восстановления и ребаланса на клиентские
|
||||
операции. Конечно, невозможно исключить такое влияние полностью, но оно
|
||||
должно становиться адекватнее. В некоторых тестах перебалансировка могла
|
||||
снижать клиентскую скорость записи с 1.5 ГБ/с до 50-100 МБ/с, а теперь, с
|
||||
настройками автотюнинга по умолчанию, она снижается только до ~1 ГБ/с.
|
||||
- name: recovery_tune_util_low
|
||||
type: float
|
||||
default: 0.1
|
||||
online: true
|
||||
info: |
|
||||
Desired recovery/rebalance utilization when client load is high, i.e. when
|
||||
it is at or above recovery_tune_client_util_high.
|
||||
info_ru: |
|
||||
Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
|
||||
высокая, то есть, находится на уровне или выше recovery_tune_client_util_high.
|
||||
- name: recovery_tune_util_high
|
||||
type: float
|
||||
default: 1
|
||||
online: true
|
||||
info: |
|
||||
Desired recovery/rebalance utilization when client load is low, i.e. when
|
||||
it is at or below recovery_tune_client_util_low.
|
||||
info_ru: |
|
||||
Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
|
||||
низкая, то есть, находится на уровне или ниже recovery_tune_client_util_low.
|
||||
- name: recovery_tune_client_util_low
|
||||
type: float
|
||||
default: 0
|
||||
online: true
|
||||
info: Client utilization considered "low".
|
||||
info_ru: Клиентская утилизация, которая считается "низкой".
|
||||
- name: recovery_tune_client_util_high
|
||||
type: float
|
||||
default: 0.5
|
||||
online: true
|
||||
info: Client utilization considered "high".
|
||||
info_ru: Клиентская утилизация, которая считается "высокой".
|
||||
- name: recovery_tune_agg_interval
|
||||
type: int
|
||||
default: 10
|
||||
online: true
|
||||
info: |
|
||||
The number of last auto-tuning iterations to use for calculating the
|
||||
delay as average. Lower values result in quicker response to client
|
||||
load change, higher values result in more stable delay. Default value of 10
|
||||
is usually fine.
|
||||
info_ru: |
|
||||
Число последних итераций автоподстройки для расчёта задержки как среднего
|
||||
значения. Меньшие значения параметра ускоряют отклик на изменение нагрузки,
|
||||
большие значения делают задержку стабильнее. Значение по умолчанию 10
|
||||
обычно нормальное и не требует изменений.
|
||||
- name: recovery_tune_sleep_min_us
|
||||
type: us
|
||||
default: 10
|
||||
online: true
|
||||
info: |
|
||||
Minimum possible value for auto-tuned recovery_sleep_us. Values lower
|
||||
than this value are changed to 0.
|
||||
info_ru: |
|
||||
Минимальное возможное значение авто-подстроенного recovery_sleep_us.
|
||||
Значения ниже данного заменяются на 0.
|
||||
|
|
|
@ -32,6 +32,7 @@
|
|||
- [Scrubbing](../config/osd.en.md#auto_scrub) (verification of copies)
|
||||
- [Checksums](../config/layout-osd.en.md#data_csum_type)
|
||||
- [Client write-back cache](../config/client.en.md#client_enable_writeback)
|
||||
- [Intelligent recovery auto-tuning](../config/osd.en.md#recovery_tune_interval)
|
||||
|
||||
## Plugins and tools
|
||||
|
||||
|
|
|
@ -34,6 +34,7 @@
|
|||
- [Фоновая проверка целостности](../config/osd.ru.md#auto_scrub) (сверка копий)
|
||||
- [Контрольные суммы](../config/layout-osd.ru.md#data_csum_type)
|
||||
- [Буферизация записи на стороне клиента](../config/client.ru.md#client_enable_writeback)
|
||||
- [Интеллектуальная автоподстройка скорости восстановления](../config/osd.ru.md#recovery_tune_interval)
|
||||
|
||||
## Драйверы и инструменты
|
||||
|
||||
|
|
|
@ -3,6 +3,7 @@
|
|||
|
||||
module.exports = {
|
||||
scale_pg_count,
|
||||
scale_pg_history,
|
||||
};
|
||||
|
||||
function add_pg_history(new_pg_history, new_pg, prev_pgs, prev_pg_history, old_pg)
|
||||
|
@ -43,16 +44,18 @@ function finish_pg_history(merged_history)
|
|||
merged_history.all_peers = Object.values(merged_history.all_peers);
|
||||
}
|
||||
|
||||
function scale_pg_count(prev_pgs, real_prev_pgs, prev_pg_history, new_pg_history, new_pg_count)
|
||||
function scale_pg_history(prev_pg_history, prev_pgs, new_pgs)
|
||||
{
|
||||
const old_pg_count = real_prev_pgs.length;
|
||||
const new_pg_history = [];
|
||||
const old_pg_count = prev_pgs.length;
|
||||
const new_pg_count = new_pgs.length;
|
||||
// Add all possibly intersecting PGs to the history of new PGs
|
||||
if (!(new_pg_count % old_pg_count))
|
||||
{
|
||||
// New PG count is a multiple of old PG count
|
||||
for (let i = 0; i < new_pg_count; i++)
|
||||
{
|
||||
add_pg_history(new_pg_history, i, real_prev_pgs, prev_pg_history, i % old_pg_count);
|
||||
add_pg_history(new_pg_history, i, prev_pgs, prev_pg_history, i % old_pg_count);
|
||||
finish_pg_history(new_pg_history[i]);
|
||||
}
|
||||
}
|
||||
|
@ -64,7 +67,7 @@ function scale_pg_count(prev_pgs, real_prev_pgs, prev_pg_history, new_pg_history
|
|||
{
|
||||
for (let j = 0; j < mul; j++)
|
||||
{
|
||||
add_pg_history(new_pg_history, i, real_prev_pgs, prev_pg_history, i+j*new_pg_count);
|
||||
add_pg_history(new_pg_history, i, prev_pgs, prev_pg_history, i+j*new_pg_count);
|
||||
}
|
||||
finish_pg_history(new_pg_history[i]);
|
||||
}
|
||||
|
@ -76,7 +79,7 @@ function scale_pg_count(prev_pgs, real_prev_pgs, prev_pg_history, new_pg_history
|
|||
let merged_history = {};
|
||||
for (let i = 0; i < old_pg_count; i++)
|
||||
{
|
||||
add_pg_history(merged_history, 1, real_prev_pgs, prev_pg_history, i);
|
||||
add_pg_history(merged_history, 1, prev_pgs, prev_pg_history, i);
|
||||
}
|
||||
finish_pg_history(merged_history[1]);
|
||||
for (let i = 0; i < new_pg_count; i++)
|
||||
|
@ -89,6 +92,12 @@ function scale_pg_count(prev_pgs, real_prev_pgs, prev_pg_history, new_pg_history
|
|||
{
|
||||
new_pg_history[i] = null;
|
||||
}
|
||||
return new_pg_history;
|
||||
}
|
||||
|
||||
function scale_pg_count(prev_pgs, new_pg_count)
|
||||
{
|
||||
const old_pg_count = prev_pgs.length;
|
||||
// Just for the lp_solve optimizer - pick a "previous" PG for each "new" one
|
||||
if (prev_pgs.length < new_pg_count)
|
||||
{
|
||||
|
|
349
mon/mon.js
349
mon/mon.js
|
@ -59,6 +59,7 @@ const etcd_tree = {
|
|||
etcd_mon_timeout: 1000, // ms. min: 0
|
||||
etcd_mon_retries: 5, // min: 0
|
||||
mon_change_timeout: 1000, // ms. min: 100
|
||||
mon_retry_change_timeout: 50, // ms. min: 10
|
||||
mon_stats_timeout: 1000, // ms. min: 100
|
||||
osd_out_time: 600, // seconds. min: 0
|
||||
placement_levels: { datacenter: 1, rack: 2, host: 3, osd: 4, ... },
|
||||
|
@ -110,7 +111,15 @@ const etcd_tree = {
|
|||
autosync_interval: 5,
|
||||
autosync_writes: 128,
|
||||
client_queue_depth: 128, // unused
|
||||
recovery_queue_depth: 4,
|
||||
recovery_queue_depth: 1,
|
||||
recovery_sleep_us: 0,
|
||||
recovery_tune_util_low: 0.1,
|
||||
recovery_tune_client_util_low: 0,
|
||||
recovery_tune_util_high: 1.0,
|
||||
recovery_tune_client_util_high: 0.5,
|
||||
recovery_tune_interval: 1,
|
||||
recovery_tune_agg_interval: 10, // 10 times recovery_tune_interval
|
||||
recovery_tune_sleep_min_us: 10, // 10 microseconds
|
||||
recovery_pg_switch: 128,
|
||||
recovery_sync_batch: 16,
|
||||
no_recovery: false,
|
||||
|
@ -490,6 +499,11 @@ class Mon
|
|||
{
|
||||
this.config.mon_change_timeout = 100;
|
||||
}
|
||||
this.config.mon_retry_change_timeout = Number(this.config.mon_retry_change_timeout) || 50;
|
||||
if (this.config.mon_retry_change_timeout < 50)
|
||||
{
|
||||
this.config.mon_retry_change_timeout = 50;
|
||||
}
|
||||
this.config.mon_stats_timeout = Number(this.config.mon_stats_timeout) || 1000;
|
||||
if (this.config.mon_stats_timeout < 100)
|
||||
{
|
||||
|
@ -1222,6 +1236,89 @@ class Mon
|
|||
return aff_osds;
|
||||
}
|
||||
|
||||
async generate_pool_pgs(pool_id, osd_tree, levels)
|
||||
{
|
||||
const pool_cfg = this.state.config.pools[pool_id];
|
||||
if (!this.validate_pool_cfg(pool_id, pool_cfg, false))
|
||||
{
|
||||
return null;
|
||||
}
|
||||
let pool_tree = osd_tree[pool_cfg.root_node || ''];
|
||||
pool_tree = pool_tree ? pool_tree.children : [];
|
||||
pool_tree = LPOptimizer.flatten_tree(pool_tree, levels, pool_cfg.failure_domain, 'osd');
|
||||
this.filter_osds_by_tags(osd_tree, pool_tree, pool_cfg.osd_tags);
|
||||
this.filter_osds_by_block_layout(
|
||||
pool_tree,
|
||||
pool_cfg.block_size || this.config.block_size || 131072,
|
||||
pool_cfg.bitmap_granularity || this.config.bitmap_granularity || 4096,
|
||||
pool_cfg.immediate_commit || this.config.immediate_commit || 'none'
|
||||
);
|
||||
// First try last_clean_pgs to minimize data movement
|
||||
let prev_pgs = [];
|
||||
for (const pg in ((this.state.history.last_clean_pgs.items||{})[pool_id]||{}))
|
||||
{
|
||||
prev_pgs[pg-1] = [ ...this.state.history.last_clean_pgs.items[pool_id][pg].osd_set ];
|
||||
}
|
||||
if (!prev_pgs.length)
|
||||
{
|
||||
// Fall back to config/pgs if it's empty
|
||||
for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
|
||||
{
|
||||
prev_pgs[pg-1] = [ ...this.state.config.pgs.items[pool_id][pg].osd_set ];
|
||||
}
|
||||
}
|
||||
const old_pg_count = prev_pgs.length;
|
||||
const optimize_cfg = {
|
||||
osd_tree: pool_tree,
|
||||
pg_count: pool_cfg.pg_count,
|
||||
pg_size: pool_cfg.pg_size,
|
||||
pg_minsize: pool_cfg.pg_minsize,
|
||||
max_combinations: pool_cfg.max_osd_combinations,
|
||||
ordered: pool_cfg.scheme != 'replicated',
|
||||
};
|
||||
let optimize_result;
|
||||
// Re-shuffle PGs if config/pgs.hash is empty
|
||||
if (old_pg_count > 0 && this.state.config.pgs.hash)
|
||||
{
|
||||
if (prev_pgs.length != pool_cfg.pg_count)
|
||||
{
|
||||
// Scale PG count
|
||||
// Do it even if old_pg_count is already equal to pool_cfg.pg_count,
|
||||
// because last_clean_pgs may still contain the old number of PGs
|
||||
PGUtil.scale_pg_count(prev_pgs, pool_cfg.pg_count);
|
||||
}
|
||||
for (const pg of prev_pgs)
|
||||
{
|
||||
while (pg.length < pool_cfg.pg_size)
|
||||
{
|
||||
pg.push(0);
|
||||
}
|
||||
}
|
||||
optimize_result = await LPOptimizer.optimize_change({
|
||||
prev_pgs,
|
||||
...optimize_cfg,
|
||||
});
|
||||
}
|
||||
else
|
||||
{
|
||||
optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
|
||||
}
|
||||
console.log(`Pool ${pool_id} (${pool_cfg.name || 'unnamed'}):`);
|
||||
LPOptimizer.print_change_stats(optimize_result);
|
||||
const pg_effsize = Math.min(pool_cfg.pg_size, Object.keys(pool_tree).length);
|
||||
return {
|
||||
pool_id,
|
||||
pgs: optimize_result.int_pgs,
|
||||
stats: {
|
||||
total_raw_tb: optimize_result.space,
|
||||
pg_real_size: pg_effsize || pool_cfg.pg_size,
|
||||
raw_to_usable: (pg_effsize || pool_cfg.pg_size) / (pool_cfg.scheme === 'replicated'
|
||||
? 1 : (pool_cfg.pg_size - (pool_cfg.parity_chunks||0))),
|
||||
space_efficiency: optimize_result.space/(optimize_result.total_space||1),
|
||||
},
|
||||
};
|
||||
}
|
||||
|
||||
async recheck_pgs()
|
||||
{
|
||||
if (this.recheck_pgs_active)
|
||||
|
@ -1236,158 +1333,47 @@ class Mon
|
|||
const { up_osds, levels, osd_tree } = this.get_osd_tree();
|
||||
const tree_cfg = {
|
||||
osd_tree,
|
||||
levels,
|
||||
pools: this.state.config.pools,
|
||||
};
|
||||
const tree_hash = sha1hex(stableStringify(tree_cfg));
|
||||
if (this.state.config.pgs.hash != tree_hash)
|
||||
{
|
||||
// Something has changed
|
||||
const new_config_pgs = JSON.parse(JSON.stringify(this.state.config.pgs));
|
||||
const etcd_request = { compare: [], success: [] };
|
||||
for (const pool_id in (this.state.config.pgs||{}).items||{})
|
||||
console.log('Pool configuration or OSD tree changed, re-optimizing');
|
||||
// First re-optimize PGs, but don't look at history yet
|
||||
const optimize_results = await Promise.all(Object.keys(this.state.config.pools)
|
||||
.map(pool_id => this.generate_pool_pgs(pool_id, osd_tree, levels)));
|
||||
// Then apply the modification in the form of an optimistic transaction,
|
||||
// each time considering new pg/history modifications (OSDs modify it during rebalance)
|
||||
while (!await this.apply_pool_pgs(optimize_results, up_osds, osd_tree, tree_hash))
|
||||
{
|
||||
if (!this.state.config.pools[pool_id])
|
||||
{
|
||||
// Pool deleted. Delete all PGs, but first stop them.
|
||||
if (!await this.stop_all_pgs(pool_id))
|
||||
{
|
||||
this.recheck_pgs_active = false;
|
||||
this.schedule_recheck();
|
||||
return;
|
||||
}
|
||||
const prev_pgs = [];
|
||||
for (const pg in this.state.config.pgs.items[pool_id]||{})
|
||||
{
|
||||
prev_pgs[pg-1] = this.state.config.pgs.items[pool_id][pg].osd_set;
|
||||
}
|
||||
// Also delete pool statistics
|
||||
etcd_request.success.push({ requestDeleteRange: {
|
||||
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
|
||||
} });
|
||||
this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, prev_pgs, [], []);
|
||||
}
|
||||
}
|
||||
for (const pool_id in this.state.config.pools)
|
||||
{
|
||||
const pool_cfg = this.state.config.pools[pool_id];
|
||||
if (!this.validate_pool_cfg(pool_id, pool_cfg, false))
|
||||
{
|
||||
continue;
|
||||
}
|
||||
let pool_tree = osd_tree[pool_cfg.root_node || ''];
|
||||
pool_tree = pool_tree ? pool_tree.children : [];
|
||||
pool_tree = LPOptimizer.flatten_tree(pool_tree, levels, pool_cfg.failure_domain, 'osd');
|
||||
this.filter_osds_by_tags(osd_tree, pool_tree, pool_cfg.osd_tags);
|
||||
this.filter_osds_by_block_layout(
|
||||
pool_tree,
|
||||
pool_cfg.block_size || this.config.block_size || 131072,
|
||||
pool_cfg.bitmap_granularity || this.config.bitmap_granularity || 4096,
|
||||
pool_cfg.immediate_commit || this.config.immediate_commit || 'none'
|
||||
console.log(
|
||||
'Someone changed PG configuration while we also tried to change it.'+
|
||||
' Retrying in '+this.config.mon_retry_change_timeout+' ms'
|
||||
);
|
||||
// These are for the purpose of building history.osd_sets
|
||||
const real_prev_pgs = [];
|
||||
let pg_history = [];
|
||||
for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
|
||||
// Failed to apply - parallel change detected. Wait a bit and retry
|
||||
const old_rev = this.etcd_watch_revision;
|
||||
while (this.etcd_watch_revision === old_rev)
|
||||
{
|
||||
real_prev_pgs[pg-1] = this.state.config.pgs.items[pool_id][pg].osd_set;
|
||||
if (this.state.pg.history[pool_id] &&
|
||||
this.state.pg.history[pool_id][pg])
|
||||
{
|
||||
pg_history[pg-1] = this.state.pg.history[pool_id][pg];
|
||||
}
|
||||
await new Promise(ok => setTimeout(ok, this.config.mon_retry_change_timeout));
|
||||
}
|
||||
// And these are for the purpose of minimizing data movement
|
||||
let prev_pgs = [];
|
||||
for (const pg in ((this.state.history.last_clean_pgs.items||{})[pool_id]||{}))
|
||||
{
|
||||
prev_pgs[pg-1] = this.state.history.last_clean_pgs.items[pool_id][pg].osd_set;
|
||||
}
|
||||
prev_pgs = JSON.parse(JSON.stringify(prev_pgs.length ? prev_pgs : real_prev_pgs));
|
||||
const old_pg_count = real_prev_pgs.length;
|
||||
const optimize_cfg = {
|
||||
osd_tree: pool_tree,
|
||||
pg_count: pool_cfg.pg_count,
|
||||
pg_size: pool_cfg.pg_size,
|
||||
pg_minsize: pool_cfg.pg_minsize,
|
||||
max_combinations: pool_cfg.max_osd_combinations,
|
||||
ordered: pool_cfg.scheme != 'replicated',
|
||||
const new_ot = this.get_osd_tree();
|
||||
const new_tcfg = {
|
||||
osd_tree: new_ot.osd_tree,
|
||||
levels: new_ot.levels,
|
||||
pools: this.state.config.pools,
|
||||
};
|
||||
let optimize_result;
|
||||
if (old_pg_count > 0)
|
||||
if (sha1hex(stableStringify(new_tcfg)) !== tree_hash)
|
||||
{
|
||||
if (old_pg_count != pool_cfg.pg_count)
|
||||
{
|
||||
// PG count changed. Need to bring all PGs down.
|
||||
if (!await this.stop_all_pgs(pool_id))
|
||||
{
|
||||
this.recheck_pgs_active = false;
|
||||
this.schedule_recheck();
|
||||
return;
|
||||
}
|
||||
}
|
||||
if (prev_pgs.length != pool_cfg.pg_count)
|
||||
{
|
||||
// Scale PG count
|
||||
// Do it even if old_pg_count is already equal to pool_cfg.pg_count,
|
||||
// because last_clean_pgs may still contain the old number of PGs
|
||||
const new_pg_history = [];
|
||||
PGUtil.scale_pg_count(prev_pgs, real_prev_pgs, pg_history, new_pg_history, pool_cfg.pg_count);
|
||||
pg_history = new_pg_history;
|
||||
}
|
||||
for (const pg of prev_pgs)
|
||||
{
|
||||
while (pg.length < pool_cfg.pg_size)
|
||||
{
|
||||
pg.push(0);
|
||||
}
|
||||
}
|
||||
if (!this.state.config.pgs.hash)
|
||||
{
|
||||
// Re-shuffle PGs
|
||||
optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
|
||||
}
|
||||
else
|
||||
{
|
||||
optimize_result = await LPOptimizer.optimize_change({
|
||||
prev_pgs,
|
||||
...optimize_cfg,
|
||||
});
|
||||
}
|
||||
// Configuration actually changed, restart from the beginning
|
||||
this.recheck_pgs_active = false;
|
||||
setImmediate(() => this.recheck_pgs().catch(this.die));
|
||||
return;
|
||||
}
|
||||
else
|
||||
{
|
||||
optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
|
||||
}
|
||||
if (old_pg_count != optimize_result.int_pgs.length)
|
||||
{
|
||||
console.log(
|
||||
`PG count for pool ${pool_id} (${pool_cfg.name || 'unnamed'})`+
|
||||
` changed from: ${old_pg_count} to ${optimize_result.int_pgs.length}`
|
||||
);
|
||||
// Drop stats
|
||||
etcd_request.success.push({ requestDeleteRange: {
|
||||
key: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'/'),
|
||||
range_end: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'0'),
|
||||
} });
|
||||
}
|
||||
LPOptimizer.print_change_stats(optimize_result);
|
||||
const pg_effsize = Math.min(pool_cfg.pg_size, Object.keys(pool_tree).length);
|
||||
this.state.pool.stats[pool_id] = {
|
||||
used_raw_tb: (this.state.pool.stats[pool_id]||{}).used_raw_tb || 0,
|
||||
total_raw_tb: optimize_result.space,
|
||||
pg_real_size: pg_effsize || pool_cfg.pg_size,
|
||||
raw_to_usable: (pg_effsize || pool_cfg.pg_size) / (pool_cfg.scheme === 'replicated'
|
||||
? 1 : (pool_cfg.pg_size - (pool_cfg.parity_chunks||0))),
|
||||
space_efficiency: optimize_result.space/(optimize_result.total_space||1),
|
||||
};
|
||||
etcd_request.success.push({ requestPut: {
|
||||
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
|
||||
value: b64(JSON.stringify(this.state.pool.stats[pool_id])),
|
||||
} });
|
||||
this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, real_prev_pgs, optimize_result.int_pgs, pg_history);
|
||||
// Configuration didn't change, PG history probably changed, so just retry
|
||||
}
|
||||
new_config_pgs.hash = tree_hash;
|
||||
await this.save_pg_config(new_config_pgs, etcd_request);
|
||||
console.log('PG configuration successfully changed');
|
||||
}
|
||||
else
|
||||
{
|
||||
|
@ -1434,8 +1420,81 @@ class Mon
|
|||
this.recheck_pgs_active = false;
|
||||
}
|
||||
|
||||
async save_pg_config(new_config_pgs, etcd_request = { compare: [], success: [] })
|
||||
async apply_pool_pgs(results, up_osds, osd_tree, tree_hash)
|
||||
{
|
||||
for (const pool_id in (this.state.config.pgs||{}).items||{})
|
||||
{
|
||||
// We should stop all PGs when deleting a pool or changing its PG count
|
||||
if (!this.state.config.pools[pool_id] ||
|
||||
this.state.config.pgs.items[pool_id] && this.state.config.pools[pool_id].pg_count !=
|
||||
Object.keys(this.state.config.pgs.items[pool_id]).reduce((a, c) => (a < (0|c) ? (0|c) : a), 0))
|
||||
{
|
||||
if (!await this.stop_all_pgs(pool_id))
|
||||
{
|
||||
return false;
|
||||
}
|
||||
}
|
||||
}
|
||||
const new_config_pgs = JSON.parse(JSON.stringify(this.state.config.pgs));
|
||||
const etcd_request = { compare: [], success: [] };
|
||||
for (const pool_id in (new_config_pgs||{}).items||{})
|
||||
{
|
||||
if (!this.state.config.pools[pool_id])
|
||||
{
|
||||
const prev_pgs = [];
|
||||
for (const pg in new_config_pgs.items[pool_id]||{})
|
||||
{
|
||||
prev_pgs[pg-1] = new_config_pgs.items[pool_id][pg].osd_set;
|
||||
}
|
||||
// Also delete pool statistics
|
||||
etcd_request.success.push({ requestDeleteRange: {
|
||||
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
|
||||
} });
|
||||
this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, prev_pgs, [], []);
|
||||
}
|
||||
}
|
||||
for (const pool_res of results)
|
||||
{
|
||||
const pool_id = pool_res.pool_id;
|
||||
const pool_cfg = this.state.config.pools[pool_id];
|
||||
let pg_history = [];
|
||||
for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
|
||||
{
|
||||
if (this.state.pg.history[pool_id] &&
|
||||
this.state.pg.history[pool_id][pg])
|
||||
{
|
||||
pg_history[pg-1] = this.state.pg.history[pool_id][pg];
|
||||
}
|
||||
}
|
||||
const real_prev_pgs = [];
|
||||
for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
|
||||
{
|
||||
real_prev_pgs[pg-1] = [ ...this.state.config.pgs.items[pool_id][pg].osd_set ];
|
||||
}
|
||||
if (real_prev_pgs.length > 0 && real_prev_pgs.length != pool_res.pgs.length)
|
||||
{
|
||||
console.log(
|
||||
`Changing PG count for pool ${pool_id} (${pool_cfg.name || 'unnamed'})`+
|
||||
` from: ${real_prev_pgs.length} to ${pool_res.pgs.length}`
|
||||
);
|
||||
pg_history = PGUtil.scale_pg_history(pg_history, real_prev_pgs, pool_res.pgs);
|
||||
// Drop stats
|
||||
etcd_request.success.push({ requestDeleteRange: {
|
||||
key: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'/'),
|
||||
range_end: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'0'),
|
||||
} });
|
||||
}
|
||||
const stats = {
|
||||
used_raw_tb: (this.state.pool.stats[pool_id]||{}).used_raw_tb || 0,
|
||||
...pool_res.stats,
|
||||
};
|
||||
etcd_request.success.push({ requestPut: {
|
||||
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
|
||||
value: b64(JSON.stringify(stats)),
|
||||
} });
|
||||
this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, real_prev_pgs, pool_res.pgs, pg_history);
|
||||
}
|
||||
new_config_pgs.hash = tree_hash;
|
||||
etcd_request.compare.push(
|
||||
{ key: b64(this.etcd_prefix+'/mon/master'), target: 'LEASE', lease: ''+this.etcd_lease_id },
|
||||
{ key: b64(this.etcd_prefix+'/config/pgs'), target: 'MOD', mod_revision: ''+this.etcd_watch_revision, result: 'LESS' },
|
||||
|
@ -1443,14 +1502,8 @@ class Mon
|
|||
etcd_request.success.push(
|
||||
{ requestPut: { key: b64(this.etcd_prefix+'/config/pgs'), value: b64(JSON.stringify(new_config_pgs)) } },
|
||||
);
|
||||
const res = await this.etcd_call('/kv/txn', etcd_request, this.config.etcd_mon_timeout, 0);
|
||||
if (!res.succeeded)
|
||||
{
|
||||
console.log('Someone changed PG configuration while we also tried to change it. Retrying in '+this.config.mon_change_timeout+' ms');
|
||||
this.schedule_recheck();
|
||||
return;
|
||||
}
|
||||
console.log('PG configuration successfully changed');
|
||||
const txn_res = await this.etcd_call('/kv/txn', etcd_request, this.config.etcd_mon_timeout, 0);
|
||||
return txn_res.succeeded;
|
||||
}
|
||||
|
||||
// Schedule next recheck at least at <unixtime>
|
||||
|
|
|
@ -163,20 +163,10 @@ void blockstore_impl_t::loop()
|
|||
}
|
||||
else if (op->opcode == BS_OP_SYNC)
|
||||
{
|
||||
// wait for all small writes to be submitted
|
||||
// wait for all big writes to complete, submit data device fsync
|
||||
// sync only completed writes?
|
||||
// wait for the data device fsync to complete, then submit journal writes for big writes
|
||||
// then submit an fsync operation
|
||||
if (has_writes)
|
||||
{
|
||||
// Can't submit SYNC before previous writes
|
||||
continue;
|
||||
}
|
||||
wr_st = continue_sync(op);
|
||||
if (wr_st != 2)
|
||||
{
|
||||
has_writes = wr_st > 0 ? 1 : 2;
|
||||
}
|
||||
}
|
||||
else if (op->opcode == BS_OP_STABLE)
|
||||
{
|
||||
|
|
|
@ -277,6 +277,7 @@ class blockstore_impl_t
|
|||
int unsynced_big_write_count = 0, unstable_unsynced = 0;
|
||||
int unsynced_queued_ops = 0;
|
||||
allocator *data_alloc = NULL;
|
||||
uint64_t used_blocks = 0;
|
||||
uint8_t *zero_object;
|
||||
|
||||
void *metadata_buffer = NULL;
|
||||
|
@ -430,7 +431,7 @@ public:
|
|||
|
||||
inline uint32_t get_block_size() { return dsk.data_block_size; }
|
||||
inline uint64_t get_block_count() { return dsk.block_count; }
|
||||
inline uint64_t get_free_block_count() { return data_alloc->get_free_count(); }
|
||||
inline uint64_t get_free_block_count() { return dsk.block_count - used_blocks; }
|
||||
inline uint32_t get_bitmap_granularity() { return dsk.disk_alignment; }
|
||||
inline uint64_t get_journal_size() { return dsk.journal_len; }
|
||||
};
|
||||
|
|
|
@ -376,6 +376,7 @@ bool blockstore_init_meta::handle_meta_block(uint8_t *buf, uint64_t entries_per_
|
|||
else
|
||||
{
|
||||
bs->inode_space_stats[entry->oid.inode] += bs->dsk.data_block_size;
|
||||
bs->used_blocks++;
|
||||
}
|
||||
entries_loaded++;
|
||||
#ifdef BLOCKSTORE_DEBUG
|
||||
|
@ -1181,6 +1182,7 @@ void blockstore_init_journal::erase_dirty_object(blockstore_dirty_db_t::iterator
|
|||
sp -= bs->dsk.data_block_size;
|
||||
else
|
||||
bs->inode_space_stats.erase(oid.inode);
|
||||
bs->used_blocks--;
|
||||
}
|
||||
bs->erase_dirty(dirty_it, dirty_end, clean_loc);
|
||||
// Remove it from the flusher's queue, too
|
||||
|
|
|
@ -445,6 +445,7 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
|
|||
if (!exists)
|
||||
{
|
||||
inode_space_stats[dirty_it->first.oid.inode] += dsk.data_block_size;
|
||||
used_blocks++;
|
||||
}
|
||||
big_to_flush++;
|
||||
}
|
||||
|
@ -455,6 +456,7 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
|
|||
sp -= dsk.data_block_size;
|
||||
else
|
||||
inode_space_stats.erase(dirty_it->first.oid.inode);
|
||||
used_blocks--;
|
||||
big_to_flush++;
|
||||
}
|
||||
}
|
||||
|
|
|
@ -705,6 +705,8 @@ resume_1:
|
|||
}
|
||||
goto resume_2;
|
||||
}
|
||||
// Protect from try_send completing the operation immediately
|
||||
op->inflight_count++;
|
||||
for (int i = 0; i < op->parts.size(); i++)
|
||||
{
|
||||
if (!(op->parts[i].flags & PART_SENT))
|
||||
|
@ -728,8 +730,10 @@ resume_1:
|
|||
}
|
||||
}
|
||||
}
|
||||
op->inflight_count--;
|
||||
if (op->state == 1)
|
||||
{
|
||||
// Some suboperations have to be resent
|
||||
return 0;
|
||||
}
|
||||
resume_2:
|
||||
|
|
|
@ -149,7 +149,7 @@ public:
|
|||
std::map<osd_num_t, osd_wanted_peer_t> wanted_peers;
|
||||
std::map<uint64_t, int> osd_peer_fds;
|
||||
// op statistics
|
||||
osd_op_stats_t stats;
|
||||
osd_op_stats_t stats, recovery_stats;
|
||||
|
||||
void init();
|
||||
void parse_config(const json11::Json & config);
|
||||
|
@ -175,6 +175,7 @@ public:
|
|||
bool connect_rdma(int peer_fd, std::string rdma_address, uint64_t client_max_msg);
|
||||
#endif
|
||||
|
||||
void inc_op_stats(osd_op_stats_t & stats, uint64_t opcode, timespec & tv_begin, timespec & tv_end, uint64_t len);
|
||||
void measure_exec(osd_op_t *cur_op);
|
||||
|
||||
protected:
|
||||
|
|
|
@ -24,3 +24,17 @@ osd_op_t::~osd_op_t()
|
|||
free(buf);
|
||||
}
|
||||
}
|
||||
|
||||
bool osd_op_t::is_recovery_related()
|
||||
{
|
||||
return (req.hdr.opcode == OSD_OP_SEC_READ ||
|
||||
req.hdr.opcode == OSD_OP_SEC_WRITE ||
|
||||
req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE) &&
|
||||
(req.sec_rw.flags & OSD_OP_RECOVERY_RELATED) ||
|
||||
req.hdr.opcode == OSD_OP_SEC_DELETE &&
|
||||
(req.sec_del.flags & OSD_OP_RECOVERY_RELATED) ||
|
||||
req.hdr.opcode == OSD_OP_SEC_STABILIZE &&
|
||||
(req.sec_stab.flags & OSD_OP_RECOVERY_RELATED) ||
|
||||
req.hdr.opcode == OSD_OP_SEC_SYNC &&
|
||||
(req.sec_sync.flags & OSD_OP_RECOVERY_RELATED);
|
||||
}
|
||||
|
|
|
@ -173,4 +173,6 @@ struct osd_op_t
|
|||
osd_op_buf_list_t iov;
|
||||
|
||||
~osd_op_t();
|
||||
|
||||
bool is_recovery_related();
|
||||
};
|
||||
|
|
|
@ -131,6 +131,23 @@ void osd_messenger_t::outbox_push(osd_op_t *cur_op)
|
|||
}
|
||||
}
|
||||
|
||||
void osd_messenger_t::inc_op_stats(osd_op_stats_t & stats, uint64_t opcode, timespec & tv_begin, timespec & tv_end, uint64_t len)
|
||||
{
|
||||
uint64_t usecs = (
|
||||
(tv_end.tv_sec - tv_begin.tv_sec)*1000000 +
|
||||
(tv_end.tv_nsec - tv_begin.tv_nsec)/1000
|
||||
);
|
||||
stats.op_stat_count[opcode]++;
|
||||
if (!stats.op_stat_count[opcode])
|
||||
{
|
||||
stats.op_stat_count[opcode] = 1;
|
||||
stats.op_stat_sum[opcode] = 0;
|
||||
stats.op_stat_bytes[opcode] = 0;
|
||||
}
|
||||
stats.op_stat_sum[opcode] += usecs;
|
||||
stats.op_stat_bytes[opcode] += len;
|
||||
}
|
||||
|
||||
void osd_messenger_t::measure_exec(osd_op_t *cur_op)
|
||||
{
|
||||
// Measure execution latency
|
||||
|
@ -142,29 +159,24 @@ void osd_messenger_t::measure_exec(osd_op_t *cur_op)
|
|||
{
|
||||
clock_gettime(CLOCK_REALTIME, &cur_op->tv_end);
|
||||
}
|
||||
stats.op_stat_count[cur_op->req.hdr.opcode]++;
|
||||
if (!stats.op_stat_count[cur_op->req.hdr.opcode])
|
||||
{
|
||||
stats.op_stat_count[cur_op->req.hdr.opcode]++;
|
||||
stats.op_stat_sum[cur_op->req.hdr.opcode] = 0;
|
||||
stats.op_stat_bytes[cur_op->req.hdr.opcode] = 0;
|
||||
}
|
||||
stats.op_stat_sum[cur_op->req.hdr.opcode] += (
|
||||
(cur_op->tv_end.tv_sec - cur_op->tv_begin.tv_sec)*1000000 +
|
||||
(cur_op->tv_end.tv_nsec - cur_op->tv_begin.tv_nsec)/1000
|
||||
);
|
||||
uint64_t len = 0;
|
||||
if (cur_op->req.hdr.opcode == OSD_OP_READ ||
|
||||
cur_op->req.hdr.opcode == OSD_OP_WRITE ||
|
||||
cur_op->req.hdr.opcode == OSD_OP_SCRUB)
|
||||
{
|
||||
// req.rw.len is internally set to the full object size for scrubs
|
||||
stats.op_stat_bytes[cur_op->req.hdr.opcode] += cur_op->req.rw.len;
|
||||
len = cur_op->req.rw.len;
|
||||
}
|
||||
else if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ ||
|
||||
cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
|
||||
cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE)
|
||||
{
|
||||
stats.op_stat_bytes[cur_op->req.hdr.opcode] += cur_op->req.sec_rw.len;
|
||||
len = cur_op->req.sec_rw.len;
|
||||
}
|
||||
inc_op_stats(stats, cur_op->req.hdr.opcode, cur_op->tv_begin, cur_op->tv_end, len);
|
||||
if (cur_op->is_recovery_related())
|
||||
{
|
||||
inc_op_stats(recovery_stats, cur_op->req.hdr.opcode, cur_op->tv_begin, cur_op->tv_end, len);
|
||||
}
|
||||
}
|
||||
|
||||
|
|
79
src/osd.cpp
79
src/osd.cpp
|
@ -68,14 +68,21 @@ osd_t::osd_t(const json11::Json & config, ring_loop_t *ringloop)
|
|||
}
|
||||
}
|
||||
|
||||
print_stats_timer_id = this->tfd->set_timer(print_stats_interval*1000, true, [this](int timer_id)
|
||||
if (print_stats_timer_id == -1)
|
||||
{
|
||||
print_stats();
|
||||
});
|
||||
slow_log_timer_id = this->tfd->set_timer(slow_log_interval*1000, true, [this](int timer_id)
|
||||
print_stats_timer_id = this->tfd->set_timer(print_stats_interval*1000, true, [this](int timer_id)
|
||||
{
|
||||
print_stats();
|
||||
});
|
||||
}
|
||||
if (slow_log_timer_id == -1)
|
||||
{
|
||||
print_slow();
|
||||
});
|
||||
slow_log_timer_id = this->tfd->set_timer(slow_log_interval*1000, true, [this](int timer_id)
|
||||
{
|
||||
print_slow();
|
||||
});
|
||||
}
|
||||
apply_recovery_tune_interval();
|
||||
|
||||
msgr.tfd = this->tfd;
|
||||
msgr.ringloop = this->ringloop;
|
||||
|
@ -97,6 +104,11 @@ osd_t::~osd_t()
|
|||
tfd->clear_timer(slow_log_timer_id);
|
||||
slow_log_timer_id = -1;
|
||||
}
|
||||
if (rtune_timer_id >= 0)
|
||||
{
|
||||
tfd->clear_timer(rtune_timer_id);
|
||||
rtune_timer_id = -1;
|
||||
}
|
||||
if (print_stats_timer_id >= 0)
|
||||
{
|
||||
tfd->clear_timer(print_stats_timer_id);
|
||||
|
@ -196,6 +208,30 @@ void osd_t::parse_config(bool init)
|
|||
recovery_queue_depth = config["recovery_queue_depth"].uint64_value();
|
||||
if (recovery_queue_depth < 1 || recovery_queue_depth > MAX_RECOVERY_QUEUE)
|
||||
recovery_queue_depth = DEFAULT_RECOVERY_QUEUE;
|
||||
recovery_sleep_us = config["recovery_sleep_us"].uint64_value();
|
||||
recovery_tune_util_low = config["recovery_tune_util_low"].is_null()
|
||||
? 0.1 : config["recovery_tune_util_low"].number_value();
|
||||
if (recovery_tune_util_low < 0.01)
|
||||
recovery_tune_util_low = 0.01;
|
||||
recovery_tune_util_high = config["recovery_tune_util_high"].is_null()
|
||||
? 1.0 : config["recovery_tune_util_high"].number_value();
|
||||
if (recovery_tune_util_high < 0.01)
|
||||
recovery_tune_util_high = 0.01;
|
||||
recovery_tune_client_util_low = config["recovery_tune_client_util_low"].is_null()
|
||||
? 0 : config["recovery_tune_client_util_low"].number_value();
|
||||
if (recovery_tune_client_util_low < 0.01)
|
||||
recovery_tune_client_util_low = 0.01;
|
||||
recovery_tune_client_util_high = config["recovery_tune_client_util_high"].is_null()
|
||||
? 0.5 : config["recovery_tune_client_util_high"].number_value();
|
||||
if (recovery_tune_client_util_high < 0.01)
|
||||
recovery_tune_client_util_high = 0.01;
|
||||
auto old_recovery_tune_interval = recovery_tune_interval;
|
||||
recovery_tune_interval = config["recovery_tune_interval"].is_null()
|
||||
? 1 : config["recovery_tune_interval"].uint64_value();
|
||||
recovery_tune_agg_interval = config["recovery_tune_agg_interval"].is_null()
|
||||
? 10 : config["recovery_tune_agg_interval"].uint64_value();
|
||||
recovery_tune_sleep_min_us = config["recovery_tune_sleep_min_us"].is_null()
|
||||
? 10 : config["recovery_tune_sleep_min_us"].uint64_value();
|
||||
recovery_pg_switch = config["recovery_pg_switch"].uint64_value();
|
||||
if (recovery_pg_switch < 1)
|
||||
recovery_pg_switch = DEFAULT_RECOVERY_PG_SWITCH;
|
||||
|
@ -274,6 +310,10 @@ void osd_t::parse_config(bool init)
|
|||
print_slow();
|
||||
});
|
||||
}
|
||||
if (old_recovery_tune_interval != recovery_tune_interval)
|
||||
{
|
||||
apply_recovery_tune_interval();
|
||||
}
|
||||
}
|
||||
|
||||
void osd_t::bind_socket()
|
||||
|
@ -421,14 +461,6 @@ void osd_t::exec_op(osd_op_t *cur_op)
|
|||
}
|
||||
}
|
||||
|
||||
void osd_t::reset_stats()
|
||||
{
|
||||
msgr.stats = {};
|
||||
prev_stats = {};
|
||||
memset(recovery_stat_count, 0, sizeof(recovery_stat_count));
|
||||
memset(recovery_stat_bytes, 0, sizeof(recovery_stat_bytes));
|
||||
}
|
||||
|
||||
void osd_t::print_stats()
|
||||
{
|
||||
for (int i = OSD_OP_MIN; i <= OSD_OP_MAX; i++)
|
||||
|
@ -466,19 +498,20 @@ void osd_t::print_stats()
|
|||
}
|
||||
for (int i = 0; i < 2; i++)
|
||||
{
|
||||
if (recovery_stat_count[0][i] != recovery_stat_count[1][i])
|
||||
if (recovery_stat[i].count > recovery_print_prev[i].count)
|
||||
{
|
||||
uint64_t bw = (recovery_stat_bytes[0][i] - recovery_stat_bytes[1][i]) / print_stats_interval;
|
||||
uint64_t bw = (recovery_stat[i].bytes - recovery_print_prev[i].bytes) / print_stats_interval;
|
||||
printf(
|
||||
"[OSD %lu] %s recovery: %.1f op/s, B/W: %.2f %s\n", osd_num, recovery_stat_names[i],
|
||||
(recovery_stat_count[0][i] - recovery_stat_count[1][i]) * 1.0 / print_stats_interval,
|
||||
"[OSD %lu] %s recovery: %.1f op/s, B/W: %.2f %s, avg latency %ld us, delay %ld us\n", osd_num, recovery_stat_names[i],
|
||||
(recovery_stat[i].count - recovery_print_prev[i].count) * 1.0 / print_stats_interval,
|
||||
(bw > 1024*1024*1024 ? bw/1024.0/1024/1024 : (bw > 1024*1024 ? bw/1024.0/1024 : bw/1024.0)),
|
||||
(bw > 1024*1024*1024 ? "GB/s" : (bw > 1024*1024 ? "MB/s" : "KB/s"))
|
||||
(bw > 1024*1024*1024 ? "GB/s" : (bw > 1024*1024 ? "MB/s" : "KB/s")),
|
||||
(recovery_stat[i].usec - recovery_print_prev[i].usec) / (recovery_stat[i].count - recovery_print_prev[i].count),
|
||||
recovery_target_sleep_us
|
||||
);
|
||||
recovery_stat_count[1][i] = recovery_stat_count[0][i];
|
||||
recovery_stat_bytes[1][i] = recovery_stat_bytes[0][i];
|
||||
}
|
||||
}
|
||||
memcpy(recovery_print_prev, recovery_stat, sizeof(recovery_stat));
|
||||
if (corrupted_objects > 0)
|
||||
{
|
||||
printf("[OSD %lu] %lu object(s) corrupted\n", osd_num, corrupted_objects);
|
||||
|
@ -572,8 +605,8 @@ void osd_t::print_slow()
|
|||
op->req.hdr.opcode == OSD_OP_SEC_STABILIZE || op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK ||
|
||||
op->req.hdr.opcode == OSD_OP_SEC_READ_BMP)
|
||||
{
|
||||
bufprintf(" state=%d", PRIV(op->bs_op)->op_state);
|
||||
int wait_for = PRIV(op->bs_op)->wait_for;
|
||||
bufprintf(" state=%d", op->bs_op ? PRIV(op->bs_op)->op_state : -1);
|
||||
int wait_for = op->bs_op ? PRIV(op->bs_op)->wait_for : 0;
|
||||
if (wait_for)
|
||||
{
|
||||
bufprintf(" wait=%d (detail=%lu)", wait_for, PRIV(op->bs_op)->wait_detail);
|
||||
|
|
37
src/osd.h
37
src/osd.h
|
@ -34,7 +34,7 @@
|
|||
#define DEFAULT_AUTOSYNC_INTERVAL 5
|
||||
#define DEFAULT_AUTOSYNC_WRITES 128
|
||||
#define MAX_RECOVERY_QUEUE 2048
|
||||
#define DEFAULT_RECOVERY_QUEUE 4
|
||||
#define DEFAULT_RECOVERY_QUEUE 1
|
||||
#define DEFAULT_RECOVERY_PG_SWITCH 128
|
||||
#define DEFAULT_RECOVERY_BATCH 16
|
||||
|
||||
|
@ -87,6 +87,11 @@ struct osd_chain_read_t
|
|||
|
||||
struct osd_rmw_stripe_t;
|
||||
|
||||
struct recovery_stat_t
|
||||
{
|
||||
uint64_t count, usec, bytes;
|
||||
};
|
||||
|
||||
class osd_t
|
||||
{
|
||||
// config
|
||||
|
@ -111,7 +116,15 @@ class osd_t
|
|||
int immediate_commit = IMMEDIATE_NONE;
|
||||
int autosync_interval = DEFAULT_AUTOSYNC_INTERVAL; // "emergency" sync every 5 seconds
|
||||
int autosync_writes = DEFAULT_AUTOSYNC_WRITES;
|
||||
int recovery_queue_depth = DEFAULT_RECOVERY_QUEUE;
|
||||
uint64_t recovery_queue_depth = 1;
|
||||
uint64_t recovery_sleep_us = 0;
|
||||
double recovery_tune_util_low = 0.1;
|
||||
double recovery_tune_client_util_low = 0;
|
||||
double recovery_tune_util_high = 1.0;
|
||||
double recovery_tune_client_util_high = 0.5;
|
||||
int recovery_tune_interval = 1;
|
||||
int recovery_tune_agg_interval = 10;
|
||||
int recovery_tune_sleep_min_us = 10;
|
||||
int recovery_pg_switch = DEFAULT_RECOVERY_PG_SWITCH;
|
||||
int recovery_sync_batch = DEFAULT_RECOVERY_BATCH;
|
||||
int inode_vanish_time = 60;
|
||||
|
@ -189,8 +202,18 @@ class osd_t
|
|||
std::map<uint64_t, inode_stats_t> inode_stats;
|
||||
std::map<uint64_t, timespec> vanishing_inodes;
|
||||
const char* recovery_stat_names[2] = { "degraded", "misplaced" };
|
||||
uint64_t recovery_stat_count[2][2] = {};
|
||||
uint64_t recovery_stat_bytes[2][2] = {};
|
||||
recovery_stat_t recovery_stat[2];
|
||||
recovery_stat_t recovery_print_prev[2];
|
||||
|
||||
// recovery auto-tuning
|
||||
int rtune_timer_id = -1;
|
||||
uint64_t rtune_avg_lat = 0;
|
||||
double rtune_client_util = 0, rtune_target_util = 1;
|
||||
osd_op_stats_t rtune_prev_stats, rtune_prev_recovery_stats;
|
||||
std::vector<uint64_t> recovery_target_sleep_items;
|
||||
uint64_t recovery_target_sleep_us = 0;
|
||||
uint64_t recovery_target_sleep_total = 0;
|
||||
int recovery_target_sleep_cur = 0, recovery_target_sleep_count = 0;
|
||||
|
||||
// cluster connection
|
||||
void parse_config(bool init);
|
||||
|
@ -208,8 +231,9 @@ class osd_t
|
|||
void create_osd_state();
|
||||
void renew_lease(bool reload);
|
||||
void print_stats();
|
||||
void tune_recovery();
|
||||
void apply_recovery_tune_interval();
|
||||
void print_slow();
|
||||
void reset_stats();
|
||||
json11::Json get_statistics();
|
||||
void report_statistics();
|
||||
void report_pg_state(pg_t & pg);
|
||||
|
@ -238,6 +262,7 @@ class osd_t
|
|||
bool submit_flush_op(pool_id_t pool_id, pg_num_t pg_num, pg_flush_batch_t *fb, bool rollback, osd_num_t peer_osd, int count, obj_ver_id *data);
|
||||
bool pick_next_recovery(osd_recovery_op_t &op);
|
||||
void submit_recovery_op(osd_recovery_op_t *op);
|
||||
void finish_recovery_op(osd_recovery_op_t *op);
|
||||
bool continue_recovery();
|
||||
pg_osd_set_state_t* change_osd_set(pg_osd_set_state_t *st, pg_t *pg);
|
||||
|
||||
|
@ -279,7 +304,7 @@ class osd_t
|
|||
bool remember_unstable_write(osd_op_t *cur_op, pg_t & pg, pg_osd_set_t & loc_set, int base_state);
|
||||
void handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op);
|
||||
void handle_primary_bs_subop(osd_op_t *subop);
|
||||
void add_bs_subop_stats(osd_op_t *subop);
|
||||
void add_bs_subop_stats(osd_op_t *subop, bool recovery_related = false);
|
||||
void pg_cancel_write_queue(pg_t & pg, osd_op_t *first_op, object_id oid, int retval);
|
||||
|
||||
void submit_primary_subops(int submit_type, uint64_t op_version, const uint64_t* osd_set, osd_op_t *cur_op);
|
||||
|
|
|
@ -213,12 +213,14 @@ json11::Json osd_t::get_statistics()
|
|||
st["subop_stats"] = subop_stats;
|
||||
st["recovery_stats"] = json11::Json::object {
|
||||
{ recovery_stat_names[0], json11::Json::object {
|
||||
{ "count", recovery_stat_count[0][0] },
|
||||
{ "bytes", recovery_stat_bytes[0][0] },
|
||||
{ "count", recovery_stat[0].count },
|
||||
{ "bytes", recovery_stat[0].bytes },
|
||||
{ "usec", recovery_stat[0].usec },
|
||||
} },
|
||||
{ recovery_stat_names[1], json11::Json::object {
|
||||
{ "count", recovery_stat_count[0][1] },
|
||||
{ "bytes", recovery_stat_bytes[0][1] },
|
||||
{ "count", recovery_stat[1].count },
|
||||
{ "bytes", recovery_stat[1].bytes },
|
||||
{ "usec", recovery_stat[1].usec },
|
||||
} },
|
||||
};
|
||||
return st;
|
||||
|
|
|
@ -325,26 +325,129 @@ void osd_t::submit_recovery_op(osd_recovery_op_t *op)
|
|||
{
|
||||
printf("Recovery operation done for %lx:%lx\n", op->oid.inode, op->oid.stripe);
|
||||
}
|
||||
// CAREFUL! op = &recovery_ops[op->oid]. Don't access op->* after recovery_ops.erase()
|
||||
op->osd_op = NULL;
|
||||
recovery_ops.erase(op->oid);
|
||||
delete osd_op;
|
||||
if (immediate_commit != IMMEDIATE_ALL)
|
||||
{
|
||||
recovery_done++;
|
||||
if (recovery_done >= recovery_sync_batch)
|
||||
{
|
||||
// Force sync every <recovery_sync_batch> operations
|
||||
// This is required not to pile up an excessive amount of delete operations
|
||||
autosync();
|
||||
recovery_done = 0;
|
||||
}
|
||||
}
|
||||
continue_recovery();
|
||||
finish_recovery_op(op);
|
||||
};
|
||||
exec_op(op->osd_op);
|
||||
}
|
||||
|
||||
void osd_t::apply_recovery_tune_interval()
|
||||
{
|
||||
if (rtune_timer_id >= 0)
|
||||
{
|
||||
tfd->clear_timer(rtune_timer_id);
|
||||
rtune_timer_id = -1;
|
||||
}
|
||||
if (recovery_tune_interval != 0)
|
||||
{
|
||||
rtune_timer_id = this->tfd->set_timer(recovery_tune_interval*1000, true, [this](int timer_id)
|
||||
{
|
||||
tune_recovery();
|
||||
});
|
||||
}
|
||||
else
|
||||
{
|
||||
recovery_target_sleep_us = recovery_sleep_us;
|
||||
}
|
||||
}
|
||||
|
||||
void osd_t::finish_recovery_op(osd_recovery_op_t *op)
|
||||
{
|
||||
// CAREFUL! op = &recovery_ops[op->oid]. Don't access op->* after recovery_ops.erase()
|
||||
delete op->osd_op;
|
||||
op->osd_op = NULL;
|
||||
recovery_ops.erase(op->oid);
|
||||
if (immediate_commit != IMMEDIATE_ALL)
|
||||
{
|
||||
recovery_done++;
|
||||
if (recovery_done >= recovery_sync_batch)
|
||||
{
|
||||
// Force sync every <recovery_sync_batch> operations
|
||||
// This is required not to pile up an excessive amount of delete operations
|
||||
autosync();
|
||||
recovery_done = 0;
|
||||
}
|
||||
}
|
||||
continue_recovery();
|
||||
}
|
||||
|
||||
void osd_t::tune_recovery()
|
||||
{
|
||||
static int accounted_ops[] = {
|
||||
OSD_OP_SEC_READ, OSD_OP_SEC_WRITE, OSD_OP_SEC_WRITE_STABLE,
|
||||
OSD_OP_SEC_STABILIZE, OSD_OP_SEC_SYNC, OSD_OP_SEC_DELETE
|
||||
};
|
||||
uint64_t total_client_usec = 0, total_recovery_usec = 0, recovery_count = 0;
|
||||
for (int i = 0; i < sizeof(accounted_ops)/sizeof(accounted_ops[0]); i++)
|
||||
{
|
||||
total_client_usec += (msgr.stats.op_stat_sum[accounted_ops[i]]
|
||||
- rtune_prev_stats.op_stat_sum[accounted_ops[i]]);
|
||||
total_recovery_usec += (msgr.recovery_stats.op_stat_sum[accounted_ops[i]]
|
||||
- rtune_prev_recovery_stats.op_stat_sum[accounted_ops[i]]);
|
||||
recovery_count += (msgr.recovery_stats.op_stat_count[accounted_ops[i]]
|
||||
- rtune_prev_recovery_stats.op_stat_count[accounted_ops[i]]);
|
||||
rtune_prev_stats.op_stat_sum[accounted_ops[i]] = msgr.stats.op_stat_sum[accounted_ops[i]];
|
||||
rtune_prev_recovery_stats.op_stat_sum[accounted_ops[i]] = msgr.recovery_stats.op_stat_sum[accounted_ops[i]];
|
||||
rtune_prev_recovery_stats.op_stat_count[accounted_ops[i]] = msgr.recovery_stats.op_stat_count[accounted_ops[i]];
|
||||
}
|
||||
total_client_usec -= total_recovery_usec;
|
||||
if (recovery_count == 0)
|
||||
{
|
||||
return;
|
||||
}
|
||||
// example:
|
||||
// total 3 GB/s
|
||||
// recovery queue 1
|
||||
// 120 OSDs
|
||||
// EC 5+3
|
||||
// 128kb block_size => 640kb object
|
||||
// 3000*1024/640/120 = 40 MB/s per OSD = 64 recovered objects per OSD
|
||||
// = 64*8*2 subops = 1024 recovery subop iops
|
||||
// 8 recovery subop queue
|
||||
// => subop avg latency = 0.0078125 sec
|
||||
// utilisation = 8
|
||||
// target util 1
|
||||
// intuitively target latency should be 8x of real
|
||||
// target_lat = rtune_avg_lat * utilisation / target_util
|
||||
// = rtune_avg_lat * rtune_avg_lat * rtune_avg_iops / target_util
|
||||
// = 0.0625
|
||||
// recovery utilisation will be 1
|
||||
rtune_client_util = total_client_usec/1000000.0/recovery_tune_interval;
|
||||
rtune_target_util = (rtune_client_util < recovery_tune_client_util_low
|
||||
? recovery_tune_util_high
|
||||
: recovery_tune_util_low + (rtune_client_util >= recovery_tune_client_util_high
|
||||
? 0 : (recovery_tune_util_high-recovery_tune_util_low)*
|
||||
(recovery_tune_client_util_high-rtune_client_util)/(recovery_tune_client_util_high-recovery_tune_client_util_low)
|
||||
)
|
||||
);
|
||||
rtune_avg_lat = total_recovery_usec/recovery_count;
|
||||
uint64_t target_lat = rtune_avg_lat * rtune_avg_lat/1000000.0 * recovery_count/recovery_tune_interval / rtune_target_util;
|
||||
auto sleep_us = target_lat > rtune_avg_lat+recovery_tune_sleep_min_us ? target_lat-rtune_avg_lat : 0;
|
||||
if (recovery_target_sleep_items.size() != recovery_tune_agg_interval)
|
||||
{
|
||||
recovery_target_sleep_items.resize(recovery_tune_agg_interval);
|
||||
for (int i = 0; i < recovery_tune_agg_interval; i++)
|
||||
recovery_target_sleep_items[i] = 0;
|
||||
recovery_target_sleep_total = 0;
|
||||
recovery_target_sleep_cur = 0;
|
||||
recovery_target_sleep_count = 0;
|
||||
}
|
||||
recovery_target_sleep_total -= recovery_target_sleep_items[recovery_target_sleep_cur];
|
||||
recovery_target_sleep_items[recovery_target_sleep_cur] = sleep_us;
|
||||
recovery_target_sleep_cur = (recovery_target_sleep_cur+1) % recovery_tune_agg_interval;
|
||||
recovery_target_sleep_total += sleep_us;
|
||||
if (recovery_target_sleep_count < recovery_tune_agg_interval)
|
||||
recovery_target_sleep_count++;
|
||||
recovery_target_sleep_us = recovery_target_sleep_total / recovery_target_sleep_count;
|
||||
if (log_level > 4)
|
||||
{
|
||||
printf(
|
||||
"[OSD %lu] auto-tune: client util: %.2f, recovery util: %.2f, lat: %lu us -> target util %.2f, delay %lu us\n",
|
||||
osd_num, rtune_client_util, total_recovery_usec/1000000.0/recovery_tune_interval,
|
||||
rtune_avg_lat, rtune_target_util, recovery_target_sleep_us
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
// Just trigger write requests for degraded objects. They'll be recovered during writing
|
||||
bool osd_t::continue_recovery()
|
||||
{
|
||||
|
|
|
@ -34,6 +34,7 @@
|
|||
#define OSD_OP_MAX 18
|
||||
#define OSD_RW_MAX 64*1024*1024
|
||||
#define OSD_PROTOCOL_VERSION 1
|
||||
#define OSD_OP_RECOVERY_RELATED (uint32_t)1
|
||||
|
||||
// Memory alignment for direct I/O (usually 512 bytes)
|
||||
#ifndef DIRECT_IO_ALIGNMENT
|
||||
|
@ -88,7 +89,8 @@ struct __attribute__((__packed__)) osd_op_sec_rw_t
|
|||
uint32_t len;
|
||||
// bitmap/attribute length - bitmap comes after header, but before data
|
||||
uint32_t attr_len;
|
||||
uint32_t pad0;
|
||||
// the only possible flag is OSD_OP_RECOVERY_RELATED
|
||||
uint32_t flags;
|
||||
};
|
||||
|
||||
struct __attribute__((__packed__)) osd_reply_sec_rw_t
|
||||
|
@ -109,6 +111,9 @@ struct __attribute__((__packed__)) osd_op_sec_del_t
|
|||
object_id oid;
|
||||
// delete version (automatic or specific)
|
||||
uint64_t version;
|
||||
// the only possible flag is OSD_OP_RECOVERY_RELATED
|
||||
uint32_t flags;
|
||||
uint32_t pad0;
|
||||
};
|
||||
|
||||
struct __attribute__((__packed__)) osd_reply_sec_del_t
|
||||
|
@ -121,6 +126,9 @@ struct __attribute__((__packed__)) osd_reply_sec_del_t
|
|||
struct __attribute__((__packed__)) osd_op_sec_sync_t
|
||||
{
|
||||
osd_op_header_t header;
|
||||
// the only possible flag is OSD_OP_RECOVERY_RELATED
|
||||
uint32_t flags;
|
||||
uint32_t pad0;
|
||||
};
|
||||
|
||||
struct __attribute__((__packed__)) osd_reply_sec_sync_t
|
||||
|
@ -134,6 +142,9 @@ struct __attribute__((__packed__)) osd_op_sec_stab_t
|
|||
osd_op_header_t header;
|
||||
// obj_ver_id array length in bytes
|
||||
uint64_t len;
|
||||
// the only possible flag is OSD_OP_RECOVERY_RELATED
|
||||
uint32_t flags;
|
||||
uint32_t pad0;
|
||||
};
|
||||
typedef osd_op_sec_stab_t osd_op_sec_rollback_t;
|
||||
|
||||
|
|
|
@ -3,13 +3,15 @@
|
|||
|
||||
#include "osd_primary.h"
|
||||
|
||||
#define SELF_FD -1
|
||||
|
||||
void osd_t::autosync()
|
||||
{
|
||||
if (immediate_commit != IMMEDIATE_ALL && !autosync_op)
|
||||
{
|
||||
autosync_op = new osd_op_t();
|
||||
autosync_op->op_type = OSD_OP_IN;
|
||||
autosync_op->peer_fd = -1;
|
||||
autosync_op->peer_fd = SELF_FD;
|
||||
autosync_op->req = (osd_any_op_t){
|
||||
.sync = {
|
||||
.header = {
|
||||
|
@ -85,9 +87,13 @@ void osd_t::finish_op(osd_op_t *cur_op, int retval)
|
|||
cur_op->reply.hdr.id = cur_op->req.hdr.id;
|
||||
cur_op->reply.hdr.opcode = cur_op->req.hdr.opcode;
|
||||
cur_op->reply.hdr.retval = retval;
|
||||
if (cur_op->peer_fd == -1)
|
||||
if (cur_op->peer_fd == SELF_FD)
|
||||
{
|
||||
msgr.measure_exec(cur_op);
|
||||
// Do not include internal primary writes (recovery/rebalance) into client op statistics
|
||||
if (cur_op->req.hdr.opcode != OSD_OP_WRITE)
|
||||
{
|
||||
msgr.measure_exec(cur_op);
|
||||
}
|
||||
// Copy lambda to be unaffected by `delete op`
|
||||
std::function<void(osd_op_t*)>(cur_op->callback)(cur_op);
|
||||
}
|
||||
|
@ -215,6 +221,7 @@ int osd_t::submit_primary_subop_batch(int submit_type, inode_t inode, uint64_t o
|
|||
.offset = wr ? si->write_start : si->read_start,
|
||||
.len = subop_len,
|
||||
.attr_len = wr ? clean_entry_bitmap_size : 0,
|
||||
.flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0,
|
||||
};
|
||||
#ifdef OSD_DEBUG
|
||||
printf(
|
||||
|
@ -294,7 +301,8 @@ void osd_t::handle_primary_bs_subop(osd_op_t *subop)
|
|||
" retval = "+std::to_string(bs_op->retval)+")"
|
||||
);
|
||||
}
|
||||
add_bs_subop_stats(subop);
|
||||
bool recovery_related = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB;
|
||||
add_bs_subop_stats(subop, recovery_related);
|
||||
subop->req.hdr.opcode = bs_op_to_osd_op[bs_op->opcode];
|
||||
subop->reply.hdr.retval = bs_op->retval;
|
||||
if (bs_op->opcode == BS_OP_READ || bs_op->opcode == BS_OP_WRITE || bs_op->opcode == BS_OP_WRITE_STABLE)
|
||||
|
@ -306,30 +314,33 @@ void osd_t::handle_primary_bs_subop(osd_op_t *subop)
|
|||
}
|
||||
delete bs_op;
|
||||
subop->bs_op = NULL;
|
||||
subop->peer_fd = -1;
|
||||
handle_primary_subop(subop, cur_op);
|
||||
subop->peer_fd = SELF_FD;
|
||||
if (recovery_related && recovery_target_sleep_us)
|
||||
{
|
||||
tfd->set_timer_us(recovery_target_sleep_us, false, [=](int timer_id)
|
||||
{
|
||||
handle_primary_subop(subop, cur_op);
|
||||
});
|
||||
}
|
||||
else
|
||||
{
|
||||
handle_primary_subop(subop, cur_op);
|
||||
}
|
||||
}
|
||||
|
||||
void osd_t::add_bs_subop_stats(osd_op_t *subop)
|
||||
void osd_t::add_bs_subop_stats(osd_op_t *subop, bool recovery_related)
|
||||
{
|
||||
// Include local blockstore ops in statistics
|
||||
uint64_t opcode = bs_op_to_osd_op[subop->bs_op->opcode];
|
||||
timespec tv_end;
|
||||
clock_gettime(CLOCK_REALTIME, &tv_end);
|
||||
msgr.stats.op_stat_count[opcode]++;
|
||||
if (!msgr.stats.op_stat_count[opcode])
|
||||
uint64_t len = (opcode == OSD_OP_SEC_READ || opcode == OSD_OP_SEC_WRITE)
|
||||
? subop->bs_op->len : 0;
|
||||
msgr.inc_op_stats(msgr.stats, opcode, subop->tv_begin, tv_end, len);
|
||||
if (recovery_related)
|
||||
{
|
||||
msgr.stats.op_stat_count[opcode] = 1;
|
||||
msgr.stats.op_stat_sum[opcode] = 0;
|
||||
msgr.stats.op_stat_bytes[opcode] = 0;
|
||||
}
|
||||
msgr.stats.op_stat_sum[opcode] += (
|
||||
(tv_end.tv_sec - subop->tv_begin.tv_sec)*1000000 +
|
||||
(tv_end.tv_nsec - subop->tv_begin.tv_nsec)/1000
|
||||
);
|
||||
if (opcode == OSD_OP_SEC_READ || opcode == OSD_OP_SEC_WRITE)
|
||||
{
|
||||
msgr.stats.op_stat_bytes[opcode] += subop->bs_op->len;
|
||||
// It is OSD_OP_RECOVERY_RELATED
|
||||
msgr.inc_op_stats(msgr.recovery_stats, opcode, subop->tv_begin, tv_end, len);
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -552,6 +563,7 @@ void osd_t::submit_primary_del_batch(osd_op_t *cur_op, obj_ver_osd_t *chunks_to_
|
|||
},
|
||||
.oid = chunk.oid,
|
||||
.version = chunk.version,
|
||||
.flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0,
|
||||
} };
|
||||
subops[i].callback = [cur_op, this](osd_op_t *subop)
|
||||
{
|
||||
|
@ -609,6 +621,7 @@ int osd_t::submit_primary_sync_subops(osd_op_t *cur_op)
|
|||
.id = msgr.next_subop_id++,
|
||||
.opcode = OSD_OP_SEC_SYNC,
|
||||
},
|
||||
.flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0,
|
||||
} };
|
||||
subops[i].callback = [cur_op, this](osd_op_t *subop)
|
||||
{
|
||||
|
@ -668,6 +681,7 @@ void osd_t::submit_primary_stab_subops(osd_op_t *cur_op)
|
|||
.opcode = OSD_OP_SEC_STABILIZE,
|
||||
},
|
||||
.len = (uint64_t)(stab_osd.len * sizeof(obj_ver_id)),
|
||||
.flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0,
|
||||
} };
|
||||
subops[i].iov.push_back(op_data->unstable_writes + stab_osd.start, stab_osd.len * sizeof(obj_ver_id));
|
||||
subops[i].callback = [cur_op, this](osd_op_t *subop)
|
||||
|
|
|
@ -292,16 +292,26 @@ resume_7:
|
|||
{
|
||||
{
|
||||
int recovery_type = op_data->object_state->state & (OBJ_DEGRADED|OBJ_INCOMPLETE) ? 0 : 1;
|
||||
recovery_stat_count[0][recovery_type]++;
|
||||
if (!recovery_stat_count[0][recovery_type])
|
||||
recovery_stat[recovery_type].count++;
|
||||
if (!recovery_stat[recovery_type].count) // wrapped
|
||||
{
|
||||
recovery_stat_count[0][recovery_type]++;
|
||||
recovery_stat_bytes[0][recovery_type] = 0;
|
||||
memset(&recovery_print_prev[recovery_type], 0, sizeof(recovery_print_prev[recovery_type]));
|
||||
memset(&recovery_stat[recovery_type], 0, sizeof(recovery_stat[recovery_type]));
|
||||
recovery_stat[recovery_type].count++;
|
||||
}
|
||||
for (int role = 0; role < (op_data->scheme == POOL_SCHEME_REPLICATED ? 1 : pg.pg_size); role++)
|
||||
{
|
||||
recovery_stat_bytes[0][recovery_type] += op_data->stripes[role].write_end - op_data->stripes[role].write_start;
|
||||
recovery_stat[recovery_type].bytes += op_data->stripes[role].write_end - op_data->stripes[role].write_start;
|
||||
}
|
||||
if (!cur_op->tv_end.tv_sec)
|
||||
{
|
||||
clock_gettime(CLOCK_REALTIME, &cur_op->tv_end);
|
||||
}
|
||||
uint64_t usec = (
|
||||
(cur_op->tv_end.tv_sec - cur_op->tv_begin.tv_sec)*1000000 +
|
||||
(cur_op->tv_end.tv_nsec - cur_op->tv_begin.tv_nsec)/1000
|
||||
);
|
||||
recovery_stat[recovery_type].usec += usec;
|
||||
}
|
||||
// Any kind of a non-clean object can have extra chunks, because we don't record objects
|
||||
// as degraded & misplaced or incomplete & misplaced at the same time. So try to remove extra chunks
|
||||
|
|
|
@ -42,7 +42,21 @@ void osd_t::secondary_op_callback(osd_op_t *op)
|
|||
int retval = op->bs_op->retval;
|
||||
delete op->bs_op;
|
||||
op->bs_op = NULL;
|
||||
finish_op(op, retval);
|
||||
if (op->is_recovery_related() && recovery_target_sleep_us)
|
||||
{
|
||||
if (!op->tv_end.tv_sec)
|
||||
{
|
||||
clock_gettime(CLOCK_REALTIME, &op->tv_end);
|
||||
}
|
||||
tfd->set_timer_us(recovery_target_sleep_us, false, [this, op, retval](int timer_id)
|
||||
{
|
||||
finish_op(op, retval);
|
||||
});
|
||||
}
|
||||
else
|
||||
{
|
||||
finish_op(op, retval);
|
||||
}
|
||||
}
|
||||
|
||||
void osd_t::exec_secondary(osd_op_t *cur_op)
|
||||
|
|
|
@ -19,10 +19,10 @@ fi
|
|||
|
||||
if [ "$IMMEDIATE_COMMIT" != "" ]; then
|
||||
NO_SAME="--journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024 --disable_data_fsync 1 --immediate_commit all --log_level 10 --etcd_stats_interval 5"
|
||||
$ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"osd_out_time":1,"immediate_commit":"all","client_enable_writeback":true}'
|
||||
$ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"recovery_tune_util_low":1,"osd_out_time":1,"immediate_commit":"all","client_enable_writeback":true}'
|
||||
else
|
||||
NO_SAME="--journal_sector_buffer_count 1024 --log_level 10 --etcd_stats_interval 5"
|
||||
$ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"osd_out_time":1,"client_enable_writeback":true}'
|
||||
$ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"recovery_tune_util_low":1,"osd_out_time":1,"client_enable_writeback":true}'
|
||||
fi
|
||||
|
||||
start_osd_on()
|
||||
|
@ -53,7 +53,7 @@ for i in $(seq 1 $OSD_COUNT); do
|
|||
start_osd $i
|
||||
done
|
||||
|
||||
(while true; do node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 || true; done) &>./testdata/mon.log &
|
||||
(while true; do node mon/mon-main.js --etcd_address $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 || true; done) >>./testdata/mon.log 2>&1 &
|
||||
MON_PID=$!
|
||||
|
||||
if [ "$SCHEME" = "ec" ]; then
|
||||
|
|
|
@ -18,6 +18,7 @@ try_change()
|
|||
for i in {1..6}; do
|
||||
echo --- Change PG count to $n --- >>testdata/osd$i.log
|
||||
done
|
||||
echo --- Change PG count to $n --- >>testdata/mon.log
|
||||
|
||||
$ETCDCTL put /vitastor/config/pools '{"1":{'$POOLCFG',"pg_size":'$PG_SIZE',"pg_minsize":'$PG_MINSIZE',"pg_count":'$n'}}'
|
||||
|
||||
|
|
|
@ -15,7 +15,7 @@ $ETCDCTL put /vitastor/osd/stats/7 '{"host":"host4","size":1073741824,"time":"'$
|
|||
$ETCDCTL put /vitastor/osd/stats/8 '{"host":"host4","size":1073741824,"time":"'$TIME'"}'
|
||||
$ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":4,"failure_domain":"rack"}}'
|
||||
|
||||
node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" &>./testdata/mon.log &
|
||||
node mon/mon-main.js --etcd_address $ETCD_URL --etcd_prefix "/vitastor" >>./testdata/mon.log 2>&1 &
|
||||
MON_PID=$!
|
||||
|
||||
sleep 2
|
||||
|
|
|
@ -7,7 +7,7 @@ OSD_COUNT=5
|
|||
OSD_ARGS="$OSD_ARGS"
|
||||
for i in $(seq 1 $OSD_COUNT); do
|
||||
dd if=/dev/zero of=./testdata/test_osd$i.bin bs=1024 count=1 seek=$((OSD_SIZE*1024-1))
|
||||
build/src/vitastor-osd --osd_num $i --bind_address 127.0.0.1 --etcd_stats_interval 5 $OSD_ARGS --etcd_address $ETCD_URL $(build/src/vitastor-disk simple-offsets --format options ./testdata/test_osd$i.bin 2>/dev/null) >>./testdata/osd$i.log 2>&1 &
|
||||
build/src/vitastor-osd --log_level 10 --osd_num $i --bind_address 127.0.0.1 --etcd_stats_interval 5 $OSD_ARGS --etcd_address $ETCD_URL $(build/src/vitastor-disk simple-offsets --format options ./testdata/test_osd$i.bin 2>/dev/null) >>./testdata/osd$i.log 2>&1 &
|
||||
eval OSD${i}_PID=$!
|
||||
done
|
||||
|
||||
|
@ -53,6 +53,11 @@ for i in {1..30}; do
|
|||
fi
|
||||
done
|
||||
|
||||
# Sync so all moved objects are removed from OSD 1 (they aren't removed without a sync)
|
||||
LD_PRELOAD="build/src/libfio_vitastor.so" \
|
||||
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4k -direct=1 -iodepth=1 -fsync=1 -number_ios=2 -rw=write \
|
||||
-etcd=$ETCD_URL -pool=1 -inode=2 -size=32M -cluster_log_level=10
|
||||
|
||||
$ETCDCTL put /vitastor/config/pgs '{"items":{"1":{"1":{"osd_set":[4,5],"primary":0}}}}'
|
||||
|
||||
$ETCDCTL put /vitastor/pg/history/1/1 '{"all_peers":[1,2,3]}'
|
||||
|
|
|
@ -0,0 +1,54 @@
|
|||
#!/bin/bash -ex
|
||||
# Test changing EC 4+1 into EC 4+3
|
||||
|
||||
OSD_COUNT=7
|
||||
PG_COUNT=16
|
||||
SCHEME=ec
|
||||
PG_SIZE=5
|
||||
PG_DATA_SIZE=4
|
||||
PG_MINSIZE=5
|
||||
|
||||
. `dirname $0`/run_3osds.sh
|
||||
|
||||
try_change()
|
||||
{
|
||||
n=$1
|
||||
s=$2
|
||||
|
||||
for i in {1..10}; do
|
||||
($ETCDCTL get /vitastor/config/pgs --print-value-only |\
|
||||
jq -s -e '(.[0].items["1"] | map( ([ .osd_set[] | select(. != 0) ] | length) == '$s' ) | length == '$n')
|
||||
and ([ .[0].items["1"] | map(.osd_set)[][] ] | sort | unique == ["1","2","3","4","5","6","7"])') && \
|
||||
($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"]) ] | length) == '$n'') && \
|
||||
break
|
||||
sleep 1
|
||||
done
|
||||
|
||||
if ! ($ETCDCTL get /vitastor/config/pgs --print-value-only |\
|
||||
jq -s -e '(.[0].items["1"] | map( ([ .osd_set[] | select(. != 0) ] | length) == '$s' ) | length == '$n')
|
||||
and ([ .[0].items["1"] | map(.osd_set)[][] ] | sort | unique == ["1","2","3","4","5","6","7"])'); then
|
||||
$ETCDCTL get /vitastor/config/pgs
|
||||
$ETCDCTL get --prefix /vitastor/pg/state/
|
||||
format_error "FAILED: PG SIZE NOT CHANGED OR SOME OSDS DO NOT HAVE PGS"
|
||||
fi
|
||||
|
||||
if ! ($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"]) ] | length) == '$n); then
|
||||
$ETCDCTL get /vitastor/config/pgs
|
||||
$ETCDCTL get --prefix /vitastor/pg/state/
|
||||
format_error "FAILED: PGS NOT UP AFTER PG SIZE CHANGE"
|
||||
fi
|
||||
}
|
||||
|
||||
LD_PRELOAD="build/src/libfio_vitastor.so" \
|
||||
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=1M -direct=1 -iodepth=4 \
|
||||
-rw=write -etcd=$ETCD_URL -pool=1 -inode=1 -size=128M -runtime=10
|
||||
|
||||
PG_SIZE=7
|
||||
POOLCFG='"name":"testpool","failure_domain":"osd","scheme":"ec","parity_chunks":'$((PG_SIZE-PG_DATA_SIZE))
|
||||
$ETCDCTL put /vitastor/config/pools '{"1":{'$POOLCFG',"pg_size":'$PG_SIZE',"pg_minsize":'$PG_MINSIZE',"pg_count":'$PG_COUNT'}}'
|
||||
|
||||
sleep 2
|
||||
|
||||
try_change 16 7
|
||||
|
||||
format_green OK
|
|
@ -15,7 +15,7 @@ for i in $(seq 1 $OSD_COUNT); do
|
|||
eval OSD${i}_PID=$!
|
||||
done
|
||||
|
||||
(while true; do node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 || true; done) &>./testdata/mon.log &
|
||||
(while true; do node mon/mon-main.js --etcd_address $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 || true; done) >>./testdata/mon.log 2>&1 &
|
||||
MON_PID=$!
|
||||
|
||||
sleep 3
|
||||
|
|
Loading…
Reference in New Issue