Compare commits

..

2 Commits

Author SHA1 Message Date
d4ebbeaf5c WIP Auto-tune recovery speed
Some checks failed
Test / test_minsize_1 (push) Successful in 14s
Test / test_snapshot_ec (push) Successful in 29s
Test / test_move_reappear (push) Successful in 21s
Test / test_rm (push) Successful in 16s
Test / test_snapshot_down (push) Successful in 28s
Test / test_snapshot_down_ec (push) Successful in 31s
Test / test_splitbrain (push) Successful in 26s
Test / test_snapshot_chain (push) Successful in 2m29s
Test / test_snapshot_chain_ec (push) Successful in 3m4s
Test / test_rebalance_verify_imm (push) Successful in 3m2s
Test / test_rebalance_verify (push) Successful in 3m55s
Test / test_write (push) Successful in 34s
Test / test_write_no_same (push) Successful in 14s
Test / test_rebalance_verify_ec (push) Successful in 4m16s
Test / test_write_xor (push) Failing after 3m7s
Test / test_rebalance_verify_ec_imm (push) Successful in 4m43s
Test / test_heal_pg_size_2 (push) Successful in 3m28s
Test / test_heal_ec (push) Failing after 3m46s
Test / test_heal_csum_32k_dmj (push) Successful in 5m43s
Test / test_heal_csum_32k_dj (push) Successful in 5m42s
Test / test_heal_csum_32k (push) Successful in 6m6s
Test / test_scrub (push) Successful in 1m9s
Test / test_scrub_zero_osd_2 (push) Successful in 1m13s
Test / test_heal_csum_4k_dmj (push) Successful in 6m30s
Test / test_scrub_xor (push) Successful in 48s
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 1m14s
Test / test_scrub_pg_size_3 (push) Failing after 2m10s
Test / test_heal_csum_4k_dj (push) Successful in 5m54s
Test / test_heal_csum_4k (push) Successful in 5m52s
Test / test_scrub_ec (push) Successful in 39s
2023-12-14 01:11:57 +03:00
bf0c29a46c Track recovery op latencies + refactor into a structure 2023-12-14 01:11:57 +03:00
56 changed files with 349 additions and 4477 deletions

115
CLA-en.md
View File

@@ -1,115 +0,0 @@
## Contributor License Agreement
> This Agreement is made in the Russian and English languages. **The English
text of Agreement is for informational purposes only** and is not binding
for the Parties.
>
> In the event of a conflict between the provisions of the Russian and
English versions of this Agreement, the **Russian version shall prevail**.
>
> Russian version is published at https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-ru.md
This document represents the offer of Filippov Vitaliy Vladimirovich
("Author"), author and copyright holder of Vitastor software ("Program"),
acknowledged by a certificate of Federal Service for Intellectual
Property of Russian Federation (Rospatent) # 2021617829 dated 20 May 2021,
to "Contributors" to conclude this license agreement as follows
("Agreement" or "Offer").
In accordance with Art. 435, Art. 438 of the Civil Code of the Russian
Federation, this Agreement is an offer and in case of acceptance of the
offer, an agreement is considered concluded on the conditions specified
in the offer.
1. Applicable Terms. \
1.1. "Official Repository" shall mean the computer storage, operated by
the Author, containing all prior and future versions of the Source
Code of the Program, at Internet addresses https://git.yourcmc.ru/vitalif/vitastor/
or https://github.com/vitalif/vitastor/. \
1.2. "Contributions" shall mean results of intellectual activity
(including, but not limited to, source code, libraries, components,
texts, documentation) which can be software or elements of the software
and which are provided by Contributors to the Author for inclusion
in the Program. \
1.3. "Contributor" shall mean a person who provides Contributions to
the Author and agrees with all provisions of this Agreement.
A Сontributor can be: 1) an individual; or 2) a legal entity or an
individual entrepreneur in case when an individual provides Contributions
on behalf of third parties, including on behalf of his employer.
2. Subject of the Agreement. \
2.1. Subject of the Agreement shall be the Contributions sent to the Author by Contributors.
2.2. The Contributor grants to the Author the right to use Contributions at his own
discretion and without any necessity to get a prior approval from Contributor or
any other third party in any way, under a simple (non-exclusive), royalty-free,
irrevocable license throughout the world by all means not contrary to law, in whole
or as a part of the Program, or other open-source or closed-source computer programs,
products or services (hereinafter -- the "License"), including, but not limited to: \
2.2.1. to execute Contributions and use them for any tasks; \
2.2.2. to publish and distribute Contributions in modified or unmodified form and/or to rent them; \
2.2.3. to modify Contributions, add comments, illustrations or any explanations to Contributions while using them; \
2.2.4. to create other results of intellectual activity based on Contributions, including derivative works and composite works; \
2.2.5. to translate Contributions into other languages, including other programming languages; \
2.2.6. to carry out rental and public display of Contributions; \
2.2.7. to use Contributions under the trade name and/or any trademark or any other label, or without it, as the Author thinks fit; \
2.3. The Contributor grants to the Author the right to sublicense any of the aforementioned
rights to third parties on any terms at the Author's discretion. \
2.4. The License is provided for the entire duration of Contributor's
exclusive intellectual property rights to the Contributions. \
2.5. The Contributor grants to the Author the right to decide how and where to mention,
or to not mention at all, the fact of his authorship, name, nickname and/or company
details when including Contributions into the Program or in any other computer
programs, products or services.
3. Acceptance of the Offer \
3.1. The Contributor may provide Contributions to the Author in the form of
a "Pull Request" in an Official Repository of the Program or by any
other electronic means of communication, including, but not limited to,
E-mail or messenger applications. \
3.2. The acceptance of the Offer shall be the fact of provision of Contributions
to the Author by the Contributor by any means with the following remark:
“I accept Vitastor CLA agreement: https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md”
or “Я принимаю соглашение Vitastor CLA: https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-ru.md”. \
3.3. Date of acceptance of the Offer shall be the date of such provision.
4. Rights and obligations of the parties. \
4.1. The Contributor reserves the right to use Contributions by any lawful means
not contrary to this Agreement. \
4.2. The Author has the right to refuse to include Contributions into the Program
at any moment with no explanation to the Contributor.
5. Representations and Warranties. \
5.1. The person providing Contributions for the purpose of their inclusion
in the Program represents and warrants that he is the Contributor
or legally acts on the Contributor's behalf. Name or company details
of the Contributor shall be provided with the Contribution at the moment
of their provision to the Author. \
5.2. The Contributor represents and warrants that he legally owns exclusive
intellectual property rights to the Contributions. \
5.3. The Contributor represents and warrants that any further use of \
Contributions by the Author as provided by Contributor under the terms
of the Agreement does not infringe on intellectual and other rights and
legitimate interests of third parties. \
5.4. The Contributor represents and warrants that he has all rights and legal
capacity needed to accept this Offer; \
5.5. The Contributor represents and warrants that Contributions don't
contain malware or any information considered illegal under the law
of Russian Federation.
6. Termination of the Agreement \
6.1. The Agreement may be terminated at will of both Author and Contributor,
formalised in the written form or if the Agreement is terminated on
reasons prescribed by the law of Russian Federation.
7. Final Clauses \
7.1. The Contributor may optionally sign the Agreement in the written form. \
7.2. The Agreement is deemed to become effective from the Date of signing of
the Agreement and until the expiration of Contributor's exclusive
intellectual property rights to the Contributions. \
7.3. The Author may unilaterally alter the Agreement without informing Contributors.
The new version of the document shall come into effect 3 (three) days after
being published in the Official Repository of the Program at Internet address
[https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md](https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md).
Contributors should keep informed about the actual version of the Agreement themselves. \
7.4. If the Author and the Contributor fail to agree on disputable issues,
disputes shall be referred to the Moscow Arbitration court.

108
CLA-ru.md
View File

@@ -1,108 +0,0 @@
## Лицензионное соглашение с участником
> Данная Оферта написана в Русской и Английской версиях. **Версия на английском
языке предоставляется в информационных целях** и не связывает стороны договора.
>
> В случае несоответствий между положениями Русской и Английской версий Договора,
**Русская версия имеет приоритет**.
>
> Английская версия опубликована по адресу https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md
Настоящий договор-оферта (далее по тексту Оферта, Договор) адресована физическим
и юридическим лицам (далее Участникам) и является официальным публичным предложением
Филиппова Виталия Владимировича (далее Автора) программного обеспечения Vitastor,
свидетельство Федеральной службы по интеллектуальной собственности (Роспатент) № 2021617829
от 20 мая 2021 г. (далее Программа) о нижеследующем:
1. Термины и определения \
1.1. Репозиторий электронное хранилище, содержащее исходный код Программы. \
1.2. Доработка результат интеллектуальной деятельности Участника, включающий
в себя изменения или дополнения к исходному коду Программы, которые Участник
желает включить в состав Программы для дальнейшего использования и распространения
Автором и для этого направляет их Автору. \
1.3. Участник физическое или юридическое лицо, вносящее Доработки в код Программы. \
1.4. ГК РФ Гражданский кодекс Российской Федерации.
2. Предмет оферты \
2.1. Предметом настоящей оферты являются Доработки, отправляемые Участником Автору. \
2.2. Участник предоставляет Автору право использовать Доработки по собственному усмотрению
и без необходимости предварительного согласования с Участником или иным третьим лицом
на условиях простой (неисключительной) безвозмездной безотзывной лицензии, полностью
или фрагментарно, в составе Программы или других программ, продуктов или сервисов
как с открытым, так и с закрытым исходным кодом, любыми способами, не противоречащими
закону, включая, но не ограничиваясь следующими: \
2.2.1. Запускать и использовать Доработки для выполнения любых задач; \
2.2.2. Распространять, импортировать и доводить Доработки до всеобщего сведения; \
2.2.3. Вносить в Доработки изменения, сокращения и дополнения, снабжать Доработки
при их использовании комментариями, иллюстрациями или пояснениями; \
2.2.4. Создавать на основе Доработок иные результаты интеллектуальной деятельности,
в том числе производные и составные произведения; \
2.2.5. Переводить Доработки на другие языки, в том числе на другие языки программирования; \
2.2.6. Осуществлять прокат и публичный показ Доработок; \
2.2.7. Использовать Доработки под любым фирменным наименованием, товарным знаком
(знаком обслуживания) или иным обозначением, или без такового. \
2.3. Участник предоставляет Автору право сублицензировать полученные права на Доработки
третьим лицам на любых условиях на усмотрение Автора. \
2.4. Участник предоставляет Автору права на Доработки на территории всего мира. \
2.5. Участник предоставляет Автору права на весь срок действия исключительного права
Участника на Доработки. \
2.6. Участник предоставляет Автору права на Доработки на безвозмездной основе. \
2.7. Участник разрешает Автору самостоятельно определять порядок, способ и
место указания его имени, реквизитов и/или псевдонима при включении
Доработок в состав Программы или других программ, продуктов или сервисов.
3. Акцепт Оферты \
3.1. Участник может передавать Доработки в адрес Автора через зеркала официального
Репозитория Программы по адресам https://git.yourcmc.ru/vitalif/vitastor/ или
https://github.com/vitalif/vitastor/ в виде “запроса на слияние” (pull request),
либо в письменном виде или с помощью любых других электронных средств коммуникации,
например, электронной почты или мессенджеров. \
3.2. Факт передачи Участником Доработок в адрес Автора любым способом с одной из пометок
“I accept Vitastor CLA agreement: https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md”
или “Я принимаю соглашение Vitastor CLA: https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-ru.md”
является полным и безоговорочным акцептом (принятием) Участником условий настоящей
Оферты, т.е. Участник считается ознакомившимся с настоящим публичным договором и
в соответствии с ГК РФ признается лицом, вступившим с Автором в договорные отношения
на основании настоящей Оферты. \
3.3. Датой акцептирования настоящей Оферты считается дата такой передачи.
4. Права и обязанности Сторон \
4.1. Участник сохраняет за собой право использовать Доработки любым законным
способом, не противоречащим настоящему Договору. \
4.2. Автор вправе отказать Участнику во включении Доработок в состав
Программы без объяснения причин в любой момент по своему усмотрению.
5. Гарантии и заверения \
5.1. Лицо, направляющее Доработки для целей их включения в состав Программы,
гарантирует, что является Участником или представителем Участника. Имя или реквизиты
Участника должны быть указаны при их передаче в адрес Автора Программы. \
5.2. Участник гарантирует, что является законным обладателем исключительных прав
на Доработки. \
5.3. Участник гарантирует, что на момент акцептирования настоящей Оферты ему
ничего не известно (и не могло быть известно) о правах третьих лиц на
передаваемые Автору Доработки или их часть, которые могут быть нарушены
в связи с передачей Доработок по настоящему Договору. \
5.4. Участник гарантирует, что является дееспособным лицом и обладает всеми
необходимыми правами для заключения Договора. \
5.5. Участник гарантирует, что Доработки не содержат вредоносного ПО, а также
любой другой информации, запрещённой к распространению по законам Российской
Федерации.
6. Прекращение действия оферты \
6.1. Действие настоящего договора может быть прекращено по соглашению сторон,
оформленному в письменном виде, а также вследствие его расторжения по основаниям,
предусмотренным законом.
7. Заключительные положения \
7.1. Участник вправе по желанию подписать настоящий Договор в письменном виде. \
7.2. Настоящий договор действует с момента его заключения и до истечения срока
действия исключительных прав Участника на Доработки. \
7.3. Автор имеет право в одностороннем порядке вносить изменения и дополнения в договор
без специального уведомления об этом Участников. Новая редакция документа вступает
в силу через 3 (Три) календарных дня со дня опубликования в официальном Репозитории
Программы по адресу в сети Интернет
[https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-ru.md](https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-ru.md).
Участники самостоятельно отслеживают действующие условия Оферты. \
7.4. Все споры, возникающие между сторонами в процессе их взаимодействия по настоящему
договору, решаются путём переговоров. В случае невозможности урегулирования споров
переговорным порядком стороны разрешают их в Арбитражном суде г.Москвы.

View File

@@ -14,7 +14,6 @@ import (
"strconv"
"strings"
"syscall"
"time"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/status"
@@ -33,7 +32,6 @@ type NodeServer struct
useVduse bool
stateDir string
mounter mount.Interface
restartInterval time.Duration
}
type DeviceState struct
@@ -67,16 +65,6 @@ func NewNodeServer(driver *Driver) *NodeServer
if (ns.useVduse)
{
ns.restoreVduseDaemons()
dur, err := time.ParseDuration(os.Getenv("RESTART_INTERVAL"))
if (err != nil)
{
dur = 10 * time.Second
}
ns.restartInterval = dur
if (ns.restartInterval != time.Duration(0))
{
go ns.restarter()
}
}
return ns
}
@@ -376,21 +364,6 @@ func (ns *NodeServer) unmapVduseById(vdpaId string)
}
}
func (ns *NodeServer) restarter()
{
// Restart dead VDUSE daemons at regular intervals
// Otherwise volume I/O may hang in case of a qemu-storage-daemon crash
// Moreover, it may lead to a kernel panic of the kernel is configured to
// panic on hung tasks
ticker := time.NewTicker(ns.restartInterval)
defer ticker.Stop()
for
{
<-ticker.C
ns.restoreVduseDaemons()
}
}
func (ns *NodeServer) restoreVduseDaemons()
{
pattern := ns.stateDir+"vitastor-vduse-*.json"

View File

@@ -6,8 +6,8 @@
# Client Parameters
These parameters apply only to Vitastor clients (QEMU, fio, NBD and so on) and
affect their interaction with the cluster.
These parameters apply only to clients and affect their interaction with
the cluster.
- [client_max_dirty_bytes](#client_max_dirty_bytes)
- [client_max_dirty_ops](#client_max_dirty_ops)

View File

@@ -6,7 +6,7 @@
# Параметры клиентского кода
Данные параметры применяются только к клиентам Vitastor (QEMU, fio, NBD и т.п.) и
Данные параметры применяются только к клиентам Vitastor (QEMU, fio, NBD) и
затрагивают логику их работы с кластером.
- [client_max_dirty_bytes](#client_max_dirty_bytes)

View File

@@ -19,7 +19,6 @@ them, even without restarting by updating configuration in etcd.
- [autosync_interval](#autosync_interval)
- [autosync_writes](#autosync_writes)
- [recovery_queue_depth](#recovery_queue_depth)
- [recovery_sleep_us](#recovery_sleep_us)
- [recovery_pg_switch](#recovery_pg_switch)
- [recovery_sync_batch](#recovery_sync_batch)
- [readonly](#readonly)
@@ -52,13 +51,6 @@ them, even without restarting by updating configuration in etcd.
- [scrub_list_limit](#scrub_list_limit)
- [scrub_find_best](#scrub_find_best)
- [scrub_ec_max_bruteforce](#scrub_ec_max_bruteforce)
- [recovery_tune_interval](#recovery_tune_interval)
- [recovery_tune_util_low](#recovery_tune_util_low)
- [recovery_tune_util_high](#recovery_tune_util_high)
- [recovery_tune_client_util_low](#recovery_tune_client_util_low)
- [recovery_tune_client_util_high](#recovery_tune_client_util_high)
- [recovery_tune_agg_interval](#recovery_tune_agg_interval)
- [recovery_tune_sleep_min_us](#recovery_tune_sleep_min_us)
## etcd_report_interval
@@ -143,24 +135,12 @@ operations before issuing an fsync operation internally.
## recovery_queue_depth
- Type: integer
- Default: 1
- Default: 4
- Can be changed online: yes
Maximum recovery and rebalance operations initiated by each OSD in parallel.
Note that each OSD talks to a lot of other OSDs so actual number of parallel
recovery operations per each OSD is greater than just recovery_queue_depth.
Increasing this parameter can speedup recovery if [auto-tuning](#recovery_tune_interval)
allows it or if it is disabled.
## recovery_sleep_us
- Type: microseconds
- Default: 0
- Can be changed online: yes
Delay for all recovery- and rebalance- related operations. If non-zero,
such operations are artificially slowed down to reduce the impact on
client I/O.
Maximum recovery operations per one primary OSD at any given moment of time.
Currently it's the only parameter available to tune the speed or recovery
and rebalancing, but it's planned to implement more.
## recovery_pg_switch
@@ -528,81 +508,3 @@ the variant with most available equal copies is correct. For example, if
you have 3 replicas and 1 of them differs, this one is considered to be
corrupted. But if there is no "best" version with more copies than all
others have then the object is also marked as inconsistent.
## recovery_tune_interval
- Type: seconds
- Default: 1
- Can be changed online: yes
Interval at which OSD re-considers client and recovery load and automatically
adjusts [recovery_sleep_us](#recovery_sleep_us). Recovery auto-tuning is
disabled if recovery_tune_interval is set to 0.
Auto-tuning targets utilization. Utilization is a measure of load and is
equal to the product of iops and average latency (so it may be greater
than 1). You set "low" and "high" client utilization thresholds and two
corresponding target recovery utilization levels. OSD calculates desired
recovery utilization from client utilization using linear interpolation
and auto-tunes recovery operation delay to make actual recovery utilization
match desired.
This allows to reduce recovery/rebalance impact on client operations. It is
of course impossible to remove it completely, but it should become adequate.
In some tests rebalance could earlier drop client write speed from 1.5 GB/s
to 50-100 MB/s, with default auto-tuning settings it now only reduces
to ~1 GB/s.
## recovery_tune_util_low
- Type: number
- Default: 0.1
- Can be changed online: yes
Desired recovery/rebalance utilization when client load is high, i.e. when
it is at or above recovery_tune_client_util_high.
## recovery_tune_util_high
- Type: number
- Default: 1
- Can be changed online: yes
Desired recovery/rebalance utilization when client load is low, i.e. when
it is at or below recovery_tune_client_util_low.
## recovery_tune_client_util_low
- Type: number
- Default: 0
- Can be changed online: yes
Client utilization considered "low".
## recovery_tune_client_util_high
- Type: number
- Default: 0.5
- Can be changed online: yes
Client utilization considered "high".
## recovery_tune_agg_interval
- Type: integer
- Default: 10
- Can be changed online: yes
The number of last auto-tuning iterations to use for calculating the
delay as average. Lower values result in quicker response to client
load change, higher values result in more stable delay. Default value of 10
is usually fine.
## recovery_tune_sleep_min_us
- Type: microseconds
- Default: 10
- Can be changed online: yes
Minimum possible value for auto-tuned recovery_sleep_us. Values lower
than this value are changed to 0.

View File

@@ -20,7 +20,6 @@
- [autosync_interval](#autosync_interval)
- [autosync_writes](#autosync_writes)
- [recovery_queue_depth](#recovery_queue_depth)
- [recovery_sleep_us](#recovery_sleep_us)
- [recovery_pg_switch](#recovery_pg_switch)
- [recovery_sync_batch](#recovery_sync_batch)
- [readonly](#readonly)
@@ -53,13 +52,6 @@
- [scrub_list_limit](#scrub_list_limit)
- [scrub_find_best](#scrub_find_best)
- [scrub_ec_max_bruteforce](#scrub_ec_max_bruteforce)
- [recovery_tune_interval](#recovery_tune_interval)
- [recovery_tune_util_low](#recovery_tune_util_low)
- [recovery_tune_util_high](#recovery_tune_util_high)
- [recovery_tune_client_util_low](#recovery_tune_client_util_low)
- [recovery_tune_client_util_high](#recovery_tune_client_util_high)
- [recovery_tune_agg_interval](#recovery_tune_agg_interval)
- [recovery_tune_sleep_min_us](#recovery_tune_sleep_min_us)
## etcd_report_interval
@@ -146,25 +138,13 @@ OSD, чтобы успевать очищать журнал - без них OSD
## recovery_queue_depth
- Тип: целое число
- Значение по умолчанию: 1
- Значение по умолчанию: 4
- Можно менять на лету: да
Максимальное число параллельных операций восстановления, инициируемых одним
OSD в любой момент времени. Имейте в виду, что каждый OSD обычно работает с
многими другими OSD, так что на практике параллелизм восстановления больше,
чем просто recovery_queue_depth. Увеличение значения этого параметра может
ускорить восстановление если [автотюнинг скорости](#recovery_tune_interval)
разрешает это или если он отключён.
## recovery_sleep_us
- Тип: микросекунды
- Значение по умолчанию: 0
- Можно менять на лету: да
Delay for all recovery- and rebalance- related operations. If non-zero,
such operations are artificially slowed down to reduce the impact on
client I/O.
Максимальное число операций восстановления на одном первичном OSD в любой
момент времени. На данный момент единственный параметр, который можно менять
для ускорения или замедления восстановления и перебалансировки данных, но
в планах реализация других параметров.
## recovery_pg_switch
@@ -555,83 +535,3 @@ EC (кодов коррекции ошибок) с более, чем 1 диск
считается некорректной. Однако, если "лучшую" версию с числом доступных
копий большим, чем у всех других версий, найти невозможно, то объект тоже
маркируется неконсистентным.
## recovery_tune_interval
- Тип: секунды
- Значение по умолчанию: 1
- Можно менять на лету: да
Интервал, с которым OSD пересматривает клиентскую нагрузку и нагрузку
восстановления и автоматически подстраивает [recovery_sleep_us](#recovery_sleep_us).
Автотюнинг (автоподстройка) отключается, если recovery_tune_interval
устанавливается в значение 0.
Автотюнинг регулирует утилизацию. Утилизация является мерой нагрузки
и равна произведению числа операций в секунду и средней задержки
(то есть, она может быть выше 1). Вы задаёте два уровня клиентской
утилизации - "низкий" и "высокий" (low и high) и два соответствующих
целевых уровня утилизации операциями восстановления. OSD рассчитывает
желаемый уровень утилизации восстановления линейной интерполяцией от
клиентской утилизации и подстраивает задержку операций восстановления
так, чтобы фактическая утилизация восстановления совпадала с желаемой.
Это позволяет снизить влияние восстановления и ребаланса на клиентские
операции. Конечно, невозможно исключить такое влияние полностью, но оно
должно становиться адекватнее. В некоторых тестах перебалансировка могла
снижать клиентскую скорость записи с 1.5 ГБ/с до 50-100 МБ/с, а теперь, с
настройками автотюнинга по умолчанию, она снижается только до ~1 ГБ/с.
## recovery_tune_util_low
- Тип: число
- Значение по умолчанию: 0.1
- Можно менять на лету: да
Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
высокая, то есть, находится на уровне или выше recovery_tune_client_util_high.
## recovery_tune_util_high
- Тип: число
- Значение по умолчанию: 1
- Можно менять на лету: да
Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
низкая, то есть, находится на уровне или ниже recovery_tune_client_util_low.
## recovery_tune_client_util_low
- Тип: число
- Значение по умолчанию: 0
- Можно менять на лету: да
Клиентская утилизация, которая считается "низкой".
## recovery_tune_client_util_high
- Тип: число
- Значение по умолчанию: 0.5
- Можно менять на лету: да
Клиентская утилизация, которая считается "высокой".
## recovery_tune_agg_interval
- Тип: целое число
- Значение по умолчанию: 10
- Можно менять на лету: да
Число последних итераций автоподстройки для расчёта задержки как среднего
значения. Меньшие значения параметра ускоряют отклик на изменение нагрузки,
большие значения делают задержку стабильнее. Значение по умолчанию 10
обычно нормальное и не требует изменений.
## recovery_tune_sleep_min_us
- Тип: микросекунды
- Значение по умолчанию: 10
- Можно менять на лету: да
Минимальное возможное значение авто-подстроенного recovery_sleep_us.
Значения ниже данного заменяются на 0.

View File

@@ -38,7 +38,6 @@ const types = {
bool: 'boolean',
int: 'integer',
sec: 'seconds',
float: 'number',
ms: 'milliseconds',
us: 'microseconds',
},
@@ -47,7 +46,6 @@ const types = {
bool: 'булево (да/нет)',
int: 'целое число',
sec: 'секунды',
float: 'число',
ms: 'миллисекунды',
us: 'микросекунды',
},

View File

@@ -107,29 +107,17 @@
принудительной отправкой fsync-а.
- name: recovery_queue_depth
type: int
default: 1
default: 4
online: true
info: |
Maximum recovery and rebalance operations initiated by each OSD in parallel.
Note that each OSD talks to a lot of other OSDs so actual number of parallel
recovery operations per each OSD is greater than just recovery_queue_depth.
Increasing this parameter can speedup recovery if [auto-tuning](#recovery_tune_interval)
allows it or if it is disabled.
Maximum recovery operations per one primary OSD at any given moment of time.
Currently it's the only parameter available to tune the speed or recovery
and rebalancing, but it's planned to implement more.
info_ru: |
Максимальное число параллельных операций восстановления, инициируемых одним
OSD в любой момент времени. Имейте в виду, что каждый OSD обычно работает с
многими другими OSD, так что на практике параллелизм восстановления больше,
чем просто recovery_queue_depth. Увеличение значения этого параметра может
ускорить восстановление если [автотюнинг скорости](#recovery_tune_interval)
разрешает это или если он отключён.
- name: recovery_sleep_us
type: us
default: 0
online: true
info: |
Delay for all recovery- and rebalance- related operations. If non-zero,
such operations are artificially slowed down to reduce the impact on
client I/O.
Максимальное число операций восстановления на одном первичном OSD в любой
момент времени. На данный момент единственный параметр, который можно менять
для ускорения или замедления восстановления и перебалансировки данных, но
в планах реализация других параметров.
- name: recovery_pg_switch
type: int
default: 128
@@ -638,101 +626,3 @@
считается некорректной. Однако, если "лучшую" версию с числом доступных
копий большим, чем у всех других версий, найти невозможно, то объект тоже
маркируется неконсистентным.
- name: recovery_tune_interval
type: sec
default: 1
online: true
info: |
Interval at which OSD re-considers client and recovery load and automatically
adjusts [recovery_sleep_us](#recovery_sleep_us). Recovery auto-tuning is
disabled if recovery_tune_interval is set to 0.
Auto-tuning targets utilization. Utilization is a measure of load and is
equal to the product of iops and average latency (so it may be greater
than 1). You set "low" and "high" client utilization thresholds and two
corresponding target recovery utilization levels. OSD calculates desired
recovery utilization from client utilization using linear interpolation
and auto-tunes recovery operation delay to make actual recovery utilization
match desired.
This allows to reduce recovery/rebalance impact on client operations. It is
of course impossible to remove it completely, but it should become adequate.
In some tests rebalance could earlier drop client write speed from 1.5 GB/s
to 50-100 MB/s, with default auto-tuning settings it now only reduces
to ~1 GB/s.
info_ru: |
Интервал, с которым OSD пересматривает клиентскую нагрузку и нагрузку
восстановления и автоматически подстраивает [recovery_sleep_us](#recovery_sleep_us).
Автотюнинг (автоподстройка) отключается, если recovery_tune_interval
устанавливается в значение 0.
Автотюнинг регулирует утилизацию. Утилизация является мерой нагрузки
и равна произведению числа операций в секунду и средней задержки
(то есть, она может быть выше 1). Вы задаёте два уровня клиентской
утилизации - "низкий" и "высокий" (low и high) и два соответствующих
целевых уровня утилизации операциями восстановления. OSD рассчитывает
желаемый уровень утилизации восстановления линейной интерполяцией от
клиентской утилизации и подстраивает задержку операций восстановления
так, чтобы фактическая утилизация восстановления совпадала с желаемой.
Это позволяет снизить влияние восстановления и ребаланса на клиентские
операции. Конечно, невозможно исключить такое влияние полностью, но оно
должно становиться адекватнее. В некоторых тестах перебалансировка могла
снижать клиентскую скорость записи с 1.5 ГБ/с до 50-100 МБ/с, а теперь, с
настройками автотюнинга по умолчанию, она снижается только до ~1 ГБ/с.
- name: recovery_tune_util_low
type: float
default: 0.1
online: true
info: |
Desired recovery/rebalance utilization when client load is high, i.e. when
it is at or above recovery_tune_client_util_high.
info_ru: |
Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
высокая, то есть, находится на уровне или выше recovery_tune_client_util_high.
- name: recovery_tune_util_high
type: float
default: 1
online: true
info: |
Desired recovery/rebalance utilization when client load is low, i.e. when
it is at or below recovery_tune_client_util_low.
info_ru: |
Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
низкая, то есть, находится на уровне или ниже recovery_tune_client_util_low.
- name: recovery_tune_client_util_low
type: float
default: 0
online: true
info: Client utilization considered "low".
info_ru: Клиентская утилизация, которая считается "низкой".
- name: recovery_tune_client_util_high
type: float
default: 0.5
online: true
info: Client utilization considered "high".
info_ru: Клиентская утилизация, которая считается "высокой".
- name: recovery_tune_agg_interval
type: int
default: 10
online: true
info: |
The number of last auto-tuning iterations to use for calculating the
delay as average. Lower values result in quicker response to client
load change, higher values result in more stable delay. Default value of 10
is usually fine.
info_ru: |
Число последних итераций автоподстройки для расчёта задержки как среднего
значения. Меньшие значения параметра ускоряют отклик на изменение нагрузки,
большие значения делают задержку стабильнее. Значение по умолчанию 10
обычно нормальное и не требует изменений.
- name: recovery_tune_sleep_min_us
type: us
default: 10
online: true
info: |
Minimum possible value for auto-tuned recovery_sleep_us. Values lower
than this value are changed to 0.
info_ru: |
Минимальное возможное значение авто-подстроенного recovery_sleep_us.
Значения ниже данного заменяются на 0.

View File

@@ -25,7 +25,7 @@ vitastor: vitastor
vitastor_pool testpool
# path to the configuration file
vitastor_config_path /etc/vitastor/vitastor.conf
# etcd address(es), OPTIONAL, required only if missing in the configuration file
# etcd address(es), required only if missing in the configuration file
vitastor_etcd_address 192.168.7.2:2379/v3
# prefix for keys in etcd
vitastor_etcd_prefix /vitastor

View File

@@ -24,7 +24,7 @@ vitastor: vitastor
vitastor_pool testpool
# Путь к файлу конфигурации
vitastor_config_path /etc/vitastor/vitastor.conf
# Адрес(а) etcd, ОПЦИОНАЛЬНЫ, нужны, только если не указаны в vitastor.conf
# Адрес(а) etcd, нужны, только если не указаны в vitastor.conf
vitastor_etcd_address 192.168.7.2:2379/v3
# Префикс ключей метаданных в etcd
vitastor_etcd_prefix /vitastor

View File

@@ -32,7 +32,6 @@
- [Scrubbing](../config/osd.en.md#auto_scrub) (verification of copies)
- [Checksums](../config/layout-osd.en.md#data_csum_type)
- [Client write-back cache](../config/client.en.md#client_enable_writeback)
- [Intelligent recovery auto-tuning](../config/osd.en.md#recovery_tune_interval)
## Plugins and tools

View File

@@ -34,7 +34,6 @@
- [Фоновая проверка целостности](../config/osd.ru.md#auto_scrub) (сверка копий)
- [Контрольные суммы](../config/layout-osd.ru.md#data_csum_type)
- [Буферизация записи на стороне клиента](../config/client.ru.md#client_enable_writeback)
- [Интеллектуальная автоподстройка скорости восстановления](../config/osd.ru.md#recovery_tune_interval)
## Драйверы и инструменты

View File

@@ -14,13 +14,10 @@ Vitastor has a fio driver which can be installed from the package vitastor-fio.
Use the following command as an example to run tests with fio against a Vitastor cluster:
```
fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -image=testimg
fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -image=testimg
```
If you don't want to access your image by name, you can specify pool number, inode number and size
(`-pool=1 -inode=1 -size=400G`) instead of the image name (`-image=testimg`).
You can also specify etcd address(es) explicitly by adding `-etcd=10.115.0.10:2379/v3`, or you
can override configuration file path by adding `-conf=/etc/vitastor/vitastor.conf`.
See exact fio commands to use for benchmarking [here](../performance/understanding.en.md#fio-commands).
See exact fio commands to use for benchmarking [here](../performance/understanding.en.md#команды-fio).

View File

@@ -14,13 +14,10 @@
Используйте следующую команду как пример для запуска тестов кластера Vitastor через fio:
```
fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -image=testimg
fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -image=testimg
```
Вместо обращения к образу по имени (`-image=testimg`) можно указать номер пула, номер инода и размер:
`-pool=1 -inode=1 -size=400G`.
Вы также можете задать адрес(а) подключения к etcd явно, добавив `-etcd=10.115.0.10:2379/v3`,
или переопределить путь к файлу конфигурации, добавив `-conf=/etc/vitastor/vitastor.conf`.
Конкретные команды fio для тестирования производительности можно посмотреть [здесь](../performance/understanding.ru.md#команды-fio).

View File

@@ -34,7 +34,7 @@ vitastor-nfs [STANDARD OPTIONS] [OTHER OPTIONS]
--foreground 1 stay in foreground, do not daemonize
```
Example start and mount commands (etcd_address is optional):
Example start and mount commands:
```
vitastor-nfs --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool

View File

@@ -33,7 +33,7 @@ vitastor-nfs [СТАНДАРТНЫЕ ОПЦИИ] [ДРУГИЕ ОПЦИИ]
--foreground 1 не уходить в фон после запуска
```
Пример монтирования Vitastor через NFS (etcd_address необязателен):
Пример монтирования Vitastor через NFS:
```
vitastor-nfs --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool

View File

@@ -16,16 +16,13 @@ Old syntax (-drive):
```
qemu-system-x86_64 -enable-kvm -m 1024 \
-drive 'file=vitastor:image=debian9',
-drive 'file=vitastor:etcd_host=192.168.7.2\:2379/v3:image=debian9',
format=raw,if=none,id=drive-virtio-disk0,cache=none \
-device 'virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,
id=virtio-disk0,bootindex=1,write-cache=off' \
-vnc 0.0.0.0:0
```
Etcd address may be specified explicitly by adding `:etcd_host=192.168.7.2\:2379/v3` to `file=`.
Configuration file path may be overriden by adding `:config_path=/etc/vitastor/vitastor.conf`.
New syntax (-blockdev):
```
@@ -53,12 +50,12 @@ You can also specify inode ID, pool and size manually instead of `:image=<IMAGE>
## qemu-img
For qemu-img, you should use `vitastor:image=<IMAGE>[:etcd_host=<HOST>]` as filename.
For qemu-img, you should use `vitastor:etcd_host=<HOST>:image=<IMAGE>` as filename.
For example, to upload a VM image into Vitastor, run:
```
qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:image=debian10'
qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:etcd_host=192.168.7.2\:2379/v3:image=debian10'
```
You can also specify `:pool=<POOL>:inode=<INODE>:size=<SIZE>` instead of `:image=<IMAGE>`
@@ -75,10 +72,10 @@ the snapshot separately using the following commands (key points are using `skip
`-B backing_file` option):
```
qemu-img convert -f raw 'vitastor:image=testimg@0' \
qemu-img convert -f raw 'vitastor:etcd_host=192.168.7.2\:2379/v3:image=testimg@0' \
-O qcow2 testimg_0.qcow2
qemu-img convert -f raw 'vitastor:image=testimg:skip-parents=1' \
qemu-img convert -f raw 'vitastor:etcd_host=192.168.7.2\:2379/v3:image=testimg:skip-parents=1' \
-O qcow2 -o 'cluster_size=4k' -B testimg_0.qcow2 testimg.qcow2
```

View File

@@ -18,16 +18,13 @@
```
qemu-system-x86_64 -enable-kvm -m 1024 \
-drive 'file=vitastor:image=debian9',
-drive 'file=vitastor:etcd_host=192.168.7.2\:2379/v3:image=debian9',
format=raw,if=none,id=drive-virtio-disk0,cache=none \
-device 'virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,
id=virtio-disk0,bootindex=1,write-cache=off' \
-vnc 0.0.0.0:0
```
Адрес подключения etcd можно задать явно, если добавить `:etcd_host=192.168.7.2\:2379/v3` к `file=`.
Путь к файлу конфигурации можно переопределить, добавив `:config_path=/etc/vitastor/vitastor.conf`.
Новый синтаксис (-blockdev):
```
@@ -55,12 +52,12 @@ qemu-system-x86_64 -enable-kvm -m 1024 \
## qemu-img
Для qemu-img используйте строку `vitastor:image=<IMAGE>[:etcd_host=<HOST>]` в качестве имени файла диска.
Для qemu-img используйте строку `vitastor:etcd_host=<HOST>:image=<IMAGE>` в качестве имени файла диска.
Например, чтобы загрузить образ диска в Vitastor:
```
qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:image=testimg'
qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:etcd_host=10.115.0.10\:2379/v3:image=testimg'
```
Если вы не хотите обращаться к образу по имени, вместо `:image=<IMAGE>` можно указать номер пула, номер инода и размер:
@@ -76,10 +73,10 @@ qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:image=testimg'
с помощью следующих команд (ключевые моменты - использование `skip-parents=1` и опции `-B backing_file.qcow2`):
```
qemu-img convert -f raw 'vitastor:image=testimg@0' \
qemu-img convert -f raw 'vitastor:etcd_host=192.168.7.2\:2379/v3:image=testimg@0' \
-O qcow2 testimg_0.qcow2
qemu-img convert -f raw 'vitastor:image=testimg:skip-parents=1' \
qemu-img convert -f raw 'vitastor:etcd_host=192.168.7.2\:2379/v3:image=testimg:skip-parents=1' \
-O qcow2 -o 'cluster_size=4k' -B testimg_0.qcow2 testimg.qcow2
```

View File

@@ -3,7 +3,6 @@
module.exports = {
scale_pg_count,
scale_pg_history,
};
function add_pg_history(new_pg_history, new_pg, prev_pgs, prev_pg_history, old_pg)
@@ -44,18 +43,16 @@ function finish_pg_history(merged_history)
merged_history.all_peers = Object.values(merged_history.all_peers);
}
function scale_pg_history(prev_pg_history, prev_pgs, new_pgs)
function scale_pg_count(prev_pgs, real_prev_pgs, prev_pg_history, new_pg_history, new_pg_count)
{
const new_pg_history = [];
const old_pg_count = prev_pgs.length;
const new_pg_count = new_pgs.length;
const old_pg_count = real_prev_pgs.length;
// Add all possibly intersecting PGs to the history of new PGs
if (!(new_pg_count % old_pg_count))
{
// New PG count is a multiple of old PG count
for (let i = 0; i < new_pg_count; i++)
{
add_pg_history(new_pg_history, i, prev_pgs, prev_pg_history, i % old_pg_count);
add_pg_history(new_pg_history, i, real_prev_pgs, prev_pg_history, i % old_pg_count);
finish_pg_history(new_pg_history[i]);
}
}
@@ -67,7 +64,7 @@ function scale_pg_history(prev_pg_history, prev_pgs, new_pgs)
{
for (let j = 0; j < mul; j++)
{
add_pg_history(new_pg_history, i, prev_pgs, prev_pg_history, i+j*new_pg_count);
add_pg_history(new_pg_history, i, real_prev_pgs, prev_pg_history, i+j*new_pg_count);
}
finish_pg_history(new_pg_history[i]);
}
@@ -79,7 +76,7 @@ function scale_pg_history(prev_pg_history, prev_pgs, new_pgs)
let merged_history = {};
for (let i = 0; i < old_pg_count; i++)
{
add_pg_history(merged_history, 1, prev_pgs, prev_pg_history, i);
add_pg_history(merged_history, 1, real_prev_pgs, prev_pg_history, i);
}
finish_pg_history(merged_history[1]);
for (let i = 0; i < new_pg_count; i++)
@@ -92,12 +89,6 @@ function scale_pg_history(prev_pg_history, prev_pgs, new_pgs)
{
new_pg_history[i] = null;
}
return new_pg_history;
}
function scale_pg_count(prev_pgs, new_pg_count)
{
const old_pg_count = prev_pgs.length;
// Just for the lp_solve optimizer - pick a "previous" PG for each "new" one
if (prev_pgs.length < new_pg_count)
{

View File

@@ -59,7 +59,6 @@ const etcd_tree = {
etcd_mon_timeout: 1000, // ms. min: 0
etcd_mon_retries: 5, // min: 0
mon_change_timeout: 1000, // ms. min: 100
mon_retry_change_timeout: 50, // ms. min: 10
mon_stats_timeout: 1000, // ms. min: 100
osd_out_time: 600, // seconds. min: 0
placement_levels: { datacenter: 1, rack: 2, host: 3, osd: 4, ... },
@@ -113,12 +112,12 @@ const etcd_tree = {
client_queue_depth: 128, // unused
recovery_queue_depth: 1,
recovery_sleep_us: 0,
recovery_tune_util_low: 0.1,
recovery_tune_client_util_low: 0,
recovery_tune_util_high: 1.0,
recovery_tune_client_util_high: 0.5,
recovery_tune_min_util: 0.1,
recovery_tune_min_client_util: 0,
recovery_tune_max_util: 1.0,
recovery_tune_max_client_util: 0.5,
recovery_tune_interval: 1,
recovery_tune_agg_interval: 10, // 10 times recovery_tune_interval
recovery_tune_ewma_rate: 0.5,
recovery_tune_sleep_min_us: 10, // 10 microseconds
recovery_pg_switch: 128,
recovery_sync_batch: 16,
@@ -401,7 +400,7 @@ class Mon
this.parse_etcd_addresses(config.etcd_address||config.etcd_url);
this.verbose = config.verbose || 0;
this.initConfig = config;
this.config = { ...config };
this.config = {};
this.etcd_prefix = config.etcd_prefix || '/vitastor';
this.etcd_prefix = this.etcd_prefix.replace(/\/\/+/g, '/').replace(/^\/?(.*[^\/])\/?$/, '/$1');
this.etcd_start_timeout = (config.etcd_start_timeout || 5) * 1000;
@@ -499,11 +498,6 @@ class Mon
{
this.config.mon_change_timeout = 100;
}
this.config.mon_retry_change_timeout = Number(this.config.mon_retry_change_timeout) || 50;
if (this.config.mon_retry_change_timeout < 50)
{
this.config.mon_retry_change_timeout = 50;
}
this.config.mon_stats_timeout = Number(this.config.mon_stats_timeout) || 1000;
if (this.config.mon_stats_timeout < 100)
{
@@ -620,7 +614,7 @@ class Mon
console.log('etcd websocket timed out, restarting it');
this.restart_watcher(cur_addr);
}
}, (Number(this.config.etcd_ws_keepalive_interval) || 30)*1000);
}, (Number(this.config.etcd_keepalive_interval) || 30)*1000);
this.ws.on('error', () => this.restart_watcher(cur_addr));
this.ws.send(JSON.stringify({
create_request: {
@@ -1236,89 +1230,6 @@ class Mon
return aff_osds;
}
async generate_pool_pgs(pool_id, osd_tree, levels)
{
const pool_cfg = this.state.config.pools[pool_id];
if (!this.validate_pool_cfg(pool_id, pool_cfg, false))
{
return null;
}
let pool_tree = osd_tree[pool_cfg.root_node || ''];
pool_tree = pool_tree ? pool_tree.children : [];
pool_tree = LPOptimizer.flatten_tree(pool_tree, levels, pool_cfg.failure_domain, 'osd');
this.filter_osds_by_tags(osd_tree, pool_tree, pool_cfg.osd_tags);
this.filter_osds_by_block_layout(
pool_tree,
pool_cfg.block_size || this.config.block_size || 131072,
pool_cfg.bitmap_granularity || this.config.bitmap_granularity || 4096,
pool_cfg.immediate_commit || this.config.immediate_commit || 'none'
);
// First try last_clean_pgs to minimize data movement
let prev_pgs = [];
for (const pg in ((this.state.history.last_clean_pgs.items||{})[pool_id]||{}))
{
prev_pgs[pg-1] = [ ...this.state.history.last_clean_pgs.items[pool_id][pg].osd_set ];
}
if (!prev_pgs.length)
{
// Fall back to config/pgs if it's empty
for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
{
prev_pgs[pg-1] = [ ...this.state.config.pgs.items[pool_id][pg].osd_set ];
}
}
const old_pg_count = prev_pgs.length;
const optimize_cfg = {
osd_tree: pool_tree,
pg_count: pool_cfg.pg_count,
pg_size: pool_cfg.pg_size,
pg_minsize: pool_cfg.pg_minsize,
max_combinations: pool_cfg.max_osd_combinations,
ordered: pool_cfg.scheme != 'replicated',
};
let optimize_result;
// Re-shuffle PGs if config/pgs.hash is empty
if (old_pg_count > 0 && this.state.config.pgs.hash)
{
if (prev_pgs.length != pool_cfg.pg_count)
{
// Scale PG count
// Do it even if old_pg_count is already equal to pool_cfg.pg_count,
// because last_clean_pgs may still contain the old number of PGs
PGUtil.scale_pg_count(prev_pgs, pool_cfg.pg_count);
}
for (const pg of prev_pgs)
{
while (pg.length < pool_cfg.pg_size)
{
pg.push(0);
}
}
optimize_result = await LPOptimizer.optimize_change({
prev_pgs,
...optimize_cfg,
});
}
else
{
optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
}
console.log(`Pool ${pool_id} (${pool_cfg.name || 'unnamed'}):`);
LPOptimizer.print_change_stats(optimize_result);
const pg_effsize = Math.min(pool_cfg.pg_size, Object.keys(pool_tree).length);
return {
pool_id,
pgs: optimize_result.int_pgs,
stats: {
total_raw_tb: optimize_result.space,
pg_real_size: pg_effsize || pool_cfg.pg_size,
raw_to_usable: (pg_effsize || pool_cfg.pg_size) / (pool_cfg.scheme === 'replicated'
? 1 : (pool_cfg.pg_size - (pool_cfg.parity_chunks||0))),
space_efficiency: optimize_result.space/(optimize_result.total_space||1),
},
};
}
async recheck_pgs()
{
if (this.recheck_pgs_active)
@@ -1333,47 +1244,158 @@ class Mon
const { up_osds, levels, osd_tree } = this.get_osd_tree();
const tree_cfg = {
osd_tree,
levels,
pools: this.state.config.pools,
};
const tree_hash = sha1hex(stableStringify(tree_cfg));
if (this.state.config.pgs.hash != tree_hash)
{
// Something has changed
console.log('Pool configuration or OSD tree changed, re-optimizing');
// First re-optimize PGs, but don't look at history yet
const optimize_results = await Promise.all(Object.keys(this.state.config.pools)
.map(pool_id => this.generate_pool_pgs(pool_id, osd_tree, levels)));
// Then apply the modification in the form of an optimistic transaction,
// each time considering new pg/history modifications (OSDs modify it during rebalance)
while (!await this.apply_pool_pgs(optimize_results, up_osds, osd_tree, tree_hash))
const new_config_pgs = JSON.parse(JSON.stringify(this.state.config.pgs));
const etcd_request = { compare: [], success: [] };
for (const pool_id in (this.state.config.pgs||{}).items||{})
{
console.log(
'Someone changed PG configuration while we also tried to change it.'+
' Retrying in '+this.config.mon_retry_change_timeout+' ms'
);
// Failed to apply - parallel change detected. Wait a bit and retry
const old_rev = this.etcd_watch_revision;
while (this.etcd_watch_revision === old_rev)
if (!this.state.config.pools[pool_id])
{
await new Promise(ok => setTimeout(ok, this.config.mon_retry_change_timeout));
// Pool deleted. Delete all PGs, but first stop them.
if (!await this.stop_all_pgs(pool_id))
{
this.recheck_pgs_active = false;
this.schedule_recheck();
return;
}
const prev_pgs = [];
for (const pg in this.state.config.pgs.items[pool_id]||{})
{
prev_pgs[pg-1] = this.state.config.pgs.items[pool_id][pg].osd_set;
}
// Also delete pool statistics
etcd_request.success.push({ requestDeleteRange: {
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
} });
this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, prev_pgs, [], []);
}
const new_ot = this.get_osd_tree();
const new_tcfg = {
osd_tree: new_ot.osd_tree,
levels: new_ot.levels,
pools: this.state.config.pools,
};
if (sha1hex(stableStringify(new_tcfg)) !== tree_hash)
{
// Configuration actually changed, restart from the beginning
this.recheck_pgs_active = false;
setImmediate(() => this.recheck_pgs().catch(this.die));
return;
}
// Configuration didn't change, PG history probably changed, so just retry
}
console.log('PG configuration successfully changed');
for (const pool_id in this.state.config.pools)
{
const pool_cfg = this.state.config.pools[pool_id];
if (!this.validate_pool_cfg(pool_id, pool_cfg, false))
{
continue;
}
let pool_tree = osd_tree[pool_cfg.root_node || ''];
pool_tree = pool_tree ? pool_tree.children : [];
pool_tree = LPOptimizer.flatten_tree(pool_tree, levels, pool_cfg.failure_domain, 'osd');
this.filter_osds_by_tags(osd_tree, pool_tree, pool_cfg.osd_tags);
this.filter_osds_by_block_layout(
pool_tree,
pool_cfg.block_size || this.config.block_size || 131072,
pool_cfg.bitmap_granularity || this.config.bitmap_granularity || 4096,
pool_cfg.immediate_commit || this.config.immediate_commit || 'none'
);
// These are for the purpose of building history.osd_sets
const real_prev_pgs = [];
let pg_history = [];
for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
{
real_prev_pgs[pg-1] = this.state.config.pgs.items[pool_id][pg].osd_set;
if (this.state.pg.history[pool_id] &&
this.state.pg.history[pool_id][pg])
{
pg_history[pg-1] = this.state.pg.history[pool_id][pg];
}
}
// And these are for the purpose of minimizing data movement
let prev_pgs = [];
for (const pg in ((this.state.history.last_clean_pgs.items||{})[pool_id]||{}))
{
prev_pgs[pg-1] = this.state.history.last_clean_pgs.items[pool_id][pg].osd_set;
}
prev_pgs = JSON.parse(JSON.stringify(prev_pgs.length ? prev_pgs : real_prev_pgs));
const old_pg_count = real_prev_pgs.length;
const optimize_cfg = {
osd_tree: pool_tree,
pg_count: pool_cfg.pg_count,
pg_size: pool_cfg.pg_size,
pg_minsize: pool_cfg.pg_minsize,
max_combinations: pool_cfg.max_osd_combinations,
ordered: pool_cfg.scheme != 'replicated',
};
let optimize_result;
if (old_pg_count > 0)
{
if (old_pg_count != pool_cfg.pg_count)
{
// PG count changed. Need to bring all PGs down.
if (!await this.stop_all_pgs(pool_id))
{
this.recheck_pgs_active = false;
this.schedule_recheck();
return;
}
}
if (prev_pgs.length != pool_cfg.pg_count)
{
// Scale PG count
// Do it even if old_pg_count is already equal to pool_cfg.pg_count,
// because last_clean_pgs may still contain the old number of PGs
const new_pg_history = [];
PGUtil.scale_pg_count(prev_pgs, real_prev_pgs, pg_history, new_pg_history, pool_cfg.pg_count);
pg_history = new_pg_history;
}
for (const pg of prev_pgs)
{
while (pg.length < pool_cfg.pg_size)
{
pg.push(0);
}
}
if (!this.state.config.pgs.hash)
{
// Re-shuffle PGs
optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
}
else
{
optimize_result = await LPOptimizer.optimize_change({
prev_pgs,
...optimize_cfg,
});
}
}
else
{
optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
}
if (old_pg_count != optimize_result.int_pgs.length)
{
console.log(
`PG count for pool ${pool_id} (${pool_cfg.name || 'unnamed'})`+
` changed from: ${old_pg_count} to ${optimize_result.int_pgs.length}`
);
// Drop stats
etcd_request.success.push({ requestDeleteRange: {
key: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'/'),
range_end: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'0'),
} });
}
LPOptimizer.print_change_stats(optimize_result);
const pg_effsize = Math.min(pool_cfg.pg_size, Object.keys(pool_tree).length);
this.state.pool.stats[pool_id] = {
used_raw_tb: (this.state.pool.stats[pool_id]||{}).used_raw_tb || 0,
total_raw_tb: optimize_result.space,
pg_real_size: pg_effsize || pool_cfg.pg_size,
raw_to_usable: (pg_effsize || pool_cfg.pg_size) / (pool_cfg.scheme === 'replicated'
? 1 : (pool_cfg.pg_size - (pool_cfg.parity_chunks||0))),
space_efficiency: optimize_result.space/(optimize_result.total_space||1),
};
etcd_request.success.push({ requestPut: {
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
value: b64(JSON.stringify(this.state.pool.stats[pool_id])),
} });
this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, real_prev_pgs, optimize_result.int_pgs, pg_history);
}
new_config_pgs.hash = tree_hash;
await this.save_pg_config(new_config_pgs, etcd_request);
}
else
{
@@ -1420,81 +1442,8 @@ class Mon
this.recheck_pgs_active = false;
}
async apply_pool_pgs(results, up_osds, osd_tree, tree_hash)
async save_pg_config(new_config_pgs, etcd_request = { compare: [], success: [] })
{
for (const pool_id in (this.state.config.pgs||{}).items||{})
{
// We should stop all PGs when deleting a pool or changing its PG count
if (!this.state.config.pools[pool_id] ||
this.state.config.pgs.items[pool_id] && this.state.config.pools[pool_id].pg_count !=
Object.keys(this.state.config.pgs.items[pool_id]).reduce((a, c) => (a < (0|c) ? (0|c) : a), 0))
{
if (!await this.stop_all_pgs(pool_id))
{
return false;
}
}
}
const new_config_pgs = JSON.parse(JSON.stringify(this.state.config.pgs));
const etcd_request = { compare: [], success: [] };
for (const pool_id in (new_config_pgs||{}).items||{})
{
if (!this.state.config.pools[pool_id])
{
const prev_pgs = [];
for (const pg in new_config_pgs.items[pool_id]||{})
{
prev_pgs[pg-1] = new_config_pgs.items[pool_id][pg].osd_set;
}
// Also delete pool statistics
etcd_request.success.push({ requestDeleteRange: {
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
} });
this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, prev_pgs, [], []);
}
}
for (const pool_res of results)
{
const pool_id = pool_res.pool_id;
const pool_cfg = this.state.config.pools[pool_id];
let pg_history = [];
for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
{
if (this.state.pg.history[pool_id] &&
this.state.pg.history[pool_id][pg])
{
pg_history[pg-1] = this.state.pg.history[pool_id][pg];
}
}
const real_prev_pgs = [];
for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
{
real_prev_pgs[pg-1] = [ ...this.state.config.pgs.items[pool_id][pg].osd_set ];
}
if (real_prev_pgs.length > 0 && real_prev_pgs.length != pool_res.pgs.length)
{
console.log(
`Changing PG count for pool ${pool_id} (${pool_cfg.name || 'unnamed'})`+
` from: ${real_prev_pgs.length} to ${pool_res.pgs.length}`
);
pg_history = PGUtil.scale_pg_history(pg_history, real_prev_pgs, pool_res.pgs);
// Drop stats
etcd_request.success.push({ requestDeleteRange: {
key: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'/'),
range_end: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'0'),
} });
}
const stats = {
used_raw_tb: (this.state.pool.stats[pool_id]||{}).used_raw_tb || 0,
...pool_res.stats,
};
etcd_request.success.push({ requestPut: {
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
value: b64(JSON.stringify(stats)),
} });
this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, real_prev_pgs, pool_res.pgs, pg_history);
}
new_config_pgs.hash = tree_hash;
etcd_request.compare.push(
{ key: b64(this.etcd_prefix+'/mon/master'), target: 'LEASE', lease: ''+this.etcd_lease_id },
{ key: b64(this.etcd_prefix+'/config/pgs'), target: 'MOD', mod_revision: ''+this.etcd_watch_revision, result: 'LESS' },
@@ -1502,8 +1451,14 @@ class Mon
etcd_request.success.push(
{ requestPut: { key: b64(this.etcd_prefix+'/config/pgs'), value: b64(JSON.stringify(new_config_pgs)) } },
);
const txn_res = await this.etcd_call('/kv/txn', etcd_request, this.config.etcd_mon_timeout, 0);
return txn_res.succeeded;
const res = await this.etcd_call('/kv/txn', etcd_request, this.config.etcd_mon_timeout, 0);
if (!res.succeeded)
{
console.log('Someone changed PG configuration while we also tried to change it. Retrying in '+this.config.mon_change_timeout+' ms');
this.schedule_recheck();
return;
}
console.log('PG configuration successfully changed');
}
// Schedule next recheck at least at <unixtime>

View File

@@ -110,6 +110,7 @@ sub properties
vitastor_etcd_address => {
description => 'IP address(es) of etcd.',
type => 'string',
format => 'pve-storage-portal-dns-list',
},
vitastor_etcd_prefix => {
description => 'Prefix for Vitastor etcd metadata',

View File

@@ -181,25 +181,6 @@ target_link_libraries(vitastor-nbd
vitastor_client
)
# vitastor-kv
add_executable(vitastor-kv
kv_cli.cpp
kv_db.cpp
kv_db.h
)
target_link_libraries(vitastor-kv
vitastor_client
)
add_executable(vitastor-kv-stress
kv_stress.cpp
kv_db.cpp
kv_db.h
)
target_link_libraries(vitastor-kv-stress
vitastor_client
)
# vitastor-nfs
add_executable(vitastor-nfs
nfs_proxy.cpp

View File

@@ -8,7 +8,6 @@
#include <stdio.h>
#include <stdexcept>
#include <set>
#include "addr_util.h"
@@ -136,7 +135,7 @@ std::vector<std::string> getifaddr_list(std::vector<std::string> mask_cfg, bool
throw std::runtime_error((include_v6 ? "Invalid IPv4 address mask: " : "Invalid IP address mask: ") + mask);
}
}
std::set<std::string> addresses;
std::vector<std::string> addresses;
ifaddrs *list, *ifa;
if (getifaddrs(&list) == -1)
{
@@ -150,8 +149,7 @@ std::vector<std::string> getifaddr_list(std::vector<std::string> mask_cfg, bool
}
int family = ifa->ifa_addr->sa_family;
if ((family == AF_INET || family == AF_INET6 && include_v6) &&
// Do not skip loopback addresses if the address filter is specified
(ifa->ifa_flags & (IFF_UP | IFF_RUNNING | (masks.size() ? 0 : IFF_LOOPBACK))) == (IFF_UP | IFF_RUNNING))
(ifa->ifa_flags & (IFF_UP | IFF_RUNNING | IFF_LOOPBACK)) == (IFF_UP | IFF_RUNNING))
{
void *addr_ptr;
if (family == AF_INET)
@@ -184,11 +182,11 @@ std::vector<std::string> getifaddr_list(std::vector<std::string> mask_cfg, bool
{
throw std::runtime_error(std::string("inet_ntop: ") + strerror(errno));
}
addresses.insert(std::string(addr));
addresses.push_back(std::string(addr));
}
}
freeifaddrs(list);
return std::vector<std::string>(addresses.begin(), addresses.end());
return addresses;
}
int create_and_bind_socket(std::string bind_address, int bind_port, int listen_backlog, int *listening_port)

View File

@@ -277,7 +277,6 @@ class blockstore_impl_t
int unsynced_big_write_count = 0, unstable_unsynced = 0;
int unsynced_queued_ops = 0;
allocator *data_alloc = NULL;
uint64_t used_blocks = 0;
uint8_t *zero_object;
void *metadata_buffer = NULL;
@@ -431,7 +430,7 @@ public:
inline uint32_t get_block_size() { return dsk.data_block_size; }
inline uint64_t get_block_count() { return dsk.block_count; }
inline uint64_t get_free_block_count() { return dsk.block_count - used_blocks; }
inline uint64_t get_free_block_count() { return data_alloc->get_free_count(); }
inline uint32_t get_bitmap_granularity() { return dsk.disk_alignment; }
inline uint64_t get_journal_size() { return dsk.journal_len; }
};

View File

@@ -376,7 +376,6 @@ bool blockstore_init_meta::handle_meta_block(uint8_t *buf, uint64_t entries_per_
else
{
bs->inode_space_stats[entry->oid.inode] += bs->dsk.data_block_size;
bs->used_blocks++;
}
entries_loaded++;
#ifdef BLOCKSTORE_DEBUG
@@ -1182,7 +1181,6 @@ void blockstore_init_journal::erase_dirty_object(blockstore_dirty_db_t::iterator
sp -= bs->dsk.data_block_size;
else
bs->inode_space_stats.erase(oid.inode);
bs->used_blocks--;
}
bs->erase_dirty(dirty_it, dirty_end, clean_loc);
// Remove it from the flusher's queue, too

View File

@@ -445,7 +445,6 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
if (!exists)
{
inode_space_stats[dirty_it->first.oid.inode] += dsk.data_block_size;
used_blocks++;
}
big_to_flush++;
}
@@ -456,7 +455,6 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
sp -= dsk.data_block_size;
else
inode_space_stats.erase(dirty_it->first.oid.inode);
used_blocks--;
big_to_flush++;
}
}

View File

@@ -6,7 +6,7 @@
#include "cluster_client_impl.h"
#include "http_client.h" // json_is_true
cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json config)
cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config)
{
wb = new writeback_cache_t();
@@ -359,8 +359,6 @@ void cluster_client_t::on_load_config_hook(json11::Json::object & etcd_global_co
{
up_wait_retry_interval = 50;
}
// log_level
log_level = config["log_level"].uint64_value();
msgr.parse_config(config);
st_cli.parse_config(config);
st_cli.load_pgs();
@@ -534,7 +532,7 @@ void cluster_client_t::execute_internal(cluster_op_t *op)
return;
}
if (op->opcode == OSD_OP_WRITE && enable_writeback && !(op->flags & OP_FLUSH_BUFFER) &&
!op->version /* no CAS writeback */)
!op->version /* FIXME no CAS writeback */)
{
if (wb->writebacks_active >= client_max_writeback_iodepth)
{
@@ -555,7 +553,7 @@ void cluster_client_t::execute_internal(cluster_op_t *op)
}
if (op->opcode == OSD_OP_WRITE && !(op->flags & OP_IMMEDIATE_COMMIT))
{
if (!(op->flags & OP_FLUSH_BUFFER) && !op->version /* no CAS write-repeat */)
if (!(op->flags & OP_FLUSH_BUFFER))
{
wb->copy_write(op, CACHE_WRITTEN);
}
@@ -705,8 +703,6 @@ resume_1:
}
goto resume_2;
}
// Protect from try_send completing the operation immediately
op->inflight_count++;
for (int i = 0; i < op->parts.size(); i++)
{
if (!(op->parts[i].flags & PART_SENT))
@@ -730,10 +726,8 @@ resume_1:
}
}
}
op->inflight_count--;
if (op->state == 1)
{
// Some suboperations have to be resent
return 0;
}
resume_2:
@@ -1153,15 +1147,12 @@ void cluster_client_t::handle_op_part(cluster_op_part_t *part)
if (op->retval != -EINTR && op->retval != -EIO && op->retval != -ENOSPC)
{
stop_fd = part->op.peer_fd;
if (op->retval != -EPIPE || log_level > 0)
{
fprintf(
stderr, "%s operation failed on OSD %lu: retval=%ld (expected %d), dropping connection\n",
osd_op_names[part->op.req.hdr.opcode], part->osd_num, part->op.reply.hdr.retval, expected
);
}
fprintf(
stderr, "%s operation failed on OSD %lu: retval=%ld (expected %d), dropping connection\n",
osd_op_names[part->op.req.hdr.opcode], part->osd_num, part->op.reply.hdr.retval, expected
);
}
else if (log_level > 0)
else
{
fprintf(
stderr, "%s operation failed on OSD %lu: retval=%ld (expected %d)\n",

View File

@@ -91,7 +91,7 @@ class cluster_client_t
uint64_t client_max_buffered_ops = 0;
uint64_t client_max_writeback_iodepth = 0;
int log_level = 0;
int log_level;
int up_wait_retry_interval = 500; // ms
int retry_timeout_id = 0;
@@ -121,7 +121,7 @@ public:
json11::Json::object cli_config, file_config, etcd_global_config;
json11::Json::object config;
cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json config);
cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config);
~cluster_client_t();
void execute(cluster_op_t *op);
void execute_raw(osd_num_t osd_num, osd_op_t *op);

View File

@@ -440,25 +440,16 @@ std::vector<std::string> disk_tool_t::get_new_data_parts(vitastor_dev_info_t & d
{
// Use this partition
use_parts.push_back(part["uuid"].string_value());
osds_exist++;
}
else
{
std::string part_path = "/dev/disk/by-partuuid/"+strtolower(part["uuid"].string_value());
bool is_meta = sb["params"]["meta_device"].string_value() == part_path;
bool is_journal = sb["params"]["journal_device"].string_value() == part_path;
bool is_data = sb["params"]["data_device"].string_value() == part_path;
fprintf(
stderr, "%s is already initialized for OSD %lu%s, skipping\n",
part["node"].string_value().c_str(), sb["params"]["osd_num"].uint64_value(),
(is_data ? " data" : (is_meta ? " meta" : (is_journal ? " journal" : "")))
stderr, "%s is already initialized for OSD %lu, skipping\n",
part["node"].string_value().c_str(), sb["params"]["osd_num"].uint64_value()
);
if (is_data || sb["params"]["data_device"].string_value().substr(0, 22) != "/dev/disk/by-partuuid/")
{
osds_size += part["size"].uint64_value()*dev.pt["sectorsize"].uint64_value();
osds_exist++;
}
osds_size += part["size"].uint64_value()*dev.pt["sectorsize"].uint64_value();
}
osds_exist++;
}
}
// Still create OSD(s) if a disk has no more than (max_other_percent) other data

View File

@@ -333,7 +333,7 @@ void etcd_state_client_t::start_etcd_watcher()
etcd_watch_ws = NULL;
}
if (this->log_level > 1)
fprintf(stderr, "Trying to connect to etcd websocket at %s, watch from revision %lu\n", etcd_address.c_str(), etcd_watch_revision);
fprintf(stderr, "Trying to connect to etcd websocket at %s\n", etcd_address.c_str());
etcd_watch_ws = open_websocket(tfd, etcd_address, etcd_api_path+"/watch", etcd_slow_timeout,
[this, cur_addr = selected_etcd_address](const http_response_t *msg)
{
@@ -356,8 +356,8 @@ void etcd_state_client_t::start_etcd_watcher()
watch_id == ETCD_PG_HISTORY_WATCH_ID ||
watch_id == ETCD_OSD_STATE_WATCH_ID)
etcd_watches_initialised++;
if (etcd_watches_initialised == ETCD_TOTAL_WATCHES && this->log_level > 0)
fprintf(stderr, "Successfully subscribed to etcd at %s, revision %lu\n", cur_addr.c_str(), etcd_watch_revision);
if (etcd_watches_initialised == 4 && this->log_level > 0)
fprintf(stderr, "Successfully subscribed to etcd at %s\n", cur_addr.c_str());
}
if (data["result"]["canceled"].bool_value())
{
@@ -393,13 +393,9 @@ void etcd_state_client_t::start_etcd_watcher()
exit(1);
}
}
if (etcd_watches_initialised == ETCD_TOTAL_WATCHES && !data["result"]["header"]["revision"].is_null())
if (etcd_watches_initialised == 4)
{
// Protect against a revision beign split into multiple messages and some
// of them being lost. Even though I'm not sure if etcd actually splits them
// Also sometimes etcd sends something without a header, like:
// {"error": {"grpc_code": 14, "http_code": 503, "http_status": "Service Unavailable", "message": "error reading from server: EOF"}}
etcd_watch_revision = data["result"]["header"]["revision"].uint64_value();
etcd_watch_revision = data["result"]["header"]["revision"].uint64_value()+1;
addresses_to_try.clear();
}
// First gather all changes into a hash to remove multiple overwrites
@@ -511,7 +507,7 @@ void etcd_state_client_t::start_ws_keepalive()
{
ws_keepalive_timer = tfd->set_timer(etcd_ws_keepalive_interval*1000, true, [this](int)
{
if (!etcd_watch_ws || etcd_watches_initialised < ETCD_TOTAL_WATCHES)
if (!etcd_watch_ws)
{
// Do nothing
}
@@ -640,28 +636,18 @@ void etcd_state_client_t::load_pgs()
on_load_pgs_hook(false);
return;
}
reset_pg_exists();
if (!etcd_watch_revision)
{
etcd_watch_revision = data["header"]["revision"].uint64_value()+1;
if (this->log_level > 3)
{
fprintf(stderr, "Loaded revision %lu of PG configuration\n", etcd_watch_revision-1);
}
}
for (auto & res: data["responses"].array_items())
{
for (auto & kv_json: res["response_range"]["kvs"].array_items())
{
auto kv = parse_etcd_kv(kv_json);
if (this->log_level > 3)
{
fprintf(stderr, "Loaded key: %s -> %s\n", kv.key.c_str(), kv.value.dump().c_str());
}
parse_state(kv);
}
}
clean_nonexistent_pgs();
on_load_pgs_hook(true);
start_etcd_watcher();
});
@@ -682,73 +668,6 @@ void etcd_state_client_t::load_pgs()
}
#endif
void etcd_state_client_t::reset_pg_exists()
{
for (auto & pool_item: pool_config)
{
for (auto & pg_item: pool_item.second.pg_config)
{
pg_item.second.state_exists = false;
pg_item.second.history_exists = false;
}
}
seen_peers.clear();
}
void etcd_state_client_t::clean_nonexistent_pgs()
{
for (auto & pool_item: pool_config)
{
for (auto pg_it = pool_item.second.pg_config.begin(); pg_it != pool_item.second.pg_config.end(); )
{
auto & pg_cfg = pg_it->second;
if (!pg_cfg.config_exists && !pg_cfg.state_exists && !pg_cfg.history_exists)
{
if (this->log_level > 3)
{
fprintf(stderr, "PG %u/%u disappeared after reload, forgetting it\n", pool_item.first, pg_it->first);
}
pool_item.second.pg_config.erase(pg_it++);
}
else
{
if (!pg_cfg.state_exists)
{
if (this->log_level > 3)
{
fprintf(stderr, "PG %u/%u primary OSD disappeared after reload, forgetting it\n", pool_item.first, pg_it->first);
}
parse_state((etcd_kv_t){
.key = etcd_prefix+"/pg/state/"+std::to_string(pool_item.first)+"/"+std::to_string(pg_it->first),
});
}
if (!pg_cfg.history_exists)
{
if (this->log_level > 3)
{
fprintf(stderr, "PG %u/%u history disappeared after reload, forgetting it\n", pool_item.first, pg_it->first);
}
parse_state((etcd_kv_t){
.key = etcd_prefix+"/pg/history/"+std::to_string(pool_item.first)+"/"+std::to_string(pg_it->first),
});
}
pg_it++;
}
}
}
for (auto & peer_item: peer_states)
{
if (seen_peers.find(peer_item.first) == seen_peers.end())
{
fprintf(stderr, "OSD %lu state disappeared after reload, forgetting it\n", peer_item.first);
parse_state((etcd_kv_t){
.key = etcd_prefix+"/osd/state/"+std::to_string(peer_item.first),
});
}
}
seen_peers.clear();
}
void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
{
const std::string & key = kv.key;
@@ -903,7 +822,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
{
for (auto & pg_item: pool_item.second.pg_config)
{
pg_item.second.config_exists = false;
pg_item.second.exists = false;
}
}
for (auto & pool_item: value["items"].object_items())
@@ -926,7 +845,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
continue;
}
auto & parsed_cfg = this->pool_config[pool_id].pg_config[pg_num];
parsed_cfg.config_exists = true;
parsed_cfg.exists = true;
parsed_cfg.pause = pg_item.second["pause"].bool_value();
parsed_cfg.primary = pg_item.second["primary"].uint64_value();
parsed_cfg.target_set.clear();
@@ -947,7 +866,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
int n = 0;
for (auto pg_it = pool_item.second.pg_config.begin(); pg_it != pool_item.second.pg_config.end(); pg_it++)
{
if (pg_it->second.config_exists && pg_it->first != ++n)
if (pg_it->second.exists && pg_it->first != ++n)
{
fprintf(
stderr, "Invalid pool %u PG configuration: PG numbers don't cover whole 1..%lu range\n",
@@ -955,7 +874,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
);
for (pg_it = pool_item.second.pg_config.begin(); pg_it != pool_item.second.pg_config.end(); pg_it++)
{
pg_it->second.config_exists = false;
pg_it->second.exists = false;
}
n = 0;
break;
@@ -980,7 +899,6 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
auto & pg_cfg = this->pool_config[pool_id].pg_config[pg_num];
pg_cfg.target_history.clear();
pg_cfg.all_peers.clear();
pg_cfg.history_exists = !value.is_null();
// Refuse to start PG if any set of the <osd_sets> has no live OSDs
for (auto & hist_item: value["osd_sets"].array_items())
{
@@ -1033,15 +951,11 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
}
else if (value.is_null())
{
auto & pg_cfg = this->pool_config[pool_id].pg_config[pg_num];
pg_cfg.state_exists = false;
pg_cfg.cur_primary = 0;
pg_cfg.cur_state = 0;
this->pool_config[pool_id].pg_config[pg_num].cur_primary = 0;
this->pool_config[pool_id].pg_config[pg_num].cur_state = 0;
}
else
{
auto & pg_cfg = this->pool_config[pool_id].pg_config[pg_num];
pg_cfg.state_exists = true;
osd_num_t cur_primary = value["primary"].uint64_value();
int state = 0;
for (auto & e: value["state"].array_items())
@@ -1069,8 +983,8 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
fprintf(stderr, "Unexpected pool %u PG %u state in etcd: primary=%lu, state=%s\n", pool_id, pg_num, cur_primary, value["state"].dump().c_str());
return;
}
pg_cfg.cur_primary = cur_primary;
pg_cfg.cur_state = state;
this->pool_config[pool_id].pg_config[pg_num].cur_primary = cur_primary;
this->pool_config[pool_id].pg_config[pg_num].cur_state = state;
}
}
else if (key.substr(0, etcd_prefix.length()+11) == etcd_prefix+"/osd/state/")
@@ -1084,7 +998,6 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
value["port"].int64_value() > 0 && value["port"].int64_value() < 65536)
{
this->peer_states[peer_osd] = value;
this->seen_peers.insert(peer_osd);
}
else
{

View File

@@ -3,8 +3,6 @@
#pragma once
#include <set>
#include "json11/json11.hpp"
#include "osd_id.h"
#include "timerfd_manager.h"
@@ -13,7 +11,6 @@
#define ETCD_PG_STATE_WATCH_ID 2
#define ETCD_PG_HISTORY_WATCH_ID 3
#define ETCD_OSD_STATE_WATCH_ID 4
#define ETCD_TOTAL_WATCHES 4
#define DEFAULT_BLOCK_SIZE 128*1024
#define MIN_DATA_BLOCK_SIZE 4*1024
@@ -33,7 +30,7 @@ struct etcd_kv_t
struct pg_config_t
{
bool config_exists, history_exists, state_exists;
bool exists;
osd_num_t primary;
std::vector<osd_num_t> target_set;
std::vector<std::vector<osd_num_t>> target_history;
@@ -64,21 +61,21 @@ struct pool_config_t
struct inode_config_t
{
uint64_t num = 0;
uint64_t num;
std::string name;
uint64_t size = 0;
inode_t parent_id = 0;
bool readonly = false;
uint64_t size;
inode_t parent_id;
bool readonly;
// Arbitrary metadata
json11::Json meta;
// Change revision of the metadata in etcd
uint64_t mod_revision = 0;
uint64_t mod_revision;
};
struct inode_watch_t
{
std::string name;
inode_config_t cfg = {};
inode_config_t cfg;
};
struct http_co_t;
@@ -116,7 +113,6 @@ public:
uint64_t etcd_watch_revision = 0;
std::map<pool_id_t, pool_config_t> pool_config;
std::map<osd_num_t, json11::Json> peer_states;
std::set<osd_num_t> seen_peers;
std::map<inode_t, inode_config_t> inode_config;
std::map<std::string, inode_t> inode_by_name;
@@ -142,8 +138,6 @@ public:
void start_ws_keepalive();
void load_global_config();
void load_pgs();
void reset_pg_exists();
void clean_nonexistent_pgs();
void parse_state(const etcd_kv_t & kv);
void parse_config(const json11::Json & config);
void insert_inode_config(const inode_config_t & cfg);

View File

@@ -1,401 +0,0 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// Vitastor shared key/value database test CLI
#define _XOPEN_SOURCE
#include <limits.h>
#include <netinet/tcp.h>
#include <sys/epoll.h>
#include <unistd.h>
#include <fcntl.h>
//#include <signal.h>
#include "epoll_manager.h"
#include "str_util.h"
#include "kv_db.h"
const char *exe_name = NULL;
class kv_cli_t
{
public:
kv_dbw_t *db = NULL;
ring_loop_t *ringloop = NULL;
epoll_manager_t *epmgr = NULL;
cluster_client_t *cli = NULL;
bool interactive = false;
int in_progress = 0;
char *cur_cmd = NULL;
int cur_cmd_size = 0, cur_cmd_alloc = 0;
bool finished = false, eof = false;
json11::Json::object cfg;
~kv_cli_t();
static json11::Json::object parse_args(int narg, const char *args[]);
void run(const json11::Json::object & cfg);
void read_cmd();
void next_cmd();
void handle_cmd(const std::string & cmd, std::function<void()> cb);
};
kv_cli_t::~kv_cli_t()
{
if (cur_cmd)
{
free(cur_cmd);
cur_cmd = NULL;
}
cur_cmd_alloc = 0;
if (db)
delete db;
if (cli)
{
cli->flush();
delete cli;
}
if (epmgr)
delete epmgr;
if (ringloop)
delete ringloop;
}
json11::Json::object kv_cli_t::parse_args(int narg, const char *args[])
{
json11::Json::object cfg;
for (int i = 1; i < narg; i++)
{
if (!strcmp(args[i], "-h") || !strcmp(args[i], "--help"))
{
printf(
"Vitastor Key/Value CLI\n"
"(c) Vitaliy Filippov, 2023+ (VNPL-1.1)\n"
"\n"
"USAGE: %s [--etcd_address ADDR] [OTHER OPTIONS]\n",
exe_name
);
exit(0);
}
else if (args[i][0] == '-' && args[i][1] == '-')
{
const char *opt = args[i]+2;
cfg[opt] = !strcmp(opt, "json") || i == narg-1 ? "1" : args[++i];
}
}
return cfg;
}
void kv_cli_t::run(const json11::Json::object & cfg)
{
// Create client
ringloop = new ring_loop_t(512);
epmgr = new epoll_manager_t(ringloop);
cli = new cluster_client_t(ringloop, epmgr->tfd, cfg);
db = new kv_dbw_t(cli);
// Load image metadata
while (!cli->is_ready())
{
ringloop->loop();
if (cli->is_ready())
break;
ringloop->wait();
}
// Run
fcntl(0, F_SETFL, fcntl(0, F_GETFL, 0) | O_NONBLOCK);
try
{
epmgr->tfd->set_fd_handler(0, false, [this](int fd, int events)
{
if (events & EPOLLIN)
{
read_cmd();
}
if (events & EPOLLRDHUP)
{
epmgr->tfd->set_fd_handler(0, false, NULL);
finished = true;
}
});
interactive = true;
printf("> ");
}
catch (std::exception & e)
{
// Can't add to epoll, STDIN is probably a file
read_cmd();
}
while (!finished)
{
ringloop->loop();
if (!finished)
ringloop->wait();
}
// Destroy the client
delete db;
db = NULL;
cli->flush();
delete cli;
delete epmgr;
delete ringloop;
cli = NULL;
epmgr = NULL;
ringloop = NULL;
}
void kv_cli_t::read_cmd()
{
if (!cur_cmd_alloc)
{
cur_cmd_alloc = 65536;
cur_cmd = (char*)malloc_or_die(cur_cmd_alloc);
}
while (cur_cmd_size < cur_cmd_alloc)
{
int r = read(0, cur_cmd+cur_cmd_size, cur_cmd_alloc-cur_cmd_size);
if (r < 0 && errno != EAGAIN)
fprintf(stderr, "Error reading from stdin: %s\n", strerror(errno));
if (r > 0)
cur_cmd_size += r;
if (r == 0)
eof = true;
if (r <= 0)
break;
}
next_cmd();
}
void kv_cli_t::next_cmd()
{
if (in_progress > 0)
{
return;
}
int pos = 0;
for (; pos < cur_cmd_size; pos++)
{
if (cur_cmd[pos] == '\n' || cur_cmd[pos] == '\r')
{
auto cmd = trim(std::string(cur_cmd, pos));
pos++;
memmove(cur_cmd, cur_cmd+pos, cur_cmd_size-pos);
cur_cmd_size -= pos;
in_progress++;
handle_cmd(cmd, [this]()
{
in_progress--;
if (interactive)
printf("> ");
next_cmd();
if (!in_progress)
read_cmd();
});
break;
}
}
if (eof && !in_progress)
{
finished = true;
}
}
void kv_cli_t::handle_cmd(const std::string & cmd, std::function<void()> cb)
{
if (cmd == "")
{
cb();
return;
}
auto pos = cmd.find_first_of(" \t");
if (pos != std::string::npos)
{
while (pos < cmd.size()-1 && (cmd[pos+1] == ' ' || cmd[pos+1] == '\t'))
pos++;
}
auto opname = strtolower(pos == std::string::npos ? cmd : cmd.substr(0, pos));
if (opname == "open")
{
uint64_t pool_id = 0;
inode_t inode_id = 0;
uint32_t kv_block_size = 0;
int scanned = sscanf(cmd.c_str() + pos+1, "%lu %lu %u", &pool_id, &inode_id, &kv_block_size);
if (scanned == 2)
{
kv_block_size = 4096;
}
if (scanned < 2 || !pool_id || !inode_id || !kv_block_size || (kv_block_size & (kv_block_size-1)) != 0)
{
fprintf(stderr, "Usage: open <pool_id> <inode_id> [block_size]. Block size must be a power of 2. Default is 4096.\n");
cb();
return;
}
cfg["kv_block_size"] = (uint64_t)kv_block_size;
db->open(INODE_WITH_POOL(pool_id, inode_id), cfg, [=](int res)
{
if (res < 0)
fprintf(stderr, "Error opening index: %s (code %d)\n", strerror(-res), res);
else
printf("Index opened. Current size: %lu bytes\n", db->get_size());
cb();
});
}
else if (opname == "config")
{
auto pos2 = cmd.find_first_of(" \t", pos+1);
if (pos2 == std::string::npos)
{
fprintf(stderr, "Usage: config <property> <value>\n");
cb();
return;
}
auto key = trim(cmd.substr(pos+1, pos2-pos-1));
auto value = parse_size(trim(cmd.substr(pos2+1)));
if (key != "kv_memory_limit" &&
key != "kv_allocate_blocks" &&
key != "kv_evict_max_misses" &&
key != "kv_evict_attempts_per_level" &&
key != "kv_evict_unused_age" &&
key != "kv_log_level")
{
fprintf(
stderr, "Allowed properties: kv_memory_limit, kv_allocate_blocks,"
" kv_evict_max_misses, kv_evict_attempts_per_level, kv_evict_unused_age, kv_log_level\n"
);
}
else
{
cfg[key] = value;
db->set_config(cfg);
}
cb();
}
else if (opname == "get" || opname == "set" || opname == "del")
{
if (opname == "get" || opname == "del")
{
if (pos == std::string::npos)
{
fprintf(stderr, "Usage: %s <key>\n", opname.c_str());
cb();
return;
}
auto key = trim(cmd.substr(pos+1));
if (opname == "get")
{
db->get(key, [this, cb](int res, const std::string & value)
{
if (res < 0)
fprintf(stderr, "Error: %s (code %d)\n", strerror(-res), res);
else
{
write(1, value.c_str(), value.size());
write(1, "\n", 1);
}
cb();
});
}
else
{
db->del(key, [this, cb](int res)
{
if (res < 0)
fprintf(stderr, "Error: %s (code %d)\n", strerror(-res), res);
else
printf("OK\n");
cb();
});
}
}
else
{
auto pos2 = cmd.find_first_of(" \t", pos+1);
if (pos2 == std::string::npos)
{
fprintf(stderr, "Usage: set <key> <value>\n");
cb();
return;
}
auto key = trim(cmd.substr(pos+1, pos2-pos-1));
auto value = trim(cmd.substr(pos2+1));
db->set(key, value, [this, cb](int res)
{
if (res < 0)
fprintf(stderr, "Error: %s (code %d)\n", strerror(-res), res);
else
printf("OK\n");
cb();
});
}
}
else if (opname == "list")
{
std::string start, end;
if (pos != std::string::npos)
{
auto pos2 = cmd.find_first_of(" \t", pos+1);
if (pos2 != std::string::npos)
{
start = trim(cmd.substr(pos+1, pos2-pos-1));
end = trim(cmd.substr(pos2+1));
}
else
{
start = trim(cmd.substr(pos+1));
}
}
void *handle = db->list_start(start);
db->list_next(handle, [=](int res, const std::string & key, const std::string & value)
{
if (res < 0)
{
if (res != -ENOENT)
{
fprintf(stderr, "Error: %s (code %d)\n", strerror(-res), res);
}
db->list_close(handle);
cb();
}
else
{
printf("%s = %s\n", key.c_str(), value.c_str());
db->list_next(handle, NULL);
}
});
}
else if (opname == "close")
{
db->close([=]()
{
printf("Index closed\n");
cb();
});
}
else if (opname == "quit" || opname == "q")
{
::close(0);
finished = true;
}
else
{
fprintf(
stderr, "Unknown operation: %s. Supported operations:\n"
"open <pool_id> <inode_id> [block_size]\n"
"config <property> <value>\n"
"get <key>\nset <key> <value>\ndel <key>\nlist [<start> [end]]\n"
"close\nquit\n", opname.c_str()
);
cb();
}
}
int main(int narg, const char *args[])
{
setvbuf(stdout, NULL, _IONBF, 0);
setvbuf(stderr, NULL, _IONBF, 0);
exe_name = args[0];
kv_cli_t *p = new kv_cli_t();
p->run(kv_cli_t::parse_args(narg, args));
delete p;
return 0;
}

File diff suppressed because it is too large Load Diff

View File

@@ -1,39 +0,0 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// Vitastor shared key/value database
// Parallel optimistic B-Tree O:-)
#pragma once
#include "cluster_client.h"
struct kv_db_t;
struct kv_dbw_t
{
kv_dbw_t(cluster_client_t *cli);
~kv_dbw_t();
void open(inode_t inode_id, json11::Json cfg, std::function<void(int)> cb);
void set_config(json11::Json cfg);
void close(std::function<void()> cb);
uint64_t get_size();
void get(const std::string & key, std::function<void(int res, const std::string & value)> cb,
bool allow_old_cached = false);
void set(const std::string & key, const std::string & value, std::function<void(int res)> cb,
std::function<bool(int res, const std::string & value)> cas_compare = NULL);
void del(const std::string & key, std::function<void(int res)> cb,
std::function<bool(int res, const std::string & value)> cas_compare = NULL);
void update(const std::string & key,
std::function<int(int res, const std::string & old_value, std::string & new_value)> cas_compare,
std::function<bool(int res)> cb);
void* list_start(const std::string & start);
void list_next(void *handle, std::function<void(int res, const std::string & key, const std::string & value)> cb);
void list_close(void *handle);
kv_db_t *db;
};

View File

@@ -1,697 +0,0 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// Vitastor shared key/value database stress tester / benchmark
#define _XOPEN_SOURCE
#include <limits.h>
#include <netinet/tcp.h>
#include <sys/epoll.h>
#include <unistd.h>
#include <fcntl.h>
//#include <signal.h>
#include "epoll_manager.h"
#include "str_util.h"
#include "kv_db.h"
const char *exe_name = NULL;
struct kv_test_listing_t
{
uint64_t count = 0, done = 0;
void *handle = NULL;
std::string next_after;
std::set<std::string> inflights;
timespec tv_begin;
bool error = false;
};
struct kv_test_lat_t
{
const char *name = NULL;
uint64_t usec = 0, count = 0;
};
struct kv_test_stat_t
{
kv_test_lat_t get, add, update, del, list;
uint64_t list_keys = 0;
};
class kv_test_t
{
public:
// Config
json11::Json::object kv_cfg;
std::string key_prefix, key_suffix;
uint64_t inode_id = 0;
uint64_t op_count = 1000000;
uint64_t runtime_sec = 0;
uint64_t parallelism = 4;
uint64_t reopen_prob = 1;
uint64_t get_prob = 30000;
uint64_t add_prob = 20000;
uint64_t update_prob = 20000;
uint64_t del_prob = 5000;
uint64_t list_prob = 300;
uint64_t min_key_len = 10;
uint64_t max_key_len = 70;
uint64_t min_value_len = 50;
uint64_t max_value_len = 300;
uint64_t min_list_count = 10;
uint64_t max_list_count = 1000;
uint64_t print_stats_interval = 1;
bool json_output = false;
uint64_t log_level = 1;
bool trace = false;
bool stop_on_error = false;
// FIXME: Multiple clients
kv_test_stat_t stat, prev_stat;
timespec prev_stat_time, start_stat_time;
// State
kv_dbw_t *db = NULL;
ring_loop_t *ringloop = NULL;
epoll_manager_t *epmgr = NULL;
cluster_client_t *cli = NULL;
ring_consumer_t consumer;
bool finished = false;
uint64_t total_prob = 0;
uint64_t ops_sent = 0, ops_done = 0;
int stat_timer_id = -1;
int in_progress = 0;
bool reopening = false;
std::set<kv_test_listing_t*> listings;
std::set<std::string> changing_keys;
std::map<std::string, std::string> values;
~kv_test_t();
static json11::Json::object parse_args(int narg, const char *args[]);
void parse_config(json11::Json cfg);
void run(json11::Json cfg);
void loop();
void print_stats(kv_test_stat_t & prev_stat, timespec & prev_stat_time);
void print_total_stats();
void start_change(const std::string & key);
void stop_change(const std::string & key);
void add_stat(kv_test_lat_t & stat, timespec tv_begin);
};
kv_test_t::~kv_test_t()
{
if (db)
delete db;
if (cli)
{
cli->flush();
delete cli;
}
if (epmgr)
delete epmgr;
if (ringloop)
delete ringloop;
}
json11::Json::object kv_test_t::parse_args(int narg, const char *args[])
{
json11::Json::object cfg;
for (int i = 1; i < narg; i++)
{
if (!strcmp(args[i], "-h") || !strcmp(args[i], "--help"))
{
printf(
"Vitastor Key/Value DB stress tester / benchmark\n"
"(c) Vitaliy Filippov, 2023+ (VNPL-1.1)\n"
"\n"
"USAGE: %s --pool_id POOL_ID --inode_id INODE_ID [OPTIONS]\n"
" --op_count 1000000\n"
" Total operations to run during test. 0 means unlimited\n"
" --key_prefix \"\"\n"
" Prefix for all keys read or written (to avoid collisions)\n"
" --key_suffix \"\"\n"
" Suffix for all keys read or written (to avoid collisions, but scan all DB)\n"
" --runtime 0\n"
" Run for this number of seconds. 0 means unlimited\n"
" --parallelism 4\n"
" Run this number of operations in parallel\n"
" --get_prob 30000\n"
" Fraction of key retrieve operations\n"
" --add_prob 20000\n"
" Fraction of key addition operations\n"
" --update_prob 20000\n"
" Fraction of key update operations\n"
" --del_prob 30000\n"
" Fraction of key delete operations\n"
" --list_prob 300\n"
" Fraction of listing operations\n"
" --min_key_len 10\n"
" Minimum key size in bytes\n"
" --max_key_len 70\n"
" Maximum key size in bytes\n"
" --min_value_len 50\n"
" Minimum value size in bytes\n"
" --max_value_len 300\n"
" Maximum value size in bytes\n"
" --min_list_count 10\n"
" Minimum number of keys read in listing (0 = all keys)\n"
" --max_list_count 1000\n"
" Maximum number of keys read in listing\n"
" --print_stats 1\n"
" Print operation statistics every this number of seconds\n"
" --json\n"
" JSON output\n"
" --stop_on_error 0\n"
" Stop on first execution error, mismatch, lost key or extra key during listing\n"
" --kv_memory_limit 128M\n"
" Maximum memory to use for vitastor-kv index cache\n"
" --kv_allocate_blocks 4\n"
" Number of PG blocks used for new tree block allocation in parallel\n"
" --kv_evict_max_misses 10\n"
" Eviction algorithm parameter: retry eviction from another random spot\n"
" if this number of keys is used currently or was used recently\n"
" --kv_evict_attempts_per_level 3\n"
" Retry eviction at most this number of times per tree level, starting\n"
" with bottom-most levels\n"
" --kv_evict_unused_age 1000\n"
" Evict only keys unused during this number of last operations\n"
" --kv_log_level 1\n"
" Log level. 0 = errors, 1 = warnings, 10 = trace operations\n",
exe_name
);
exit(0);
}
else if (args[i][0] == '-' && args[i][1] == '-')
{
const char *opt = args[i]+2;
cfg[opt] = !strcmp(opt, "json") || i == narg-1 ? "1" : args[++i];
}
}
return cfg;
}
void kv_test_t::parse_config(json11::Json cfg)
{
inode_id = INODE_WITH_POOL(cfg["pool_id"].uint64_value(), cfg["inode_id"].uint64_value());
if (cfg["op_count"].uint64_value() > 0)
op_count = cfg["op_count"].uint64_value();
key_prefix = cfg["key_prefix"].string_value();
key_suffix = cfg["key_suffix"].string_value();
if (cfg["runtime"].uint64_value() > 0)
runtime_sec = cfg["runtime"].uint64_value();
if (cfg["parallelism"].uint64_value() > 0)
parallelism = cfg["parallelism"].uint64_value();
if (!cfg["reopen_prob"].is_null())
reopen_prob = cfg["reopen_prob"].uint64_value();
if (!cfg["get_prob"].is_null())
get_prob = cfg["get_prob"].uint64_value();
if (!cfg["add_prob"].is_null())
add_prob = cfg["add_prob"].uint64_value();
if (!cfg["update_prob"].is_null())
update_prob = cfg["update_prob"].uint64_value();
if (!cfg["del_prob"].is_null())
del_prob = cfg["del_prob"].uint64_value();
if (!cfg["list_prob"].is_null())
list_prob = cfg["list_prob"].uint64_value();
if (!cfg["min_key_len"].is_null())
min_key_len = cfg["min_key_len"].uint64_value();
if (cfg["max_key_len"].uint64_value() > 0)
max_key_len = cfg["max_key_len"].uint64_value();
if (!cfg["min_value_len"].is_null())
min_value_len = cfg["min_value_len"].uint64_value();
if (cfg["max_value_len"].uint64_value() > 0)
max_value_len = cfg["max_value_len"].uint64_value();
if (!cfg["min_list_count"].is_null())
min_list_count = cfg["min_list_count"].uint64_value();
if (!cfg["max_list_count"].is_null())
max_list_count = cfg["max_list_count"].uint64_value();
if (!cfg["print_stats"].is_null())
print_stats_interval = cfg["print_stats"].uint64_value();
if (!cfg["json"].is_null())
json_output = true;
if (!cfg["stop_on_error"].is_null())
stop_on_error = cfg["stop_on_error"].bool_value();
if (!cfg["kv_memory_limit"].is_null())
kv_cfg["kv_memory_limit"] = cfg["kv_memory_limit"];
if (!cfg["kv_allocate_blocks"].is_null())
kv_cfg["kv_allocate_blocks"] = cfg["kv_allocate_blocks"];
if (!cfg["kv_evict_max_misses"].is_null())
kv_cfg["kv_evict_max_misses"] = cfg["kv_evict_max_misses"];
if (!cfg["kv_evict_attempts_per_level"].is_null())
kv_cfg["kv_evict_attempts_per_level"] = cfg["kv_evict_attempts_per_level"];
if (!cfg["kv_evict_unused_age"].is_null())
kv_cfg["kv_evict_unused_age"] = cfg["kv_evict_unused_age"];
if (!cfg["kv_log_level"].is_null())
{
log_level = cfg["kv_log_level"].uint64_value();
trace = log_level >= 10;
kv_cfg["kv_log_level"] = cfg["kv_log_level"];
}
total_prob = reopen_prob+get_prob+add_prob+update_prob+del_prob+list_prob;
stat.get.name = "get";
stat.add.name = "add";
stat.update.name = "update";
stat.del.name = "del";
stat.list.name = "list";
}
void kv_test_t::run(json11::Json cfg)
{
srand48(time(NULL));
parse_config(cfg);
// Create client
ringloop = new ring_loop_t(512);
epmgr = new epoll_manager_t(ringloop);
cli = new cluster_client_t(ringloop, epmgr->tfd, cfg);
db = new kv_dbw_t(cli);
// Load image metadata
while (!cli->is_ready())
{
ringloop->loop();
if (cli->is_ready())
break;
ringloop->wait();
}
// Run
reopening = true;
db->open(inode_id, kv_cfg, [this](int res)
{
reopening = false;
if (res < 0)
{
fprintf(stderr, "ERROR: Open index: %d (%s)\n", res, strerror(-res));
exit(1);
}
if (trace)
printf("Index opened\n");
ringloop->wakeup();
});
consumer.loop = [this]() { loop(); };
ringloop->register_consumer(&consumer);
if (print_stats_interval)
stat_timer_id = epmgr->tfd->set_timer(print_stats_interval*1000, true, [this](int) { print_stats(prev_stat, prev_stat_time); });
clock_gettime(CLOCK_REALTIME, &start_stat_time);
prev_stat_time = start_stat_time;
while (!finished)
{
ringloop->loop();
if (!finished)
ringloop->wait();
}
if (stat_timer_id >= 0)
epmgr->tfd->clear_timer(stat_timer_id);
ringloop->unregister_consumer(&consumer);
// Print total stats
print_total_stats();
// Destroy the client
delete db;
db = NULL;
cli->flush();
delete cli;
delete epmgr;
delete ringloop;
cli = NULL;
epmgr = NULL;
ringloop = NULL;
}
static const char *base64_chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789@+/";
std::string random_str(int len)
{
std::string str;
str.resize(len);
for (int i = 0; i < len; i++)
{
str[i] = base64_chars[lrand48() % 64];
}
return str;
}
void kv_test_t::loop()
{
if (reopening)
{
return;
}
if (ops_done >= op_count)
{
finished = true;
}
while (!finished && ops_sent < op_count && in_progress < parallelism)
{
uint64_t dice = (lrand48() % total_prob);
if (dice < reopen_prob)
{
reopening = true;
db->close([this]()
{
if (trace)
printf("Index closed\n");
db->open(inode_id, kv_cfg, [this](int res)
{
reopening = false;
if (res < 0)
{
fprintf(stderr, "ERROR: Reopen index: %d (%s)\n", res, strerror(-res));
finished = true;
return;
}
if (trace)
printf("Index reopened\n");
ringloop->wakeup();
});
});
return;
}
else if (dice < reopen_prob+get_prob)
{
// get existing
auto key = random_str(max_key_len);
auto k_it = values.lower_bound(key);
if (k_it == values.end())
continue;
key = k_it->first;
if (changing_keys.find(key) != changing_keys.end())
continue;
in_progress++;
ops_sent++;
if (trace)
printf("get %s\n", key.c_str());
timespec tv_begin;
clock_gettime(CLOCK_REALTIME, &tv_begin);
db->get(key, [this, key, tv_begin](int res, const std::string & value)
{
add_stat(stat.get, tv_begin);
ops_done++;
in_progress--;
auto it = values.find(key);
if (res != (it == values.end() ? -ENOENT : 0))
{
fprintf(stderr, "ERROR: get %s: %d (%s)\n", key.c_str(), res, strerror(-res));
if (stop_on_error)
exit(1);
}
else if (it != values.end() && value != it->second)
{
fprintf(stderr, "ERROR: get %s: mismatch: %s vs %s\n", key.c_str(), value.c_str(), it->second.c_str());
if (stop_on_error)
exit(1);
}
ringloop->wakeup();
});
}
else if (dice < reopen_prob+get_prob+add_prob+update_prob)
{
bool is_add = false;
std::string key;
if (dice < reopen_prob+get_prob+add_prob)
{
// add
is_add = true;
uint64_t key_len = min_key_len + (max_key_len > min_key_len ? lrand48() % (max_key_len-min_key_len) : 0);
key = key_prefix + random_str(key_len) + key_suffix;
}
else
{
// update
key = random_str(max_key_len);
auto k_it = values.lower_bound(key);
if (k_it == values.end())
continue;
key = k_it->first;
}
if (changing_keys.find(key) != changing_keys.end())
continue;
uint64_t value_len = min_value_len + (max_value_len > min_value_len ? lrand48() % (max_value_len-min_value_len) : 0);
auto value = random_str(value_len);
start_change(key);
ops_sent++;
in_progress++;
if (trace)
printf("set %s = %s\n", key.c_str(), value.c_str());
timespec tv_begin;
clock_gettime(CLOCK_REALTIME, &tv_begin);
db->set(key, value, [this, key, value, tv_begin, is_add](int res)
{
add_stat(is_add ? stat.add : stat.update, tv_begin);
stop_change(key);
ops_done++;
in_progress--;
if (res != 0)
{
fprintf(stderr, "ERROR: set %s = %s: %d (%s)\n", key.c_str(), value.c_str(), res, strerror(-res));
if (stop_on_error)
exit(1);
}
else
{
values[key] = value;
}
ringloop->wakeup();
}, NULL);
}
else if (dice < reopen_prob+get_prob+add_prob+update_prob+del_prob)
{
// delete
auto key = random_str(max_key_len);
auto k_it = values.lower_bound(key);
if (k_it == values.end())
continue;
key = k_it->first;
if (changing_keys.find(key) != changing_keys.end())
continue;
start_change(key);
ops_sent++;
in_progress++;
if (trace)
printf("del %s\n", key.c_str());
timespec tv_begin;
clock_gettime(CLOCK_REALTIME, &tv_begin);
db->del(key, [this, key, tv_begin](int res)
{
add_stat(stat.del, tv_begin);
stop_change(key);
ops_done++;
in_progress--;
if (res != 0)
{
fprintf(stderr, "ERROR: del %s: %d (%s)\n", key.c_str(), res, strerror(-res));
if (stop_on_error)
exit(1);
}
else
{
values.erase(key);
}
ringloop->wakeup();
}, NULL);
}
else if (dice < reopen_prob+get_prob+add_prob+update_prob+del_prob+list_prob)
{
// list
ops_sent++;
in_progress++;
auto key = random_str(max_key_len);
auto lst = new kv_test_listing_t;
auto k_it = values.lower_bound(key);
lst->count = min_list_count + (max_list_count > min_list_count ? lrand48() % (max_list_count-min_list_count) : 0);
lst->handle = db->list_start(k_it == values.begin() ? key_prefix : key);
lst->next_after = k_it == values.begin() ? key_prefix : key;
lst->inflights = changing_keys;
listings.insert(lst);
if (trace)
printf("list from %s\n", key.c_str());
clock_gettime(CLOCK_REALTIME, &lst->tv_begin);
db->list_next(lst->handle, [this, lst](int res, const std::string & key, const std::string & value)
{
if (log_level >= 11)
printf("list: %s = %s\n", key.c_str(), value.c_str());
if (res >= 0 && key_prefix.size() && (key.size() < key_prefix.size() ||
key.substr(0, key_prefix.size()) != key_prefix))
{
// stop at this key
res = -ENOENT;
}
if (res < 0 || (lst->count > 0 && lst->done >= lst->count))
{
add_stat(stat.list, lst->tv_begin);
if (res == 0)
{
// ok (done >= count)
}
else if (res != -ENOENT)
{
fprintf(stderr, "ERROR: list: %d (%s)\n", res, strerror(-res));
lst->error = true;
}
else
{
auto k_it = lst->next_after == "" ? values.begin() : values.upper_bound(lst->next_after);
while (k_it != values.end())
{
while (k_it != values.end() && lst->inflights.find(k_it->first) != lst->inflights.end())
k_it++;
if (k_it != values.end())
{
fprintf(stderr, "ERROR: list: missing key %s\n", (k_it++)->first.c_str());
lst->error = true;
}
}
}
if (lst->error && stop_on_error)
exit(1);
ops_done++;
in_progress--;
db->list_close(lst->handle);
delete lst;
listings.erase(lst);
ringloop->wakeup();
}
else
{
stat.list_keys++;
// Do not check modified keys in listing
// Listing may return their old or new state
if ((!key_suffix.size() || key.size() >= key_suffix.size() &&
key.substr(key.size()-key_suffix.size()) == key_suffix) &&
lst->inflights.find(key) == lst->inflights.end())
{
lst->done++;
auto k_it = lst->next_after == "" ? values.begin() : values.upper_bound(lst->next_after);
while (true)
{
while (k_it != values.end() && lst->inflights.find(k_it->first) != lst->inflights.end())
{
k_it++;
}
if (k_it == values.end() || k_it->first > key)
{
fprintf(stderr, "ERROR: list: extra key %s\n", key.c_str());
lst->error = true;
break;
}
else if (k_it->first < key)
{
fprintf(stderr, "ERROR: list: missing key %s\n", k_it->first.c_str());
lst->error = true;
lst->next_after = k_it->first;
k_it++;
}
else
{
if (k_it->second != value)
{
fprintf(stderr, "ERROR: list: mismatch: %s = %s but should be %s\n",
key.c_str(), value.c_str(), k_it->second.c_str());
lst->error = true;
}
lst->next_after = k_it->first;
break;
}
}
}
db->list_next(lst->handle, NULL);
}
});
}
}
}
void kv_test_t::add_stat(kv_test_lat_t & stat, timespec tv_begin)
{
timespec tv_end;
clock_gettime(CLOCK_REALTIME, &tv_end);
int64_t usec = (tv_end.tv_sec - tv_begin.tv_sec)*1000000 +
(tv_end.tv_nsec - tv_begin.tv_nsec)/1000;
if (usec > 0)
{
stat.usec += usec;
stat.count++;
}
}
void kv_test_t::print_stats(kv_test_stat_t & prev_stat, timespec & prev_stat_time)
{
timespec cur_stat_time;
clock_gettime(CLOCK_REALTIME, &cur_stat_time);
int64_t usec = (cur_stat_time.tv_sec - prev_stat_time.tv_sec)*1000000 +
(cur_stat_time.tv_nsec - prev_stat_time.tv_nsec)/1000;
if (usec > 0)
{
kv_test_lat_t *lats[] = { &stat.get, &stat.add, &stat.update, &stat.del, &stat.list };
kv_test_lat_t *prev[] = { &prev_stat.get, &prev_stat.add, &prev_stat.update, &prev_stat.del, &prev_stat.list };
if (!json_output)
{
char buf[128] = { 0 };
for (int i = 0; i < sizeof(lats)/sizeof(lats[0]); i++)
{
snprintf(buf, sizeof(buf)-1, "%.1f %s/s (%lu us)", (lats[i]->count-prev[i]->count)*1000000.0/usec,
lats[i]->name, (lats[i]->usec-prev[i]->usec)/(lats[i]->count-prev[i]->count > 0 ? lats[i]->count-prev[i]->count : 1));
int k;
for (k = strlen(buf); k < strlen(lats[i]->name)+21; k++)
buf[k] = ' ';
buf[k] = 0;
printf("%s", buf);
}
printf("\n");
}
else
{
int64_t runtime = (cur_stat_time.tv_sec - start_stat_time.tv_sec)*1000000 +
(cur_stat_time.tv_nsec - start_stat_time.tv_nsec)/1000;
printf("{\"runtime\":%.1f", (double)runtime/1000000.0);
for (int i = 0; i < sizeof(lats)/sizeof(lats[0]); i++)
{
if (lats[i]->count > prev[i]->count)
{
printf(
",\"%s\":{\"avg\":{\"iops\":%.1f,\"usec\":%lu},\"total\":{\"count\":%lu,\"usec\":%lu}}",
lats[i]->name, (lats[i]->count-prev[i]->count)*1000000.0/usec,
(lats[i]->usec-prev[i]->usec)/(lats[i]->count-prev[i]->count),
lats[i]->count, lats[i]->usec
);
}
}
printf("}\n");
}
}
prev_stat = stat;
prev_stat_time = cur_stat_time;
}
void kv_test_t::print_total_stats()
{
if (!json_output)
printf("Total:\n");
kv_test_stat_t start_stats;
timespec start_stat_time = this->start_stat_time;
print_stats(start_stats, start_stat_time);
}
void kv_test_t::start_change(const std::string & key)
{
changing_keys.insert(key);
for (auto lst: listings)
{
lst->inflights.insert(key);
}
}
void kv_test_t::stop_change(const std::string & key)
{
changing_keys.erase(key);
}
int main(int narg, const char *args[])
{
setvbuf(stdout, NULL, _IONBF, 0);
setvbuf(stderr, NULL, _IONBF, 0);
exe_name = args[0];
kv_test_t *p = new kv_test_t();
p->run(kv_test_t::parse_args(narg, args));
delete p;
return 0;
}

View File

@@ -149,7 +149,7 @@ public:
std::map<osd_num_t, osd_wanted_peer_t> wanted_peers;
std::map<uint64_t, int> osd_peer_fds;
// op statistics
osd_op_stats_t stats, recovery_stats;
osd_op_stats_t stats;
void init();
void parse_config(const json11::Json & config);
@@ -175,7 +175,6 @@ public:
bool connect_rdma(int peer_fd, std::string rdma_address, uint64_t client_max_msg);
#endif
void inc_op_stats(osd_op_stats_t & stats, uint64_t opcode, timespec & tv_begin, timespec & tv_end, uint64_t len);
void measure_exec(osd_op_t *cur_op);
protected:

View File

@@ -24,17 +24,3 @@ osd_op_t::~osd_op_t()
free(buf);
}
}
bool osd_op_t::is_recovery_related()
{
return (req.hdr.opcode == OSD_OP_SEC_READ ||
req.hdr.opcode == OSD_OP_SEC_WRITE ||
req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE) &&
(req.sec_rw.flags & OSD_OP_RECOVERY_RELATED) ||
req.hdr.opcode == OSD_OP_SEC_DELETE &&
(req.sec_del.flags & OSD_OP_RECOVERY_RELATED) ||
req.hdr.opcode == OSD_OP_SEC_STABILIZE &&
(req.sec_stab.flags & OSD_OP_RECOVERY_RELATED) ||
req.hdr.opcode == OSD_OP_SEC_SYNC &&
(req.sec_sync.flags & OSD_OP_RECOVERY_RELATED);
}

View File

@@ -173,6 +173,4 @@ struct osd_op_t
osd_op_buf_list_t iov;
~osd_op_t();
bool is_recovery_related();
};

View File

@@ -131,23 +131,6 @@ void osd_messenger_t::outbox_push(osd_op_t *cur_op)
}
}
void osd_messenger_t::inc_op_stats(osd_op_stats_t & stats, uint64_t opcode, timespec & tv_begin, timespec & tv_end, uint64_t len)
{
uint64_t usecs = (
(tv_end.tv_sec - tv_begin.tv_sec)*1000000 +
(tv_end.tv_nsec - tv_begin.tv_nsec)/1000
);
stats.op_stat_count[opcode]++;
if (!stats.op_stat_count[opcode])
{
stats.op_stat_count[opcode] = 1;
stats.op_stat_sum[opcode] = 0;
stats.op_stat_bytes[opcode] = 0;
}
stats.op_stat_sum[opcode] += usecs;
stats.op_stat_bytes[opcode] += len;
}
void osd_messenger_t::measure_exec(osd_op_t *cur_op)
{
// Measure execution latency
@@ -159,24 +142,29 @@ void osd_messenger_t::measure_exec(osd_op_t *cur_op)
{
clock_gettime(CLOCK_REALTIME, &cur_op->tv_end);
}
uint64_t len = 0;
stats.op_stat_count[cur_op->req.hdr.opcode]++;
if (!stats.op_stat_count[cur_op->req.hdr.opcode])
{
stats.op_stat_count[cur_op->req.hdr.opcode]++;
stats.op_stat_sum[cur_op->req.hdr.opcode] = 0;
stats.op_stat_bytes[cur_op->req.hdr.opcode] = 0;
}
stats.op_stat_sum[cur_op->req.hdr.opcode] += (
(cur_op->tv_end.tv_sec - cur_op->tv_begin.tv_sec)*1000000 +
(cur_op->tv_end.tv_nsec - cur_op->tv_begin.tv_nsec)/1000
);
if (cur_op->req.hdr.opcode == OSD_OP_READ ||
cur_op->req.hdr.opcode == OSD_OP_WRITE ||
cur_op->req.hdr.opcode == OSD_OP_SCRUB)
{
// req.rw.len is internally set to the full object size for scrubs
len = cur_op->req.rw.len;
stats.op_stat_bytes[cur_op->req.hdr.opcode] += cur_op->req.rw.len;
}
else if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ ||
cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE)
{
len = cur_op->req.sec_rw.len;
}
inc_op_stats(stats, cur_op->req.hdr.opcode, cur_op->tv_begin, cur_op->tv_end, len);
if (cur_op->is_recovery_related())
{
inc_op_stats(recovery_stats, cur_op->req.hdr.opcode, cur_op->tv_begin, cur_op->tv_end, len);
stats.op_stat_bytes[cur_op->req.hdr.opcode] += cur_op->req.sec_rw.len;
}
}

View File

@@ -209,27 +209,19 @@ void osd_t::parse_config(bool init)
if (recovery_queue_depth < 1 || recovery_queue_depth > MAX_RECOVERY_QUEUE)
recovery_queue_depth = DEFAULT_RECOVERY_QUEUE;
recovery_sleep_us = config["recovery_sleep_us"].uint64_value();
recovery_tune_util_low = config["recovery_tune_util_low"].is_null()
? 0.1 : config["recovery_tune_util_low"].number_value();
if (recovery_tune_util_low < 0.01)
recovery_tune_util_low = 0.01;
recovery_tune_util_high = config["recovery_tune_util_high"].is_null()
? 1.0 : config["recovery_tune_util_high"].number_value();
if (recovery_tune_util_high < 0.01)
recovery_tune_util_high = 0.01;
recovery_tune_client_util_low = config["recovery_tune_client_util_low"].is_null()
? 0 : config["recovery_tune_client_util_low"].number_value();
if (recovery_tune_client_util_low < 0.01)
recovery_tune_client_util_low = 0.01;
recovery_tune_client_util_high = config["recovery_tune_client_util_high"].is_null()
? 0.5 : config["recovery_tune_client_util_high"].number_value();
if (recovery_tune_client_util_high < 0.01)
recovery_tune_client_util_high = 0.01;
recovery_tune_min_util = config["recovery_tune_min_util"].is_null()
? 0.1 : config["recovery_tune_min_util"].number_value();
recovery_tune_max_util = config["recovery_tune_max_util"].is_null()
? 1.0 : config["recovery_tune_max_util"].number_value();
recovery_tune_min_client_util = config["recovery_tune_min_client_util"].is_null()
? 0 : config["recovery_tune_min_client_util"].number_value();
recovery_tune_max_client_util = config["recovery_tune_max_client_util"].is_null()
? 0.5 : config["recovery_tune_max_client_util"].number_value();
auto old_recovery_tune_interval = recovery_tune_interval;
recovery_tune_interval = config["recovery_tune_interval"].is_null()
? 1 : config["recovery_tune_interval"].uint64_value();
recovery_tune_agg_interval = config["recovery_tune_agg_interval"].is_null()
? 10 : config["recovery_tune_agg_interval"].uint64_value();
recovery_tune_ewma_rate = config["recovery_tune_ewma_rate"].is_null()
? 0.5 : config["recovery_tune_ewma_rate"].number_value();
recovery_tune_sleep_min_us = config["recovery_tune_sleep_min_us"].is_null()
? 10 : config["recovery_tune_sleep_min_us"].uint64_value();
recovery_pg_switch = config["recovery_pg_switch"].uint64_value();
@@ -502,12 +494,11 @@ void osd_t::print_stats()
{
uint64_t bw = (recovery_stat[i].bytes - recovery_print_prev[i].bytes) / print_stats_interval;
printf(
"[OSD %lu] %s recovery: %.1f op/s, B/W: %.2f %s, avg latency %ld us, delay %ld us\n", osd_num, recovery_stat_names[i],
"[OSD %lu] %s recovery: %.1f op/s, B/W: %.2f %s, avg lat %ld us\n", osd_num, recovery_stat_names[i],
(recovery_stat[i].count - recovery_print_prev[i].count) * 1.0 / print_stats_interval,
(bw > 1024*1024*1024 ? bw/1024.0/1024/1024 : (bw > 1024*1024 ? bw/1024.0/1024 : bw/1024.0)),
(bw > 1024*1024*1024 ? "GB/s" : (bw > 1024*1024 ? "MB/s" : "KB/s")),
(recovery_stat[i].usec - recovery_print_prev[i].usec) / (recovery_stat[i].count - recovery_print_prev[i].count),
recovery_target_sleep_us
(recovery_stat[i].usec - recovery_print_prev[i].usec) / (recovery_stat[i].count - recovery_print_prev[i].count)
);
}
}
@@ -605,8 +596,8 @@ void osd_t::print_slow()
op->req.hdr.opcode == OSD_OP_SEC_STABILIZE || op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK ||
op->req.hdr.opcode == OSD_OP_SEC_READ_BMP)
{
bufprintf(" state=%d", op->bs_op ? PRIV(op->bs_op)->op_state : -1);
int wait_for = op->bs_op ? PRIV(op->bs_op)->wait_for : 0;
bufprintf(" state=%d", PRIV(op->bs_op)->op_state);
int wait_for = PRIV(op->bs_op)->wait_for;
if (wait_for)
{
bufprintf(" wait=%d (detail=%lu)", wait_for, PRIV(op->bs_op)->wait_detail);

View File

@@ -118,12 +118,12 @@ class osd_t
int autosync_writes = DEFAULT_AUTOSYNC_WRITES;
uint64_t recovery_queue_depth = 1;
uint64_t recovery_sleep_us = 0;
double recovery_tune_util_low = 0.1;
double recovery_tune_client_util_low = 0;
double recovery_tune_util_high = 1.0;
double recovery_tune_client_util_high = 0.5;
double recovery_tune_min_util = 0.1;
double recovery_tune_min_client_util = 0;
double recovery_tune_max_util = 1.0;
double recovery_tune_max_client_util = 0.5;
int recovery_tune_interval = 1;
int recovery_tune_agg_interval = 10;
double recovery_tune_ewma_rate = 0.5;
int recovery_tune_sleep_min_us = 10;
int recovery_pg_switch = DEFAULT_RECOVERY_PG_SWITCH;
int recovery_sync_batch = DEFAULT_RECOVERY_BATCH;
@@ -209,11 +209,10 @@ class osd_t
int rtune_timer_id = -1;
uint64_t rtune_avg_lat = 0;
double rtune_client_util = 0, rtune_target_util = 1;
osd_op_stats_t rtune_prev_stats, rtune_prev_recovery_stats;
std::vector<uint64_t> recovery_target_sleep_items;
osd_op_stats_t rtune_prev_stats;
recovery_stat_t rtune_prev_recovery[2];
uint64_t recovery_target_queue_depth = 1;
uint64_t recovery_target_sleep_us = 0;
uint64_t recovery_target_sleep_total = 0;
int recovery_target_sleep_cur = 0, recovery_target_sleep_count = 0;
// cluster connection
void parse_config(bool init);
@@ -304,7 +303,7 @@ class osd_t
bool remember_unstable_write(osd_op_t *cur_op, pg_t & pg, pg_osd_set_t & loc_set, int base_state);
void handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op);
void handle_primary_bs_subop(osd_op_t *subop);
void add_bs_subop_stats(osd_op_t *subop, bool recovery_related = false);
void add_bs_subop_stats(osd_op_t *subop);
void pg_cancel_write_queue(pg_t & pg, osd_op_t *first_op, object_id oid, int retval);
void submit_primary_subops(int submit_type, uint64_t op_version, const uint64_t* osd_set, osd_op_t *cur_op);

View File

@@ -651,7 +651,7 @@ void osd_t::apply_pg_config()
{
pg_num_t pg_num = kv.first;
auto & pg_cfg = kv.second;
bool take = pg_cfg.config_exists && pg_cfg.primary == this->osd_num &&
bool take = pg_cfg.exists && pg_cfg.primary == this->osd_num &&
!pg_cfg.pause && (!pg_cfg.cur_primary || pg_cfg.cur_primary == this->osd_num);
auto pg_it = this->pgs.find({ .pool_id = pool_id, .pg_num = pg_num });
bool currently_taken = pg_it != this->pgs.end() && pg_it->second.state != PG_OFFLINE;

View File

@@ -325,7 +325,17 @@ void osd_t::submit_recovery_op(osd_recovery_op_t *op)
{
printf("Recovery operation done for %lx:%lx\n", op->oid.inode, op->oid.stripe);
}
finish_recovery_op(op);
if (recovery_target_sleep_us)
{
this->tfd->set_timer_us(recovery_target_sleep_us, false, [this, op](int timer_id)
{
finish_recovery_op(op);
});
}
else
{
finish_recovery_op(op);
}
};
exec_op(op->osd_op);
}
@@ -346,6 +356,7 @@ void osd_t::apply_recovery_tune_interval()
}
else
{
recovery_target_queue_depth = recovery_queue_depth;
recovery_target_sleep_us = recovery_sleep_us;
}
}
@@ -372,78 +383,47 @@ void osd_t::finish_recovery_op(osd_recovery_op_t *op)
void osd_t::tune_recovery()
{
static int accounted_ops[] = {
OSD_OP_SEC_READ, OSD_OP_SEC_WRITE, OSD_OP_SEC_WRITE_STABLE,
OSD_OP_SEC_STABILIZE, OSD_OP_SEC_SYNC, OSD_OP_SEC_DELETE
};
uint64_t total_client_usec = 0, total_recovery_usec = 0, recovery_count = 0;
for (int i = 0; i < sizeof(accounted_ops)/sizeof(accounted_ops[0]); i++)
static int total_client_ops[] = { OSD_OP_READ, OSD_OP_WRITE, OSD_OP_SYNC, OSD_OP_DELETE };
uint64_t total_client_usec = 0;
for (int i = 0; i < sizeof(total_client_ops)/sizeof(total_client_ops[0]); i++)
{
total_client_usec += (msgr.stats.op_stat_sum[accounted_ops[i]]
- rtune_prev_stats.op_stat_sum[accounted_ops[i]]);
total_recovery_usec += (msgr.recovery_stats.op_stat_sum[accounted_ops[i]]
- rtune_prev_recovery_stats.op_stat_sum[accounted_ops[i]]);
recovery_count += (msgr.recovery_stats.op_stat_count[accounted_ops[i]]
- rtune_prev_recovery_stats.op_stat_count[accounted_ops[i]]);
rtune_prev_stats.op_stat_sum[accounted_ops[i]] = msgr.stats.op_stat_sum[accounted_ops[i]];
rtune_prev_recovery_stats.op_stat_sum[accounted_ops[i]] = msgr.recovery_stats.op_stat_sum[accounted_ops[i]];
rtune_prev_recovery_stats.op_stat_count[accounted_ops[i]] = msgr.recovery_stats.op_stat_count[accounted_ops[i]];
total_client_usec += (msgr.stats.op_stat_sum[total_client_ops[i]] - rtune_prev_stats.op_stat_sum[total_client_ops[i]]);
rtune_prev_stats.op_stat_sum[total_client_ops[i]] = msgr.stats.op_stat_sum[total_client_ops[i]];
}
total_client_usec -= total_recovery_usec;
uint64_t total_recovery_usec = 0, recovery_count = 0;
total_recovery_usec += recovery_stat[0].usec-rtune_prev_recovery[0].usec;
total_recovery_usec += recovery_stat[1].usec-rtune_prev_recovery[1].usec;
recovery_count += recovery_stat[0].count-rtune_prev_recovery[0].count;
recovery_count += recovery_stat[1].count-rtune_prev_recovery[1].count;
memcpy(rtune_prev_recovery, recovery_stat, sizeof(recovery_stat));
if (recovery_count == 0)
{
return;
}
// example:
// total 3 GB/s
// recovery queue 1
// 120 OSDs
// EC 5+3
// 128kb block_size => 640kb object
// 3000*1024/640/120 = 40 MB/s per OSD = 64 recovered objects per OSD
// = 64*8*2 subops = 1024 recovery subop iops
// 8 recovery subop queue
// => subop avg latency = 0.0078125 sec
// utilisation = 8
// target util 1
// intuitively target latency should be 8x of real
// target_lat = rtune_avg_lat * utilisation / target_util
// = rtune_avg_lat * rtune_avg_lat * rtune_avg_iops / target_util
// = 0.0625
// recovery utilisation will be 1
rtune_client_util = total_client_usec/1000000.0/recovery_tune_interval;
rtune_target_util = (rtune_client_util < recovery_tune_client_util_low
? recovery_tune_util_high
: recovery_tune_util_low + (rtune_client_util >= recovery_tune_client_util_high
? 0 : (recovery_tune_util_high-recovery_tune_util_low)*
(recovery_tune_client_util_high-rtune_client_util)/(recovery_tune_client_util_high-recovery_tune_client_util_low)
rtune_avg_lat = total_recovery_usec/recovery_count*recovery_tune_ewma_rate +
rtune_avg_lat*(1-recovery_tune_ewma_rate);
// client_util = count/interval * usec/1000000.0/count = usec/1000000.0/interval :-)
double client_util = total_client_usec/1000000.0/recovery_tune_interval;
rtune_client_util = rtune_client_util*(1-recovery_tune_ewma_rate) + client_util*recovery_tune_ewma_rate;
rtune_target_util = (rtune_client_util < recovery_tune_min_client_util
? recovery_tune_max_util
: recovery_tune_min_util + (rtune_client_util >= recovery_tune_max_client_util
? 0 : (recovery_tune_max_util-recovery_tune_min_util)*
(recovery_tune_max_client_util-rtune_client_util)/(recovery_tune_max_client_util-recovery_tune_min_client_util)
)
);
rtune_avg_lat = total_recovery_usec/recovery_count;
uint64_t target_lat = rtune_avg_lat * rtune_avg_lat/1000000.0 * recovery_count/recovery_tune_interval / rtune_target_util;
auto sleep_us = target_lat > rtune_avg_lat+recovery_tune_sleep_min_us ? target_lat-rtune_avg_lat : 0;
if (recovery_target_sleep_items.size() != recovery_tune_agg_interval)
{
recovery_target_sleep_items.resize(recovery_tune_agg_interval);
for (int i = 0; i < recovery_tune_agg_interval; i++)
recovery_target_sleep_items[i] = 0;
recovery_target_sleep_total = 0;
recovery_target_sleep_cur = 0;
recovery_target_sleep_count = 0;
}
recovery_target_sleep_total -= recovery_target_sleep_items[recovery_target_sleep_cur];
recovery_target_sleep_items[recovery_target_sleep_cur] = sleep_us;
recovery_target_sleep_cur = (recovery_target_sleep_cur+1) % recovery_tune_agg_interval;
recovery_target_sleep_total += sleep_us;
if (recovery_target_sleep_count < recovery_tune_agg_interval)
recovery_target_sleep_count++;
recovery_target_sleep_us = recovery_target_sleep_total / recovery_target_sleep_count;
if (log_level > 4)
recovery_target_queue_depth = (int)rtune_target_util + (rtune_target_util < 1 || rtune_target_util-(int)rtune_target_util >= 0.1 ? 1 : 0);
// ideal_iops = 1s / real_latency
// ;; target_iops = target_util * ideal_iops
// => target_lat = target_queue * 1s / target_iops
// => target_lat = target_queue / target_util * real_latency
uint64_t target_lat = recovery_target_queue_depth/rtune_target_util * rtune_avg_lat;
recovery_target_sleep_us = target_lat > rtune_avg_lat+recovery_tune_sleep_min_us ? target_lat-rtune_avg_lat : 0;
if (log_level > 3)
{
printf(
"[OSD %lu] auto-tune: client util: %.2f, recovery util: %.2f, lat: %lu us -> target util %.2f, delay %lu us\n",
osd_num, rtune_client_util, total_recovery_usec/1000000.0/recovery_tune_interval,
rtune_avg_lat, rtune_target_util, recovery_target_sleep_us
"recovery tune: client util %.2f (ewma %.2f), target util %.2f -> queue %ld, lat %lu us, real %lu us, pause %lu us\n",
client_util, rtune_client_util, rtune_target_util, recovery_target_queue_depth, target_lat, rtune_avg_lat, recovery_target_sleep_us
);
}
}
@@ -451,7 +431,7 @@ void osd_t::tune_recovery()
// Just trigger write requests for degraded objects. They'll be recovered during writing
bool osd_t::continue_recovery()
{
while (recovery_ops.size() < recovery_queue_depth)
while (recovery_ops.size() < recovery_target_queue_depth)
{
osd_recovery_op_t op;
if (pick_next_recovery(op))

View File

@@ -34,7 +34,6 @@
#define OSD_OP_MAX 18
#define OSD_RW_MAX 64*1024*1024
#define OSD_PROTOCOL_VERSION 1
#define OSD_OP_RECOVERY_RELATED (uint32_t)1
// Memory alignment for direct I/O (usually 512 bytes)
#ifndef DIRECT_IO_ALIGNMENT
@@ -89,8 +88,7 @@ struct __attribute__((__packed__)) osd_op_sec_rw_t
uint32_t len;
// bitmap/attribute length - bitmap comes after header, but before data
uint32_t attr_len;
// the only possible flag is OSD_OP_RECOVERY_RELATED
uint32_t flags;
uint32_t pad0;
};
struct __attribute__((__packed__)) osd_reply_sec_rw_t
@@ -111,9 +109,6 @@ struct __attribute__((__packed__)) osd_op_sec_del_t
object_id oid;
// delete version (automatic or specific)
uint64_t version;
// the only possible flag is OSD_OP_RECOVERY_RELATED
uint32_t flags;
uint32_t pad0;
};
struct __attribute__((__packed__)) osd_reply_sec_del_t
@@ -126,9 +121,6 @@ struct __attribute__((__packed__)) osd_reply_sec_del_t
struct __attribute__((__packed__)) osd_op_sec_sync_t
{
osd_op_header_t header;
// the only possible flag is OSD_OP_RECOVERY_RELATED
uint32_t flags;
uint32_t pad0;
};
struct __attribute__((__packed__)) osd_reply_sec_sync_t
@@ -142,9 +134,6 @@ struct __attribute__((__packed__)) osd_op_sec_stab_t
osd_op_header_t header;
// obj_ver_id array length in bytes
uint64_t len;
// the only possible flag is OSD_OP_RECOVERY_RELATED
uint32_t flags;
uint32_t pad0;
};
typedef osd_op_sec_stab_t osd_op_sec_rollback_t;

View File

@@ -221,7 +221,6 @@ int osd_t::submit_primary_subop_batch(int submit_type, inode_t inode, uint64_t o
.offset = wr ? si->write_start : si->read_start,
.len = subop_len,
.attr_len = wr ? clean_entry_bitmap_size : 0,
.flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0,
};
#ifdef OSD_DEBUG
printf(
@@ -301,8 +300,7 @@ void osd_t::handle_primary_bs_subop(osd_op_t *subop)
" retval = "+std::to_string(bs_op->retval)+")"
);
}
bool recovery_related = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB;
add_bs_subop_stats(subop, recovery_related);
add_bs_subop_stats(subop);
subop->req.hdr.opcode = bs_op_to_osd_op[bs_op->opcode];
subop->reply.hdr.retval = bs_op->retval;
if (bs_op->opcode == BS_OP_READ || bs_op->opcode == BS_OP_WRITE || bs_op->opcode == BS_OP_WRITE_STABLE)
@@ -314,33 +312,30 @@ void osd_t::handle_primary_bs_subop(osd_op_t *subop)
}
delete bs_op;
subop->bs_op = NULL;
subop->peer_fd = SELF_FD;
if (recovery_related && recovery_target_sleep_us)
{
tfd->set_timer_us(recovery_target_sleep_us, false, [=](int timer_id)
{
handle_primary_subop(subop, cur_op);
});
}
else
{
handle_primary_subop(subop, cur_op);
}
subop->peer_fd = -1;
handle_primary_subop(subop, cur_op);
}
void osd_t::add_bs_subop_stats(osd_op_t *subop, bool recovery_related)
void osd_t::add_bs_subop_stats(osd_op_t *subop)
{
// Include local blockstore ops in statistics
uint64_t opcode = bs_op_to_osd_op[subop->bs_op->opcode];
timespec tv_end;
clock_gettime(CLOCK_REALTIME, &tv_end);
uint64_t len = (opcode == OSD_OP_SEC_READ || opcode == OSD_OP_SEC_WRITE)
? subop->bs_op->len : 0;
msgr.inc_op_stats(msgr.stats, opcode, subop->tv_begin, tv_end, len);
if (recovery_related)
msgr.stats.op_stat_count[opcode]++;
if (!msgr.stats.op_stat_count[opcode])
{
// It is OSD_OP_RECOVERY_RELATED
msgr.inc_op_stats(msgr.recovery_stats, opcode, subop->tv_begin, tv_end, len);
msgr.stats.op_stat_count[opcode] = 1;
msgr.stats.op_stat_sum[opcode] = 0;
msgr.stats.op_stat_bytes[opcode] = 0;
}
msgr.stats.op_stat_sum[opcode] += (
(tv_end.tv_sec - subop->tv_begin.tv_sec)*1000000 +
(tv_end.tv_nsec - subop->tv_begin.tv_nsec)/1000
);
if (opcode == OSD_OP_SEC_READ || opcode == OSD_OP_SEC_WRITE)
{
msgr.stats.op_stat_bytes[opcode] += subop->bs_op->len;
}
}
@@ -563,7 +558,6 @@ void osd_t::submit_primary_del_batch(osd_op_t *cur_op, obj_ver_osd_t *chunks_to_
},
.oid = chunk.oid,
.version = chunk.version,
.flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0,
} };
subops[i].callback = [cur_op, this](osd_op_t *subop)
{
@@ -621,7 +615,6 @@ int osd_t::submit_primary_sync_subops(osd_op_t *cur_op)
.id = msgr.next_subop_id++,
.opcode = OSD_OP_SEC_SYNC,
},
.flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0,
} };
subops[i].callback = [cur_op, this](osd_op_t *subop)
{
@@ -681,7 +674,6 @@ void osd_t::submit_primary_stab_subops(osd_op_t *cur_op)
.opcode = OSD_OP_SEC_STABILIZE,
},
.len = (uint64_t)(stab_osd.len * sizeof(obj_ver_id)),
.flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0,
} };
subops[i].iov.push_back(op_data->unstable_writes + stab_osd.start, stab_osd.len * sizeof(obj_ver_id));
subops[i].callback = [cur_op, this](osd_op_t *subop)

View File

@@ -296,6 +296,7 @@ resume_7:
if (!recovery_stat[recovery_type].count) // wrapped
{
memset(&recovery_print_prev[recovery_type], 0, sizeof(recovery_print_prev[recovery_type]));
memset(&rtune_prev_recovery[recovery_type], 0, sizeof(rtune_prev_recovery[recovery_type]));
memset(&recovery_stat[recovery_type], 0, sizeof(recovery_stat[recovery_type]));
recovery_stat[recovery_type].count++;
}

View File

@@ -42,21 +42,7 @@ void osd_t::secondary_op_callback(osd_op_t *op)
int retval = op->bs_op->retval;
delete op->bs_op;
op->bs_op = NULL;
if (op->is_recovery_related() && recovery_target_sleep_us)
{
if (!op->tv_end.tv_sec)
{
clock_gettime(CLOCK_REALTIME, &op->tv_end);
}
tfd->set_timer_us(recovery_target_sleep_us, false, [this, op, retval](int timer_id)
{
finish_op(op, retval);
});
}
else
{
finish_op(op, retval);
}
finish_op(op, retval);
}
void osd_t::exec_secondary(osd_op_t *cur_op)

View File

@@ -90,12 +90,6 @@ void timerfd_manager_t::clear_timer(int timer_id)
void timerfd_manager_t::set_nearest()
{
if (onstack > 0)
{
// Prevent re-entry
return;
}
onstack++;
again:
if (!timers.size())
{
@@ -145,7 +139,6 @@ again:
}
wait_state = wait_state | 1;
}
onstack--;
}
void timerfd_manager_t::handle_readable()

View File

@@ -22,7 +22,6 @@ class timerfd_manager_t
int timerfd;
int nearest = -1;
int id = 1;
int onstack = 0;
std::vector<timerfd_timer_t> timers;
void inc_timer(timerfd_timer_t & t);

View File

@@ -19,10 +19,10 @@ fi
if [ "$IMMEDIATE_COMMIT" != "" ]; then
NO_SAME="--journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024 --disable_data_fsync 1 --immediate_commit all --log_level 10 --etcd_stats_interval 5"
$ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"recovery_tune_util_low":1,"osd_out_time":1,"immediate_commit":"all","client_enable_writeback":true}'
$ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"osd_out_time":1,"immediate_commit":"all","client_enable_writeback":true}'
else
NO_SAME="--journal_sector_buffer_count 1024 --log_level 10 --etcd_stats_interval 5"
$ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"recovery_tune_util_low":1,"osd_out_time":1,"client_enable_writeback":true}'
$ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"osd_out_time":1,"client_enable_writeback":true}'
fi
start_osd_on()
@@ -53,7 +53,7 @@ for i in $(seq 1 $OSD_COUNT); do
start_osd $i
done
(while true; do node mon/mon-main.js --etcd_address $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 || true; done) >>./testdata/mon.log 2>&1 &
(while true; do node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 || true; done) &>./testdata/mon.log &
MON_PID=$!
if [ "$SCHEME" = "ec" ]; then

View File

@@ -18,7 +18,6 @@ try_change()
for i in {1..6}; do
echo --- Change PG count to $n --- >>testdata/osd$i.log
done
echo --- Change PG count to $n --- >>testdata/mon.log
$ETCDCTL put /vitastor/config/pools '{"1":{'$POOLCFG',"pg_size":'$PG_SIZE',"pg_minsize":'$PG_MINSIZE',"pg_count":'$n'}}'

View File

@@ -15,7 +15,7 @@ $ETCDCTL put /vitastor/osd/stats/7 '{"host":"host4","size":1073741824,"time":"'$
$ETCDCTL put /vitastor/osd/stats/8 '{"host":"host4","size":1073741824,"time":"'$TIME'"}'
$ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":4,"failure_domain":"rack"}}'
node mon/mon-main.js --etcd_address $ETCD_URL --etcd_prefix "/vitastor" >>./testdata/mon.log 2>&1 &
node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" &>./testdata/mon.log &
MON_PID=$!
sleep 2

View File

@@ -7,7 +7,7 @@ OSD_COUNT=5
OSD_ARGS="$OSD_ARGS"
for i in $(seq 1 $OSD_COUNT); do
dd if=/dev/zero of=./testdata/test_osd$i.bin bs=1024 count=1 seek=$((OSD_SIZE*1024-1))
build/src/vitastor-osd --log_level 10 --osd_num $i --bind_address 127.0.0.1 --etcd_stats_interval 5 $OSD_ARGS --etcd_address $ETCD_URL $(build/src/vitastor-disk simple-offsets --format options ./testdata/test_osd$i.bin 2>/dev/null) >>./testdata/osd$i.log 2>&1 &
build/src/vitastor-osd --osd_num $i --bind_address 127.0.0.1 --etcd_stats_interval 5 $OSD_ARGS --etcd_address $ETCD_URL $(build/src/vitastor-disk simple-offsets --format options ./testdata/test_osd$i.bin 2>/dev/null) >>./testdata/osd$i.log 2>&1 &
eval OSD${i}_PID=$!
done
@@ -53,11 +53,6 @@ for i in {1..30}; do
fi
done
# Sync so all moved objects are removed from OSD 1 (they aren't removed without a sync)
LD_PRELOAD="build/src/libfio_vitastor.so" \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4k -direct=1 -iodepth=1 -fsync=1 -number_ios=2 -rw=write \
-etcd=$ETCD_URL -pool=1 -inode=2 -size=32M -cluster_log_level=10
$ETCDCTL put /vitastor/config/pgs '{"items":{"1":{"1":{"osd_set":[4,5],"primary":0}}}}'
$ETCDCTL put /vitastor/pg/history/1/1 '{"all_peers":[1,2,3]}'

View File

@@ -1,54 +0,0 @@
#!/bin/bash -ex
# Test changing EC 4+1 into EC 4+3
OSD_COUNT=7
PG_COUNT=16
SCHEME=ec
PG_SIZE=5
PG_DATA_SIZE=4
PG_MINSIZE=5
. `dirname $0`/run_3osds.sh
try_change()
{
n=$1
s=$2
for i in {1..10}; do
($ETCDCTL get /vitastor/config/pgs --print-value-only |\
jq -s -e '(.[0].items["1"] | map( ([ .osd_set[] | select(. != 0) ] | length) == '$s' ) | length == '$n')
and ([ .[0].items["1"] | map(.osd_set)[][] ] | sort | unique == ["1","2","3","4","5","6","7"])') && \
($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"]) ] | length) == '$n'') && \
break
sleep 1
done
if ! ($ETCDCTL get /vitastor/config/pgs --print-value-only |\
jq -s -e '(.[0].items["1"] | map( ([ .osd_set[] | select(. != 0) ] | length) == '$s' ) | length == '$n')
and ([ .[0].items["1"] | map(.osd_set)[][] ] | sort | unique == ["1","2","3","4","5","6","7"])'); then
$ETCDCTL get /vitastor/config/pgs
$ETCDCTL get --prefix /vitastor/pg/state/
format_error "FAILED: PG SIZE NOT CHANGED OR SOME OSDS DO NOT HAVE PGS"
fi
if ! ($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"]) ] | length) == '$n); then
$ETCDCTL get /vitastor/config/pgs
$ETCDCTL get --prefix /vitastor/pg/state/
format_error "FAILED: PGS NOT UP AFTER PG SIZE CHANGE"
fi
}
LD_PRELOAD="build/src/libfio_vitastor.so" \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=1M -direct=1 -iodepth=4 \
-rw=write -etcd=$ETCD_URL -pool=1 -inode=1 -size=128M -runtime=10
PG_SIZE=7
POOLCFG='"name":"testpool","failure_domain":"osd","scheme":"ec","parity_chunks":'$((PG_SIZE-PG_DATA_SIZE))
$ETCDCTL put /vitastor/config/pools '{"1":{'$POOLCFG',"pg_size":'$PG_SIZE',"pg_minsize":'$PG_MINSIZE',"pg_count":'$PG_COUNT'}}'
sleep 2
try_change 16 7
format_green OK

View File

@@ -15,7 +15,7 @@ for i in $(seq 1 $OSD_COUNT); do
eval OSD${i}_PID=$!
done
(while true; do node mon/mon-main.js --etcd_address $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 || true; done) >>./testdata/mon.log 2>&1 &
(while true; do node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 || true; done) &>./testdata/mon.log &
MON_PID=$!
sleep 3