Compare commits
15 Commits
v0.8.8
...
mon-self-r
Author | SHA1 | Date | |
---|---|---|---|
d258a6e76b | |||
77155ab7bd | |||
a409598b16 | |||
f4c6765522 | |||
ad2916068a | |||
321cb435a6 | |||
cfcf4f4355 | |||
e0fb17bfee | |||
5b9031fecc | |||
5da1d8e1b5 | |||
44f86f1999 | |||
2d9a80c6f6 | |||
5e295e346e | |||
d9c0898b7c | |||
04cfb48361 |
@@ -17,14 +17,16 @@ Configuration parameters can be set in 3 places:
|
|||||||
- Configuration file (`/etc/vitastor/vitastor.conf` or other path)
|
- Configuration file (`/etc/vitastor/vitastor.conf` or other path)
|
||||||
- etcd key `/vitastor/config/global`. Most variables can be set there, but etcd
|
- etcd key `/vitastor/config/global`. Most variables can be set there, but etcd
|
||||||
connection parameters should obviously be set in the configuration file.
|
connection parameters should obviously be set in the configuration file.
|
||||||
- Command line of Vitastor components: OSD, mon, fio and QEMU options,
|
- Command line of Vitastor components: OSD (when you run it without vitastor-disk),
|
||||||
OpenStack/Proxmox/etc configuration. The latter doesn't allow to set all
|
mon, fio and QEMU options, OpenStack/Proxmox/etc configuration. The latter
|
||||||
variables directly, but it allows to override the configuration file and
|
doesn't allow to set all variables directly, but it allows to override the
|
||||||
set everything you need inside it.
|
configuration file and set everything you need inside it.
|
||||||
|
- OSD superblocks created by [vitastor-disk](../usage/disk.en.md) contain
|
||||||
|
primarily disk layout parameters of specific OSDs. In fact, these parameters
|
||||||
|
are automatically passed into the command line of vitastor-osd process, so
|
||||||
|
they have the same "status" as command-line parameters.
|
||||||
|
|
||||||
In the future, additional configuration methods may be added:
|
In the future, additional configuration methods may be added:
|
||||||
- OSD superblock which will, by design, contain parameters related to the disk
|
|
||||||
layout and to one specific OSD.
|
|
||||||
- OSD-specific keys in etcd like `/vitastor/config/osd/<number>`.
|
- OSD-specific keys in etcd like `/vitastor/config/osd/<number>`.
|
||||||
|
|
||||||
## Parameter Reference
|
## Parameter Reference
|
||||||
|
@@ -19,14 +19,17 @@
|
|||||||
- Ключе в etcd `/vitastor/config/global`. Большая часть параметров может
|
- Ключе в etcd `/vitastor/config/global`. Большая часть параметров может
|
||||||
задаваться там, кроме, естественно, самих параметров соединения с etcd,
|
задаваться там, кроме, естественно, самих параметров соединения с etcd,
|
||||||
которые должны задаваться в файле конфигурации
|
которые должны задаваться в файле конфигурации
|
||||||
- В командной строке компонентов Vitastor: OSD, монитора, опциях fio и QEMU,
|
- В командной строке компонентов Vitastor: OSD (при ручном запуске без vitastor-disk),
|
||||||
настроек OpenStack, Proxmox и т.п. Последние, как правило, не включают полный
|
монитора, опциях fio и QEMU, настроек OpenStack, Proxmox и т.п. Последние,
|
||||||
набор параметров напрямую, но разрешают определить путь к файлу конфигурации
|
как правило, не включают полный набор параметров напрямую, но позволяют
|
||||||
и задать любые параметры в нём.
|
определить путь к файлу конфигурации и задать любые параметры в нём.
|
||||||
|
- В суперблоке OSD, записываемом [vitastor-disk](../usage/disk.ru.md) - параметры,
|
||||||
|
связанные с дисковым форматом и с этим конкретным OSD. На самом деле,
|
||||||
|
при запуске OSD эти параметры автоматически передаются в командную строку
|
||||||
|
процесса vitastor-osd, то есть по "статусу" они эквивалентны параметрам
|
||||||
|
командной строки OSD.
|
||||||
|
|
||||||
В будущем также могут быть добавлены другие способы конфигурации:
|
В будущем также могут быть добавлены другие способы конфигурации:
|
||||||
- Суперблок OSD, в котором будут храниться параметры OSD, связанные с дисковым
|
|
||||||
форматом и с этим конкретным OSD.
|
|
||||||
- OSD-специфичные ключи в etcd типа `/vitastor/config/osd/<номер>`.
|
- OSD-специфичные ключи в etcd типа `/vitastor/config/osd/<номер>`.
|
||||||
|
|
||||||
## Список параметров
|
## Список параметров
|
||||||
|
@@ -6,10 +6,10 @@
|
|||||||
|
|
||||||
# Proxmox VE
|
# Proxmox VE
|
||||||
|
|
||||||
To enable Vitastor support in Proxmox Virtual Environment (6.4-7.3 are supported):
|
To enable Vitastor support in Proxmox Virtual Environment (6.4-7.4 are supported):
|
||||||
|
|
||||||
- Add the corresponding Vitastor Debian repository into sources.list on Proxmox hosts:
|
- Add the corresponding Vitastor Debian repository into sources.list on Proxmox hosts:
|
||||||
buster for 6.4, bullseye for 7.3, pve7.1 for 7.1, pve7.2 for 7.2
|
buster for 6.4, bullseye for 7.4, pve7.1 for 7.1, pve7.2 for 7.2, pve7.3 for 7.3
|
||||||
- Install vitastor-client, pve-qemu-kvm, pve-storage-vitastor (* or see note) packages from Vitastor repository
|
- Install vitastor-client, pve-qemu-kvm, pve-storage-vitastor (* or see note) packages from Vitastor repository
|
||||||
- Define storage in `/etc/pve/storage.cfg` (see below)
|
- Define storage in `/etc/pve/storage.cfg` (see below)
|
||||||
- Block network access from VMs to Vitastor network (to OSDs and etcd),
|
- Block network access from VMs to Vitastor network (to OSDs and etcd),
|
||||||
|
@@ -6,10 +6,10 @@
|
|||||||
|
|
||||||
# Proxmox
|
# Proxmox
|
||||||
|
|
||||||
Чтобы подключить Vitastor к Proxmox Virtual Environment (поддерживаются версии 6.4-7.3):
|
Чтобы подключить Vitastor к Proxmox Virtual Environment (поддерживаются версии 6.4-7.4):
|
||||||
|
|
||||||
- Добавьте соответствующий Debian-репозиторий Vitastor в sources.list на хостах Proxmox:
|
- Добавьте соответствующий Debian-репозиторий Vitastor в sources.list на хостах Proxmox:
|
||||||
buster для 6.4, bullseye для 7.3, pve7.1 для 7.1, pve7.2 для 7.2
|
buster для 6.4, bullseye для 7.4, pve7.1 для 7.1, pve7.2 для 7.2, pve7.3 для 7.3
|
||||||
- Установите пакеты vitastor-client, pve-qemu-kvm, pve-storage-vitastor (* или см. сноску) из репозитория Vitastor
|
- Установите пакеты vitastor-client, pve-qemu-kvm, pve-storage-vitastor (* или см. сноску) из репозитория Vitastor
|
||||||
- Определите тип хранилища в `/etc/pve/storage.cfg` (см. ниже)
|
- Определите тип хранилища в `/etc/pve/storage.cfg` (см. ниже)
|
||||||
- Обязательно заблокируйте доступ от виртуальных машин к сети Vitastor (OSD и etcd), т.к. Vitastor (пока) не поддерживает аутентификацию
|
- Обязательно заблокируйте доступ от виртуальных машин к сети Vitastor (OSD и etcd), т.к. Vitastor (пока) не поддерживает аутентификацию
|
||||||
|
@@ -45,7 +45,9 @@ On the monitor hosts:
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
- Initialize OSDs:
|
- Initialize OSDs:
|
||||||
- SSD-only: `vitastor-disk prepare /dev/sdXXX [/dev/sdYYY ...]`
|
- SSD-only: `vitastor-disk prepare /dev/sdXXX [/dev/sdYYY ...]`. You can add
|
||||||
|
`--disable_data_fsync off` to leave disk cache enabled if you use desktop
|
||||||
|
SSDs without capacitors.
|
||||||
- Hybrid, SSD+HDD: `vitastor-disk prepare --hybrid /dev/sdXXX [/dev/sdYYY ...]`.
|
- Hybrid, SSD+HDD: `vitastor-disk prepare --hybrid /dev/sdXXX [/dev/sdYYY ...]`.
|
||||||
Pass all your devices (HDD and SSD) to this script — it will partition disks and initialize journals on its own.
|
Pass all your devices (HDD and SSD) to this script — it will partition disks and initialize journals on its own.
|
||||||
This script skips HDDs which are already partitioned so if you want to use non-empty disks for
|
This script skips HDDs which are already partitioned so if you want to use non-empty disks for
|
||||||
@@ -53,7 +55,9 @@ On the monitor hosts:
|
|||||||
but some free unpartitioned space must be available because the script creates new partitions for journals.
|
but some free unpartitioned space must be available because the script creates new partitions for journals.
|
||||||
- You can change OSD configuration in units or in `vitastor.conf`.
|
- You can change OSD configuration in units or in `vitastor.conf`.
|
||||||
Check [Configuration Reference](../config.en.md) for parameter descriptions.
|
Check [Configuration Reference](../config.en.md) for parameter descriptions.
|
||||||
- If all your drives have capacitors, create global configuration in etcd: \
|
- If all your drives have capacitors, and even if not, but if you ran `vitastor-disk`
|
||||||
|
without `--disable_data_fsync off` at the first step, then put the following
|
||||||
|
setting into etcd: \
|
||||||
`etcdctl --endpoints=... put /vitastor/config/global '{"immediate_commit":"all"}'`
|
`etcdctl --endpoints=... put /vitastor/config/global '{"immediate_commit":"all"}'`
|
||||||
- Start all OSDs: `systemctl start vitastor.target`
|
- Start all OSDs: `systemctl start vitastor.target`
|
||||||
|
|
||||||
@@ -75,6 +79,10 @@ etcdctl --endpoints=... put /vitastor/config/pools '{"2":{"name":"ecpool",
|
|||||||
|
|
||||||
After you do this, one of the monitors will configure PGs and OSDs will start them.
|
After you do this, one of the monitors will configure PGs and OSDs will start them.
|
||||||
|
|
||||||
|
If you use HDDs you should also add `"block_size": 1048576` to pool configuration.
|
||||||
|
The other option is to add it into /vitastor/config/global, in this case it will
|
||||||
|
apply to all pools by default.
|
||||||
|
|
||||||
## Check cluster status
|
## Check cluster status
|
||||||
|
|
||||||
`vitastor-cli status`
|
`vitastor-cli status`
|
||||||
|
@@ -45,7 +45,9 @@
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
- Инициализуйте OSD:
|
- Инициализуйте OSD:
|
||||||
- SSD: `vitastor-disk prepare /dev/sdXXX [/dev/sdYYY ...]`
|
- SSD: `vitastor-disk prepare /dev/sdXXX [/dev/sdYYY ...]`. Если вы используете
|
||||||
|
десктопные SSD без конденсаторов, можете оставить кэш включённым, добавив
|
||||||
|
опцию `--disable_data_fsync off`.
|
||||||
- Гибридные, SSD+HDD: `vitastor-disk prepare --hybrid /dev/sdXXX [/dev/sdYYY ...]`.
|
- Гибридные, SSD+HDD: `vitastor-disk prepare --hybrid /dev/sdXXX [/dev/sdYYY ...]`.
|
||||||
Передайте все ваши SSD и HDD скрипту в командной строке подряд, скрипт автоматически выделит
|
Передайте все ваши SSD и HDD скрипту в командной строке подряд, скрипт автоматически выделит
|
||||||
разделы под журналы на SSD и данные на HDD. Скрипт пропускает HDD, на которых уже есть разделы
|
разделы под журналы на SSD и данные на HDD. Скрипт пропускает HDD, на которых уже есть разделы
|
||||||
@@ -54,8 +56,11 @@
|
|||||||
для журналов, на SSD должно быть доступно свободное нераспределённое место.
|
для журналов, на SSD должно быть доступно свободное нераспределённое место.
|
||||||
- Вы можете менять параметры OSD в юнитах systemd или в `vitastor.conf`. Описания параметров
|
- Вы можете менять параметры OSD в юнитах systemd или в `vitastor.conf`. Описания параметров
|
||||||
смотрите в [справке по конфигурации](../config.ru.md).
|
смотрите в [справке по конфигурации](../config.ru.md).
|
||||||
- Если все ваши диски - серверные с конденсаторами, пропишите это в глобальную конфигурацию в etcd: \
|
- Если все ваши диски - серверные с конденсаторами, и даже если нет, но при этом
|
||||||
`etcdctl --endpoints=... put /vitastor/config/global '{"immediate_commit":"all"}'`
|
вы не добавляли опцию `--disable_data_fsync off` на первом шаге, а `vitastor-disk`
|
||||||
|
не ругался на невозможность отключения кэша дисков, пропишите следующую настройку
|
||||||
|
в глобальную конфигурацию в etcd: \
|
||||||
|
`etcdctl --endpoints=... put /vitastor/config/global '{"immediate_commit":"all"}'`.
|
||||||
- Запустите все OSD: `systemctl start vitastor.target`
|
- Запустите все OSD: `systemctl start vitastor.target`
|
||||||
|
|
||||||
## Создайте пул
|
## Создайте пул
|
||||||
@@ -76,6 +81,10 @@ etcdctl --endpoints=... put /vitastor/config/pools '{"2":{"name":"ecpool",
|
|||||||
|
|
||||||
После этого один из мониторов должен сконфигурировать PG, а OSD должны запустить их.
|
После этого один из мониторов должен сконфигурировать PG, а OSD должны запустить их.
|
||||||
|
|
||||||
|
Если вы используете HDD-диски, то добавьте в конфигурацию пулов опцию `"block_size": 1048576`.
|
||||||
|
Также эту опцию можно добавить в /vitastor/config/global, в этом случае она будет
|
||||||
|
применяться ко всем пулам по умолчанию.
|
||||||
|
|
||||||
## Проверьте состояние кластера
|
## Проверьте состояние кластера
|
||||||
|
|
||||||
`vitastor-cli status`
|
`vitastor-cli status`
|
||||||
|
@@ -43,16 +43,16 @@ function finish_pg_history(merged_history)
|
|||||||
merged_history.all_peers = Object.values(merged_history.all_peers);
|
merged_history.all_peers = Object.values(merged_history.all_peers);
|
||||||
}
|
}
|
||||||
|
|
||||||
function scale_pg_count(prev_pgs, prev_pg_history, new_pg_history, new_pg_count)
|
function scale_pg_count(prev_pgs, real_prev_pgs, prev_pg_history, new_pg_history, new_pg_count)
|
||||||
{
|
{
|
||||||
const old_pg_count = prev_pgs.length;
|
const old_pg_count = real_prev_pgs.length;
|
||||||
// Add all possibly intersecting PGs to the history of new PGs
|
// Add all possibly intersecting PGs to the history of new PGs
|
||||||
if (!(new_pg_count % old_pg_count))
|
if (!(new_pg_count % old_pg_count))
|
||||||
{
|
{
|
||||||
// New PG count is a multiple of old PG count
|
// New PG count is a multiple of old PG count
|
||||||
for (let i = 0; i < new_pg_count; i++)
|
for (let i = 0; i < new_pg_count; i++)
|
||||||
{
|
{
|
||||||
add_pg_history(new_pg_history, i, prev_pgs, prev_pg_history, i % old_pg_count);
|
add_pg_history(new_pg_history, i, real_prev_pgs, prev_pg_history, i % old_pg_count);
|
||||||
finish_pg_history(new_pg_history[i]);
|
finish_pg_history(new_pg_history[i]);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@@ -64,7 +64,7 @@ function scale_pg_count(prev_pgs, prev_pg_history, new_pg_history, new_pg_count)
|
|||||||
{
|
{
|
||||||
for (let j = 0; j < mul; j++)
|
for (let j = 0; j < mul; j++)
|
||||||
{
|
{
|
||||||
add_pg_history(new_pg_history, i, prev_pgs, prev_pg_history, i+j*new_pg_count);
|
add_pg_history(new_pg_history, i, real_prev_pgs, prev_pg_history, i+j*new_pg_count);
|
||||||
}
|
}
|
||||||
finish_pg_history(new_pg_history[i]);
|
finish_pg_history(new_pg_history[i]);
|
||||||
}
|
}
|
||||||
@@ -76,7 +76,7 @@ function scale_pg_count(prev_pgs, prev_pg_history, new_pg_history, new_pg_count)
|
|||||||
let merged_history = {};
|
let merged_history = {};
|
||||||
for (let i = 0; i < old_pg_count; i++)
|
for (let i = 0; i < old_pg_count; i++)
|
||||||
{
|
{
|
||||||
add_pg_history(merged_history, 1, prev_pgs, prev_pg_history, i);
|
add_pg_history(merged_history, 1, real_prev_pgs, prev_pg_history, i);
|
||||||
}
|
}
|
||||||
finish_pg_history(merged_history[1]);
|
finish_pg_history(merged_history[1]);
|
||||||
for (let i = 0; i < new_pg_count; i++)
|
for (let i = 0; i < new_pg_count; i++)
|
||||||
@@ -90,15 +90,15 @@ function scale_pg_count(prev_pgs, prev_pg_history, new_pg_history, new_pg_count)
|
|||||||
new_pg_history[i] = null;
|
new_pg_history[i] = null;
|
||||||
}
|
}
|
||||||
// Just for the lp_solve optimizer - pick a "previous" PG for each "new" one
|
// Just for the lp_solve optimizer - pick a "previous" PG for each "new" one
|
||||||
if (old_pg_count < new_pg_count)
|
if (prev_pgs.length < new_pg_count)
|
||||||
{
|
{
|
||||||
for (let i = old_pg_count; i < new_pg_count; i++)
|
for (let i = prev_pgs.length; i < new_pg_count; i++)
|
||||||
{
|
{
|
||||||
prev_pgs[i] = prev_pgs[i % old_pg_count];
|
prev_pgs[i] = prev_pgs[i % prev_pgs.length];
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
else if (old_pg_count > new_pg_count)
|
else if (prev_pgs.length > new_pg_count)
|
||||||
{
|
{
|
||||||
prev_pgs.splice(new_pg_count, old_pg_count-new_pg_count);
|
prev_pgs.splice(new_pg_count, prev_pgs.length-new_pg_count);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@@ -13,7 +13,7 @@ for (let i = 2; i < process.argv.length; i++)
|
|||||||
{
|
{
|
||||||
console.error('USAGE: '+process.argv[0]+' '+process.argv[1]+' [--verbose 1]'+
|
console.error('USAGE: '+process.argv[0]+' '+process.argv[1]+' [--verbose 1]'+
|
||||||
' [--etcd_address "http://127.0.0.1:2379,..."] [--config_path /etc/vitastor/vitastor.conf]'+
|
' [--etcd_address "http://127.0.0.1:2379,..."] [--config_path /etc/vitastor/vitastor.conf]'+
|
||||||
' [--etcd_prefix "/vitastor"] [--etcd_start_timeout 5]');
|
' [--etcd_prefix "/vitastor"] [--etcd_start_timeout 5] [--restart_interval 5]');
|
||||||
process.exit();
|
process.exit();
|
||||||
}
|
}
|
||||||
else if (process.argv[i].substr(0, 2) == '--')
|
else if (process.argv[i].substr(0, 2) == '--')
|
||||||
|
71
mon/mon.js
71
mon/mon.js
@@ -561,7 +561,7 @@ class Mon
|
|||||||
}
|
}
|
||||||
if (!this.ws)
|
if (!this.ws)
|
||||||
{
|
{
|
||||||
this.die('Failed to open etcd watch websocket');
|
await this.die('Failed to open etcd watch websocket');
|
||||||
}
|
}
|
||||||
const cur_addr = this.selected_etcd_url;
|
const cur_addr = this.selected_etcd_url;
|
||||||
this.ws_alive = true;
|
this.ws_alive = true;
|
||||||
@@ -728,7 +728,7 @@ class Mon
|
|||||||
const res = await this.etcd_call('/lease/keepalive', { ID: this.etcd_lease_id }, this.config.etcd_mon_timeout, this.config.etcd_mon_retries);
|
const res = await this.etcd_call('/lease/keepalive', { ID: this.etcd_lease_id }, this.config.etcd_mon_timeout, this.config.etcd_mon_retries);
|
||||||
if (!res.result.TTL)
|
if (!res.result.TTL)
|
||||||
{
|
{
|
||||||
this.die('Lease expired');
|
await this.die('Lease expired');
|
||||||
}
|
}
|
||||||
}, this.config.etcd_mon_timeout);
|
}, this.config.etcd_mon_timeout);
|
||||||
if (!this.signals_set)
|
if (!this.signals_set)
|
||||||
@@ -740,11 +740,34 @@ class Mon
|
|||||||
}
|
}
|
||||||
|
|
||||||
async on_stop(status)
|
async on_stop(status)
|
||||||
|
{
|
||||||
|
if (this.ws_keepalive_timer)
|
||||||
|
{
|
||||||
|
clearInterval(this.ws_keepalive_timer);
|
||||||
|
this.ws_keepalive_timer = null;
|
||||||
|
}
|
||||||
|
if (this.lease_timer)
|
||||||
{
|
{
|
||||||
clearInterval(this.lease_timer);
|
clearInterval(this.lease_timer);
|
||||||
await this.etcd_call('/lease/revoke', { ID: this.etcd_lease_id }, this.config.etcd_mon_timeout, this.config.etcd_mon_retries);
|
this.lease_timer = null;
|
||||||
|
}
|
||||||
|
if (this.etcd_lease_id)
|
||||||
|
{
|
||||||
|
const lease_id = this.etcd_lease_id;
|
||||||
|
this.etcd_lease_id = null;
|
||||||
|
await this.etcd_call('/lease/revoke', { ID: lease_id }, this.config.etcd_mon_timeout, this.config.etcd_mon_retries);
|
||||||
|
}
|
||||||
|
if (!status || !this.initConfig.restart_interval)
|
||||||
|
{
|
||||||
process.exit(status);
|
process.exit(status);
|
||||||
}
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
console.log('Restarting after '+this.initConfig.restart_interval+' seconds');
|
||||||
|
await new Promise(ok => setTimeout(ok, this.initConfig.restart_interval*1000));
|
||||||
|
await this.start();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
async become_master()
|
async become_master()
|
||||||
{
|
{
|
||||||
@@ -956,7 +979,7 @@ class Mon
|
|||||||
return alive_set[this.rng() % alive_set.length];
|
return alive_set[this.rng() % alive_set.length];
|
||||||
}
|
}
|
||||||
|
|
||||||
save_new_pgs_txn(request, pool_id, up_osds, osd_tree, prev_pgs, new_pgs, pg_history)
|
save_new_pgs_txn(save_to, request, pool_id, up_osds, osd_tree, prev_pgs, new_pgs, pg_history)
|
||||||
{
|
{
|
||||||
const aff_osds = this.get_affinity_osds(this.state.config.pools[pool_id], up_osds, osd_tree);
|
const aff_osds = this.get_affinity_osds(this.state.config.pools[pool_id], up_osds, osd_tree);
|
||||||
const pg_items = {};
|
const pg_items = {};
|
||||||
@@ -1009,14 +1032,14 @@ class Mon
|
|||||||
});
|
});
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
this.state.config.pgs.items = this.state.config.pgs.items || {};
|
save_to.items = save_to.items || {};
|
||||||
if (!new_pgs.length)
|
if (!new_pgs.length)
|
||||||
{
|
{
|
||||||
delete this.state.config.pgs.items[pool_id];
|
delete save_to.items[pool_id];
|
||||||
}
|
}
|
||||||
else
|
else
|
||||||
{
|
{
|
||||||
this.state.config.pgs.items[pool_id] = pg_items;
|
save_to.items[pool_id] = pg_items;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -1160,6 +1183,7 @@ class Mon
|
|||||||
if (this.state.config.pgs.hash != tree_hash)
|
if (this.state.config.pgs.hash != tree_hash)
|
||||||
{
|
{
|
||||||
// Something has changed
|
// Something has changed
|
||||||
|
const new_config_pgs = JSON.parse(JSON.stringify(this.state.config.pgs));
|
||||||
const etcd_request = { compare: [], success: [] };
|
const etcd_request = { compare: [], success: [] };
|
||||||
for (const pool_id in (this.state.config.pgs||{}).items||{})
|
for (const pool_id in (this.state.config.pgs||{}).items||{})
|
||||||
{
|
{
|
||||||
@@ -1180,7 +1204,7 @@ class Mon
|
|||||||
etcd_request.success.push({ requestDeleteRange: {
|
etcd_request.success.push({ requestDeleteRange: {
|
||||||
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
|
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
|
||||||
} });
|
} });
|
||||||
this.save_new_pgs_txn(etcd_request, pool_id, up_osds, osd_tree, prev_pgs, [], []);
|
this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, prev_pgs, [], []);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
for (const pool_id in this.state.config.pools)
|
for (const pool_id in this.state.config.pools)
|
||||||
@@ -1234,7 +1258,7 @@ class Mon
|
|||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
const new_pg_history = [];
|
const new_pg_history = [];
|
||||||
PGUtil.scale_pg_count(prev_pgs, pg_history, new_pg_history, pool_cfg.pg_count);
|
PGUtil.scale_pg_count(prev_pgs, real_prev_pgs, pg_history, new_pg_history, pool_cfg.pg_count);
|
||||||
pg_history = new_pg_history;
|
pg_history = new_pg_history;
|
||||||
}
|
}
|
||||||
for (const pg of prev_pgs)
|
for (const pg of prev_pgs)
|
||||||
@@ -1287,14 +1311,15 @@ class Mon
|
|||||||
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
|
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
|
||||||
value: b64(JSON.stringify(this.state.pool.stats[pool_id])),
|
value: b64(JSON.stringify(this.state.pool.stats[pool_id])),
|
||||||
} });
|
} });
|
||||||
this.save_new_pgs_txn(etcd_request, pool_id, up_osds, osd_tree, real_prev_pgs, optimize_result.int_pgs, pg_history);
|
this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, real_prev_pgs, optimize_result.int_pgs, pg_history);
|
||||||
}
|
}
|
||||||
this.state.config.pgs.hash = tree_hash;
|
new_config_pgs.hash = tree_hash;
|
||||||
await this.save_pg_config(etcd_request);
|
await this.save_pg_config(new_config_pgs, etcd_request);
|
||||||
}
|
}
|
||||||
else
|
else
|
||||||
{
|
{
|
||||||
// Nothing changed, but we still want to recheck the distribution of primaries
|
// Nothing changed, but we still want to recheck the distribution of primaries
|
||||||
|
let new_config_pgs;
|
||||||
let changed = false;
|
let changed = false;
|
||||||
for (const pool_id in this.state.config.pools)
|
for (const pool_id in this.state.config.pools)
|
||||||
{
|
{
|
||||||
@@ -1314,31 +1339,35 @@ class Mon
|
|||||||
const new_primary = this.pick_primary(pool_id, pg_cfg.osd_set, up_osds, aff_osds);
|
const new_primary = this.pick_primary(pool_id, pg_cfg.osd_set, up_osds, aff_osds);
|
||||||
if (pg_cfg.primary != new_primary)
|
if (pg_cfg.primary != new_primary)
|
||||||
{
|
{
|
||||||
|
if (!new_config_pgs)
|
||||||
|
{
|
||||||
|
new_config_pgs = JSON.parse(JSON.stringify(this.state.config.pgs));
|
||||||
|
}
|
||||||
console.log(
|
console.log(
|
||||||
`Moving pool ${pool_id} (${pool_cfg.name || 'unnamed'}) PG ${pg_num}`+
|
`Moving pool ${pool_id} (${pool_cfg.name || 'unnamed'}) PG ${pg_num}`+
|
||||||
` primary OSD from ${pg_cfg.primary} to ${new_primary}`
|
` primary OSD from ${pg_cfg.primary} to ${new_primary}`
|
||||||
);
|
);
|
||||||
changed = true;
|
changed = true;
|
||||||
pg_cfg.primary = new_primary;
|
new_config_pgs.items[pool_id][pg_num].primary = new_primary;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
if (changed)
|
if (changed)
|
||||||
{
|
{
|
||||||
await this.save_pg_config();
|
await this.save_pg_config(new_config_pgs);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
async save_pg_config(etcd_request = { compare: [], success: [] })
|
async save_pg_config(new_config_pgs, etcd_request = { compare: [], success: [] })
|
||||||
{
|
{
|
||||||
etcd_request.compare.push(
|
etcd_request.compare.push(
|
||||||
{ key: b64(this.etcd_prefix+'/mon/master'), target: 'LEASE', lease: ''+this.etcd_lease_id },
|
{ key: b64(this.etcd_prefix+'/mon/master'), target: 'LEASE', lease: ''+this.etcd_lease_id },
|
||||||
{ key: b64(this.etcd_prefix+'/config/pgs'), target: 'MOD', mod_revision: ''+this.etcd_watch_revision, result: 'LESS' },
|
{ key: b64(this.etcd_prefix+'/config/pgs'), target: 'MOD', mod_revision: ''+this.etcd_watch_revision, result: 'LESS' },
|
||||||
);
|
);
|
||||||
etcd_request.success.push(
|
etcd_request.success.push(
|
||||||
{ requestPut: { key: b64(this.etcd_prefix+'/config/pgs'), value: b64(JSON.stringify(this.state.config.pgs)) } },
|
{ requestPut: { key: b64(this.etcd_prefix+'/config/pgs'), value: b64(JSON.stringify(new_config_pgs)) } },
|
||||||
);
|
);
|
||||||
const res = await this.etcd_call('/kv/txn', etcd_request, this.config.etcd_mon_timeout, 0);
|
const res = await this.etcd_call('/kv/txn', etcd_request, this.config.etcd_mon_timeout, 0);
|
||||||
if (!res.succeeded)
|
if (!res.succeeded)
|
||||||
@@ -1765,14 +1794,13 @@ class Mon
|
|||||||
return res.json;
|
return res.json;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
this.die();
|
await this.die();
|
||||||
}
|
}
|
||||||
|
|
||||||
_die(err)
|
async _die(err)
|
||||||
{
|
{
|
||||||
// In fact we can just try to rejoin
|
|
||||||
console.error(new Error(err || 'Cluster connection failed'));
|
console.error(new Error(err || 'Cluster connection failed'));
|
||||||
process.exit(1);
|
await this.on_stop(1);
|
||||||
}
|
}
|
||||||
|
|
||||||
local_ips(all)
|
local_ips(all)
|
||||||
@@ -1817,6 +1845,7 @@ function POST(url, body, timeout)
|
|||||||
clearTimeout(timer_id);
|
clearTimeout(timer_id);
|
||||||
let res_body = '';
|
let res_body = '';
|
||||||
res.setEncoding('utf8');
|
res.setEncoding('utf8');
|
||||||
|
res.on('error', no);
|
||||||
res.on('data', chunk => { res_body += chunk; });
|
res.on('data', chunk => { res_body += chunk; });
|
||||||
res.on('end', () =>
|
res.on('end', () =>
|
||||||
{
|
{
|
||||||
@@ -1836,6 +1865,8 @@ function POST(url, body, timeout)
|
|||||||
}
|
}
|
||||||
});
|
});
|
||||||
});
|
});
|
||||||
|
req.on('error', no);
|
||||||
|
req.on('close', () => no(new Error('Connection closed prematurely')));
|
||||||
req.write(body_text);
|
req.write(body_text);
|
||||||
req.end();
|
req.end();
|
||||||
});
|
});
|
||||||
|
@@ -15,4 +15,4 @@ StartLimitInterval=0
|
|||||||
RestartSec=10
|
RestartSec=10
|
||||||
|
|
||||||
[Install]
|
[Install]
|
||||||
WantedBy=vitastor.target
|
WantedBy=multi-user.target
|
||||||
|
@@ -307,6 +307,18 @@ void blockstore_impl_t::check_wait(blockstore_op_t *op)
|
|||||||
}
|
}
|
||||||
PRIV(op)->wait_for = 0;
|
PRIV(op)->wait_for = 0;
|
||||||
}
|
}
|
||||||
|
else if (PRIV(op)->wait_for == WAIT_FREE)
|
||||||
|
{
|
||||||
|
if (!data_alloc->get_free_count() && big_to_flush > 0)
|
||||||
|
{
|
||||||
|
#ifdef BLOCKSTORE_DEBUG
|
||||||
|
printf("Still waiting for free space on the data device\n");
|
||||||
|
#endif
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
flusher->release_trim();
|
||||||
|
PRIV(op)->wait_for = 0;
|
||||||
|
}
|
||||||
else
|
else
|
||||||
{
|
{
|
||||||
throw std::runtime_error("BUG: op->wait_for value is unexpected");
|
throw std::runtime_error("BUG: op->wait_for value is unexpected");
|
||||||
|
@@ -160,6 +160,8 @@ struct __attribute__((__packed__)) dirty_entry
|
|||||||
#define WAIT_JOURNAL 3
|
#define WAIT_JOURNAL 3
|
||||||
// Suspend operation until the next journal sector buffer is free
|
// Suspend operation until the next journal sector buffer is free
|
||||||
#define WAIT_JOURNAL_BUFFER 4
|
#define WAIT_JOURNAL_BUFFER 4
|
||||||
|
// Suspend operation until there is some free space on the data device
|
||||||
|
#define WAIT_FREE 5
|
||||||
|
|
||||||
struct fulfill_read_t
|
struct fulfill_read_t
|
||||||
{
|
{
|
||||||
@@ -263,6 +265,7 @@ class blockstore_impl_t
|
|||||||
|
|
||||||
struct journal_t journal;
|
struct journal_t journal;
|
||||||
journal_flusher_t *flusher;
|
journal_flusher_t *flusher;
|
||||||
|
int big_to_flush = 0;
|
||||||
int write_iodepth = 0;
|
int write_iodepth = 0;
|
||||||
|
|
||||||
bool live = false, queue_stall = false;
|
bool live = false, queue_stall = false;
|
||||||
|
@@ -201,6 +201,11 @@ void blockstore_impl_t::erase_dirty(blockstore_dirty_db_t::iterator dirty_start,
|
|||||||
}
|
}
|
||||||
while (1)
|
while (1)
|
||||||
{
|
{
|
||||||
|
if ((IS_BIG_WRITE(dirty_it->second.state) || IS_DELETE(dirty_it->second.state)) &&
|
||||||
|
IS_STABLE(dirty_it->second.state))
|
||||||
|
{
|
||||||
|
big_to_flush--;
|
||||||
|
}
|
||||||
if (IS_BIG_WRITE(dirty_it->second.state) && dirty_it->second.location != clean_loc &&
|
if (IS_BIG_WRITE(dirty_it->second.state) && dirty_it->second.location != clean_loc &&
|
||||||
dirty_it->second.location != UINT64_MAX)
|
dirty_it->second.location != UINT64_MAX)
|
||||||
{
|
{
|
||||||
|
@@ -446,6 +446,7 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
|
|||||||
{
|
{
|
||||||
inode_space_stats[dirty_it->first.oid.inode] += dsk.data_block_size;
|
inode_space_stats[dirty_it->first.oid.inode] += dsk.data_block_size;
|
||||||
}
|
}
|
||||||
|
big_to_flush++;
|
||||||
}
|
}
|
||||||
else if (IS_DELETE(dirty_it->second.state))
|
else if (IS_DELETE(dirty_it->second.state))
|
||||||
{
|
{
|
||||||
@@ -454,6 +455,7 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
|
|||||||
sp -= dsk.data_block_size;
|
sp -= dsk.data_block_size;
|
||||||
else
|
else
|
||||||
inode_space_stats.erase(dirty_it->first.oid.inode);
|
inode_space_stats.erase(dirty_it->first.oid.inode);
|
||||||
|
big_to_flush++;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
if (forget_dirty && (IS_BIG_WRITE(dirty_it->second.state) ||
|
if (forget_dirty && (IS_BIG_WRITE(dirty_it->second.state) ||
|
||||||
|
@@ -271,6 +271,13 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
|
|||||||
if (loc == UINT64_MAX)
|
if (loc == UINT64_MAX)
|
||||||
{
|
{
|
||||||
// no space
|
// no space
|
||||||
|
if (big_to_flush > 0)
|
||||||
|
{
|
||||||
|
// hope that some space will be available after flush
|
||||||
|
flusher->request_trim();
|
||||||
|
PRIV(op)->wait_for = WAIT_FREE;
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
cancel_all_writes(op, dirty_it, -ENOSPC);
|
cancel_all_writes(op, dirty_it, -ENOSPC);
|
||||||
return 2;
|
return 2;
|
||||||
}
|
}
|
||||||
|
@@ -54,6 +54,13 @@ void epoll_manager_t::set_fd_handler(int fd, bool wr, std::function<void(int, in
|
|||||||
ev.events = (wr ? EPOLLOUT : 0) | EPOLLIN | EPOLLRDHUP | EPOLLET;
|
ev.events = (wr ? EPOLLOUT : 0) | EPOLLIN | EPOLLRDHUP | EPOLLET;
|
||||||
if (epoll_ctl(epoll_fd, exists ? EPOLL_CTL_MOD : EPOLL_CTL_ADD, fd, &ev) < 0)
|
if (epoll_ctl(epoll_fd, exists ? EPOLL_CTL_MOD : EPOLL_CTL_ADD, fd, &ev) < 0)
|
||||||
{
|
{
|
||||||
|
if (errno == ENOENT)
|
||||||
|
{
|
||||||
|
// The FD is probably already closed
|
||||||
|
epoll_ctl(epoll_fd, EPOLL_CTL_DEL, fd, NULL);
|
||||||
|
epoll_handlers.erase(fd);
|
||||||
|
return;
|
||||||
|
}
|
||||||
throw std::runtime_error(std::string("epoll_ctl: ") + strerror(errno));
|
throw std::runtime_error(std::string("epoll_ctl: ") + strerror(errno));
|
||||||
}
|
}
|
||||||
epoll_handlers[fd] = handler;
|
epoll_handlers[fd] = handler;
|
||||||
|
@@ -191,7 +191,7 @@ struct __attribute__((__packed__)) osd_op_rw_t
|
|||||||
uint64_t inode;
|
uint64_t inode;
|
||||||
// offset
|
// offset
|
||||||
uint64_t offset;
|
uint64_t offset;
|
||||||
// length
|
// length. 0 means to read all bitmaps of the specified range, but no data.
|
||||||
uint32_t len;
|
uint32_t len;
|
||||||
// flags (for future)
|
// flags (for future)
|
||||||
uint32_t flags;
|
uint32_t flags;
|
||||||
|
@@ -186,11 +186,23 @@ void osd_t::continue_primary_read(osd_op_t *cur_op)
|
|||||||
cur_op->reply.rw.bitmap_len = 0;
|
cur_op->reply.rw.bitmap_len = 0;
|
||||||
{
|
{
|
||||||
auto & pg = pgs.at({ .pool_id = INODE_POOL(op_data->oid.inode), .pg_num = op_data->pg_num });
|
auto & pg = pgs.at({ .pool_id = INODE_POOL(op_data->oid.inode), .pg_num = op_data->pg_num });
|
||||||
|
if (cur_op->req.rw.len == 0)
|
||||||
|
{
|
||||||
|
// len=0 => bitmap read
|
||||||
|
for (int role = 0; role < op_data->pg_data_size; role++)
|
||||||
|
{
|
||||||
|
op_data->stripes[role].read_start = 0;
|
||||||
|
op_data->stripes[role].read_end = UINT32_MAX;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
for (int role = 0; role < op_data->pg_data_size; role++)
|
for (int role = 0; role < op_data->pg_data_size; role++)
|
||||||
{
|
{
|
||||||
op_data->stripes[role].read_start = op_data->stripes[role].req_start;
|
op_data->stripes[role].read_start = op_data->stripes[role].req_start;
|
||||||
op_data->stripes[role].read_end = op_data->stripes[role].req_end;
|
op_data->stripes[role].read_end = op_data->stripes[role].req_end;
|
||||||
}
|
}
|
||||||
|
}
|
||||||
// Determine version
|
// Determine version
|
||||||
auto vo_it = pg.ver_override.find(op_data->oid);
|
auto vo_it = pg.ver_override.find(op_data->oid);
|
||||||
op_data->target_ver = vo_it != pg.ver_override.end() ? vo_it->second : UINT64_MAX;
|
op_data->target_ver = vo_it != pg.ver_override.end() ? vo_it->second : UINT64_MAX;
|
||||||
|
@@ -151,6 +151,13 @@ int osd_t::submit_primary_subop_batch(int submit_type, inode_t inode, uint64_t o
|
|||||||
{
|
{
|
||||||
int stripe_num = rep ? 0 : role;
|
int stripe_num = rep ? 0 : role;
|
||||||
osd_op_t *subop = op_data->subops + i;
|
osd_op_t *subop = op_data->subops + i;
|
||||||
|
uint32_t subop_len = wr
|
||||||
|
? stripes[stripe_num].write_end - stripes[stripe_num].write_start
|
||||||
|
: stripes[stripe_num].read_end - stripes[stripe_num].read_start;
|
||||||
|
if (!wr && stripes[stripe_num].read_end == UINT32_MAX)
|
||||||
|
{
|
||||||
|
subop_len = 0;
|
||||||
|
}
|
||||||
if (role_osd_num == this->osd_num)
|
if (role_osd_num == this->osd_num)
|
||||||
{
|
{
|
||||||
clock_gettime(CLOCK_REALTIME, &subop->tv_begin);
|
clock_gettime(CLOCK_REALTIME, &subop->tv_begin);
|
||||||
@@ -169,7 +176,7 @@ int osd_t::submit_primary_subop_batch(int submit_type, inode_t inode, uint64_t o
|
|||||||
},
|
},
|
||||||
.version = op_version,
|
.version = op_version,
|
||||||
.offset = wr ? stripes[stripe_num].write_start : stripes[stripe_num].read_start,
|
.offset = wr ? stripes[stripe_num].write_start : stripes[stripe_num].read_start,
|
||||||
.len = wr ? stripes[stripe_num].write_end - stripes[stripe_num].write_start : stripes[stripe_num].read_end - stripes[stripe_num].read_start,
|
.len = subop_len,
|
||||||
.buf = wr ? stripes[stripe_num].write_buf : stripes[stripe_num].read_buf,
|
.buf = wr ? stripes[stripe_num].write_buf : stripes[stripe_num].read_buf,
|
||||||
.bitmap = stripes[stripe_num].bmp_buf,
|
.bitmap = stripes[stripe_num].bmp_buf,
|
||||||
});
|
});
|
||||||
@@ -199,7 +206,7 @@ int osd_t::submit_primary_subop_batch(int submit_type, inode_t inode, uint64_t o
|
|||||||
},
|
},
|
||||||
.version = op_version,
|
.version = op_version,
|
||||||
.offset = wr ? stripes[stripe_num].write_start : stripes[stripe_num].read_start,
|
.offset = wr ? stripes[stripe_num].write_start : stripes[stripe_num].read_start,
|
||||||
.len = wr ? stripes[stripe_num].write_end - stripes[stripe_num].write_start : stripes[stripe_num].read_end - stripes[stripe_num].read_start,
|
.len = subop_len,
|
||||||
.attr_len = wr ? clean_entry_bitmap_size : 0,
|
.attr_len = wr ? clean_entry_bitmap_size : 0,
|
||||||
};
|
};
|
||||||
#ifdef OSD_DEBUG
|
#ifdef OSD_DEBUG
|
||||||
@@ -218,9 +225,9 @@ int osd_t::submit_primary_subop_batch(int submit_type, inode_t inode, uint64_t o
|
|||||||
}
|
}
|
||||||
else
|
else
|
||||||
{
|
{
|
||||||
if (stripes[stripe_num].read_end > stripes[stripe_num].read_start)
|
if (subop_len > 0)
|
||||||
{
|
{
|
||||||
subop->iov.push_back(stripes[stripe_num].read_buf, stripes[stripe_num].read_end - stripes[stripe_num].read_start);
|
subop->iov.push_back(stripes[stripe_num].read_buf, subop_len);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
subop->callback = [cur_op, this](osd_op_t *subop)
|
subop->callback = [cur_op, this](osd_op_t *subop)
|
||||||
|
@@ -28,7 +28,9 @@ static inline void extend_read(uint32_t start, uint32_t end, osd_rmw_stripe_t &
|
|||||||
}
|
}
|
||||||
else
|
else
|
||||||
{
|
{
|
||||||
if (stripe.read_end < end)
|
if (stripe.read_end < end && end != UINT32_MAX ||
|
||||||
|
// UINT32_MAX means that stripe only needs bitmap, end != 0 => needs also data
|
||||||
|
stripe.read_end == UINT32_MAX && end != 0)
|
||||||
stripe.read_end = end;
|
stripe.read_end = end;
|
||||||
if (stripe.read_start > start)
|
if (stripe.read_start > start)
|
||||||
stripe.read_start = start;
|
stripe.read_start = start;
|
||||||
@@ -104,6 +106,8 @@ void reconstruct_stripes_xor(osd_rmw_stripe_t *stripes, int pg_size, uint32_t bi
|
|||||||
prev = other;
|
prev = other;
|
||||||
}
|
}
|
||||||
else if (prev >= 0)
|
else if (prev >= 0)
|
||||||
|
{
|
||||||
|
if (stripes[role].read_end != UINT32_MAX)
|
||||||
{
|
{
|
||||||
assert(stripes[role].read_start >= stripes[prev].read_start &&
|
assert(stripes[role].read_start >= stripes[prev].read_start &&
|
||||||
stripes[role].read_start >= stripes[other].read_start);
|
stripes[role].read_start >= stripes[other].read_start);
|
||||||
@@ -112,10 +116,13 @@ void reconstruct_stripes_xor(osd_rmw_stripe_t *stripes, int pg_size, uint32_t bi
|
|||||||
(uint8_t*)stripes[other].read_buf + (stripes[role].read_start - stripes[other].read_start),
|
(uint8_t*)stripes[other].read_buf + (stripes[role].read_start - stripes[other].read_start),
|
||||||
stripes[role].read_buf, stripes[role].read_end - stripes[role].read_start
|
stripes[role].read_buf, stripes[role].read_end - stripes[role].read_start
|
||||||
);
|
);
|
||||||
|
}
|
||||||
memxor(stripes[prev].bmp_buf, stripes[other].bmp_buf, stripes[role].bmp_buf, bitmap_size);
|
memxor(stripes[prev].bmp_buf, stripes[other].bmp_buf, stripes[role].bmp_buf, bitmap_size);
|
||||||
prev = -1;
|
prev = -1;
|
||||||
}
|
}
|
||||||
else
|
else
|
||||||
|
{
|
||||||
|
if (stripes[role].read_end != UINT32_MAX)
|
||||||
{
|
{
|
||||||
assert(stripes[role].read_start >= stripes[other].read_start);
|
assert(stripes[role].read_start >= stripes[other].read_start);
|
||||||
memxor(
|
memxor(
|
||||||
@@ -123,6 +130,7 @@ void reconstruct_stripes_xor(osd_rmw_stripe_t *stripes, int pg_size, uint32_t bi
|
|||||||
(uint8_t*)stripes[other].read_buf + (stripes[role].read_start - stripes[other].read_start),
|
(uint8_t*)stripes[other].read_buf + (stripes[role].read_start - stripes[other].read_start),
|
||||||
stripes[role].read_buf, stripes[role].read_end - stripes[role].read_start
|
stripes[role].read_buf, stripes[role].read_end - stripes[role].read_start
|
||||||
);
|
);
|
||||||
|
}
|
||||||
memxor(stripes[role].bmp_buf, stripes[other].bmp_buf, stripes[role].bmp_buf, bitmap_size);
|
memxor(stripes[role].bmp_buf, stripes[other].bmp_buf, stripes[role].bmp_buf, bitmap_size);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@@ -355,9 +363,11 @@ void reconstruct_stripes_ec(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsi
|
|||||||
int wanted_base = 0, wanted = 0;
|
int wanted_base = 0, wanted = 0;
|
||||||
uint64_t read_start = 0, read_end = 0;
|
uint64_t read_start = 0, read_end = 0;
|
||||||
auto recover_seq = [&]()
|
auto recover_seq = [&]()
|
||||||
|
{
|
||||||
|
if (read_end != UINT32_MAX)
|
||||||
{
|
{
|
||||||
int orig = 0;
|
int orig = 0;
|
||||||
for (int other = 0; other < pg_size; other++)
|
for (int other = 0; other < pg_size && orig < pg_minsize; other++)
|
||||||
{
|
{
|
||||||
if (stripes[other].read_end != 0 && !stripes[other].missing)
|
if (stripes[other].read_end != 0 && !stripes[other].missing)
|
||||||
{
|
{
|
||||||
@@ -370,6 +380,7 @@ void reconstruct_stripes_ec(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsi
|
|||||||
read_end-read_start, pg_minsize, wanted, dectable + wanted_base*32*pg_minsize,
|
read_end-read_start, pg_minsize, wanted, dectable + wanted_base*32*pg_minsize,
|
||||||
data_ptrs, data_ptrs + pg_minsize
|
data_ptrs, data_ptrs + pg_minsize
|
||||||
);
|
);
|
||||||
|
}
|
||||||
wanted_base += wanted;
|
wanted_base += wanted;
|
||||||
wanted = 0;
|
wanted = 0;
|
||||||
};
|
};
|
||||||
@@ -391,6 +402,32 @@ void reconstruct_stripes_ec(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsi
|
|||||||
{
|
{
|
||||||
recover_seq();
|
recover_seq();
|
||||||
}
|
}
|
||||||
|
// Recover bitmaps
|
||||||
|
if (bitmap_size > 0)
|
||||||
|
{
|
||||||
|
for (int role = 0; role < pg_minsize; role++)
|
||||||
|
{
|
||||||
|
if (stripes[role].read_end != 0 && stripes[role].missing)
|
||||||
|
{
|
||||||
|
data_ptrs[pg_minsize + (wanted++)] = (uint8_t*)stripes[role].bmp_buf;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if (wanted > 0)
|
||||||
|
{
|
||||||
|
int orig = 0;
|
||||||
|
for (int other = 0; other < pg_size && orig < pg_minsize; other++)
|
||||||
|
{
|
||||||
|
if (stripes[other].read_end != 0 && !stripes[other].missing)
|
||||||
|
{
|
||||||
|
data_ptrs[orig++] = (uint8_t*)stripes[other].bmp_buf;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
ec_encode_data(
|
||||||
|
bitmap_size, pg_minsize, wanted, dectable,
|
||||||
|
data_ptrs, data_ptrs + pg_minsize
|
||||||
|
);
|
||||||
|
}
|
||||||
|
}
|
||||||
}
|
}
|
||||||
#else
|
#else
|
||||||
void reconstruct_stripes_ec(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize, uint32_t bitmap_size)
|
void reconstruct_stripes_ec(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize, uint32_t bitmap_size)
|
||||||
@@ -412,7 +449,8 @@ void reconstruct_stripes_ec(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsi
|
|||||||
if (stripes[role].read_end != 0 && stripes[role].missing)
|
if (stripes[role].read_end != 0 && stripes[role].missing)
|
||||||
{
|
{
|
||||||
recovered = true;
|
recovered = true;
|
||||||
if (stripes[role].read_end > stripes[role].read_start)
|
if (stripes[role].read_end > stripes[role].read_start &&
|
||||||
|
stripes[role].read_end != UINT32_MAX)
|
||||||
{
|
{
|
||||||
for (int other = 0; other < pg_size; other++)
|
for (int other = 0; other < pg_size; other++)
|
||||||
{
|
{
|
||||||
@@ -531,7 +569,8 @@ void* alloc_read_buffer(osd_rmw_stripe_t *stripes, int read_pg_size, uint64_t ad
|
|||||||
uint64_t buf_size = add_size;
|
uint64_t buf_size = add_size;
|
||||||
for (int role = 0; role < read_pg_size; role++)
|
for (int role = 0; role < read_pg_size; role++)
|
||||||
{
|
{
|
||||||
if (stripes[role].read_end != 0)
|
if (stripes[role].read_end != 0 &&
|
||||||
|
stripes[role].read_end != UINT32_MAX)
|
||||||
{
|
{
|
||||||
buf_size += stripes[role].read_end - stripes[role].read_start;
|
buf_size += stripes[role].read_end - stripes[role].read_start;
|
||||||
}
|
}
|
||||||
@@ -541,7 +580,8 @@ void* alloc_read_buffer(osd_rmw_stripe_t *stripes, int read_pg_size, uint64_t ad
|
|||||||
uint64_t buf_pos = add_size;
|
uint64_t buf_pos = add_size;
|
||||||
for (int role = 0; role < read_pg_size; role++)
|
for (int role = 0; role < read_pg_size; role++)
|
||||||
{
|
{
|
||||||
if (stripes[role].read_end != 0)
|
if (stripes[role].read_end != 0 &&
|
||||||
|
stripes[role].read_end != UINT32_MAX)
|
||||||
{
|
{
|
||||||
stripes[role].read_buf = (uint8_t*)buf + buf_pos;
|
stripes[role].read_buf = (uint8_t*)buf + buf_pos;
|
||||||
buf_pos += stripes[role].read_end - stripes[role].read_start;
|
buf_pos += stripes[role].read_end - stripes[role].read_start;
|
||||||
|
@@ -23,6 +23,7 @@ struct osd_rmw_stripe_t
|
|||||||
void *read_buf, *write_buf;
|
void *read_buf, *write_buf;
|
||||||
void *bmp_buf;
|
void *bmp_buf;
|
||||||
uint32_t req_start, req_end;
|
uint32_t req_start, req_end;
|
||||||
|
// read_end=UINT32_MAX means to only read bitmap, but not data
|
||||||
uint32_t read_start, read_end;
|
uint32_t read_start, read_end;
|
||||||
uint32_t write_start, write_end;
|
uint32_t write_start, write_end;
|
||||||
bool missing;
|
bool missing;
|
||||||
|
@@ -27,6 +27,7 @@ void test13();
|
|||||||
void test14();
|
void test14();
|
||||||
void test15(bool second);
|
void test15(bool second);
|
||||||
void test16();
|
void test16();
|
||||||
|
void test_recover_22_d2();
|
||||||
|
|
||||||
int main(int narg, char *args[])
|
int main(int narg, char *args[])
|
||||||
{
|
{
|
||||||
@@ -61,6 +62,8 @@ int main(int narg, char *args[])
|
|||||||
test15(true);
|
test15(true);
|
||||||
// Test 16
|
// Test 16
|
||||||
test16();
|
test16();
|
||||||
|
// Test 17
|
||||||
|
test_recover_22_d2();
|
||||||
// End
|
// End
|
||||||
printf("all ok\n");
|
printf("all ok\n");
|
||||||
return 0;
|
return 0;
|
||||||
@@ -1045,7 +1048,12 @@ void test16()
|
|||||||
assert(stripes[3].read_buf == (uint8_t*)read_buf+2*128*1024);
|
assert(stripes[3].read_buf == (uint8_t*)read_buf+2*128*1024);
|
||||||
set_pattern(stripes[1].read_buf, 128*1024, PATTERN2);
|
set_pattern(stripes[1].read_buf, 128*1024, PATTERN2);
|
||||||
memcpy(stripes[3].read_buf, rmw_buf, 128*1024);
|
memcpy(stripes[3].read_buf, rmw_buf, 128*1024);
|
||||||
|
memset(stripes[0].bmp_buf, 0xa8, bmp);
|
||||||
|
memset(stripes[2].bmp_buf, 0xb7, bmp);
|
||||||
|
assert(bitmaps[1] == 0xFFFFFFFF);
|
||||||
|
assert(bitmaps[3] == 0xF1F1F1F1);
|
||||||
reconstruct_stripes_ec(stripes, 4, 2, bmp);
|
reconstruct_stripes_ec(stripes, 4, 2, bmp);
|
||||||
|
assert(*(uint32_t*)stripes[3].bmp_buf == 0xF1F1F1F1);
|
||||||
assert(bitmaps[0] == 0xFFFFFFFF);
|
assert(bitmaps[0] == 0xFFFFFFFF);
|
||||||
check_pattern(stripes[0].read_buf, 128*1024, PATTERN1);
|
check_pattern(stripes[0].read_buf, 128*1024, PATTERN1);
|
||||||
free(read_buf);
|
free(read_buf);
|
||||||
@@ -1054,3 +1062,47 @@ void test16()
|
|||||||
free(write_buf);
|
free(write_buf);
|
||||||
use_ec(4, 2, false);
|
use_ec(4, 2, false);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/***
|
||||||
|
|
||||||
|
17. EC 2+2 recover second data block
|
||||||
|
|
||||||
|
***/
|
||||||
|
|
||||||
|
void test_recover_22_d2()
|
||||||
|
{
|
||||||
|
const int bmp = 128*1024 / 4096 / 8;
|
||||||
|
use_ec(4, 2, true);
|
||||||
|
osd_num_t osd_set[4] = { 1, 0, 3, 4 };
|
||||||
|
osd_rmw_stripe_t stripes[4] = {};
|
||||||
|
unsigned bitmaps[4] = { 0 };
|
||||||
|
// Read 0-256K
|
||||||
|
split_stripes(2, 128*1024, 0, 256*1024, stripes);
|
||||||
|
assert(stripes[0].req_start == 0 && stripes[0].req_end == 128*1024);
|
||||||
|
assert(stripes[1].req_start == 0 && stripes[1].req_end == 128*1024);
|
||||||
|
assert(stripes[2].req_start == 0 && stripes[2].req_end == 0);
|
||||||
|
assert(stripes[3].req_start == 0 && stripes[3].req_end == 0);
|
||||||
|
uint8_t *data_buf = (uint8_t*)malloc_or_die(128*1024*4);
|
||||||
|
for (int i = 0; i < 4; i++)
|
||||||
|
{
|
||||||
|
stripes[i].read_start = stripes[i].req_start;
|
||||||
|
stripes[i].read_end = stripes[i].req_end;
|
||||||
|
stripes[i].read_buf = data_buf + i*128*1024;
|
||||||
|
stripes[i].bmp_buf = bitmaps + i;
|
||||||
|
}
|
||||||
|
// Read using parity
|
||||||
|
assert(extend_missing_stripes(stripes, osd_set, 2, 4) == 0);
|
||||||
|
assert(stripes[2].read_start == 0 && stripes[2].read_end == 128*1024);
|
||||||
|
assert(stripes[3].read_start == 0 && stripes[3].read_end == 0);
|
||||||
|
bitmaps[0] = 0xffffffff;
|
||||||
|
bitmaps[2] = 0;
|
||||||
|
set_pattern(stripes[0].read_buf, 128*1024, PATTERN1);
|
||||||
|
set_pattern(stripes[2].read_buf, 128*1024, PATTERN1^PATTERN2);
|
||||||
|
// Reconstruct
|
||||||
|
reconstruct_stripes_ec(stripes, 4, 2, bmp);
|
||||||
|
check_pattern(stripes[1].read_buf, 128*1024, PATTERN2);
|
||||||
|
assert(bitmaps[1] == 0xFFFFFFFF);
|
||||||
|
free(data_buf);
|
||||||
|
// Done
|
||||||
|
use_ec(4, 2, false);
|
||||||
|
}
|
||||||
|
@@ -36,12 +36,12 @@ for i in $(seq 2 $ETCD_COUNT); do
|
|||||||
ETCD_URL="$ETCD_URL,http://$ETCD_IP:$((ETCD_PORT+2*i-2))"
|
ETCD_URL="$ETCD_URL,http://$ETCD_IP:$((ETCD_PORT+2*i-2))"
|
||||||
ETCD_CLUSTER="$ETCD_CLUSTER,etcd$i=http://$ETCD_IP:$((ETCD_PORT+2*i-1))"
|
ETCD_CLUSTER="$ETCD_CLUSTER,etcd$i=http://$ETCD_IP:$((ETCD_PORT+2*i-1))"
|
||||||
done
|
done
|
||||||
ETCDCTL="${ETCD}ctl --endpoints=$ETCD_URL"
|
ETCDCTL="${ETCD}ctl --endpoints=$ETCD_URL --dial-timeout=5s --command-timeout=10s"
|
||||||
|
|
||||||
start_etcd()
|
start_etcd()
|
||||||
{
|
{
|
||||||
local i=$1
|
local i=$1
|
||||||
$ETCD -name etcd$i --data-dir ./testdata/etcd$i \
|
ionice -c2 -n0 $ETCD -name etcd$i --data-dir ./testdata/etcd$i \
|
||||||
--advertise-client-urls http://$ETCD_IP:$((ETCD_PORT+2*i-2)) --listen-client-urls http://$ETCD_IP:$((ETCD_PORT+2*i-2)) \
|
--advertise-client-urls http://$ETCD_IP:$((ETCD_PORT+2*i-2)) --listen-client-urls http://$ETCD_IP:$((ETCD_PORT+2*i-2)) \
|
||||||
--initial-advertise-peer-urls http://$ETCD_IP:$((ETCD_PORT+2*i-1)) --listen-peer-urls http://$ETCD_IP:$((ETCD_PORT+2*i-1)) \
|
--initial-advertise-peer-urls http://$ETCD_IP:$((ETCD_PORT+2*i-1)) --listen-peer-urls http://$ETCD_IP:$((ETCD_PORT+2*i-1)) \
|
||||||
--initial-cluster-token vitastor-tests-etcd --initial-cluster-state new \
|
--initial-cluster-token vitastor-tests-etcd --initial-cluster-state new \
|
||||||
@@ -53,8 +53,11 @@ start_etcd()
|
|||||||
for i in $(seq 1 $ETCD_COUNT); do
|
for i in $(seq 1 $ETCD_COUNT); do
|
||||||
start_etcd $i
|
start_etcd $i
|
||||||
done
|
done
|
||||||
if [ $ETCD_COUNT -gt 1 ]; then
|
for i in {1..10}; do
|
||||||
sleep 1
|
${ETCD}ctl --endpoints=$ETCD_URL --dial-timeout=1s --command-timeout=1s member list >/dev/null && break
|
||||||
|
done
|
||||||
|
if [[ $i = 10 ]]; then
|
||||||
|
format_error "Failed to start etcd"
|
||||||
fi
|
fi
|
||||||
|
|
||||||
echo leak:fio >> testdata/lsan-suppress.txt
|
echo leak:fio >> testdata/lsan-suppress.txt
|
||||||
|
@@ -39,7 +39,7 @@ done
|
|||||||
cd mon
|
cd mon
|
||||||
npm install
|
npm install
|
||||||
cd ..
|
cd ..
|
||||||
node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 &>./testdata/mon.log &
|
node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 --restart_interval 5 &>./testdata/mon.log &
|
||||||
MON_PID=$!
|
MON_PID=$!
|
||||||
|
|
||||||
if [ "$SCHEME" = "ec" ]; then
|
if [ "$SCHEME" = "ec" ]; then
|
||||||
@@ -100,13 +100,13 @@ wait_finish_rebalance()
|
|||||||
sec=$1
|
sec=$1
|
||||||
i=0
|
i=0
|
||||||
while [[ $i -lt $sec ]]; do
|
while [[ $i -lt $sec ]]; do
|
||||||
($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"]) ] | length) == 32') && \
|
($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"] or .state == ["active", "left_on_dead"]) ] | length) == '$PG_COUNT) && \
|
||||||
break
|
break
|
||||||
if [ $i -eq 60 ]; then
|
|
||||||
format_error "Rebalance couldn't finish in $sec seconds"
|
|
||||||
fi
|
|
||||||
sleep 1
|
sleep 1
|
||||||
i=$((i+1))
|
i=$((i+1))
|
||||||
|
if [ $i -eq $sec ]; then
|
||||||
|
format_error "Rebalance couldn't finish in $sec seconds"
|
||||||
|
fi
|
||||||
done
|
done
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -117,3 +117,14 @@ check_qemu()
|
|||||||
sudo ln -s "$(realpath .)/build/src/block-vitastor.so" /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so
|
sudo ln -s "$(realpath .)/build/src/block-vitastor.so" /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so
|
||||||
fi
|
fi
|
||||||
}
|
}
|
||||||
|
|
||||||
|
check_nbd()
|
||||||
|
{
|
||||||
|
if [[ -d /sys/module/nbd && ! -e /dev/nbd0 ]]; then
|
||||||
|
max_part=$(cat /sys/module/nbd/parameters/max_part)
|
||||||
|
nbds_max=$(cat /sys/module/nbd/parameters/nbds_max)
|
||||||
|
for i in $(seq 1 $nbds_max); do
|
||||||
|
mknod /dev/nbd$((i-1)) b 43 $(((i-1)*(max_part+1)))
|
||||||
|
done
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
@@ -15,10 +15,10 @@ done
|
|||||||
|
|
||||||
sleep 2
|
sleep 2
|
||||||
|
|
||||||
for i in {1..10}; do
|
for i in {1..30}; do
|
||||||
($ETCDCTL get /vitastor/config/pgs --print-value-only |\
|
($ETCDCTL get /vitastor/config/pgs --print-value-only |\
|
||||||
jq -s -e '([ .[0].items["1"] | map(.osd_set)[][] ] | sort | unique == ["1","2","3","4"])') && \
|
jq -s -e '([ .[0].items["1"] | map(.osd_set)[][] ] | sort | unique == ["1","2","3","4"])') && \
|
||||||
($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"]) ] | length) == '$PG_COUNT'') && \
|
($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"]) ] | length) == '$PG_COUNT) && \
|
||||||
break
|
break
|
||||||
sleep 1
|
sleep 1
|
||||||
done
|
done
|
||||||
@@ -28,7 +28,7 @@ if ! ($ETCDCTL get /vitastor/config/pgs --print-value-only |\
|
|||||||
format_error "FAILED: OSD NOT ADDED INTO DISTRIBUTION"
|
format_error "FAILED: OSD NOT ADDED INTO DISTRIBUTION"
|
||||||
fi
|
fi
|
||||||
|
|
||||||
wait_finish_rebalance 10
|
wait_finish_rebalance 20
|
||||||
|
|
||||||
sleep 1
|
sleep 1
|
||||||
kill -9 $OSD4_PID
|
kill -9 $OSD4_PID
|
||||||
@@ -37,7 +37,7 @@ build/src/vitastor-cli --etcd_address $ETCD_URL rm-osd --force 4
|
|||||||
|
|
||||||
sleep 2
|
sleep 2
|
||||||
|
|
||||||
for i in {1..10}; do
|
for i in {1..30}; do
|
||||||
($ETCDCTL get /vitastor/config/pgs --print-value-only |\
|
($ETCDCTL get /vitastor/config/pgs --print-value-only |\
|
||||||
jq -s -e '([ .[0].items["1"] | map(.osd_set)[][] ] | sort | unique == ["1","2","3"])') && \
|
jq -s -e '([ .[0].items["1"] | map(.osd_set)[][] ] | sort | unique == ["1","2","3"])') && \
|
||||||
($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"] or .state == ["active", "left_on_dead"]) ] | length) == '$PG_COUNT'') && \
|
($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"] or .state == ["active", "left_on_dead"]) ] | length) == '$PG_COUNT'') && \
|
||||||
@@ -50,6 +50,6 @@ if ! ($ETCDCTL get /vitastor/config/pgs --print-value-only |\
|
|||||||
format_error "FAILED: OSD NOT REMOVED FROM DISTRIBUTION"
|
format_error "FAILED: OSD NOT REMOVED FROM DISTRIBUTION"
|
||||||
fi
|
fi
|
||||||
|
|
||||||
wait_finish_rebalance 10
|
wait_finish_rebalance 20
|
||||||
|
|
||||||
format_green OK
|
format_green OK
|
||||||
|
@@ -18,7 +18,7 @@ $ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicate
|
|||||||
cd mon
|
cd mon
|
||||||
npm install
|
npm install
|
||||||
cd ..
|
cd ..
|
||||||
node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" &>./testdata/mon.log &
|
node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 --restart_interval 5 &>./testdata/mon.log &
|
||||||
MON_PID=$!
|
MON_PID=$!
|
||||||
|
|
||||||
sleep 2
|
sleep 2
|
||||||
|
@@ -4,6 +4,8 @@ OSD_COUNT=7
|
|||||||
PG_COUNT=32
|
PG_COUNT=32
|
||||||
. `dirname $0`/run_3osds.sh
|
. `dirname $0`/run_3osds.sh
|
||||||
|
|
||||||
|
check_nbd
|
||||||
|
|
||||||
IMG_SIZE=256
|
IMG_SIZE=256
|
||||||
|
|
||||||
$ETCDCTL put /vitastor/config/inode/1/1 '{"name":"testimg","size":'$((IMG_SIZE*1024*1024))'}'
|
$ETCDCTL put /vitastor/config/inode/1/1 '{"name":"testimg","size":'$((IMG_SIZE*1024*1024))'}'
|
||||||
|
@@ -14,7 +14,7 @@ for i in $(seq 1 $OSD_COUNT); do
|
|||||||
eval OSD${i}_PID=$!
|
eval OSD${i}_PID=$!
|
||||||
done
|
done
|
||||||
|
|
||||||
node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" &>./testdata/mon.log &
|
node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 --restart_interval 5 &>./testdata/mon.log &
|
||||||
MON_PID=$!
|
MON_PID=$!
|
||||||
|
|
||||||
sleep 3
|
sleep 3
|
||||||
|
Reference in New Issue
Block a user