Compare commits

...

47 Commits

Author SHA1 Message Date
fa90f287da Release 0.8.3
- Implement a new "vitastor-disk purge" command to remove OSDs with safety checks
- Implement a new "vitastor-cli rm-osd" command to only remove OSD metadata from etcd
- Fix a bug where the monitor could ignore OSD removal and other /osd/stats key changes
- Fix a bug where garbage could be returned when reading objects being written at the same time
- Fix a rare write stall where journal space could be not reclaimed where there
  were no new operations in the flush queue
- Fix a rare peering stall caused by a previous long listing operations queues limiting attempt
- Fix total object count statistic in OSD on object creation
- Add missing offset&len into vitastor-disk dump-journal for big_writes, fix JSON format
- Make vitastor-cli print help on missing command
- Make vitastor-cli translate all '-' to '_' in CLI options
2022-12-27 02:40:55 +03:00
795020674d Loop journal flusher when the queue is empty but there is a trim request 2022-12-27 02:28:20 +03:00
8e12285629 Fix vitastor-disk purge (now it works) 2022-12-27 02:28:20 +03:00
b9b50ab4cc Implement vitastor-disk purge command 2022-12-26 02:48:48 +03:00
0d8625f92d Make vitastor-cli print help on missing command 2022-12-26 02:48:48 +03:00
2f3c2c5140 Implement safety check for OSD removal, translate all '-' to '_' in cli options
'-' to '_' translation fixes a bug with create --image_meta
2022-12-26 02:48:48 +03:00
4ebdd02b0f Remove LIST op limiter
It doesn't prevent OSD slow ops but may itself lead to stalls :)
2022-12-26 02:48:48 +03:00
bf6fdc4141 Check add/rm osd with 2048 PGs 2022-12-26 02:48:48 +03:00
c2244331e6 Add vitastor-cli rm-osd command 2022-12-26 02:48:48 +03:00
3de57e87b1 Recheck OSD tree in monitor on /osd/stats changes 2022-12-26 02:48:48 +03:00
2d4cc688b2 Add a remove-osd test 2022-12-26 02:48:48 +03:00
31bd1ec145 Fix object creation check for statistics 2022-12-21 02:51:11 +03:00
c08d1f2dfe Add missing offset&len into big_writes journal dump, fix commas again 2022-12-21 02:51:11 +03:00
1d80bcc8d0 Fix blockstore returning garbage for unstable reads if there is an in-flight version
"In-flight" versions are added into dirty_db when writes are enqueued. And they
weren't ignored by subsequent reads even though they didn't have data location yet.
This bug was leading to test_heal.sh not passing sometimes with replicated setups.
2022-12-21 02:48:24 +03:00
5ef8bed75f Release 0.8.2
- Fix QEMU driver compatibility with QEMU 7.0 and < 2.9
- Add patches for pve-qemu-kvm 7.1 (PVE 7.3) and pve-qemu-kvm 6.2 (PVE 7.2)
- Fix Proxmox driver location in the pve-storage-vitastor package
- Disable HDD autodetection in non-hybrid mode
- Explicitly warn about a buggy kernels on -EAGAIN in io_uring
- Final fix for the lack of zeroing out of old metadata entries
  (do not crash with "big_write journal_entry was allocated over another object"
  in some cases after an unclean OSD shutdown)
- Wait for data writes before fsyncing data if data fsync is enabled
- Never try to wait for free space inside blockstore thus stalling OSDs
- Fix a rare crash in osd_peering due to callback ordering
- Fix a rare duplication of ping & op message IDs
- Fix a rare use-after-free during pings
- Add --force to vitastor-disk read-sb
- Make vitastor-disk dump metadata object IDs in hex, add forgotten commas
- Fix vitastor-disk SCSI disk cache check
2022-12-17 17:54:13 +03:00
8669998e5e Fix discard_list_subop() for local ops 2022-12-17 17:54:13 +03:00
b457327e77 Oops. Fix metadata read after fixes :-) 2022-12-17 17:31:57 +03:00
f7fa9d5e34 Fix SCSI device cache type check 2022-12-17 17:31:57 +03:00
49b88b01f9 Fix clang build 2022-12-17 16:25:26 +03:00
71688bcb59 Disable HDD autodetection in non-hybrid mode 2022-12-17 16:12:15 +03:00
552e207d2b Explicitly print errors about -EAGAIN in io_uring 2022-12-17 15:49:49 +03:00
5464821fa5 Final fix for the lack of zeroing out of old metadata entries
If a crash occurs during flushing a redirect-write it may happen so that
the disk contains both old and new metadata entries. This is OK, but prior
to 0.8.0 after this situation OSDs started without problem, but then they
crashed after some more overwrites with a "tried to overwrite non-zero
metadata entry" error. 0.8.0 introduced a change that was intended to fix
this situation, but rather than fixing it it prevented OSDs from starting,
now because of a "big_write journal_entry was allocated over another object"
error... :-)

This change finally fixes the original issue.

Followup to 54ef2c389f
2022-12-17 14:50:31 +03:00
6917a32ca8 Add --force to vitastor-disk read-sb 2022-12-17 02:47:15 +03:00
f8722a8bd5 Dump meta in hex 2022-12-17 01:50:38 +03:00
9c2f69c9fa Add forgotten commas to vitastor-disk dump-journal 2022-12-17 01:22:58 +03:00
1a93e3f33a Wait for data writes before fsyncing data if data fsync is enabled 2022-12-16 20:46:55 +03:00
3f35744052 Fix compatibility with QEMU aio_set_fd_handler signatures in 7.0 and < 2.9 2022-12-15 19:17:17 +03:00
66f14ac019 Update notes about Proxmox 7.1-7.3 2022-12-15 18:57:28 +03:00
1364009931 Add patches for pve-qemu-kvm 7.1 (PVE 7.3) and pve-qemu-kvm 6.2 (PVE 7.2) 2022-12-14 19:01:36 +03:00
d7e30b8353 Fix pve-storage-vitastor filename 2022-12-14 16:41:35 +03:00
cb437913d3 Never try to wait for free space inside blockstore 2022-12-12 00:27:05 +03:00
472bce58ab Fix rare crash in osd_peering due to callback ordering 2022-12-12 00:27:05 +03:00
7a71e7ef01 Fix possible duplication of ping & op message IDs 2022-12-04 00:16:47 +03:00
c71e5e7bbd Fix possible use-after-free during pings 2022-12-04 00:16:47 +03:00
8fdf30b21f Release 0.8.1
- Remove an additional data copy operation when flushing journal (should
  slightly increase write performance)
- Fix a bug where new writes in the inmemory_journal=false mode could overwrite
  the data currently read by a parallel read operation
- Fix degraded parity writes for EC N+K when K>1 where the bug could also lead
  to an "assertion failed" error
- Fix missing journal space check for "big" writes which could lead to
  "prefill_single_journal_entry(): assertion failed..." error in OSD
- Fix possible "assertion failed: next->prev_wait >= 0" in client in rare cases
- Fix missing "len" field in vitastor-disk write-journal big_writes
- Fix possible crash of a full OSD (ENOSPC)
- Fix CSI build scripts to include newest packages every time
- Fix CSI endpoint in the liveness probe manifest
2022-11-20 11:44:09 +03:00
238037ae31 Make journal trimmer wait until reads are completed when inmemory_journal is false
Without this new writes may in theory overwrite journal data being read at that time
2022-11-20 01:49:21 +03:00
09a8864686 Fix degraded parity writes for EC N+K when K>1
Fixes possible `calc_rmw_parity_ec(): Assertion `bufs[i][curbuf[i]].buf' failed` error
2022-11-20 00:50:13 +03:00
6e6f6ecbb0 Add missing journal space check for big_writes
Fixes possible `prefill_single_journal_entry(): Assertion `!journal.sector_info[journal.cur_sector].flush_count' failed` error
2022-11-20 00:50:13 +03:00
9491f81419 Add missing documentation for OSD tags 2022-11-20 00:50:13 +03:00
44c2b30167 Take newest packages every time when rebuilding CSI 2022-11-20 00:50:13 +03:00
bf8a0581cd Fix possible "assertion failed: next->prev_wait >= 0" in client 2022-11-20 00:50:13 +03:00
5953942042 Add crc32c test utility 2022-11-20 00:50:13 +03:00
a276a1f737 Do not copy journal data additional time when flushing 2022-11-20 00:50:13 +03:00
cc24e5796e Add a FIXME 2022-11-20 00:50:09 +03:00
6e26732e6a Fix skipped "len" field in vitastor-disk write-journal big_writes 2022-11-12 12:01:40 +03:00
b4edc79449 Fix possible segfault on ENOSPC 2022-11-12 11:59:43 +03:00
5f26887d32 Fix csi endpoint in liveness probe 2022-11-10 18:37:37 +03:00
74 changed files with 1454 additions and 308 deletions

View File

@@ -2,6 +2,6 @@ cmake_minimum_required(VERSION 2.8)
project(vitastor)
set(VERSION "0.8.0")
set(VERSION "0.8.3")
add_subdirectory(src)

View File

@@ -18,15 +18,19 @@ ENV CSI_ENDPOINT=""
RUN apt-get update && \
apt-get install -y wget && \
wget -q -O /etc/apt/trusted.gpg.d/vitastor.gpg https://vitastor.io/debian/pubkey.gpg && \
(echo deb http://vitastor.io/debian buster main > /etc/apt/sources.list.d/vitastor.list) && \
(echo deb http://deb.debian.org/debian buster-backports main > /etc/apt/sources.list.d/backports.list) && \
(echo "APT::Install-Recommends false;" > /etc/apt/apt.conf) && \
apt-get update && \
apt-get install -y e2fsprogs xfsprogs vitastor kmod && \
apt-get install -y e2fsprogs xfsprogs kmod && \
apt-get clean && \
(echo options nbd nbds_max=128 > /etc/modprobe.d/nbd.conf)
COPY --from=build /app/vitastor-csi /bin/
RUN (echo deb http://vitastor.io/debian buster main > /etc/apt/sources.list.d/vitastor.list) && \
wget -q -O /etc/apt/trusted.gpg.d/vitastor.gpg https://vitastor.io/debian/pubkey.gpg && \
apt-get update && \
apt-get install -y vitastor-client && \
apt-get clean
ENTRYPOINT ["/bin/vitastor-csi"]

View File

@@ -1,4 +1,4 @@
VERSION ?= v0.8.0
VERSION ?= v0.8.3
all: build push

View File

@@ -49,7 +49,7 @@ spec:
capabilities:
add: ["SYS_ADMIN"]
allowPrivilegeEscalation: true
image: vitalif/vitastor-csi:v0.8.0
image: vitalif/vitastor-csi:v0.8.3
args:
- "--node=$(NODE_ID)"
- "--endpoint=$(CSI_ENDPOINT)"
@@ -102,7 +102,7 @@ spec:
- "--health-port=9898"
env:
- name: CSI_ENDPOINT
value: unix://csi/csi.sock
value: unix:///csi/csi.sock
volumeMounts:
- mountPath: /csi
name: socket-dir

View File

@@ -116,7 +116,7 @@ spec:
privileged: true
capabilities:
add: ["SYS_ADMIN"]
image: vitalif/vitastor-csi:v0.8.0
image: vitalif/vitastor-csi:v0.8.3
args:
- "--node=$(NODE_ID)"
- "--endpoint=$(CSI_ENDPOINT)"

View File

@@ -5,7 +5,7 @@ package vitastor
const (
vitastorCSIDriverName = "csi.vitastor.io"
vitastorCSIDriverVersion = "0.8.0"
vitastorCSIDriverVersion = "0.8.3"
)
// Config struct fills the parameters of request or user input

4
debian/changelog vendored
View File

@@ -1,10 +1,10 @@
vitastor (0.8.0-1) unstable; urgency=medium
vitastor (0.8.3-1) unstable; urgency=medium
* Bugfixes
-- Vitaliy Filippov <vitalif@yourcmc.ru> Fri, 03 Jun 2022 02:09:44 +0300
vitastor (0.8.0-1) unstable; urgency=medium
vitastor (0.8.3-1) unstable; urgency=medium
* Implement NFS proxy
* Add documentation

View File

@@ -1 +1 @@
patches/PVE_VitastorPlugin.pm usr/share/perl5/PVE/Storage/Custom/VitastorPlugin.pm
patches/VitastorPlugin.pm usr/share/perl5/PVE/Storage/Custom/

View File

@@ -34,8 +34,8 @@ RUN set -e -x; \
mkdir -p /root/packages/vitastor-$REL; \
rm -rf /root/packages/vitastor-$REL/*; \
cd /root/packages/vitastor-$REL; \
cp -r /root/vitastor vitastor-0.8.0; \
cd vitastor-0.8.0; \
cp -r /root/vitastor vitastor-0.8.3; \
cd vitastor-0.8.3; \
ln -s /root/fio-build/fio-*/ ./fio; \
FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
ls /usr/include/linux/raw.h || cp ./debian/raw.h /usr/include/linux/raw.h; \
@@ -48,8 +48,8 @@ RUN set -e -x; \
rm -rf a b; \
echo "dep:fio=$FIO" > debian/fio_version; \
cd /root/packages/vitastor-$REL; \
tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.8.0.orig.tar.xz vitastor-0.8.0; \
cd vitastor-0.8.0; \
tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.8.3.orig.tar.xz vitastor-0.8.3; \
cd vitastor-0.8.3; \
V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \
DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \

View File

@@ -82,7 +82,7 @@ Parent node reference is required for intermediate tree nodes.
Separate OSD settings are set in etc keys `/vitastor/config/osd/<number>`
in JSON format `{"<key>":<value>}`.
As of now, there is only one setting:
As of now, two settings are supported:
## reweight
@@ -96,6 +96,15 @@ This means an OSD configured with reweight lower than 1 receives less PGs than
it normally would. An OSD with reweight = 0 won't store any data. You can set
reweight to 0 to trigger rebalance and remove all data from an OSD.
## tags
- Type: string or array of strings
Sets tag or multiple tags for this OSD. Tags can be used to group OSDs into
subsets and then use a specific subset for pool instead of all OSDs.
For example you can mark SSD OSDs with tag "ssd" and HDD OSDs with "hdd" and
such tags will work as device classes.
# Pool parameters
## name

View File

@@ -81,7 +81,10 @@
Настройки отдельных OSD задаются в ключах etcd `/vitastor/config/osd/<number>`
в JSON-формате `{"<key>":<value>}`.
На данный момент поддерживается одна настройка:
На данный момент поддерживаются две настройки:
- [reweight](#reweight)
- [tags](#tags)
## reweight
@@ -96,6 +99,15 @@
хранении данных вообще. Вы можете установить reweight в 0, чтобы убрать
все данные с OSD.
## tags
- Тип: строка или массив строк
Задаёт тег или набор тегов для данного OSD. Теги можно использовать, чтобы
делить OSD на множества и потом размещать пул только на части OSD, а не на
всех. Можно, например, пометить SSD OSD тегом "ssd", а HDD тегом "hdd", в
этом смысле теги работают аналогично классам устройств.
# Параметры
## name

View File

@@ -6,10 +6,10 @@
# Proxmox VE
To enable Vitastor support in Proxmox Virtual Environment (6.4 and 7.1 are supported):
To enable Vitastor support in Proxmox Virtual Environment (6.4-7.3 are supported):
- Add the corresponding Vitastor Debian repository into sources.list on Proxmox hosts
(buster for 6.4, bullseye for 7.1)
- Add the corresponding Vitastor Debian repository into sources.list on Proxmox hosts:
buster for 6.4, bullseye for 7.3, pve7.1 for 7.1, pve7.2 for 7.2
- Install vitastor-client, pve-qemu-kvm, pve-storage-vitastor (* or see note) packages from Vitastor repository
- Define storage in `/etc/pve/storage.cfg` (see below)
- Block network access from VMs to Vitastor network (to OSDs and etcd),
@@ -35,5 +35,5 @@ vitastor: vitastor
vitastor_nbd 0
```
\* Note: you can also manually copy [patches/PVE_VitastorPlugin.pm](patches/PVE_VitastorPlugin.pm) to Proxmox hosts
\* Note: you can also manually copy [patches/VitastorPlugin.pm](patches/VitastorPlugin.pm) to Proxmox hosts
as `/usr/share/perl5/PVE/Storage/Custom/VitastorPlugin.pm` instead of installing pve-storage-vitastor.

View File

@@ -6,10 +6,10 @@
# Proxmox
Чтобы подключить Vitastor к Proxmox Virtual Environment (поддерживаются версии 6.4 и 7.1):
Чтобы подключить Vitastor к Proxmox Virtual Environment (поддерживаются версии 6.4-7.3):
- Добавьте соответствующий Debian-репозиторий Vitastor в sources.list на хостах Proxmox
(buster для 6.4, bullseye для 7.1)
- Добавьте соответствующий Debian-репозиторий Vitastor в sources.list на хостах Proxmox:
buster для 6.4, bullseye для 7.3, pve7.1 для 7.1, pve7.2 для 7.2
- Установите пакеты vitastor-client, pve-qemu-kvm, pve-storage-vitastor (* или см. сноску) из репозитория Vitastor
- Определите тип хранилища в `/etc/pve/storage.cfg` (см. ниже)
- Обязательно заблокируйте доступ от виртуальных машин к сети Vitastor (OSD и etcd), т.к. Vitastor (пока) не поддерживает аутентификацию
@@ -35,5 +35,5 @@ vitastor: vitastor
```
\* Примечание: вместо установки пакета pve-storage-vitastor вы можете вручную скопировать файл
[patches/PVE_VitastorPlugin.pm](patches/PVE_VitastorPlugin.pm) на хосты Proxmox как
[patches/VitastorPlugin.pm](patches/VitastorPlugin.pm) на хосты Proxmox как
`/usr/share/perl5/PVE/Storage/Custom/VitastorPlugin.pm`.

View File

@@ -20,6 +20,7 @@ It supports the following commands:
- [rm-data](#rm-data)
- [merge-data](#merge-data)
- [alloc-osd](#alloc-osd)
- [rm-osd](#rm-osd)
Global options:
@@ -175,3 +176,14 @@ Merge layer data without changing metadata. Merge `<from>`..`<to>` to `<target>`
`vitastor-cli alloc-osd`
Allocate a new OSD number and reserve it by creating empty `/osd/stats/<n>` key.
## rm-osd
`vitastor-cli rm-osd [--force] [--allow-data-loss] [--dry-run] <osd_id> [osd_id...]`
Remove metadata and configuration for specified OSD(s) from etcd.
Refuses to remove OSDs with data without `--force` and `--allow-data-loss`.
With `--dry-run` only checks if deletion is possible without data loss and
redundancy degradation.

View File

@@ -21,6 +21,7 @@ vitastor-cli - интерфейс командной строки для адм
- [rm-data](#rm-data)
- [merge-data](#merge-data)
- [alloc-osd](#alloc-osd)
- [rm-osd](#rm-osd)
Глобальные опции:
@@ -186,3 +187,14 @@ vitastor-cli snap-create [-p|--pool <id|name>] <image>@<snapshot>
Атомарно выделить новый номер OSD и зарезервировать его, создав в etcd пустой
ключ `/osd/stats/<n>`.
## rm-osd
`vitastor-cli rm-osd [--force] [--allow-data-loss] [--dry-run] <osd_id> [osd_id...]`
Удалить метаданные и конфигурацию для заданных OSD из etcd.
Отказывается удалять OSD с данными без опций `--force` и `--allow-data-loss`.
С опцией `--dry-run` только проверяет, возможно ли удаление без потери данных и деградации
избыточности.

View File

@@ -14,6 +14,7 @@ It supports the following commands:
- [upgrade-simple](#upgrade-simple)
- [resize](#resize)
- [start/stop/restart/enable/disable](#start/stop/restart/enable/disable)
- [purge](#purge)
- [read-sb](#read-sb)
- [write-sb](#write-sb)
- [udev](#udev)
@@ -155,11 +156,22 @@ Commands are passed to `systemctl` with `vitastor-osd@<num>` units as arguments.
When `--now` is added to enable/disable, OSDs are also immediately started/stopped.
## purge
`vitastor-disk purge [--force] [--allow-data-loss] <device> [device2 device3 ...]`
Purge Vitastor OSD(s) on specified device(s). Uses `vitastor-cli rm-osd` to check
if deletion is possible without data loss and to actually remove metadata from etcd.
`--force` and `--allow-data-loss` options may be used to ignore safety check results.
Requires `vitastor-cli`, `sfdisk` and `partprobe` (from parted) utilities.
## read-sb
`vitastor-disk read-sb <device>`
`vitastor-disk read-sb [--force] <device>`
Try to read Vitastor OSD superblock from `<device>` and print it in JSON format.
`--force` allows to ignore validation errors.
## write-sb

View File

@@ -14,6 +14,7 @@ vitastor-disk - инструмент командной строки для уп
- [upgrade-simple](#upgrade-simple)
- [resize](#resize)
- [start/stop/restart/enable/disable](#start/stop/restart/enable/disable)
- [purge](#purge)
- [read-sb](#read-sb)
- [write-sb](#write-sb)
- [udev](#udev)
@@ -158,12 +159,25 @@ throttle_target_mbs, throttle_target_parallelism, throttle_threshold_us.
Когда к командам включения/выключения добавляется параметр `--now`, OSD также сразу
запускаются/останавливаются.
## purge
`vitastor-disk purge [--force] [--allow-data-loss] <device> [device2 device3 ...]`
Удалить OSD на заданном диске/дисках. Использует `vitastor-cli rm-osd` для проверки
возможности удаления без потери данных и для удаления OSD из etcd. Опции `--force`
и `--allow-data-loss` служат для обхода данной защиты в случае необходимости.
Команде требуются утилиты `vitastor-cli`, `sfdisk` и `partprobe` (из состава parted).
## read-sb
`vitastor-disk read-sb <device>`
`vitastor-disk read-sb [--force] <device>`
Прочитать суперблок OSD с диска `<device>` и вывести его в формате JSON.
Опция `--force` позволяет читать суперблок, даже если он считается некорректным
из-за ошибок валидации.
## write-sb
`vitastor-disk write-sb <device>`

View File

@@ -1693,6 +1693,7 @@ class Mon
// Do not clear these to null
kv.value = kv.value || {};
}
const old = cur[key_parts[key_parts.length-1]];
cur[key_parts[key_parts.length-1]] = kv.value;
if (key === 'config/global')
{
@@ -1717,6 +1718,11 @@ class Mon
}
else if (key_parts[0] === 'osd' && key_parts[1] === 'stats')
{
// Recheck OSD tree on OSD addition/deletion
if ((!old) != (!kv.value) || old && kv.value && (old.size != kv.value.size || old.time != kv.value.time))
{
this.schedule_recheck();
}
// Recheck PGs <osd_out_time> later
this.schedule_next_recheck_at(
!this.state.osd.stats[key[2]] ? 0 : this.state.osd.stats[key[2]].time+this.config.osd_out_time

View File

@@ -50,7 +50,7 @@ from cinder.volume import configuration
from cinder.volume import driver
from cinder.volume import volume_utils
VERSION = '0.8.0'
VERSION = '0.8.3'
LOG = logging.getLogger(__name__)

View File

@@ -0,0 +1,169 @@
Index: qemu/block/meson.build
===================================================================
--- qemu.orig/block/meson.build
+++ qemu/block/meson.build
@@ -91,6 +91,7 @@ foreach m : [
[libnfs, 'nfs', files('nfs.c')],
[libssh, 'ssh', files('ssh.c')],
[rbd, 'rbd', files('rbd.c')],
+ [vitastor, 'vitastor', files('vitastor.c')],
]
if m[0].found()
module_ss = ss.source_set()
Index: qemu/meson.build
===================================================================
--- qemu.orig/meson.build
+++ qemu/meson.build
@@ -838,6 +838,26 @@ if not get_option('rbd').auto() or have_
endif
endif
+vitastor = not_found
+if not get_option('vitastor').auto() or have_block
+ libvitastor_client = cc.find_library('vitastor_client', has_headers: ['vitastor_c.h'],
+ required: get_option('vitastor'), kwargs: static_kwargs)
+ if libvitastor_client.found()
+ if cc.links('''
+ #include <vitastor_c.h>
+ int main(void) {
+ vitastor_c_create_qemu(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
+ return 0;
+ }''', dependencies: libvitastor_client)
+ vitastor = declare_dependency(dependencies: libvitastor_client)
+ elif get_option('vitastor').enabled()
+ error('could not link libvitastor_client')
+ else
+ warning('could not link libvitastor_client, disabling')
+ endif
+ endif
+endif
+
glusterfs = not_found
glusterfs_ftruncate_has_stat = false
glusterfs_iocb_has_stat = false
@@ -1459,6 +1479,7 @@ config_host_data.set('CONFIG_LINUX_AIO',
config_host_data.set('CONFIG_LINUX_IO_URING', linux_io_uring.found())
config_host_data.set('CONFIG_LIBPMEM', libpmem.found())
config_host_data.set('CONFIG_RBD', rbd.found())
+config_host_data.set('CONFIG_VITASTOR', vitastor.found())
config_host_data.set('CONFIG_SDL', sdl.found())
config_host_data.set('CONFIG_SDL_IMAGE', sdl_image.found())
config_host_data.set('CONFIG_SECCOMP', seccomp.found())
@@ -3424,6 +3445,7 @@ if spice_protocol.found()
summary_info += {' spice server support': spice}
endif
summary_info += {'rbd support': rbd}
+summary_info += {'vitastor support': vitastor}
summary_info += {'xfsctl support': config_host.has_key('CONFIG_XFS')}
summary_info += {'smartcard support': cacard}
summary_info += {'U2F support': u2f}
Index: qemu/meson_options.txt
===================================================================
--- qemu.orig/meson_options.txt
+++ qemu/meson_options.txt
@@ -121,6 +121,8 @@ option('lzo', type : 'feature', value :
description: 'lzo compression support')
option('rbd', type : 'feature', value : 'auto',
description: 'Ceph block device driver')
+option('vitastor', type : 'feature', value : 'auto',
+ description: 'Vitastor block device driver')
option('gtk', type : 'feature', value : 'auto',
description: 'GTK+ user interface')
option('sdl', type : 'feature', value : 'auto',
Index: qemu/qapi/block-core.json
===================================================================
--- qemu.orig/qapi/block-core.json
+++ qemu/qapi/block-core.json
@@ -3179,7 +3179,7 @@
'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
{ 'name': 'replication', 'if': 'CONFIG_REPLICATION' },
'pbs',
- 'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
+ 'ssh', 'throttle', 'vdi', 'vhdx', 'vitastor', 'vmdk', 'vpc', 'vvfat' ] }
##
# @BlockdevOptionsFile:
@@ -4125,6 +4125,28 @@
'*server': ['InetSocketAddressBase'] } }
##
+# @BlockdevOptionsVitastor:
+#
+# Driver specific block device options for vitastor
+#
+# @image: Image name
+# @inode: Inode number
+# @pool: Pool ID
+# @size: Desired image size in bytes
+# @config-path: Path to Vitastor configuration
+# @etcd-host: etcd connection address(es)
+# @etcd-prefix: etcd key/value prefix
+##
+{ 'struct': 'BlockdevOptionsVitastor',
+ 'data': { '*inode': 'uint64',
+ '*pool': 'uint64',
+ '*size': 'uint64',
+ '*image': 'str',
+ '*config-path': 'str',
+ '*etcd-host': 'str',
+ '*etcd-prefix': 'str' } }
+
+##
# @ReplicationMode:
#
# An enumeration of replication modes.
@@ -4520,6 +4542,7 @@
'throttle': 'BlockdevOptionsThrottle',
'vdi': 'BlockdevOptionsGenericFormat',
'vhdx': 'BlockdevOptionsGenericFormat',
+ 'vitastor': 'BlockdevOptionsVitastor',
'vmdk': 'BlockdevOptionsGenericCOWFormat',
'vpc': 'BlockdevOptionsGenericFormat',
'vvfat': 'BlockdevOptionsVVFAT'
@@ -4910,6 +4933,17 @@
'*encrypt' : 'RbdEncryptionCreateOptions' } }
##
+# @BlockdevCreateOptionsVitastor:
+#
+# Driver specific image creation options for Vitastor.
+#
+# @size: Size of the virtual disk in bytes
+##
+{ 'struct': 'BlockdevCreateOptionsVitastor',
+ 'data': { 'location': 'BlockdevOptionsVitastor',
+ 'size': 'size' } }
+
+##
# @BlockdevVmdkSubformat:
#
# Subformat options for VMDK images
@@ -5108,6 +5142,7 @@
'ssh': 'BlockdevCreateOptionsSsh',
'vdi': 'BlockdevCreateOptionsVdi',
'vhdx': 'BlockdevCreateOptionsVhdx',
+ 'vitastor': 'BlockdevCreateOptionsVitastor',
'vmdk': 'BlockdevCreateOptionsVmdk',
'vpc': 'BlockdevCreateOptionsVpc'
} }
Index: qemu/scripts/ci/org.centos/stream/8/x86_64/configure
===================================================================
--- qemu.orig/scripts/ci/org.centos/stream/8/x86_64/configure
+++ qemu/scripts/ci/org.centos/stream/8/x86_64/configure
@@ -31,7 +31,7 @@
--with-git=meson \
--with-git-submodules=update \
--target-list="x86_64-softmmu" \
---block-drv-rw-whitelist="qcow2,raw,file,host_device,nbd,iscsi,rbd,blkdebug,luks,null-co,nvme,copy-on-read,throttle,gluster" \
+--block-drv-rw-whitelist="qcow2,raw,file,host_device,nbd,iscsi,rbd,vitastor,blkdebug,luks,null-co,nvme,copy-on-read,throttle,gluster" \
--audio-drv-list="" \
--block-drv-ro-whitelist="vmdk,vhdx,vpc,https,ssh" \
--with-coroutine=ucontext \
@@ -183,6 +183,7 @@
--enable-opengl \
--enable-pie \
--enable-rbd \
+--enable-vitastor \
--enable-rdma \
--enable-seccomp \
--enable-snappy \

View File

@@ -0,0 +1,169 @@
Index: qemu/block/meson.build
===================================================================
--- qemu.orig/block/meson.build
+++ qemu/block/meson.build
@@ -111,6 +111,7 @@ foreach m : [
[libnfs, 'nfs', files('nfs.c')],
[libssh, 'ssh', files('ssh.c')],
[rbd, 'rbd', files('rbd.c')],
+ [vitastor, 'vitastor', files('vitastor.c')],
]
if m[0].found()
module_ss = ss.source_set()
Index: qemu/meson.build
===================================================================
--- qemu.orig/meson.build
+++ qemu/meson.build
@@ -967,6 +967,26 @@ if not get_option('rbd').auto() or have_
endif
endif
+vitastor = not_found
+if not get_option('vitastor').auto() or have_block
+ libvitastor_client = cc.find_library('vitastor_client', has_headers: ['vitastor_c.h'],
+ required: get_option('vitastor'), kwargs: static_kwargs)
+ if libvitastor_client.found()
+ if cc.links('''
+ #include <vitastor_c.h>
+ int main(void) {
+ vitastor_c_create_qemu(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
+ return 0;
+ }''', dependencies: libvitastor_client)
+ vitastor = declare_dependency(dependencies: libvitastor_client)
+ elif get_option('vitastor').enabled()
+ error('could not link libvitastor_client')
+ else
+ warning('could not link libvitastor_client, disabling')
+ endif
+ endif
+endif
+
glusterfs = not_found
glusterfs_ftruncate_has_stat = false
glusterfs_iocb_has_stat = false
@@ -1802,6 +1822,7 @@ config_host_data.set('CONFIG_NUMA', numa
config_host_data.set('CONFIG_OPENGL', opengl.found())
config_host_data.set('CONFIG_PROFILER', get_option('profiler'))
config_host_data.set('CONFIG_RBD', rbd.found())
+config_host_data.set('CONFIG_VITASTOR', vitastor.found())
config_host_data.set('CONFIG_RDMA', rdma.found())
config_host_data.set('CONFIG_SDL', sdl.found())
config_host_data.set('CONFIG_SDL_IMAGE', sdl_image.found())
@@ -3965,6 +3986,7 @@ if spice_protocol.found()
summary_info += {' spice server support': spice}
endif
summary_info += {'rbd support': rbd}
+summary_info += {'vitastor support': vitastor}
summary_info += {'smartcard support': cacard}
summary_info += {'U2F support': u2f}
summary_info += {'libusb': libusb}
Index: qemu/meson_options.txt
===================================================================
--- qemu.orig/meson_options.txt
+++ qemu/meson_options.txt
@@ -167,6 +167,8 @@ option('lzo', type : 'feature', value :
description: 'lzo compression support')
option('rbd', type : 'feature', value : 'auto',
description: 'Ceph block device driver')
+option('vitastor', type : 'feature', value : 'auto',
+ description: 'Vitastor block device driver')
option('opengl', type : 'feature', value : 'auto',
description: 'OpenGL support')
option('rdma', type : 'feature', value : 'auto',
Index: qemu/qapi/block-core.json
===================================================================
--- qemu.orig/qapi/block-core.json
+++ qemu/qapi/block-core.json
@@ -3209,7 +3209,7 @@
'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
{ 'name': 'replication', 'if': 'CONFIG_REPLICATION' },
'pbs',
- 'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
+ 'ssh', 'throttle', 'vdi', 'vhdx', 'vitastor', 'vmdk', 'vpc', 'vvfat' ] }
##
# @BlockdevOptionsFile:
@@ -4149,6 +4149,28 @@
'*server': ['InetSocketAddressBase'] } }
##
+# @BlockdevOptionsVitastor:
+#
+# Driver specific block device options for vitastor
+#
+# @image: Image name
+# @inode: Inode number
+# @pool: Pool ID
+# @size: Desired image size in bytes
+# @config-path: Path to Vitastor configuration
+# @etcd-host: etcd connection address(es)
+# @etcd-prefix: etcd key/value prefix
+##
+{ 'struct': 'BlockdevOptionsVitastor',
+ 'data': { '*inode': 'uint64',
+ '*pool': 'uint64',
+ '*size': 'uint64',
+ '*image': 'str',
+ '*config-path': 'str',
+ '*etcd-host': 'str',
+ '*etcd-prefix': 'str' } }
+
+##
# @ReplicationMode:
#
# An enumeration of replication modes.
@@ -4593,6 +4615,7 @@
'throttle': 'BlockdevOptionsThrottle',
'vdi': 'BlockdevOptionsGenericFormat',
'vhdx': 'BlockdevOptionsGenericFormat',
+ 'vitastor': 'BlockdevOptionsVitastor',
'vmdk': 'BlockdevOptionsGenericCOWFormat',
'vpc': 'BlockdevOptionsGenericFormat',
'vvfat': 'BlockdevOptionsVVFAT'
@@ -4985,6 +5008,17 @@
'*encrypt' : 'RbdEncryptionCreateOptions' } }
##
+# @BlockdevCreateOptionsVitastor:
+#
+# Driver specific image creation options for Vitastor.
+#
+# @size: Size of the virtual disk in bytes
+##
+{ 'struct': 'BlockdevCreateOptionsVitastor',
+ 'data': { 'location': 'BlockdevOptionsVitastor',
+ 'size': 'size' } }
+
+##
# @BlockdevVmdkSubformat:
#
# Subformat options for VMDK images
@@ -5182,6 +5216,7 @@
'ssh': 'BlockdevCreateOptionsSsh',
'vdi': 'BlockdevCreateOptionsVdi',
'vhdx': 'BlockdevCreateOptionsVhdx',
+ 'vitastor': 'BlockdevCreateOptionsVitastor',
'vmdk': 'BlockdevCreateOptionsVmdk',
'vpc': 'BlockdevCreateOptionsVpc'
} }
Index: qemu/scripts/ci/org.centos/stream/8/x86_64/configure
===================================================================
--- qemu.orig/scripts/ci/org.centos/stream/8/x86_64/configure
+++ qemu/scripts/ci/org.centos/stream/8/x86_64/configure
@@ -31,7 +31,7 @@
--with-git=meson \
--with-git-submodules=update \
--target-list="x86_64-softmmu" \
---block-drv-rw-whitelist="qcow2,raw,file,host_device,nbd,iscsi,rbd,blkdebug,luks,null-co,nvme,copy-on-read,throttle,gluster" \
+--block-drv-rw-whitelist="qcow2,raw,file,host_device,nbd,iscsi,rbd,vitastor,blkdebug,luks,null-co,nvme,copy-on-read,throttle,gluster" \
--audio-drv-list="" \
--block-drv-ro-whitelist="vmdk,vhdx,vpc,https,ssh" \
--with-coroutine=ucontext \
@@ -179,6 +179,7 @@
--enable-opengl \
--enable-pie \
--enable-rbd \
+--enable-vitastor \
--enable-rdma \
--enable-seccomp \
--enable-snappy \

View File

@@ -9,7 +9,7 @@ for i in "$DIR"/qemu-*-vitastor.patch "$DIR"/pve-qemu-*-vitastor.patch; do
echo '===================================================================' >> $i
echo '--- /dev/null' >> $i
echo '+++ a/block/vitastor.c' >> $i
echo '@@ -0,0 +1,'$(wc -l "$DIR"/../src/qemu_driver.c)' @@' >> $i
echo '@@ -0,0 +1,'$(wc -l "$DIR"/../src/qemu_driver.c | cut -d ' ' -f 1)' @@' >> $i
cat "$DIR"/../src/qemu_driver.c | sed 's/^/+/' >> $i
fi
done

View File

@@ -25,4 +25,4 @@ rm fio
mv fio-copy fio
FIO=`rpm -qi fio | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'`
perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec
tar --transform 's#^#vitastor-0.8.0/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.8.0$(rpm --eval '%dist').tar.gz *
tar --transform 's#^#vitastor-0.8.3/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.8.3$(rpm --eval '%dist').tar.gz *

View File

@@ -58,7 +58,7 @@
+BuildRequires: gperftools-devel
+BuildRequires: libusbx-devel >= 1.0.21
%if %{have_usbredir}
BuildRequires: usbredir-devel >= 0.8.0
BuildRequires: usbredir-devel >= 0.8.2
%endif
@@ -856,12 +861,13 @@ BuildRequires: virglrenderer-devel
# For smartcard NSS support

View File

@@ -35,7 +35,7 @@ ADD . /root/vitastor
RUN set -e; \
cd /root/vitastor/rpm; \
sh build-tarball.sh; \
cp /root/vitastor-0.8.0.el7.tar.gz ~/rpmbuild/SOURCES; \
cp /root/vitastor-0.8.3.el7.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \

View File

@@ -1,11 +1,11 @@
Name: vitastor
Version: 0.8.0
Version: 0.8.3
Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1
URL: https://vitastor.io/
Source0: vitastor-0.8.0.el7.tar.gz
Source0: vitastor-0.8.3.el7.tar.gz
BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel

View File

@@ -35,7 +35,7 @@ ADD . /root/vitastor
RUN set -e; \
cd /root/vitastor/rpm; \
sh build-tarball.sh; \
cp /root/vitastor-0.8.0.el8.tar.gz ~/rpmbuild/SOURCES; \
cp /root/vitastor-0.8.3.el8.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \

View File

@@ -1,11 +1,11 @@
Name: vitastor
Version: 0.8.0
Version: 0.8.3
Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1
URL: https://vitastor.io/
Source0: vitastor-0.8.0.el8.tar.gz
Source0: vitastor-0.8.3.el8.tar.gz
BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel

View File

@@ -15,7 +15,7 @@ if("${CMAKE_INSTALL_PREFIX}" MATCHES "^/usr/local/?$")
set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}")
endif()
add_definitions(-DVERSION="0.8.0")
add_definitions(-DVERSION="0.8.3")
add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -fdiagnostics-color=always -I ${CMAKE_SOURCE_DIR}/src)
if (${WITH_ASAN})
add_definitions(-fsanitize=address -fno-omit-frame-pointer)
@@ -140,6 +140,7 @@ add_library(vitastor_client SHARED
cli_merge.cpp
cli_rm_data.cpp
cli_rm.cpp
cli_rm_osd.cpp
)
set_target_properties(vitastor_client PROPERTIES PUBLIC_HEADER "vitastor_c.h")
target_link_libraries(vitastor_client
@@ -263,6 +264,14 @@ target_link_libraries(test_cas
vitastor_client
)
# test_crc32
add_executable(test_crc32
test_crc32.cpp
)
target_link_libraries(test_crc32
vitastor_blk
)
# test_cluster_client
add_executable(test_cluster_client
test_cluster_client.cpp

View File

@@ -35,24 +35,14 @@ journal_flusher_co::journal_flusher_co()
{
bs->live = true;
if (data->res != data->iov.iov_len)
{
throw std::runtime_error(
"data read operation failed during flush ("+std::to_string(data->res)+" != "+std::to_string(data->iov.iov_len)+
"). can't continue, sorry :-("
);
}
bs->disk_error_abort("read operation during flush", data->res, data->iov.iov_len);
wait_count--;
};
simple_callback_w = [this](ring_data_t* data)
{
bs->live = true;
if (data->res != data->iov.iov_len)
{
throw std::runtime_error(
"write operation failed ("+std::to_string(data->res)+" != "+std::to_string(data->iov.iov_len)+
"). state "+std::to_string(wait_state)+". in-memory state is corrupted. AAAAAAAaaaaaaaaa!!!111"
);
}
bs->disk_error_abort("write operation during flush", data->res, data->iov.iov_len);
wait_count--;
};
}
@@ -87,7 +77,7 @@ void journal_flusher_t::loop()
cur_flusher_count--;
}
}
for (int i = 0; (active_flushers > 0 || dequeuing) && i < cur_flusher_count; i++)
for (int i = 0; (active_flushers > 0 || dequeuing || trim_wanted > 0) && i < cur_flusher_count; i++)
co[i].loop();
}
@@ -306,6 +296,8 @@ bool journal_flusher_co::loop()
goto resume_20;
else if (wait_state == 21)
goto resume_21;
else if (wait_state == 22)
goto resume_22;
resume_0:
if (flusher->flush_queue.size() < flusher->min_flusher_count && !flusher->trim_wanted ||
!flusher->flush_queue.size() || !flusher->dequeuing)
@@ -511,6 +503,13 @@ resume_1:
);
wait_count++;
}
// Wait for data writes before fsyncing it
resume_22:
if (wait_count > 0)
{
wait_state = 22;
return false;
}
// Sync data before writing metadata
resume_16:
resume_17:
@@ -521,7 +520,7 @@ resume_1:
return false;
}
resume_5:
// And metadata writes, but only after data writes complete
// Submit metadata writes, but only when data is written and fsynced
if (!bs->inmemory_meta && meta_new.it->second.state == 0 || wait_count > 0)
{
// metadata sector is still being read or data is still being written, wait for it
@@ -615,7 +614,12 @@ resume_1:
}
for (it = v.begin(); it != v.end(); it++)
{
free(it->buf);
// Free it if it's not taken from the journal
if (it->buf && (!bs->journal.inmemory || it->buf < bs->journal.buffer ||
it->buf >= (uint8_t*)bs->journal.buffer + bs->journal.len))
{
free(it->buf);
}
}
v.clear();
// And sync metadata (in batches - not per each operation!)
@@ -760,16 +764,17 @@ bool journal_flusher_co::scan_dirty(int wait_base)
{
submit_offset = dirty_it->second.location + offset - dirty_it->second.offset;
submit_len = it == v.end() || it->offset >= end_offset ? end_offset-offset : it->offset-offset;
it = v.insert(it, (copy_buffer_t){ .offset = offset, .len = submit_len, .buf = memalign_or_die(MEM_ALIGNMENT, submit_len) });
it = v.insert(it, (copy_buffer_t){ .offset = offset, .len = submit_len });
copy_count++;
if (bs->journal.inmemory)
{
// Take it from memory
memcpy(it->buf, (uint8_t*)bs->journal.buffer + submit_offset, submit_len);
// Take it from memory, don't copy it
it->buf = (uint8_t*)bs->journal.buffer + submit_offset;
}
else
{
// Read it from disk
it->buf = memalign_or_die(MEM_ALIGNMENT, submit_len);
await_sqe(0);
data->iov = (struct iovec){ it->buf, (size_t)submit_len };
data->callback = simple_callback_r;

View File

@@ -107,7 +107,7 @@ void blockstore_impl_t::loop()
// has_writes == 0 - no writes before the current queue item
// has_writes == 1 - some writes in progress
// has_writes == 2 - tried to submit some writes, but failed
int has_writes = 0, op_idx = 0, new_idx = 0, done_lists = 0;
int has_writes = 0, op_idx = 0, new_idx = 0;
for (; op_idx < submit_queue.size(); op_idx++, new_idx++)
{
auto op = submit_queue[op_idx];
@@ -188,13 +188,8 @@ void blockstore_impl_t::loop()
else if (op->opcode == BS_OP_LIST)
{
// LIST doesn't have to be blocked by previous modifications
// But don't do a lot of LISTs at once, because they're blocking and potentially slow
if (single_tick_list_limit <= 0 || done_lists < single_tick_list_limit)
{
process_list(op);
done_lists++;
wr_st = 2;
}
process_list(op);
wr_st = 2;
}
if (wr_st == 2)
{
@@ -306,17 +301,6 @@ void blockstore_impl_t::check_wait(blockstore_op_t *op)
// do not submit
#ifdef BLOCKSTORE_DEBUG
printf("Still waiting for a journal buffer\n");
#endif
return;
}
PRIV(op)->wait_for = 0;
}
else if (PRIV(op)->wait_for == WAIT_FREE)
{
if (!data_alloc->get_free_count() && flusher->is_active())
{
#ifdef BLOCKSTORE_DEBUG
printf("Still waiting for free space on the data device\n");
#endif
return;
}
@@ -687,3 +671,16 @@ void blockstore_impl_t::dump_diagnostics()
journal.dump_diagnostics();
flusher->dump_diagnostics();
}
void blockstore_impl_t::disk_error_abort(const char *op, int retval, int expected)
{
if (retval == -EAGAIN)
{
fprintf(stderr, "EAGAIN error received from a disk %s during flush."
" It must never happen with io_uring and indicates a kernel bug."
" Please upgrade your kernel. Aborting.\n", op);
exit(1);
}
fprintf(stderr, "Disk %s failed: result is %d, expected %d. Can't continue, sorry :-(\n", op, retval, expected);
exit(1);
}

View File

@@ -160,12 +160,11 @@ struct __attribute__((__packed__)) dirty_entry
#define WAIT_JOURNAL 3
// Suspend operation until the next journal sector buffer is free
#define WAIT_JOURNAL_BUFFER 4
// Suspend operation until there is some free space on the data device
#define WAIT_FREE 5
struct fulfill_read_t
{
uint64_t offset, len;
uint64_t journal_sector; // sector+1 if used and !journal.inmemory, otherwise 0
};
#define PRIV(op) ((blockstore_op_private_t*)(op)->private_data)
@@ -241,8 +240,6 @@ class blockstore_impl_t
int throttle_target_parallelism = 1;
// Minimum difference in microseconds between target and real execution times to throttle the response
int throttle_threshold_us = 50;
// Maximum number of LIST operations to be processed between
int single_tick_list_limit = 1;
/******* END OF OPTIONS *******/
struct ring_consumer_t ring_consumer;
@@ -293,6 +290,7 @@ class blockstore_impl_t
// Journaling
void prepare_journal_sector_write(int sector, blockstore_op_t *op);
void handle_journal_write(ring_data_t *data, uint64_t flush_id);
void disk_error_abort(const char *op, int retval, int expected);
// Asynchronous init
int initialized;
@@ -305,7 +303,7 @@ class blockstore_impl_t
// Read
int dequeue_read(blockstore_op_t *read_op);
int fulfill_read(blockstore_op_t *read_op, uint64_t &fulfilled, uint32_t item_start, uint32_t item_end,
uint32_t item_state, uint64_t item_version, uint64_t item_location);
uint32_t item_state, uint64_t item_version, uint64_t item_location, uint64_t journal_sector);
int fulfill_read_push(blockstore_op_t *op, void *buf, uint64_t offset, uint64_t len,
uint32_t item_state, uint64_t item_version);
void handle_read_event(ring_data_t *data, blockstore_op_t *op);

View File

@@ -48,14 +48,12 @@ void blockstore_init_meta::handle_event(ring_data_t *data, int buf_num)
int blockstore_init_meta::loop()
{
if (wait_state == 1)
goto resume_1;
else if (wait_state == 2)
goto resume_2;
else if (wait_state == 3)
goto resume_3;
else if (wait_state == 4)
goto resume_4;
if (wait_state == 1) goto resume_1;
else if (wait_state == 2) goto resume_2;
else if (wait_state == 3) goto resume_3;
else if (wait_state == 4) goto resume_4;
else if (wait_state == 5) goto resume_5;
else if (wait_state == 6) goto resume_6;
printf("Reading blockstore metadata\n");
if (bs->inmemory_meta)
metadata_buffer = bs->metadata_buffer;
@@ -140,6 +138,7 @@ resume_1:
// Skip superblock
md_offset = bs->dsk.meta_block_size;
next_offset = md_offset;
entries_per_block = bs->dsk.meta_block_size / bs->dsk.clean_entry_size;
// Read the rest of the metadata
resume_2:
if (next_offset < bs->dsk.meta_len && submitted == 0)
@@ -179,17 +178,15 @@ resume_2:
if (bufs[i].state == INIT_META_READ_DONE)
{
// Handle result
unsigned entries_per_block = bs->dsk.meta_block_size / bs->dsk.clean_entry_size;
bool changed = false;
for (uint64_t sector = 0; sector < bufs[i].size; sector += bs->dsk.meta_block_size)
{
// handle <count> entries
changed = changed || handle_entries(
bufs[i].buf + sector, entries_per_block,
((bufs[i].offset + sector - md_offset) / bs->dsk.meta_block_size) * entries_per_block
);
if (handle_meta_block(bufs[i].buf + sector, entries_per_block,
((bufs[i].offset + sector - md_offset) / bs->dsk.meta_block_size) * entries_per_block))
changed = true;
}
if (changed && !bs->inmemory_meta)
if (changed && !bs->inmemory_meta && !bs->readonly)
{
// write the modified buffer back
GET_SQE();
@@ -211,6 +208,43 @@ resume_2:
wait_state = 2;
return 1;
}
if (entries_to_zero.size() && !bs->inmemory_meta && !bs->readonly)
{
// we have to zero out additional entries
for (i = 0; i < entries_to_zero.size(); )
{
next_offset = entries_to_zero[i]/entries_per_block;
for (j = i; j < entries_to_zero.size() && entries_to_zero[j]/entries_per_block == next_offset; j++) {}
GET_SQE();
data->iov = { metadata_buffer, bs->dsk.meta_block_size };
data->callback = [this](ring_data_t *data) { handle_event(data, -1); };
my_uring_prep_readv(sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset + (1+next_offset)*bs->dsk.meta_block_size);
submitted++;
resume_5:
if (submitted > 0)
{
wait_state = 5;
return 1;
}
for (; i < j; i++)
{
uint64_t pos = (entries_to_zero[i] % entries_per_block);
memset((uint8_t*)metadata_buffer + pos*bs->dsk.clean_entry_size, 0, bs->dsk.clean_entry_size);
}
GET_SQE();
data->iov = { metadata_buffer, bs->dsk.meta_block_size };
data->callback = [this](ring_data_t *data) { handle_event(data, -1); };
my_uring_prep_writev(sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset + (1+next_offset)*bs->dsk.meta_block_size);
submitted++;
resume_6:
if (submitted > 0)
{
wait_state = 6;
return 1;
}
}
entries_to_zero.clear();
}
// metadata read finished
printf("Metadata entries loaded: %lu, free blocks: %lu / %lu\n", entries_loaded, bs->data_alloc->get_free_count(), bs->dsk.block_count);
if (!bs->inmemory_meta)
@@ -236,10 +270,13 @@ resume_2:
return 0;
}
bool blockstore_init_meta::handle_entries(uint8_t *buf, uint64_t count, uint64_t done_cnt)
bool blockstore_init_meta::handle_meta_block(uint8_t *buf, uint64_t entries_per_block, uint64_t done_cnt)
{
bool updated = false;
for (uint64_t i = 0; i < count; i++)
uint64_t max_i = entries_per_block;
if (max_i > bs->dsk.block_count-done_cnt)
max_i = bs->dsk.block_count-done_cnt;
for (uint64_t i = 0; i < max_i; i++)
{
clean_disk_entry *entry = (clean_disk_entry*)(buf + i*bs->dsk.clean_entry_size);
if (!bs->inmemory_meta && bs->dsk.clean_entry_bitmap_size)
@@ -255,17 +292,35 @@ bool blockstore_init_meta::handle_entries(uint8_t *buf, uint64_t count, uint64_t
if (clean_it != clean_db.end())
{
// free the previous block
// here we have to zero out the entry because otherwise we'll hit
// here we have to zero out the previous entry because otherwise we'll hit
// "tried to overwrite non-zero metadata entry" later
updated = true;
memset(entry, 0, bs->dsk.clean_entry_size);
uint64_t old_clean_loc = clean_it->second.location >> bs->dsk.block_order;
if (bs->inmemory_meta)
{
uint64_t sector = (old_clean_loc / entries_per_block) * bs->dsk.meta_block_size;
uint64_t pos = (old_clean_loc % entries_per_block);
clean_disk_entry *old_entry = (clean_disk_entry*)((uint8_t*)bs->metadata_buffer + sector + pos*bs->dsk.clean_entry_size);
memset(old_entry, 0, bs->dsk.clean_entry_size);
}
else if (old_clean_loc >= done_cnt)
{
updated = true;
uint64_t sector = ((old_clean_loc - done_cnt) / entries_per_block) * bs->dsk.meta_block_size;
uint64_t pos = (old_clean_loc % entries_per_block);
clean_disk_entry *old_entry = (clean_disk_entry*)(buf + sector + pos*bs->dsk.clean_entry_size);
memset(old_entry, 0, bs->dsk.clean_entry_size);
}
else
{
entries_to_zero.push_back(clean_it->second.location >> bs->dsk.block_order);
}
#ifdef BLOCKSTORE_DEBUG
printf("Free block %lu from %lx:%lx v%lu (new location is %lu)\n",
clean_it->second.location >> bs->dsk.block_order,
old_clean_loc,
clean_it->first.inode, clean_it->first.stripe, clean_it->second.version,
done_cnt+i);
#endif
bs->data_alloc->set(clean_it->second.location >> bs->dsk.block_order, false);
bs->data_alloc->set(old_clean_loc, false);
}
else
{

View File

@@ -24,7 +24,10 @@ class blockstore_init_meta
uint64_t md_offset = 0;
uint64_t next_offset = 0;
uint64_t entries_loaded = 0;
bool handle_entries(uint8_t *buf, uint64_t count, uint64_t done_cnt);
unsigned entries_per_block = 0;
int i = 0, j = 0;
std::vector<uint64_t> entries_to_zero;
bool handle_meta_block(uint8_t *buf, uint64_t count, uint64_t done_cnt);
void handle_event(ring_data_t *data, int buf_num);
public:
blockstore_init_meta(blockstore_impl_t *bs);

View File

@@ -198,10 +198,7 @@ void blockstore_impl_t::handle_journal_write(ring_data_t *data, uint64_t flush_i
if (data->res != data->iov.iov_len)
{
// FIXME: our state becomes corrupted after a write error. maybe do something better than just die
throw std::runtime_error(
"journal write failed ("+std::to_string(data->res)+" != "+std::to_string(data->iov.iov_len)+
"). in-memory state is corrupted. AAAAAAAaaaaaaaaa!!!111"
);
disk_error_abort("journal write", data->res, data->iov.iov_len);
}
auto fl_it = journal.flushing_ops.upper_bound((pending_journaling_t){ .flush_id = flush_id });
if (fl_it != journal.flushing_ops.end() && fl_it->flush_id == flush_id)

View File

@@ -42,7 +42,7 @@ int blockstore_impl_t::fulfill_read_push(blockstore_op_t *op, void *buf, uint64_
// FIXME I've seen a bug here so I want some tests
int blockstore_impl_t::fulfill_read(blockstore_op_t *read_op, uint64_t &fulfilled, uint32_t item_start, uint32_t item_end,
uint32_t item_state, uint64_t item_version, uint64_t item_location)
uint32_t item_state, uint64_t item_version, uint64_t item_location, uint64_t journal_sector)
{
uint32_t cur_start = item_start;
if (cur_start < read_op->offset + read_op->len && item_end > read_op->offset)
@@ -72,6 +72,7 @@ int blockstore_impl_t::fulfill_read(blockstore_op_t *read_op, uint64_t &fulfille
fulfill_read_t el = {
.offset = cur_start,
.len = it == PRIV(read_op)->read_vec.end() || it->offset >= item_end ? item_end-cur_start : it->offset-cur_start,
.journal_sector = journal_sector,
};
it = PRIV(read_op)->read_vec.insert(it, el);
if (!fulfill_read_push(read_op,
@@ -138,7 +139,7 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
while (dirty_it->first.oid == read_op->oid)
{
dirty_entry& dirty = dirty_it->second;
bool version_ok = read_op->version >= dirty_it->first.version;
bool version_ok = !IS_IN_FLIGHT(dirty.state) && read_op->version >= dirty_it->first.version;
if (IS_SYNCED(dirty.state))
{
if (!version_ok && read_op->version != 0)
@@ -156,8 +157,10 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
memcpy(read_op->bitmap, bmp_ptr, dsk.clean_entry_bitmap_size);
}
}
// If inmemory_journal is false, journal trim will have to wait until the read is completed
if (!fulfill_read(read_op, fulfilled, dirty.offset, dirty.offset + dirty.len,
dirty.state, dirty_it->first.version, dirty.location + (IS_JOURNAL(dirty.state) ? 0 : dirty.offset)))
dirty.state, dirty_it->first.version, dirty.location + (IS_JOURNAL(dirty.state) ? 0 : dirty.offset),
(IS_JOURNAL(dirty.state) ? dirty.journal_sector+1 : 0)))
{
// need to wait. undo added requests, don't dequeue op
PRIV(read_op)->read_vec.clear();
@@ -171,7 +174,7 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
dirty_it--;
}
}
if (clean_it != clean_db.end())
if (clean_found)
{
if (!result_version)
{
@@ -186,7 +189,8 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
{
if (!dsk.clean_entry_bitmap_size)
{
if (!fulfill_read(read_op, fulfilled, 0, dsk.data_block_size, (BS_ST_BIG_WRITE | BS_ST_STABLE), 0, clean_it->second.location))
if (!fulfill_read(read_op, fulfilled, 0, dsk.data_block_size,
(BS_ST_BIG_WRITE | BS_ST_STABLE), 0, clean_it->second.location, 0))
{
// need to wait. undo added requests, don't dequeue op
PRIV(read_op)->read_vec.clear();
@@ -207,7 +211,7 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
{
// fill with zeroes
assert(fulfill_read(read_op, fulfilled, bmp_start * dsk.bitmap_granularity,
bmp_end * dsk.bitmap_granularity, (BS_ST_DELETE | BS_ST_STABLE), 0, 0));
bmp_end * dsk.bitmap_granularity, (BS_ST_DELETE | BS_ST_STABLE), 0, 0, 0));
}
bmp_start = bmp_end;
while (clean_entry_bitmap[bmp_end >> 3] & (1 << (bmp_end & 0x7)) && bmp_end < bmp_size)
@@ -218,7 +222,7 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
{
if (!fulfill_read(read_op, fulfilled, bmp_start * dsk.bitmap_granularity,
bmp_end * dsk.bitmap_granularity, (BS_ST_BIG_WRITE | BS_ST_STABLE), 0,
clean_it->second.location + bmp_start * dsk.bitmap_granularity))
clean_it->second.location + bmp_start * dsk.bitmap_granularity, 0))
{
// need to wait. undo added requests, don't dequeue op
PRIV(read_op)->read_vec.clear();
@@ -233,7 +237,7 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
else if (fulfilled < read_op->len)
{
// fill remaining parts with zeroes
assert(fulfill_read(read_op, fulfilled, 0, dsk.data_block_size, (BS_ST_DELETE | BS_ST_STABLE), 0, 0));
assert(fulfill_read(read_op, fulfilled, 0, dsk.data_block_size, (BS_ST_DELETE | BS_ST_STABLE), 0, 0, 0));
}
assert(fulfilled == read_op->len);
read_op->version = result_version;
@@ -249,6 +253,15 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
FINISH_OP(read_op);
return 2;
}
if (!journal.inmemory)
{
// Journal trim has to wait until the read is completed - record journal sector usage
for (auto & rv: PRIV(read_op)->read_vec)
{
if (rv.journal_sector)
journal.used_sectors[rv.journal_sector-1]++;
}
}
read_op->retval = 0;
return 2;
}
@@ -264,6 +277,19 @@ void blockstore_impl_t::handle_read_event(ring_data_t *data, blockstore_op_t *op
}
if (PRIV(op)->pending_ops == 0)
{
if (!journal.inmemory)
{
// Release journal sector usage
for (auto & rv: PRIV(op)->read_vec)
{
if (rv.journal_sector)
{
auto used = --journal.used_sectors[rv.journal_sector-1];
if (used == 0)
journal.used_sectors.erase(rv.journal_sector-1);
}
}
}
if (op->retval == 0)
op->retval = op->len;
FINISH_OP(op);

View File

@@ -222,7 +222,7 @@ void blockstore_impl_t::erase_dirty(blockstore_dirty_db_t::iterator dirty_start,
#endif
data_alloc->set(dirty_it->second.location >> dsk.block_order, false);
}
int used = --journal.used_sectors[dirty_it->second.journal_sector];
auto used = --journal.used_sectors[dirty_it->second.journal_sector];
#ifdef BLOCKSTORE_DEBUG
printf(
"remove usage of journal offset %08lx by %lx:%lx v%lu (%d refs)\n", dirty_it->second.journal_sector,

View File

@@ -139,7 +139,7 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
uint8_t *bmp_ptr = (uint8_t*)(dsk.clean_entry_bitmap_size > sizeof(void*) ? bmp : &bmp);
uint32_t bit = op->offset/dsk.bitmap_granularity;
uint32_t bits_left = op->len/dsk.bitmap_granularity;
while (!(bit % 8) && bits_left > 8)
while (!(bit % 8) && bits_left >= 8)
{
// Copy bytes
bmp_ptr[bit/8] = ((uint8_t*)op->bitmap)[bit/8];
@@ -182,7 +182,8 @@ void blockstore_impl_t::cancel_all_writes(blockstore_op_t *op, blockstore_dirty_
bool found = false;
for (auto other_op: submit_queue)
{
if (!found && other_op == op)
// <op> may be present in queue multiple times due to moving operations in submit_queue
if (other_op == op)
found = true;
else if (found && other_op->oid == op->oid &&
(other_op->opcode == BS_OP_WRITE || other_op->opcode == BS_OP_WRITE_STABLE))
@@ -260,12 +261,6 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
if (loc == UINT64_MAX)
{
// no space
if (flusher->is_active())
{
// hope that some space will be available after flush
PRIV(op)->wait_for = WAIT_FREE;
return 0;
}
cancel_all_writes(op, dirty_it, -ENOSPC);
return 2;
}
@@ -448,6 +443,12 @@ int blockstore_impl_t::continue_write(blockstore_op_t *op)
resume_2:
// Only for the immediate_commit mode: prepare and submit big_write journal entry
{
blockstore_journal_check_t space_check(this);
if (!space_check.check_available(op, 1,
sizeof(journal_entry_big_write) + dsk.clean_entry_bitmap_size, JOURNAL_STABILIZE_RESERVATION))
{
return 0;
}
BS_SUBMIT_CHECK_SQES(1);
auto dirty_it = dirty_db.find((obj_ver_id){
.oid = op->oid,
@@ -585,10 +586,7 @@ void blockstore_impl_t::handle_write_event(ring_data_t *data, blockstore_op_t *o
if (data->res != data->iov.iov_len)
{
// FIXME: our state becomes corrupted after a write error. maybe do something better than just die
throw std::runtime_error(
"write operation failed ("+std::to_string(data->res)+" != "+std::to_string(data->iov.iov_len)+
"). in-memory state is corrupted. AAAAAAAaaaaaaaaa!!!111"
);
disk_error_abort("data write", data->res, data->iov.iov_len);
}
PRIV(op)->pending_ops--;
assert(PRIV(op)->pending_ops >= 0);

View File

@@ -76,6 +76,12 @@ static const char* help_text =
"vitastor-cli alloc-osd\n"
" Allocate a new OSD number and reserve it by creating empty /osd/stats/<n> key.\n"
"\n"
"vitastor-cli rm-osd [--force] [--allow-data-loss] [--dry-run] <osd_id> [osd_id...]\n"
" Remove metadata and configuration for specified OSD(s) from etcd.\n"
" Refuses to remove OSDs with data without --force and --allow-data-loss.\n"
" With --dry-run only checks if deletion is possible without data loss and\n"
" redundancy degradation.\n"
"\n"
"Use vitastor-cli --help <command> for command details or vitastor-cli --help --all for all details.\n"
"\n"
"GLOBAL OPTIONS:\n"
@@ -95,43 +101,47 @@ static json11::Json::object parse_args(int narg, const char *args[])
cfg["progress"] = "1";
for (int i = 1; i < narg; i++)
{
if (args[i][0] == '-' && args[i][1] == 'h')
if (args[i][0] == '-' && args[i][1] == 'h' && args[i][2] == 0)
{
cfg["help"] = "1";
}
else if (args[i][0] == '-' && args[i][1] == 'l')
else if (args[i][0] == '-' && args[i][1] == 'l' && args[i][2] == 0)
{
cfg["long"] = "1";
}
else if (args[i][0] == '-' && args[i][1] == 'n')
else if (args[i][0] == '-' && args[i][1] == 'n' && args[i][2] == 0)
{
cfg["count"] = args[++i];
}
else if (args[i][0] == '-' && args[i][1] == 'p')
else if (args[i][0] == '-' && args[i][1] == 'p' && args[i][2] == 0)
{
cfg["pool"] = args[++i];
}
else if (args[i][0] == '-' && args[i][1] == 's')
else if (args[i][0] == '-' && args[i][1] == 's' && args[i][2] == 0)
{
cfg["size"] = args[++i];
}
else if (args[i][0] == '-' && args[i][1] == 'r')
else if (args[i][0] == '-' && args[i][1] == 'r' && args[i][2] == 0)
{
cfg["reverse"] = "1";
}
else if (args[i][0] == '-' && args[i][1] == 'f')
else if (args[i][0] == '-' && args[i][1] == 'f' && args[i][2] == 0)
{
cfg["force"] = "1";
}
else if (args[i][0] == '-' && args[i][1] == '-')
{
const char *opt = args[i]+2;
cfg[opt] = i == narg-1 || !strcmp(opt, "json") || !strcmp(opt, "wait-list") ||
!strcmp(opt, "long") || !strcmp(opt, "del") || !strcmp(opt, "no-color") ||
cfg[opt] = i == narg-1 || !strcmp(opt, "json") ||
!strcmp(opt, "wait-list") || !strcmp(opt, "wait_list") ||
!strcmp(opt, "long") || !strcmp(opt, "del") ||
!strcmp(opt, "no-color") || !strcmp(opt, "no_color") ||
!strcmp(opt, "readonly") || !strcmp(opt, "readwrite") ||
!strcmp(opt, "force") || !strcmp(opt, "reverse") ||
!strcmp(opt, "allow-data-loss") || !strcmp(opt, "allow_data_loss") ||
!strcmp(opt, "dry-run") || !strcmp(opt, "dry_run") ||
!strcmp(opt, "help") || !strcmp(opt, "all") ||
!strcmp(opt, "writers-stopped") && strcmp("1", args[i+1]) != 0
(!strcmp(opt, "writers-stopped") || !strcmp(opt, "writers_stopped")) && strcmp("1", args[i+1]) != 0
? "1" : args[++i];
}
else
@@ -139,10 +149,6 @@ static json11::Json::object parse_args(int narg, const char *args[])
cmd.push_back(std::string(args[i]));
}
}
if (cfg["help"].bool_value())
{
print_help(help_text, "vitastor-cli", cmd.size() ? cmd[0].string_value() : "", cfg["all"].bool_value());
}
if (!cmd.size())
{
std::string exe(exe_name);
@@ -151,6 +157,10 @@ static json11::Json::object parse_args(int narg, const char *args[])
cmd.push_back("rm-data");
}
}
if (!cmd.size() || cfg["help"].bool_value())
{
print_help(help_text, "vitastor-cli", cmd.size() ? cmd[0].string_value() : "", cfg["all"].bool_value());
}
cfg["command"] = cmd;
return cfg;
}
@@ -225,6 +235,16 @@ static int run(cli_tool_t *p, json11::Json::object cfg)
// Delete inode data
action_cb = p->start_rm_data(cfg);
}
else if (cmd[0] == "rm-osd")
{
// Delete OSD metadata from etcd
if (cmd.size() > 1)
{
cmd.erase(cmd.begin(), cmd.begin()+1);
cfg["osd_id"] = cmd;
}
action_cb = p->start_rm_osd(cfg);
}
else if (cmd[0] == "merge-data")
{
// Merge layer data without affecting metadata

View File

@@ -45,7 +45,7 @@ public:
cli_result_t etcd_err;
json11::Json etcd_result;
void parse_config(json11::Json cfg);
void parse_config(json11::Json::object & cfg);
void change_parent(inode_t cur, inode_t new_parent, cli_result_t *result);
inode_config_t* get_inode_cfg(const std::string & name);
@@ -64,6 +64,7 @@ public:
std::function<bool(cli_result_t &)> start_merge(json11::Json);
std::function<bool(cli_result_t &)> start_flatten(json11::Json);
std::function<bool(cli_result_t &)> start_rm(json11::Json);
std::function<bool(cli_result_t &)> start_rm_osd(json11::Json cfg);
std::function<bool(cli_result_t &)> start_alloc_osd(json11::Json cfg);
// Should be called like loop_and_wait(start_status(), <completion callback>)

View File

@@ -100,9 +100,20 @@ inode_config_t* cli_tool_t::get_inode_cfg(const std::string & name)
return NULL;
}
void cli_tool_t::parse_config(json11::Json cfg)
void cli_tool_t::parse_config(json11::Json::object & cfg)
{
color = !cfg["no-color"].bool_value();
for (auto kv_it = cfg.begin(); kv_it != cfg.end();)
{
// Translate all options with - to _
if (kv_it->first.find("-") != std::string::npos)
{
cfg[str_replace(kv_it->first, "-", "_")] = kv_it->second;
cfg.erase(kv_it++);
}
else
kv_it++;
}
color = !cfg["no_color"].bool_value();
json_output = cfg["json"].bool_value();
iodepth = cfg["iodepth"].uint64_value();
if (!iodepth)
@@ -112,7 +123,7 @@ void cli_tool_t::parse_config(json11::Json cfg)
parallel_osds = 4;
log_level = cfg["log_level"].int64_value();
progress = cfg["progress"].uint64_value() ? true : false;
list_first = cfg["wait-list"].uint64_value() ? true : false;
list_first = cfg["wait_list"].uint64_value() ? true : false;
}
struct cli_result_looper_t

View File

@@ -517,7 +517,7 @@ std::function<bool(cli_result_t &)> cli_tool_t::start_create(json11::Json cfg)
image_creator->force_size = cfg["force_size"].bool_value();
if (cfg["image_meta"].is_object())
{
image_creator->new_meta = cfg["image-meta"];
image_creator->new_meta = cfg["image_meta"];
}
if (cfg["snapshot"].string_value() != "")
{

View File

@@ -133,7 +133,7 @@ std::function<bool(cli_result_t &)> cli_tool_t::start_flatten(json11::Json cfg)
auto flattener = new snap_flattener_t();
flattener->parent = this;
flattener->target_name = cfg["image"].string_value();
flattener->fsync_interval = cfg["fsync-interval"].uint64_value();
flattener->fsync_interval = cfg["fsync_interval"].uint64_value();
if (!flattener->fsync_interval)
flattener->fsync_interval = 128;
if (!cfg["cas"].is_null())

View File

@@ -631,8 +631,8 @@ std::function<bool(cli_result_t &)> cli_tool_t::start_merge(json11::Json cfg)
merger->from_name = cfg["from"].string_value();
merger->to_name = cfg["to"].string_value();
merger->target_name = cfg["target"].string_value();
merger->delete_source = cfg["delete-source"].string_value() != "";
merger->fsync_interval = cfg["fsync-interval"].uint64_value();
merger->delete_source = cfg["delete_source"].string_value() != "";
merger->fsync_interval = cfg["fsync_interval"].uint64_value();
if (!merger->fsync_interval)
merger->fsync_interval = 128;
if (!cfg["cas"].is_null())

View File

@@ -236,7 +236,7 @@ std::function<bool(cli_result_t &)> cli_tool_t::start_modify(json11::Json cfg)
changer->force = cfg["force"].bool_value();
changer->set_readonly = cfg["readonly"].bool_value();
changer->set_readwrite = cfg["readwrite"].bool_value();
changer->fsync_interval = cfg["fsync-interval"].uint64_value();
changer->fsync_interval = cfg["fsync_interval"].uint64_value();
if (!changer->fsync_interval)
changer->fsync_interval = 128;
// FIXME Check that the image doesn't have children when shrinking

View File

@@ -639,7 +639,7 @@ std::function<bool(cli_result_t &)> cli_tool_t::start_rm(json11::Json cfg)
snap_remover->parent = this;
snap_remover->from_name = cfg["from"].string_value();
snap_remover->to_name = cfg["to"].string_value();
snap_remover->fsync_interval = cfg["fsync-interval"].uint64_value();
snap_remover->fsync_interval = cfg["fsync_interval"].uint64_value();
if (!snap_remover->fsync_interval)
snap_remover->fsync_interval = 128;
if (!cfg["cas"].is_null())

View File

@@ -218,7 +218,7 @@ std::function<bool(cli_result_t &)> cli_tool_t::start_rm_data(json11::Json cfg)
remover->inode = (remover->inode & (((uint64_t)1 << (64-POOL_ID_BITS)) - 1)) | (((uint64_t)remover->pool_id) << (64-POOL_ID_BITS));
}
remover->pool_id = INODE_POOL(remover->inode);
remover->min_offset = cfg["min-offset"].uint64_value();
remover->min_offset = cfg["min_offset"].uint64_value();
return [remover](cli_result_t & result)
{
remover->loop();

230
src/cli_rm_osd.cpp Normal file
View File

@@ -0,0 +1,230 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
#include <ctype.h>
#include "cli.h"
#include "cluster_client.h"
#include "str_util.h"
#include <algorithm>
// Delete OSD metadata from etcd
struct rm_osd_t
{
cli_tool_t *parent;
bool dry_run, force_warning, force_dataloss;
std::vector<uint64_t> osd_ids;
int state = 0;
cli_result_t result;
std::set<uint64_t> to_remove;
json11::Json::array pool_effects;
bool is_warning, is_dataloss;
bool is_done()
{
return state == 100;
}
void loop()
{
if (state == 1)
goto resume_1;
if (!osd_ids.size())
{
result = (cli_result_t){ .err = EINVAL, .text = "OSD numbers are not specified" };
state = 100;
return;
}
for (auto osd_id: osd_ids)
{
if (!osd_id)
{
result = (cli_result_t){ .err = EINVAL, .text = "OSD number can't be zero" };
state = 100;
return;
}
to_remove.insert(osd_id);
}
// Check if OSDs are still used in data distribution
is_warning = is_dataloss = false;
for (auto & pp: parent->cli->st_cli.pool_config)
{
// Will OSD deletion make pool incomplete / down / degraded?
bool pool_incomplete = false, pool_down = false, pool_degraded = false;
bool hist_incomplete = false, hist_degraded = false;
auto & pool_cfg = pp.second;
uint64_t pg_data_size = (pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pool_cfg.pg_size-pool_cfg.parity_chunks);
for (auto & pgp: pool_cfg.pg_config)
{
auto & pg_cfg = pgp.second;
int pg_cursize = 0, pg_rm = 0;
for (auto pg_osd: pg_cfg.target_set)
{
if (pg_osd != 0)
{
pg_cursize++;
if (to_remove.find(pg_osd) != to_remove.end())
pg_rm++;
}
}
for (auto & hist_item: pg_cfg.target_history)
{
int hist_size = 0, hist_rm = 0;
for (auto & old_osd: hist_item)
{
if (old_osd != 0)
{
hist_size++;
if (to_remove.find(old_osd) != to_remove.end())
hist_rm++;
}
}
if (hist_rm > 0)
{
hist_degraded = true;
if (hist_size-hist_rm == 0)
pool_incomplete = true;
else if (hist_size-hist_rm < pg_data_size)
hist_incomplete = true;
}
}
if (pg_rm > 0)
{
pool_degraded = true;
if (pg_cursize-pg_rm < pg_data_size)
pool_incomplete = true;
else if (pg_cursize-pg_rm < pool_cfg.pg_minsize)
pool_down = true;
}
}
if (pool_incomplete || pool_down || pool_degraded || hist_incomplete || hist_degraded)
{
pool_effects.push_back(json11::Json::object {
{ "pool_id", (uint64_t)pool_cfg.id },
{ "pool_name", pool_cfg.name },
{ "effect", (pool_incomplete
? "incomplete"
: (hist_incomplete
? "has_incomplete"
: (pool_down
? "offline"
: (pool_degraded
? "degraded"
: (hist_degraded ? "has_degraded" : "?")
)
)
)
) },
});
is_warning = true;
if (pool_incomplete || hist_incomplete)
is_dataloss = true;
}
}
result.data = json11::Json::object {
{ "osd_ids", osd_ids },
{ "pool_errors", pool_effects },
};
if (is_dataloss || is_warning || dry_run)
{
std::string error;
for (auto & e: pool_effects)
{
error += "Pool "+e["pool_name"].string_value()+" (ID "+e["pool_id"].as_string()+") will have "+(
e["effect"] == "has_incomplete"
? std::string("INCOMPLETE objects (DATA LOSS)")
: (e["effect"] == "incomplete"
? std::string("INCOMPLETE PGs (DATA LOSS)")
: (e["effect"] == "has_degraded"
? std::string("DEGRADED objects")
: strtoupper(e["effect"].string_value())+" PGs"))
)+" after deleting OSD(s).\n";
}
if (is_dataloss && !force_dataloss && !dry_run)
error += "OSDs not deleted. Please move data to other OSDs or bypass this check with --allow-data-loss if you know what you are doing.\n";
else if (is_warning && !force_warning && !dry_run)
error += "OSDs not deleted. Please move data to other OSDs or bypass this check with --force if you know what you are doing.\n";
else if (!is_dataloss && !is_warning && dry_run)
error += "OSDs can be deleted without data loss.\n";
result.text = error;
if (dry_run || is_dataloss && !force_dataloss || is_warning && !force_warning)
{
result.err = is_dataloss || is_warning ? EBUSY : 0;
state = 100;
return;
}
}
// Remove keys from etcd
{
json11::Json::array rm_items;
for (auto osd_id: osd_ids)
{
rm_items.push_back("/config/osd/"+std::to_string(osd_id));
rm_items.push_back("/osd/stats/"+std::to_string(osd_id));
rm_items.push_back("/osd/state/"+std::to_string(osd_id));
rm_items.push_back("/osd/inodestats/"+std::to_string(osd_id));
rm_items.push_back("/osd/space/"+std::to_string(osd_id));
}
for (int i = 0; i < rm_items.size(); i++)
{
rm_items[i] = json11::Json::object {
{ "request_delete_range", json11::Json::object {
{ "key", base64_encode(
parent->cli->st_cli.etcd_prefix+rm_items[i].string_value()
) },
} },
};
}
parent->etcd_txn(json11::Json::object { { "success", rm_items } });
}
resume_1:
state = 1;
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
std::string ids = "";
for (auto osd_id: osd_ids)
{
ids += (ids.size() ? ", " : "")+std::to_string(osd_id);
}
ids = (osd_ids.size() > 1 ? "OSDs " : "OSD ")+ids+(osd_ids.size() > 1 ? " are" : " is")+" removed from etcd";
state = 100;
result.text = (result.text != "" ? ids+"\n"+result.text : ids);
result.err = 0;
}
};
std::function<bool(cli_result_t &)> cli_tool_t::start_rm_osd(json11::Json cfg)
{
auto rm_osd = new rm_osd_t();
rm_osd->parent = this;
rm_osd->dry_run = cfg["dry_run"].bool_value();
rm_osd->force_dataloss = cfg["allow_data_loss"].bool_value();
rm_osd->force_warning = rm_osd->force_dataloss || cfg["force"].bool_value();
if (cfg["osd_id"].is_number() || cfg["osd_id"].is_string())
rm_osd->osd_ids.push_back(cfg["osd_id"].uint64_value());
else
{
for (auto & id: cfg["osd_id"].array_items())
rm_osd->osd_ids.push_back(id.uint64_value());
}
return [rm_osd](cli_result_t & result)
{
rm_osd->loop();
if (rm_osd->is_done())
{
result = rm_osd->result;
delete rm_osd;
return true;
}
return false;
};
}

View File

@@ -221,9 +221,11 @@ void cluster_client_t::erase_op(cluster_op_t *op)
if (op_queue_tail == op)
op_queue_tail = op->prev;
op->next = op->prev = NULL;
std::function<void(cluster_op_t*)>(op->callback)(op);
if (!(flags & OP_IMMEDIATE_COMMIT))
inc_wait(opcode, flags, next, -1);
// Call callback at the end to avoid inconsistencies in prev_wait
// if the callback adds more operations itself
std::function<void(cluster_op_t*)>(op->callback)(op);
}
void cluster_client_t::continue_ops(bool up_retry)
@@ -939,7 +941,7 @@ bool cluster_client_t::try_send(cluster_op_t *op, int i)
.req = { .rw = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = op_id++,
.id = next_op_id(),
.opcode = op->opcode == OSD_OP_READ_BITMAP ? OSD_OP_READ : op->opcode,
},
.inode = op->cur_inode,
@@ -1067,7 +1069,7 @@ void cluster_client_t::send_sync(cluster_op_t *op, cluster_op_part_t *part)
.req = {
.hdr = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = op_id++,
.id = next_op_id(),
.opcode = OSD_OP_SYNC,
},
},
@@ -1179,5 +1181,5 @@ void cluster_client_t::copy_part_bitmap(cluster_op_t *op, cluster_op_part_t *par
uint64_t cluster_client_t::next_op_id()
{
return op_id++;
return msgr.next_subop_id++;
}

View File

@@ -85,7 +85,6 @@ class cluster_client_t
int up_wait_retry_interval = 500; // ms
int retry_timeout_id = 0;
uint64_t op_id = 1;
std::vector<cluster_op_t*> offline_ops;
cluster_op_t *op_queue_head = NULL, *op_queue_tail = NULL;
std::map<object_id, cluster_buffer_t> dirty_buffers;

View File

@@ -196,7 +196,7 @@ void cluster_client_t::send_list(inode_list_osd_t *cur_list)
.sec_list = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = op_id++,
.id = next_op_id(),
.opcode = OSD_OP_SEC_LIST,
},
.list_pg = cur_list->pg->pg_num,

View File

@@ -52,11 +52,12 @@ static const char *help_text =
" --disable_data_fsync 0 Disable data device cache and fsync (default off)\n"
" --disable_meta_fsync 0 Disable metadata device cache and fsync (default off)\n"
" --disable_journal_fsync 0 Disable journal device cache and fsync (default off)\n"
" --hdd Enable HDD defaults (1M block, 1G journal, throttling)\n"
" --force Bypass partition safety checks (for emptiness and so on)\n"
" \n"
" Options (both modes):\n"
" --journal_size 1G/32M Set journal size (area or partition size)\n"
" --block_size 1M/128k Set blockstore object size\n"
" --journal_size 32M/1G Set journal size (area or partition size)\n"
" --block_size 128k/1M Set blockstore object size\n"
" --bitmap_granularity 4k Set bitmap granularity\n"
" --data_device_block 4k Override data device block size\n"
" --meta_device_block 4k Override metadata device block size\n"
@@ -109,8 +110,16 @@ static const char *help_text =
" Commands are passed to systemctl with vitastor-osd@<num> units as arguments.\n"
" When --now is added to enable/disable, OSDs are also immediately started/stopped.\n"
"\n"
"vitastor-disk read-sb <device>\n"
"vitastor-disk purge [--force] [--allow-data-loss] <device> [device2 device3 ...]\n"
" Purge Vitastor OSD(s) on specified device(s). Uses vitastor-cli rm-osd to check\n"
" if deletion is possible without data loss and to actually remove metadata from etcd.\n"
" --force and --allow-data-loss options may be used to ignore safety check results.\n"
" \n"
" Requires `vitastor-cli`, `sfdisk` and `partprobe` (from parted) utilities.\n"
"\n"
"vitastor-disk read-sb [--force] <device>\n"
" Try to read Vitastor OSD superblock from <device> and print it in JSON format.\n"
" --force allows to ignore validation errors.\n"
"\n"
"vitastor-disk write-sb <device>\n"
" Read JSON from STDIN and write it into Vitastor OSD superblock on <device>.\n"
@@ -195,6 +204,10 @@ int main(int argc, char *argv[])
{
self.options["hybrid"] = "1";
}
else if (!strcmp(argv[i], "--hdd"))
{
self.options["hdd"] = "1";
}
else if (!strcmp(argv[i], "--help") || !strcmp(argv[i], "-h"))
{
cmd.insert(cmd.begin(), (char*)"help");
@@ -207,6 +220,10 @@ int main(int argc, char *argv[])
{
self.options["force"] = "1";
}
else if (!strcmp(argv[i], "--allow-data-loss"))
{
self.options["allow_data_loss"] = "1";
}
else if (argv[i][0] == '-' && argv[i][1] == '-')
{
char *key = argv[i]+2;
@@ -339,6 +356,10 @@ int main(int argc, char *argv[])
}
return self.systemd_start_stop_osds(systemd_cmd, std::vector<std::string>(cmd.begin()+1, cmd.end()));
}
else if (!strcmp(cmd[0], "purge"))
{
return self.purge_devices(std::vector<std::string>(cmd.begin()+1, cmd.end()));
}
else if (!strcmp(cmd[0], "exec-osd"))
{
if (cmd.size() != 2)

View File

@@ -56,7 +56,7 @@ struct disk_tool_t
uint64_t meta_pos;
uint64_t journal_pos, journal_calc_data_pos;
bool first, first2;
bool first_block, first_entry;
allocator *data_alloc;
std::map<uint64_t, uint64_t> data_remap;
@@ -108,10 +108,11 @@ struct disk_tool_t
int read_sb(std::string device);
int write_sb(std::string device);
int exec_osd(std::string device);
int systemd_start_stop_osds(std::vector<std::string> cmd, std::vector<std::string> devices);
int systemd_start_stop_osds(const std::vector<std::string> & cmd, const std::vector<std::string> & devices);
int pre_exec_osd(std::string device);
int purge_devices(const std::vector<std::string> & devices);
json11::Json read_osd_superblock(std::string device, bool expect_exist = true);
json11::Json read_osd_superblock(std::string device, bool expect_exist = true, bool ignore_nonref = false);
uint32_t write_osd_superblock(std::string device, json11::Json params);
int prepare_one(std::map<std::string, std::string> options, int is_hdd = -1);
@@ -139,3 +140,4 @@ int write_zero(int fd, uint64_t offset, uint64_t size);
json11::Json read_parttable(std::string dev);
uint64_t dev_size_from_parttable(json11::Json pt);
uint64_t free_from_parttable(json11::Json pt);
int fix_partition_type(std::string dev_by_uuid);

View File

@@ -13,7 +13,7 @@ int disk_tool_t::dump_journal()
fprintf(stderr, "Invalid journal block size\n");
return 1;
}
first = true;
first_block = true;
if (json)
printf("[\n");
if (all)
@@ -38,8 +38,8 @@ int disk_tool_t::dump_journal()
}
if (json)
{
printf("%s{\"offset\":\"0x%lx\"", first ? "" : ",\n", journal_pos);
first = false;
printf("%s{\"offset\":\"0x%lx\"", first_block ? "" : ",\n", journal_pos);
first_block = false;
}
if (s == dsk.journal_block_size)
{
@@ -55,10 +55,10 @@ int disk_tool_t::dump_journal()
printf("offset %08lx:\n", journal_pos);
else
printf(",\"entries\":[\n");
first2 = true;
first_entry = true;
process_journal_block(journal_buf, [this](int num, journal_entry *je) { dump_journal_entry(num, je, json); });
if (json)
printf(first2 ? "]}" : "\n]}");
printf(first_entry ? "]}" : "\n]}");
}
else
{
@@ -75,34 +75,30 @@ int disk_tool_t::dump_journal()
}
else
{
first_entry = true;
process_journal([this](void *data)
{
first2 = true;
if (json && dump_with_blocks)
first_entry = true;
if (!json)
printf("offset %08lx:\n", journal_pos);
auto pos = journal_pos;
int r = process_journal_block(data, [this, pos](int num, journal_entry *je)
{
if (json && first2)
{
if (dump_with_blocks)
printf("%s{\"offset\":\"0x%lx\",\"entries\":[\n", first ? "" : ",\n", pos);
first = false;
}
if (json && dump_with_blocks && first_entry)
printf("%s{\"offset\":\"0x%lx\",\"entries\":[\n", first_block ? "" : ",\n", pos);
dump_journal_entry(num, je, json);
first_block = false;
});
if (json)
{
if (dump_with_blocks && !first2)
printf("\n]}");
}
else if (r <= 0)
if (json && dump_with_blocks && !first_entry)
printf("\n]}");
else if (!json && r <= 0)
printf("end of the journal\n");
return r;
});
}
if (json)
printf(first ? "]\n" : "\n]\n");
printf(first_block ? "]\n" : "\n]\n");
return 0;
}
@@ -209,9 +205,9 @@ void disk_tool_t::dump_journal_entry(int num, journal_entry *je, bool json)
{
if (json)
{
if (!first2)
if (!first_entry)
printf(",\n");
first2 = false;
first_entry = false;
printf(
"{\"crc32\":\"%08x\",\"valid\":%s,\"crc32_prev\":\"%08x\"",
je->crc32, (je_crc32(je) == je->crc32 ? "true" : "false"), je->crc32_prev
@@ -275,10 +271,12 @@ void disk_tool_t::dump_journal_entry(int num, journal_entry *je, bool json)
else if (je->type == JE_BIG_WRITE || je->type == JE_BIG_WRITE_INSTANT)
{
printf(
json ? ",\"type\":\"big_write%s\",\"inode\":\"0x%lx\",\"stripe\":\"0x%lx\",\"ver\":\"%lu\",\"loc\":\"0x%lx\""
: "je_big_write%s oid=%lx:%lx ver=%lu loc=%08lx",
json ? ",\"type\":\"big_write%s\",\"inode\":\"0x%lx\",\"stripe\":\"0x%lx\",\"ver\":\"%lu\",\"offset\":%u,\"len\":%u,\"loc\":\"0x%lx\""
: "je_big_write%s oid=%lx:%lx ver=%lu offset=%u len=%u loc=%08lx",
je->type == JE_BIG_WRITE_INSTANT ? "_instant" : "",
je->big_write.oid.inode, je->big_write.oid.stripe, je->big_write.version, je->big_write.location
je->big_write.oid.inode, je->big_write.oid.stripe,
je->big_write.version, je->big_write.offset, je->big_write.len,
je->big_write.location
);
if (je->big_write.size > sizeof(journal_entry_big_write))
{
@@ -424,6 +422,8 @@ int disk_tool_t::write_json_journal(json11::Json entries)
.stripe = sscanf_json(NULL, rec["stripe"]),
},
.version = rec["ver"].uint64_value(),
.offset = (uint32_t)rec["offset"].uint64_value(),
.len = (uint32_t)rec["len"].uint64_value(),
.location = sscanf_json(NULL, rec["loc"]),
};
fromhexstr(rec["bitmap"].string_value(), dsk.clean_entry_bitmap_size, ((uint8_t*)ne) + sizeof(journal_entry_big_write));

View File

@@ -124,14 +124,14 @@ void disk_tool_t::dump_meta_header(blockstore_meta_header_v1_t *hdr)
{
printf("{\"version\":\"0.5\",\"meta_block_size\":%lu,\"entries\":[\n", dsk.meta_block_size);
}
first = true;
first_entry = true;
}
void disk_tool_t::dump_meta_entry(uint64_t block_num, clean_disk_entry *entry, uint8_t *bitmap)
{
printf(
#define ENTRY_FMT "{\"block\":%lu,\"pool\":%u,\"inode\":%lu,\"stripe\":%lu,\"version\":%lu"
(first ? ENTRY_FMT : (",\n" ENTRY_FMT)),
#define ENTRY_FMT "{\"block\":%lu,\"pool\":%u,\"inode\":\"0x%lx\",\"stripe\":\"0x%lx\",\"version\":%lu"
(first_entry ? ENTRY_FMT : (",\n" ENTRY_FMT)),
#undef ENTRY_FMT
block_num, INODE_POOL(entry->oid.inode), INODE_NO_POOL(entry->oid.inode),
entry->oid.stripe, entry->version
@@ -154,7 +154,7 @@ void disk_tool_t::dump_meta_entry(uint64_t block_num, clean_disk_entry *entry, u
{
printf("}");
}
first = false;
first_entry = false;
}
int disk_tool_t::write_json_meta(json11::Json meta)

View File

@@ -61,6 +61,11 @@ int disk_tool_t::prepare_one(std::map<std::string, std::string> options, int is_
fprintf(stderr, "%s already contains Vitastor OSD superblock, not creating OSD without --force\n", dev.c_str());
return 1;
}
if (fix_partition_type(dev) != 0)
{
fprintf(stderr, "%s has incorrect type and we failed to change it to Vitastor type\n", dev.c_str());
return 1;
}
}
}
for (auto dev: std::vector<std::string>{"data", "meta", "journal"})
@@ -317,7 +322,8 @@ json11::Json disk_tool_t::add_partitions(vitastor_dev_info_t & devinfo, std::vec
{
script += "+ "+size+" "+std::string(VITASTOR_PART_TYPE)+"\n";
}
if (shell_exec({ "sfdisk", "--force", devinfo.path }, script, NULL, NULL) != 0)
std::string out;
if (shell_exec({ "sfdisk", "--no-reread", "--force", devinfo.path }, script, &out, NULL) != 0)
{
fprintf(stderr, "Failed to add %lu partition(s) with sfdisk\n", sizes.size());
return {};
@@ -351,7 +357,8 @@ json11::Json disk_tool_t::add_partitions(vitastor_dev_info_t & devinfo, std::vec
{
iter++;
// Run partprobe
if (iter > 1 || (r = shell_exec({ "partprobe", devinfo.path }, "", NULL, NULL)) != 0)
std::string out;
if (iter > 1 || (r = shell_exec({ "partprobe", devinfo.path }, "", &out, NULL)) != 0)
{
fprintf(
stderr, iter == 1 && r == 255
@@ -539,7 +546,7 @@ int disk_tool_t::prepare(std::vector<std::string> devices)
fprintf(stderr, "Device list (positional arguments) and --hybrid are incompatible with --data_device\n");
return 1;
}
return prepare_one(options);
return prepare_one(options, options.find("hdd") != options.end() ? 1 : 0);
}
if (!devices.size())
{
@@ -549,12 +556,12 @@ int disk_tool_t::prepare(std::vector<std::string> devices)
options.erase("data_device");
options.erase("meta_device");
options.erase("journal_device");
bool hybrid = options.find("hybrid") != options.end();
auto devinfo = collect_devices(devices);
if (!devinfo.size())
{
return 1;
}
bool hybrid = options.find("hybrid") != options.end();
uint64_t osd_per_disk = stoull_full(options["osd_per_disk"]);
if (!osd_per_disk)
osd_per_disk = 1;
@@ -612,7 +619,8 @@ int disk_tool_t::prepare(std::vector<std::string> devices)
return 1;
}
}
prepare_one(options, dev.is_hdd ? 1 : 0);
// Treat all disks as SSDs if not in the hybrid mode
prepare_one(options, hybrid && dev.is_hdd ? 1 : 0);
}
}
}

View File

@@ -5,6 +5,7 @@
#include "disk_tool.h"
#include "rw_blocking.h"
#include "str_util.h"
struct __attribute__((__packed__)) vitastor_disk_superblock_t
{
@@ -54,7 +55,7 @@ int disk_tool_t::udev_import(std::string device)
int disk_tool_t::read_sb(std::string device)
{
json11::Json sb = read_osd_superblock(device);
json11::Json sb = read_osd_superblock(device, true, options.find("force") != options.end());
if (sb.is_null())
{
return 1;
@@ -123,7 +124,7 @@ uint32_t disk_tool_t::write_osd_superblock(std::string device, json11::Json para
return sb_size;
}
json11::Json disk_tool_t::read_osd_superblock(std::string device, bool expect_exist)
json11::Json disk_tool_t::read_osd_superblock(std::string device, bool expect_exist, bool ignore_errors)
{
vitastor_disk_superblock_t *sb = NULL;
uint8_t *buf = NULL;
@@ -144,7 +145,7 @@ json11::Json disk_tool_t::read_osd_superblock(std::string device, bool expect_ex
goto ex;
}
sb = (vitastor_disk_superblock_t*)buf;
if (sb->magic != VITASTOR_DISK_MAGIC)
if (sb->magic != VITASTOR_DISK_MAGIC && !ignore_errors)
{
if (expect_exist)
fprintf(stderr, "Invalid OSD superblock on %s: magic number mismatch\n", device.c_str());
@@ -172,7 +173,7 @@ json11::Json disk_tool_t::read_osd_superblock(std::string device, bool expect_ex
}
sb = (vitastor_disk_superblock_t*)buf;
}
if (sb->crc32c != crc32c(0, &sb->size, sb->size - ((uint8_t*)&sb->size - buf)))
if (sb->crc32c != crc32c(0, &sb->size, sb->size - ((uint8_t*)&sb->size - buf)) && !ignore_errors)
{
if (expect_exist)
fprintf(stderr, "Invalid OSD superblock on %s: crc32 mismatch\n", device.c_str());
@@ -186,14 +187,14 @@ json11::Json disk_tool_t::read_osd_superblock(std::string device, bool expect_ex
goto ex;
}
// Validate superblock
if (!osd_params["osd_num"].uint64_value())
if (!osd_params["osd_num"].uint64_value() && !ignore_errors)
{
if (expect_exist)
fprintf(stderr, "OSD superblock on %s lacks osd_num\n", device.c_str());
osd_params = json11::Json();
goto ex;
}
if (osd_params["data_device"].string_value() == "")
if (osd_params["data_device"].string_value() == "" && !ignore_errors)
{
if (expect_exist)
fprintf(stderr, "OSD superblock on %s lacks data_device\n", device.c_str());
@@ -226,7 +227,7 @@ json11::Json disk_tool_t::read_osd_superblock(std::string device, bool expect_ex
{
device_type = "journal";
}
else
else if (!ignore_errors)
{
if (expect_exist)
fprintf(stderr, "Invalid OSD superblock on %s: does not refer to the device itself\n", device.c_str());
@@ -246,7 +247,7 @@ ex:
return osd_params;
}
int disk_tool_t::systemd_start_stop_osds(std::vector<std::string> cmd, std::vector<std::string> devices)
int disk_tool_t::systemd_start_stop_osds(const std::vector<std::string> & cmd, const std::vector<std::string> & devices)
{
if (!devices.size())
{
@@ -306,8 +307,7 @@ int disk_tool_t::exec_osd(std::string device)
argv[i] = (char*)argstr[i].c_str();
}
argv[argstr.size()] = NULL;
execvpe(osd_binary.c_str(), argv, environ);
return 0;
return execvpe(osd_binary.c_str(), argv, environ);
}
static int check_disabled_cache(std::string dev)
@@ -362,3 +362,140 @@ int disk_tool_t::pre_exec_osd(std::string device)
}
return 0;
}
int disk_tool_t::purge_devices(const std::vector<std::string> & devices)
{
std::vector<uint64_t> osd_numbers;
json11::Json::array superblocks;
for (auto & device: devices)
{
json11::Json sb = read_osd_superblock(device);
if (!sb.is_null())
{
uint64_t osd_num = sb["params"]["osd_num"].uint64_value();
osd_numbers.push_back(osd_num);
superblocks.push_back(sb);
}
}
if (!osd_numbers.size())
{
return 0;
}
std::vector<std::string> rm_osd_cli = { "vitastor-cli", "rm-osd" };
for (auto osd_num: osd_numbers)
{
rm_osd_cli.push_back(std::to_string(osd_num));
}
// Check for data loss
rm_osd_cli.push_back("--dry-run");
std::string dry_run_ignore_stdout;
if (shell_exec(rm_osd_cli, "", &dry_run_ignore_stdout, NULL) != 0)
{
return 1;
}
// Disable & stop OSDs
std::vector<std::string> systemctl_cli = { "systemctl", "disable", "--now" };
for (auto osd_num: osd_numbers)
{
systemctl_cli.push_back("vitastor-osd@"+std::to_string(osd_num));
}
if (shell_exec(systemctl_cli, "", NULL, NULL) != 0)
{
return 1;
}
// Remove OSD metadata
rm_osd_cli.pop_back();
if (options["force"] != "")
{
rm_osd_cli.push_back("--force");
}
else if (options["allow_data_loss"] != "")
{
rm_osd_cli.push_back("--allow-data-loss");
}
if (shell_exec(rm_osd_cli, "", NULL, NULL) != 0)
{
return 1;
}
// Destroy OSD superblocks
uint8_t *buf = (uint8_t*)memalign_or_die(MEM_ALIGNMENT, 4096);
for (auto & sb: superblocks)
{
for (auto dev_type: std::vector<std::string>{ "data", "meta", "journal" })
{
auto dev = sb["real_"+dev_type+"_device"].string_value();
if (dev != "")
{
int fd = -1, r = open(dev.c_str(), O_DIRECT|O_RDWR);
if (r >= 0)
{
fd = r;
r = read_blocking(fd, buf, 4096);
if (r == 4096)
{
// Clear magic and CRC
memset(buf, 0, 12);
r = lseek64(fd, 0, 0);
if (r == 0)
{
r = write_blocking(fd, buf, 4096);
if (r == 4096)
r = 0;
}
}
}
if (fd >= 0)
close(fd);
if (r != 0)
{
fprintf(stderr, "Failed to clear OSD %lu %s device %s superblock: %s\n",
sb["params"]["osd_num"].uint64_value(), dev_type.c_str(), dev.c_str(), strerror(errno));
}
else
{
fprintf(stderr, "OSD %lu %s device %s superblock cleared\n",
sb["params"]["osd_num"].uint64_value(), dev_type.c_str(), dev.c_str());
}
if (sb["params"][dev_type+"_device"].string_value().substr(0, 22) == "/dev/disk/by-partuuid/")
{
// Delete the partition itself
auto uuid_to_del = strtolower(sb["params"][dev_type+"_device"].string_value().substr(22));
auto parent_dev = get_parent_device(dev);
if (parent_dev == "" || parent_dev == dev)
{
fprintf(stderr, "Failed to delete partition %s: failed to find parent device\n", dev.c_str());
continue;
}
auto pt = read_parttable("/dev/"+parent_dev);
if (!pt.is_object())
continue;
json11::Json::array newpt = pt["partitions"].array_items();
for (int i = 0; i < newpt.size(); i++)
{
if (strtolower(newpt[i]["uuid"].string_value()) == uuid_to_del)
{
auto old_part = newpt[i];
newpt.erase(newpt.begin()+i, newpt.begin()+i+1);
vitastor_dev_info_t devinfo = {
.path = "/dev/"+parent_dev,
.pt = json11::Json::object{ { "partitions", newpt } },
};
add_partitions(devinfo, {});
struct stat st;
if (stat(old_part["node"].string_value().c_str(), &st) == 0 ||
errno != ENOENT)
{
std::string out;
shell_exec({ "partprobe", "/dev/"+parent_dev }, "", &out, NULL);
}
break;
}
}
}
}
}
}
free(buf);
buf = NULL;
return 0;
}

View File

@@ -38,42 +38,6 @@ static std::map<std::string, std::string> read_vitastor_unit(std::string unit)
return r;
}
static int fix_partition_type(std::string dev_by_uuid)
{
auto uuid = strtolower(dev_by_uuid.substr(dev_by_uuid.rfind('/')+1));
std::string parent_dev = get_parent_device(realpath_str(dev_by_uuid, false));
if (parent_dev == "")
return 1;
auto pt = read_parttable("/dev/"+parent_dev);
if (pt.is_null())
return 1;
std::string script = "label: gpt\n\n";
for (const auto & part: pt["partitions"].array_items())
{
bool this_part = (strtolower(part["uuid"].string_value()) == uuid);
if (this_part && strtolower(part["type"].string_value()) == "e7009fac-a5a1-4d72-af72-53de13059903")
{
// Already correct type
return 0;
}
script += part["node"].string_value()+": ";
bool first = true;
for (const auto & kv: part.object_items())
{
if (kv.first != "node")
{
script += (first ? "" : ", ")+kv.first+"="+
(kv.first == "type" && this_part
? "e7009fac-a5a1-4d72-af72-53de13059903"
: (kv.second.is_string() ? kv.second.string_value() : kv.second.dump()));
first = false;
}
}
script += "\n";
}
return shell_exec({ "sfdisk", "--no-reread", "--force", "/dev/"+parent_dev }, script, NULL, NULL);
}
int disk_tool_t::upgrade_simple_unit(std::string unit)
{
if (stoull_full(unit) != 0)

View File

@@ -145,10 +145,10 @@ int disable_cache(std::string dev)
closedir(dir);
// Check cache_type
scsi_disk += "/cache_type";
std::string cache_type = read_file(scsi_disk);
std::string cache_type = trim(read_file(scsi_disk));
if (cache_type == "")
return 1;
if (cache_type == "write back")
if (cache_type != "write through")
{
int fd = open(scsi_disk.c_str(), O_WRONLY);
if (fd < 0 || write_blocking(fd, (void*)"write through", strlen("write through")) != strlen("write through"))
@@ -239,7 +239,8 @@ int shell_exec(const std::vector<std::string> & cmd, const std::string & in, std
{
// Child
dup2(child_stdin[0], 0);
dup2(child_stdout[1], 1);
if (out)
dup2(child_stdout[1], 1);
if (err)
dup2(child_stderr[1], 2);
close(child_stdin[0]);
@@ -250,9 +251,7 @@ int shell_exec(const std::vector<std::string> & cmd, const std::string & in, std
close(child_stderr[1]);
char *argv[cmd.size()+1];
for (int i = 0; i < cmd.size(); i++)
{
argv[i] = (char*)cmd[i].c_str();
}
argv[cmd.size()] = NULL;
execvp(argv[0], argv);
std::string full_cmd;
@@ -354,3 +353,40 @@ uint64_t free_from_parttable(json11::Json pt)
free *= pt["sectorsize"].uint64_value();
return free;
}
int fix_partition_type(std::string dev_by_uuid)
{
auto uuid = strtolower(dev_by_uuid.substr(dev_by_uuid.rfind('/')+1));
std::string parent_dev = get_parent_device(realpath_str(dev_by_uuid, false));
if (parent_dev == "")
return 1;
auto pt = read_parttable("/dev/"+parent_dev);
if (pt.is_null() || pt.is_bool())
return 1;
std::string script = "label: gpt\n\n";
for (const auto & part: pt["partitions"].array_items())
{
bool this_part = (strtolower(part["uuid"].string_value()) == uuid);
if (this_part && strtolower(part["type"].string_value()) == "e7009fac-a5a1-4d72-af72-53de13059903")
{
// Already correct type
return 0;
}
script += part["node"].string_value()+": ";
bool first = true;
for (const auto & kv: part.object_items())
{
if (kv.first != "node")
{
script += (first ? "" : ", ")+kv.first+"="+
(kv.first == "type" && this_part
? "e7009fac-a5a1-4d72-af72-53de13059903"
: (kv.second.is_string() ? kv.second.string_value() : kv.second.dump()));
first = false;
}
}
script += "\n";
}
std::string out;
return shell_exec({ "sfdisk", "--no-reread", "--force", "/dev/"+parent_dev }, script, &out, NULL);
}

View File

@@ -80,12 +80,20 @@ void osd_messenger_t::init()
};
op->callback = [this, cl](osd_op_t *op)
{
auto cl_it = clients.find(op->peer_fd);
if (cl_it == clients.end() || cl_it->second != cl)
{
// client is already dropped
delete op;
return;
}
int fail_fd = (op->reply.hdr.retval != 0 ? op->peer_fd : -1);
auto fail_osd_num = cl->osd_num;
cl->ping_time_remaining = 0;
delete op;
if (fail_fd >= 0)
{
fprintf(stderr, "Ping failed for OSD %lu (client %d), disconnecting peer\n", cl->osd_num, cl->peer_fd);
fprintf(stderr, "Ping failed for OSD %lu (client %d), disconnecting peer\n", fail_osd_num, fail_fd);
stop_client(fail_fd, true);
}
};

View File

@@ -9,6 +9,8 @@
#include "str_util.h"
#include "osd.h"
#define SELF_FD -1
// Peering loop
void osd_t::handle_peers()
{
@@ -317,7 +319,7 @@ void osd_t::submit_sync_and_list_subop(osd_num_t role_osd, pg_peering_state_t *p
// Self
osd_op_t *op = new osd_op_t();
op->op_type = 0;
op->peer_fd = -1;
op->peer_fd = SELF_FD;
clock_gettime(CLOCK_REALTIME, &op->tv_begin);
op->bs_op = new blockstore_op_t();
op->bs_op->opcode = BS_OP_SYNC;
@@ -336,8 +338,8 @@ void osd_t::submit_sync_and_list_subop(osd_num_t role_osd, pg_peering_state_t *p
ps->list_ops.erase(role_osd);
submit_list_subop(role_osd, ps);
};
bs->enqueue_op(op->bs_op);
ps->list_ops[role_osd] = op;
bs->enqueue_op(op->bs_op);
}
else
{
@@ -371,8 +373,8 @@ void osd_t::submit_sync_and_list_subop(osd_num_t role_osd, pg_peering_state_t *p
ps->list_ops.erase(role_osd);
submit_list_subop(role_osd, ps);
};
msgr.outbox_push(op);
ps->list_ops[role_osd] = op;
msgr.outbox_push(op);
}
}
@@ -383,7 +385,7 @@ void osd_t::submit_list_subop(osd_num_t role_osd, pg_peering_state_t *ps)
// Self
osd_op_t *op = new osd_op_t();
op->op_type = 0;
op->peer_fd = -1;
op->peer_fd = SELF_FD;
clock_gettime(CLOCK_REALTIME, &op->tv_begin);
op->bs_op = new blockstore_op_t();
op->bs_op->opcode = BS_OP_LIST;
@@ -415,8 +417,8 @@ void osd_t::submit_list_subop(osd_num_t role_osd, pg_peering_state_t *ps)
op->bs_op = NULL;
delete op;
};
bs->enqueue_op(op->bs_op);
ps->list_ops[role_osd] = op;
bs->enqueue_op(op->bs_op);
}
else
{
@@ -463,14 +465,14 @@ void osd_t::submit_list_subop(osd_num_t role_osd, pg_peering_state_t *ps)
ps->list_ops.erase(role_osd);
delete op;
};
msgr.outbox_push(op);
ps->list_ops[role_osd] = op;
msgr.outbox_push(op);
}
}
void osd_t::discard_list_subop(osd_op_t *list_op)
{
if (list_op->peer_fd == 0)
if (list_op->peer_fd == SELF_FD)
{
// Self
list_op->bs_op->callback = [list_op](blockstore_op_t *bs_op)

View File

@@ -22,7 +22,7 @@ struct osd_primary_op_data_t
pg_num_t pg_num;
object_id oid;
uint64_t target_ver;
uint64_t fact_ver = 0;
uint64_t orig_ver = 0, fact_ver = 0;
uint64_t scheme = 0;
int n_subops = 0, done = 0, errors = 0, epipe = 0;
int degraded = 0, pg_size, pg_data_size;

View File

@@ -138,6 +138,7 @@ resume_3:
}
}
// Send writes
op_data->orig_ver = op_data->fact_ver;
if ((op_data->fact_ver >> (64-PG_EPOCH_BITS)) < pg.epoch)
{
op_data->target_ver = ((uint64_t)pg.epoch << (64-PG_EPOCH_BITS)) | 1;
@@ -194,7 +195,7 @@ resume_7:
{
return;
}
if (op_data->fact_ver == 1)
if (op_data->orig_ver == 0)
{
// Object is created
pg.clean_count++;

View File

@@ -478,15 +478,18 @@ void* calc_rmw(void *request_buf, osd_rmw_stripe_t *stripes, uint64_t *read_osd_
{
if (write_osd_set[role] != 0)
{
write_parity = 1;
write_parity++;
if (write_osd_set[role] != read_osd_set[role])
{
start = 0;
end = chunk_size;
for (int r2 = pg_minsize; r2 < role; r2++)
{
stripes[r2].write_start = start;
stripes[r2].write_end = end;
if (write_osd_set[r2] != 0)
{
stripes[r2].write_start = start;
stripes[r2].write_end = end;
}
}
}
stripes[role].write_start = start;
@@ -555,7 +558,7 @@ void* calc_rmw(void *request_buf, osd_rmw_stripe_t *stripes, uint64_t *read_osd_
}
}
// Allocate read buffers
void *rmw_buf = alloc_read_buffer(stripes, pg_size, (write_parity ? pg_size-pg_minsize : 0) * (end - start));
void *rmw_buf = alloc_read_buffer(stripes, pg_size, write_parity * (end - start));
// Position write buffers
uint64_t buf_pos = 0, in_pos = 0;
for (int role = 0; role < pg_size; role++)
@@ -804,13 +807,11 @@ void calc_rmw_parity_ec(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize,
calc_rmw_parity_copy_mod(stripes, pg_size, pg_minsize, read_osd_set, write_osd_set, chunk_size, bitmap_granularity, start, end);
if (end != 0)
{
int i;
for (i = pg_minsize; i < pg_size; i++)
{
int write_parity = 0;
for (int i = pg_minsize; i < pg_size; i++)
if (write_osd_set[i] != 0)
break;
}
if (i < pg_size)
write_parity++;
if (write_parity > 0)
{
// Calculate new coding chunks
buf_len_t bufs[pg_size][3];
@@ -830,8 +831,11 @@ void calc_rmw_parity_ec(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize,
}
for (int i = pg_minsize; i < pg_size; i++)
{
bufs[i][nbuf[i]++] = { .buf = stripes[i].write_buf, .len = end-start };
positions[i] = start;
if (write_osd_set[i] != 0)
{
bufs[i][nbuf[i]++] = { .buf = stripes[i].write_buf, .len = end-start };
positions[i] = start;
}
}
uint32_t pos = start;
while (pos < end)
@@ -839,31 +843,37 @@ void calc_rmw_parity_ec(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize,
uint32_t next_end = end;
for (int i = 0; i < pg_size; i++)
{
assert(curbuf[i] < nbuf[i]);
assert(bufs[i][curbuf[i]].buf);
data_ptrs[i] = (uint8_t*)bufs[i][curbuf[i]].buf + pos-positions[i];
uint32_t this_end = bufs[i][curbuf[i]].len + positions[i];
if (next_end > this_end)
next_end = this_end;
if (i < pg_minsize || write_osd_set[i] != 0)
{
assert(curbuf[i] < nbuf[i]);
assert(bufs[i][curbuf[i]].buf);
data_ptrs[i] = (uint8_t*)bufs[i][curbuf[i]].buf + pos-positions[i];
uint32_t this_end = bufs[i][curbuf[i]].len + positions[i];
if (next_end > this_end)
next_end = this_end;
}
}
assert(next_end > pos);
for (int i = 0; i < pg_size; i++)
{
uint32_t this_end = bufs[i][curbuf[i]].len + positions[i];
if (next_end >= this_end)
if (i < pg_minsize || write_osd_set[i] != 0)
{
positions[i] += bufs[i][curbuf[i]].len;
curbuf[i]++;
uint32_t this_end = bufs[i][curbuf[i]].len + positions[i];
if (next_end >= this_end)
{
positions[i] += bufs[i][curbuf[i]].len;
curbuf[i]++;
}
}
}
#ifdef WITH_ISAL
ec_encode_data(
next_end-pos, pg_minsize, pg_size-pg_minsize, matrix->isal_data,
next_end-pos, pg_minsize, write_parity, matrix->isal_data,
(uint8_t**)data_ptrs, (uint8_t**)data_ptrs+pg_minsize
);
#else
jerasure_matrix_encode(
pg_minsize, pg_size-pg_minsize, OSD_JERASURE_W, matrix->je_data,
pg_minsize, write_parity, OSD_JERASURE_W, matrix->je_data,
(char**)data_ptrs, (char**)data_ptrs+pg_minsize, next_end-pos
);
#endif
@@ -871,16 +881,19 @@ void calc_rmw_parity_ec(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize,
}
for (int i = 0; i < pg_size; i++)
{
data_ptrs[i] = stripes[i].bmp_buf;
if (i < pg_minsize || write_osd_set[i] != 0)
{
data_ptrs[i] = stripes[i].bmp_buf;
}
}
#ifdef WITH_ISAL
ec_encode_data(
bitmap_size, pg_minsize, pg_size-pg_minsize, matrix->isal_data,
bitmap_size, pg_minsize, write_parity, matrix->isal_data,
(uint8_t**)data_ptrs, (uint8_t**)data_ptrs+pg_minsize
);
#else
jerasure_matrix_encode(
pg_minsize, pg_size-pg_minsize, OSD_JERASURE_W, matrix->je_data,
pg_minsize, write_parity, OSD_JERASURE_W, matrix->je_data,
(char**)data_ptrs, (char**)data_ptrs+pg_minsize, bitmap_size
);
#endif

View File

@@ -20,6 +20,7 @@ void test11();
void test12();
void test13();
void test14();
void test15();
int main(int narg, char *args[])
{
@@ -47,6 +48,8 @@ int main(int narg, char *args[])
test13();
// Test 14
test14();
// Test 15
test15();
// End
printf("all ok\n");
return 0;
@@ -706,7 +709,7 @@ void test13()
/***
13. basic jerasure 2+1 test
14. basic jerasure 2+1 test
calc_rmw(offset=128K-4K, len=8K, osd_set=[1,2,0], write_set=[1,2,3])
= {
read: [ [ 0, 128K ], [ 0, 128K ], [ 0, 0 ] ],
@@ -727,13 +730,13 @@ void test14()
osd_num_t write_osd_set[3] = { 1, 2, 3 };
osd_rmw_stripe_t stripes[3] = {};
unsigned bitmaps[3] = { 0 };
// Test 13.0
// Test 14.0
void *write_buf = malloc_or_die(8192);
split_stripes(2, 128*1024, 128*1024-4096, 8192, stripes);
assert(stripes[0].req_start == 128*1024-4096 && stripes[0].req_end == 128*1024);
assert(stripes[1].req_start == 0 && stripes[1].req_end == 4096);
assert(stripes[2].req_start == 0 && stripes[2].req_end == 0);
// Test 13.1
// Test 14.1
void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 3, write_osd_set, 128*1024, bmp);
for (int i = 0; i < 3; i++)
stripes[i].bmp_buf = bitmaps+i;
@@ -750,7 +753,7 @@ void test14()
assert(stripes[0].write_buf == write_buf);
assert(stripes[1].write_buf == (uint8_t*)write_buf+4096);
assert(stripes[2].write_buf == rmw_buf);
// Test 13.2 - encode
// Test 14.2 - encode
set_pattern(write_buf, 8192, PATTERN3);
set_pattern(stripes[0].read_buf, 128*1024-4096, PATTERN1);
set_pattern(stripes[1].read_buf, 128*1024-4096, PATTERN2);
@@ -767,7 +770,7 @@ void test14()
assert(stripes[0].write_buf == write_buf);
assert(stripes[1].write_buf == (uint8_t*)write_buf+4096);
assert(stripes[2].write_buf == rmw_buf);
// Test 13.3 - decode and verify
// Test 14.3 - decode and verify
osd_num_t read_osd_set[4] = { 0, 2, 3 };
memset(stripes, 0, sizeof(stripes));
split_stripes(2, 128*1024, 0, 128*1024, stripes);
@@ -802,3 +805,74 @@ void test14()
free(write_buf);
use_ec(3, 2, false);
}
/***
15. EC 2+2 partial overwrite with 1 missing stripe
calc_rmw(offset=64K+28K, len=4K, osd_set=[1,2,3,0], write_set=[1,2,3,0])
= {
read: [ [ 28K, 32K ], [ 0, 0 ], [ 0, 0 ], [ 0, 0 ] ],
write: [ [ 0, 0 ], [ 28K, 32K ], [ 28K, 32K ], [ 0, 0 ] ],
input buffer: [ write1 ],
rmw buffer: [ write2, read0 ],
}
***/
void test15()
{
const int bmp = 64*1024 / 4096 / 8;
use_ec(4, 2, true);
osd_num_t osd_set[4] = { 1, 2, 3, 0 };
osd_num_t write_osd_set[4] = { 1, 2, 3, 0 };
osd_rmw_stripe_t stripes[4] = {};
unsigned bitmaps[4] = { 0 };
// Test 15.0
void *write_buf = malloc_or_die(4096);
split_stripes(2, 64*1024, (64+28)*1024, 4096, stripes);
assert(stripes[0].req_start == 0 && stripes[0].req_end == 0);
assert(stripes[1].req_start == 28*1024 && stripes[1].req_end == 32*1024);
assert(stripes[2].req_start == 0 && stripes[2].req_end == 0);
assert(stripes[3].req_start == 0 && stripes[3].req_end == 0);
// Test 15.1
void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 4, 2, 3, write_osd_set, 64*1024, bmp);
for (int i = 0; i < 4; i++)
stripes[i].bmp_buf = bitmaps+i;
assert(rmw_buf);
assert(stripes[0].read_start == 28*1024 && stripes[0].read_end == 32*1024);
assert(stripes[1].read_start == 0 && stripes[1].read_end == 0);
assert(stripes[2].read_start == 0 && stripes[2].read_end == 0);
assert(stripes[3].read_start == 0 && stripes[3].read_end == 0);
assert(stripes[0].write_start == 0 && stripes[0].write_end == 0);
assert(stripes[1].write_start == 28*1024 && stripes[1].write_end == 32*1024);
assert(stripes[2].write_start == 28*1024 && stripes[2].write_end == 32*1024);
assert(stripes[3].write_start == 0 && stripes[3].write_end == 0);
assert(stripes[0].read_buf == (uint8_t*)rmw_buf+4*1024);
assert(stripes[1].read_buf == NULL);
assert(stripes[2].read_buf == NULL);
assert(stripes[3].read_buf == NULL);
assert(stripes[0].write_buf == NULL);
assert(stripes[1].write_buf == (uint8_t*)write_buf);
assert(stripes[2].write_buf == rmw_buf);
assert(stripes[3].write_buf == NULL);
// Test 15.2 - encode
set_pattern(write_buf, 4*1024, PATTERN1);
set_pattern(stripes[0].read_buf, 4*1024, PATTERN2);
memset(stripes[0].bmp_buf, 0, bmp);
memset(stripes[1].bmp_buf, 0, bmp);
calc_rmw_parity_ec(stripes, 4, 2, osd_set, write_osd_set, 64*1024, bmp);
assert(*(uint32_t*)stripes[2].bmp_buf == 0x80);
assert(stripes[0].write_start == 0 && stripes[0].write_end == 0);
assert(stripes[1].write_start == 28*1024 && stripes[1].write_end == 32*1024);
assert(stripes[2].write_start == 28*1024 && stripes[2].write_end == 32*1024);
assert(stripes[3].write_start == 0 && stripes[3].write_end == 0);
assert(stripes[0].write_buf == NULL);
assert(stripes[1].write_buf == (uint8_t*)write_buf);
assert(stripes[2].write_buf == rmw_buf);
assert(stripes[3].write_buf == NULL);
check_pattern(stripes[2].write_buf, 4*1024, PATTERN1^PATTERN2); // first parity is always xor :)
// Done
free(rmw_buf);
free(write_buf);
use_ec(3, 2, false);
}

View File

@@ -206,6 +206,25 @@ static void coroutine_fn vitastor_co_get_metadata(VitastorRPC *task)
}
}
static void vitastor_aio_set_fd_handler(void *ctx, int fd, int unused1, IOHandler *fd_read, IOHandler *fd_write, void *unused2, void *opaque)
{
aio_set_fd_handler(ctx, fd,
#if QEMU_VERSION_MAJOR == 2 && QEMU_VERSION_MINOR >= 5 || QEMU_VERSION_MAJOR >= 3
0 /*is_external*/,
#endif
fd_read, fd_write,
#if QEMU_VERSION_MAJOR == 1 && QEMU_VERSION_MINOR <= 6 || QEMU_VERSION_MAJOR < 1
NULL /*io_flush*/,
#endif
#if QEMU_VERSION_MAJOR == 2 && QEMU_VERSION_MINOR >= 9 || QEMU_VERSION_MAJOR >= 3
NULL /*io_poll*/,
#endif
#if QEMU_VERSION_MAJOR >= 7
NULL /*io_poll_ready*/,
#endif
opaque);
}
static int vitastor_file_open(BlockDriverState *bs, QDict *options, int flags, Error **errp)
{
VitastorClient *client = bs->opaque;
@@ -221,7 +240,7 @@ static int vitastor_file_open(BlockDriverState *bs, QDict *options, int flags, E
client->rdma_gid_index = qdict_get_try_int(options, "rdma-gid-index", 0);
client->rdma_mtu = qdict_get_try_int(options, "rdma-mtu", 0);
client->proxy = vitastor_c_create_qemu(
(QEMUSetFDHandler*)aio_set_fd_handler, bdrv_get_aio_context(bs), client->config_path, client->etcd_host, client->etcd_prefix,
vitastor_aio_set_fd_handler, bdrv_get_aio_context(bs), client->config_path, client->etcd_host, client->etcd_prefix,
client->use_rdma, client->rdma_device, client->rdma_port_num, client->rdma_gid_index, client->rdma_mtu, 0
);
client->image = g_strdup(qdict_get_try_str(options, "image"));
@@ -238,9 +257,9 @@ static int vitastor_file_open(BlockDriverState *bs, QDict *options, int flags, E
}
else
{
qemu_coroutine_enter(qemu_coroutine_create((void(*)(void*))vitastor_co_get_metadata, &task));
bdrv_coroutine_enter(bs, qemu_coroutine_create((void(*)(void*))vitastor_co_get_metadata, &task));
BDRV_POLL_WHILE(bs, !task.complete);
}
BDRV_POLL_WHILE(bs, !task.complete);
client->watch = (void*)task.ret;
client->readonly = client->readonly || vitastor_c_inode_get_readonly(client->watch);
client->size = vitastor_c_inode_get_size(client->watch);
@@ -428,7 +447,13 @@ static void vitastor_co_read_cb(void *opaque, long retval, uint64_t version)
vitastor_co_generic_bh_cb(opaque, retval);
}
static int coroutine_fn vitastor_co_preadv(BlockDriverState *bs, uint64_t offset, uint64_t bytes, QEMUIOVector *iov, int flags)
static int coroutine_fn vitastor_co_preadv(BlockDriverState *bs,
#if QEMU_VERSION_MAJOR >= 7 || QEMU_VERSION_MAJOR == 6 && QEMU_VERSION_MINOR >= 2
int64_t offset, int64_t bytes, QEMUIOVector *iov, BdrvRequestFlags flags
#else
uint64_t offset, uint64_t bytes, QEMUIOVector *iov, int flags
#endif
)
{
VitastorClient *client = bs->opaque;
VitastorRPC task;
@@ -448,7 +473,13 @@ static int coroutine_fn vitastor_co_preadv(BlockDriverState *bs, uint64_t offset
return task.ret;
}
static int coroutine_fn vitastor_co_pwritev(BlockDriverState *bs, uint64_t offset, uint64_t bytes, QEMUIOVector *iov, int flags)
static int coroutine_fn vitastor_co_pwritev(BlockDriverState *bs,
#if QEMU_VERSION_MAJOR >= 7 || QEMU_VERSION_MAJOR == 6 && QEMU_VERSION_MINOR >= 2
int64_t offset, int64_t bytes, QEMUIOVector *iov, BdrvRequestFlags flags
#else
uint64_t offset, uint64_t bytes, QEMUIOVector *iov, int flags
#endif
)
{
VitastorClient *client = bs->opaque;
VitastorRPC task;

View File

@@ -56,6 +56,16 @@ std::string base64_decode(const std::string &in)
return out;
}
std::string strtoupper(const std::string & in)
{
std::string s = in;
for (int i = 0; i < s.length(); i++)
{
s[i] = toupper(s[i]);
}
return s;
}
std::string strtolower(const std::string & in)
{
std::string s = in;

View File

@@ -8,6 +8,7 @@
std::string base64_encode(const std::string &in);
std::string base64_decode(const std::string &in);
uint64_t parse_size(std::string size_str, bool *ok = NULL);
std::string strtoupper(const std::string & in);
std::string strtolower(const std::string & in);
std::string trim(const std::string & in, const char *rm_chars = " \n\r\t");
std::string str_replace(const std::string & in, const std::string & needle, const std::string & replacement);

28
src/test_crc32.cpp Normal file
View File

@@ -0,0 +1,28 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdint.h>
#include "malloc_or_die.h"
#include "errno.h"
#include "crc32c.h"
int main(int narg, char *args[])
{
int bufsize = 65536;
uint8_t *buf = (uint8_t*)malloc_or_die(bufsize);
uint32_t csum = 0;
while (1)
{
int r = read(0, buf, bufsize);
if (r <= 0 && errno != EAGAIN && errno != EINTR)
break;
csum = crc32c(csum, buf, r);
}
free(buf);
printf("%08x\n", csum);
return 0;
}

View File

@@ -6,7 +6,7 @@ includedir=${prefix}/@CMAKE_INSTALL_INCLUDEDIR@
Name: Vitastor
Description: Vitastor client library
Version: 0.8.0
Version: 0.8.3
Libs: -L${libdir} -lvitastor_client
Cflags: -I${includedir}

View File

@@ -24,6 +24,7 @@ typedef void VitastorIOHandler(void *opaque, long retval);
// QEMU
typedef void IOHandler(void *opaque);
// is_external and poll_fn are not required, but are here for compatibility
typedef void QEMUSetFDHandler(void *ctx, int fd, int is_external, IOHandler *fd_read, IOHandler *fd_write, void *poll_fn, void *opaque);
vitastor_c *vitastor_c_create_qemu(QEMUSetFDHandler *aio_set_fd_handler, void *aio_context,

View File

@@ -61,15 +61,31 @@ fi
POOLCFG='"name":"testpool","failure_domain":"osd",'$POOLCFG
$ETCDCTL put /vitastor/config/pools '{"1":{'$POOLCFG',"pg_size":'$PG_SIZE',"pg_minsize":'$PG_MINSIZE',"pg_count":'$PG_COUNT'}}'
sleep 2
wait_up()
{
local sec=$1
local i=0
local configured=0
while [[ $i -lt $sec ]]; do
if $ETCDCTL get /vitastor/config/pgs --print-value-only | jq -s -e '(. | length) != 0 and ([ .[0].items["1"][] |
select(((.osd_set | select(. != 0) | sort | unique) | length) == '$PG_SIZE') ] | length) == '$PG_COUNT; then
configured=1
if $ETCDCTL get /vitastor/pg/state/1/ --prefix --print-value-only | jq -s -e '[ .[] | select(.state == ["active"]) ] | length == '$PG_COUNT; then
break
fi
fi
sleep 1
i=$((i+1))
if [ $i -eq $sec ]; then
if [[ $configured -ne 0 ]]; then
format_error "FAILED: $PG_COUNT PG(s) NOT CONFIGURED"
fi
format_error "FAILED: $PG_COUNT PG(s) NOT UP"
fi
done
}
if ! ($ETCDCTL get /vitastor/config/pgs --print-value-only | jq -s -e '(.[0].items["1"] | map((.osd_set | select(. > 0)) | length == '$PG_SIZE') | length) == '$PG_COUNT); then
format_error "FAILED: $PG_COUNT PGS NOT CONFIGURED"
fi
if ! ($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"]) ] | length) == '$PG_COUNT); then
format_error "FAILED: $PG_COUNT PGS NOT UP"
fi
wait_up 60
try_reweight()
{

View File

@@ -1,6 +1,6 @@
#!/bin/bash -ex
PG_COUNT=16
PG_COUNT=2048
. `dirname $0`/run_3osds.sh
@@ -10,8 +10,7 @@ LD_PRELOAD="build/src/libfio_vitastor.so" \
for i in 4; do
dd if=/dev/zero of=./testdata/test_osd$i.bin bs=1024 count=1 seek=$((OSD_SIZE*1024-1))
build/src/vitastor-osd --osd_num $i --bind_address 127.0.0.1 $OSD_ARGS --etcd_address $ETCD_URL $(build/src/vitastor-disk simple-offsets --format options ./testdata/test_osd$i.bin 2>/dev/null) &>./testdata/osd$i.log &
eval OSD${i}_PID=$!
start_osd $i
done
sleep 2
@@ -33,4 +32,28 @@ if ! ($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '(
format_error "FAILED: $PG_COUNT PGS NOT ACTIVE"
fi
sleep 1
kill -9 $OSD4_PID
sleep 1
build/src/vitastor-cli --etcd_address $ETCD_URL rm-osd --force 4
sleep 2
for i in {1..10}; do
($ETCDCTL get /vitastor/config/pgs --print-value-only |\
jq -s -e '([ .[0].items["1"] | map(.osd_set)[][] ] | sort | unique == ["1","2","3"])') && \
($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"] or .state == ["active", "left_on_dead"]) ] | length) == '$PG_COUNT'') && \
break
sleep 1
done
if ! ($ETCDCTL get /vitastor/config/pgs --print-value-only |\
jq -s -e '([ .[0].items["1"] | map(.osd_set)[][] ] | sort | unique == ["1","2","3"])'); then
format_error "FAILED: OSD NOT REMOVED FROM DISTRIBUTION"
fi
if ! ($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"] or .state == ["active", "left_on_dead"]) ] | length) == '$PG_COUNT''); then
format_error "FAILED: $PG_COUNT PGS NOT ACTIVE"
fi
format_green OK