Compare commits

..

85 Commits

Author SHA1 Message Date
Vitaliy Filippov 0b41ffc08d Release 0.6.0
Warning: upgrading from 0.5.x is currently not supported!
Please create an issue if you really need upgrade capability.

New features:
- Snapshots and Copy-on-Write clones
- Inode (image) names
- Inode I/O and space statistics
- Write throttling for smoothing random write workloads in SSD+HDD configurations
2021-04-11 00:49:18 +03:00
Vitaliy Filippov 64eeb79051 Prevent 0.6.x OSDs from talking to 0.5.x
The new protocol is almost compatible - it has bitmaps, but also it has
a "bitmap_length" field. It's not hard to make 0.5-0.6 OSDs and clients
compatible, but for now I just assume nobody needs it.

If I'm wrong and anybody requests to upgrade their production 0.5.x system
to 0.6.x I'll fix it.
2021-04-10 22:26:17 +03:00
Vitaliy Filippov 2a02f3c4c7 Add metadata superblock and check it on start
Refuse to start if the superblock is missing or bad version;
zero out the metadata area when initializing superblock.
2021-04-10 22:26:17 +03:00
Vitaliy Filippov f684d9101a Refuse to start with old journal version 2021-04-10 17:44:12 +03:00
Vitaliy Filippov c72fddd714 Notes about master/0.5.x 2021-04-10 17:44:12 +03:00
Vitaliy Filippov a1f2f19489 Do not increment inode statistics if the object already exists 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 82c1a7ec67 Fix statistics reporting, split inode number into pool & inode 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 2ab423d4ef Implement journaled write throttling for the SSD+HDD case 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 4694811eab Add microsecond accuracy to set_timer 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 6b988de17d Remove timerfd_interval 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 37efdc2a83 Fix bitmap_set for replicated pools 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 591cad09c9 Fix bitmaps for objects larger than 128K 2021-04-10 17:44:12 +03:00
Vitaliy Filippov b907ad50aa Oops, forgot to add external bitmaps to blockstore in some places 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 7308d6a6c0 Note about etcd 3.4.15 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 5f5b6ef150 Enable chained reads in the client 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 38a3df4a0e Implement chained (optimized) read in the primary OSD code 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 6950b8e3a0 Watch inode metadata revisions 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 0cea3576fb Add "read bitmaps" operation to secondary OSD protocol 2021-04-10 17:44:12 +03:00
Vitaliy Filippov f01eea07d3 Add simplified interface to read blockstore bitmaps synchronously 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 2c2f08aca2 Shorten some structure names 2021-04-10 17:44:12 +03:00
Vitaliy Filippov d6524670e1 Introduce data distribution locality 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 879ecfa74d Fix wording 2021-04-10 17:44:12 +03:00
Vitaliy Filippov aea2d19d35 Change Telegram chat link 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 04f86dc00b Fix Russian README for CMake build 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 7aeb2cbac7 Capture all by value in qemu_proxy 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 519f081006 Add LICENSE 2021-04-10 17:44:12 +03:00
Vitaliy Filippov e50f703e1d Add Russian version of the README 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 2612d3198a Introduce image names and metadata storage in etcd
Each inode has: image name, parent inode number & pool, size and readonly flag

Snapshots are created by switching image name to a different inode number
while using the older inode as parent.
2021-04-10 17:44:12 +03:00
Vitaliy Filippov ab39ce2bbb Use clean_entry_bitmap_size instead of entry_attr_size back because of changed bitmap handling 2021-04-10 17:44:12 +03:00
Vitaliy Filippov d0c2e31312 Add a test for snapshots, fix bugs. Now the test passes 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 9038d42327 Fix several snapshot I/O bugs 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 691f066055 Actual snapshot support (untested) 2021-04-10 17:44:12 +03:00
Vitaliy Filippov ffe1cd4c79 Report inode I/O statistics, aggregate it in the monitor 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 4ae1b84c67 Report inode space usage statistics to etcd, aggregate it in the monitor 2021-04-10 17:44:12 +03:00
Vitaliy Filippov c35963967f Add inode space usage statistics tracking to blockstore 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 0aa2dd2890 Send bitmaps with primary-reads, actually read bitmaps for READ ops 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 6bf88883ac Allocate bitmaps along with stripes to avoid memory fragmentation 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 004f265393 Remove cryptic bitmap inlining from bs_op_t and osd_op_t, use bitmap in primary OSD code 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 860ac24762 Add "external" bitmap support to the secondary OSD protocol 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 6107a4d07b Add "external" bitmap support to blockstore 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 95c29b9dc3 Add "external" bitmap support to osd_rmw 2021-04-10 17:44:12 +03:00
Vitaliy Filippov d99407dcec Check QEMU block-vitastor.so during the test 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 6909807068 Allow to start the OSD just to flush the journal completely 2021-04-10 17:44:12 +03:00
Vitaliy Filippov ec90fe6ec1 Release 0.5.13
Another followup to 0.5.11
2021-04-09 12:10:16 +03:00
Vitaliy Filippov 18c72f4835 Correct reenterability fix (now verified with a test)
It's rather funny but 0.5.12 has to be re-published again
2021-04-09 12:10:16 +03:00
Vitaliy Filippov 59fbcef734 Release 0.5.12
Fix qemu driver broken in 0.5.11 :)
2021-04-08 15:47:18 +03:00
Vitaliy Filippov 40b7c21fb1 Followup to 307c1731c1 - fix mark_stable 2021-04-08 15:47:18 +03:00
Vitaliy Filippov efb3678606 Fix qemu-img broken in 0.5.11
Caused by the lack of reenterability of the main cluster_client function
2021-04-08 14:59:20 +03:00
Vitaliy Filippov 462650134e Release 0.5.11
Another bunch of fixes, including important ones. Now OSDs are stable in SSD+HDD
configurations and everything is mostly ready for the merge of master branch.

Features:

- Add min_flusher_count configuration (good for HDDs)
- Shuffle PGs for better data device utilisation
- Make OSDs benefit from the immediate_commit=small setting if it's applicable

Bug fixes:

- Rework client code to fix write ordering during operation replay
- Rework error handling code so OSDs don't crash in reaction to a crash of their peer OSDs
- Fix several block layer problems related to the journal, some of which
  were leading to double allocations of the same block during journal replay
- Fix monitors crashing during the removal of OSD keys from etcd
- Fix data fsyncs being incorrectly disabled when only disable_journal_fsync was set
- Always zero out unused part of request/reply headers
- Fix some theoretically possible read/write ordering issues
- Don't try to "recover" misplaced objects if it would make them degraded
- Fix heartbeats sometimes preventing OSD to establish connections
2021-04-08 01:18:46 +03:00
Vitaliy Filippov 8d87e32175 Fix msgr_op.h includes 2021-04-08 01:18:46 +03:00
Vitaliy Filippov b0b2e7df3c Fix use-after-free in keepalive_timer and rework stop_client()
The bug reproduced if fio was temporarily stopped with SIGSTOP
during write test and then resumed after 10 seconds. In this case
"pings" were failed for all clients and fio process crashed with
'use-after-free' in keepalive_timer. It happened because it called
stop_client while having a live iterator to the map.
2021-04-07 11:06:31 +03:00
Vitaliy Filippov 97efb9e299 Do not crash on PG re-peering events when operations are in progress 2021-04-07 11:06:31 +03:00
Vitaliy Filippov f6d705383a Fix client connection recovery bugs, add dirty_ops limit 2021-04-07 11:06:31 +03:00
Vitaliy Filippov 68567c0e1f Fix messenger possibly trying to connect to the same OSD twice 2021-04-07 01:30:38 +03:00
Vitaliy Filippov 04b00003e9 Log ping failures 2021-04-07 01:30:38 +03:00
Vitaliy Filippov 307c1731c1 Forget all dirty_entries before stable big_write or delete during initialisation
This fixes a 'double_alloc' assertion in the following case:
- big_write object #1 v1 to block #100
- big_write object #1 v2 to block #101
- big_write object #2 v1 to block #100
2021-04-07 01:30:38 +03:00
Vitaliy Filippov 75a6a556b5 Shuffle PGs for better data device utilisation 2021-04-07 01:30:38 +03:00
Vitaliy Filippov a48e2bbf18 Fix write replay ordering when immediate_commit != all
Previous implementation didn't respect write ordering and could lead
to corrupted data when restarting writes after an OSD outage

Also rework cluster_client queueing logic and add tests for it to verify the correct behaviour
2021-04-03 14:51:52 +03:00
Vitaliy Filippov 688821665a Remove stoull_full() from etcd_state_client.cpp 2021-04-03 14:36:04 +03:00
Vitaliy Filippov 3e162d95a0 Remove http_client.h include from etcd_state_client.h 2021-04-03 14:36:04 +03:00
Vitaliy Filippov 829381b335 Extract some definitions to msgr_op.{cpp,h} 2021-04-03 14:36:04 +03:00
Vitaliy Filippov 54f2353f24 Use bitmap granularity for alignment checks 2021-04-03 14:36:04 +03:00
Vitaliy Filippov e47f6fba60 Remove cluster_client_t::stop() 2021-04-03 14:35:42 +03:00
Vitaliy Filippov 883bf84a16 Fix build 2021-04-03 01:47:15 +03:00
Vitaliy Filippov 52097c4856 Stop flushing when less than min_flusher_count operations are available (unless a trim is forced) 2021-04-03 00:53:28 +03:00
Vitaliy Filippov e1355cbc74 Report failed operation name in cluster_client 2021-04-03 00:53:28 +03:00
Vitaliy Filippov 8f8b90be7a Add min_flusher_count configuration 2021-04-03 00:53:28 +03:00
Vitaliy Filippov ad9f619370 Skip double allocs when reading journal 2021-04-03 00:53:28 +03:00
Vitaliy Filippov f4769ba7c7 Collapse create+delete journal entry pairs if they're already flushed
Old journal replay mechanism could lead to a double allocation of the same
block and a "Fatal error: tried to overwrite non-zero metadata entry"
2021-04-03 00:53:28 +03:00
Vitaliy Filippov 843b7052d2 Add an assertion when clearing deleted metadata entries, add debug details when freeing blocks 2021-04-03 00:53:28 +03:00
Vitaliy Filippov df99e232ee Deduplicate osd_sets in pg history + raise request size limit for etcd 2021-04-03 00:53:28 +03:00
Vitaliy Filippov 3a40fa4127 Fix monitor errors in case of OSD removal 2021-03-27 01:15:18 +03:00
Vitaliy Filippov 4095bcc558 Do not ignore object deletion journal entries when they are preceded by a big write 2021-03-25 11:00:10 +03:00
Vitaliy Filippov 564d64e271 Add some details for debug prints 2021-03-25 11:00:10 +03:00
Vitaliy Filippov cf54741c95 Followup to 05db1308aa
Don't do anything with the object state after errors because
it's freed by PG re-peer in this case
2021-03-25 11:00:10 +03:00
Vitaliy Filippov 18a5fafa2a Fix rollback 2021-03-25 02:41:58 +03:00
Vitaliy Filippov 06f4978085 Fix fsync check in blockstore_flush (data fsyncs were disabled instead of journal fsyncs) 2021-03-25 02:41:58 +03:00
Vitaliy Filippov 7ebf1588c5 Check for immediate_commit==small in the OSD code 2021-03-25 02:41:58 +03:00
Vitaliy Filippov b0ad1e1e6d Remember writes as "unsynced" only after completing them
Previously BS_OP_SYNC could take unfinished writes and add them into the journal before
they were actually completed. This was leading to crashes with the message
"BUG: Unexpected dirty_entry 2000000000001:9f2a0000 v3 unstable state during flush: 338"
2021-03-25 02:41:58 +03:00
Vitaliy Filippov 0949f08407 Extract osd_primary write and sync code into separate files 2021-03-24 14:20:56 +03:00
Vitaliy Filippov 04a1f18fa5 Assign .req as a whole to always zero out the remaining part
Also clear .reply before processing the operation
2021-03-24 14:20:56 +03:00
Vitaliy Filippov cf9a641d66 Skip disconnected OSDs during sync 2021-03-24 14:20:56 +03:00
Vitaliy Filippov 05db1308aa Fix two potential read/write ordering problems (even though not yet seen in tests)
- Write operations could be 'stabilized' and previous versions could be
  purged from OSDs before the removal of version_override and following
  reads could potentially hit different version in EC pools
- Object was marked clean after completing the delete during recovery, so
  reads could in theory hit a deleted version and return nothing
2021-03-24 14:20:56 +03:00
Vitaliy Filippov 98b54ca948 Don't try to "recover" misplaced objects if it would make them degraded 2021-03-21 01:37:23 +03:00
Vitaliy Filippov 23225c5e62 Do not run ping on clients that are not yet connected 2021-03-21 01:37:23 +03:00
75 changed files with 4076 additions and 2012 deletions

View File

@ -22,6 +22,7 @@ Vitastor на данный момент находится в статусе п
Однако следующее уже реализовано:
0.5.x (стабильная версия):
- Базовая часть - надёжное кластерное блочное хранилище без единой точки отказа
- Производительность ;-D
- Несколько схем отказоустойчивости: репликация, XOR n+1 (1 диск чётности), коды коррекции ошибок
@ -42,20 +43,23 @@ Vitastor на данный момент находится в статусе п
- NBD-прокси для монтирования образов ядром ("блочное устройство в режиме пользователя")
- Утилита удаления образов/инодов (vitastor-rm)
- Пакеты для Debian и CentOS
0.6.x (master-ветка):
- Статистика операций ввода/вывода и занятого места в разрезе инодов
- Именование инодов через хранение их метаданных в etcd
- Снапшоты и copy-on-write клоны
- Сглаживание производительности случайной записи в SSD+HDD конфигурациях
## Планы разработки
## Планы развития
- Более корректные скрипты разметки дисков и автоматического запуска OSD
- Другие инструменты администрирования
- Плагины для OpenStack, Kubernetes, OpenNebula, Proxmox и других облачных систем
- iSCSI-прокси
- Таймауты операций и более быстрое выявление отказов
- Более быстрое переключение при отказах
- Фоновая проверка целостности без контрольных сумм (сверка реплик)
- Контрольные суммы
- Оптимизации для гибридных SSD+HDD хранилищ
- Поддержка SSD-кэширования (tiered storage)
- Поддержка RDMA и NVDIMM
- Web-интерфейс
- Возможно, сжатие
@ -359,9 +363,9 @@ Vitastor с однопоточной NBD прокси на том же стен
так как в 5.4 есть как минимум 1 известный баг, ведущий к зависанию с io_uring и контроллером HP SmartArray.
- Установите liburing 0.4 или более новый и его заголовки.
- Установите lp_solve.
- Установите etcd. Внимание: вам нужна версия с исправлением отсюда: https://github.com/vitalif/etcd/,
из ветки release-3.4, так как в etcd есть баг, который [будет](https://github.com/etcd-io/etcd/pull/12402)
исправлен только в 3.4.15. Баг приводит к неспособности Vitastor запустить PG, когда их хотя бы 500 штук.
- Установите etcd, версии не ниже 3.4.15. Более ранние версии работать не будут из-за различных багов,
например [#12402](https://github.com/etcd-io/etcd/pull/12402). Также вы можете взять версию 3.4.13 с
этим конкретным исправлением из ветки release-3.4 репозитория https://github.com/vitalif/etcd/.
- Установите node.js 10 или новее.
- Установите gcc и g++ 8.x или новее.
- Склонируйте данный репозиторий с подмодулями: `git clone https://yourcmc.ru/git/vitalif/vitastor/`.

View File

@ -16,6 +16,7 @@ with configurable redundancy (replication or erasure codes/XOR).
Vitastor is currently a pre-release, a lot of features are missing and you can still expect
breaking changes in the future. However, the following is implemented:
0.5.x (stable):
- Basic part: highly-available block storage with symmetric clustering and no SPOF
- Performance ;-D
- Multiple redundancy schemes: Replication, XOR n+1, Reed-Solomon erasure codes
@ -36,9 +37,12 @@ breaking changes in the future. However, the following is implemented:
- NBD proxy for kernel mounts
- Inode removal tool (vitastor-rm)
- Packaging for Debian and CentOS
0.6.x (master):
- Per-inode I/O and space usage statistics
- Inode metadata storage in etcd
- Snapshots and copy-on-write image clones
- Write throttling to smooth random write workloads in SSD+HDD configurations
## Roadmap
@ -46,10 +50,10 @@ breaking changes in the future. However, the following is implemented:
- Other administrative tools
- Plugins for OpenStack, Kubernetes, OpenNebula, Proxmox and other cloud systems
- iSCSI proxy
- Operation timeouts and better failure detection
- Faster failover
- Scrubbing without checksums (verification of replicas)
- Checksums
- SSD+HDD optimizations, possibly including tiered storage and soft journal flushes
- Tiered storage
- RDMA and NVDIMM support
- Web GUI
- Compression (possibly)
@ -315,10 +319,9 @@ Vitastor with single-thread NBD on the same hardware:
there is at least one known io_uring hang with 5.4 and an HP SmartArray controller.
- Install liburing 0.4 or newer and its headers.
- Install lp_solve.
- Install etcd. Attention: you need a fixed version from here: https://github.com/vitalif/etcd/,
branch release-3.4, because there is a bug in upstream etcd which makes Vitastor OSDs fail to
move PGs out of "starting" state if you have at least around ~500 PGs or so. The custom build
will be unnecessary when etcd merges the fix: https://github.com/etcd-io/etcd/pull/12402.
- Install etcd, at least version 3.4.15. Earlier versions won't work because of various bugs,
for example [#12402](https://github.com/etcd-io/etcd/pull/12402). You can also take 3.4.13
with this specific fix from here: https://github.com/vitalif/etcd/, branch release-3.4.
- Install node.js 10 or newer.
- Install gcc and g++ 8.x or newer.
- Clone https://yourcmc.ru/git/vitalif/vitastor/ with submodules.

2
debian/changelog vendored
View File

@ -1,4 +1,4 @@
vitastor (0.5.10-1) unstable; urgency=medium
vitastor (0.6.0-1) unstable; urgency=medium
* Bugfixes

View File

@ -40,10 +40,10 @@ RUN set -e -x; \
mkdir -p /root/packages/vitastor-$REL; \
rm -rf /root/packages/vitastor-$REL/*; \
cd /root/packages/vitastor-$REL; \
cp -r /root/vitastor vitastor-0.5.10; \
ln -s /root/packages/qemu-$REL/qemu-*/ vitastor-0.5.10/qemu; \
ln -s /root/fio-build/fio-*/ vitastor-0.5.10/fio; \
cd vitastor-0.5.10; \
cp -r /root/vitastor vitastor-0.6.0; \
ln -s /root/packages/qemu-$REL/qemu-*/ vitastor-0.6.0/qemu; \
ln -s /root/fio-build/fio-*/ vitastor-0.6.0/fio; \
cd vitastor-0.6.0; \
FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
QEMU=$(head -n1 qemu/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
sh copy-qemu-includes.sh; \
@ -59,8 +59,8 @@ RUN set -e -x; \
echo "dep:fio=$FIO" > debian/substvars; \
echo "dep:qemu=$QEMU" >> debian/substvars; \
cd /root/packages/vitastor-$REL; \
tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.5.10.orig.tar.xz vitastor-0.5.10; \
cd vitastor-0.5.10; \
tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.6.0.orig.tar.xz vitastor-0.6.0; \
cd vitastor-0.6.0; \
V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \
DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \

View File

@ -104,6 +104,17 @@ async function optimize_initial({ osd_tree, pg_count, pg_size = 3, pg_minsize =
return res;
}
function shuffle(array)
{
for (let i = array.length - 1, j, x; i > 0; i--)
{
j = Math.floor(Math.random() * (i + 1));
x = array[i];
array[i] = array[j];
array[j] = x;
}
}
function make_int_pgs(weights, pg_count)
{
const total_weight = Object.values(weights).reduce((a, c) => Number(a) + Number(c), 0);
@ -120,6 +131,7 @@ function make_int_pgs(weights, pg_count)
weight_left -= weights[pg_name];
pg_left -= n;
}
shuffle(int_pgs);
return int_pgs;
}

View File

@ -53,7 +53,6 @@ ExecStart=/usr/bin/vitastor-osd \\
--osd_num $OSD_NUM \\
--disable_data_fsync 1 \\
--immediate_commit all \\
--flusher_count 256 \\
--disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 \\
--journal_no_same_sector_overwrites true \\
--journal_sector_buffer_count 1024 \\

View File

@ -32,7 +32,8 @@ ExecStart=/usr/local/bin/etcd -name etcd$ETCD_NUM --data-dir /var/lib/etcd$ETCD_
--advertise-client-urls http://$IP:2379 --listen-client-urls http://$IP:2379 \\
--initial-advertise-peer-urls http://$IP:2380 --listen-peer-urls http://$IP:2380 \\
--initial-cluster-token vitastor-etcd-1 --initial-cluster $ETCD_HOSTS \\
--initial-cluster-state new --max-txn-ops=100000 --auto-compaction-retention=10 --auto-compaction-mode=revision
--initial-cluster-state new --max-txn-ops=100000 --max-request-bytes=104857600 \\
--auto-compaction-retention=10 --auto-compaction-mode=revision
WorkingDirectory=/var/lib/etcd$ETCD_NUM.etcd
ExecStartPre=+chown -R etcd /var/lib/etcd$ETCD_NUM.etcd
User=etcd

View File

@ -34,7 +34,7 @@ const etcd_allow = new RegExp('^'+[
'pg/stats/[1-9]\\d*/[1-9]\\d*',
'pg/history/[1-9]\\d*/[1-9]\\d*',
'history/last_clean_pgs',
'inode/stats/[1-9]\\d*',
'inode/stats/[1-9]\\d*/[1-9]\\d*',
'stats',
].join('$|^')+'$');
@ -96,7 +96,8 @@ const etcd_tree = {
disable_device_lock,
// blockstore - configurable
max_write_iodepth,
flusher_count,
min_flusher_count: 1,
max_flusher_count: 256,
inmemory_metadata,
inmemory_journal,
journal_sector_buffer_count,
@ -210,7 +211,7 @@ const etcd_tree = {
/* <pool_id>: {
<pg_id>: {
primary: osd_num_t,
state: ("starting"|"peering"|"incomplete"|"active"|"stopping"|"offline"|
state: ("starting"|"peering"|"incomplete"|"active"|"repeering"|"stopping"|"offline"|
"degraded"|"has_incomplete"|"has_degraded"|"has_misplaced"|"has_unclean"|
"has_invalid"|"left_on_dead")[],
}
@ -579,7 +580,7 @@ class Mon
for (const osd_num of this.all_osds().sort((a, b) => a - b))
{
const stat = this.state.osd.stats[osd_num];
if (stat.size && (this.state.osd.state[osd_num] || Number(stat.time) >= down_time))
if (stat && stat.size && (this.state.osd.state[osd_num] || Number(stat.time) >= down_time))
{
// Numeric IDs are reserved for OSDs
const osd_cfg = this.state.config.osd[osd_num];
@ -730,6 +731,11 @@ class Mon
pg_history[i].osd_sets = pg_history[i].osd_sets || [];
pg_history[i].osd_sets.push(prev_pgs[i]);
}
if (pg_history[i] && pg_history[i].osd_sets)
{
pg_history[i].osd_sets = Object.values(pg_history[i].osd_sets
.reduce((a, c) => { a[c.join(' ')] = c; return a; }, {}));
}
});
for (let i = 0; i < new_pgs.length || i < prev_pgs.length; i++)
{
@ -880,7 +886,7 @@ class Mon
{
// Take configuration and state, check it against the stored configuration hash
// Recalculate PGs and save them to etcd if the configuration is changed
// FIXME: Also do not change anything if the distribution is good enough and no PGs are degraded
// FIXME: Do not change anything if the distribution is good and random enough and no PGs are degraded
const { up_osds, levels, osd_tree } = this.get_osd_tree();
const tree_cfg = {
osd_tree,
@ -939,7 +945,14 @@ class Mon
prev_pgs[pg-1] = this.state.history.last_clean_pgs.items[pool_id][pg].osd_set;
}
prev_pgs = JSON.parse(JSON.stringify(prev_pgs.length ? prev_pgs : real_prev_pgs));
const old_pg_count = prev_pgs.length;
const old_pg_count = real_prev_pgs.length;
const optimize_cfg = {
osd_tree: pool_tree,
pg_count: pool_cfg.pg_count,
pg_size: pool_cfg.pg_size,
pg_minsize: pool_cfg.pg_minsize,
max_combinations: pool_cfg.max_osd_combinations,
};
let optimize_result;
if (old_pg_count > 0)
{
@ -966,24 +979,23 @@ class Mon
pg.pop();
}
}
optimize_result = await LPOptimizer.optimize_change({
prev_pgs,
osd_tree: pool_tree,
pg_size: pool_cfg.pg_size,
pg_minsize: pool_cfg.pg_minsize,
max_combinations: pool_cfg.max_osd_combinations,
});
if (!this.state.config.pgs.hash)
{
// Re-shuffle PGs
optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
}
else
{
optimize_result = await LPOptimizer.optimize_initial({
osd_tree: pool_tree,
pg_count: pool_cfg.pg_count,
pg_size: pool_cfg.pg_size,
pg_minsize: pool_cfg.pg_minsize,
max_combinations: pool_cfg.max_osd_combinations,
optimize_result = await LPOptimizer.optimize_change({
prev_pgs,
...optimize_cfg,
});
}
}
else
{
optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
}
if (old_pg_count != optimize_result.int_pgs.length)
{
console.log(
@ -1108,7 +1120,7 @@ class Mon
const op_stats = {}, subop_stats = {}, recovery_stats = {};
for (const osd in this.state.osd.stats)
{
const st = this.state.osd.stats[osd];
const st = this.state.osd.stats[osd]||{};
for (const op in st.op_stats||{})
{
op_stats[op] = op_stats[op] || { count: 0n, usec: 0n, bytes: 0n };
@ -1166,23 +1178,31 @@ class Mon
});
for (const osd_num in this.state.osd.space)
{
for (const inode_num in this.state.osd.space[osd_num])
for (const pool_id in this.state.osd.space[osd_num])
{
inode_stats[inode_num] = inode_stats[inode_num] || inode_stub();
inode_stats[inode_num].raw_used += BigInt(this.state.osd.space[osd_num][inode_num]||0);
inode_stats[pool_id] = inode_stats[pool_id] || {};
for (const inode_num in this.state.osd.space[osd_num][pool_id])
{
inode_stats[pool_id][inode_num] = inode_stats[pool_id][inode_num] || inode_stub();
inode_stats[pool_id][inode_num].raw_used += BigInt(this.state.osd.space[osd_num][pool_id][inode_num]||0);
}
}
}
for (const osd_num in this.state.osd.inodestats)
{
const ist = this.state.osd.inodestats[osd_num];
for (const inode_num in ist)
for (const pool_id in ist)
{
inode_stats[inode_num] = inode_stats[inode_num] || inode_stub();
inode_stats[pool_id] = inode_stats[pool_id] || {};
for (const inode_num in ist[pool_id])
{
inode_stats[pool_id][inode_num] = inode_stats[pool_id][inode_num] || inode_stub();
for (const op of [ 'read', 'write', 'delete' ])
{
inode_stats[inode_num][op].count += BigInt(ist[inode_num][op].count||0);
inode_stats[inode_num][op].usec += BigInt(ist[inode_num][op].usec||0);
inode_stats[inode_num][op].bytes += BigInt(ist[inode_num][op].bytes||0);
inode_stats[pool_id][inode_num][op].count += BigInt(ist[pool_id][inode_num][op].count||0);
inode_stats[pool_id][inode_num][op].usec += BigInt(ist[pool_id][inode_num][op].usec||0);
inode_stats[pool_id][inode_num][op].bytes += BigInt(ist[pool_id][inode_num][op].bytes||0);
}
}
}
}
@ -1248,13 +1268,16 @@ class Mon
this.serialize_bigints(stats);
this.serialize_bigints(inode_stats);
txn.push({ requestPut: { key: b64(this.etcd_prefix+'/stats'), value: b64(JSON.stringify(stats)) } });
for (const inode_num in inode_stats)
for (const pool_id in inode_stats)
{
for (const inode_num in inode_stats[pool_id])
{
txn.push({ requestPut: {
key: b64(this.etcd_prefix+'/inode/stats/'+inode_num),
value: b64(JSON.stringify(inode_stats[inode_num])),
key: b64(this.etcd_prefix+'/inode/stats/'+pool_id+'/'+inode_num),
value: b64(JSON.stringify(inode_stats[pool_id][inode_num])),
} });
}
}
if (txn.length)
{
await this.etcd_call('/kv/txn', { success: txn }, this.config.etcd_mon_timeout, 0);

View File

@ -51,7 +51,7 @@ async function run()
const meta_offset = options.journal_offset + Math.ceil(options.journal_size/options.device_block_size)*options.device_block_size;
const entries_per_block = Math.floor(options.device_block_size / (24 + 2*options.object_size/options.bitmap_granularity/8));
const object_count = Math.floor((device_size-meta_offset)/options.object_size);
const meta_size = Math.ceil(object_count / entries_per_block) * options.device_block_size;
const meta_size = Math.ceil(1 + object_count / entries_per_block) * options.device_block_size;
const data_offset = meta_offset + meta_size;
const meta_size_fmt = (meta_size > 1024*1024*1024 ? Math.round(meta_size/1024/1024/1024*100)/100+" GB"
: Math.round(meta_size/1024/1024*100)/100+" MB");
@ -65,6 +65,9 @@ async function run()
);
}
process.stdout.write(
(options.device_block_size != 4096 ?
` --meta_block_size ${options.device}\n`+
` --journal_block-size ${options.device}\n` : '')+
` --data_device ${options.device}\n`+
` --journal_offset ${options.journal_offset}\n`+
` --meta_offset ${meta_offset}\n`+

View File

@ -48,4 +48,4 @@ FIO=`rpm -qi fio | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Ve
QEMU=`rpm -qi qemu qemu-kvm | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'`
perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec
perl -i -pe 's/(Requires:\s*qemu(?:-kvm)?)([^\n]+)?/$1 = '$QEMU'/' $VITASTOR/rpm/vitastor-el$EL.spec
tar --transform 's#^#vitastor-0.5.10/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.5.10$(rpm --eval '%dist').tar.gz *
tar --transform 's#^#vitastor-0.6.0/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.6.0$(rpm --eval '%dist').tar.gz *

View File

@ -37,7 +37,7 @@ ADD . /root/vitastor
RUN set -e; \
cd /root/vitastor/rpm; \
sh build-tarball.sh; \
cp /root/vitastor-0.5.10.el7.tar.gz ~/rpmbuild/SOURCES; \
cp /root/vitastor-0.6.0.el7.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \

View File

@ -1,11 +1,11 @@
Name: vitastor
Version: 0.5.10
Version: 0.6.0
Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1
URL: https://vitastor.io/
Source0: vitastor-0.5.10.el7.tar.gz
Source0: vitastor-0.6.0.el7.tar.gz
BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel

View File

@ -35,7 +35,7 @@ ADD . /root/vitastor
RUN set -e; \
cd /root/vitastor/rpm; \
sh build-tarball.sh; \
cp /root/vitastor-0.5.10.el8.tar.gz ~/rpmbuild/SOURCES; \
cp /root/vitastor-0.6.0.el8.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \

View File

@ -1,11 +1,11 @@
Name: vitastor
Version: 0.5.10
Version: 0.6.0
Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1
URL: https://vitastor.io/
Source0: vitastor-0.5.10.el8.tar.gz
Source0: vitastor-0.6.0.el8.tar.gz
BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel

View File

@ -13,8 +13,8 @@ if("${CMAKE_INSTALL_PREFIX}" MATCHES "^/usr/local/?$")
set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}")
endif()
add_definitions(-DVERSION="0.6-dev")
add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith)
add_definitions(-DVERSION="0.6.0")
add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -I ${CMAKE_SOURCE_DIR}/src)
if (${WITH_ASAN})
add_definitions(-fsanitize=address -fno-omit-frame-pointer)
add_link_options(-fsanitize=address -fno-omit-frame-pointer)
@ -66,7 +66,8 @@ target_link_libraries(fio_vitastor_blk
# vitastor-osd
add_executable(vitastor-osd
osd_main.cpp osd.cpp osd_secondary.cpp msgr_receive.cpp msgr_send.cpp osd_peering.cpp osd_flush.cpp osd_peering_pg.cpp
osd_primary.cpp osd_primary_subops.cpp etcd_state_client.cpp messenger.cpp osd_cluster.cpp http_client.cpp osd_ops.cpp pg_states.cpp
osd_primary.cpp osd_primary_chain.cpp osd_primary_sync.cpp osd_primary_write.cpp osd_primary_subops.cpp
etcd_state_client.cpp messenger.cpp msgr_stop.cpp msgr_op.cpp osd_cluster.cpp http_client.cpp osd_ops.cpp pg_states.cpp
osd_rmw.cpp base64.cpp timerfd_manager.cpp epoll_manager.cpp ../json11/json11.cpp
)
target_link_libraries(vitastor-osd
@ -86,7 +87,7 @@ target_link_libraries(fio_vitastor_sec
# libvitastor_client.so
add_library(vitastor_client SHARED
cluster_client.cpp epoll_manager.cpp etcd_state_client.cpp
messenger.cpp msgr_send.cpp msgr_receive.cpp ringloop.cpp ../json11/json11.cpp
messenger.cpp msgr_stop.cpp msgr_op.cpp msgr_send.cpp msgr_receive.cpp ringloop.cpp ../json11/json11.cpp
http_client.cpp osd_ops.cpp pg_states.cpp timerfd_manager.cpp base64.cpp
)
target_link_libraries(vitastor_client
@ -161,7 +162,8 @@ target_link_libraries(osd_rmw_test Jerasure tcmalloc_minimal)
# stub_uring_osd
add_executable(stub_uring_osd
stub_uring_osd.cpp epoll_manager.cpp messenger.cpp msgr_send.cpp msgr_receive.cpp ringloop.cpp timerfd_manager.cpp ../json11/json11.cpp
stub_uring_osd.cpp epoll_manager.cpp messenger.cpp msgr_stop.cpp msgr_op.cpp
msgr_send.cpp msgr_receive.cpp ringloop.cpp timerfd_manager.cpp ../json11/json11.cpp
)
target_link_libraries(stub_uring_osd
${LIBURING_LIBRARIES}
@ -175,8 +177,17 @@ target_link_libraries(osd_peering_pg_test tcmalloc_minimal)
# test_allocator
add_executable(test_allocator test_allocator.cpp allocator.cpp)
# test_cluster_client
add_executable(test_cluster_client
test_cluster_client.cpp
pg_states.cpp osd_ops.cpp cluster_client.cpp msgr_op.cpp mock/messenger.cpp msgr_stop.cpp
etcd_state_client.cpp timerfd_manager.cpp ../json11/json11.cpp
)
target_compile_definitions(test_cluster_client PUBLIC -D__MOCK__)
target_include_directories(test_cluster_client PUBLIC ${CMAKE_SOURCE_DIR}/src/mock)
## test_blockstore, test_shit
#add_executable(test_blockstore test_blockstore.cpp timerfd_interval.cpp)
#add_executable(test_blockstore test_blockstore.cpp)
#target_link_libraries(test_blockstore blockstore)
#add_executable(test_shit test_shit.cpp osd_peering_pg.cpp)
#target_link_libraries(test_shit ${LIBURING_LIBRARIES} m)

View File

@ -37,6 +37,21 @@ allocator::~allocator()
delete[] mask;
}
bool allocator::get(uint64_t addr)
{
if (addr >= size)
{
return false;
}
uint64_t p2 = 1, offset = 0;
while (p2 * 64 < size)
{
offset += p2;
p2 = p2 * 64;
}
return ((mask[offset + addr/64] >> (addr % 64)) & 1);
}
void allocator::set(uint64_t addr, bool value)
{
if (addr >= size)

View File

@ -16,6 +16,7 @@ class allocator
public:
allocator(uint64_t blocks);
~allocator();
bool get(uint64_t addr);
void set(uint64_t addr, bool value);
uint64_t find_free();
uint64_t get_free_count();

View File

@ -3,9 +3,9 @@
#include "blockstore_impl.h"
blockstore_t::blockstore_t(blockstore_config_t & config, ring_loop_t *ringloop)
blockstore_t::blockstore_t(blockstore_config_t & config, ring_loop_t *ringloop, timerfd_manager_t *tfd)
{
impl = new blockstore_impl_t(config, ringloop);
impl = new blockstore_impl_t(config, ringloop, tfd);
}
blockstore_t::~blockstore_t()
@ -38,6 +38,11 @@ void blockstore_t::enqueue_op(blockstore_op_t *op)
impl->enqueue_op(op);
}
int blockstore_t::read_bitmap(object_id oid, uint64_t target_version, void *bitmap, uint64_t *result_version)
{
return impl->read_bitmap(oid, target_version, bitmap, result_version);
}
std::unordered_map<object_id, uint64_t> & blockstore_t::get_unstable_writes()
{
return impl->unstable_writes;

View File

@ -16,6 +16,7 @@
#include "object_id.h"
#include "ringloop.h"
#include "timerfd_manager.h"
// Memory alignment for direct I/O (usually 512 bytes)
// All other alignments must be a multiple of this one
@ -158,7 +159,7 @@ class blockstore_t
{
blockstore_impl_t *impl;
public:
blockstore_t(blockstore_config_t & config, ring_loop_t *ringloop);
blockstore_t(blockstore_config_t & config, ring_loop_t *ringloop, timerfd_manager_t *tfd);
~blockstore_t();
// Event loop
@ -179,6 +180,9 @@ public:
// Submission
void enqueue_op(blockstore_op_t *op);
// Simplified synchronous operation: get object bitmap & current version
int read_bitmap(object_id oid, uint64_t target_version, void *bitmap, uint64_t *result_version = NULL);
// Unstable writes are added here (map of object_id -> version)
std::unordered_map<object_id, uint64_t> & get_unstable_writes();

View File

@ -3,12 +3,13 @@
#include "blockstore_impl.h"
journal_flusher_t::journal_flusher_t(int flusher_count, blockstore_impl_t *bs)
journal_flusher_t::journal_flusher_t(blockstore_impl_t *bs)
{
this->bs = bs;
this->flusher_count = flusher_count;
this->cur_flusher_count = 1;
this->target_flusher_count = 1;
this->max_flusher_count = bs->max_flusher_count;
this->min_flusher_count = bs->min_flusher_count;
this->cur_flusher_count = bs->min_flusher_count;
this->target_flusher_count = bs->min_flusher_count;
dequeuing = false;
trimming = false;
active_flushers = 0;
@ -16,11 +17,11 @@ journal_flusher_t::journal_flusher_t(int flusher_count, blockstore_impl_t *bs)
// FIXME: allow to configure flusher_start_threshold and journal_trim_interval
flusher_start_threshold = bs->journal_block_size / sizeof(journal_entry_stable);
journal_trim_interval = 512;
journal_trim_counter = 0;
trim_wanted = 0;
journal_trim_counter = bs->journal.flush_journal ? 1 : 0;
trim_wanted = bs->journal.flush_journal ? 1 : 0;
journal_superblock = bs->journal.inmemory ? bs->journal.buffer : memalign_or_die(MEM_ALIGNMENT, bs->journal_block_size);
co = new journal_flusher_co[flusher_count];
for (int i = 0; i < flusher_count; i++)
co = new journal_flusher_co[max_flusher_count];
for (int i = 0; i < max_flusher_count; i++)
{
co[i].bs = bs;
co[i].flusher = this;
@ -71,10 +72,10 @@ bool journal_flusher_t::is_active()
void journal_flusher_t::loop()
{
target_flusher_count = bs->write_iodepth*2;
if (target_flusher_count <= 0)
target_flusher_count = 1;
else if (target_flusher_count > flusher_count)
target_flusher_count = flusher_count;
if (target_flusher_count < min_flusher_count)
target_flusher_count = min_flusher_count;
else if (target_flusher_count > max_flusher_count)
target_flusher_count = max_flusher_count;
if (target_flusher_count > cur_flusher_count)
cur_flusher_count = target_flusher_count;
else if (target_flusher_count < cur_flusher_count)
@ -237,7 +238,8 @@ bool journal_flusher_co::loop()
else if (wait_state == 21)
goto resume_21;
resume_0:
if (!flusher->flush_queue.size() || !flusher->dequeuing)
if (flusher->flush_queue.size() < flusher->min_flusher_count && !flusher->trim_wanted ||
!flusher->flush_queue.size() || !flusher->dequeuing)
{
stop_flusher:
if (flusher->trim_wanted > 0 && flusher->journal_trim_counter > 0)
@ -483,6 +485,13 @@ resume_1:
}
if (has_delete)
{
clean_disk_entry *new_entry = (clean_disk_entry*)(meta_new.buf + meta_new.pos*bs->clean_entry_size);
if (new_entry->oid.inode != 0 && new_entry->oid != cur.oid)
{
printf("Fatal error (metadata corruption or bug): tried to delete metadata entry %lu (%lx:%lx) while deleting %lx:%lx\n",
clean_loc >> bs->block_order, new_entry->oid.inode, new_entry->oid.stripe, cur.oid.inode, cur.oid.stripe);
exit(1);
}
// zero out new metadata entry
memset(meta_new.buf + meta_new.pos*bs->clean_entry_size, 0, bs->clean_entry_size);
}
@ -593,6 +602,7 @@ resume_1:
.size = sizeof(journal_entry_start),
.reserved = 0,
.journal_start = new_trim_pos,
.version = JOURNAL_VERSION,
};
((journal_entry_start*)flusher->journal_superblock)->crc32 = je_crc32((journal_entry*)flusher->journal_superblock);
data->iov = (struct iovec){ flusher->journal_superblock, bs->journal_block_size };
@ -624,6 +634,12 @@ resume_1:
#endif
flusher->trimming = false;
}
if (bs->journal.flush_journal && !flusher->flush_queue.size())
{
assert(bs->journal.used_start == bs->journal.next_free);
printf("Journal flushed\n");
exit(0);
}
}
// All done
flusher->active_flushers--;
@ -654,7 +670,7 @@ bool journal_flusher_co::scan_dirty(int wait_base)
{
char err[1024];
snprintf(
err, 1024, "BUG: Unexpected dirty_entry %lx:%lx v%lu unstable state during flush: %d",
err, 1024, "BUG: Unexpected dirty_entry %lx:%lx v%lu unstable state during flush: 0x%x",
dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version, dirty_it->second.state
);
throw std::runtime_error(err);
@ -783,7 +799,10 @@ void journal_flusher_co::update_clean_db()
if (old_clean_loc != UINT64_MAX && old_clean_loc != clean_loc)
{
#ifdef BLOCKSTORE_DEBUG
printf("Free block %lu (new location is %lu)\n", old_clean_loc >> bs->block_order, clean_loc >> bs->block_order);
printf("Free block %lu from %lx:%lx v%lu (new location is %lu)\n",
old_clean_loc >> bs->block_order,
cur.oid.inode, cur.oid.stripe, cur.version,
clean_loc >> bs->block_order);
#endif
bs->data_alloc->set(old_clean_loc >> bs->block_order, false);
}
@ -791,6 +810,11 @@ void journal_flusher_co::update_clean_db()
{
auto clean_it = bs->clean_db.find(cur.oid);
bs->clean_db.erase(clean_it);
#ifdef BLOCKSTORE_DEBUG
printf("Free block %lu from %lx:%lx v%lu (delete)\n",
clean_loc >> bs->block_order,
cur.oid.inode, cur.oid.stripe, cur.version);
#endif
bs->data_alloc->set(clean_loc >> bs->block_order, false);
clean_loc = UINT64_MAX;
}
@ -812,7 +836,7 @@ bool journal_flusher_co::fsync_batch(bool fsync_meta, int wait_base)
goto resume_1;
else if (wait_state == wait_base+2)
goto resume_2;
if (!(fsync_meta ? bs->disable_meta_fsync : bs->disable_journal_fsync))
if (!(fsync_meta ? bs->disable_meta_fsync : bs->disable_data_fsync))
{
cur_sync = flusher->syncs.end();
while (cur_sync != flusher->syncs.begin())

View File

@ -79,7 +79,7 @@ class journal_flusher_t
{
int trim_wanted = 0;
bool dequeuing;
int flusher_count, cur_flusher_count, target_flusher_count;
int min_flusher_count, max_flusher_count, cur_flusher_count, target_flusher_count;
int flusher_start_threshold;
journal_flusher_co *co;
blockstore_impl_t *bs;
@ -98,7 +98,7 @@ class journal_flusher_t
std::deque<object_id> flush_queue;
std::map<object_id, uint64_t> flush_versions;
public:
journal_flusher_t(int flusher_count, blockstore_impl_t *bs);
journal_flusher_t(blockstore_impl_t *bs);
~journal_flusher_t();
void loop();
bool is_active();

View File

@ -3,9 +3,10 @@
#include "blockstore_impl.h"
blockstore_impl_t::blockstore_impl_t(blockstore_config_t & config, ring_loop_t *ringloop)
blockstore_impl_t::blockstore_impl_t(blockstore_config_t & config, ring_loop_t *ringloop, timerfd_manager_t *tfd)
{
assert(sizeof(blockstore_op_private_t) <= BS_OP_PRIVATE_DATA_SIZE);
this->tfd = tfd;
this->ringloop = ringloop;
ring_consumer.loop = [this]() { loop(); };
ringloop->register_consumer(&ring_consumer);
@ -31,7 +32,7 @@ blockstore_impl_t::blockstore_impl_t(blockstore_config_t & config, ring_loop_t *
close(journal.fd);
throw;
}
flusher = new journal_flusher_t(flusher_count, this);
flusher = new journal_flusher_t(this);
}
blockstore_impl_t::~blockstore_impl_t()
@ -92,10 +93,23 @@ void blockstore_impl_t::loop()
{
delete journal_init_reader;
journal_init_reader = NULL;
if (journal.flush_journal)
initialized = 3;
else
initialized = 10;
ringloop->wakeup();
}
}
if (initialized == 3)
{
if (readonly)
{
printf("Can't flush the journal in readonly mode\n");
exit(1);
}
flusher->loop();
ringloop->submit();
}
}
else
{
@ -443,7 +457,7 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
}
for (; clean_it != clean_end; clean_it++)
{
if (!pg_count || ((clean_it->first.inode + clean_it->first.stripe / pg_stripe_size) % pg_count) == list_pg)
if (!pg_count || ((clean_it->first.stripe / pg_stripe_size) % pg_count) == list_pg) // like map_to_pg()
{
if (stable_count >= stable_alloc)
{
@ -488,7 +502,7 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
}
for (; dirty_it != dirty_end; dirty_it++)
{
if (!pg_count || ((dirty_it->first.oid.inode + dirty_it->first.oid.stripe / pg_stripe_size) % pg_count) == list_pg)
if (!pg_count || ((dirty_it->first.oid.stripe / pg_stripe_size) % pg_count) == list_pg) // like map_to_pg()
{
if (IS_DELETE(dirty_it->second.state))
{

View File

@ -9,6 +9,7 @@
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <time.h>
#include <unistd.h>
#include <linux/fs.h>
@ -77,6 +78,23 @@
#include "blockstore_journal.h"
// "VITAstor"
#define BLOCKSTORE_META_MAGIC 0x726F747341544956l
#define BLOCKSTORE_META_VERSION 1
// metadata header (superblock)
// FIXME: After adding the OSD superblock, add a key to metadata
// and journal headers to check if they belong to the same OSD
struct __attribute__((__packed__)) blockstore_meta_header_t
{
uint64_t zero;
uint64_t magic;
uint64_t version;
uint32_t meta_block_size;
uint32_t data_block_size;
uint32_t bitmap_granularity;
};
// 32 bytes = 24 bytes + block bitmap (4 bytes by default) + external attributes (also bitmap, 4 bytes by default)
// per "clean" entry on disk with fixed metadata tables
// FIXME: maybe add crc32's to metadata
@ -158,6 +176,7 @@ struct blockstore_op_private_t
struct iovec iov_zerofill[3];
// Warning: must not have a default value here because it's written to before calling constructor in blockstore_write.cpp O_o
uint64_t real_version;
timespec tv_begin;
// Sync
std::vector<obj_ver_id> sync_big_writes, sync_small_writes;
@ -199,10 +218,18 @@ class blockstore_impl_t
// Suitable only for server SSDs with capacitors, requires disabled data and journal fsyncs
int immediate_commit = IMMEDIATE_NONE;
bool inmemory_meta = false;
// Maximum flusher count
unsigned flusher_count;
// Maximum and minimum flusher count
unsigned max_flusher_count, min_flusher_count;
// Maximum queue depth
unsigned max_write_iodepth = 128;
// Enable small (journaled) write throttling, useful for the SSD+HDD case
bool throttle_small_writes = false;
// Target data device iops, bandwidth and parallelism for throttling (100/100/1 is the default for HDD)
int throttle_target_iops = 100;
int throttle_target_mbs = 100;
int throttle_target_parallelism = 1;
// Minimum difference in microseconds between target and real execution times to throttle the response
int throttle_threshold_us = 50;
/******* END OF OPTIONS *******/
struct ring_consumer_t ring_consumer;
@ -212,6 +239,7 @@ class blockstore_impl_t
blockstore_dirty_db_t dirty_db;
std::vector<blockstore_op_t*> submit_queue;
std::vector<obj_ver_id> unsynced_big_writes, unsynced_small_writes;
int unsynced_big_write_count = 0;
allocator *data_alloc = NULL;
uint8_t *zero_object;
@ -232,6 +260,7 @@ class blockstore_impl_t
bool live = false, queue_stall = false;
ring_loop_t *ringloop;
timerfd_manager_t *tfd;
bool stop_sync_submitted;
@ -286,7 +315,7 @@ class blockstore_impl_t
// Stabilize
int dequeue_stable(blockstore_op_t *op);
int continue_stable(blockstore_op_t *op);
void mark_stable(const obj_ver_id & ov);
void mark_stable(const obj_ver_id & ov, bool forget_dirty = false);
void handle_stable_event(ring_data_t *data, blockstore_op_t *op);
void stabilize_object(object_id oid, uint64_t max_ver);
@ -302,7 +331,7 @@ class blockstore_impl_t
public:
blockstore_impl_t(blockstore_config_t & config, ring_loop_t *ringloop);
blockstore_impl_t(blockstore_config_t & config, ring_loop_t *ringloop, timerfd_manager_t *tfd);
~blockstore_impl_t();
// Event loop
@ -323,6 +352,9 @@ public:
// Submission
void enqueue_op(blockstore_op_t *op);
// Simplified synchronous operation: get object bitmap & current version
int read_bitmap(object_id oid, uint64_t target_version, void *bitmap, uint64_t *result_version = NULL);
// Unstable writes are added here (map of object_id -> version)
std::unordered_map<object_id, uint64_t> unstable_writes;

View File

@ -3,6 +3,20 @@
#include "blockstore_impl.h"
#define GET_SQE() \
sqe = bs->get_sqe();\
if (!sqe)\
throw std::runtime_error("io_uring is full during initialization");\
data = ((ring_data_t*)sqe->user_data)
static bool iszero(uint64_t *buf, int len)
{
for (int i = 0; i < len; i++)
if (buf[i] != 0)
return false;
return true;
}
blockstore_init_meta::blockstore_init_meta(blockstore_impl_t *bs)
{
this->bs = bs;
@ -10,7 +24,7 @@ blockstore_init_meta::blockstore_init_meta(blockstore_impl_t *bs)
void blockstore_init_meta::handle_event(ring_data_t *data)
{
if (data->res <= 0)
if (data->res < 0)
{
throw std::runtime_error(
std::string("read metadata failed at offset ") + std::to_string(metadata_read) +
@ -28,6 +42,12 @@ int blockstore_init_meta::loop()
{
if (wait_state == 1)
goto resume_1;
else if (wait_state == 2)
goto resume_2;
else if (wait_state == 3)
goto resume_3;
else if (wait_state == 4)
goto resume_4;
printf("Reading blockstore metadata\n");
if (bs->inmemory_meta)
metadata_buffer = bs->metadata_buffer;
@ -35,22 +55,98 @@ int blockstore_init_meta::loop()
metadata_buffer = memalign(MEM_ALIGNMENT, 2*bs->metadata_buf_size);
if (!metadata_buffer)
throw std::runtime_error("Failed to allocate metadata read buffer");
while (1)
{
// Read superblock
GET_SQE();
data->iov = { metadata_buffer, bs->meta_block_size };
data->callback = [this](ring_data_t *data) { handle_event(data); };
my_uring_prep_readv(sqe, bs->meta_fd, &data->iov, 1, bs->meta_offset);
bs->ringloop->submit();
submitted = 1;
resume_1:
if (submitted)
{
wait_state = 1;
return 1;
}
if (iszero((uint64_t*)metadata_buffer, bs->meta_block_size / sizeof(uint64_t)))
{
{
blockstore_meta_header_t *hdr = (blockstore_meta_header_t *)metadata_buffer;
hdr->zero = 0;
hdr->magic = BLOCKSTORE_META_MAGIC;
hdr->version = BLOCKSTORE_META_VERSION;
hdr->meta_block_size = bs->meta_block_size;
hdr->data_block_size = bs->block_size;
hdr->bitmap_granularity = bs->bitmap_granularity;
}
if (bs->readonly)
{
printf("Skipping metadata initialization because blockstore is readonly\n");
}
else
{
printf("Initializing metadata area\n");
GET_SQE();
data->iov = (struct iovec){ metadata_buffer, bs->meta_block_size };
data->callback = [this](ring_data_t *data) { handle_event(data); };
my_uring_prep_writev(sqe, bs->meta_fd, &data->iov, 1, bs->meta_offset);
bs->ringloop->submit();
submitted = 1;
resume_3:
if (submitted > 0)
{
wait_state = 3;
return 1;
}
zero_on_init = true;
}
}
else
{
blockstore_meta_header_t *hdr = (blockstore_meta_header_t *)metadata_buffer;
if (hdr->zero != 0 ||
hdr->magic != BLOCKSTORE_META_MAGIC ||
hdr->version != BLOCKSTORE_META_VERSION)
{
printf(
"Metadata is corrupt or old version.\n"
" If this is a new OSD please zero out the metadata area before starting it.\n"
" If you need to upgrade from 0.5.x please request it via the issue tracker.\n"
);
exit(1);
}
if (hdr->meta_block_size != bs->meta_block_size ||
hdr->data_block_size != bs->block_size ||
hdr->bitmap_granularity != bs->bitmap_granularity)
{
printf(
"Configuration stored in metadata superblock"
" (meta_block_size=%u, data_block_size=%u, bitmap_granularity=%u)"
" differs from OSD configuration (%lu/%u/%lu).\n",
hdr->meta_block_size, hdr->data_block_size, hdr->bitmap_granularity,
bs->meta_block_size, bs->block_size, bs->bitmap_granularity
);
exit(1);
}
}
// Skip superblock
bs->meta_offset += bs->meta_block_size;
prev_done = 0;
done_len = 0;
done_pos = 0;
metadata_read = 0;
// Read the rest of the metadata
while (1)
{
resume_2:
if (submitted)
{
wait_state = 2;
return 1;
}
if (metadata_read < bs->meta_len)
{
sqe = bs->get_sqe();
if (!sqe)
{
throw std::runtime_error("io_uring is full while trying to read metadata");
}
data = ((ring_data_t*)sqe->user_data);
GET_SQE();
data->iov = {
metadata_buffer + (bs->inmemory_meta
? metadata_read
@ -58,7 +154,14 @@ int blockstore_init_meta::loop()
bs->meta_len - metadata_read > bs->metadata_buf_size ? bs->metadata_buf_size : bs->meta_len - metadata_read,
};
data->callback = [this](ring_data_t *data) { handle_event(data); };
if (!zero_on_init)
my_uring_prep_readv(sqe, bs->meta_fd, &data->iov, 1, bs->meta_offset + metadata_read);
else
{
// Fill metadata with zeroes
memset(data->iov.iov_base, 0, data->iov.iov_len);
my_uring_prep_writev(sqe, bs->meta_fd, &data->iov, 1, bs->meta_offset + metadata_read);
}
bs->ringloop->submit();
submitted = (prev == 1 ? 2 : 1);
prev = submitted;
@ -90,6 +193,21 @@ int blockstore_init_meta::loop()
free(metadata_buffer);
metadata_buffer = NULL;
}
if (zero_on_init && !bs->disable_meta_fsync)
{
GET_SQE();
my_uring_prep_fsync(sqe, bs->meta_fd, IORING_FSYNC_DATASYNC);
data->iov = { 0 };
data->callback = [this](ring_data_t *data) { handle_event(data); };
submitted = 1;
bs->ringloop->submit();
resume_4:
if (submitted > 0)
{
wait_state = 4;
return 1;
}
}
return 0;
}
@ -111,7 +229,10 @@ void blockstore_init_meta::handle_entries(void* entries, unsigned count, int blo
{
// free the previous block
#ifdef BLOCKSTORE_DEBUG
printf("Free block %lu (new location is %lu)\n", clean_it->second.location >> block_order, done_cnt+i);
printf("Free block %lu from %lx:%lx v%lu (new location is %lu)\n",
clean_it->second.location >> block_order,
clean_it->first.inode, clean_it->first.stripe, clean_it->second.version,
done_cnt+i);
#endif
bs->data_alloc->set(clean_it->second.location >> block_order, false);
}
@ -153,14 +274,6 @@ blockstore_init_journal::blockstore_init_journal(blockstore_impl_t *bs)
};
}
bool iszero(uint64_t *buf, int len)
{
for (int i = 0; i < len; i++)
if (buf[i] != 0)
return false;
return true;
}
void blockstore_init_journal::handle_event(ring_data_t *data1)
{
if (data1->res <= 0)
@ -185,12 +298,6 @@ void blockstore_init_journal::handle_event(ring_data_t *data1)
submitted_buf = NULL;
}
#define GET_SQE() \
sqe = bs->get_sqe();\
if (!sqe)\
throw std::runtime_error("io_uring is full while trying to read journal");\
data = ((ring_data_t*)sqe->user_data)
int blockstore_init_journal::loop()
{
if (wait_state == 1)
@ -228,7 +335,7 @@ resume_1:
wait_state = 1;
return 1;
}
if (iszero((uint64_t*)submitted_buf, 3))
if (iszero((uint64_t*)submitted_buf, bs->journal.block_size / sizeof(uint64_t)))
{
// Journal is empty
// FIXME handle this wrapping to journal_block_size better (maybe)
@ -243,6 +350,7 @@ resume_1:
.size = sizeof(journal_entry_start),
.reserved = 0,
.journal_start = bs->journal.block_size,
.version = JOURNAL_VERSION,
};
((journal_entry_start*)submitted_buf)->crc32 = je_crc32((journal_entry*)submitted_buf);
if (bs->readonly)
@ -293,11 +401,21 @@ resume_1:
je_start = (journal_entry_start*)submitted_buf;
if (je_start->magic != JOURNAL_MAGIC ||
je_start->type != JE_START ||
je_start->size != sizeof(journal_entry_start) ||
je_crc32((journal_entry*)je_start) != je_start->crc32)
je_crc32((journal_entry*)je_start) != je_start->crc32 ||
je_start->size != sizeof(journal_entry_start) && je_start->size != JE_START_LEGACY_SIZE)
{
// Entry is corrupt
throw std::runtime_error("first entry of the journal is corrupt");
fprintf(stderr, "First entry of the journal is corrupt\n");
exit(1);
}
if (je_start->size == JE_START_LEGACY_SIZE || je_start->version != JOURNAL_VERSION)
{
fprintf(
stderr, "The code only supports journal version %d, but it is %lu on disk."
" Please use the previous version to flush the journal before upgrading OSD\n",
JOURNAL_VERSION, je_start->size == JE_START_LEGACY_SIZE ? 0 : je_start->version
);
exit(1);
}
next_free = journal_pos = bs->journal.used_start = je_start->journal_start;
if (!bs->journal.inmemory)
@ -403,6 +521,18 @@ resume_1:
}
}
}
for (auto ov: double_allocs)
{
auto dirty_it = bs->dirty_db.find(ov);
if (dirty_it != bs->dirty_db.end() &&
IS_BIG_WRITE(dirty_it->second.state) &&
dirty_it->second.location == UINT64_MAX)
{
printf("Fatal error (bug): %lx:%lx v%lu big_write journal_entry was allocated over another object\n",
dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version);
exit(1);
}
}
bs->flusher->mark_trim_possible();
bs->journal.dirty_start = bs->journal.next_free;
printf(
@ -534,20 +664,20 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
.oid = je->small_write.oid,
.version = je->small_write.version,
};
void *bmp = (void*)je + sizeof(journal_entry_small_write);
void *bmp = NULL;
void *bmp_from = (void*)je + sizeof(journal_entry_small_write);
if (bs->clean_entry_bitmap_size <= sizeof(void*))
{
memcpy(&bmp, bmp, bs->clean_entry_bitmap_size);
memcpy(&bmp, bmp_from, bs->clean_entry_bitmap_size);
}
else if (!bs->journal.inmemory)
else
{
// FIXME Using large blockstore objects and not keeping journal in memory
// will result in a lot of small allocations for entry bitmaps. This can
// only be fixed by using a patched map with dynamic entry size, but not
// the btree_map, because it doesn't keep iterators valid all the time.
void *bmp_cp = malloc_or_die(bs->clean_entry_bitmap_size);
memcpy(bmp_cp, bmp, bs->clean_entry_bitmap_size);
bmp = bmp_cp;
// FIXME Using large blockstore objects will result in a lot of small
// allocations for entry bitmaps. This can only be fixed by using
// a patched map with dynamic entry size, but not the btree_map,
// because it doesn't keep iterators valid all the time.
bmp = malloc_or_die(bs->clean_entry_bitmap_size);
memcpy(bmp, bmp_from, bs->clean_entry_bitmap_size);
}
bs->dirty_db.emplace(ov, (dirty_entry){
.state = (BS_ST_SMALL_WRITE | BS_ST_SYNCED),
@ -569,7 +699,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
unstab = unstab < ov.version ? ov.version : unstab;
if (je->type == JE_SMALL_WRITE_INSTANT)
{
bs->mark_stable(ov);
bs->mark_stable(ov, true);
}
}
}
@ -599,32 +729,10 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
// its data and metadata are already flushed.
// We don't know if newer versions are flushed, but
// the previous delete definitely is.
// So we flush previous dirty entries, but retain the clean one.
// So we forget previous dirty entries, but retain the clean one.
// This feature is required for writes happening shortly
// after deletes.
auto dirty_end = dirty_it;
dirty_end++;
while (1)
{
if (dirty_it == bs->dirty_db.begin())
{
break;
}
dirty_it--;
if (dirty_it->first.oid != je->big_write.oid)
{
dirty_it++;
break;
}
}
auto clean_it = bs->clean_db.find(je->big_write.oid);
bs->erase_dirty(
dirty_it, dirty_end,
clean_it != bs->clean_db.end() ? clean_it->second.location : UINT64_MAX
);
// Remove it from the flusher's queue, too
// Otherwise it may end up referring to a small unstable write after reading the rest of the journal
bs->flusher->remove_flush(je->big_write.oid);
erase_dirty_object(dirty_it);
}
}
auto clean_it = bs->clean_db.find(je->big_write.oid);
@ -636,22 +744,22 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
.oid = je->big_write.oid,
.version = je->big_write.version,
};
void *bmp = (void*)je + sizeof(journal_entry_big_write);
void *bmp = NULL;
void *bmp_from = (void*)je + sizeof(journal_entry_big_write);
if (bs->clean_entry_bitmap_size <= sizeof(void*))
{
memcpy(&bmp, bmp, bs->clean_entry_bitmap_size);
memcpy(&bmp, bmp_from, bs->clean_entry_bitmap_size);
}
else if (!bs->journal.inmemory)
else
{
// FIXME Using large blockstore objects and not keeping journal in memory
// will result in a lot of small allocations for entry bitmaps. This can
// only be fixed by using a patched map with dynamic entry size, but not
// the btree_map, because it doesn't keep iterators valid all the time.
void *bmp_cp = malloc_or_die(bs->clean_entry_bitmap_size);
memcpy(bmp_cp, bmp, bs->clean_entry_bitmap_size);
bmp = bmp_cp;
// FIXME Using large blockstore objects will result in a lot of small
// allocations for entry bitmaps. This can only be fixed by using
// a patched map with dynamic entry size, but not the btree_map,
// because it doesn't keep iterators valid all the time.
bmp = malloc_or_die(bs->clean_entry_bitmap_size);
memcpy(bmp, bmp_from, bs->clean_entry_bitmap_size);
}
bs->dirty_db.emplace(ov, (dirty_entry){
auto dirty_it = bs->dirty_db.emplace(ov, (dirty_entry){
.state = (BS_ST_BIG_WRITE | BS_ST_SYNCED),
.flags = 0,
.location = je->big_write.location,
@ -659,11 +767,26 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
.len = je->big_write.len,
.journal_sector = proc_pos,
.bitmap = bmp,
});
}).first;
if (bs->data_alloc->get(je->big_write.location >> bs->block_order))
{
// This is probably a big_write that's already flushed and freed, but it may
// also indicate a bug. So we remember such entries and recheck them afterwards.
// If it's not a bug they won't be present after reading the whole journal.
dirty_it->second.location = UINT64_MAX;
double_allocs.push_back(ov);
}
else
{
#ifdef BLOCKSTORE_DEBUG
printf("Allocate block %lu\n", je->big_write.location >> bs->block_order);
printf(
"Allocate block (journal) %lu: %lx:%lx v%lu\n",
je->big_write.location >> bs->block_order,
ov.oid.inode, ov.oid.stripe, ov.version
);
#endif
bs->data_alloc->set(je->big_write.location >> bs->block_order, true);
}
bs->journal.used_sectors[proc_pos]++;
#ifdef BLOCKSTORE_DEBUG
printf(
@ -675,7 +798,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
unstab = unstab < ov.version ? ov.version : unstab;
if (je->type == JE_BIG_WRITE_INSTANT)
{
bs->mark_stable(ov);
bs->mark_stable(ov, true);
}
}
}
@ -689,7 +812,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
.oid = je->stable.oid,
.version = je->stable.version,
};
bs->mark_stable(ov);
bs->mark_stable(ov, true);
}
else if (je->type == JE_ROLLBACK)
{
@ -708,9 +831,26 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
#ifdef BLOCKSTORE_DEBUG
printf("je_delete oid=%lx:%lx ver=%lu\n", je->del.oid.inode, je->del.oid.stripe, je->del.version);
#endif
bool dirty_exists = false;
auto dirty_it = bs->dirty_db.upper_bound((obj_ver_id){
.oid = je->del.oid,
.version = UINT64_MAX,
});
if (dirty_it != bs->dirty_db.begin())
{
dirty_it--;
dirty_exists = dirty_it->first.oid == je->del.oid;
}
auto clean_it = bs->clean_db.find(je->del.oid);
if (clean_it != bs->clean_db.end() &&
clean_it->second.version < je->del.version)
bool clean_exists = (clean_it != bs->clean_db.end() &&
clean_it->second.version < je->del.version);
if (!clean_exists && dirty_exists)
{
// Clean entry doesn't exist. This means that the delete is already flushed.
// So we must not flush this object anymore.
erase_dirty_object(dirty_it);
}
else if (clean_exists || dirty_exists)
{
// oid, version
obj_ver_id ov = {
@ -728,8 +868,9 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
bs->journal.used_sectors[proc_pos]++;
// Deletions are treated as immediately stable, because
// "2-phase commit" (write->stabilize) isn't sufficient for them anyway
bs->mark_stable(ov);
bs->mark_stable(ov, true);
}
// Ignore delete if neither preceding dirty entries nor the clean one are present
}
started = true;
pos += je->size;
@ -740,3 +881,35 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
bs->journal.next_free = next_free;
return 1;
}
void blockstore_init_journal::erase_dirty_object(blockstore_dirty_db_t::iterator dirty_it)
{
auto oid = dirty_it->first.oid;
bool exists = !IS_DELETE(dirty_it->second.state);
auto dirty_end = dirty_it;
dirty_end++;
while (1)
{
if (dirty_it == bs->dirty_db.begin())
{
break;
}
dirty_it--;
if (dirty_it->first.oid != oid)
{
dirty_it++;
break;
}
}
auto clean_it = bs->clean_db.find(oid);
uint64_t clean_loc = clean_it != bs->clean_db.end()
? clean_it->second.location : UINT64_MAX;
if (exists && clean_loc == UINT64_MAX)
{
bs->inode_space_stats[oid.inode] -= bs->block_size;
}
bs->erase_dirty(dirty_it, dirty_end, clean_loc);
// Remove it from the flusher's queue, too
// Otherwise it may end up referring to a small unstable write after reading the rest of the journal
bs->flusher->remove_flush(oid);
}

View File

@ -7,6 +7,7 @@ class blockstore_init_meta
{
blockstore_impl_t *bs;
int wait_state = 0, wait_count = 0;
bool zero_on_init = false;
void *metadata_buffer = NULL;
uint64_t metadata_read = 0;
int prev = 0, prev_done = 0, done_len = 0, submitted = 0;
@ -36,6 +37,7 @@ class blockstore_init_journal
bool started = false;
uint64_t next_free;
std::vector<bs_init_journal_done> done;
std::vector<obj_ver_id> double_allocs;
uint64_t journal_pos = 0;
uint64_t continue_pos = 0;
void *init_write_buf = NULL;
@ -48,6 +50,7 @@ class blockstore_init_journal
std::function<void(ring_data_t*)> simple_callback;
int handle_journal_part(void *buf, uint64_t done_pos, uint64_t len);
void handle_event(ring_data_t *data);
void erase_dirty_object(blockstore_dirty_db_t::iterator dirty_it);
public:
blockstore_init_journal(blockstore_impl_t* bs);
int loop();

View File

@ -7,6 +7,7 @@
#define MIN_JOURNAL_SIZE 4*1024*1024
#define JOURNAL_MAGIC 0x4A33
#define JOURNAL_VERSION 1
#define JOURNAL_BUFFER_SIZE 4*1024*1024
// We reserve some extra space for future stabilize requests during writes
@ -37,7 +38,9 @@ struct __attribute__((__packed__)) journal_entry_start
uint32_t size;
uint32_t reserved;
uint64_t journal_start;
uint64_t version;
};
#define JE_START_LEGACY_SIZE 24
struct __attribute__((__packed__)) journal_entry_small_write
{
@ -149,6 +152,7 @@ struct journal_t
int fd;
uint64_t device_size;
bool inmemory = false;
bool flush_journal = false;
void *buffer = NULL;
uint64_t block_size;

View File

@ -42,6 +42,11 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
{
disable_flock = true;
}
if (config["flush_journal"] == "true" || config["flush_journal"] == "1" || config["flush_journal"] == "yes")
{
// Only flush journal and exit
journal.flush_journal = true;
}
if (config["immediate_commit"] == "all")
{
immediate_commit = IMMEDIATE_ALL;
@ -69,8 +74,16 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
journal_block_size = strtoull(config["journal_block_size"].c_str(), NULL, 10);
meta_block_size = strtoull(config["meta_block_size"].c_str(), NULL, 10);
bitmap_granularity = strtoull(config["bitmap_granularity"].c_str(), NULL, 10);
flusher_count = strtoull(config["flusher_count"].c_str(), NULL, 10);
max_flusher_count = strtoull(config["max_flusher_count"].c_str(), NULL, 10);
if (!max_flusher_count)
max_flusher_count = strtoull(config["flusher_count"].c_str(), NULL, 10);
min_flusher_count = strtoull(config["min_flusher_count"].c_str(), NULL, 10);
max_write_iodepth = strtoull(config["max_write_iodepth"].c_str(), NULL, 10);
throttle_small_writes = config["throttle_small_writes"] == "true" || config["throttle_small_writes"] == "1" || config["throttle_small_writes"] == "yes";
throttle_target_iops = strtoull(config["throttle_target_iops"].c_str(), NULL, 10);
throttle_target_mbs = strtoull(config["throttle_target_mbs"].c_str(), NULL, 10);
throttle_target_parallelism = strtoull(config["throttle_target_parallelism"].c_str(), NULL, 10);
throttle_threshold_us = strtoull(config["throttle_threshold_us"].c_str(), NULL, 10);
// Validate
if (!block_size)
{
@ -80,9 +93,13 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
{
throw std::runtime_error("Bad block size");
}
if (!flusher_count)
if (!max_flusher_count)
{
flusher_count = 32;
max_flusher_count = 256;
}
if (!min_flusher_count || journal.flush_journal)
{
min_flusher_count = 1;
}
if (!max_write_iodepth)
{
@ -168,6 +185,22 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
{
throw std::runtime_error("immediate_commit=all requires disable_journal_fsync and disable_data_fsync");
}
if (!throttle_target_iops)
{
throttle_target_iops = 100;
}
if (!throttle_target_mbs)
{
throttle_target_mbs = 100;
}
if (!throttle_target_parallelism)
{
throttle_target_parallelism = 1;
}
if (!throttle_threshold_us)
{
throttle_threshold_us = 50;
}
// init some fields
clean_entry_bitmap_size = block_size / bitmap_granularity / 8;
clean_entry_size = sizeof(clean_disk_entry) + 2*clean_entry_bitmap_size;
@ -224,7 +257,7 @@ void blockstore_impl_t::calc_lengths()
}
// required metadata size
block_count = data_len / block_size;
meta_len = ((block_count - 1 + meta_block_size / clean_entry_size) / (meta_block_size / clean_entry_size)) * meta_block_size;
meta_len = (1 + (block_count - 1 + meta_block_size / clean_entry_size) / (meta_block_size / clean_entry_size)) * meta_block_size;
if (meta_area < meta_len)
{
throw std::runtime_error("Metadata area is too small, need at least "+std::to_string(meta_len)+" bytes");

View File

@ -268,3 +268,50 @@ void blockstore_impl_t::handle_read_event(ring_data_t *data, blockstore_op_t *op
FINISH_OP(op);
}
}
int blockstore_impl_t::read_bitmap(object_id oid, uint64_t target_version, void *bitmap, uint64_t *result_version)
{
auto dirty_it = dirty_db.upper_bound((obj_ver_id){
.oid = oid,
.version = UINT64_MAX,
});
if (dirty_it != dirty_db.begin())
dirty_it--;
if (dirty_it != dirty_db.end())
{
while (dirty_it->first.oid == oid)
{
if (target_version >= dirty_it->first.version)
{
if (result_version)
*result_version = dirty_it->first.version;
if (bitmap)
{
void *bmp_ptr = (clean_entry_bitmap_size > sizeof(void*) ? dirty_it->second.bitmap : &dirty_it->second.bitmap);
memcpy(bitmap, bmp_ptr, clean_entry_bitmap_size);
}
return 0;
}
if (dirty_it == dirty_db.begin())
break;
dirty_it--;
}
}
auto clean_it = clean_db.find(oid);
if (clean_it != clean_db.end())
{
if (result_version)
*result_version = clean_it->second.version;
if (bitmap)
{
void *bmp_ptr = get_clean_entry_bitmap(clean_it->second.location, clean_entry_bitmap_size);
memcpy(bitmap, bmp_ptr, clean_entry_bitmap_size);
}
return 0;
}
if (result_version)
*result_version = 0;
if (bitmap)
memset(bitmap, 0, clean_entry_bitmap_size);
return -ENOENT;
}

View File

@ -248,10 +248,12 @@ void blockstore_impl_t::erase_dirty(blockstore_dirty_db_t::iterator dirty_start,
}
while (1)
{
if (IS_BIG_WRITE(dirty_it->second.state) && dirty_it->second.location != clean_loc)
if (IS_BIG_WRITE(dirty_it->second.state) && dirty_it->second.location != clean_loc &&
dirty_it->second.location != UINT64_MAX)
{
#ifdef BLOCKSTORE_DEBUG
printf("Free block %lu\n", dirty_it->second.location >> block_order);
printf("Free block %lu from %lx:%lx v%lu\n", dirty_it->second.location >> block_order,
dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version);
#endif
data_alloc->set(dirty_it->second.location >> block_order, false);
}

View File

@ -168,6 +168,9 @@ resume_5:
for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++)
{
// Mark all dirty_db entries up to op->version as stable
#ifdef BLOCKSTORE_DEBUG
printf("Stabilize %lx:%lx v%lu\n", v->oid.inode, v->oid.stripe, v->version);
#endif
mark_stable(*v);
}
// Acknowledge op
@ -176,31 +179,66 @@ resume_5:
return 2;
}
void blockstore_impl_t::mark_stable(const obj_ver_id & v)
void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
{
auto dirty_it = dirty_db.find(v);
if (dirty_it != dirty_db.end())
{
while (1)
{
bool was_stable = IS_STABLE(dirty_it->second.state);
if ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_SYNCED)
{
dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_STABLE;
// Allocations and deletions are counted when they're stabilized
if (IS_BIG_WRITE(dirty_it->second.state))
{
int exists = -1;
if (dirty_it != dirty_db.begin())
{
auto prev_it = dirty_it;
prev_it--;
if (prev_it->first.oid == v.oid)
{
exists = IS_DELETE(prev_it->second.state) ? 0 : 1;
}
}
if (exists == -1)
{
auto clean_it = clean_db.find(v.oid);
exists = clean_it != clean_db.end() ? 1 : 0;
}
if (!exists)
{
inode_space_stats[dirty_it->first.oid.inode] += block_size;
}
}
else if (IS_DELETE(dirty_it->second.state))
{
inode_space_stats[dirty_it->first.oid.inode] -= block_size;
}
}
else if (IS_STABLE(dirty_it->second.state))
if (forget_dirty && (IS_BIG_WRITE(dirty_it->second.state) ||
IS_DELETE(dirty_it->second.state)))
{
// Big write overrides all previous dirty entries
auto erase_end = dirty_it;
while (dirty_it != dirty_db.begin())
{
dirty_it--;
if (dirty_it->first.oid != v.oid)
{
dirty_it++;
break;
}
if (dirty_it == dirty_db.begin())
}
auto clean_it = clean_db.find(v.oid);
uint64_t clean_loc = clean_it != clean_db.end()
? clean_it->second.location : UINT64_MAX;
erase_dirty(dirty_it, erase_end, clean_loc);
break;
}
if (was_stable || dirty_it == dirty_db.begin())
{
break;
}

View File

@ -24,6 +24,7 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
if (PRIV(op)->op_state == 0)
{
stop_sync_submitted = false;
unsynced_big_write_count -= unsynced_big_writes.size();
PRIV(op)->sync_big_writes.swap(unsynced_big_writes);
PRIV(op)->sync_small_writes.swap(unsynced_small_writes);
PRIV(op)->sync_small_checked = 0;
@ -79,7 +80,8 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
// 2nd step: Data device is synced, prepare & write journal entries
// Check space in the journal and journal memory buffers
blockstore_journal_check_t space_check(this);
if (!space_check.check_available(op, PRIV(op)->sync_big_writes.size(), sizeof(journal_entry_big_write), JOURNAL_STABILIZE_RESERVATION))
if (!space_check.check_available(op, PRIV(op)->sync_big_writes.size(),
sizeof(journal_entry_big_write) + clean_entry_bitmap_size, JOURNAL_STABILIZE_RESERVATION))
{
return 0;
}
@ -94,7 +96,7 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
int s = 0, cur_sector = -1;
while (it != PRIV(op)->sync_big_writes.end())
{
if (!journal.entry_fits(sizeof(journal_entry_big_write)) &&
if (!journal.entry_fits(sizeof(journal_entry_big_write) + clean_entry_bitmap_size) &&
journal.sector_info[journal.cur_sector].dirty)
{
if (cur_sector == -1)
@ -102,24 +104,27 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], [this, op](ring_data_t *data) { handle_sync_event(data, op); });
cur_sector = journal.cur_sector;
}
auto & dirty_entry = dirty_db.at(*it);
journal_entry_big_write *je = (journal_entry_big_write*)prefill_single_journal_entry(
journal, (dirty_db[*it].state & BS_ST_INSTANT) ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
sizeof(journal_entry_big_write)
journal, (dirty_entry.state & BS_ST_INSTANT) ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
sizeof(journal_entry_big_write) + clean_entry_bitmap_size
);
dirty_db[*it].journal_sector = journal.sector_info[journal.cur_sector].offset;
dirty_entry.journal_sector = journal.sector_info[journal.cur_sector].offset;
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
#ifdef BLOCKSTORE_DEBUG
printf(
"journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
dirty_db[*it].journal_sector, it->oid.inode, it->oid.stripe, it->version,
dirty_entry.journal_sector, it->oid.inode, it->oid.stripe, it->version,
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]
);
#endif
je->oid = it->oid;
je->version = it->version;
je->offset = dirty_db[*it].offset;
je->len = dirty_db[*it].len;
je->location = dirty_db[*it].location;
je->offset = dirty_entry.offset;
je->len = dirty_entry.len;
je->location = dirty_entry.location;
memcpy((void*)(je+1), (clean_entry_bitmap_size > sizeof(void*)
? dirty_entry.bitmap : &dirty_entry.bitmap), clean_entry_bitmap_size);
je->crc32 = je_crc32((journal_entry*)je);
journal.crc32_last = je->crc32;
it++;

View File

@ -30,21 +30,27 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
wait_big = (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE
? !IS_SYNCED(dirty_it->second.state)
: ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_WAIT_BIG);
if (!is_del && !deleted)
{
if (clean_entry_bitmap_size > sizeof(void*))
memcpy(bmp, dirty_it->second.bitmap, clean_entry_bitmap_size);
else
bmp = dirty_it->second.bitmap;
}
}
}
if (!found)
{
auto clean_it = clean_db.find(op->oid);
if (clean_it != clean_db.end())
{
version = clean_it->second.version + 1;
if (!is_del)
{
void *bmp_ptr = get_clean_entry_bitmap(clean_it->second.location, clean_entry_bitmap_size);
memcpy((clean_entry_bitmap_size > sizeof(void*) ? bmp : &bmp), bmp_ptr, clean_entry_bitmap_size);
}
}
else
{
deleted = true;
@ -116,6 +122,8 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
else
{
state = (op->len == block_size || deleted ? BS_ST_BIG_WRITE : BS_ST_SMALL_WRITE);
if (state == BS_ST_SMALL_WRITE && throttle_small_writes)
clock_gettime(CLOCK_REALTIME, &PRIV(op)->tv_begin);
if (wait_del)
state |= BS_ST_WAIT_DEL;
else if (state == BS_ST_SMALL_WRITE && wait_big)
@ -241,7 +249,8 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
if ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE)
{
blockstore_journal_check_t space_check(this);
if (!space_check.check_available(op, unsynced_big_writes.size() + 1, sizeof(journal_entry_big_write), JOURNAL_STABILIZE_RESERVATION))
if (!space_check.check_available(op, unsynced_big_write_count + 1,
sizeof(journal_entry_big_write) + clean_entry_bitmap_size, JOURNAL_STABILIZE_RESERVATION))
{
return 0;
}
@ -264,7 +273,10 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
dirty_it->second.location = loc << block_order;
dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_SUBMITTED;
#ifdef BLOCKSTORE_DEBUG
printf("Allocate block %lu\n", loc);
printf(
"Allocate block %lu for %lx:%lx v%lu\n",
loc, op->oid.inode, op->oid.stripe, op->version
);
#endif
data_alloc->set(loc, true);
uint64_t stripe_offset = (op->offset % bitmap_granularity);
@ -290,11 +302,8 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
if (immediate_commit != IMMEDIATE_ALL)
{
// Remember big write as unsynced
unsynced_big_writes.push_back((obj_ver_id){
.oid = op->oid,
.version = op->version,
});
// Increase the counter, but don't save into unsynced_writes yet (can't sync until the write is finished)
unsynced_big_write_count++;
PRIV(op)->op_state = 3;
}
else
@ -307,8 +316,11 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
// Small (journaled) write
// First check if the journal has sufficient space
blockstore_journal_check_t space_check(this);
if (unsynced_big_writes.size() && !space_check.check_available(op, unsynced_big_writes.size(), sizeof(journal_entry_big_write), 0)
|| !space_check.check_available(op, 1, sizeof(journal_entry_small_write), op->len + JOURNAL_STABILIZE_RESERVATION))
if (unsynced_big_write_count &&
!space_check.check_available(op, unsynced_big_write_count,
sizeof(journal_entry_big_write) + clean_entry_bitmap_size, 0)
|| !space_check.check_available(op, 1,
sizeof(journal_entry_small_write) + clean_entry_bitmap_size, op->len + JOURNAL_STABILIZE_RESERVATION))
{
return 0;
}
@ -316,8 +328,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
// There is sufficient space. Get SQE(s)
struct io_uring_sqe *sqe1 = NULL;
if (immediate_commit != IMMEDIATE_NONE ||
(journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_small_write) &&
journal.sector_info[journal.cur_sector].dirty)
!journal.entry_fits(sizeof(journal_entry_small_write) + clean_entry_bitmap_size))
{
// Write current journal sector only if it's dirty and full, or in the immediate_commit mode
BS_SUBMIT_GET_SQE_DECL(sqe1);
@ -400,14 +411,6 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
{
journal.next_free = journal_block_size;
}
if (immediate_commit == IMMEDIATE_NONE)
{
// Remember small write as unsynced
unsynced_small_writes.push_back((obj_ver_id){
.oid = op->oid,
.version = op->version,
});
}
if (!PRIV(op)->pending_ops)
{
PRIV(op)->op_state = 4;
@ -423,27 +426,29 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
int blockstore_impl_t::continue_write(blockstore_op_t *op)
{
io_uring_sqe *sqe = NULL;
journal_entry_big_write *je;
int op_state = PRIV(op)->op_state;
if (op_state != 2 && op_state != 4)
if (op_state == 2)
goto resume_2;
else if (op_state == 4)
goto resume_4;
else if (op_state == 6)
goto resume_6;
else
{
// In progress
return 1;
}
resume_2:
// Only for the immediate_commit mode: prepare and submit big_write journal entry
{
auto dirty_it = dirty_db.find((obj_ver_id){
.oid = op->oid,
.version = op->version,
});
assert(dirty_it != dirty_db.end());
if (op_state == 2)
goto resume_2;
else if (op_state == 4)
goto resume_4;
resume_2:
// Only for the immediate_commit mode: prepare and submit big_write journal entry
io_uring_sqe *sqe = NULL;
BS_SUBMIT_GET_SQE_DECL(sqe);
je = (journal_entry_big_write*)prefill_single_journal_entry(
journal_entry_big_write *je = (journal_entry_big_write*)prefill_single_journal_entry(
journal, op->opcode == BS_OP_WRITE_STABLE ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
sizeof(journal_entry_big_write) + clean_entry_bitmap_size
);
@ -470,14 +475,20 @@ resume_2:
PRIV(op)->pending_ops = 1;
PRIV(op)->op_state = 3;
return 1;
}
resume_4:
// Switch object state
#ifdef BLOCKSTORE_DEBUG
printf("Ack write %lx:%lx v%lu = state %x\n", op->oid.inode, op->oid.stripe, op->version, dirty_it->second.state);
printf("Ack write %lx:%lx v%lu = state 0x%x\n", op->oid.inode, op->oid.stripe, op->version, dirty_it->second.state);
#endif
bool imm = (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE
? (immediate_commit == IMMEDIATE_ALL)
: (immediate_commit != IMMEDIATE_NONE);
{
auto dirty_it = dirty_db.find((obj_ver_id){
.oid = op->oid,
.version = op->version,
});
assert(dirty_it != dirty_db.end());
bool is_big = (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE;
bool imm = is_big ? (immediate_commit == IMMEDIATE_ALL) : (immediate_commit != IMMEDIATE_NONE);
if (imm)
{
auto & unstab = unstable_writes[op->oid];
@ -487,11 +498,31 @@ resume_4:
| (imm ? BS_ST_SYNCED : BS_ST_WRITTEN);
if (imm && ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_DELETE || (dirty_it->second.state & BS_ST_INSTANT)))
{
// Deletions are treated as immediately stable
// Deletions and 'instant' operations are treated as immediately stable
mark_stable(dirty_it->first);
}
if (immediate_commit == IMMEDIATE_ALL)
if (!imm)
{
if (is_big)
{
// Remember big write as unsynced
unsynced_big_writes.push_back((obj_ver_id){
.oid = op->oid,
.version = op->version,
});
}
else
{
// Remember small write as unsynced
unsynced_small_writes.push_back((obj_ver_id){
.oid = op->oid,
.version = op->version,
});
}
}
if (imm && (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE)
{
// Unblock small writes
dirty_it++;
while (dirty_it != dirty_db.end() && dirty_it->first.oid == op->oid)
{
@ -502,6 +533,41 @@ resume_4:
dirty_it++;
}
}
// Apply throttling to not fill the journal too fast for the SSD+HDD case
if (!is_big && throttle_small_writes)
{
// Apply throttling
timespec tv_end;
clock_gettime(CLOCK_REALTIME, &tv_end);
uint64_t exec_us =
(tv_end.tv_sec - PRIV(op)->tv_begin.tv_sec)*1000000 +
(tv_end.tv_nsec - PRIV(op)->tv_begin.tv_nsec)/1000;
// Compare with target execution time
// 100% free -> target time = 0
// 0% free -> target time = iodepth/parallelism * (iops + size/bw) / write per second
uint64_t used_start = journal.get_trim_pos();
uint64_t journal_free_space = journal.next_free < used_start
? (used_start - journal.next_free)
: (journal.len - journal.next_free + used_start - journal.block_size);
uint64_t ref_us =
(write_iodepth <= throttle_target_parallelism ? 100 : 100*write_iodepth/throttle_target_parallelism)
* (1000000/throttle_target_iops + op->len*1000000/throttle_target_mbs/1024/1024)
/ 100;
ref_us -= ref_us * journal_free_space / journal.len;
if (ref_us > exec_us + throttle_threshold_us)
{
// Pause reply
tfd->set_timer_us(ref_us-exec_us, false, [this, op](int timer_id)
{
PRIV(op)->op_state++;
ringloop->wakeup();
});
PRIV(op)->op_state = 5;
return 1;
}
}
}
resume_6:
// Acknowledge write
op->retval = op->len;
write_iodepth--;
@ -625,14 +691,6 @@ int blockstore_impl_t::dequeue_del(blockstore_op_t *op)
PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
PRIV(op)->pending_ops++;
}
else
{
// Remember delete as unsynced
unsynced_small_writes.push_back((obj_ver_id){
.oid = op->oid,
.version = op->version,
});
}
if (!PRIV(op)->pending_ops)
{
PRIV(op)->op_state = 4;

File diff suppressed because it is too large Load Diff

View File

@ -8,7 +8,8 @@
#define MIN_BLOCK_SIZE 4*1024
#define MAX_BLOCK_SIZE 128*1024*1024
#define DEFAULT_CLIENT_DIRTY_LIMIT 32*1024*1024
#define DEFAULT_CLIENT_MAX_DIRTY_BYTES 32*1024*1024
#define DEFAULT_CLIENT_MAX_DIRTY_OPS 1024
struct cluster_op_t;
@ -20,8 +21,7 @@ struct cluster_op_part_t
pg_num_t pg_num;
osd_num_t osd_num;
osd_op_buf_list_t iov;
bool sent;
bool done;
unsigned flags;
osd_op_t op;
};
@ -36,19 +36,28 @@ struct cluster_op_t
std::function<void(cluster_op_t*)> callback;
~cluster_op_t();
protected:
int flags = 0;
int state = 0;
uint64_t cur_inode; // for snapshot reads
void *buf = NULL;
cluster_op_t *orig_op = NULL;
bool is_internal = false;
bool needs_reslice = false;
bool up_wait = false;
int sent_count = 0, done_count = 0;
int inflight_count = 0, done_count = 0;
std::vector<cluster_op_part_t> parts;
void *bitmap_buf = NULL, *part_bitmaps = NULL;
unsigned bitmap_buf_size = 0;
friend class cluster_client_t;
};
struct cluster_buffer_t
{
void *buf;
uint64_t len;
int state;
};
// FIXME: Split into public and private interfaces
class cluster_client_t
{
timerfd_manager_t *tfd;
@ -59,28 +68,27 @@ class cluster_client_t
std::map<pool_id_t, uint64_t> pg_counts;
bool immediate_commit = false;
// FIXME: Implement inmemory_commit mode. Note that it requires to return overlapping reads from memory.
uint64_t client_dirty_limit = 0;
uint64_t client_max_dirty_bytes = 0;
uint64_t client_max_dirty_ops = 0;
int log_level;
int up_wait_retry_interval = 500; // ms
uint64_t op_id = 1;
ring_consumer_t consumer;
// operations currently in progress
std::set<cluster_op_t*> cur_ops;
int retry_timeout_id = 0;
// unsynced operations are copied in memory to allow replay when cluster isn't in the immediate_commit mode
// unsynced_writes are replayed in any order (because only the SYNC operation guarantees ordering)
std::vector<cluster_op_t*> unsynced_writes;
std::vector<cluster_op_t*> syncing_writes;
cluster_op_t* cur_sync = NULL;
std::vector<cluster_op_t*> next_writes;
uint64_t op_id = 1;
std::vector<cluster_op_t*> offline_ops;
uint64_t queued_bytes = 0;
std::vector<cluster_op_t*> op_queue;
std::map<object_id, cluster_buffer_t> dirty_buffers;
std::set<osd_num_t> dirty_osds;
uint64_t dirty_bytes = 0, dirty_ops = 0;
void *scrap_buffer = NULL;
unsigned scrap_buffer_size = 0;
bool pgs_loaded = false;
ring_consumer_t consumer;
std::vector<std::function<void(void)>> on_ready_hooks;
int continuing_ops = 0;
int op_queue_pos = 0;
public:
etcd_state_client_t st_cli;
@ -93,19 +101,20 @@ public:
bool is_ready();
void on_ready(std::function<void(void)> fn);
protected:
static void copy_write(cluster_op_t *op, std::map<object_id, cluster_buffer_t> & dirty_buffers);
void continue_ops(bool up_retry = false);
protected:
bool affects_osd(uint64_t inode, uint64_t offset, uint64_t len, osd_num_t osd);
void flush_buffer(const object_id & oid, cluster_buffer_t *wr);
void on_load_config_hook(json11::Json::object & config);
void on_load_pgs_hook(bool success);
void on_change_hook(json11::Json::object & changes);
void on_change_hook(std::map<std::string, etcd_kv_t> & changes);
void on_change_osd_state_hook(uint64_t peer_osd);
cluster_op_t *copy_write(cluster_op_t *op);
void continue_rw(cluster_op_t *op);
int continue_rw(cluster_op_t *op);
void slice_rw(cluster_op_t *op);
bool try_send(cluster_op_t *op, int i);
void execute_sync(cluster_op_t *op);
void continue_sync();
void finish_sync();
int continue_sync(cluster_op_t *op);
void send_sync(cluster_op_t *op, cluster_op_part_t *part);
void handle_op_part(cluster_op_part_t *part);
void copy_part_bitmap(cluster_op_t *op, cluster_op_part_t *part);
};

View File

@ -4,8 +4,10 @@
#include "osd_ops.h"
#include "pg_states.h"
#include "etcd_state_client.h"
#ifndef __MOCK__
#include "http_client.h"
#include "base64.h"
#endif
etcd_state_client_t::~etcd_state_client_t()
{
@ -15,16 +17,19 @@ etcd_state_client_t::~etcd_state_client_t()
}
watches.clear();
etcd_watches_initialised = -1;
#ifndef __MOCK__
if (etcd_watch_ws)
{
etcd_watch_ws->close();
etcd_watch_ws = NULL;
}
#endif
}
json_kv_t etcd_state_client_t::parse_etcd_kv(const json11::Json & kv_json)
#ifndef __MOCK__
etcd_kv_t etcd_state_client_t::parse_etcd_kv(const json11::Json & kv_json)
{
json_kv_t kv;
etcd_kv_t kv;
kv.key = base64_decode(kv_json["key"].string_value());
std::string json_err, json_text = base64_decode(kv_json["value"].string_value());
kv.value = json_text == "" ? json11::Json() : json11::Json::parse(json_text, json_err);
@ -33,6 +38,8 @@ json_kv_t etcd_state_client_t::parse_etcd_kv(const json11::Json & kv_json)
printf("Bad JSON in etcd key %s: %s (value: %s)\n", kv.key.c_str(), json_err.c_str(), json_text.c_str());
kv.key = "";
}
else
kv.mod_revision = kv_json["mod_revision"].uint64_value();
return kv;
}
@ -145,22 +152,22 @@ void etcd_state_client_t::start_etcd_watcher()
etcd_watch_revision = data["result"]["header"]["revision"].uint64_value();
}
// First gather all changes into a hash to remove multiple overwrites
json11::Json::object changes;
std::map<std::string, etcd_kv_t> changes;
for (auto & ev: data["result"]["events"].array_items())
{
auto kv = parse_etcd_kv(ev["kv"]);
if (kv.key != "")
{
changes[kv.key] = kv.value;
changes[kv.key] = kv;
}
}
for (auto & kv: changes)
{
if (this->log_level > 3)
{
printf("Incoming event: %s -> %s\n", kv.first.c_str(), kv.second.dump().c_str());
printf("Incoming event: %s -> %s\n", kv.first.c_str(), kv.second.value.dump().c_str());
}
parse_state(kv.first, kv.second);
parse_state(kv.second);
}
// React to changes
if (on_change_hook != NULL)
@ -327,16 +334,33 @@ void etcd_state_client_t::load_pgs()
for (auto & kv_json: res["response_range"]["kvs"].array_items())
{
auto kv = parse_etcd_kv(kv_json);
parse_state(kv.key, kv.value);
parse_state(kv);
}
}
on_load_pgs_hook(true);
start_etcd_watcher();
});
}
void etcd_state_client_t::parse_state(const std::string & key, const json11::Json & value)
#else
void etcd_state_client_t::parse_config(json11::Json & config)
{
}
void etcd_state_client_t::load_global_config()
{
json11::Json::object global_config;
on_load_config_hook(global_config);
}
void etcd_state_client_t::load_pgs()
{
}
#endif
void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
{
const std::string & key = kv.key;
const json11::Json & value = kv.value;
if (key == etcd_prefix+"/config/pools")
{
for (auto & pool_item: this->pool_config)
@ -347,8 +371,10 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
{
pool_config_t pc;
// ID
pool_id_t pool_id = stoull_full(pool_item.first);
if (!pool_id || pool_id >= POOL_ID_MAX)
pool_id_t pool_id;
char null_byte = 0;
sscanf(pool_item.first.c_str(), "%u%c", &pool_id, &null_byte);
if (!pool_id || pool_id >= POOL_ID_MAX || null_byte != 0)
{
printf("Pool ID %s is invalid (must be a number less than 0x%x), skipping pool\n", pool_item.first.c_str(), POOL_ID_MAX);
continue;
@ -460,16 +486,19 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
}
for (auto & pool_item: value["items"].object_items())
{
pool_id_t pool_id = stoull_full(pool_item.first);
if (!pool_id || pool_id >= POOL_ID_MAX)
pool_id_t pool_id;
char null_byte = 0;
sscanf(pool_item.first.c_str(), "%u%c", &pool_id, &null_byte);
if (!pool_id || pool_id >= POOL_ID_MAX || null_byte != 0)
{
printf("Pool ID %s is invalid in PG configuration (must be a number less than 0x%x), skipping pool\n", pool_item.first.c_str(), POOL_ID_MAX);
continue;
}
for (auto & pg_item: pool_item.second.object_items())
{
pg_num_t pg_num = stoull_full(pg_item.first);
if (!pg_num)
pg_num_t pg_num = 0;
sscanf(pg_item.first.c_str(), "%u%c", &pg_num, &null_byte);
if (!pg_num || null_byte != 0)
{
printf("Bad key in pool %u PG configuration: %s (must be a number), skipped\n", pool_id, pg_item.first.c_str());
continue;
@ -682,6 +711,7 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
.size = value["size"].uint64_value(),
.parent_id = parent_inode_num,
.readonly = value["readonly"].bool_value(),
.mod_revision = kv.mod_revision,
};
this->inode_config[inode_num] = cfg;
if (cfg.name != "")

View File

@ -3,8 +3,8 @@
#pragma once
#include "json11/json11.hpp"
#include "osd_id.h"
#include "http_client.h"
#include "timerfd_manager.h"
#define ETCD_CONFIG_WATCH_ID 1
@ -18,10 +18,11 @@
#define DEFAULT_BLOCK_SIZE 128*1024
struct json_kv_t
struct etcd_kv_t
{
std::string key;
json11::Json value;
uint64_t mod_revision;
};
struct pg_config_t
@ -59,6 +60,8 @@ struct inode_config_t
uint64_t size;
inode_t parent_id;
bool readonly;
// Change revision of the metadata in etcd
uint64_t mod_revision;
};
struct inode_watch_t
@ -67,12 +70,14 @@ struct inode_watch_t
inode_config_t cfg;
};
struct websocket_t;
struct etcd_state_client_t
{
protected:
std::vector<inode_watch_t*> watches;
websocket_t *etcd_watch_ws = NULL;
uint64_t bs_block_size = 0;
uint64_t bs_block_size = DEFAULT_BLOCK_SIZE;
void add_etcd_url(std::string);
public:
std::vector<std::string> etcd_addresses;
@ -87,20 +92,20 @@ public:
std::map<inode_t, inode_config_t> inode_config;
std::map<std::string, inode_t> inode_by_name;
std::function<void(json11::Json::object &)> on_change_hook;
std::function<void(std::map<std::string, etcd_kv_t> &)> on_change_hook;
std::function<void(json11::Json::object &)> on_load_config_hook;
std::function<json11::Json()> load_pgs_checks_hook;
std::function<void(bool)> on_load_pgs_hook;
std::function<void(pool_id_t, pg_num_t)> on_change_pg_history_hook;
std::function<void(osd_num_t)> on_change_osd_state_hook;
json_kv_t parse_etcd_kv(const json11::Json & kv_json);
etcd_kv_t parse_etcd_kv(const json11::Json & kv_json);
void etcd_call(std::string api, json11::Json payload, int timeout, std::function<void(std::string, json11::Json)> callback);
void etcd_txn(json11::Json txn, int timeout, std::function<void(std::string, json11::Json)> callback);
void start_etcd_watcher();
void load_global_config();
void load_pgs();
void parse_state(const std::string & key, const json11::Json & value);
void parse_state(const etcd_kv_t & kv);
void parse_config(json11::Json & config);
inode_watch_t* watch_inode(std::string name);
void close_watch(inode_watch_t* watch);

View File

@ -25,6 +25,7 @@
// -bs_config='{"data_device":"./test_data.bin"}' -size=1000M
#include "blockstore.h"
#include "epoll_manager.h"
#include "fio_headers.h"
#include "json11/json11.hpp"
@ -32,6 +33,7 @@
struct bs_data
{
blockstore_t *bs;
epoll_manager_t *epmgr;
ring_loop_t *ringloop;
/* The list of completed io_u structs. */
std::vector<io_u*> completed;
@ -104,6 +106,7 @@ static void bs_cleanup(struct thread_data *td)
}
safe:
delete bsd->bs;
delete bsd->epmgr;
delete bsd->ringloop;
delete bsd;
}
@ -129,7 +132,8 @@ static int bs_init(struct thread_data *td)
}
}
bsd->ringloop = new ring_loop_t(512);
bsd->bs = new blockstore_t(config, bsd->ringloop);
bsd->epmgr = new epoll_manager_t(bsd->ringloop);
bsd->bs = new blockstore_t(config, bsd->ringloop, bsd->epmgr->tfd);
while (1)
{
bsd->ringloop->loop();

View File

@ -10,30 +10,16 @@
#include "messenger.h"
osd_op_t::~osd_op_t()
{
assert(!bs_op);
assert(!op_data);
if (rmw_buf)
{
free(rmw_buf);
}
if (buf)
{
// Note: reusing osd_op_t WILL currently lead to memory leaks
// So we don't reuse it, but free it every time
free(buf);
}
}
void osd_messenger_t::init()
{
keepalive_timer_id = tfd->set_timer(1000, true, [this](int)
{
for (auto cl_it = clients.begin(); cl_it != clients.end();)
std::vector<int> to_stop;
std::vector<osd_op_t*> to_ping;
for (auto cl_it = clients.begin(); cl_it != clients.end(); cl_it++)
{
auto cl = (cl_it++)->second;
if (!cl->osd_num)
auto cl = cl_it->second;
if (!cl->osd_num || cl->peer_state != PEER_CONNECTED)
{
// Do not run keepalive on regular clients
continue;
@ -44,7 +30,8 @@ void osd_messenger_t::init()
if (!cl->ping_time_remaining)
{
// Ping timed out, stop the client
stop_client(cl->peer_fd, true);
printf("Ping timed out for OSD %lu (client %d), disconnecting peer\n", cl->osd_num, cl->peer_fd);
to_stop.push_back(cl->peer_fd);
}
}
else if (cl->idle_time_remaining > 0)
@ -70,10 +57,11 @@ void osd_messenger_t::init()
delete op;
if (fail_fd >= 0)
{
printf("Ping failed for OSD %lu (client %d), disconnecting peer\n", cl->osd_num, cl->peer_fd);
stop_client(fail_fd, true);
}
};
outbox_push(op);
to_ping.push_back(op);
cl->ping_time_remaining = osd_ping_timeout;
cl->idle_time_remaining = osd_idle_timeout;
}
@ -83,6 +71,15 @@ void osd_messenger_t::init()
cl->idle_time_remaining = osd_idle_timeout;
}
}
// Don't stop clients while a 'clients' iterator is still active
for (int peer_fd: to_stop)
{
stop_client(peer_fd, true);
}
for (auto op: to_ping)
{
outbox_push(op);
}
});
}
@ -141,17 +138,14 @@ void osd_messenger_t::connect_peer(uint64_t peer_osd, json11::Json peer_state)
wanted_peers[peer_osd].port = (int)peer_state["port"].int64_value();
}
wanted_peers[peer_osd].address_changed = true;
if (!wanted_peers[peer_osd].connecting &&
(time(NULL) - wanted_peers[peer_osd].last_connect_attempt) >= peer_connect_interval)
{
try_connect_peer(peer_osd);
}
}
void osd_messenger_t::try_connect_peer(uint64_t peer_osd)
{
auto wp_it = wanted_peers.find(peer_osd);
if (wp_it == wanted_peers.end())
if (wp_it == wanted_peers.end() || wp_it->second.connecting ||
(time(NULL) - wp_it->second.last_connect_attempt) < peer_connect_interval)
{
return;
}
@ -197,10 +191,22 @@ void osd_messenger_t::try_connect_peer_addr(osd_num_t peer_osd, const char *peer
on_connect_peer(peer_osd, -errno);
return;
}
int timeout_id = -1;
clients[peer_fd] = new osd_client_t();
clients[peer_fd]->peer_addr = addr;
clients[peer_fd]->peer_port = peer_port;
clients[peer_fd]->peer_fd = peer_fd;
clients[peer_fd]->peer_state = PEER_CONNECTING;
clients[peer_fd]->connect_timeout_id = -1;
clients[peer_fd]->osd_num = peer_osd;
clients[peer_fd]->in_buf = malloc_or_die(receive_buffer_size);
tfd->set_fd_handler(peer_fd, true, [this](int peer_fd, int epoll_events)
{
// Either OUT (connected) or HUP
handle_connect_epoll(peer_fd);
});
if (peer_connect_timeout > 0)
{
timeout_id = tfd->set_timer(1000*peer_connect_timeout, false, [this, peer_fd](int timer_id)
clients[peer_fd]->connect_timeout_id = tfd->set_timer(1000*peer_connect_timeout, false, [this, peer_fd](int timer_id)
{
osd_num_t peer_osd = clients.at(peer_fd)->osd_num;
stop_client(peer_fd, true);
@ -208,20 +214,6 @@ void osd_messenger_t::try_connect_peer_addr(osd_num_t peer_osd, const char *peer
return;
});
}
clients[peer_fd] = new osd_client_t((osd_client_t){
.peer_addr = addr,
.peer_port = peer_port,
.peer_fd = peer_fd,
.peer_state = PEER_CONNECTING,
.connect_timeout_id = timeout_id,
.osd_num = peer_osd,
.in_buf = malloc_or_die(receive_buffer_size),
});
tfd->set_fd_handler(peer_fd, true, [this](int peer_fd, int epoll_events)
{
// Either OUT (connected) or HUP
handle_connect_epoll(peer_fd);
});
}
void osd_messenger_t::handle_connect_epoll(int peer_fd)
@ -357,6 +349,15 @@ void osd_messenger_t::check_peer_config(osd_client_t *cl)
err = true;
printf("Connected to OSD %lu instead of OSD %lu, peer state is outdated, disconnecting peer\n", config["osd_num"].uint64_value(), cl->osd_num);
}
else if (config["protocol_version"].uint64_value() != OSD_PROTOCOL_VERSION)
{
err = true;
printf(
"OSD %lu protocol version is %lu, but only version %u is supported.\n"
" If you need to upgrade from 0.5.x please request it via the issue tracker.\n",
cl->osd_num, config["protocol_version"].uint64_value(), OSD_PROTOCOL_VERSION
);
}
}
if (err)
{
@ -373,123 +374,6 @@ void osd_messenger_t::check_peer_config(osd_client_t *cl)
outbox_push(op);
}
void osd_messenger_t::cancel_osd_ops(osd_client_t *cl)
{
for (auto p: cl->sent_ops)
{
cancel_op(p.second);
}
cl->sent_ops.clear();
cl->outbox.clear();
}
void osd_messenger_t::cancel_op(osd_op_t *op)
{
if (op->op_type == OSD_OP_OUT)
{
op->reply.hdr.magic = SECONDARY_OSD_REPLY_MAGIC;
op->reply.hdr.id = op->req.hdr.id;
op->reply.hdr.opcode = op->req.hdr.opcode;
op->reply.hdr.retval = -EPIPE;
// Copy lambda to be unaffected by `delete op`
std::function<void(osd_op_t*)>(op->callback)(op);
}
else
{
// This function is only called in stop_client(), so it's fine to destroy the operation
delete op;
}
}
void osd_messenger_t::stop_client(int peer_fd, bool force)
{
assert(peer_fd != 0);
auto it = clients.find(peer_fd);
if (it == clients.end())
{
return;
}
uint64_t repeer_osd = 0;
osd_client_t *cl = it->second;
if (cl->peer_state == PEER_CONNECTED)
{
if (cl->osd_num)
{
// Reload configuration from etcd when the connection is dropped
if (log_level > 0)
printf("[OSD %lu] Stopping client %d (OSD peer %lu)\n", osd_num, peer_fd, cl->osd_num);
repeer_osd = cl->osd_num;
}
else
{
if (log_level > 0)
printf("[OSD %lu] Stopping client %d (regular client)\n", osd_num, peer_fd);
}
}
else if (!force)
{
return;
}
cl->peer_state = PEER_STOPPED;
clients.erase(it);
tfd->set_fd_handler(peer_fd, false, NULL);
if (cl->connect_timeout_id >= 0)
{
tfd->clear_timer(cl->connect_timeout_id);
cl->connect_timeout_id = -1;
}
if (cl->osd_num)
{
osd_peer_fds.erase(cl->osd_num);
}
if (cl->read_op)
{
if (cl->read_op->callback)
{
cancel_op(cl->read_op);
}
else
{
delete cl->read_op;
}
cl->read_op = NULL;
}
for (auto rit = read_ready_clients.begin(); rit != read_ready_clients.end(); rit++)
{
if (*rit == peer_fd)
{
read_ready_clients.erase(rit);
break;
}
}
for (auto wit = write_ready_clients.begin(); wit != write_ready_clients.end(); wit++)
{
if (*wit == peer_fd)
{
write_ready_clients.erase(wit);
break;
}
}
free(cl->in_buf);
cl->in_buf = NULL;
close(peer_fd);
if (repeer_osd)
{
// First repeer PGs as canceling OSD ops may push new operations
// and we need correct PG states when we do that
repeer_pgs(repeer_osd);
}
if (cl->osd_num)
{
// Cancel outbound operations
cancel_osd_ops(cl);
}
if (cl->refs <= 0)
{
delete cl;
}
}
void osd_messenger_t::accept_connections(int listen_fd)
{
// Accept new connections
@ -505,13 +389,12 @@ void osd_messenger_t::accept_connections(int listen_fd)
fcntl(peer_fd, F_SETFL, fcntl(peer_fd, F_GETFL, 0) | O_NONBLOCK);
int one = 1;
setsockopt(peer_fd, SOL_TCP, TCP_NODELAY, &one, sizeof(one));
clients[peer_fd] = new osd_client_t((osd_client_t){
.peer_addr = addr,
.peer_port = ntohs(addr.sin_port),
.peer_fd = peer_fd,
.peer_state = PEER_CONNECTED,
.in_buf = malloc_or_die(receive_buffer_size),
});
clients[peer_fd] = new osd_client_t();
clients[peer_fd]->peer_addr = addr;
clients[peer_fd]->peer_port = ntohs(addr.sin_port);
clients[peer_fd]->peer_fd = peer_fd;
clients[peer_fd]->peer_state = PEER_CONNECTED;
clients[peer_fd]->in_buf = malloc_or_die(receive_buffer_size);
// Add FD to epoll
tfd->set_fd_handler(peer_fd, false, [this](int peer_fd, int epoll_events)
{

View File

@ -14,19 +14,15 @@
#include "malloc_or_die.h"
#include "json11/json11.hpp"
#include "osd_ops.h"
#include "msgr_op.h"
#include "timerfd_manager.h"
#include "ringloop.h"
#define OSD_OP_IN 0
#define OSD_OP_OUT 1
#include <ringloop.h>
#define CL_READ_HDR 1
#define CL_READ_DATA 2
#define CL_READ_REPLY_DATA 3
#define CL_WRITE_READY 1
#define CL_WRITE_REPLY 2
#define OSD_OP_INLINE_BUF_COUNT 16
#define PEER_CONNECTING 1
#define PEER_CONNECTED 2
@ -37,164 +33,6 @@
#define DEFAULT_OSD_PING_TIMEOUT 5
#define DEFAULT_BITMAP_GRANULARITY 4096
// Kind of a vector with small-list-optimisation
struct osd_op_buf_list_t
{
int count = 0, alloc = OSD_OP_INLINE_BUF_COUNT, done = 0;
iovec *buf = NULL;
iovec inline_buf[OSD_OP_INLINE_BUF_COUNT];
inline osd_op_buf_list_t()
{
buf = inline_buf;
}
inline osd_op_buf_list_t(const osd_op_buf_list_t & other)
{
buf = inline_buf;
append(other);
}
inline osd_op_buf_list_t & operator = (const osd_op_buf_list_t & other)
{
reset();
append(other);
return *this;
}
inline ~osd_op_buf_list_t()
{
if (buf && buf != inline_buf)
{
free(buf);
}
}
inline void reset()
{
count = 0;
done = 0;
}
inline iovec* get_iovec()
{
return buf + done;
}
inline int get_size()
{
return count - done;
}
inline void append(const osd_op_buf_list_t & other)
{
if (count+other.count > alloc)
{
if (buf == inline_buf)
{
int old = alloc;
alloc = (((count+other.count+15)/16)*16);
buf = (iovec*)malloc(sizeof(iovec) * alloc);
if (!buf)
{
printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
exit(1);
}
memcpy(buf, inline_buf, sizeof(iovec) * old);
}
else
{
alloc = (((count+other.count+15)/16)*16);
buf = (iovec*)realloc(buf, sizeof(iovec) * alloc);
if (!buf)
{
printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
exit(1);
}
}
}
for (int i = 0; i < other.count; i++)
{
buf[count++] = other.buf[i];
}
}
inline void push_back(void *nbuf, size_t len)
{
if (count >= alloc)
{
if (buf == inline_buf)
{
int old = alloc;
alloc = ((alloc/16)*16 + 1);
buf = (iovec*)malloc(sizeof(iovec) * alloc);
if (!buf)
{
printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
exit(1);
}
memcpy(buf, inline_buf, sizeof(iovec)*old);
}
else
{
alloc = alloc < 16 ? 16 : (alloc+16);
buf = (iovec*)realloc(buf, sizeof(iovec) * alloc);
if (!buf)
{
printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
exit(1);
}
}
}
buf[count++] = { .iov_base = nbuf, .iov_len = len };
}
inline void eat(int result)
{
while (result > 0 && done < count)
{
iovec & iov = buf[done];
if (iov.iov_len <= result)
{
result -= iov.iov_len;
done++;
}
else
{
iov.iov_len -= result;
iov.iov_base += result;
break;
}
}
}
};
struct blockstore_op_t;
struct osd_primary_op_data_t;
struct osd_op_t
{
timespec tv_begin = { 0 }, tv_end = { 0 };
uint64_t op_type = OSD_OP_IN;
int peer_fd;
osd_any_op_t req;
osd_any_reply_t reply;
blockstore_op_t *bs_op = NULL;
void *buf = NULL;
// bitmap, bitmap_len, bmp_data are only meaningful for reads
void *bitmap = NULL;
unsigned bitmap_len = 0;
unsigned bmp_data = 0;
void *rmw_buf = NULL;
osd_primary_op_data_t* op_data = NULL;
std::function<void(osd_op_t*)> callback;
osd_op_buf_list_t iov;
~osd_op_t();
};
struct osd_client_t
{
int refs = 0;
@ -233,6 +71,12 @@ struct osd_client_t
int write_state = 0;
std::vector<iovec> send_list, next_send_list;
std::vector<osd_op_t*> outbox, next_outbox;
~osd_client_t()
{
free(in_buf);
in_buf = NULL;
}
};
struct osd_wanted_peer_t
@ -257,12 +101,9 @@ struct osd_op_stats_t
struct osd_messenger_t
{
timerfd_manager_t *tfd;
ring_loop_t *ringloop;
protected:
int keepalive_timer_id = -1;
// osd_num_t is only for logging and asserts
osd_num_t osd_num;
// FIXME: make receive_buffer_size configurable
int receive_buffer_size = 64*1024;
int peer_connect_interval = DEFAULT_PEER_CONNECT_INTERVAL;
@ -272,19 +113,22 @@ struct osd_messenger_t
int log_level = 0;
bool use_sync_send_recv = false;
std::map<osd_num_t, osd_wanted_peer_t> wanted_peers;
std::map<uint64_t, int> osd_peer_fds;
uint64_t next_subop_id = 1;
std::map<int, osd_client_t*> clients;
std::vector<int> read_ready_clients;
std::vector<int> write_ready_clients;
std::vector<std::function<void()>> set_immediate;
public:
timerfd_manager_t *tfd;
ring_loop_t *ringloop;
// osd_num_t is only for logging and asserts
osd_num_t osd_num;
uint64_t next_subop_id = 1;
std::map<int, osd_client_t*> clients;
std::map<osd_num_t, osd_wanted_peer_t> wanted_peers;
std::map<uint64_t, int> osd_peer_fds;
// op statistics
osd_op_stats_t stats;
public:
void init();
void parse_config(const json11::Json & config);
void connect_peer(uint64_t osd_num, json11::Json peer_state);
@ -292,7 +136,6 @@ public:
void outbox_push(osd_op_t *cur_op);
std::function<void(osd_op_t*)> exec_op;
std::function<void(osd_num_t)> repeer_pgs;
void handle_peer_epoll(int peer_fd, int epoll_events);
void read_requests();
void send_replies();
void accept_connections(int listen_fd);
@ -301,6 +144,7 @@ public:
protected:
void try_connect_peer(uint64_t osd_num);
void try_connect_peer_addr(osd_num_t peer_osd, const char *peer_host, int peer_port);
void handle_peer_epoll(int peer_fd, int epoll_events);
void handle_connect_epoll(int peer_fd);
void on_connect_peer(osd_num_t peer_osd, int peer_fd);
void check_peer_config(osd_client_t *cl);

1
src/mock/build.sh Normal file
View File

@ -0,0 +1 @@
g++ -D__MOCK__ -fsanitize=address -g -Wno-pointer-arith pg_states.cpp osd_ops.cpp test_cluster_client.cpp cluster_client.cpp msgr_op.cpp msgr_stop.cpp mock/messenger.cpp etcd_state_client.cpp timerfd_manager.cpp ../json11/json11.cpp -I mock -I . -I ..; ./a.out

44
src/mock/messenger.cpp Normal file
View File

@ -0,0 +1,44 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#include <unistd.h>
#include <stdexcept>
#include <assert.h>
#include "messenger.h"
void osd_messenger_t::init()
{
}
osd_messenger_t::~osd_messenger_t()
{
while (clients.size() > 0)
{
stop_client(clients.begin()->first, true);
}
}
void osd_messenger_t::outbox_push(osd_op_t *cur_op)
{
clients[cur_op->peer_fd]->sent_ops[cur_op->req.hdr.id] = cur_op;
}
void osd_messenger_t::parse_config(const json11::Json & config)
{
}
void osd_messenger_t::connect_peer(uint64_t peer_osd, json11::Json peer_state)
{
wanted_peers[peer_osd] = (osd_wanted_peer_t){
.port = 1,
};
}
void osd_messenger_t::read_requests()
{
}
void osd_messenger_t::send_replies()
{
}

25
src/mock/ringloop.h Normal file
View File

@ -0,0 +1,25 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once
#include <functional>
struct ring_consumer_t
{
std::function<void(void)> loop;
};
class ring_loop_t
{
public:
void register_consumer(ring_consumer_t *consumer)
{
}
void unregister_consumer(ring_consumer_t *consumer)
{
}
void submit()
{
}
};

22
src/msgr_op.cpp Normal file
View File

@ -0,0 +1,22 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#include <assert.h>
#include "msgr_op.h"
osd_op_t::~osd_op_t()
{
assert(!bs_op);
assert(!op_data);
if (rmw_buf)
{
free(rmw_buf);
}
if (buf)
{
// Note: reusing osd_op_t WILL currently lead to memory leaks
// So we don't reuse it, but free it every time
free(buf);
}
}

175
src/msgr_op.h Normal file
View File

@ -0,0 +1,175 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once
#include <sys/uio.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include "osd_ops.h"
#define OSD_OP_IN 0
#define OSD_OP_OUT 1
#define OSD_OP_INLINE_BUF_COUNT 16
// Kind of a vector with small-list-optimisation
struct osd_op_buf_list_t
{
int count = 0, alloc = OSD_OP_INLINE_BUF_COUNT, done = 0;
iovec *buf = NULL;
iovec inline_buf[OSD_OP_INLINE_BUF_COUNT];
inline osd_op_buf_list_t()
{
buf = inline_buf;
}
inline osd_op_buf_list_t(const osd_op_buf_list_t & other)
{
buf = inline_buf;
append(other);
}
inline osd_op_buf_list_t & operator = (const osd_op_buf_list_t & other)
{
reset();
append(other);
return *this;
}
inline ~osd_op_buf_list_t()
{
if (buf && buf != inline_buf)
{
free(buf);
}
}
inline void reset()
{
count = 0;
done = 0;
}
inline iovec* get_iovec()
{
return buf + done;
}
inline int get_size()
{
return count - done;
}
inline void append(const osd_op_buf_list_t & other)
{
if (count+other.count > alloc)
{
if (buf == inline_buf)
{
int old = alloc;
alloc = (((count+other.count+15)/16)*16);
buf = (iovec*)malloc(sizeof(iovec) * alloc);
if (!buf)
{
printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
exit(1);
}
memcpy(buf, inline_buf, sizeof(iovec) * old);
}
else
{
alloc = (((count+other.count+15)/16)*16);
buf = (iovec*)realloc(buf, sizeof(iovec) * alloc);
if (!buf)
{
printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
exit(1);
}
}
}
for (int i = 0; i < other.count; i++)
{
buf[count++] = other.buf[i];
}
}
inline void push_back(void *nbuf, size_t len)
{
if (count >= alloc)
{
if (buf == inline_buf)
{
int old = alloc;
alloc = ((alloc/16)*16 + 1);
buf = (iovec*)malloc(sizeof(iovec) * alloc);
if (!buf)
{
printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
exit(1);
}
memcpy(buf, inline_buf, sizeof(iovec)*old);
}
else
{
alloc = alloc < 16 ? 16 : (alloc+16);
buf = (iovec*)realloc(buf, sizeof(iovec) * alloc);
if (!buf)
{
printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
exit(1);
}
}
}
buf[count++] = { .iov_base = nbuf, .iov_len = len };
}
inline void eat(int result)
{
while (result > 0 && done < count)
{
iovec & iov = buf[done];
if (iov.iov_len <= result)
{
result -= iov.iov_len;
done++;
}
else
{
iov.iov_len -= result;
iov.iov_base += result;
break;
}
}
}
};
struct blockstore_op_t;
struct osd_primary_op_data_t;
struct osd_op_t
{
timespec tv_begin = { 0 }, tv_end = { 0 };
uint64_t op_type = OSD_OP_IN;
int peer_fd;
osd_any_op_t req;
osd_any_reply_t reply;
blockstore_op_t *bs_op = NULL;
void *buf = NULL;
// bitmap, bitmap_len, bmp_data are only meaningful for reads
void *bitmap = NULL;
unsigned bitmap_len = 0;
unsigned bmp_data = 0;
void *rmw_buf = NULL;
osd_primary_op_data_t* op_data = NULL;
std::function<void(osd_op_t*)> callback;
osd_op_buf_list_t iov;
~osd_op_t();
};

View File

@ -232,6 +232,15 @@ void osd_messenger_t::handle_op_hdr(osd_client_t *cl)
}
cl->read_remaining = cur_op->req.sec_stab.len;
}
else if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ_BMP)
{
if (cur_op->req.sec_read_bmp.len > 0)
{
cur_op->buf = memalign_or_die(MEM_ALIGNMENT, cur_op->req.sec_read_bmp.len);
cl->recv_list.push_back(cur_op->buf, cur_op->req.sec_read_bmp.len);
}
cl->read_remaining = cur_op->req.sec_read_bmp.len;
}
else if (cur_op->req.hdr.opcode == OSD_OP_READ)
{
cl->read_remaining = 0;
@ -277,17 +286,19 @@ bool osd_messenger_t::handle_reply_hdr(osd_client_t *cl)
{
// Read data. In this case we assume that the buffer is preallocated by the caller (!)
unsigned bmp_len = (op->reply.hdr.opcode == OSD_OP_SEC_READ ? op->reply.sec_rw.attr_len : op->reply.rw.bitmap_len);
if (op->reply.hdr.retval != (op->reply.hdr.opcode == OSD_OP_SEC_READ ? op->req.sec_rw.len : op->req.rw.len) ||
bmp_len > op->bitmap_len)
unsigned expected_size = (op->reply.hdr.opcode == OSD_OP_SEC_READ ? op->req.sec_rw.len : op->req.rw.len);
if (op->reply.hdr.retval >= 0 && (op->reply.hdr.retval != expected_size || bmp_len > op->bitmap_len))
{
// Check reply length to not overflow the buffer
printf("Client %d read reply of different length\n", cl->peer_fd);
printf("Client %d read reply of different length: expected %u+%u, got %ld+%u\n",
cl->peer_fd, expected_size, op->bitmap_len, op->reply.hdr.retval, bmp_len);
cl->sent_ops[op->req.hdr.id] = op;
stop_client(cl->peer_fd);
return false;
}
if (bmp_len > 0)
if (op->reply.hdr.retval >= 0 && bmp_len > 0)
{
assert(op->bitmap);
cl->recv_list.push_back(op->bitmap, bmp_len);
}
if (op->reply.hdr.retval > 0)
@ -314,6 +325,16 @@ bool osd_messenger_t::handle_reply_hdr(osd_client_t *cl)
op->buf = memalign_or_die(MEM_ALIGNMENT, cl->read_remaining);
cl->recv_list.push_back(op->buf, cl->read_remaining);
}
else if (op->reply.hdr.opcode == OSD_OP_SEC_READ_BMP && op->reply.hdr.retval > 0)
{
assert(!op->iov.count);
delete cl->read_op;
cl->read_op = op;
cl->read_state = CL_READ_REPLY_DATA;
cl->read_remaining = op->reply.hdr.retval;
op->buf = memalign_or_die(MEM_ALIGNMENT, cl->read_remaining);
cl->recv_list.push_back(op->buf, cl->read_remaining);
}
else if (op->reply.hdr.opcode == OSD_OP_SHOW_CONFIG && op->reply.hdr.retval > 0)
{
assert(!op->iov.count);

View File

@ -87,6 +87,14 @@ void osd_messenger_t::outbox_push(osd_op_t *cur_op)
to_outbox.push_back(NULL);
}
}
if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ_BMP)
{
if (cur_op->op_type == OSD_OP_IN && cur_op->reply.hdr.retval > 0)
to_send_list.push_back((iovec){ .iov_base = cur_op->buf, .iov_len = (size_t)cur_op->reply.hdr.retval });
else if (cur_op->op_type == OSD_OP_OUT && cur_op->req.sec_read_bmp.len > 0)
to_send_list.push_back((iovec){ .iov_base = cur_op->buf, .iov_len = (size_t)cur_op->req.sec_read_bmp.len });
to_outbox.push_back(NULL);
}
if (cur_op->op_type == OSD_OP_IN)
{
// To free it later
@ -203,7 +211,7 @@ void osd_messenger_t::handle_send(int result, osd_client_t *cl)
cl->refs--;
if (cl->peer_state == PEER_STOPPED)
{
if (!cl->refs)
if (cl->refs <= 0)
{
delete cl;
}

137
src/msgr_stop.cpp Normal file
View File

@ -0,0 +1,137 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#include <unistd.h>
#include <assert.h>
#include "messenger.h"
void osd_messenger_t::cancel_osd_ops(osd_client_t *cl)
{
std::vector<osd_op_t*> cancel_ops;
cancel_ops.resize(cl->sent_ops.size());
int i = 0;
for (auto p: cl->sent_ops)
{
cancel_ops[i++] = p.second;
}
cl->sent_ops.clear();
cl->outbox.clear();
for (auto op: cancel_ops)
{
cancel_op(op);
}
}
void osd_messenger_t::cancel_op(osd_op_t *op)
{
if (op->op_type == OSD_OP_OUT)
{
op->reply.hdr.magic = SECONDARY_OSD_REPLY_MAGIC;
op->reply.hdr.id = op->req.hdr.id;
op->reply.hdr.opcode = op->req.hdr.opcode;
op->reply.hdr.retval = -EPIPE;
// Copy lambda to be unaffected by `delete op`
std::function<void(osd_op_t*)>(op->callback)(op);
}
else
{
// This function is only called in stop_client(), so it's fine to destroy the operation
delete op;
}
}
void osd_messenger_t::stop_client(int peer_fd, bool force)
{
assert(peer_fd != 0);
auto it = clients.find(peer_fd);
if (it == clients.end())
{
return;
}
osd_client_t *cl = it->second;
if (cl->peer_state == PEER_CONNECTING && !force || cl->peer_state == PEER_STOPPED)
{
return;
}
if (log_level > 0)
{
if (cl->osd_num)
{
printf("[OSD %lu] Stopping client %d (OSD peer %lu)\n", osd_num, peer_fd, cl->osd_num);
}
else
{
printf("[OSD %lu] Stopping client %d (regular client)\n", osd_num, peer_fd);
}
}
// First set state to STOPPED so another stop_client() call doesn't try to free it again
cl->refs++;
cl->peer_state = PEER_STOPPED;
if (cl->osd_num)
{
// ...and forget OSD peer
osd_peer_fds.erase(cl->osd_num);
}
#ifndef __MOCK__
// Then remove FD from the eventloop so we don't accidentally read something
tfd->set_fd_handler(peer_fd, false, NULL);
if (cl->connect_timeout_id >= 0)
{
tfd->clear_timer(cl->connect_timeout_id);
cl->connect_timeout_id = -1;
}
for (auto rit = read_ready_clients.begin(); rit != read_ready_clients.end(); rit++)
{
if (*rit == peer_fd)
{
read_ready_clients.erase(rit);
break;
}
}
for (auto wit = write_ready_clients.begin(); wit != write_ready_clients.end(); wit++)
{
if (*wit == peer_fd)
{
write_ready_clients.erase(wit);
break;
}
}
#endif
if (cl->osd_num)
{
// Then repeer PGs because cancel_op() callbacks can try to perform
// some actions and we need correct PG states to not do something silly
repeer_pgs(cl->osd_num);
}
// Then cancel all operations
if (cl->read_op)
{
if (!cl->read_op->callback)
{
delete cl->read_op;
}
cl->read_op = NULL;
}
if (cl->osd_num)
{
// Cancel outbound operations
cancel_osd_ops(cl);
}
#ifndef __MOCK__
// And close the FD only when everything is done
// ...because peer_fd number can get reused after close()
close(peer_fd);
#endif
// Find the item again because it can be invalidated at this point
it = clients.find(peer_fd);
if (it != clients.end())
{
clients.erase(it);
}
cl->refs--;
if (cl->refs <= 0)
{
delete cl;
}
}

View File

@ -8,6 +8,7 @@
#include <arpa/inet.h>
#include "osd.h"
#include "http_client.h"
osd_t::osd_t(blockstore_config_t & config, ring_loop_t *ringloop)
{
@ -19,17 +20,22 @@ osd_t::osd_t(blockstore_config_t & config, ring_loop_t *ringloop)
bs_bitmap_granularity = DEFAULT_BITMAP_GRANULARITY;
clean_entry_bitmap_size = bs_block_size / bs_bitmap_granularity / 8;
zero_buffer_size = 1<<20;
zero_buffer = malloc_or_die(zero_buffer_size);
memset(zero_buffer, 0, zero_buffer_size);
this->config = config;
this->ringloop = ringloop;
epmgr = new epoll_manager_t(ringloop);
// FIXME: Use timerfd_interval based directly on io_uring
this->tfd = epmgr->tfd;
// FIXME: Create Blockstore from on-disk superblock config and check it against the OSD cluster config
this->bs = new blockstore_t(config, ringloop);
this->bs = new blockstore_t(config, ringloop, tfd);
parse_config(config);
epmgr = new epoll_manager_t(ringloop);
this->tfd = epmgr->tfd;
this->tfd->set_timer(print_stats_interval*1000, true, [this](int timer_id)
{
print_stats();
@ -57,6 +63,7 @@ osd_t::~osd_t()
delete epmgr;
delete bs;
close(listen_fd);
free(zero_buffer);
}
void osd_t::parse_config(blockstore_config_t & config)
@ -198,6 +205,8 @@ void osd_t::exec_op(osd_op_t *cur_op)
delete cur_op;
return;
}
// Clear the reply buffer
memset(cur_op->reply.buf, 0, OSD_PACKET_SIZE);
inflight_ops++;
if (cur_op->req.hdr.magic != SECONDARY_OSD_OP_MAGIC ||
cur_op->req.hdr.opcode < OSD_OP_MIN || cur_op->req.hdr.opcode > OSD_OP_MAX ||
@ -228,6 +237,7 @@ void osd_t::exec_op(osd_op_t *cur_op)
cur_op->req.hdr.opcode != OSD_OP_SEC_READ &&
cur_op->req.hdr.opcode != OSD_OP_SEC_LIST &&
cur_op->req.hdr.opcode != OSD_OP_READ &&
cur_op->req.hdr.opcode != OSD_OP_SEC_READ_BMP &&
cur_op->req.hdr.opcode != OSD_OP_SHOW_CONFIG)
{
// Readonly mode

View File

@ -66,6 +66,28 @@ struct inode_stats_t
uint64_t op_bytes[3] = { 0 };
};
struct bitmap_request_t
{
osd_num_t osd_num;
object_id oid;
uint64_t version;
void *bmp_buf;
};
inline bool operator < (const bitmap_request_t & a, const bitmap_request_t & b)
{
return a.osd_num < b.osd_num || a.osd_num == b.osd_num && a.oid < b.oid;
}
struct osd_chain_read_t
{
int chain_pos;
inode_t inode;
uint32_t offset, len;
};
struct osd_rmw_stripe_t;
class osd_t
{
// config
@ -126,6 +148,8 @@ class osd_t
bool stopping = false;
int inflight_ops = 0;
blockstore_t *bs;
void *zero_buffer = NULL;
uint64_t zero_buffer_size = 0;
uint32_t bs_block_size, bs_bitmap_granularity, clean_entry_bitmap_size;
ring_loop_t *ringloop;
timerfd_manager_t *tfd = NULL;
@ -147,7 +171,7 @@ class osd_t
void init_cluster();
void on_change_osd_state_hook(osd_num_t peer_osd);
void on_change_pg_history_hook(pool_id_t pool_id, pg_num_t pg_num);
void on_change_etcd_state_hook(json11::Json::object & changes);
void on_change_etcd_state_hook(std::map<std::string, etcd_kv_t> & changes);
void on_load_config_hook(json11::Json::object & changes);
json11::Json on_load_pgs_checks_hook();
void on_load_pgs_hook(bool success);
@ -210,23 +234,37 @@ class osd_t
void continue_primary_del(osd_op_t *cur_op);
bool check_write_queue(osd_op_t *cur_op, pg_t & pg);
void remove_object_from_state(object_id & oid, pg_osd_set_state_t *object_state, pg_t &pg);
void free_object_state(pg_t & pg, pg_osd_set_state_t **object_state);
bool remember_unstable_write(osd_op_t *cur_op, pg_t & pg, pg_osd_set_t & loc_set, int base_state);
void handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op);
void handle_primary_bs_subop(osd_op_t *subop);
void add_bs_subop_stats(osd_op_t *subop);
void pg_cancel_write_queue(pg_t & pg, osd_op_t *first_op, object_id oid, int retval);
void submit_primary_subops(int submit_type, uint64_t op_version, int pg_size, const uint64_t* osd_set, osd_op_t *cur_op);
void submit_primary_subops(int submit_type, uint64_t op_version, const uint64_t* osd_set, osd_op_t *cur_op);
int submit_primary_subop_batch(int submit_type, inode_t inode, uint64_t op_version,
osd_rmw_stripe_t *stripes, const uint64_t* osd_set, osd_op_t *cur_op, int subop_idx, int zero_read);
void submit_primary_del_subops(osd_op_t *cur_op, uint64_t *cur_set, uint64_t set_size, pg_osd_set_t & loc_set);
void submit_primary_del_batch(osd_op_t *cur_op, obj_ver_osd_t *chunks_to_delete, int chunks_to_delete_count);
void submit_primary_sync_subops(osd_op_t *cur_op);
int submit_primary_sync_subops(osd_op_t *cur_op);
void submit_primary_stab_subops(osd_op_t *cur_op);
uint64_t* get_object_osd_set(pg_t &pg, object_id &oid, uint64_t *def, pg_osd_set_state_t **object_state);
void continue_chained_read(osd_op_t *cur_op);
int submit_chained_read_requests(pg_t & pg, osd_op_t *cur_op);
void send_chained_read_results(pg_t & pg, osd_op_t *cur_op);
std::vector<osd_chain_read_t> collect_chained_read_requests(osd_op_t *cur_op);
int collect_bitmap_requests(osd_op_t *cur_op, pg_t & pg, std::vector<bitmap_request_t> & bitmap_requests);
int submit_bitmap_subops(osd_op_t *cur_op, pg_t & pg);
int read_bitmaps(osd_op_t *cur_op, pg_t & pg, int base_state);
inline pg_num_t map_to_pg(object_id oid, uint64_t pg_stripe_size)
{
uint64_t pg_count = pg_counts[INODE_POOL(oid.inode)];
if (!pg_count)
pg_count = 1;
return (oid.inode + oid.stripe / pg_stripe_size) % pg_count + 1;
return (oid.stripe / pg_stripe_size) % pg_count + 1;
}
public:

View File

@ -4,6 +4,7 @@
#include "osd.h"
#include "base64.h"
#include "etcd_state_client.h"
#include "http_client.h"
#include "osd_rmw.h"
// Startup sequence:
@ -64,7 +65,7 @@ void osd_t::init_cluster()
st_cli.log_level = log_level;
st_cli.on_change_osd_state_hook = [this](osd_num_t peer_osd) { on_change_osd_state_hook(peer_osd); };
st_cli.on_change_pg_history_hook = [this](pool_id_t pool_id, pg_num_t pg_num) { on_change_pg_history_hook(pool_id, pg_num); };
st_cli.on_change_hook = [this](json11::Json::object & changes) { on_change_etcd_state_hook(changes); };
st_cli.on_change_hook = [this](std::map<std::string, etcd_kv_t> & changes) { on_change_etcd_state_hook(changes); };
st_cli.on_load_config_hook = [this](json11::Json::object & cfg) { on_load_config_hook(cfg); };
st_cli.load_pgs_checks_hook = [this]() { return on_load_pgs_checks_hook(); };
st_cli.on_load_pgs_hook = [this](bool success) { on_load_pgs_hook(success); };
@ -182,14 +183,38 @@ void osd_t::report_statistics()
// Report space usage statistics as a whole
// Maybe we'll report it using deltas if we tune for a lot of inodes at some point
json11::Json::object inode_space;
json11::Json::object last_stat;
pool_id_t last_pool = 0;
for (auto kv: bs->get_inode_space_stats())
{
inode_space[std::to_string(kv.first)] = kv.second;
pool_id_t pool_id = INODE_POOL(kv.first);
uint64_t only_inode_num = (kv.first & ((1l << (64-POOL_ID_BITS)) - 1));
if (!last_pool || pool_id != last_pool)
{
if (last_pool)
inode_space[std::to_string(last_pool)] = last_stat;
last_stat = json11::Json::object();
last_pool = pool_id;
}
last_stat[std::to_string(only_inode_num)] = kv.second;
}
if (last_pool)
inode_space[std::to_string(last_pool)] = last_stat;
last_stat = json11::Json::object();
last_pool = 0;
json11::Json::object inode_ops;
for (auto kv: inode_stats)
{
inode_ops[std::to_string(kv.first)] = json11::Json::object {
pool_id_t pool_id = INODE_POOL(kv.first);
uint64_t only_inode_num = (kv.first & ((1l << (64-POOL_ID_BITS)) - 1));
if (!last_pool || pool_id != last_pool)
{
if (last_pool)
inode_ops[std::to_string(last_pool)] = last_stat;
last_stat = json11::Json::object();
last_pool = pool_id;
}
last_stat[std::to_string(only_inode_num)] = json11::Json::object {
{ "read", json11::Json::object {
{ "count", kv.second.op_count[INODE_STATS_READ] },
{ "usec", kv.second.op_sum[INODE_STATS_READ] },
@ -207,20 +232,28 @@ void osd_t::report_statistics()
} },
};
}
json11::Json::array txn = { json11::Json::object {
if (last_pool)
inode_ops[std::to_string(last_pool)] = last_stat;
json11::Json::array txn = {
json11::Json::object {
{ "request_put", json11::Json::object {
{ "key", base64_encode(st_cli.etcd_prefix+"/osd/stats/"+std::to_string(osd_num)) },
{ "value", base64_encode(get_statistics().dump()) },
} },
},
json11::Json::object {
{ "request_put", json11::Json::object {
{ "key", base64_encode(st_cli.etcd_prefix+"/osd/space/"+std::to_string(osd_num)) },
{ "value", base64_encode(json11::Json(inode_space).dump()) },
} },
},
json11::Json::object {
{ "request_put", json11::Json::object {
{ "key", base64_encode(st_cli.etcd_prefix+"/osd/inodestats/"+std::to_string(osd_num)) },
{ "value", base64_encode(json11::Json(inode_ops).dump()) },
} },
} };
},
};
for (auto & p: pgs)
{
auto & pg = p.second;
@ -271,7 +304,7 @@ void osd_t::on_change_osd_state_hook(osd_num_t peer_osd)
}
}
void osd_t::on_change_etcd_state_hook(json11::Json::object & changes)
void osd_t::on_change_etcd_state_hook(std::map<std::string, etcd_kv_t> & changes)
{
// FIXME apply config changes in runtime (maybe, some)
if (run_primary)
@ -593,7 +626,7 @@ void osd_t::apply_pg_config()
}
if (currently_taken)
{
if (pg_it->second.state & (PG_ACTIVE | PG_INCOMPLETE | PG_PEERING))
if (pg_it->second.state & (PG_ACTIVE | PG_INCOMPLETE | PG_PEERING | PG_REPEERING))
{
if (pg_it->second.target_set == pg_cfg.target_set)
{

View File

@ -149,10 +149,14 @@ void osd_t::handle_flush_op(bool rollback, pool_id_t pool_id, pg_num_t pg_num, p
{
continue_primary_write(op);
}
if (pg.inflight == 0 && (pg.state & PG_STOPPING))
if ((pg.state & PG_STOPPING) && pg.inflight == 0 && !pg.flush_batch)
{
finish_stop_pg(pg);
}
else if ((pg.state & PG_REPEERING) && pg.inflight == 0 && !pg.flush_batch)
{
start_pg_peering(pg);
}
}
}
@ -231,7 +235,8 @@ bool osd_t::pick_next_recovery(osd_recovery_op_t &op)
{
for (auto pg_it = pgs.begin(); pg_it != pgs.end(); pg_it++)
{
if ((pg_it->second.state & (PG_ACTIVE | PG_HAS_MISPLACED)) == (PG_ACTIVE | PG_HAS_MISPLACED))
// Don't try to "recover" misplaced objects if "recovery" would make them degraded
if ((pg_it->second.state & (PG_ACTIVE | PG_DEGRADED | PG_HAS_MISPLACED)) == (PG_ACTIVE | PG_HAS_MISPLACED))
{
for (auto obj_it = pg_it->second.misplaced_objects.begin(); obj_it != pg_it->second.misplaced_objects.end(); obj_it++)
{

View File

@ -20,4 +20,5 @@ const char* osd_op_names[] = {
"primary_sync",
"primary_delete",
"ping",
"sec_read_bmp",
};

View File

@ -28,12 +28,14 @@
#define OSD_OP_SYNC 13
#define OSD_OP_DELETE 14
#define OSD_OP_PING 15
#define OSD_OP_MAX 15
#define OSD_OP_SEC_READ_BMP 16
#define OSD_OP_MAX 16
// Alignment & limit for read/write operations
#ifndef MEM_ALIGNMENT
#define MEM_ALIGNMENT 512
#endif
#define OSD_RW_MAX 64*1024*1024
#define OSD_PROTOCOL_VERSION 1
// common request and reply headers
struct __attribute__((__packed__)) osd_op_header_t
@ -59,7 +61,7 @@ struct __attribute__((__packed__)) osd_reply_header_t
};
// read or write to the secondary OSD
struct __attribute__((__packed__)) osd_op_secondary_rw_t
struct __attribute__((__packed__)) osd_op_sec_rw_t
{
osd_op_header_t header;
// object
@ -76,7 +78,7 @@ struct __attribute__((__packed__)) osd_op_secondary_rw_t
uint32_t pad0;
};
struct __attribute__((__packed__)) osd_reply_secondary_rw_t
struct __attribute__((__packed__)) osd_reply_sec_rw_t
{
osd_reply_header_t header;
// for reads and writes: assigned or read version number
@ -87,7 +89,7 @@ struct __attribute__((__packed__)) osd_reply_secondary_rw_t
};
// delete object on the secondary OSD
struct __attribute__((__packed__)) osd_op_secondary_del_t
struct __attribute__((__packed__)) osd_op_sec_del_t
{
osd_op_header_t header;
// object
@ -96,37 +98,51 @@ struct __attribute__((__packed__)) osd_op_secondary_del_t
uint64_t version;
};
struct __attribute__((__packed__)) osd_reply_secondary_del_t
struct __attribute__((__packed__)) osd_reply_sec_del_t
{
osd_reply_header_t header;
uint64_t version;
};
// sync to the secondary OSD
struct __attribute__((__packed__)) osd_op_secondary_sync_t
struct __attribute__((__packed__)) osd_op_sec_sync_t
{
osd_op_header_t header;
};
struct __attribute__((__packed__)) osd_reply_secondary_sync_t
struct __attribute__((__packed__)) osd_reply_sec_sync_t
{
osd_reply_header_t header;
};
// stabilize or rollback objects on the secondary OSD
struct __attribute__((__packed__)) osd_op_secondary_stabilize_t
struct __attribute__((__packed__)) osd_op_sec_stab_t
{
osd_op_header_t header;
// obj_ver_id array length in bytes
uint64_t len;
};
typedef osd_op_secondary_stabilize_t osd_op_secondary_rollback_t;
typedef osd_op_sec_stab_t osd_op_sec_rollback_t;
struct __attribute__((__packed__)) osd_reply_secondary_stabilize_t
struct __attribute__((__packed__)) osd_reply_sec_stab_t
{
osd_reply_header_t header;
};
typedef osd_reply_secondary_stabilize_t osd_reply_secondary_rollback_t;
typedef osd_reply_sec_stab_t osd_reply_sec_rollback_t;
// bulk read bitmaps from a secondary OSD
struct __attribute__((__packed__)) osd_op_sec_read_bmp_t
{
osd_op_header_t header;
// obj_ver_id array length in bytes
uint64_t len;
};
struct __attribute__((__packed__)) osd_reply_sec_read_bmp_t
{
// retval is payload length in bytes. payload is {version,bitmap}[]
osd_reply_header_t header;
};
// show configuration
struct __attribute__((__packed__)) osd_op_show_config_t
@ -140,7 +156,7 @@ struct __attribute__((__packed__)) osd_reply_show_config_t
};
// list objects on replica
struct __attribute__((__packed__)) osd_op_secondary_list_t
struct __attribute__((__packed__)) osd_op_sec_list_t
{
osd_op_header_t header;
// placement group total number and total count
@ -151,7 +167,7 @@ struct __attribute__((__packed__)) osd_op_secondary_list_t
uint64_t min_inode, max_inode;
};
struct __attribute__((__packed__)) osd_reply_secondary_list_t
struct __attribute__((__packed__)) osd_reply_sec_list_t
{
osd_reply_header_t header;
// stable object version count. header.retval = total object version count
@ -169,6 +185,10 @@ struct __attribute__((__packed__)) osd_op_rw_t
uint64_t offset;
// length
uint32_t len;
// flags (for future)
uint32_t flags;
// inode metadata revision
uint64_t meta_revision;
};
struct __attribute__((__packed__)) osd_reply_rw_t
@ -194,11 +214,12 @@ struct __attribute__((__packed__)) osd_reply_sync_t
union osd_any_op_t
{
osd_op_header_t hdr;
osd_op_secondary_rw_t sec_rw;
osd_op_secondary_del_t sec_del;
osd_op_secondary_sync_t sec_sync;
osd_op_secondary_stabilize_t sec_stab;
osd_op_secondary_list_t sec_list;
osd_op_sec_rw_t sec_rw;
osd_op_sec_del_t sec_del;
osd_op_sec_sync_t sec_sync;
osd_op_sec_stab_t sec_stab;
osd_op_sec_read_bmp_t sec_read_bmp;
osd_op_sec_list_t sec_list;
osd_op_show_config_t show_conf;
osd_op_rw_t rw;
osd_op_sync_t sync;
@ -208,11 +229,12 @@ union osd_any_op_t
union osd_any_reply_t
{
osd_reply_header_t hdr;
osd_reply_secondary_rw_t sec_rw;
osd_reply_secondary_del_t sec_del;
osd_reply_secondary_sync_t sec_sync;
osd_reply_secondary_stabilize_t sec_stab;
osd_reply_secondary_list_t sec_list;
osd_reply_sec_rw_t sec_rw;
osd_reply_sec_del_t sec_del;
osd_reply_sec_sync_t sec_sync;
osd_reply_sec_stab_t sec_stab;
osd_reply_sec_read_bmp_t sec_read_bmp;
osd_reply_sec_list_t sec_list;
osd_reply_show_config_t show_conf;
osd_reply_rw_t rw;
osd_reply_sync_t sync;

View File

@ -77,10 +77,11 @@ void osd_t::repeer_pgs(osd_num_t peer_osd)
// Re-peer affected PGs
for (auto & p: pgs)
{
auto & pg = p.second;
bool repeer = false;
if (p.second.state & (PG_PEERING | PG_ACTIVE | PG_INCOMPLETE))
if (pg.state & (PG_PEERING | PG_ACTIVE | PG_INCOMPLETE))
{
for (osd_num_t pg_osd: p.second.all_peers)
for (osd_num_t pg_osd: pg.all_peers)
{
if (pg_osd == peer_osd)
{
@ -91,8 +92,17 @@ void osd_t::repeer_pgs(osd_num_t peer_osd)
if (repeer)
{
// Repeer this pg
printf("[PG %u/%u] Repeer because of OSD %lu\n", p.second.pool_id, p.second.pg_num, peer_osd);
start_pg_peering(p.second);
printf("[PG %u/%u] Repeer because of OSD %lu\n", pg.pool_id, pg.pg_num, peer_osd);
if (!(pg.state & (PG_ACTIVE | PG_REPEERING)) || pg.inflight == 0 && !pg.flush_batch)
{
start_pg_peering(pg);
}
else
{
// Stop accepting new operations, wait for current ones to finish or fail
pg.state = pg.state & ~PG_ACTIVE | PG_REPEERING;
report_pg_state(pg);
}
}
}
}
@ -334,9 +344,10 @@ void osd_t::submit_sync_and_list_subop(osd_num_t role_osd, pg_peering_state_t *p
{
// FIXME: Mark peer as failed and don't reconnect immediately after dropping the connection
printf("Failed to sync OSD %lu: %ld (%s), disconnecting peer\n", role_osd, op->reply.hdr.retval, strerror(-op->reply.hdr.retval));
int fail_fd = op->peer_fd;
ps->list_ops.erase(role_osd);
c_cli.stop_client(op->peer_fd);
delete op;
c_cli.stop_client(fail_fd);
return;
}
delete op;
@ -413,9 +424,10 @@ void osd_t::submit_list_subop(osd_num_t role_osd, pg_peering_state_t *ps)
if (op->reply.hdr.retval < 0)
{
printf("Failed to get object list from OSD %lu (retval=%ld), disconnecting peer\n", role_osd, op->reply.hdr.retval);
int fail_fd = op->peer_fd;
ps->list_ops.erase(role_osd);
c_cli.stop_client(op->peer_fd);
delete op;
c_cli.stop_client(fail_fd);
return;
}
printf(
@ -484,15 +496,13 @@ bool osd_t::stop_pg(pg_t & pg)
{
return false;
}
if (!(pg.state & PG_ACTIVE))
if (!(pg.state & (PG_ACTIVE | PG_REPEERING)))
{
finish_stop_pg(pg);
return true;
}
pg.state = pg.state & ~PG_ACTIVE | PG_STOPPING;
if (pg.inflight == 0 && !pg.flush_batch &&
// We must either forget all PG's unstable writes or wait for it to become clean
dirty_pgs.find({ .pool_id = pg.pool_id, .pg_num = pg.pg_num }) == dirty_pgs.end())
pg.state = pg.state & ~PG_ACTIVE & ~PG_REPEERING | PG_STOPPING;
if (pg.inflight == 0 && !pg.flush_batch)
{
finish_stop_pg(pg);
}

View File

@ -430,12 +430,13 @@ void pg_t::calc_object_states(int log_level)
void pg_t::print_state()
{
printf(
"[PG %u/%u] is %s%s%s%s%s%s%s%s%s%s%s%s%s (%lu objects)\n", pool_id, pg_num,
"[PG %u/%u] is %s%s%s%s%s%s%s%s%s%s%s%s%s%s (%lu objects)\n", pool_id, pg_num,
(state & PG_STARTING) ? "starting" : "",
(state & PG_OFFLINE) ? "offline" : "",
(state & PG_PEERING) ? "peering" : "",
(state & PG_INCOMPLETE) ? "incomplete" : "",
(state & PG_ACTIVE) ? "active" : "",
(state & PG_REPEERING) ? "repeering" : "",
(state & PG_STOPPING) ? "stopping" : "",
(state & PG_DEGRADED) ? " + degraded" : "",
(state & PG_HAS_INCOMPLETE) ? " + has_incomplete" : "",

View File

@ -19,7 +19,7 @@ bool osd_t::prepare_primary_rw(osd_op_t *cur_op)
// Our EC scheme stores data in fixed chunks equal to (K*block size)
// K = (pg_size-parity_chunks) in case of EC/XOR, or 1 for replicated pools
pool_id_t pool_id = INODE_POOL(cur_op->req.rw.inode);
// FIXME: We have to access pool config here, so make sure that it doesn't change while its PGs are active...
// Note: We read pool config here, so we must NOT change it when PGs are active
auto pool_cfg_it = st_cli.pool_config.find(pool_id);
if (pool_cfg_it == st_cli.pool_config.end())
{
@ -28,6 +28,7 @@ bool osd_t::prepare_primary_rw(osd_op_t *cur_op)
return false;
}
auto & pool_cfg = pool_cfg_it->second;
// FIXME: op_data->pg_data_size can probably be removed (there's pg.pg_data_size)
uint64_t pg_data_size = (pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pool_cfg.pg_size-pool_cfg.parity_chunks);
uint64_t pg_block_size = bs_block_size * pg_data_size;
object_id oid = {
@ -35,7 +36,7 @@ bool osd_t::prepare_primary_rw(osd_op_t *cur_op)
// oid.stripe = starting offset of the parity stripe
.stripe = (cur_op->req.rw.offset/pg_block_size)*pg_block_size,
};
pg_num_t pg_num = (cur_op->req.rw.inode + oid.stripe/pool_cfg.pg_stripe_size) % pg_counts[pool_id] + 1;
pg_num_t pg_num = (oid.stripe/pool_cfg.pg_stripe_size) % pg_counts[pool_id] + 1; // like map_to_pg()
auto pg_it = pgs.find({ .pool_id = pool_id, .pg_num = pg_num });
if (pg_it == pgs.end() || !(pg_it->second.state & PG_ACTIVE))
{
@ -52,26 +53,87 @@ bool osd_t::prepare_primary_rw(osd_op_t *cur_op)
return false;
}
int stripe_count = (pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pg_it->second.pg_size);
int chain_size = 0;
if (cur_op->req.hdr.opcode == OSD_OP_READ && cur_op->req.rw.meta_revision > 0)
{
// Chained read
auto inode_it = st_cli.inode_config.find(cur_op->req.rw.inode);
if (inode_it->second.mod_revision != cur_op->req.rw.meta_revision)
{
// Client view of the metadata differs from OSD's view
// Operation can't be completed correctly, client should retry later
finish_op(cur_op, -EPIPE);
return false;
}
// Find parents from the same pool. Optimized reads only work within pools
while (inode_it != st_cli.inode_config.end() && inode_it->second.parent_id &&
INODE_POOL(inode_it->second.parent_id) == pg_it->second.pool_id)
{
chain_size++;
inode_it = st_cli.inode_config.find(inode_it->second.parent_id);
}
if (chain_size)
{
// Add the original inode
chain_size++;
}
}
osd_primary_op_data_t *op_data = (osd_primary_op_data_t*)calloc_or_die(
1, sizeof(osd_primary_op_data_t) + (clean_entry_bitmap_size + sizeof(osd_rmw_stripe_t)) * stripe_count
// Allocate:
// - op_data
1, sizeof(osd_primary_op_data_t) +
// - stripes
// - resulting bitmap buffers
stripe_count * (clean_entry_bitmap_size + sizeof(osd_rmw_stripe_t)) +
chain_size * (
// - copy of the chain
sizeof(inode_t) +
// - bitmap buffers for chained read
stripe_count * clean_entry_bitmap_size +
// - 'missing' flags for chained reads
(pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 0 : pg_it->second.pg_size)
)
);
void *data_buf = ((void*)op_data) + sizeof(osd_primary_op_data_t);
op_data->pg_num = pg_num;
op_data->oid = oid;
op_data->stripes = ((osd_rmw_stripe_t*)(op_data+1));
op_data->stripes = (osd_rmw_stripe_t*)data_buf;
data_buf += sizeof(osd_rmw_stripe_t) * stripe_count;
op_data->scheme = pool_cfg.scheme;
op_data->pg_data_size = pg_data_size;
op_data->pg_size = pg_it->second.pg_size;
cur_op->op_data = op_data;
split_stripes(pg_data_size, bs_block_size, (uint32_t)(cur_op->req.rw.offset - oid.stripe), cur_op->req.rw.len, op_data->stripes);
// Allocate bitmaps along with stripes to avoid extra allocations and fragmentation
for (int i = 0; i < stripe_count; i++)
{
op_data->stripes[i].bmp_buf = (void*)(op_data->stripes+stripe_count) + clean_entry_bitmap_size*i;
op_data->stripes[i].bmp_buf = data_buf;
data_buf += clean_entry_bitmap_size;
}
op_data->chain_size = chain_size;
if (chain_size > 0)
{
op_data->read_chain = (inode_t*)data_buf;
data_buf += sizeof(inode_t) * chain_size;
op_data->snapshot_bitmaps = data_buf;
data_buf += chain_size * stripe_count * clean_entry_bitmap_size;
op_data->missing_flags = (uint8_t*)data_buf;
data_buf += chain_size * (pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 0 : pg_it->second.pg_size);
// Copy chain
int chain_num = 0;
op_data->read_chain[chain_num++] = cur_op->req.rw.inode;
auto inode_it = st_cli.inode_config.find(cur_op->req.rw.inode);
while (inode_it != st_cli.inode_config.end() && inode_it->second.parent_id)
{
op_data->read_chain[chain_num++] = inode_it->second.parent_id;
inode_it = st_cli.inode_config.find(inode_it->second.parent_id);
}
}
pg_it->second.inflight++;
return true;
}
static uint64_t* get_object_osd_set(pg_t &pg, object_id &oid, uint64_t *def, pg_osd_set_state_t **object_state)
uint64_t* osd_t::get_object_osd_set(pg_t &pg, object_id &oid, uint64_t *def, pg_osd_set_state_t **object_state)
{
if (!(pg.state & (PG_HAS_INCOMPLETE | PG_HAS_DEGRADED | PG_HAS_MISPLACED)))
{
@ -106,10 +168,17 @@ void osd_t::continue_primary_read(osd_op_t *cur_op)
{
return;
}
cur_op->reply.rw.bitmap_len = 0;
osd_primary_op_data_t *op_data = cur_op->op_data;
if (op_data->st == 1) goto resume_1;
else if (op_data->st == 2) goto resume_2;
if (op_data->chain_size)
{
continue_chained_read(cur_op);
return;
}
if (op_data->st == 1)
goto resume_1;
else if (op_data->st == 2)
goto resume_2;
cur_op->reply.rw.bitmap_len = 0;
{
auto & pg = pgs.at({ .pool_id = INODE_POOL(op_data->oid.inode), .pg_num = op_data->pg_num });
for (int role = 0; role < op_data->pg_data_size; role++)
@ -124,8 +193,7 @@ void osd_t::continue_primary_read(osd_op_t *cur_op)
{
// Fast happy-path
cur_op->buf = alloc_read_buffer(op_data->stripes, op_data->pg_data_size, 0);
submit_primary_subops(SUBMIT_READ, op_data->target_ver,
(op_data->scheme == POOL_SCHEME_REPLICATED ? pg.pg_size : op_data->pg_data_size), pg.cur_set.data(), cur_op);
submit_primary_subops(SUBMIT_READ, op_data->target_ver, pg.cur_set.data(), cur_op);
op_data->st = 1;
}
else
@ -142,7 +210,7 @@ void osd_t::continue_primary_read(osd_op_t *cur_op)
op_data->scheme = pg.scheme;
op_data->degraded = 1;
cur_op->buf = alloc_read_buffer(op_data->stripes, pg.pg_size, 0);
submit_primary_subops(SUBMIT_READ, op_data->target_ver, pg.pg_size, cur_set, cur_op);
submit_primary_subops(SUBMIT_READ, op_data->target_ver, cur_set, cur_op);
op_data->st = 1;
}
}
@ -188,612 +256,6 @@ resume_2:
finish_op(cur_op, cur_op->req.rw.len);
}
bool osd_t::check_write_queue(osd_op_t *cur_op, pg_t & pg)
{
osd_primary_op_data_t *op_data = cur_op->op_data;
// Check if actions are pending for this object
auto act_it = pg.flush_actions.lower_bound((obj_piece_id_t){
.oid = op_data->oid,
.osd_num = 0,
});
if (act_it != pg.flush_actions.end() &&
act_it->first.oid.inode == op_data->oid.inode &&
(act_it->first.oid.stripe & ~STRIPE_MASK) == op_data->oid.stripe)
{
pg.write_queue.emplace(op_data->oid, cur_op);
return false;
}
// Check if there are other write requests to the same object
auto vo_it = pg.write_queue.find(op_data->oid);
if (vo_it != pg.write_queue.end())
{
op_data->st = 1;
pg.write_queue.emplace(op_data->oid, cur_op);
return false;
}
pg.write_queue.emplace(op_data->oid, cur_op);
return true;
}
void osd_t::continue_primary_write(osd_op_t *cur_op)
{
if (!cur_op->op_data && !prepare_primary_rw(cur_op))
{
return;
}
osd_primary_op_data_t *op_data = cur_op->op_data;
auto & pg = pgs.at({ .pool_id = INODE_POOL(op_data->oid.inode), .pg_num = op_data->pg_num });
if (op_data->st == 1) goto resume_1;
else if (op_data->st == 2) goto resume_2;
else if (op_data->st == 3) goto resume_3;
else if (op_data->st == 4) goto resume_4;
else if (op_data->st == 5) goto resume_5;
else if (op_data->st == 6) goto resume_6;
else if (op_data->st == 7) goto resume_7;
else if (op_data->st == 8) goto resume_8;
else if (op_data->st == 9) goto resume_9;
else if (op_data->st == 10) goto resume_10;
assert(op_data->st == 0);
if (!check_write_queue(cur_op, pg))
{
return;
}
resume_1:
// Determine blocks to read and write
// Missing chunks are allowed to be overwritten even in incomplete objects
// FIXME: Allow to do small writes to the old (degraded/misplaced) OSD set for lower performance impact
op_data->prev_set = get_object_osd_set(pg, op_data->oid, pg.cur_set.data(), &op_data->object_state);
if (op_data->scheme == POOL_SCHEME_REPLICATED)
{
// Simplified algorithm
op_data->stripes[0].write_start = op_data->stripes[0].req_start;
op_data->stripes[0].write_end = op_data->stripes[0].req_end;
op_data->stripes[0].write_buf = cur_op->buf;
op_data->stripes[0].bmp_buf = (void*)(op_data->stripes+1);
if (pg.cur_set.data() != op_data->prev_set && (op_data->stripes[0].write_start != 0 ||
op_data->stripes[0].write_end != bs_block_size))
{
// Object is degraded/misplaced and will be moved to <write_osd_set>
op_data->stripes[0].read_start = 0;
op_data->stripes[0].read_end = bs_block_size;
cur_op->rmw_buf = op_data->stripes[0].read_buf = memalign_or_die(MEM_ALIGNMENT, bs_block_size);
}
}
else
{
cur_op->rmw_buf = calc_rmw(cur_op->buf, op_data->stripes, op_data->prev_set,
pg.pg_size, op_data->pg_data_size, pg.pg_cursize, pg.cur_set.data(), bs_block_size, clean_entry_bitmap_size);
if (!cur_op->rmw_buf)
{
// Refuse partial overwrite of an incomplete object
cur_op->reply.hdr.retval = -EINVAL;
goto continue_others;
}
}
// Read required blocks
submit_primary_subops(SUBMIT_RMW_READ, UINT64_MAX, pg.pg_size, op_data->prev_set, cur_op);
resume_2:
op_data->st = 2;
return;
resume_3:
if (op_data->errors > 0)
{
pg_cancel_write_queue(pg, cur_op, op_data->oid, op_data->epipe > 0 ? -EPIPE : -EIO);
return;
}
// Save version override for parallel reads
pg.ver_override[op_data->oid] = op_data->fact_ver;
if (op_data->scheme == POOL_SCHEME_REPLICATED)
{
// Set bitmap bits
bitmap_set(op_data->stripes[0].bmp_buf, op_data->stripes[0].write_start, op_data->stripes[0].write_end, bs_bitmap_granularity);
// Possibly copy new data from the request into the recovery buffer
if (pg.cur_set.data() != op_data->prev_set && (op_data->stripes[0].write_start != 0 ||
op_data->stripes[0].write_end != bs_block_size))
{
memcpy(
op_data->stripes[0].read_buf + op_data->stripes[0].req_start,
op_data->stripes[0].write_buf,
op_data->stripes[0].req_end - op_data->stripes[0].req_start
);
op_data->stripes[0].write_buf = op_data->stripes[0].read_buf;
op_data->stripes[0].write_start = 0;
op_data->stripes[0].write_end = bs_block_size;
}
}
else
{
// Recover missing stripes, calculate parity
if (pg.scheme == POOL_SCHEME_XOR)
{
calc_rmw_parity_xor(op_data->stripes, pg.pg_size, op_data->prev_set, pg.cur_set.data(), bs_block_size, clean_entry_bitmap_size);
}
else if (pg.scheme == POOL_SCHEME_JERASURE)
{
calc_rmw_parity_jerasure(op_data->stripes, pg.pg_size, op_data->pg_data_size, op_data->prev_set, pg.cur_set.data(), bs_block_size, clean_entry_bitmap_size);
}
}
// Send writes
if ((op_data->fact_ver >> (64-PG_EPOCH_BITS)) < pg.epoch)
{
op_data->target_ver = ((uint64_t)pg.epoch << (64-PG_EPOCH_BITS)) | 1;
}
else
{
if ((op_data->fact_ver & (1ul<<(64-PG_EPOCH_BITS) - 1)) == (1ul<<(64-PG_EPOCH_BITS) - 1))
{
assert(pg.epoch != ((1ul << PG_EPOCH_BITS)-1));
pg.epoch++;
}
op_data->target_ver = op_data->fact_ver + 1;
}
if (pg.epoch > pg.reported_epoch)
{
// Report newer epoch before writing
// FIXME: We may report only one PG state here...
this->pg_state_dirty.insert({ .pool_id = pg.pool_id, .pg_num = pg.pg_num });
pg.history_changed = true;
report_pg_states();
resume_10:
if (pg.epoch > pg.reported_epoch)
{
op_data->st = 10;
return;
}
}
submit_primary_subops(SUBMIT_WRITE, op_data->target_ver, pg.pg_size, pg.cur_set.data(), cur_op);
resume_4:
op_data->st = 4;
return;
resume_5:
if (op_data->errors > 0)
{
pg_cancel_write_queue(pg, cur_op, op_data->oid, op_data->epipe > 0 ? -EPIPE : -EIO);
return;
}
resume_6:
resume_7:
if (!remember_unstable_write(cur_op, pg, pg.cur_loc_set, 6))
{
// FIXME: Check for immediate_commit == IMMEDIATE_SMALL
return;
}
if (op_data->fact_ver == 1)
{
// Object is created
pg.clean_count++;
pg.total_count++;
}
if (op_data->object_state)
{
{
int recovery_type = op_data->object_state->state & (OBJ_DEGRADED|OBJ_INCOMPLETE) ? 0 : 1;
recovery_stat_count[0][recovery_type]++;
if (!recovery_stat_count[0][recovery_type])
{
recovery_stat_count[0][recovery_type]++;
recovery_stat_bytes[0][recovery_type] = 0;
}
for (int role = 0; role < (op_data->scheme == POOL_SCHEME_REPLICATED ? 1 : pg.pg_size); role++)
{
recovery_stat_bytes[0][recovery_type] += op_data->stripes[role].write_end - op_data->stripes[role].write_start;
}
}
// Any kind of a non-clean object can have extra chunks, because we don't record objects
// as degraded & misplaced or incomplete & misplaced at the same time. So try to remove extra chunks
if (immediate_commit != IMMEDIATE_ALL)
{
// We can't remove extra chunks yet if fsyncs are explicit, because
// new copies may not be committed to stable storage yet
// We can only remove extra chunks after a successful SYNC for this PG
for (auto & chunk: op_data->object_state->osd_set)
{
// Check is the same as in submit_primary_del_subops()
if (op_data->scheme == POOL_SCHEME_REPLICATED
? !contains_osd(pg.cur_set.data(), pg.pg_size, chunk.osd_num)
: (chunk.osd_num != pg.cur_set[chunk.role]))
{
pg.copies_to_delete_after_sync.push_back((obj_ver_osd_t){
.osd_num = chunk.osd_num,
.oid = {
.inode = op_data->oid.inode,
.stripe = op_data->oid.stripe | (op_data->scheme == POOL_SCHEME_REPLICATED ? 0 : chunk.role),
},
.version = op_data->fact_ver,
});
copies_to_delete_after_sync_count++;
}
}
}
else
{
submit_primary_del_subops(cur_op, pg.cur_set.data(), pg.pg_size, op_data->object_state->osd_set);
if (op_data->n_subops > 0)
{
resume_8:
op_data->st = 8;
return;
resume_9:
if (op_data->errors > 0)
{
pg_cancel_write_queue(pg, cur_op, op_data->oid, op_data->epipe > 0 ? -EPIPE : -EIO);
return;
}
}
}
// Clear object state
remove_object_from_state(op_data->oid, op_data->object_state, pg);
pg.clean_count++;
}
cur_op->reply.hdr.retval = cur_op->req.rw.len;
continue_others:
// Remove version override
pg.ver_override.erase(op_data->oid);
object_id oid = op_data->oid;
// Remove the operation from queue before calling finish_op so it doesn't see the completed operation in queue
auto next_it = pg.write_queue.find(oid);
if (next_it != pg.write_queue.end() && next_it->second == cur_op)
{
pg.write_queue.erase(next_it++);
}
// finish_op would invalidate next_it if it cleared pg.write_queue, but it doesn't do that :)
finish_op(cur_op, cur_op->reply.hdr.retval);
// Continue other write operations to the same object
if (next_it != pg.write_queue.end() && next_it->first == oid)
{
osd_op_t *next_op = next_it->second;
continue_primary_write(next_op);
}
}
bool osd_t::remember_unstable_write(osd_op_t *cur_op, pg_t & pg, pg_osd_set_t & loc_set, int base_state)
{
osd_primary_op_data_t *op_data = cur_op->op_data;
if (op_data->st == base_state)
{
goto resume_6;
}
else if (op_data->st == base_state+1)
{
goto resume_7;
}
// FIXME: Check for immediate_commit == IMMEDIATE_SMALL
if (immediate_commit == IMMEDIATE_ALL)
{
if (op_data->scheme != POOL_SCHEME_REPLICATED)
{
// Send STABILIZE ops immediately
op_data->unstable_write_osds = new std::vector<unstable_osd_num_t>();
op_data->unstable_writes = new obj_ver_id[loc_set.size()];
{
int last_start = 0;
for (auto & chunk: loc_set)
{
op_data->unstable_writes[last_start] = (obj_ver_id){
.oid = {
.inode = op_data->oid.inode,
.stripe = op_data->oid.stripe | chunk.role,
},
.version = op_data->fact_ver,
};
op_data->unstable_write_osds->push_back((unstable_osd_num_t){
.osd_num = chunk.osd_num,
.start = last_start,
.len = 1,
});
last_start++;
}
}
submit_primary_stab_subops(cur_op);
resume_6:
op_data->st = 6;
return false;
resume_7:
// FIXME: Free those in the destructor?
delete op_data->unstable_write_osds;
delete[] op_data->unstable_writes;
op_data->unstable_writes = NULL;
op_data->unstable_write_osds = NULL;
if (op_data->errors > 0)
{
pg_cancel_write_queue(pg, cur_op, op_data->oid, op_data->epipe > 0 ? -EPIPE : -EIO);
return false;
}
}
}
else
{
if (op_data->scheme != POOL_SCHEME_REPLICATED)
{
// Remember version as unstable for EC/XOR
for (auto & chunk: loc_set)
{
this->dirty_osds.insert(chunk.osd_num);
this->unstable_writes[(osd_object_id_t){
.osd_num = chunk.osd_num,
.oid = {
.inode = op_data->oid.inode,
.stripe = op_data->oid.stripe | chunk.role,
},
}] = op_data->fact_ver;
}
}
else
{
// Only remember to sync OSDs for replicated pools
for (auto & chunk: loc_set)
{
this->dirty_osds.insert(chunk.osd_num);
}
}
// Remember PG as dirty to drop the connection when PG goes offline
// (this is required because of the "lazy sync")
auto cl_it = c_cli.clients.find(cur_op->peer_fd);
if (cl_it != c_cli.clients.end())
{
cl_it->second->dirty_pgs.insert({ .pool_id = pg.pool_id, .pg_num = pg.pg_num });
}
dirty_pgs.insert({ .pool_id = pg.pool_id, .pg_num = pg.pg_num });
}
return true;
}
// Save and clear unstable_writes -> SYNC all -> STABLE all
void osd_t::continue_primary_sync(osd_op_t *cur_op)
{
if (!cur_op->op_data)
{
cur_op->op_data = (osd_primary_op_data_t*)calloc_or_die(1, sizeof(osd_primary_op_data_t));
}
osd_primary_op_data_t *op_data = cur_op->op_data;
if (op_data->st == 1) goto resume_1;
else if (op_data->st == 2) goto resume_2;
else if (op_data->st == 3) goto resume_3;
else if (op_data->st == 4) goto resume_4;
else if (op_data->st == 5) goto resume_5;
else if (op_data->st == 6) goto resume_6;
else if (op_data->st == 7) goto resume_7;
else if (op_data->st == 8) goto resume_8;
assert(op_data->st == 0);
if (syncs_in_progress.size() > 0)
{
// Wait for previous syncs, if any
// FIXME: We may try to execute the current one in parallel, like in Blockstore, but I'm not sure if it matters at all
syncs_in_progress.push_back(cur_op);
op_data->st = 1;
resume_1:
return;
}
else
{
syncs_in_progress.push_back(cur_op);
}
resume_2:
if (dirty_osds.size() == 0)
{
// Nothing to sync
goto finish;
}
// Save and clear unstable_writes
// In theory it is possible to do in on a per-client basis, but this seems to be an unnecessary complication
// It would be cool not to copy these here at all, but someone has to deduplicate them by object IDs anyway
if (unstable_writes.size() > 0)
{
op_data->unstable_write_osds = new std::vector<unstable_osd_num_t>();
op_data->unstable_writes = new obj_ver_id[this->unstable_writes.size()];
osd_num_t last_osd = 0;
int last_start = 0, last_end = 0;
for (auto it = this->unstable_writes.begin(); it != this->unstable_writes.end(); it++)
{
if (last_osd != it->first.osd_num)
{
if (last_osd != 0)
{
op_data->unstable_write_osds->push_back((unstable_osd_num_t){
.osd_num = last_osd,
.start = last_start,
.len = last_end - last_start,
});
}
last_osd = it->first.osd_num;
last_start = last_end;
}
op_data->unstable_writes[last_end] = (obj_ver_id){
.oid = it->first.oid,
.version = it->second,
};
last_end++;
}
if (last_osd != 0)
{
op_data->unstable_write_osds->push_back((unstable_osd_num_t){
.osd_num = last_osd,
.start = last_start,
.len = last_end - last_start,
});
}
this->unstable_writes.clear();
}
{
void *dirty_buf = malloc_or_die(
sizeof(pool_pg_num_t)*dirty_pgs.size() +
sizeof(osd_num_t)*dirty_osds.size() +
sizeof(obj_ver_osd_t)*this->copies_to_delete_after_sync_count
);
op_data->dirty_pgs = (pool_pg_num_t*)dirty_buf;
op_data->dirty_osds = (osd_num_t*)(dirty_buf + sizeof(pool_pg_num_t)*dirty_pgs.size());
op_data->dirty_pg_count = dirty_pgs.size();
op_data->dirty_osd_count = dirty_osds.size();
if (this->copies_to_delete_after_sync_count)
{
op_data->copies_to_delete_count = 0;
op_data->copies_to_delete = (obj_ver_osd_t*)(op_data->dirty_osds + op_data->dirty_osd_count);
for (auto dirty_pg_num: dirty_pgs)
{
auto & pg = pgs.at(dirty_pg_num);
assert(pg.copies_to_delete_after_sync.size() <= this->copies_to_delete_after_sync_count);
memcpy(
op_data->copies_to_delete + op_data->copies_to_delete_count,
pg.copies_to_delete_after_sync.data(),
sizeof(obj_ver_osd_t)*pg.copies_to_delete_after_sync.size()
);
op_data->copies_to_delete_count += pg.copies_to_delete_after_sync.size();
this->copies_to_delete_after_sync_count -= pg.copies_to_delete_after_sync.size();
pg.copies_to_delete_after_sync.clear();
}
assert(this->copies_to_delete_after_sync_count == 0);
}
int dpg = 0;
for (auto dirty_pg_num: dirty_pgs)
{
pgs.at(dirty_pg_num).inflight++;
op_data->dirty_pgs[dpg++] = dirty_pg_num;
}
dirty_pgs.clear();
dpg = 0;
for (auto osd_num: dirty_osds)
{
op_data->dirty_osds[dpg++] = osd_num;
}
dirty_osds.clear();
}
if (immediate_commit != IMMEDIATE_ALL)
{
// SYNC
submit_primary_sync_subops(cur_op);
resume_3:
op_data->st = 3;
return;
resume_4:
if (op_data->errors > 0)
{
goto resume_6;
}
}
if (op_data->unstable_writes)
{
// Stabilize version sets, if any
submit_primary_stab_subops(cur_op);
resume_5:
op_data->st = 5;
return;
}
resume_6:
if (op_data->errors > 0)
{
// Return PGs and OSDs back into their dirty sets
for (int i = 0; i < op_data->dirty_pg_count; i++)
{
dirty_pgs.insert(op_data->dirty_pgs[i]);
}
for (int i = 0; i < op_data->dirty_osd_count; i++)
{
dirty_osds.insert(op_data->dirty_osds[i]);
}
if (op_data->unstable_writes)
{
// Return objects back into the unstable write set
for (auto unstable_osd: *(op_data->unstable_write_osds))
{
for (int i = 0; i < unstable_osd.len; i++)
{
// Except those from peered PGs
auto & w = op_data->unstable_writes[i];
pool_pg_num_t wpg = {
.pool_id = INODE_POOL(w.oid.inode),
.pg_num = map_to_pg(w.oid, st_cli.pool_config.at(INODE_POOL(w.oid.inode)).pg_stripe_size),
};
if (pgs.at(wpg).state & PG_ACTIVE)
{
uint64_t & dest = this->unstable_writes[(osd_object_id_t){
.osd_num = unstable_osd.osd_num,
.oid = w.oid,
}];
dest = dest < w.version ? w.version : dest;
dirty_pgs.insert(wpg);
}
}
}
}
if (op_data->copies_to_delete)
{
// Return 'copies to delete' back into respective PGs
for (int i = 0; i < op_data->copies_to_delete_count; i++)
{
auto & w = op_data->copies_to_delete[i];
auto & pg = pgs.at((pool_pg_num_t){
.pool_id = INODE_POOL(w.oid.inode),
.pg_num = map_to_pg(w.oid, st_cli.pool_config.at(INODE_POOL(w.oid.inode)).pg_stripe_size),
});
if (pg.state & PG_ACTIVE)
{
pg.copies_to_delete_after_sync.push_back(w);
copies_to_delete_after_sync_count++;
}
}
}
}
else if (op_data->copies_to_delete)
{
// Actually delete copies which we wanted to delete
submit_primary_del_batch(cur_op, op_data->copies_to_delete, op_data->copies_to_delete_count);
resume_7:
op_data->st = 7;
return;
resume_8:
if (op_data->errors > 0)
{
goto resume_6;
}
}
for (int i = 0; i < op_data->dirty_pg_count; i++)
{
auto & pg = pgs.at(op_data->dirty_pgs[i]);
pg.inflight--;
if ((pg.state & PG_STOPPING) && pg.inflight == 0 && !pg.flush_batch &&
// We must either forget all PG's unstable writes or wait for it to become clean
dirty_pgs.find({ .pool_id = pg.pool_id, .pg_num = pg.pg_num }) == dirty_pgs.end())
{
finish_stop_pg(pg);
}
}
// FIXME: Free those in the destructor?
free(op_data->dirty_pgs);
op_data->dirty_pgs = NULL;
op_data->dirty_osds = NULL;
if (op_data->unstable_writes)
{
delete op_data->unstable_write_osds;
delete[] op_data->unstable_writes;
op_data->unstable_writes = NULL;
op_data->unstable_write_osds = NULL;
}
if (op_data->errors > 0)
{
finish_op(cur_op, op_data->epipe > 0 ? -EPIPE : -EIO);
}
else
{
finish:
if (cur_op->peer_fd)
{
auto it = c_cli.clients.find(cur_op->peer_fd);
if (it != c_cli.clients.end())
it->second->dirty_pgs.clear();
}
finish_op(cur_op, 0);
}
assert(syncs_in_progress.front() == cur_op);
syncs_in_progress.pop_front();
if (syncs_in_progress.size() > 0)
{
cur_op = syncs_in_progress.front();
op_data = cur_op->op_data;
op_data->st++;
goto resume_2;
}
}
// Decrement pg_osd_set_state_t's object_count and change PG state accordingly
void osd_t::remove_object_from_state(object_id & oid, pg_osd_set_state_t *object_state, pg_t & pg)
{
@ -832,10 +294,14 @@ void osd_t::remove_object_from_state(object_id & oid, pg_osd_set_state_t *object
{
throw std::runtime_error("BUG: Invalid object state: "+std::to_string(object_state->state));
}
object_state->object_count--;
if (!object_state->object_count)
}
void osd_t::free_object_state(pg_t & pg, pg_osd_set_state_t **object_state)
{
pg.state_dict.erase(object_state->osd_set);
if (*object_state && !(--(*object_state)->object_count))
{
pg.state_dict.erase((*object_state)->osd_set);
*object_state = NULL;
}
}
@ -867,7 +333,7 @@ resume_1:
// Determine which OSDs contain this object and delete it
op_data->prev_set = get_object_osd_set(pg, op_data->oid, pg.cur_set.data(), &op_data->object_state);
// Submit 1 read to determine the actual version number
submit_primary_subops(SUBMIT_RMW_READ, UINT64_MAX, pg.pg_size, op_data->prev_set, cur_op);
submit_primary_subops(SUBMIT_RMW_READ, UINT64_MAX, op_data->prev_set, cur_op);
resume_2:
op_data->st = 2;
return;
@ -901,22 +367,21 @@ resume_5:
else
{
remove_object_from_state(op_data->oid, op_data->object_state, pg);
free_object_state(pg, &op_data->object_state);
}
pg.total_count--;
object_id oid = op_data->oid;
osd_op_t *next_op = NULL;
auto next_it = pg.write_queue.find(op_data->oid);
if (next_it != pg.write_queue.end() && next_it->second == cur_op)
{
pg.write_queue.erase(next_it++);
if (next_it != pg.write_queue.end() && next_it->first == op_data->oid)
next_op = next_it->second;
}
finish_op(cur_op, cur_op->req.rw.len);
// Continue other write operations to the same object
auto next_it = pg.write_queue.find(oid);
auto this_it = next_it;
if (this_it != pg.write_queue.end() && this_it->second == cur_op)
if (next_op)
{
next_it++;
pg.write_queue.erase(this_it);
if (next_it != pg.write_queue.end() &&
next_it->first == oid)
{
osd_op_t *next_op = next_it->second;
// Continue next write to the same object
continue_primary_write(next_op);
}
}
}

View File

@ -31,15 +31,31 @@ struct osd_primary_op_data_t
uint64_t *prev_set = NULL;
pg_osd_set_state_t *object_state = NULL;
union
{
struct
{
// for sync. oops, requires freeing
std::vector<unstable_osd_num_t> *unstable_write_osds = NULL;
pool_pg_num_t *dirty_pgs = NULL;
int dirty_pg_count = 0;
osd_num_t *dirty_osds = NULL;
int dirty_osd_count = 0;
obj_ver_id *unstable_writes = NULL;
obj_ver_osd_t *copies_to_delete = NULL;
int copies_to_delete_count = 0;
std::vector<unstable_osd_num_t> *unstable_write_osds;
pool_pg_num_t *dirty_pgs;
int dirty_pg_count;
osd_num_t *dirty_osds;
int dirty_osd_count;
obj_ver_id *unstable_writes;
obj_ver_osd_t *copies_to_delete;
int copies_to_delete_count;
};
struct
{
// for read_bitmaps
void *snapshot_bitmaps;
inode_t *read_chain;
uint8_t *missing_flags;
int chain_size;
osd_chain_read_t *chain_reads;
int chain_read_count;
};
};
};
bool contains_osd(osd_num_t *osd_set, uint64_t size, osd_num_t osd_num);

549
src/osd_primary_chain.cpp Normal file
View File

@ -0,0 +1,549 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
#include "osd_primary.h"
#include "allocator.h"
void osd_t::continue_chained_read(osd_op_t *cur_op)
{
osd_primary_op_data_t *op_data = cur_op->op_data;
auto & pg = pgs.at({ .pool_id = INODE_POOL(op_data->oid.inode), .pg_num = op_data->pg_num });
if (op_data->st == 1)
goto resume_1;
else if (op_data->st == 2)
goto resume_2;
else if (op_data->st == 3)
goto resume_3;
else if (op_data->st == 4)
goto resume_4;
cur_op->reply.rw.bitmap_len = 0;
for (int role = 0; role < op_data->pg_data_size; role++)
{
op_data->stripes[role].read_start = op_data->stripes[role].req_start;
op_data->stripes[role].read_end = op_data->stripes[role].req_end;
}
resume_1:
resume_2:
// Read bitmaps
if (read_bitmaps(cur_op, pg, 1) != 0)
return;
// Prepare & submit reads
if (submit_chained_read_requests(pg, cur_op) != 0)
return;
if (op_data->n_subops > 0)
{
// Wait for reads
op_data->st = 3;
resume_3:
return;
}
resume_4:
if (op_data->errors > 0)
{
free(op_data->chain_reads);
op_data->chain_reads = NULL;
finish_op(cur_op, op_data->epipe > 0 ? -EPIPE : -EIO);
return;
}
send_chained_read_results(pg, cur_op);
finish_op(cur_op, cur_op->req.rw.len);
}
int osd_t::read_bitmaps(osd_op_t *cur_op, pg_t & pg, int base_state)
{
osd_primary_op_data_t *op_data = cur_op->op_data;
if (op_data->st == base_state)
goto resume_0;
else if (op_data->st == base_state+1)
goto resume_1;
if (pg.state == PG_ACTIVE && pg.scheme == POOL_SCHEME_REPLICATED)
{
// Happy path for clean replicated PGs (all bitmaps are available locally)
for (int chain_num = 0; chain_num < op_data->chain_size; chain_num++)
{
object_id cur_oid = { .inode = op_data->read_chain[chain_num], .stripe = op_data->oid.stripe };
auto vo_it = pg.ver_override.find(cur_oid);
auto read_version = (vo_it != pg.ver_override.end() ? vo_it->second : UINT64_MAX);
// Read bitmap synchronously from the local database
bs->read_bitmap(cur_oid, read_version, op_data->snapshot_bitmaps + chain_num*clean_entry_bitmap_size, NULL);
}
}
else
{
if (submit_bitmap_subops(cur_op, pg) < 0)
{
// Failure
finish_op(cur_op, -EIO);
return -1;
}
resume_0:
if (op_data->n_subops > 0)
{
// Wait for subops
op_data->st = base_state;
return 1;
}
resume_1:
if (pg.scheme != POOL_SCHEME_REPLICATED)
{
for (int chain_num = 0; chain_num < op_data->chain_size; chain_num++)
{
// Check if we need to reconstruct any bitmaps
for (int i = 0; i < pg.pg_size; i++)
{
if (op_data->missing_flags[chain_num*pg.pg_size + i])
{
osd_rmw_stripe_t local_stripes[pg.pg_size] = { 0 };
for (i = 0; i < pg.pg_size; i++)
{
local_stripes[i].missing = op_data->missing_flags[chain_num*pg.pg_size + i] && true;
local_stripes[i].bmp_buf = op_data->snapshot_bitmaps + (chain_num*pg.pg_size + i)*clean_entry_bitmap_size;
local_stripes[i].read_start = local_stripes[i].read_end = 1;
}
if (pg.scheme == POOL_SCHEME_XOR)
{
reconstruct_stripes_xor(local_stripes, pg.pg_size, clean_entry_bitmap_size);
}
else if (pg.scheme == POOL_SCHEME_JERASURE)
{
reconstruct_stripes_jerasure(local_stripes, pg.pg_size, pg.pg_data_size, clean_entry_bitmap_size);
}
break;
}
}
}
}
}
return 0;
}
int osd_t::collect_bitmap_requests(osd_op_t *cur_op, pg_t & pg, std::vector<bitmap_request_t> & bitmap_requests)
{
osd_primary_op_data_t *op_data = cur_op->op_data;
for (int chain_num = 0; chain_num < op_data->chain_size; chain_num++)
{
object_id cur_oid = { .inode = op_data->read_chain[chain_num], .stripe = op_data->oid.stripe };
auto vo_it = pg.ver_override.find(cur_oid);
uint64_t target_version = vo_it != pg.ver_override.end() ? vo_it->second : UINT64_MAX;
pg_osd_set_state_t *object_state;
uint64_t* cur_set = get_object_osd_set(pg, cur_oid, pg.cur_set.data(), &object_state);
if (pg.scheme == POOL_SCHEME_REPLICATED)
{
osd_num_t read_target = 0;
for (int i = 0; i < pg.pg_size; i++)
{
if (cur_set[i] == this->osd_num || cur_set[i] != 0 && read_target == 0)
{
// Select local or any other available OSD for reading
read_target = cur_set[i];
}
}
assert(read_target != 0);
bitmap_requests.push_back((bitmap_request_t){
.osd_num = read_target,
.oid = cur_oid,
.version = target_version,
.bmp_buf = op_data->snapshot_bitmaps + chain_num*clean_entry_bitmap_size,
});
}
else
{
osd_rmw_stripe_t local_stripes[pg.pg_size];
memcpy(local_stripes, op_data->stripes, sizeof(osd_rmw_stripe_t) * pg.pg_size);
if (extend_missing_stripes(local_stripes, cur_set, pg.pg_data_size, pg.pg_size) < 0)
{
free(op_data->snapshot_bitmaps);
return -1;
}
int need_at_least = 0;
for (int i = 0; i < pg.pg_size; i++)
{
if (local_stripes[i].read_end != 0 && cur_set[i] == 0)
{
// We need this part of the bitmap, but it's unavailable
need_at_least = pg.pg_data_size;
op_data->missing_flags[chain_num*pg.pg_size + i] = 1;
}
else
{
op_data->missing_flags[chain_num*pg.pg_size + i] = 0;
}
}
int found = 0;
for (int i = 0; i < pg.pg_size; i++)
{
if (cur_set[i] != 0 && (local_stripes[i].read_end != 0 || found < need_at_least))
{
// Read part of the bitmap
bitmap_requests.push_back((bitmap_request_t){
.osd_num = cur_set[i],
.oid = {
.inode = cur_oid.inode,
.stripe = cur_oid.stripe | i,
},
.version = target_version,
.bmp_buf = op_data->snapshot_bitmaps + (chain_num*pg.pg_size + i)*clean_entry_bitmap_size,
});
found++;
}
}
// Already checked by extend_missing_stripes, so it's fine to use assert
assert(found >= need_at_least);
}
}
std::sort(bitmap_requests.begin(), bitmap_requests.end());
return 0;
}
int osd_t::submit_bitmap_subops(osd_op_t *cur_op, pg_t & pg)
{
osd_primary_op_data_t *op_data = cur_op->op_data;
std::vector<bitmap_request_t> *bitmap_requests = new std::vector<bitmap_request_t>();
if (collect_bitmap_requests(cur_op, pg, *bitmap_requests) < 0)
{
return -1;
}
op_data->n_subops = 0;
for (int i = 0; i < bitmap_requests->size(); i++)
{
if ((i == bitmap_requests->size()-1 || (*bitmap_requests)[i+1].osd_num != (*bitmap_requests)[i].osd_num) &&
(*bitmap_requests)[i].osd_num != this->osd_num)
{
op_data->n_subops++;
}
}
if (op_data->n_subops)
{
op_data->fact_ver = 0;
op_data->done = op_data->errors = 0;
op_data->subops = new osd_op_t[op_data->n_subops];
}
for (int i = 0, subop_idx = 0, prev = 0; i < bitmap_requests->size(); i++)
{
if (i == bitmap_requests->size()-1 || (*bitmap_requests)[i+1].osd_num != (*bitmap_requests)[i].osd_num)
{
osd_num_t subop_osd_num = (*bitmap_requests)[i].osd_num;
if (subop_osd_num == this->osd_num)
{
// Read bitmap synchronously from the local database
for (int j = prev; j <= i; j++)
{
bs->read_bitmap((*bitmap_requests)[j].oid, (*bitmap_requests)[j].version, (*bitmap_requests)[j].bmp_buf, NULL);
}
}
else
{
// Send to a remote OSD
osd_op_t *subop = op_data->subops+subop_idx;
subop->op_type = OSD_OP_OUT;
subop->peer_fd = c_cli.osd_peer_fds.at(subop_osd_num);
// FIXME: Use the pre-allocated buffer
subop->buf = malloc_or_die(sizeof(obj_ver_id)*(i+1-prev));
subop->req = (osd_any_op_t){
.sec_read_bmp = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = c_cli.next_subop_id++,
.opcode = OSD_OP_SEC_READ_BMP,
},
.len = sizeof(obj_ver_id)*(i+1-prev),
}
};
obj_ver_id *ov = (obj_ver_id*)subop->buf;
for (int j = prev; j <= i; j++, ov++)
{
ov->oid = (*bitmap_requests)[j].oid;
ov->version = (*bitmap_requests)[j].version;
}
subop->callback = [cur_op, bitmap_requests, prev, i, this](osd_op_t *subop)
{
int requested_count = subop->req.sec_read_bmp.len / sizeof(obj_ver_id);
if (subop->reply.hdr.retval == requested_count * (8 + clean_entry_bitmap_size))
{
void *cur_buf = subop->buf + 8;
for (int j = prev; j <= i; j++)
{
memcpy((*bitmap_requests)[j].bmp_buf, cur_buf, clean_entry_bitmap_size);
cur_buf += 8 + clean_entry_bitmap_size;
}
}
if ((cur_op->op_data->errors + cur_op->op_data->done + 1) >= cur_op->op_data->n_subops)
{
delete bitmap_requests;
}
handle_primary_subop(subop, cur_op);
};
c_cli.outbox_push(subop);
subop_idx++;
}
prev = i+1;
}
}
if (!op_data->n_subops)
{
delete bitmap_requests;
}
return 0;
}
std::vector<osd_chain_read_t> osd_t::collect_chained_read_requests(osd_op_t *cur_op)
{
osd_primary_op_data_t *op_data = cur_op->op_data;
std::vector<osd_chain_read_t> chain_reads;
int stripe_count = (op_data->scheme == POOL_SCHEME_REPLICATED ? 1 : op_data->pg_size);
memset(op_data->stripes[0].bmp_buf, 0, stripe_count * clean_entry_bitmap_size);
uint8_t *global_bitmap = (uint8_t*)op_data->stripes[0].bmp_buf;
// We always use at most 1 read request per layer
for (int chain_pos = 0; chain_pos < op_data->chain_size; chain_pos++)
{
uint8_t *part_bitmap = ((uint8_t*)op_data->snapshot_bitmaps) + chain_pos*stripe_count*clean_entry_bitmap_size;
int start = (cur_op->req.rw.offset - op_data->oid.stripe)/bs_bitmap_granularity;
int end = start + cur_op->req.rw.len/bs_bitmap_granularity;
// Skip unneeded part in the beginning
while (start < end && (
((global_bitmap[start>>3] >> (start&7)) & 1) ||
!((part_bitmap[start>>3] >> (start&7)) & 1)))
{
start++;
}
// Skip unneeded part in the end
while (start < end && (
((global_bitmap[(end-1)>>3] >> ((end-1)&7)) & 1) ||
!((part_bitmap[(end-1)>>3] >> ((end-1)&7)) & 1)))
{
end--;
}
if (start < end)
{
// Copy (OR) bits in between
int cur = start;
for (; cur < end && (cur & 0x7); cur++)
{
global_bitmap[cur>>3] = global_bitmap[cur>>3] | (part_bitmap[cur>>3] & (1 << (cur&7)));
}
for (; cur <= end-8; cur += 8)
{
global_bitmap[cur>>3] = global_bitmap[cur>>3] | part_bitmap[cur>>3];
}
for (; cur < end; cur++)
{
global_bitmap[cur>>3] = global_bitmap[cur>>3] | (part_bitmap[cur>>3] & (1 << (cur&7)));
}
// Add request
chain_reads.push_back((osd_chain_read_t){
.chain_pos = chain_pos,
.inode = op_data->read_chain[chain_pos],
.offset = start*bs_bitmap_granularity,
.len = (end-start)*bs_bitmap_granularity,
});
}
}
return chain_reads;
}
int osd_t::submit_chained_read_requests(pg_t & pg, osd_op_t *cur_op)
{
// Decide which parts of which objects we need to read based on bitmaps
osd_primary_op_data_t *op_data = cur_op->op_data;
auto chain_reads = collect_chained_read_requests(cur_op);
int stripe_count = (pg.scheme == POOL_SCHEME_REPLICATED ? 1 : pg.pg_size);
op_data->chain_read_count = chain_reads.size();
op_data->chain_reads = (osd_chain_read_t*)calloc_or_die(
1, (sizeof(osd_chain_read_t) + sizeof(osd_rmw_stripe_t)*stripe_count) * chain_reads.size()
);
osd_rmw_stripe_t *chain_stripes = (osd_rmw_stripe_t*)(((void*)op_data->chain_reads) + sizeof(osd_chain_read_t) * op_data->chain_read_count);
// Now process each subrequest as a separate read, including reconstruction if needed
// Prepare reads
int n_subops = 0;
uint64_t read_buffer_size = 0;
for (int cri = 0; cri < chain_reads.size(); cri++)
{
op_data->chain_reads[cri] = chain_reads[cri];
object_id cur_oid = { .inode = chain_reads[cri].inode, .stripe = op_data->oid.stripe };
// FIXME: maybe introduce split_read_stripes to shorten these lines and to remove read_start=req_start
osd_rmw_stripe_t *stripes = chain_stripes + cri*stripe_count;
split_stripes(pg.pg_data_size, bs_block_size, chain_reads[cri].offset, chain_reads[cri].len, stripes);
if (op_data->scheme == POOL_SCHEME_REPLICATED && !stripes[0].req_end)
{
continue;
}
for (int role = 0; role < op_data->pg_data_size; role++)
{
stripes[role].read_start = stripes[role].req_start;
stripes[role].read_end = stripes[role].req_end;
}
uint64_t *cur_set = pg.cur_set.data();
if (pg.state != PG_ACTIVE && op_data->scheme != POOL_SCHEME_REPLICATED)
{
pg_osd_set_state_t *object_state;
cur_set = get_object_osd_set(pg, cur_oid, pg.cur_set.data(), &object_state);
if (extend_missing_stripes(stripes, cur_set, pg.pg_data_size, pg.pg_size) < 0)
{
free(op_data->chain_reads);
op_data->chain_reads = NULL;
finish_op(cur_op, -EIO);
return -1;
}
op_data->degraded = 1;
}
if (op_data->scheme == POOL_SCHEME_REPLICATED)
{
n_subops++;
read_buffer_size += stripes[0].read_end - stripes[0].read_start;
}
else
{
for (int role = 0; role < pg.pg_size; role++)
{
if (stripes[role].read_end > 0 && cur_set[role] != 0)
n_subops++;
if (stripes[role].read_end > 0)
read_buffer_size += stripes[role].read_end - stripes[role].read_start;
}
}
}
cur_op->buf = memalign_or_die(MEM_ALIGNMENT, read_buffer_size);
void *cur_buf = cur_op->buf;
for (int cri = 0; cri < chain_reads.size(); cri++)
{
osd_rmw_stripe_t *stripes = chain_stripes + cri*stripe_count;
for (int role = 0; role < stripe_count; role++)
{
if (stripes[role].read_end > 0)
{
stripes[role].read_buf = cur_buf;
stripes[role].bmp_buf = op_data->snapshot_bitmaps + (chain_reads[cri].chain_pos*stripe_count + role)*clean_entry_bitmap_size;
cur_buf += stripes[role].read_end - stripes[role].read_start;
}
}
}
// Submit all reads
op_data->fact_ver = UINT64_MAX;
op_data->done = op_data->errors = 0;
op_data->n_subops = n_subops;
if (!n_subops)
{
return 0;
}
op_data->subops = new osd_op_t[n_subops];
int cur_subops = 0;
for (int cri = 0; cri < chain_reads.size(); cri++)
{
osd_rmw_stripe_t *stripes = chain_stripes + cri*stripe_count;
if (op_data->scheme == POOL_SCHEME_REPLICATED && !stripes[0].req_end)
{
continue;
}
object_id cur_oid = { .inode = chain_reads[cri].inode, .stripe = op_data->oid.stripe };
auto vo_it = pg.ver_override.find(cur_oid);
uint64_t target_ver = vo_it != pg.ver_override.end() ? vo_it->second : UINT64_MAX;
uint64_t *cur_set = pg.cur_set.data();
if (pg.state != PG_ACTIVE && op_data->scheme != POOL_SCHEME_REPLICATED)
{
pg_osd_set_state_t *object_state;
cur_set = get_object_osd_set(pg, cur_oid, pg.cur_set.data(), &object_state);
}
int zero_read = -1;
if (op_data->scheme == POOL_SCHEME_REPLICATED)
{
for (int role = 0; role < op_data->pg_size; role++)
if (cur_set[role] == this->osd_num || zero_read == -1)
zero_read = role;
}
cur_subops += submit_primary_subop_batch(SUBMIT_READ, chain_reads[cri].inode, target_ver, stripes, cur_set, cur_op, cur_subops, zero_read);
}
assert(cur_subops == n_subops);
return 0;
}
void osd_t::send_chained_read_results(pg_t & pg, osd_op_t *cur_op)
{
osd_primary_op_data_t *op_data = cur_op->op_data;
int stripe_count = (pg.scheme == POOL_SCHEME_REPLICATED ? 1 : pg.pg_size);
osd_rmw_stripe_t *chain_stripes = (osd_rmw_stripe_t*)(((void*)op_data->chain_reads) + sizeof(osd_chain_read_t) * op_data->chain_read_count);
// Reconstruct parts if needed
if (op_data->degraded)
{
int stripe_count = (pg.scheme == POOL_SCHEME_REPLICATED ? 1 : pg.pg_size);
for (int cri = 0; cri < op_data->chain_read_count; cri++)
{
// Reconstruct missing stripes
osd_rmw_stripe_t *stripes = chain_stripes + cri*stripe_count;
if (op_data->scheme == POOL_SCHEME_XOR)
{
reconstruct_stripes_xor(stripes, pg.pg_size, clean_entry_bitmap_size);
}
else if (op_data->scheme == POOL_SCHEME_JERASURE)
{
reconstruct_stripes_jerasure(stripes, pg.pg_size, pg.pg_data_size, clean_entry_bitmap_size);
}
}
}
// Send bitmap
cur_op->reply.rw.bitmap_len = op_data->pg_data_size * clean_entry_bitmap_size;
cur_op->iov.push_back(op_data->stripes[0].bmp_buf, cur_op->reply.rw.bitmap_len);
// And finally compose the result
uint64_t sent = 0;
int prev_pos = 0, pos = 0;
bool prev_set = false;
int prev = (cur_op->req.rw.offset - op_data->oid.stripe) / bs_bitmap_granularity;
int end = prev + cur_op->req.rw.len/bs_bitmap_granularity;
int cur = prev;
while (cur <= end)
{
bool has_bit = false;
if (cur < end)
{
for (pos = 0; pos < op_data->chain_size; pos++)
{
has_bit = (((uint8_t*)op_data->snapshot_bitmaps)[pos*stripe_count*clean_entry_bitmap_size + cur/8] >> (cur%8)) & 1;
if (has_bit)
break;
}
}
if (has_bit != prev_set || pos != prev_pos || cur == end)
{
if (cur > prev)
{
// Send buffer in parts to avoid copying
if (!prev_set)
{
while ((cur-prev) > zero_buffer_size/bs_bitmap_granularity)
{
cur_op->iov.push_back(zero_buffer, zero_buffer_size);
sent += zero_buffer_size;
prev += zero_buffer_size/bs_bitmap_granularity;
}
cur_op->iov.push_back(zero_buffer, (cur-prev)*bs_bitmap_granularity);
sent += (cur-prev)*bs_bitmap_granularity;
}
else
{
osd_rmw_stripe_t *stripes = chain_stripes + prev_pos*stripe_count;
while (cur > prev)
{
int role = prev*bs_bitmap_granularity/bs_block_size;
int role_start = prev*bs_bitmap_granularity - role*bs_block_size;
int role_end = cur*bs_bitmap_granularity - role*bs_block_size;
if (role_end > bs_block_size)
role_end = bs_block_size;
assert(stripes[role].read_buf);
cur_op->iov.push_back(
stripes[role].read_buf + (role_start - stripes[role].read_start),
role_end - role_start
);
sent += role_end - role_start;
prev += (role_end - role_start)/bs_bitmap_granularity;
}
}
}
prev = cur;
prev_pos = pos;
prev_set = has_bit;
}
cur++;
}
assert(sent == cur_op->req.rw.len);
free(op_data->chain_reads);
op_data->chain_reads = NULL;
}

View File

@ -66,17 +66,16 @@ void osd_t::finish_op(osd_op_t *cur_op, int retval)
auto & pg = pgs.at({ .pool_id = INODE_POOL(cur_op->op_data->oid.inode), .pg_num = cur_op->op_data->pg_num });
pg.inflight--;
assert(pg.inflight >= 0);
if ((pg.state & PG_STOPPING) && pg.inflight == 0 && !pg.flush_batch &&
// We must either forget all PG's unstable writes or wait for it to become clean
dirty_pgs.find({ .pool_id = pg.pool_id, .pg_num = pg.pg_num }) == dirty_pgs.end())
if ((pg.state & PG_STOPPING) && pg.inflight == 0 && !pg.flush_batch)
{
finish_stop_pg(pg);
}
else if ((pg.state & PG_REPEERING) && pg.inflight == 0 && !pg.flush_batch)
{
start_pg_peering(pg);
}
}
assert(!cur_op->op_data->subops);
assert(!cur_op->op_data->unstable_write_osds);
assert(!cur_op->op_data->unstable_writes);
assert(!cur_op->op_data->dirty_pgs);
free(cur_op->op_data);
cur_op->op_data = NULL;
}
@ -104,7 +103,7 @@ void osd_t::finish_op(osd_op_t *cur_op, int retval)
}
}
void osd_t::submit_primary_subops(int submit_type, uint64_t op_version, int pg_size, const uint64_t* osd_set, osd_op_t *cur_op)
void osd_t::submit_primary_subops(int submit_type, uint64_t op_version, const uint64_t* osd_set, osd_op_t *cur_op)
{
bool wr = submit_type == SUBMIT_WRITE;
osd_primary_op_data_t *op_data = cur_op->op_data;
@ -112,32 +111,34 @@ void osd_t::submit_primary_subops(int submit_type, uint64_t op_version, int pg_s
bool rep = op_data->scheme == POOL_SCHEME_REPLICATED;
// Allocate subops
int n_subops = 0, zero_read = -1;
for (int role = 0; role < pg_size; role++)
for (int role = 0; role < op_data->pg_size; role++)
{
if (osd_set[role] == this->osd_num || osd_set[role] != 0 && zero_read == -1)
{
zero_read = role;
}
if (osd_set[role] != 0 && (wr || !rep && stripes[role].read_end != 0))
{
n_subops++;
}
}
if (!n_subops && (submit_type == SUBMIT_RMW_READ || rep))
{
n_subops = 1;
}
else
{
zero_read = -1;
}
osd_op_t *subops = new osd_op_t[n_subops];
op_data->fact_ver = 0;
op_data->done = op_data->errors = 0;
op_data->n_subops = n_subops;
op_data->subops = subops;
int i = 0;
for (int role = 0; role < pg_size; role++)
int sent = submit_primary_subop_batch(submit_type, op_data->oid.inode, op_version, op_data->stripes, osd_set, cur_op, 0, zero_read);
assert(sent == n_subops);
}
int osd_t::submit_primary_subop_batch(int submit_type, inode_t inode, uint64_t op_version,
osd_rmw_stripe_t *stripes, const uint64_t* osd_set, osd_op_t *cur_op, int subop_idx, int zero_read)
{
bool wr = submit_type == SUBMIT_WRITE;
osd_primary_op_data_t *op_data = cur_op->op_data;
bool rep = op_data->scheme == POOL_SCHEME_REPLICATED;
int i = subop_idx;
for (int role = 0; role < op_data->pg_size; role++)
{
// We always submit zero-length writes to all replicas, even if the stripe is not modified
if (!(wr || !rep && stripes[role].read_end != 0 || zero_read == role))
@ -148,20 +149,21 @@ void osd_t::submit_primary_subops(int submit_type, uint64_t op_version, int pg_s
if (role_osd_num != 0)
{
int stripe_num = rep ? 0 : role;
osd_op_t *subop = op_data->subops + i;
if (role_osd_num == this->osd_num)
{
clock_gettime(CLOCK_REALTIME, &subops[i].tv_begin);
subops[i].op_type = (uint64_t)cur_op;
subops[i].bitmap = stripes[stripe_num].bmp_buf;
subops[i].bitmap_len = clean_entry_bitmap_size;
subops[i].bs_op = new blockstore_op_t({
clock_gettime(CLOCK_REALTIME, &subop->tv_begin);
subop->op_type = (uint64_t)cur_op;
subop->bitmap = stripes[stripe_num].bmp_buf;
subop->bitmap_len = clean_entry_bitmap_size;
subop->bs_op = new blockstore_op_t({
.opcode = (uint64_t)(wr ? (rep ? BS_OP_WRITE_STABLE : BS_OP_WRITE) : BS_OP_READ),
.callback = [subop = &subops[i], this](blockstore_op_t *bs_subop)
.callback = [subop, this](blockstore_op_t *bs_subop)
{
handle_primary_bs_subop(subop);
},
.oid = {
.inode = op_data->oid.inode,
.inode = inode,
.stripe = op_data->oid.stripe | stripe_num,
},
.version = op_version,
@ -173,26 +175,26 @@ void osd_t::submit_primary_subops(int submit_type, uint64_t op_version, int pg_s
#ifdef OSD_DEBUG
printf(
"Submit %s to local: %lx:%lx v%lu %u-%u\n", wr ? "write" : "read",
op_data->oid.inode, op_data->oid.stripe | stripe_num, op_version,
subops[i].bs_op->offset, subops[i].bs_op->len
inode, op_data->oid.stripe | stripe_num, op_version,
subop->bs_op->offset, subop->bs_op->len
);
#endif
bs->enqueue_op(subops[i].bs_op);
bs->enqueue_op(subop->bs_op);
}
else
{
subops[i].op_type = OSD_OP_OUT;
subops[i].peer_fd = c_cli.osd_peer_fds.at(role_osd_num);
subops[i].bitmap = stripes[stripe_num].bmp_buf;
subops[i].bitmap_len = clean_entry_bitmap_size;
subops[i].req.sec_rw = {
subop->op_type = OSD_OP_OUT;
subop->peer_fd = c_cli.osd_peer_fds.at(role_osd_num);
subop->bitmap = stripes[stripe_num].bmp_buf;
subop->bitmap_len = clean_entry_bitmap_size;
subop->req.sec_rw = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = c_cli.next_subop_id++,
.opcode = (uint64_t)(wr ? (rep ? OSD_OP_SEC_WRITE_STABLE : OSD_OP_SEC_WRITE) : OSD_OP_SEC_READ),
},
.oid = {
.inode = op_data->oid.inode,
.inode = inode,
.stripe = op_data->oid.stripe | stripe_num,
},
.version = op_version,
@ -203,40 +205,34 @@ void osd_t::submit_primary_subops(int submit_type, uint64_t op_version, int pg_s
#ifdef OSD_DEBUG
printf(
"Submit %s to osd %lu: %lx:%lx v%lu %u-%u\n", wr ? "write" : "read", role_osd_num,
op_data->oid.inode, op_data->oid.stripe | stripe_num, op_version,
subops[i].req.sec_rw.offset, subops[i].req.sec_rw.len
inode, op_data->oid.stripe | stripe_num, op_version,
subop->req.sec_rw.offset, subop->req.sec_rw.len
);
#endif
if (wr)
{
if (stripes[stripe_num].write_end > stripes[stripe_num].write_start)
{
subops[i].iov.push_back(stripes[stripe_num].write_buf, stripes[stripe_num].write_end - stripes[stripe_num].write_start);
subop->iov.push_back(stripes[stripe_num].write_buf, stripes[stripe_num].write_end - stripes[stripe_num].write_start);
}
}
else
{
if (stripes[stripe_num].read_end > stripes[stripe_num].read_start)
{
subops[i].iov.push_back(stripes[stripe_num].read_buf, stripes[stripe_num].read_end - stripes[stripe_num].read_start);
subop->iov.push_back(stripes[stripe_num].read_buf, stripes[stripe_num].read_end - stripes[stripe_num].read_start);
}
}
subops[i].callback = [cur_op, this](osd_op_t *subop)
subop->callback = [cur_op, this](osd_op_t *subop)
{
int fail_fd = subop->req.hdr.opcode == OSD_OP_SEC_WRITE &&
subop->reply.hdr.retval != subop->req.sec_rw.len ? subop->peer_fd : -1;
handle_primary_subop(subop, cur_op);
if (fail_fd >= 0)
{
// write operation failed, drop the connection
c_cli.stop_client(fail_fd);
}
};
c_cli.outbox_push(&subops[i]);
c_cli.outbox_push(subop);
}
i++;
}
}
return i-subop_idx;
}
static uint64_t bs_op_to_osd_op[] = {
@ -276,6 +272,7 @@ void osd_t::handle_primary_bs_subop(osd_op_t *subop)
}
delete bs_op;
subop->bs_op = NULL;
subop->peer_fd = -1;
handle_primary_subop(subop, cur_op);
}
@ -306,8 +303,13 @@ void osd_t::handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op)
{
uint64_t opcode = subop->req.hdr.opcode;
int retval = subop->reply.hdr.retval;
int expected = opcode == OSD_OP_SEC_READ || opcode == OSD_OP_SEC_WRITE
|| opcode == OSD_OP_SEC_WRITE_STABLE ? subop->req.sec_rw.len : 0;
int expected;
if (opcode == OSD_OP_SEC_READ || opcode == OSD_OP_SEC_WRITE || opcode == OSD_OP_SEC_WRITE_STABLE)
expected = subop->req.sec_rw.len;
else if (opcode == OSD_OP_SEC_READ_BMP)
expected = subop->req.sec_read_bmp.len / sizeof(obj_ver_id) * (8 + clean_entry_bitmap_size);
else
expected = 0;
osd_primary_op_data_t *op_data = cur_op->op_data;
if (retval != expected)
{
@ -317,6 +319,11 @@ void osd_t::handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op)
op_data->epipe++;
}
op_data->errors++;
if (subop->peer_fd >= 0)
{
// Drop connection on any error
c_cli.stop_client(subop->peer_fd);
}
}
else
{
@ -329,6 +336,8 @@ void osd_t::handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op)
? c_cli.clients[subop->peer_fd]->osd_num : osd_num;
printf("subop %lu from osd %lu: version = %lu\n", opcode, peer_osd, version);
#endif
if (op_data->fact_ver != UINT64_MAX)
{
if (op_data->fact_ver != 0 && op_data->fact_ver != version)
{
throw std::runtime_error(
@ -339,6 +348,7 @@ void osd_t::handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op)
op_data->fact_ver = version;
}
}
}
if ((op_data->errors + op_data->done) >= op_data->n_subops)
{
delete[] op_data->subops;
@ -456,7 +466,7 @@ void osd_t::submit_primary_del_batch(osd_op_t *cur_op, obj_ver_osd_t *chunks_to_
{
subops[i].op_type = OSD_OP_OUT;
subops[i].peer_fd = c_cli.osd_peer_fds.at(chunk.osd_num);
subops[i].req.sec_del = {
subops[i].req = (osd_any_op_t){ .sec_del = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = c_cli.next_subop_id++,
@ -464,23 +474,17 @@ void osd_t::submit_primary_del_batch(osd_op_t *cur_op, obj_ver_osd_t *chunks_to_
},
.oid = chunk.oid,
.version = chunk.version,
};
} };
subops[i].callback = [cur_op, this](osd_op_t *subop)
{
int fail_fd = subop->reply.hdr.retval != 0 ? subop->peer_fd : -1;
handle_primary_subop(subop, cur_op);
if (fail_fd >= 0)
{
// delete operation failed, drop the connection
c_cli.stop_client(fail_fd);
}
};
c_cli.outbox_push(&subops[i]);
}
}
}
void osd_t::submit_primary_sync_subops(osd_op_t *cur_op)
int osd_t::submit_primary_sync_subops(osd_op_t *cur_op)
{
osd_primary_op_data_t *op_data = cur_op->op_data;
int n_osds = op_data->dirty_osd_count;
@ -488,6 +492,7 @@ void osd_t::submit_primary_sync_subops(osd_op_t *cur_op)
op_data->done = op_data->errors = 0;
op_data->n_subops = n_osds;
op_data->subops = subops;
std::map<uint64_t, int>::iterator peer_it;
for (int i = 0; i < n_osds; i++)
{
osd_num_t sync_osd = op_data->dirty_osds[i];
@ -504,31 +509,36 @@ void osd_t::submit_primary_sync_subops(osd_op_t *cur_op)
});
bs->enqueue_op(subops[i].bs_op);
}
else
else if ((peer_it = c_cli.osd_peer_fds.find(sync_osd)) != c_cli.osd_peer_fds.end())
{
subops[i].op_type = OSD_OP_OUT;
subops[i].peer_fd = c_cli.osd_peer_fds.at(sync_osd);
subops[i].req.sec_sync = {
subops[i].peer_fd = peer_it->second;
subops[i].req = (osd_any_op_t){ .sec_sync = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = c_cli.next_subop_id++,
.opcode = OSD_OP_SEC_SYNC,
},
};
} };
subops[i].callback = [cur_op, this](osd_op_t *subop)
{
int fail_fd = subop->reply.hdr.retval != 0 ? subop->peer_fd : -1;
handle_primary_subop(subop, cur_op);
if (fail_fd >= 0)
{
// sync operation failed, drop the connection
c_cli.stop_client(fail_fd);
}
};
c_cli.outbox_push(&subops[i]);
}
else
{
op_data->done++;
}
}
if (op_data->done >= op_data->n_subops)
{
delete[] op_data->subops;
op_data->subops = NULL;
return 0;
}
return 1;
}
void osd_t::submit_primary_stab_subops(osd_op_t *cur_op)
{
@ -560,24 +570,18 @@ void osd_t::submit_primary_stab_subops(osd_op_t *cur_op)
{
subops[i].op_type = OSD_OP_OUT;
subops[i].peer_fd = c_cli.osd_peer_fds.at(stab_osd.osd_num);
subops[i].req.sec_stab = {
subops[i].req = (osd_any_op_t){ .sec_stab = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = c_cli.next_subop_id++,
.opcode = OSD_OP_SEC_STABILIZE,
},
.len = (uint64_t)(stab_osd.len * sizeof(obj_ver_id)),
};
} };
subops[i].iov.push_back(op_data->unstable_writes + stab_osd.start, stab_osd.len * sizeof(obj_ver_id));
subops[i].callback = [cur_op, this](osd_op_t *subop)
{
int fail_fd = subop->reply.hdr.retval != 0 ? subop->peer_fd : -1;
handle_primary_subop(subop, cur_op);
if (fail_fd >= 0)
{
// sync operation failed, drop the connection
c_cli.stop_client(fail_fd);
}
};
c_cli.outbox_push(&subops[i]);
}
@ -595,7 +599,7 @@ void osd_t::pg_cancel_write_queue(pg_t & pg, osd_op_t *first_op, object_id oid,
return;
}
std::vector<osd_op_t*> cancel_ops;
while (it != pg.write_queue.end())
while (it != pg.write_queue.end() && it->first == oid)
{
cancel_ops.push_back(it->second);
it++;

265
src/osd_primary_sync.cpp Normal file
View File

@ -0,0 +1,265 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
#include "osd_primary.h"
// Save and clear unstable_writes -> SYNC all -> STABLE all
void osd_t::continue_primary_sync(osd_op_t *cur_op)
{
if (!cur_op->op_data)
{
cur_op->op_data = (osd_primary_op_data_t*)calloc_or_die(1, sizeof(osd_primary_op_data_t));
}
osd_primary_op_data_t *op_data = cur_op->op_data;
if (op_data->st == 1) goto resume_1;
else if (op_data->st == 2) goto resume_2;
else if (op_data->st == 3) goto resume_3;
else if (op_data->st == 4) goto resume_4;
else if (op_data->st == 5) goto resume_5;
else if (op_data->st == 6) goto resume_6;
else if (op_data->st == 7) goto resume_7;
else if (op_data->st == 8) goto resume_8;
assert(op_data->st == 0);
if (syncs_in_progress.size() > 0)
{
// Wait for previous syncs, if any
// FIXME: We may try to execute the current one in parallel, like in Blockstore, but I'm not sure if it matters at all
syncs_in_progress.push_back(cur_op);
op_data->st = 1;
resume_1:
return;
}
else
{
syncs_in_progress.push_back(cur_op);
}
resume_2:
if (dirty_osds.size() == 0)
{
// Nothing to sync
goto finish;
}
// Save and clear unstable_writes
// In theory it is possible to do in on a per-client basis, but this seems to be an unnecessary complication
// It would be cool not to copy these here at all, but someone has to deduplicate them by object IDs anyway
if (unstable_writes.size() > 0)
{
op_data->unstable_write_osds = new std::vector<unstable_osd_num_t>();
op_data->unstable_writes = new obj_ver_id[this->unstable_writes.size()];
osd_num_t last_osd = 0;
int last_start = 0, last_end = 0;
for (auto it = this->unstable_writes.begin(); it != this->unstable_writes.end(); it++)
{
if (last_osd != it->first.osd_num)
{
if (last_osd != 0)
{
op_data->unstable_write_osds->push_back((unstable_osd_num_t){
.osd_num = last_osd,
.start = last_start,
.len = last_end - last_start,
});
}
last_osd = it->first.osd_num;
last_start = last_end;
}
op_data->unstable_writes[last_end] = (obj_ver_id){
.oid = it->first.oid,
.version = it->second,
};
last_end++;
}
if (last_osd != 0)
{
op_data->unstable_write_osds->push_back((unstable_osd_num_t){
.osd_num = last_osd,
.start = last_start,
.len = last_end - last_start,
});
}
this->unstable_writes.clear();
}
{
void *dirty_buf = malloc_or_die(
sizeof(pool_pg_num_t)*dirty_pgs.size() +
sizeof(osd_num_t)*dirty_osds.size() +
sizeof(obj_ver_osd_t)*this->copies_to_delete_after_sync_count
);
op_data->dirty_pgs = (pool_pg_num_t*)dirty_buf;
op_data->dirty_osds = (osd_num_t*)(dirty_buf + sizeof(pool_pg_num_t)*dirty_pgs.size());
op_data->dirty_pg_count = dirty_pgs.size();
op_data->dirty_osd_count = dirty_osds.size();
if (this->copies_to_delete_after_sync_count)
{
op_data->copies_to_delete_count = 0;
op_data->copies_to_delete = (obj_ver_osd_t*)(op_data->dirty_osds + op_data->dirty_osd_count);
for (auto dirty_pg_num: dirty_pgs)
{
auto & pg = pgs.at(dirty_pg_num);
assert(pg.copies_to_delete_after_sync.size() <= this->copies_to_delete_after_sync_count);
memcpy(
op_data->copies_to_delete + op_data->copies_to_delete_count,
pg.copies_to_delete_after_sync.data(),
sizeof(obj_ver_osd_t)*pg.copies_to_delete_after_sync.size()
);
op_data->copies_to_delete_count += pg.copies_to_delete_after_sync.size();
this->copies_to_delete_after_sync_count -= pg.copies_to_delete_after_sync.size();
pg.copies_to_delete_after_sync.clear();
}
assert(this->copies_to_delete_after_sync_count == 0);
}
int dpg = 0;
for (auto dirty_pg_num: dirty_pgs)
{
pgs.at(dirty_pg_num).inflight++;
op_data->dirty_pgs[dpg++] = dirty_pg_num;
}
dirty_pgs.clear();
dpg = 0;
for (auto osd_num: dirty_osds)
{
op_data->dirty_osds[dpg++] = osd_num;
}
dirty_osds.clear();
}
if (immediate_commit != IMMEDIATE_ALL)
{
// SYNC
if (!submit_primary_sync_subops(cur_op))
{
goto resume_4;
}
resume_3:
op_data->st = 3;
return;
resume_4:
if (op_data->errors > 0)
{
goto resume_6;
}
}
if (op_data->unstable_writes)
{
// Stabilize version sets, if any
submit_primary_stab_subops(cur_op);
resume_5:
op_data->st = 5;
return;
}
resume_6:
if (op_data->errors > 0)
{
// Return PGs and OSDs back into their dirty sets
for (int i = 0; i < op_data->dirty_pg_count; i++)
{
dirty_pgs.insert(op_data->dirty_pgs[i]);
}
for (int i = 0; i < op_data->dirty_osd_count; i++)
{
dirty_osds.insert(op_data->dirty_osds[i]);
}
if (op_data->unstable_writes)
{
// Return objects back into the unstable write set
for (auto unstable_osd: *(op_data->unstable_write_osds))
{
for (int i = 0; i < unstable_osd.len; i++)
{
// Except those from peered PGs
auto & w = op_data->unstable_writes[i];
pool_pg_num_t wpg = {
.pool_id = INODE_POOL(w.oid.inode),
.pg_num = map_to_pg(w.oid, st_cli.pool_config.at(INODE_POOL(w.oid.inode)).pg_stripe_size),
};
if (pgs.at(wpg).state & PG_ACTIVE)
{
uint64_t & dest = this->unstable_writes[(osd_object_id_t){
.osd_num = unstable_osd.osd_num,
.oid = w.oid,
}];
dest = dest < w.version ? w.version : dest;
dirty_pgs.insert(wpg);
}
}
}
}
if (op_data->copies_to_delete)
{
// Return 'copies to delete' back into respective PGs
for (int i = 0; i < op_data->copies_to_delete_count; i++)
{
auto & w = op_data->copies_to_delete[i];
auto & pg = pgs.at((pool_pg_num_t){
.pool_id = INODE_POOL(w.oid.inode),
.pg_num = map_to_pg(w.oid, st_cli.pool_config.at(INODE_POOL(w.oid.inode)).pg_stripe_size),
});
if (pg.state & PG_ACTIVE)
{
pg.copies_to_delete_after_sync.push_back(w);
copies_to_delete_after_sync_count++;
}
}
}
}
else if (op_data->copies_to_delete)
{
// Actually delete copies which we wanted to delete
submit_primary_del_batch(cur_op, op_data->copies_to_delete, op_data->copies_to_delete_count);
resume_7:
op_data->st = 7;
return;
resume_8:
if (op_data->errors > 0)
{
goto resume_6;
}
}
for (int i = 0; i < op_data->dirty_pg_count; i++)
{
auto & pg = pgs.at(op_data->dirty_pgs[i]);
pg.inflight--;
if ((pg.state & PG_STOPPING) && pg.inflight == 0 && !pg.flush_batch)
{
finish_stop_pg(pg);
}
else if ((pg.state & PG_REPEERING) && pg.inflight == 0 && !pg.flush_batch)
{
start_pg_peering(pg);
}
}
// FIXME: Free those in the destructor?
free(op_data->dirty_pgs);
op_data->dirty_pgs = NULL;
op_data->dirty_osds = NULL;
if (op_data->unstable_writes)
{
delete op_data->unstable_write_osds;
delete[] op_data->unstable_writes;
op_data->unstable_writes = NULL;
op_data->unstable_write_osds = NULL;
}
if (op_data->errors > 0)
{
finish_op(cur_op, op_data->epipe > 0 ? -EPIPE : -EIO);
}
else
{
finish:
if (cur_op->peer_fd)
{
auto it = c_cli.clients.find(cur_op->peer_fd);
if (it != c_cli.clients.end())
it->second->dirty_pgs.clear();
}
finish_op(cur_op, 0);
}
assert(syncs_in_progress.front() == cur_op);
syncs_in_progress.pop_front();
if (syncs_in_progress.size() > 0)
{
cur_op = syncs_in_progress.front();
op_data = cur_op->op_data;
op_data->st++;
goto resume_2;
}
}

381
src/osd_primary_write.cpp Normal file
View File

@ -0,0 +1,381 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
#include "osd_primary.h"
#include "allocator.h"
bool osd_t::check_write_queue(osd_op_t *cur_op, pg_t & pg)
{
osd_primary_op_data_t *op_data = cur_op->op_data;
// Check if actions are pending for this object
auto act_it = pg.flush_actions.lower_bound((obj_piece_id_t){
.oid = op_data->oid,
.osd_num = 0,
});
if (act_it != pg.flush_actions.end() &&
act_it->first.oid.inode == op_data->oid.inode &&
(act_it->first.oid.stripe & ~STRIPE_MASK) == op_data->oid.stripe)
{
pg.write_queue.emplace(op_data->oid, cur_op);
return false;
}
// Check if there are other write requests to the same object
auto vo_it = pg.write_queue.find(op_data->oid);
if (vo_it != pg.write_queue.end())
{
op_data->st = 1;
pg.write_queue.emplace(op_data->oid, cur_op);
return false;
}
pg.write_queue.emplace(op_data->oid, cur_op);
return true;
}
void osd_t::continue_primary_write(osd_op_t *cur_op)
{
if (!cur_op->op_data && !prepare_primary_rw(cur_op))
{
return;
}
osd_primary_op_data_t *op_data = cur_op->op_data;
auto & pg = pgs.at({ .pool_id = INODE_POOL(op_data->oid.inode), .pg_num = op_data->pg_num });
if (op_data->st == 1) goto resume_1;
else if (op_data->st == 2) goto resume_2;
else if (op_data->st == 3) goto resume_3;
else if (op_data->st == 4) goto resume_4;
else if (op_data->st == 5) goto resume_5;
else if (op_data->st == 6) goto resume_6;
else if (op_data->st == 7) goto resume_7;
else if (op_data->st == 8) goto resume_8;
else if (op_data->st == 9) goto resume_9;
else if (op_data->st == 10) goto resume_10;
assert(op_data->st == 0);
if (!check_write_queue(cur_op, pg))
{
return;
}
resume_1:
// Determine blocks to read and write
// Missing chunks are allowed to be overwritten even in incomplete objects
// FIXME: Allow to do small writes to the old (degraded/misplaced) OSD set for lower performance impact
op_data->prev_set = get_object_osd_set(pg, op_data->oid, pg.cur_set.data(), &op_data->object_state);
if (op_data->scheme == POOL_SCHEME_REPLICATED)
{
// Simplified algorithm
op_data->stripes[0].write_start = op_data->stripes[0].req_start;
op_data->stripes[0].write_end = op_data->stripes[0].req_end;
op_data->stripes[0].write_buf = cur_op->buf;
if (pg.cur_set.data() != op_data->prev_set && (op_data->stripes[0].write_start != 0 ||
op_data->stripes[0].write_end != bs_block_size))
{
// Object is degraded/misplaced and will be moved to <write_osd_set>
op_data->stripes[0].read_start = 0;
op_data->stripes[0].read_end = bs_block_size;
cur_op->rmw_buf = op_data->stripes[0].read_buf = memalign_or_die(MEM_ALIGNMENT, bs_block_size);
}
}
else
{
cur_op->rmw_buf = calc_rmw(cur_op->buf, op_data->stripes, op_data->prev_set,
pg.pg_size, op_data->pg_data_size, pg.pg_cursize, pg.cur_set.data(), bs_block_size, clean_entry_bitmap_size);
if (!cur_op->rmw_buf)
{
// Refuse partial overwrite of an incomplete object
cur_op->reply.hdr.retval = -EINVAL;
goto continue_others;
}
}
// Read required blocks
submit_primary_subops(SUBMIT_RMW_READ, UINT64_MAX, op_data->prev_set, cur_op);
resume_2:
op_data->st = 2;
return;
resume_3:
if (op_data->errors > 0)
{
pg_cancel_write_queue(pg, cur_op, op_data->oid, op_data->epipe > 0 ? -EPIPE : -EIO);
return;
}
if (op_data->scheme == POOL_SCHEME_REPLICATED)
{
// Set bitmap bits
bitmap_set(op_data->stripes[0].bmp_buf, op_data->stripes[0].write_start,
op_data->stripes[0].write_end-op_data->stripes[0].write_start, bs_bitmap_granularity);
// Possibly copy new data from the request into the recovery buffer
if (pg.cur_set.data() != op_data->prev_set && (op_data->stripes[0].write_start != 0 ||
op_data->stripes[0].write_end != bs_block_size))
{
memcpy(
op_data->stripes[0].read_buf + op_data->stripes[0].req_start,
op_data->stripes[0].write_buf,
op_data->stripes[0].req_end - op_data->stripes[0].req_start
);
op_data->stripes[0].write_buf = op_data->stripes[0].read_buf;
op_data->stripes[0].write_start = 0;
op_data->stripes[0].write_end = bs_block_size;
}
}
else
{
// For EC/XOR pools, save version override to make it impossible
// for parallel reads to read different versions of data and parity
pg.ver_override[op_data->oid] = op_data->fact_ver;
// Recover missing stripes, calculate parity
if (pg.scheme == POOL_SCHEME_XOR)
{
calc_rmw_parity_xor(op_data->stripes, pg.pg_size, op_data->prev_set, pg.cur_set.data(), bs_block_size, clean_entry_bitmap_size);
}
else if (pg.scheme == POOL_SCHEME_JERASURE)
{
calc_rmw_parity_jerasure(op_data->stripes, pg.pg_size, op_data->pg_data_size, op_data->prev_set, pg.cur_set.data(), bs_block_size, clean_entry_bitmap_size);
}
}
// Send writes
if ((op_data->fact_ver >> (64-PG_EPOCH_BITS)) < pg.epoch)
{
op_data->target_ver = ((uint64_t)pg.epoch << (64-PG_EPOCH_BITS)) | 1;
}
else
{
if ((op_data->fact_ver & (1ul<<(64-PG_EPOCH_BITS) - 1)) == (1ul<<(64-PG_EPOCH_BITS) - 1))
{
assert(pg.epoch != ((1ul << PG_EPOCH_BITS)-1));
pg.epoch++;
}
op_data->target_ver = op_data->fact_ver + 1;
}
if (pg.epoch > pg.reported_epoch)
{
// Report newer epoch before writing
// FIXME: We may report only one PG state here...
this->pg_state_dirty.insert({ .pool_id = pg.pool_id, .pg_num = pg.pg_num });
pg.history_changed = true;
report_pg_states();
resume_10:
if (pg.epoch > pg.reported_epoch)
{
op_data->st = 10;
return;
}
}
submit_primary_subops(SUBMIT_WRITE, op_data->target_ver, pg.cur_set.data(), cur_op);
resume_4:
op_data->st = 4;
return;
resume_5:
if (op_data->scheme != POOL_SCHEME_REPLICATED)
{
// Remove version override just after the write, but before stabilizing
pg.ver_override.erase(op_data->oid);
}
if (op_data->errors > 0)
{
pg_cancel_write_queue(pg, cur_op, op_data->oid, op_data->epipe > 0 ? -EPIPE : -EIO);
return;
}
if (op_data->object_state)
{
// We must forget the unclean state of the object before deleting it
// so the next reads don't accidentally read a deleted version
// And it should be done at the same time as the removal of the version override
remove_object_from_state(op_data->oid, op_data->object_state, pg);
pg.clean_count++;
}
resume_6:
resume_7:
if (!remember_unstable_write(cur_op, pg, pg.cur_loc_set, 6))
{
return;
}
if (op_data->fact_ver == 1)
{
// Object is created
pg.clean_count++;
pg.total_count++;
}
if (op_data->object_state)
{
{
int recovery_type = op_data->object_state->state & (OBJ_DEGRADED|OBJ_INCOMPLETE) ? 0 : 1;
recovery_stat_count[0][recovery_type]++;
if (!recovery_stat_count[0][recovery_type])
{
recovery_stat_count[0][recovery_type]++;
recovery_stat_bytes[0][recovery_type] = 0;
}
for (int role = 0; role < (op_data->scheme == POOL_SCHEME_REPLICATED ? 1 : pg.pg_size); role++)
{
recovery_stat_bytes[0][recovery_type] += op_data->stripes[role].write_end - op_data->stripes[role].write_start;
}
}
// Any kind of a non-clean object can have extra chunks, because we don't record objects
// as degraded & misplaced or incomplete & misplaced at the same time. So try to remove extra chunks
if (immediate_commit != IMMEDIATE_ALL)
{
// We can't remove extra chunks yet if fsyncs are explicit, because
// new copies may not be committed to stable storage yet
// We can only remove extra chunks after a successful SYNC for this PG
for (auto & chunk: op_data->object_state->osd_set)
{
// Check is the same as in submit_primary_del_subops()
if (op_data->scheme == POOL_SCHEME_REPLICATED
? !contains_osd(pg.cur_set.data(), pg.pg_size, chunk.osd_num)
: (chunk.osd_num != pg.cur_set[chunk.role]))
{
pg.copies_to_delete_after_sync.push_back((obj_ver_osd_t){
.osd_num = chunk.osd_num,
.oid = {
.inode = op_data->oid.inode,
.stripe = op_data->oid.stripe | (op_data->scheme == POOL_SCHEME_REPLICATED ? 0 : chunk.role),
},
.version = op_data->fact_ver,
});
copies_to_delete_after_sync_count++;
}
}
free_object_state(pg, &op_data->object_state);
}
else
{
submit_primary_del_subops(cur_op, pg.cur_set.data(), pg.pg_size, op_data->object_state->osd_set);
free_object_state(pg, &op_data->object_state);
if (op_data->n_subops > 0)
{
resume_8:
op_data->st = 8;
return;
resume_9:
if (op_data->errors > 0)
{
pg_cancel_write_queue(pg, cur_op, op_data->oid, op_data->epipe > 0 ? -EPIPE : -EIO);
return;
}
}
}
}
cur_op->reply.hdr.retval = cur_op->req.rw.len;
continue_others:
osd_op_t *next_op = NULL;
auto next_it = pg.write_queue.find(op_data->oid);
// Remove the operation from queue before calling finish_op so it doesn't see the completed operation in queue
if (next_it != pg.write_queue.end() && next_it->second == cur_op)
{
pg.write_queue.erase(next_it++);
if (next_it != pg.write_queue.end() && next_it->first == op_data->oid)
next_op = next_it->second;
}
// finish_op would invalidate next_it if it cleared pg.write_queue, but it doesn't do that :)
finish_op(cur_op, cur_op->req.rw.len);
if (next_op)
{
// Continue next write to the same object
continue_primary_write(next_op);
}
}
bool osd_t::remember_unstable_write(osd_op_t *cur_op, pg_t & pg, pg_osd_set_t & loc_set, int base_state)
{
osd_primary_op_data_t *op_data = cur_op->op_data;
if (op_data->st == base_state)
{
goto resume_6;
}
else if (op_data->st == base_state+1)
{
goto resume_7;
}
if (immediate_commit == IMMEDIATE_ALL)
{
immediate:
if (op_data->scheme != POOL_SCHEME_REPLICATED)
{
// Send STABILIZE ops immediately
op_data->unstable_write_osds = new std::vector<unstable_osd_num_t>();
op_data->unstable_writes = new obj_ver_id[loc_set.size()];
{
int last_start = 0;
for (auto & chunk: loc_set)
{
op_data->unstable_writes[last_start] = (obj_ver_id){
.oid = {
.inode = op_data->oid.inode,
.stripe = op_data->oid.stripe | chunk.role,
},
.version = op_data->fact_ver,
};
op_data->unstable_write_osds->push_back((unstable_osd_num_t){
.osd_num = chunk.osd_num,
.start = last_start,
.len = 1,
});
last_start++;
}
}
submit_primary_stab_subops(cur_op);
resume_6:
op_data->st = 6;
return false;
resume_7:
// FIXME: Free those in the destructor?
delete op_data->unstable_write_osds;
delete[] op_data->unstable_writes;
op_data->unstable_writes = NULL;
op_data->unstable_write_osds = NULL;
if (op_data->errors > 0)
{
pg_cancel_write_queue(pg, cur_op, op_data->oid, op_data->epipe > 0 ? -EPIPE : -EIO);
return false;
}
}
}
else if (immediate_commit == IMMEDIATE_SMALL)
{
int stripe_count = (op_data->scheme == POOL_SCHEME_REPLICATED ? 1 : op_data->pg_size);
for (int role = 0; role < stripe_count; role++)
{
if (op_data->stripes[role].write_start == 0 &&
op_data->stripes[role].write_end == bs_block_size)
{
// Big write. Treat write as unsynced
goto lazy;
}
}
goto immediate;
}
else
{
lazy:
if (op_data->scheme != POOL_SCHEME_REPLICATED)
{
// Remember version as unstable for EC/XOR
for (auto & chunk: loc_set)
{
this->dirty_osds.insert(chunk.osd_num);
this->unstable_writes[(osd_object_id_t){
.osd_num = chunk.osd_num,
.oid = {
.inode = op_data->oid.inode,
.stripe = op_data->oid.stripe | chunk.role,
},
}] = op_data->fact_ver;
}
}
else
{
// Only remember to sync OSDs for replicated pools
for (auto & chunk: loc_set)
{
this->dirty_osds.insert(chunk.osd_num);
}
}
// Remember PG as dirty to drop the connection when PG goes offline
// (this is required because of the "lazy sync")
auto cl_it = c_cli.clients.find(cur_op->peer_fd);
if (cl_it != c_cli.clients.end())
{
cl_it->second->dirty_pgs.insert({ .pool_id = pg.pool_id, .pg_num = pg.pg_num });
}
dirty_pgs.insert({ .pool_id = pg.pool_id, .pg_num = pg.pg_num });
}
return true;
}

View File

@ -245,6 +245,8 @@ void reconstruct_stripes_jerasure(osd_rmw_stripe_t *stripes, int pg_size, int pg
for (int role = 0; role < pg_minsize; role++)
{
if (stripes[role].read_end != 0 && stripes[role].missing)
{
if (stripes[role].read_end > stripes[role].read_start)
{
for (int other = 0; other < pg_size; other++)
{
@ -260,6 +262,7 @@ void reconstruct_stripes_jerasure(osd_rmw_stripe_t *stripes, int pg_size, int pg
pg_minsize, OSD_JERASURE_W, decoding_matrix+(role*pg_minsize), dm_ids, role,
data_ptrs, data_ptrs+pg_minsize, stripes[role].read_end - stripes[role].read_start
);
}
for (int other = 0; other < pg_size; other++)
{
if (stripes[other].read_end != 0 && !stripes[other].missing)

View File

@ -44,6 +44,25 @@ void osd_t::secondary_op_callback(osd_op_t *op)
void osd_t::exec_secondary(osd_op_t *cur_op)
{
if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ_BMP)
{
int n = cur_op->req.sec_read_bmp.len / sizeof(obj_ver_id);
if (n > 0)
{
obj_ver_id *ov = (obj_ver_id*)cur_op->buf;
void *reply_buf = malloc_or_die(n * (8 + clean_entry_bitmap_size));
void *cur_buf = reply_buf;
for (int i = 0; i < n; i++)
{
bs->read_bitmap(ov[i].oid, ov[i].version, cur_buf + sizeof(uint64_t), (uint64_t*)cur_buf);
cur_buf += (8 + clean_entry_bitmap_size);
}
free(cur_op->buf);
cur_op->buf = reply_buf;
}
finish_op(cur_op, n * (8 + clean_entry_bitmap_size));
return;
}
cur_op->bs_op = new blockstore_op_t();
cur_op->bs_op->callback = [this, cur_op](blockstore_op_t* bs_op) { secondary_op_callback(cur_op); };
cur_op->bs_op->opcode = (cur_op->req.hdr.opcode == OSD_OP_SEC_READ ? BS_OP_READ
@ -126,7 +145,9 @@ void osd_t::exec_secondary(osd_op_t *cur_op)
void osd_t::exec_show_config(osd_op_t *cur_op)
{
// FIXME: Send the real config, not its source
std::string cfg_str = json11::Json(config).dump();
auto cfg_copy = config;
cfg_copy["protocol_version"] = std::to_string(OSD_PROTOCOL_VERSION);
std::string cfg_str = json11::Json(cfg_copy).dump();
cur_op->buf = malloc_or_die(cfg_str.size()+1);
memcpy(cur_op->buf, cfg_str.c_str(), cfg_str.size()+1);
cur_op->iov.push_back(cur_op->buf, cfg_str.size()+1);

View File

@ -3,13 +3,14 @@
#include "pg_states.h"
const int pg_state_bit_count = 14;
const int pg_state_bit_count = 15;
const int pg_state_bits[14] = {
const int pg_state_bits[15] = {
PG_STARTING,
PG_PEERING,
PG_INCOMPLETE,
PG_ACTIVE,
PG_REPEERING,
PG_STOPPING,
PG_OFFLINE,
PG_DEGRADED,
@ -21,11 +22,12 @@ const int pg_state_bits[14] = {
PG_LEFT_ON_DEAD,
};
const char *pg_state_names[14] = {
const char *pg_state_names[15] = {
"starting",
"peering",
"incomplete",
"active",
"repeering",
"stopping",
"offline",
"degraded",

View File

@ -10,16 +10,17 @@
#define PG_PEERING (1<<1)
#define PG_INCOMPLETE (1<<2)
#define PG_ACTIVE (1<<3)
#define PG_STOPPING (1<<4)
#define PG_OFFLINE (1<<5)
#define PG_REPEERING (1<<4)
#define PG_STOPPING (1<<5)
#define PG_OFFLINE (1<<6)
// Plus any of these:
#define PG_DEGRADED (1<<6)
#define PG_HAS_INCOMPLETE (1<<7)
#define PG_HAS_DEGRADED (1<<8)
#define PG_HAS_MISPLACED (1<<9)
#define PG_HAS_UNCLEAN (1<<10)
#define PG_HAS_INVALID (1<<11)
#define PG_LEFT_ON_DEAD (1<<12)
#define PG_DEGRADED (1<<7)
#define PG_HAS_INCOMPLETE (1<<8)
#define PG_HAS_DEGRADED (1<<9)
#define PG_HAS_MISPLACED (1<<10)
#define PG_HAS_UNCLEAN (1<<11)
#define PG_HAS_INVALID (1<<12)
#define PG_LEFT_ON_DEAD (1<<13)
// Lower bits that represent object role (EC 0/1/2... or always 0 with replication)
// 12 bits is a safe default that doesn't depend on pg_stripe_size or pg_block_size

View File

@ -20,7 +20,15 @@ void alloc_all(int size)
{
printf("incorrect block allocated: expected %d, got %lu\n", i, x);
}
if (a->get(x))
{
printf("not free before set at %d\n", i);
}
a->set(x, true);
if (!a->get(x))
{
printf("free after set at %d\n", i);
}
}
uint64_t x = a->find_free();
if (x != UINT64_MAX)

View File

@ -2,8 +2,8 @@
// License: VNPL-1.1 (see README.md for details)
#include <malloc.h>
#include "timerfd_interval.h"
#include "blockstore.h"
#include "epoll_manager.h"
int main(int narg, char *args[])
{
@ -12,11 +12,8 @@ int main(int narg, char *args[])
config["journal_device"] = "./test_journal.bin";
config["data_device"] = "./test_data.bin";
ring_loop_t *ringloop = new ring_loop_t(512);
blockstore_t *bs = new blockstore_t(config, ringloop);
timerfd_interval tick_tfd(ringloop, 1, []()
{
printf("tick 1s\n");
});
epoll_manager_t *epmgr = new epoll_manager_t(ringloop);
blockstore_t *bs = new blockstore_t(config, ringloop, epmgr->tfd);
blockstore_op_t op;
int main_state = 0;
@ -125,6 +122,7 @@ int main(int narg, char *args[])
ringloop->wait();
}
delete bs;
delete epmgr;
delete ringloop;
return 0;
}

407
src/test_cluster_client.cpp Normal file
View File

@ -0,0 +1,407 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include "cluster_client.h"
void configure_single_pg_pool(cluster_client_t *cli)
{
cli->st_cli.on_load_pgs_hook(true);
cli->st_cli.parse_state((etcd_kv_t){
.key = "/config/pools",
.value = json11::Json::object {
{ "1", json11::Json::object {
{ "name", "hddpool" },
{ "scheme", "replicated" },
{ "pg_size", 2 },
{ "pg_minsize", 1 },
{ "pg_count", 1 },
{ "failure_domain", "osd" },
} }
},
});
cli->st_cli.parse_state((etcd_kv_t){
.key = "/config/pgs",
.value = json11::Json::object {
{ "items", json11::Json::object {
{ "1", json11::Json::object {
{ "1", json11::Json::object {
{ "osd_set", json11::Json::array { 1, 2 } },
{ "primary", 1 },
} }
} }
} }
},
});
cli->st_cli.parse_state((etcd_kv_t){
.key = "/pg/state/1/1",
.value = json11::Json::object {
{ "peers", json11::Json::array { 1, 2 } },
{ "primary", 1 },
{ "state", json11::Json::array { "active" } },
},
});
std::map<std::string, etcd_kv_t> changes;
cli->st_cli.on_change_hook(changes);
}
int *test_write(cluster_client_t *cli, uint64_t offset, uint64_t len, uint8_t c, std::function<void()> cb = NULL)
{
printf("Post write %lx+%lx\n", offset, len);
int *r = new int;
*r = -1;
cluster_op_t *op = new cluster_op_t();
op->opcode = OSD_OP_WRITE;
op->inode = 0x1000000000001;
op->offset = offset;
op->len = len;
op->iov.push_back(malloc_or_die(len), len);
memset(op->iov.buf[0].iov_base, c, len);
op->callback = [r, cb](cluster_op_t *op)
{
if (*r == -1)
printf("Error: Not allowed to complete yet\n");
assert(*r != -1);
*r = op->retval == op->len ? 1 : 0;
free(op->iov.buf[0].iov_base);
printf("Done write %lx+%lx r=%d\n", op->offset, op->len, op->retval);
delete op;
if (cb != NULL)
cb();
};
cli->execute(op);
return r;
}
int *test_sync(cluster_client_t *cli)
{
printf("Post sync\n");
int *r = new int;
*r = -1;
cluster_op_t *op = new cluster_op_t();
op->opcode = OSD_OP_SYNC;
op->callback = [r](cluster_op_t *op)
{
if (*r == -1)
printf("Error: Not allowed to complete yet\n");
assert(*r != -1);
*r = op->retval == 0 ? 1 : 0;
printf("Done sync r=%d\n", op->retval);
delete op;
};
cli->execute(op);
return r;
}
void can_complete(int *r)
{
// Allow the operation to proceed so the test verifies
// that it doesn't complete earlier than expected
*r = -2;
}
void check_completed(int *r)
{
assert(*r == 1);
delete r;
}
void pretend_connected(cluster_client_t *cli, osd_num_t osd_num)
{
printf("OSD %lu connected\n", osd_num);
int peer_fd = cli->msgr.clients.size() ? std::prev(cli->msgr.clients.end())->first+1 : 10;
cli->msgr.osd_peer_fds[osd_num] = peer_fd;
cli->msgr.clients[peer_fd] = new osd_client_t();
cli->msgr.clients[peer_fd]->osd_num = osd_num;
cli->msgr.clients[peer_fd]->peer_state = PEER_CONNECTED;
cli->msgr.wanted_peers.erase(osd_num);
cli->msgr.repeer_pgs(osd_num);
}
void pretend_disconnected(cluster_client_t *cli, osd_num_t osd_num)
{
printf("OSD %lu disconnected\n", osd_num);
cli->msgr.stop_client(cli->msgr.osd_peer_fds.at(osd_num));
}
void check_disconnected(cluster_client_t *cli, osd_num_t osd_num)
{
if (cli->msgr.osd_peer_fds.find(osd_num) != cli->msgr.osd_peer_fds.end())
{
printf("OSD %lu not disconnected as it ought to be\n", osd_num);
assert(0);
}
}
void check_op_count(cluster_client_t *cli, osd_num_t osd_num, int ops)
{
int peer_fd = cli->msgr.osd_peer_fds.at(osd_num);
int real_ops = cli->msgr.clients[peer_fd]->sent_ops.size();
if (real_ops != ops)
{
printf("error: %d ops expected, but %d queued\n", ops, real_ops);
assert(0);
}
}
osd_op_t *find_op(cluster_client_t *cli, osd_num_t osd_num, uint64_t opcode, uint64_t offset, uint64_t len)
{
int peer_fd = cli->msgr.osd_peer_fds.at(osd_num);
auto op_it = cli->msgr.clients[peer_fd]->sent_ops.begin();
while (op_it != cli->msgr.clients[peer_fd]->sent_ops.end())
{
auto op = op_it->second;
if (op->req.hdr.opcode == opcode && (opcode == OSD_OP_SYNC ||
op->req.rw.inode == 0x1000000000001 && op->req.rw.offset == offset && op->req.rw.len == len))
{
return op;
}
op_it++;
}
return NULL;
}
void pretend_op_completed(cluster_client_t *cli, osd_op_t *op, int64_t retval)
{
assert(op);
printf("Pretend completed %s %lx+%x\n", op->req.hdr.opcode == OSD_OP_SYNC
? "sync" : (op->req.hdr.opcode == OSD_OP_WRITE ? "write" : "read"), op->req.rw.offset, op->req.rw.len);
uint64_t op_id = op->req.hdr.id;
int peer_fd = op->peer_fd;
cli->msgr.clients[peer_fd]->sent_ops.erase(op_id);
op->reply.hdr.magic = SECONDARY_OSD_REPLY_MAGIC;
op->reply.hdr.id = op->req.hdr.id;
op->reply.hdr.opcode = op->req.hdr.opcode;
op->reply.hdr.retval = retval < 0 ? retval : (op->req.hdr.opcode == OSD_OP_SYNC ? 0 : op->req.rw.len);
// Copy lambda to be unaffected by `delete op`
std::function<void(osd_op_t*)>(op->callback)(op);
}
void test1()
{
json11::Json config;
timerfd_manager_t *tfd = new timerfd_manager_t([](int fd, bool wr, std::function<void(int, int)> callback){});
cluster_client_t *cli = new cluster_client_t(NULL, tfd, config);
int *r1 = test_write(cli, 0, 4096, 0x55);
configure_single_pg_pool(cli);
pretend_connected(cli, 1);
cli->continue_ops(true);
can_complete(r1);
check_op_count(cli, 1, 1);
pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0, 4096), 0);
check_completed(r1);
pretend_disconnected(cli, 1);
int *r2 = test_sync(cli);
pretend_connected(cli, 1);
check_op_count(cli, 1, 0);
cli->continue_ops(true);
check_op_count(cli, 1, 1);
pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0, 4096), 0);
check_op_count(cli, 1, 1);
can_complete(r2);
pretend_op_completed(cli, find_op(cli, 1, OSD_OP_SYNC, 0, 0), 0);
check_completed(r2);
// Check that the client doesn't repeat operations once more
pretend_disconnected(cli, 1);
pretend_connected(cli, 1);
check_op_count(cli, 1, 0);
// Case:
// Write(1) -> Complete Write(1) -> Overwrite(2) -> Complete Write(2)
// -> Overwrite(3) -> Drop OSD connection -> Reestablish OSD connection
// -> Complete All Posted Writes -> Sync -> Complete Sync
// The resulting state of the block must be (3) over (2) over (1).
// I.e. the part overwritten by (3) must remain as in (3) and so on.
// More interesting case:
// Same, but both Write(2) and Write(3) must consist of two parts:
// one from an OSD 2 that drops connection and other from OSD 1 that doesn't.
// The idea is that if the whole Write(2) is repeated when OSD 2 drops connection
// then it may also overwrite a part in OSD 1 which shouldn't be overwritten.
// Another interesting case:
// A new operation added during replay (would also break with the previous implementation)
r1 = test_write(cli, 0, 0x10000, 0x56);
can_complete(r1);
check_op_count(cli, 1, 1);
pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0, 0x10000), 0);
check_completed(r1);
r1 = test_write(cli, 0xE000, 0x4000, 0x57);
can_complete(r1);
check_op_count(cli, 1, 1);
pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0xE000, 0x4000), 0);
check_completed(r1);
r1 = test_write(cli, 0x10000, 0x4000, 0x58);
pretend_disconnected(cli, 1);
pretend_connected(cli, 1);
cli->continue_ops(true);
// Check replay
{
uint64_t replay_start = UINT64_MAX;
uint64_t replay_end = 0;
std::vector<osd_op_t*> replay_ops;
auto osd_cl = cli->msgr.clients.at(cli->msgr.osd_peer_fds.at(1));
for (auto & op_p: osd_cl->sent_ops)
{
auto op = op_p.second;
assert(op->req.hdr.opcode == OSD_OP_WRITE);
uint64_t offset = op->req.rw.offset;
if (op->req.rw.offset < replay_start)
replay_start = op->req.rw.offset;
if (op->req.rw.offset+op->req.rw.len > replay_end)
replay_end = op->req.rw.offset+op->req.rw.len;
for (int buf_idx = 0; buf_idx < op->iov.count; buf_idx++)
{
for (int i = 0; i < op->iov.buf[buf_idx].iov_len; i++, offset++)
{
uint8_t c = offset < 0xE000 ? 0x56 : (offset < 0x10000 ? 0x57 : 0x58);
if (((uint8_t*)op->iov.buf[buf_idx].iov_base)[i] != c)
{
printf("Write replay: mismatch at %lu\n", offset-op->req.rw.offset);
goto fail;
}
}
}
fail:
assert(offset == op->req.rw.offset+op->req.rw.len);
replay_ops.push_back(op);
}
if (replay_start != 0 || replay_end != 0x14000)
{
printf("Write replay: range mismatch: %lx-%lx\n", replay_start, replay_end);
assert(0);
}
for (auto op: replay_ops)
{
pretend_op_completed(cli, op, 0);
}
}
// Check that the following write finally proceeds
check_op_count(cli, 1, 1);
can_complete(r1);
pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0x10000, 0x4000), 0);
check_completed(r1);
check_op_count(cli, 1, 0);
// Check sync
r2 = test_sync(cli);
can_complete(r2);
pretend_op_completed(cli, find_op(cli, 1, OSD_OP_SYNC, 0, 0), 0);
check_completed(r2);
// Check disconnect during write
r1 = test_write(cli, 0, 4096, 0x59);
check_op_count(cli, 1, 1);
pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0, 0x1000), -EPIPE);
check_disconnected(cli, 1);
pretend_connected(cli, 1);
check_op_count(cli, 1, 0);
cli->continue_ops(true);
check_op_count(cli, 1, 1);
pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0, 0x1000), 0);
check_op_count(cli, 1, 1);
can_complete(r1);
pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0, 0x1000), 0);
check_completed(r1);
// Check disconnect inside operation callback (reenterability)
// Probably doesn't happen too often, but possible in theory
r1 = test_write(cli, 0, 0x1000, 0x60, [cli]()
{
pretend_disconnected(cli, 1);
});
r2 = test_write(cli, 0x1000, 0x1000, 0x61);
check_op_count(cli, 1, 2);
can_complete(r1);
pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0, 0x1000), 0);
check_completed(r1);
check_disconnected(cli, 1);
pretend_connected(cli, 1);
cli->continue_ops(true);
check_op_count(cli, 1, 2);
pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0, 0x1000), 0);
pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0x1000, 0x1000), 0);
check_op_count(cli, 1, 1);
can_complete(r2);
pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0x1000, 0x1000), 0);
check_completed(r2);
// Free client
delete cli;
delete tfd;
printf("[ok] write replay test\n");
}
void test2()
{
std::map<object_id, cluster_buffer_t> unsynced_writes;
cluster_op_t *op = new cluster_op_t();
op->opcode = OSD_OP_WRITE;
op->inode = 1;
op->offset = 0;
op->len = 4096;
op->iov.push_back(malloc_or_die(4096*1024), 4096);
// 0-4k = 0x55
memset(op->iov.buf[0].iov_base, 0x55, op->iov.buf[0].iov_len);
cluster_client_t::copy_write(op, unsynced_writes);
// 8k-12k = 0x66
op->offset = 8192;
memset(op->iov.buf[0].iov_base, 0x66, op->iov.buf[0].iov_len);
cluster_client_t::copy_write(op, unsynced_writes);
// 4k-1M+4k = 0x77
op->len = op->iov.buf[0].iov_len = 1048576;
op->offset = 4096;
memset(op->iov.buf[0].iov_base, 0x77, op->iov.buf[0].iov_len);
cluster_client_t::copy_write(op, unsynced_writes);
// check it
assert(unsynced_writes.size() == 4);
auto uit = unsynced_writes.begin();
int i;
assert(uit->first.inode == 1);
assert(uit->first.stripe == 0);
assert(uit->second.len == 4096);
for (i = 0; i < uit->second.len && ((uint8_t*)uit->second.buf)[i] == 0x55; i++) {}
assert(i == uit->second.len);
uit++;
assert(uit->first.inode == 1);
assert(uit->first.stripe == 4096);
assert(uit->second.len == 4096);
for (i = 0; i < uit->second.len && ((uint8_t*)uit->second.buf)[i] == 0x77; i++) {}
assert(i == uit->second.len);
uit++;
assert(uit->first.inode == 1);
assert(uit->first.stripe == 8192);
assert(uit->second.len == 4096);
for (i = 0; i < uit->second.len && ((uint8_t*)uit->second.buf)[i] == 0x77; i++) {}
assert(i == uit->second.len);
uit++;
assert(uit->first.inode == 1);
assert(uit->first.stripe == 12*1024);
assert(uit->second.len == 1016*1024);
for (i = 0; i < uit->second.len && ((uint8_t*)uit->second.buf)[i] == 0x77; i++) {}
assert(i == uit->second.len);
uit++;
// free memory
free(op->iov.buf[0].iov_base);
delete op;
for (auto p: unsynced_writes)
{
free(p.second.buf);
}
printf("[ok] copy_write test\n");
}
int main(int narg, char *args[])
{
test1();
test2();
return 0;
}

View File

@ -1,64 +0,0 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#include <sys/timerfd.h>
#include <sys/poll.h>
#include <unistd.h>
#include "timerfd_interval.h"
timerfd_interval::timerfd_interval(ring_loop_t *ringloop, int seconds, std::function<void(void)> cb)
{
wait_state = 0;
timerfd = timerfd_create(CLOCK_MONOTONIC, TFD_NONBLOCK);
if (timerfd < 0)
{
throw std::runtime_error(std::string("timerfd_create: ") + strerror(errno));
}
struct itimerspec exp = {
.it_interval = { seconds, 0 },
.it_value = { seconds, 0 },
};
if (timerfd_settime(timerfd, 0, &exp, NULL))
{
throw std::runtime_error(std::string("timerfd_settime: ") + strerror(errno));
}
consumer.loop = [this]() { loop(); };
ringloop->register_consumer(&consumer);
this->ringloop = ringloop;
this->callback = cb;
}
timerfd_interval::~timerfd_interval()
{
ringloop->unregister_consumer(&consumer);
close(timerfd);
}
void timerfd_interval::loop()
{
if (wait_state == 1)
{
return;
}
struct io_uring_sqe *sqe = ringloop->get_sqe();
if (!sqe)
{
wait_state = 0;
return;
}
struct ring_data_t *data = ((ring_data_t*)sqe->user_data);
my_uring_prep_poll_add(sqe, timerfd, POLLIN);
data->callback = [&](ring_data_t *data)
{
if (data->res < 0)
{
throw std::runtime_error(std::string("waiting for timer failed: ") + strerror(-data->res));
}
uint64_t n;
read(timerfd, &n, 8);
wait_state = 0;
callback();
};
wait_state = 1;
ringloop->submit();
}

View File

@ -1,19 +0,0 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once
#include "ringloop.h"
class timerfd_interval
{
int wait_state;
int timerfd;
ring_loop_t *ringloop;
ring_consumer_t consumer;
std::function<void(void)> callback;
public:
timerfd_interval(ring_loop_t *ringloop, int seconds, std::function<void(void)> cb);
~timerfd_interval();
void loop();
};

View File

@ -34,8 +34,8 @@ timerfd_manager_t::~timerfd_manager_t()
void timerfd_manager_t::inc_timer(timerfd_timer_t & t)
{
t.next.tv_sec += t.millis/1000;
t.next.tv_nsec += (t.millis%1000)*1000000;
t.next.tv_sec += t.micros/1000000;
t.next.tv_nsec += (t.micros%1000000)*1000;
if (t.next.tv_nsec > 1000000000)
{
t.next.tv_sec++;
@ -44,13 +44,18 @@ void timerfd_manager_t::inc_timer(timerfd_timer_t & t)
}
int timerfd_manager_t::set_timer(uint64_t millis, bool repeat, std::function<void(int)> callback)
{
return set_timer_us(millis*1000, repeat, callback);
}
int timerfd_manager_t::set_timer_us(uint64_t micros, bool repeat, std::function<void(int)> callback)
{
int timer_id = id++;
timespec start;
clock_gettime(CLOCK_MONOTONIC, &start);
timers.push_back({
.id = timer_id,
.millis = millis,
.micros = micros,
.start = start,
.next = start,
.repeat = repeat,
@ -121,7 +126,7 @@ again:
exp.it_value.tv_sec--;
exp.it_value.tv_nsec += 1000000000;
}
if (exp.it_value.tv_sec < 0 || !exp.it_value.tv_sec && !exp.it_value.tv_nsec)
if (exp.it_value.tv_sec < 0 || exp.it_value.tv_sec == 0 && exp.it_value.tv_nsec <= 0)
{
// It already happened
trigger_nearest();
@ -159,6 +164,6 @@ void timerfd_manager_t::trigger_nearest()
{
timers.erase(timers.begin()+nearest, timers.begin()+nearest+1);
}
cb(nearest_id);
nearest = -1;
cb(nearest_id);
}

View File

@ -10,7 +10,7 @@
struct timerfd_timer_t
{
int id;
uint64_t millis;
uint64_t micros;
timespec start, next;
bool repeat;
std::function<void(int)> callback;
@ -34,5 +34,6 @@ public:
timerfd_manager_t(std::function<void(int, bool, std::function<void(int, int)>)> set_fd_handler);
~timerfd_manager_t();
int set_timer(uint64_t millis, bool repeat, std::function<void(int)> callback);
int set_timer_us(uint64_t micros, bool repeat, std::function<void(int)> callback);
void clear_timer(int timer_id);
};

View File

@ -2,6 +2,14 @@
. `dirname $0`/common.sh
if [ "$EC" != "" ]; then
POOLCFG='"scheme":"xor","pg_size":3,"pg_minsize":2,"parity_chunks":1'
NOBJ=512
else
POOLCFG='"scheme":"replicated","pg_size":2,"pg_minsize":2'
NOBJ=1024
fi
dd if=/dev/zero of=./testdata/test_osd1.bin bs=1024 count=1 seek=$((1024*1024-1))
dd if=/dev/zero of=./testdata/test_osd2.bin bs=1024 count=1 seek=$((1024*1024-1))
dd if=/dev/zero of=./testdata/test_osd3.bin bs=1024 count=1 seek=$((1024*1024-1))
@ -28,7 +36,7 @@ cd ..
node mon/mon-main.js --etcd_url http://$ETCD_URL --etcd_prefix "/vitastor" --verbose 1 &>./testdata/mon.log &
MON_PID=$!
$ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":2,"pg_count":16,"failure_domain":"osd"}}'
$ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool",'$POOLCFG',"pg_count":16,"failure_domain":"osd"}}'
sleep 2
@ -52,7 +60,7 @@ try_change()
echo --- Change PG count to $n --- >>testdata/osd$i.log
done
$ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":2,"pg_count":'$n',"failure_domain":"osd"}}'
$ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool",'$POOLCFG',"pg_count":'$n',"failure_domain":"osd"}}'
for i in {1..10}; do
($ETCDCTL get /vitastor/config/pgs --print-value-only | jq -s -e '(.[0].items["1"] | map((.osd_set | select(. > 0)) | length == 2) | length) == '$n) && \
@ -82,8 +90,8 @@ try_change()
# Check that no objects are lost !
nobj=`$ETCDCTL get --prefix '/vitastor/pg/stats' --print-value-only | jq -s '[ .[].object_count ] | reduce .[] as $num (0; .+$num)'`
if [ "$nobj" -ne 1024 ]; then
format_error "Data lost after changing PG count to $n: 1024 objects expected, but got $nobj"
if [ "$nobj" -ne $NOBJ ]; then
format_error "Data lost after changing PG count to $n: $NOBJ objects expected, but got $nobj"
fi
}

75
tests/test_snapshot.sh Executable file
View File

@ -0,0 +1,75 @@
#!/bin/bash -ex
. `dirname $0`/common.sh
dd if=/dev/zero of=./testdata/test_osd1.bin bs=1024 count=1 seek=$((1024*1024-1))
dd if=/dev/zero of=./testdata/test_osd2.bin bs=1024 count=1 seek=$((1024*1024-1))
dd if=/dev/zero of=./testdata/test_osd3.bin bs=1024 count=1 seek=$((1024*1024-1))
build/src/vitastor-osd --osd_num 1 --bind_address 127.0.0.1 --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd1.bin 2>/dev/null) &>./testdata/osd1.log &
OSD1_PID=$!
build/src/vitastor-osd --osd_num 2 --bind_address 127.0.0.1 --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd2.bin 2>/dev/null) &>./testdata/osd2.log &
OSD2_PID=$!
build/src/vitastor-osd --osd_num 3 --bind_address 127.0.0.1 --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd3.bin 2>/dev/null) &>./testdata/osd3.log &
OSD3_PID=$!
cd mon
npm install
cd ..
node mon/mon-main.js --etcd_url http://$ETCD_URL --etcd_prefix "/vitastor" &>./testdata/mon.log &
MON_PID=$!
$ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"xor","pg_size":3,"pg_minsize":2,"parity_chunks":1,"pg_count":1,"failure_domain":"osd"}}'
sleep 2
if ! ($ETCDCTL get /vitastor/config/pgs --print-value-only | jq -s -e '(. | length) != 0 and (.[0].items["1"]["1"].osd_set | sort) == ["1","2","3"]'); then
format_error "FAILED: 1 PG NOT CONFIGURED"
fi
if ! ($ETCDCTL get /vitastor/pg/state/1/1 --print-value-only | jq -s -e '(. | length) != 0 and .[0].state == ["active"]'); then
format_error "FAILED: 1 PG NOT UP"
fi
if ! cmp build/src/block-vitastor.so /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so; then
sudo rm -f /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so
sudo ln -s "$(realpath .)/build/src/block-vitastor.so" /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so
fi
# Test basic write and snapshot
$ETCDCTL put /vitastor/config/inode/1/2 '{"name":"testimg","size":'$((32*1024*1024))'}'
LD_PRELOAD=libasan.so.5 \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4M -direct=1 -iodepth=1 -fsync=1 -rw=write \
-etcd=$ETCD_URL -pool=1 -inode=2 -size=32M -cluster_log_level=10
$ETCDCTL put /vitastor/config/inode/1/2 '{"name":"testimg@0","size":'$((32*1024*1024))'}'
$ETCDCTL put /vitastor/config/inode/1/3 '{"parent_id":2,"name":"testimg","size":'$((32*1024*1024))'}'
LD_PRELOAD=libasan.so.5 \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4k -direct=1 -iodepth=1 -fsync=32 -buffer_pattern=0xdeadface \
-rw=randwrite -etcd=$ETCD_URL -image=testimg -number_ios=1024
LD_PRELOAD=libasan.so.5 \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4M -direct=1 -iodepth=1 -rw=read -etcd=$ETCD_URL -pool=1 -inode=3 -size=32M
qemu-img convert -S 4096 -p \
-f raw "vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:pool=1:inode=3:size=$((32*1024*1024))" \
-O raw ./testdata/merged.bin
qemu-img convert -S 4096 -p \
-f raw "vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:image=testimg@0" \
-O raw ./testdata/layer0.bin
$ETCDCTL put /vitastor/config/inode/1/3 '{"name":"testimg","size":'$((32*1024*1024))'}'
qemu-img convert -S 4096 -p \
-f raw "vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:image=testimg" \
-O raw ./testdata/layer1.bin
node mon/merge.js ./testdata/layer0.bin ./testdata/layer1.bin ./testdata/check.bin
cmp ./testdata/merged.bin ./testdata/check.bin
format_green OK

View File

@ -34,40 +34,24 @@ fi
#LD_PRELOAD=libasan.so.5 \
# fio -thread -name=test -ioengine=build/src/libfio_vitastor_sec.so -bs=4k -fsync=128 `$ETCDCTL get /vitastor/osd/state/1 --print-value-only | jq -r '"-host="+.addresses[0]+" -port="+(.port|tostring)'` -rw=write -size=32M
# Test basic write and snapshot
$ETCDCTL put /vitastor/config/inode/1/2 '{"name":"testimg","size":'$((32*1024*1024))'}'
if ! cmp build/src/block-vitastor.so /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so; then
sudo rm -f /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so
sudo ln -s "$(realpath .)/build/src/block-vitastor.so" /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so
fi
LD_PRELOAD=libasan.so.5 \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4M -direct=1 -iodepth=1 -fsync=1 -rw=write \
-etcd=$ETCD_URL -pool=1 -inode=2 -size=32M -cluster_log_level=10
$ETCDCTL put /vitastor/config/inode/1/2 '{"name":"testimg@0","size":'$((32*1024*1024))'}'
$ETCDCTL put /vitastor/config/inode/1/3 '{"parent_id":2,"name":"testimg","size":'$((32*1024*1024))'}'
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4M -direct=1 -iodepth=1 -fsync=1 -rw=write -etcd=$ETCD_URL -pool=1 -inode=1 -size=128M -cluster_log_level=10
LD_PRELOAD=libasan.so.5 \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4k -direct=1 -iodepth=1 -fsync=32 -buffer_pattern=0xdeadface \
-rw=randwrite -etcd=$ETCD_URL -image=testimg -number_ios=1024
LD_PRELOAD=libasan.so.5 \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4M -direct=1 -iodepth=1 -rw=read -etcd=$ETCD_URL -pool=1 -inode=3 -size=32M
-rw=randwrite -etcd=$ETCD_URL -pool=1 -inode=1 -size=128M -number_ios=1024
qemu-img convert -S 4096 -p \
-f raw "vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:pool=1:inode=3:size=$((32*1024*1024))" \
-O raw ./testdata/merged.bin
-f raw "vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:pool=1:inode=1:size=$((128*1024*1024))" \
-O raw ./testdata/read.bin
qemu-img convert -S 4096 -p \
-f raw "vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:image=testimg@0" \
-O raw ./testdata/layer0.bin
$ETCDCTL put /vitastor/config/inode/1/3 '{"name":"testimg","size":'$((32*1024*1024))'}'
qemu-img convert -S 4096 -p \
-f raw "vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:image=testimg" \
-O raw ./testdata/layer1.bin
node mon/merge.js ./testdata/layer0.bin ./testdata/layer1.bin ./testdata/check.bin
cmp ./testdata/merged.bin ./testdata/check.bin
-f raw ./testdata/read.bin \
-O raw "vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:pool=1:inode=1:size=$((128*1024*1024))"
format_green OK