Compare commits

...

38 Commits

Author SHA1 Message Date
Vitaliy Filippov a8d744ca0e Fix wording 2021-03-16 12:48:36 +03:00
Vitaliy Filippov b5ff44fb6f Change Telegram chat link 2021-03-16 12:48:36 +03:00
Vitaliy Filippov f918bc4543 Fix Russian README for CMake build 2021-03-16 12:48:36 +03:00
Vitaliy Filippov 6875a838e0 Capture all by value in qemu_proxy 2021-03-16 12:48:36 +03:00
Vitaliy Filippov 20781abd3d Add LICENSE 2021-03-16 12:48:36 +03:00
Vitaliy Filippov 1f02f645c0 Add Russian version of the README 2021-03-16 12:48:36 +03:00
Vitaliy Filippov ee44f64927 Introduce image names and metadata storage in etcd
Each inode has: image name, parent inode number & pool, size and readonly flag

Snapshots are created by switching image name to a different inode number
while using the older inode as parent.
2021-03-16 12:48:36 +03:00
Vitaliy Filippov abf0611d93 Use clean_entry_bitmap_size instead of entry_attr_size back because of changed bitmap handling 2021-03-16 12:48:36 +03:00
Vitaliy Filippov edbf0eb040 Add a test for snapshots, fix bugs. Now the test passes 2021-03-16 12:48:36 +03:00
Vitaliy Filippov 09725038e7 Begin snapshot test 2021-03-16 12:48:36 +03:00
Vitaliy Filippov 18f71b059a Fix part bitmap addresses 2021-03-16 12:48:36 +03:00
Vitaliy Filippov 2db2ed22ea Fix several snapshot I/O bugs 2021-03-16 12:48:36 +03:00
Vitaliy Filippov aa7699da24 Fix subop generation for snapshot implementation 2021-03-16 12:48:36 +03:00
Vitaliy Filippov 853ecba780 Actual snapshot support (untested) 2021-03-16 12:48:36 +03:00
Vitaliy Filippov 2f9c76b8fc Report inode I/O statistics, aggregate it in the monitor 2021-03-16 12:48:36 +03:00
Vitaliy Filippov 8da7f26459 Report inode space usage statistics to etcd, aggregate it in the monitor 2021-03-16 12:48:36 +03:00
Vitaliy Filippov 9998b50c7e Add inode space usage statistics tracking to blockstore 2021-03-16 12:48:36 +03:00
Vitaliy Filippov 0422d94a70 Send bitmaps with primary-reads, actually read bitmaps for READ ops 2021-03-16 12:48:36 +03:00
Vitaliy Filippov ff2208ae70 Allocate bitmaps along with stripes to avoid memory fragmentation 2021-03-16 12:48:36 +03:00
Vitaliy Filippov ae54dddb0c Remove cryptic bitmap inlining from bs_op_t and osd_op_t, use bitmap in primary OSD code 2021-03-16 12:48:36 +03:00
Vitaliy Filippov bfc175fe0f Add "external" bitmap support to the secondary OSD protocol 2021-03-16 12:48:36 +03:00
Vitaliy Filippov 07e10210b6 Use bitmap granularity for alignment checks 2021-03-16 12:48:36 +03:00
Vitaliy Filippov 221b728fc9 Add "external" bitmap support to blockstore 2021-03-16 12:48:36 +03:00
Vitaliy Filippov 6625aaae00 Add "external" bitmap support to osd_rmw 2021-03-16 12:48:36 +03:00
Vitaliy Filippov 7e6e1a5a82 Release 0.5.10
The version seems to be stable after this bunch of fixes :)

- Fix delete & write operation ordering during rebalance to not lose objects in the immediate_commit=off mode
- Fix a possible crash caused by very high iodepths
- Re-distribute PG primaries over OSDs that come up after a short downtime
- Allow to specify etcd URLs for OSDs with http://, do not die with a strange error if -etcd option is missing for fio
- Fix a journal flushing deadlock which sometimes occurred in the immediate_commit=off mode
- Fix a bug where OSDs could hang if the data device filled up
- Fix an allocator bug where it was unable to allocate up to last (n%64) data device blocks
- Fix monitor crash that occurred on removal of some etcd keys
- Fix a bug where PGs could remain incomplete due to incorrect PG history with just zeroes in osd_sets
2021-03-16 12:48:26 +03:00
Vitaliy Filippov 435045751d Delete objects only after a SYNC during rebalance in the non-immediate_commit mode
Previously OSDs could commit deletes before writes during recovery or rebalance
in the "lazy fsync" (immediate_commit=off) mode which could result in lost objects
2021-03-16 12:48:26 +03:00
Vitaliy Filippov c5fb1d5987 Do not duplicate blockstore operations when io_uring fills up
This bug was leading to OSDs dying with "Assertion `fulfilled == read_op->len' failed"
when testing fio -rw=randread -numjobs=8 -iodepth=128
2021-03-16 12:48:26 +03:00
Vitaliy Filippov 9f59381bea Re-distribute PG primaries over OSDs that come up after a short downtime 2021-03-16 12:48:26 +03:00
Vitaliy Filippov 9ac7e75178 Allow to specify etcd URLs for OSDs with http://, do not die with a strange error if -etcd option is missing for fio 2021-03-16 12:48:26 +03:00
Vitaliy Filippov 88671cf745 Fix a bug causing all flushers to wait for an fsync without actually trying to do it
This happened because flusher_count became dynamic and fsync_batch() was comparing the number
of flushers currently ready to do an fsync with the maximum number of flushers. Also the number
wasn't rechecked on every loop which was also incorrect.

Now the interrupted_rebalance test passes even without IMMEDIATE_COMMIT=1.
2021-03-13 17:27:29 +03:00
Vitaliy Filippov fe1749c427 Fix the multiple_interrupted_rebalance test 2021-03-13 17:19:45 +03:00
Vitaliy Filippov ceb9c28de7 Set default log_level before passing config to etcd_state_client 2021-03-13 17:19:45 +03:00
Vitaliy Filippov 299d7d7c95 Use common macro for get_sqe 2021-03-13 17:19:45 +03:00
Vitaliy Filippov d1526b415f Correctly resume writes when OSD is full to return an error 2021-03-13 17:19:45 +03:00
Vitaliy Filippov f49fd53d55 Fix a bug where allocator was unable to allocate up to last (n%64) blocks, add tests for it 2021-03-13 02:19:02 +03:00
Vitaliy Filippov dd76eda5e5 Test multiple interrupted rebalancings
Currently only passes with immediate_commit=all configuration
(env variable IMMEDIATE_COMMIT=1 for the bash script)
2021-03-12 12:55:44 +03:00
Vitaliy Filippov 87dbd8fa57 Use empty hash as the default value for some etcd keys in the monitor 2021-03-12 12:40:15 +03:00
Vitaliy Filippov b44f49aab2 Ignore zero OSDs in history osd_sets 2021-03-12 12:40:15 +03:00
60 changed files with 2397 additions and 620 deletions

27
LICENSE Normal file
View File

@ -0,0 +1,27 @@
Copyright (c) Vitaliy Filippov (vitalif [at] yourcmc.ru), 2019+
All server-side code (OSD, Monitor and so on) is licensed under the terms of
Vitastor Network Public License 1.1 (VNPL 1.1), a copyleft license based on
GNU GPLv3.0 with the additional "Network Interaction" clause which requires
opensourcing all programs directly or indirectly interacting with Vitastor
through a computer network and expressly designed to be used in conjunction
with it ("Proxy Programs"). Proxy Programs may be made public not only under
the terms of the same license, but also under the terms of any GPL-Compatible
Free Software License, as listed by the Free Software Foundation.
This is a stricter copyleft license than the Affero GPL.
Please note that VNPL doesn't require you to open the code of proprietary
software running inside a VM if it's not specially designed to be used with
Vitastor.
Basically, you can't use the software in a proprietary environment to provide
its functionality to users without opensourcing all intermediary components
standing between the user and Vitastor or purchasing a commercial license
from the author 😀.
Client libraries (cluster_client and so on) are dual-licensed under the same
VNPL 1.1 and also GNU GPL 2.0 or later to allow for compatibility with GPLed
software like QEMU and fio.
You can find the full text of VNPL-1.1 in the file [VNPL-1.1.txt](VNPL-1.1.txt).
GPL 2.0 is also included in this repository as [GPL-2.0.txt](GPL-2.0.txt).

491
README-ru.md Normal file
View File

@ -0,0 +1,491 @@
## Vitastor
[Read English version](README.md)
## Идея
Я всего лишь хочу сделать качественную блочную SDS!
Vitastor - распределённая блочная SDS, прямой аналог Ceph RBD и внутренних СХД популярных
облачных провайдеров. Однако, в отличие от них, Vitastor быстрый и при этом простой.
Только пока маленький :-).
Архитектурная схожесть с Ceph означает заложенную на уровне алгоритмов записи строгую консистентность,
репликацию через первичный OSD, симметричную кластеризацию без единой точки отказа
и автоматическое распределение данных по любому числу дисков любого размера с настраиваемыми схемами
избыточности - репликацией или с произвольными кодами коррекции ошибок.
## Возможности
Vitastor на данный момент находится в статусе предварительного выпуска, расширенные
возможности пока отсутствуют, а в будущих версиях вероятны "ломающие" изменения.
Однако следующее уже реализовано:
- Базовая часть - надёжное кластерное блочное хранилище без единой точки отказа
- Производительность ;-D
- Несколько схем отказоустойчивости: репликация, XOR n+1 (1 диск чётности), коды коррекции ошибок
Рида-Соломона на основе библиотеки jerasure с любым числом дисков данных и чётности в группе
- Конфигурация через простые человекочитаемые JSON-структуры в etcd
- Автоматическое распределение данных по OSD, с поддержкой:
- Математической оптимизации для лучшей равномерности распределения и минимизации перемещений данных
- Нескольких пулов с разными схемами избыточности
- Дерева распределения, выбора OSD по тегам / классам устройств (только SSD, только HDD) и по поддереву
- Настраиваемых доменов отказа (диск/сервер/стойка и т.п.)
- Восстановление деградированных блоков
- Ребаланс, то есть перемещение данных между OSD (дисками)
- Поддержка "ленивого" fsync (fsync не на каждую операцию)
- Сбор статистики ввода/вывода в etcd
- Клиентская библиотека режима пользователя для ввода/вывода
- Драйвер диска для QEMU (собирается вне дерева исходников QEMU)
- Драйвер диска для утилиты тестирования производительности fio (также собирается вне дерева исходников fio)
- NBD-прокси для монтирования образов ядром ("блочное устройство в режиме пользователя")
- Утилита удаления образов/инодов (vitastor-rm)
- Пакеты для Debian и CentOS
- Статистика операций ввода/вывода и занятого места в разрезе инодов
- Именование инодов через хранение их метаданных в etcd
- Снапшоты и copy-on-write клоны
## Планы разработки
- Более корректные скрипты разметки дисков и автоматического запуска OSD
- Другие инструменты администрирования
- Плагины для OpenStack, Kubernetes, OpenNebula, Proxmox и других облачных систем
- iSCSI-прокси
- Таймауты операций и более быстрое выявление отказов
- Фоновая проверка целостности без контрольных сумм (сверка реплик)
- Контрольные суммы
- Оптимизации для гибридных SSD+HDD хранилищ
- Поддержка RDMA и NVDIMM
- Web-интерфейс
- Возможно, сжатие
- Возможно, поддержка кэширования данных через системный page cache
## Архитектура
Так же, как и в Ceph, в Vitastor:
- Есть пулы (pools), PG, OSD, мониторы, домены отказа, дерево распределения (аналог crush-дерева).
- Образы делятся на блоки фиксированного размера (объекты), и эти объекты распределяются по OSD.
- У OSD есть журнал и метаданные и они тоже могут размещаться на отдельных быстрых дисках.
- Все операции записи тоже транзакционны. В Vitastor, правда, есть режим отложенного/ленивого fsync
(коммита), в котором fsync не вызывается на каждую операцию записи, что делает его более
пригодным для использования на "плохих" (десктопных) SSD. Однако все операции записи
в любом случае атомарны.
- Клиентская библиотека тоже старается ждать восстановления после любого отказа кластера, то есть,
вы тоже можете перезагрузить хоть весь кластер разом, и клиенты только на время зависнут,
но не отключатся.
Некоторые базовые термины для тех, кто не знаком с Ceph:
- OSD (Object Storage Daemon) - процесс, который хранит данные на одном диске и обрабатывает
запросы чтения/записи от клиентов.
- Пул (Pool) - контейнер для данных, имеющих одну и ту же схему избыточности и правила распределения по OSD.
- PG (Placement Group) - группа объектов, хранимых на одном и том же наборе реплик (OSD).
Несколько PG могут храниться на одном и том же наборе реплик, но объекты одной PG
в норме не хранятся на разных наборах OSD.
- Монитор - демон, хранящий состояние кластера.
- Домен отказа (Failure Domain) - группа OSD, которым вы разрешаете "упасть" всем вместе.
Иными словами, это группа OSD, в которые СХД не помещает разные копии одного и того же
блока данных. Например, если домен отказа - сервер, то на двух дисках одного сервера
никогда не окажется 2 и более копий одного и того же блока данных, а значит, даже
если в этом сервере откажут все диски, это будет равносильно потере только 1 копии
любого блока данных.
- Дерево распределения (Placement Tree / CRUSH Tree) - иерархическая группировка OSD
в узлы, которые далее можно использовать как домены отказа. То есть, диск (OSD) входит в
сервер, сервер входит в стойку, стойка входит в ряд, ряд в датацентр и т.п.
Чем Vitastor отличается от Ceph:
- Vitastor в первую очередь сфокусирован на SSD. Также Vitastor, вероятно, должен неплохо работать
с комбинацией SSD и HDD через bcache, а в будущем, возможно, будут добавлены и нативные способы
оптимизации под SSD+HDD. Однако хранилище на основе одних лишь жёстких дисков, вообще без SSD,
не в приоритете, поэтому оптимизации под этот кейс могут вообще не состояться.
- OSD Vitastor однопоточный и всегда таким останется, так как это самый оптимальный способ работы.
Если вам не хватает 1 ядра на 1 диск, просто делите диск на разделы и запускайте на нём несколько OSD.
Но, скорее всего, вам хватит и 1 ядра - Vitastor не так прожорлив к ресурсам CPU, как Ceph.
- Журнал и метаданные всегда размещаются в памяти, благодаря чему никогда не тратится лишнее время
на чтение метаданных с диска. Размер метаданных линейно зависит от размера диска и блока данных,
который задаётся в конфигурации кластера и по умолчанию составляет 128 КБ. С блоком 128 КБ метаданные
занимают примерно 512 МБ памяти на 1 ТБ дискового пространства (и это всё равно меньше, чем нужно Ceph-у).
Журнал вообще не должен быть большим, например, тесты производительности в данном документе проводились
с журналом размером всего 16 МБ. Большой журнал, вероятно, даже вреден, т.к. "грязные" записи (записи,
не сброшенные из журнала) тоже занимают память и могут немного замедлять работу.
- В Vitastor нет внутреннего copy-on-write. Я считаю, что реализация CoW-хранилища гораздо сложнее,
поэтому сложнее добиться устойчиво хороших результатов. Возможно, в один прекрасный день
я придумаю красивый алгоритм для CoW-хранилища, но пока нет - внутреннего CoW в Vitastor не будет.
Всё это не относится к "внешнему" CoW (снапшотам и клонам).
- Базовый слой Vitastor - простое блочное хранилище с блоками фиксированного размера, а не сложное
объектное хранилище с расширенными возможностями, как в Ceph (RADOS).
- В Vitastor есть режим "ленивых fsync", в котором OSD группирует запросы записи перед сбросом их
на диск, что позволяет получить лучшую производительность с дешёвыми настольными SSD без конденсаторов
("Advanced Power Loss Protection" / "Capacitor-Based Power Loss Protection").
Тем не менее, такой режим всё равно медленнее использования нормальных серверных SSD и мгновенного
fsync, так как приводит к дополнительным операциям передачи данных по сети, поэтому рекомендуется
всё-таки использовать хорошие серверные диски, тем более, стоят они почти так же, как десктопные.
- PG эфемерны. Это означает, что они не хранятся на дисках и существуют только в памяти работающих OSD.
- Процессы восстановления оперируют отдельными объектами, а не целыми PG.
- PGLOG-ов нет.
- "Мониторы" не хранят данные. Конфигурация и состояние кластера хранятся в etcd в простых человекочитаемых
JSON-структурах. Мониторы Vitastor только следят за состоянием кластера и управляют перемещением данных.
В этом смысле монитор Vitastor не является критичным компонентом системы и больше похож на Ceph-овский
менеджер (MGR). Монитор Vitastor написан на node.js.
- Распределение PG не основано на консистентных хешах. Вместо этого все маппинги PG хранятся прямо в etcd
(ибо нет никакой проблемы сохранить несколько сотен-тысяч записей в памяти, а не считать каждый раз хеши).
Перераспределение PG по OSD выполняется через математическую оптимизацию,
а конкретно, сведение задачи к ЛП (задаче линейного программирования) и решение оной с помощью утилиты
lp_solve. Такой подход позволяет обычно выравнивать распределение места почти идеально - равномерность
обычно составляет 96-99%, в отличие от Ceph, где на голом CRUSH-е без балансировщика обычно выходит 80-90%.
Также это позволяет минимизировать объём перемещения данных и случайность связей между OSD, а также менять
распределение вручную, не боясь сломать логику перебалансировки. В таком подходе есть и потенциальный
недостаток - есть предположение, что в очень большом кластере он может сломаться - однако вплоть до
нескольких сотен OSD подход точно работает нормально. Ну и, собственно, при необходимости легко
реализовать и консистентные хеши.
- Отдельный слой, подобный слою "CRUSH-правил", отсутствует. Вы настраиваете схемы отказоустойчивости,
домены отказа и правила выбора OSD напрямую в конфигурации пулов.
## Понимание сути производительности систем хранения
Вкратце: для быстрой хранилки задержки важнее, чем пиковые iops-ы.
Лучшая возможная задержка достигается при тестировании в 1 поток с глубиной очереди 1,
что приблизительно означает минимально нагруженное состояние кластера. В данном случае
IOPS = 1/задержка. Ни числом серверов, ни дисков, ни серверных процессов/потоков
задержка не масштабируется... Она зависит только от того, насколько быстро один
серверный процесс (и клиент) обрабатывают одну операцию.
Почему задержки важны? Потому, что некоторые приложения *не могут* использовать глубину
очереди больше 1, ибо их задача не параллелизуется. Важный пример - это все СУБД
с поддержкой консистентности (ACID), потому что все они обеспечивают её через
журналирование, а журналы пишутся последовательно и с fsync() после каждой операции.
fsync, кстати - это ещё одна очень важная вещь, про которую почти всегда забывают в тестах.
Смысл в том, что все современные диски имеют кэши/буферы записи и не гарантируют, что
данные реально физически записываются на носитель до того, как вы делаете fsync(),
который транслируется в команду сброса кэша операционной системой.
Дешёвые SSD для настольных ПК и ноутбуков очень быстрые без fsync - NVMe диски, например,
могут обработать порядка 80000 операций записи в секунду с глубиной очереди 1 без fsync.
Однако с fsync, когда они реально вынуждены писать каждый блок данных во флеш-память,
они выжимают лишь 1000-2000 операций записи в секунду (число практически постоянное
для всех моделей SSD).
Серверные SSD часто имеют суперконденсаторы, работающие как встроенный источник
бесперебойного питания и дающие дискам успеть сбросить их DRAM-кэш в постоянную
флеш-память при отключении питания. Благодаря этому диски с чистой совестью
*игнорируют fsync*, так как точно знают, что данные из кэша доедут до постоянной
памяти.
Все наиболее известные программные СХД, например, Ceph и внутренние СХД, используемые
такими облачными провайдерами, как Amazon, Google, Яндекс, медленные в смысле задержки.
В лучшем случае они дают задержки от 0.3мс на чтение и 0.6мс на запись 4 КБ блоками
даже при условии использования наилучшего возможного железа.
И это в эпоху SSD, когда вы можете пойти на рынок и купить там SSD, задержка которого
на чтение будет 0.1мс, а на запись - 0.04мс, за 100$ или даже дешевле.
Когда мне нужно быстро протестировать производительность дисковой подсистемы, я
использую следующие 6 команд, с небольшими вариациями:
- Линейная запись:
`fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=write -runtime=60 -filename=/dev/sdX`
- Линейное чтение:
`fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=read -runtime=60 -filename=/dev/sdX`
- Запись в 1 поток (T1Q1):
`fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -fsync=1 -rw=randwrite -runtime=60 -filename=/dev/sdX`
- Чтение в 1 поток (T1Q1):
`fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -rw=randread -runtime=60 -filename=/dev/sdX`
- Параллельная запись (numjobs используется, когда 1 ядро CPU не может насытить диск):
`fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 [-numjobs=4 -group_reporting] -rw=randwrite -runtime=60 -filename=/dev/sdX`
- Параллельное чтение (numjobs - аналогично):
`fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 [-numjobs=4 -group_reporting] -rw=randread -runtime=60 -filename=/dev/sdX`
## Теоретическая максимальная производительность Vitastor
При использовании репликации:
- Задержка чтения в 1 поток (T1Q1): 1 сетевой RTT + 1 чтение с диска.
- Запись+fsync в 1 поток:
- С мгновенным сбросом: 2 RTT + 1 запись.
- С отложенным ("ленивым") сбросом: 4 RTT + 1 запись + 1 fsync.
- Параллельное чтение: сумма IOPS всех дисков либо производительность сети, если в сеть упрётся раньше.
- Параллельная запись: сумма IOPS всех дисков / число реплик / WA либо производительность сети, если в сеть упрётся раньше.
При использовании кодов коррекции ошибок (EC):
- Задержка чтения в 1 поток (T1Q1): 1.5 RTT + 1 чтение.
- Запись+fsync в 1 поток:
- С мгновенным сбросом: 3.5 RTT + 1 чтение + 2 записи.
- С отложенным ("ленивым") сбросом: 5.5 RTT + 1 чтение + 2 записи + 2 fsync.
- Под 0.5 на самом деле подразумевается (k-1)/k, где k - число дисков данных,
что означает, что дополнительное обращение по сети не нужно, когда операция
чтения обслуживается локально.
- Параллельное чтение: сумма IOPS всех дисков либо производительность сети, если в сеть упрётся раньше.
- Параллельная запись: сумма IOPS всех дисков / общее число дисков данных и чётности / WA либо производительность сети, если в сеть упрётся раньше.
Примечание: IOPS дисков в данном случае надо брать в смешанном режиме чтения/записи в пропорции, аналогичной формулам выше.
WA (мультипликатор записи) для 4 КБ блоков в Vitastor обычно составляет 3-5:
1. Запись метаданных в журнал
2. Запись блока данных в журнал
3. Запись метаданных в БД
4. Ещё одна запись метаданных в журнал при использовании EC
5. Запись блока данных на диск данных
Если вы найдёте SSD, хорошо работающий с 512-байтными блоками данных (Optane?),
то 1, 3 и 4 можно снизить до 512 байт (1/8 от размера данных) и получить WA всего 2.375.
Кроме того, WA снижается при использовании отложенного/ленивого сброса при параллельной
нагрузке, т.к. блоки журнала записываются на диск только когда они заполняются или явным
образом запрашивается fsync.
## Пример сравнения с Ceph
Железо - 4 сервера, в каждом:
- 6x SATA SSD Intel D3-4510 3.84 TB
- 2x Xeon Gold 6242 (16 cores @ 2.8 GHz)
- 384 GB RAM
- 1x 25 GbE сетевая карта (Mellanox ConnectX-4 LX), подключённая к свитчу Juniper QFX5200
Экономия энергии CPU отключена. В тестах и Vitastor, и Ceph развёрнуто по 2 OSD на 1 SSD.
Все результаты ниже относятся к случайной нагрузке 4 КБ блоками (если явно не указано обратное).
Производительность голых дисков:
- T1Q1 запись ~27000 iops (задержка ~0.037ms)
- T1Q1 чтение ~9800 iops (задержка ~0.101ms)
- T1Q32 запись ~60000 iops
- T1Q32 чтение ~81700 iops
Ceph 15.2.4 (Bluestore):
- T1Q1 запись ~1000 iops (задержка ~1ms)
- T1Q1 чтение ~1750 iops (задержка ~0.57ms)
- T8Q64 запись ~100000 iops, потребление CPU процессами OSD около 40 ядер на каждом сервере
- T8Q64 чтение ~480000 iops, потребление CPU процессами OSD около 40 ядер на каждом сервере
Тесты в 8 потоков проводились на 8 400GB RBD образах со всех хостов (с каждого хоста запускалось 2 процесса fio).
Это нужно потому, что в Ceph несколько RBD-клиентов, пишущих в 1 образ, очень сильно замедляются.
Настройки RocksDB и Bluestore в Ceph не менялись, единственным изменением было отключение cephx_sign_messages.
На самом деле, результаты теста не такие уж и плохие для Ceph (могло быть хуже).
Собственно говоря, эти серверы как раз хорошо сбалансированы для Ceph - 6 SATA SSD как раз
утилизируют 25-гигабитную сеть, а без 2 мощных процессоров Ceph-у бы не хватило ядер,
чтобы выдать пристойный результат. Собственно, что и показывает жор 40 ядер в процессе
параллельного теста.
Vitastor:
- T1Q1 запись: 7087 iops (задержка 0.14ms)
- T1Q1 чтение: 6838 iops (задержка 0.145ms)
- T2Q64 запись: 162000 iops, потребление CPU - 3 ядра на каждом сервере
- T8Q64 чтение: 895000 iops, потребление CPU - 4 ядра на каждом сервере
- Линейная запись (4M T1Q32): 2800 МБ/с
- Линейное чтение (4M T1Q32): 1500 МБ/с
Тест на чтение в 8 потоков проводился на 1 большом образе (3.2 ТБ) со всех хостов (опять же, по 2 fio с каждого).
В Vitastor никакой разницы между 1 образом и 8-ю нет. Естественно, примерно 1/4 запросов чтения
в такой конфигурации, как и в тестах Ceph выше, обслуживалась с локальной машины. Если проводить
тест так, чтобы все операции всегда обращались к первичным OSD по сети - тест сильнее упирался
в сеть и результат составлял примерно 689000 iops.
Настройки Vitastor: `--disable_data_fsync true --immediate_commit all --flusher_count 8
--disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096
--journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024
--journal_size 16777216`.
### EC/XOR 2+1
Vitastor:
- T1Q1 запись: 2808 iops (задержка ~0.355ms)
- T1Q1 чтение: 6190 iops (задержка ~0.16ms)
- T2Q64 запись: 85500 iops, потребление CPU - 3.4 ядра на каждом сервере
- T8Q64 чтение: 812000 iops, потребление CPU - 4.7 ядра на каждом сервере
- Линейная запись (4M T1Q32): 3200 МБ/с
- Линейное чтение (4M T1Q32): 1800 МБ/с
Ceph:
- T1Q1 запись: 730 iops (задержка ~1.37ms latency)
- T1Q1 чтение: 1500 iops с холодным кэшем метаданных (задержка ~0.66ms), 2300 iops через 2 минуты прогрева (задержка ~0.435ms)
- T4Q128 запись (4 RBD images): 45300 iops, потребление CPU - 30 ядер на каждом сервере
- T8Q64 чтение (4 RBD images): 278600 iops, потребление CPU - 40 ядер на каждом сервере
- Линейная запись (4M T1Q32): 1950 МБ/с в пустой образ, 2500 МБ/с в заполненный образ
- Линейное чтение (4M T1Q32): 2400 МБ/с
### NBD
NBD - на данный момент единственный способ монтировать Vitastor ядром Linux, но он
приводит к дополнительным копированиям данных, поэтому немного ухудшает производительность,
правда, в основном - линейную, а случайная затрагивается слабо.
NBD расшифровывается как "сетевое блочное устройство", но на самом деле оно также
работает просто как аналог FUSE для блочных устройств, то есть, представляет собой
"блочное устройство в пространстве пользователя".
Vitastor с однопоточной NBD прокси на том же стенде:
- T1Q1 запись: 6000 iops (задержка 0.166ms)
- T1Q1 чтение: 5518 iops (задержка 0.18ms)
- T1Q128 запись: 94400 iops
- T1Q128 чтение: 103000 iops
- Линейная запись (4M T1Q128): 1266 МБ/с (в сравнении с 2800 МБ/с через fio)
- Линейное чтение (4M T1Q128): 975 МБ/с (в сравнении с 1500 МБ/с через fio)
## Установка
### Debian
- Добавьте ключ репозитория Vitastor:
`wget -q -O - https://vitastor.io/debian/pubkey | sudo apt-key add -`
- Добавьте репозиторий Vitastor в /etc/apt/sources.list:
- Debian 11 (Bullseye/Sid): `deb https://vitastor.io/debian bullseye main`
- Debian 10 (Buster): `deb https://vitastor.io/debian buster main`
- Для Debian 10 (Buster) также включите репозиторий backports:
`deb http://deb.debian.org/debian buster-backports main`
- Установите пакеты: `apt update; apt install vitastor lp-solve etcd linux-image-amd64 qemu`
### CentOS
- Добавьте в систему репозиторий Vitastor:
- CentOS 7: `yum install https://vitastor.io/rpms/centos/7/vitastor-release-1.0-1.el7.noarch.rpm`
- CentOS 8: `dnf install https://vitastor.io/rpms/centos/8/vitastor-release-1.0-1.el8.noarch.rpm`
- Включите EPEL: `yum/dnf install epel-release`
- Включите дополнительные репозитории CentOS:
- CentOS 7: `yum install centos-release-scl`
- CentOS 8: `dnf install centos-release-advanced-virtualization`
- Включите elrepo-kernel:
- CentOS 7: `yum install https://www.elrepo.org/elrepo-release-7.el7.elrepo.noarch.rpm`
- CentOS 8: `dnf install https://www.elrepo.org/elrepo-release-8.el8.elrepo.noarch.rpm`
- Установите пакеты: `yum/dnf install vitastor lpsolve etcd kernel-ml qemu-kvm`
### Установка из исходников
- Установите ядро 5.4 или более новое, для поддержки io_uring. Желательно 5.8 или даже новее,
так как в 5.4 есть как минимум 1 известный баг, ведущий к зависанию с io_uring и контроллером HP SmartArray.
- Установите liburing 0.4 или более новый и его заголовки.
- Установите lp_solve.
- Установите etcd. Внимание: вам нужна версия с исправлением отсюда: https://github.com/vitalif/etcd/,
из ветки release-3.4, так как в etcd есть баг, который [будет](https://github.com/etcd-io/etcd/pull/12402)
исправлен только в 3.4.15. Баг приводит к неспособности Vitastor запустить PG, когда их хотя бы 500 штук.
- Установите node.js 10 или новее.
- Установите gcc и g++ 8.x или новее.
- Склонируйте данный репозиторий с подмодулями: `git clone https://yourcmc.ru/git/vitalif/vitastor/`.
- Желательно пересобрать QEMU с патчем, который делает необязательным запуск через LD_PRELOAD.
См `qemu-*.*-vitastor.patch` - выберите версию, наиболее близкую вашей версии QEMU.
- Установите QEMU 3.0 или новее, возьмите исходные коды установленного пакета, начните его пересборку,
через некоторое время остановите её и скопируйте следующие заголовки:
- `<qemu>/include` &rarr; `<vitastor>/qemu/include`
- Debian:
* Берите qemu из основного репозитория
* `<qemu>/b/qemu/config-host.h` &rarr; `<vitastor>/qemu/b/qemu/config-host.h`
* `<qemu>/b/qemu/qapi` &rarr; `<vitastor>/qemu/b/qemu/qapi`
- CentOS 8:
* Берите qemu из репозитория Advanced-Virtualization. Чтобы включить его, запустите
`yum install centos-release-advanced-virtualization.noarch` и далее `yum install qemu`
* `<qemu>/config-host.h` &rarr; `<vitastor>/qemu/b/qemu/config-host.h`
* Для QEMU 3.0+: `<qemu>/qapi` &rarr; `<vitastor>/qemu/b/qemu/qapi`
* Для QEMU 2.0+: `<qemu>/qapi-types.h` &rarr; `<vitastor>/qemu/b/qemu/qapi-types.h`
- `config-host.h` и `qapi` нужны, т.к. в них содержатся автогенерируемые заголовки
- Установите fio 3.7 или новее, возьмите исходники пакета и сделайте на них симлинк с `<vitastor>/fio`.
- Соберите и установите Vitastor командой `mkdir build && cd build && cmake .. && make -j8 && make install`.
Обратите внимание на переменную cmake `QEMU_PLUGINDIR` - под RHEL её нужно установить равной `qemu-kvm`.
## Запуск
Внимание: процедура пока что достаточно нетривиальная, задавать конфигурацию и смещения
на диске нужно почти вручную. Это будет исправлено в ближайшем будущем.
- Желательны SATA SSD или NVMe диски с конденсаторами (серверные SSD). Можно использовать и
десктопные SSD, включив режим отложенного fsync, но производительность однопоточной записи
в этом случае пострадает.
- Быстрая сеть, минимум 10 гбит/с
- Для наилучшей производительности нужно отключить энергосбережение CPU: `cpupower idle-set -D 0 && cpupower frequency-set -g performance`.
- Пропишите нужные вам значения вверху файлов `/usr/lib/vitastor/mon/make-units.sh` и `/usr/lib/vitastor/mon/make-osd.sh`.
- Создайте юниты systemd для etcd и мониторов: `/usr/lib/vitastor/mon/make-units.sh`
- Создайте юниты для OSD: `/usr/lib/vitastor/mon/make-osd.sh /dev/disk/by-partuuid/XXX [/dev/disk/by-partuuid/YYY ...]`
- Вы можете поменять параметры OSD в юнитах systemd. Смысл некоторых параметров:
- `disable_data_fsync 1` - отключает fsync, используется с SSD с конденсаторами.
- `immediate_commit all` - используется с SSD с конденсаторами.
- `disable_device_lock 1` - отключает блокировку файла устройства, нужно, только если вы запускаете
несколько OSD на одном блочном устройстве.
- `flusher_count 256` - "flusher" - микропоток, удаляющий старые данные из журнала.
Не волнуйтесь об этой настройке, 256 теперь достаточно практически всегда.
- `disk_alignment`, `journal_block_size`, `meta_block_size` следует установить равными размеру
внутреннего блока SSD. Это почти всегда 4096.
- `journal_no_same_sector_overwrites true` запрещает перезапись одного и того же сектора журнала подряд
много раз в процессе записи. Большинство (99%) SSD не нуждаются в данной опции. Однако выяснилось, что
диски, используемые на одном из тестовых стендов - Intel D3-S4510 - очень сильно не любят такую
перезапись, и для них была добавлена эта опция. Когда данный режим включён, также нужно поднимать
значение `journal_sector_buffer_count`, так как иначе Vitastor не хватит буферов для записи в журнал.
- Запустите все etcd: `systemctl start etcd`
- Создайте глобальную конфигурацию в etcd: `etcdctl --endpoints=... put /vitastor/config/global '{"immediate_commit":"all"}'`
(если все ваши диски - серверные с конденсаторами).
- Создайте пулы: `etcdctl --endpoints=... put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":256,"failure_domain":"host"}}'`.
Для jerasure EC-пулов конфигурация должна выглядеть так: `2:{"name":"ecpool","scheme":"jerasure","pg_size":4,"parity_chunks":2,"pg_minsize":2,"pg_count":256,"failure_domain":"host"}`.
- Запустите все OSD: `systemctl start vitastor.target`
- Ваш кластер должен быть готов - один из мониторов должен уже сконфигурировать PG, а OSD должны запустить их.
- Вы можете проверить состояние PG прямо в etcd: `etcdctl --endpoints=... get --prefix /vitastor/pg/state`. Все PG должны быть 'active'.
- Пример команды для запуска тестов: `fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -pool=1 -inode=1 -size=400G`.
- Пример команды для заливки образа ВМ в vitastor через qemu-img:
```
qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:etcd_host=10.115.0.10\:2379/v3:pool=1:inode=1:size=2147483648'
```
Если вы используете немодифицированный QEMU, данной команде потребуется переменная окружения `LD_PRELOAD=/usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so`.
- Пример команды запуска QEMU:
```
qemu-system-x86_64 -enable-kvm -m 1024
-drive 'file=vitastor:etcd_host=10.115.0.10\:2379/v3:pool=1:inode=1:size=2147483648',format=raw,if=none,id=drive-virtio-disk0,cache=none
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=off,physical_block_size=4096,logical_block_size=512
-vnc 0.0.0.0:0
```
- Пример команды удаления образа (инода) из Vitastor:
```
vitastor-rm --etcd_address 10.115.0.10:2379/v3 --pool 1 --inode 1 --parallel_osds 16 --iodepth 32
```
## Известные проблемы
- Запросы удаления объектов могут в данный момент приводить к "неполным" объектам в EC-пулах,
если в процессе удаления произойдут отказы OSD или серверов, потому что правильная обработка
запросов удаления в кластере должна быть "трёхфазной", а это пока не реализовано. Если вы
столкнётесь с такой ситуацией, просто повторите запрос удаления.
## Принципы реализации
- Я люблю архитектурно простые решения. Vitastor проектируется именно так и я намерен
и далее следовать данному принципу.
- Если вы пришли сюда за идеальным кодом на C++, вы, вероятно, не по адресу. "Общепринятые"
практики написания C++ кода меня не очень волнуют, так как зачастую, опять-таки, ведут к
излишним усложнениям и код получается красивый... но медленный.
- По той же причине в коде иногда можно встретить велосипеды типа собственного упрощённого
HTTP-клиента для работы с etcd. Зато эти велосипеды маленькие и компактные и не требуют
использования десятка внешних библиотек.
- node.js для монитора - не случайный выбор. Он очень быстрый, имеет встроенную событийную
машину, приятный нейтральный C-подобный язык программирования и развитую инфраструктуру.
## Автор и лицензия
Автор: Виталий Филиппов (vitalif [at] yourcmc.ru), 2019+
Заходите в Telegram-чат Vitastor: https://t.me/vitastor
Лицензия: VNPL 1.1 на серверный код и двойная VNPL 1.1 + GPL 2.0+ на клиентский.
VNPL - "сетевой копилефт", собственная свободная копилефт-лицензия
Vitastor Network Public License 1.1, основанная на GNU GPL 3.0 с дополнительным
условием "Сетевого взаимодействия", требующим распространять все программы,
специально разработанные для использования вместе с Vitastor и взаимодействующие
с ним по сети, под лицензией VNPL или под любой другой свободной лицензией.
Идея VNPL - расширение действия копилефта не только на модули, явным образом
связываемые с кодом Vitastor, но также на модули, оформленные в виде микросервисов
и взаимодействующие с ним по сети.
Таким образом, если вы хотите построить на основе Vitastor сервис, содержаший
компоненты с закрытым кодом, взаимодействующие с Vitastor, вам нужна коммерческая
лицензия от автора 😀.
На Windows и любое другое ПО, не разработанное *специально* для использования
вместе с Vitastor, никакие ограничения не накладываются.
Клиентские библиотеки распространяются на условиях двойной лицензии VNPL 1.0
и также на условиях GNU GPL 2.0 или более поздней версии. Так сделано в целях
совместимости с таким ПО, как QEMU и fio.
Вы можете найти полный текст VNPL 1.1 в файле [VNPL-1.1.txt](VNPL-1.1.txt),
а GPL 2.0 в файле [GPL-2.0.txt](GPL-2.0.txt).

View File

@ -1,5 +1,7 @@
## Vitastor ## Vitastor
[Читать на русском](README-ru.md)
## The Idea ## The Idea
Make Software-Defined Block Storage Great Again. Make Software-Defined Block Storage Great Again.
@ -34,16 +36,16 @@ breaking changes in the future. However, the following is implemented:
- NBD proxy for kernel mounts - NBD proxy for kernel mounts
- Inode removal tool (vitastor-rm) - Inode removal tool (vitastor-rm)
- Packaging for Debian and CentOS - Packaging for Debian and CentOS
- Per-inode I/O and space usage statistics
- Inode metadata storage in etcd
- Snapshots and copy-on-write image clones
## Roadmap ## Roadmap
- OSD creation tool (OSDs currently have to be created by hand) - Better OSD creation and auto-start tools
- Other administrative tools - Other administrative tools
- Per-inode I/O and space usage statistics - Plugins for OpenStack, Kubernetes, OpenNebula, Proxmox and other cloud systems
- Proxmox and OpenNebula plugins
- iSCSI proxy - iSCSI proxy
- Inode metadata storage in etcd
- Snapshots and copy-on-write image clones
- Operation timeouts and better failure detection - Operation timeouts and better failure detection
- Scrubbing without checksums (verification of replicas) - Scrubbing without checksums (verification of replicas)
- Checksums - Checksums
@ -291,7 +293,7 @@ Vitastor with single-thread NBD on the same hardware:
- Debian 10 (Buster): `deb https://vitastor.io/debian buster main` - Debian 10 (Buster): `deb https://vitastor.io/debian buster main`
- For Debian 10 (Buster) also enable backports repository: - For Debian 10 (Buster) also enable backports repository:
`deb http://deb.debian.org/debian buster-backports main` `deb http://deb.debian.org/debian buster-backports main`
- Install packages: `apt update; apt install vitastor lp-solve etcd linux-image-amd64` - Install packages: `apt update; apt install vitastor lp-solve etcd linux-image-amd64 qemu`
### CentOS ### CentOS
@ -395,13 +397,15 @@ and calculate disk offsets almost by hand. This will be fixed in near future.
## Known Problems ## Known Problems
- Object deletion requests may currently lead to 'incomplete' objects if your OSDs crash during - Object deletion requests may currently lead to 'incomplete' objects in EC pools
deletion because proper handling of object cleanup in a cluster should be "three-phase" if your OSDs crash during deletion because proper handling of object cleanup
and it's currently not implemented. Just to repeat the removal again in this case. in a cluster should be "three-phase" and it's currently not implemented.
Just repeat the removal request again in this case.
## Implementation Principles ## Implementation Principles
- I like simple and stupid solutions, so expect Vitastor to stay simple. - I like architecturally simple solutions. Vitastor is and will always be designed
exactly like that.
- I also like reinventing the wheel to some extent, like writing my own HTTP client - I also like reinventing the wheel to some extent, like writing my own HTTP client
for etcd interaction instead of using prebuilt libraries, because in this case for etcd interaction instead of using prebuilt libraries, because in this case
I'm confident about what my code does and what it doesn't do. I'm confident about what my code does and what it doesn't do.
@ -416,7 +420,7 @@ and calculate disk offsets almost by hand. This will be fixed in near future.
Copyright (c) Vitaliy Filippov (vitalif [at] yourcmc.ru), 2019+ Copyright (c) Vitaliy Filippov (vitalif [at] yourcmc.ru), 2019+
You can also find me in the Russian Telegram Ceph chat: https://t.me/ceph_ru Join Vitastor Telegram Chat: https://t.me/vitastor
All server-side code (OSD, Monitor and so on) is licensed under the terms of All server-side code (OSD, Monitor and so on) is licensed under the terms of
Vitastor Network Public License 1.1 (VNPL 1.1), a copyleft license based on Vitastor Network Public License 1.1 (VNPL 1.1), a copyleft license based on

2
debian/changelog vendored
View File

@ -1,4 +1,4 @@
vitastor (0.5.9-1) unstable; urgency=medium vitastor (0.5.10-1) unstable; urgency=medium
* Bugfixes * Bugfixes

View File

@ -40,10 +40,10 @@ RUN set -e -x; \
mkdir -p /root/packages/vitastor-$REL; \ mkdir -p /root/packages/vitastor-$REL; \
rm -rf /root/packages/vitastor-$REL/*; \ rm -rf /root/packages/vitastor-$REL/*; \
cd /root/packages/vitastor-$REL; \ cd /root/packages/vitastor-$REL; \
cp -r /root/vitastor vitastor-0.5.9; \ cp -r /root/vitastor vitastor-0.5.10; \
ln -s /root/packages/qemu-$REL/qemu-*/ vitastor-0.5.9/qemu; \ ln -s /root/packages/qemu-$REL/qemu-*/ vitastor-0.5.10/qemu; \
ln -s /root/fio-build/fio-*/ vitastor-0.5.9/fio; \ ln -s /root/fio-build/fio-*/ vitastor-0.5.10/fio; \
cd vitastor-0.5.9; \ cd vitastor-0.5.10; \
FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \ FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
QEMU=$(head -n1 qemu/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \ QEMU=$(head -n1 qemu/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
sh copy-qemu-includes.sh; \ sh copy-qemu-includes.sh; \
@ -59,8 +59,8 @@ RUN set -e -x; \
echo "dep:fio=$FIO" > debian/substvars; \ echo "dep:fio=$FIO" > debian/substvars; \
echo "dep:qemu=$QEMU" >> debian/substvars; \ echo "dep:qemu=$QEMU" >> debian/substvars; \
cd /root/packages/vitastor-$REL; \ cd /root/packages/vitastor-$REL; \
tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.5.9.orig.tar.xz vitastor-0.5.9; \ tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.5.10.orig.tar.xz vitastor-0.5.10; \
cd vitastor-0.5.9; \ cd vitastor-0.5.10; \
V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \ V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \ DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \
DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \ DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \

23
mon/merge.js Normal file
View File

@ -0,0 +1,23 @@
const fsp = require('fs').promises;
async function merge(file1, file2, out)
{
if (!out)
{
console.error('USAGE: nodejs merge.js layer1 layer2 output');
process.exit();
}
const layer1 = await fsp.readFile(file1);
const layer2 = await fsp.readFile(file2);
const zero = Buffer.alloc(4096);
for (let i = 0; i < layer2.length; i += 4096)
{
if (zero.compare(layer2, i, i+4096) != 0)
{
layer2.copy(layer1, i, i, i+4096);
}
}
await fsp.writeFile(out, layer1);
}
merge(process.argv[2], process.argv[3], process.argv[4]);

View File

@ -10,19 +10,31 @@ const stableStringify = require('./stable-stringify.js');
const PGUtil = require('./PGUtil.js'); const PGUtil = require('./PGUtil.js');
// FIXME document all etcd keys and config variables in the form of JSON schema or similar // FIXME document all etcd keys and config variables in the form of JSON schema or similar
const etcd_nonempty_keys = {
'config/global': 1,
'config/node_placement': 1,
'config/pools': 1,
'config/pgs': 1,
'history/last_clean_pgs': 1,
'stats': 1,
};
const etcd_allow = new RegExp('^'+[ const etcd_allow = new RegExp('^'+[
'config/global', 'config/global',
'config/node_placement', 'config/node_placement',
'config/pools', 'config/pools',
'config/osd/[1-9]\\d*', 'config/osd/[1-9]\\d*',
'config/pgs', 'config/pgs',
'config/inode/[1-9]\\d*/[1-9]\\d*',
'osd/state/[1-9]\\d*', 'osd/state/[1-9]\\d*',
'osd/stats/[1-9]\\d*', 'osd/stats/[1-9]\\d*',
'osd/inodestats/[1-9]\\d*',
'osd/space/[1-9]\\d*',
'mon/master', 'mon/master',
'pg/state/[1-9]\\d*/[1-9]\\d*', 'pg/state/[1-9]\\d*/[1-9]\\d*',
'pg/stats/[1-9]\\d*/[1-9]\\d*', 'pg/stats/[1-9]\\d*/[1-9]\\d*',
'pg/history/[1-9]\\d*/[1-9]\\d*', 'pg/history/[1-9]\\d*/[1-9]\\d*',
'history/last_clean_pgs', 'history/last_clean_pgs',
'inode/stats/[1-9]\\d*',
'stats', 'stats',
].join('$|^')+'$'); ].join('$|^')+'$');
@ -58,6 +70,7 @@ const etcd_tree = {
autosync_interval: 5, autosync_interval: 5,
client_queue_depth: 128, // unused client_queue_depth: 128, // unused
recovery_queue_depth: 4, recovery_queue_depth: 4,
recovery_sync_batch: 16,
readonly: false, readonly: false,
no_recovery: false, no_recovery: false,
no_rebalance: false, no_rebalance: false,
@ -131,6 +144,18 @@ const etcd_tree = {
} }
}, */ }, */
pgs: {}, pgs: {},
/* inode: {
<pool_id>: {
<inode_t>: {
name: string,
size?: uint64_t, // bytes
parent_pool?: <pool_id>,
parent_id?: <inode_t>,
readonly?: boolean,
}
}
}, */
inode: {},
}, },
osd: { osd: {
state: { state: {
@ -162,6 +187,18 @@ const etcd_tree = {
}, },
}, */ }, */
}, },
inodestats: {
/* <inode_t>: {
read: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
write: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
delete: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
}, */
},
space: {
/* <osd_num_t>: {
<inode_t>: uint64_t, // bytes
}, */
},
}, },
mon: { mon: {
master: { master: {
@ -201,6 +238,16 @@ const etcd_tree = {
}, */ }, */
}, },
}, },
inode: {
stats: {
/* <inode_t>: {
raw_used: uint64_t, // raw used bytes on OSDs
read: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
write: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
delete: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
}, */
},
},
stats: { stats: {
/* op_stats: { /* op_stats: {
<string>: { count: uint64_t, usec: uint64_t, bytes: uint64_t }, <string>: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
@ -385,7 +432,7 @@ class Mon
{ {
this.parse_kv(e.kv); this.parse_kv(e.kv);
const key = e.kv.key.substr(this.etcd_prefix.length); const key = e.kv.key.substr(this.etcd_prefix.length);
if (key.substr(0, 11) == '/osd/stats/' || key.substr(0, 10) == '/pg/stats/') if (key.substr(0, 11) == '/osd/stats/' || key.substr(0, 10) == '/pg/stats/' || key.substr(0, 16) == '/osd/inodestats/')
{ {
stats_changed = true; stats_changed = true;
} }
@ -393,7 +440,7 @@ class Mon
{ {
pg_states_changed = true; pg_states_changed = true;
} }
else if (key != '/stats') else if (key != '/stats' && key.substr(0, 13) != '/inode/stats/')
{ {
changed = true; changed = true;
} }
@ -633,29 +680,51 @@ class Mon
return !has_online; return !has_online;
} }
save_new_pgs_txn(request, pool_id, up_osds, prev_pgs, new_pgs, pg_history) reset_rng()
{ {
const replicated = new_pgs.length && this.state.config.pools[pool_id].scheme === 'replicated'; this.seed = 0x5f020e43;
const pg_minsize = new_pgs.length && this.state.config.pools[pool_id].pg_minsize; }
const pg_items = {};
new_pgs.map((osd_set, i) => rng()
{
this.seed ^= this.seed << 13;
this.seed ^= this.seed >> 17;
this.seed ^= this.seed << 5;
return this.seed + 2147483648;
}
pick_primary(pool_id, osd_set, up_osds)
{ {
osd_set = osd_set.map(osd_num => osd_num === LPOptimizer.NO_OSD ? 0 : osd_num);
let alive_set; let alive_set;
if (replicated) if (this.state.config.pools[pool_id].scheme === 'replicated')
alive_set = osd_set.filter(osd_num => osd_num && up_osds[osd_num]); alive_set = osd_set.filter(osd_num => osd_num && up_osds[osd_num]);
else else
{ {
// Prefer data OSDs for EC because they can actually read something without an additional network hop // Prefer data OSDs for EC because they can actually read something without an additional network hop
alive_set = osd_set.slice(0, pg_minsize).filter(osd_num => osd_num && up_osds[osd_num]); const pg_data_size = (this.state.config.pools[pool_id].pg_size||0) -
(this.state.config.pools[pool_id].parity_chunks||0);
alive_set = osd_set.slice(0, pg_data_size).filter(osd_num => osd_num && up_osds[osd_num]);
if (!alive_set.length) if (!alive_set.length)
alive_set = osd_set.filter(osd_num => osd_num && up_osds[osd_num]); alive_set = osd_set.filter(osd_num => osd_num && up_osds[osd_num]);
} }
if (!alive_set.length)
return 0;
return alive_set[this.rng() % alive_set.length];
}
save_new_pgs_txn(request, pool_id, up_osds, prev_pgs, new_pgs, pg_history)
{
const pg_items = {};
this.reset_rng();
new_pgs.map((osd_set, i) =>
{
osd_set = osd_set.map(osd_num => osd_num === LPOptimizer.NO_OSD ? 0 : osd_num);
pg_items[i+1] = { pg_items[i+1] = {
osd_set, osd_set,
primary: alive_set.length ? alive_set[Math.floor(Math.random()*alive_set.length)] : 0, primary: this.pick_primary(pool_id, osd_set, up_osds),
}; };
if (prev_pgs[i] && prev_pgs[i].join(' ') != osd_set.join(' ')) if (prev_pgs[i] && prev_pgs[i].join(' ') != osd_set.join(' ') &&
prev_pgs[i].filter(osd_num => osd_num).length > 0)
{ {
pg_history[i] = pg_history[i] || {}; pg_history[i] = pg_history[i] || {};
pg_history[i].osd_sets = pg_history[i].osd_sets || []; pg_history[i].osd_sets = pg_history[i].osd_sets || [];
@ -935,7 +1004,7 @@ class Mon
} }
else else
{ {
// Nothing changed, but we still want to check for down OSDs // Nothing changed, but we still want to recheck the distribution of primaries
let changed = false; let changed = false;
for (const pool_id in this.state.config.pools) for (const pool_id in this.state.config.pools)
{ {
@ -945,22 +1014,13 @@ class Mon
continue; continue;
} }
const replicated = pool_cfg.scheme === 'replicated'; const replicated = pool_cfg.scheme === 'replicated';
for (const pg_num in ((this.state.config.pgs.items||{})[pool_id]||{})||{}) this.reset_rng();
for (let pg_num = 1; pg_num <= pool_cfg.pg_count; pg_num++)
{ {
const pg_cfg = this.state.config.pgs.items[pool_id][pg_num]; const pg_cfg = this.state.config.pgs.items[pool_id][pg_num];
if (!Number(pg_cfg.primary) || !up_osds[pg_cfg.primary]) if (pg_cfg)
{ {
let alive_set; const new_primary = this.pick_primary(pool_id, pg_cfg.osd_set, up_osds);
if (replicated)
alive_set = pg_cfg.osd_set.filter(osd_num => osd_num && up_osds[osd_num]);
else
{
// Prefer data OSDs for EC because they can actually read something without an additional network hop
alive_set = pg_cfg.osd_set.slice(0, pool_cfg.pg_minsize).filter(osd_num => osd_num && up_osds[osd_num]);
if (!alive_set.length)
alive_set = pg_cfg.osd_set.filter(osd_num => osd_num && up_osds[osd_num]);
}
const new_primary = alive_set.length ? alive_set[Math.floor(Math.random()*alive_set.length)] : 0;
if (pg_cfg.primary != new_primary) if (pg_cfg.primary != new_primary)
{ {
console.log( console.log(
@ -1045,8 +1105,6 @@ class Mon
sum_stats() sum_stats()
{ {
let overflow = false;
this.prev_stats = this.prev_stats || { op_stats: {}, subop_stats: {}, recovery_stats: {} };
const op_stats = {}, subop_stats = {}, recovery_stats = {}; const op_stats = {}, subop_stats = {}, recovery_stats = {};
for (const osd in this.state.osd.stats) for (const osd in this.state.osd.stats)
{ {
@ -1071,52 +1129,11 @@ class Mon
recovery_stats[op].bytes += BigInt(st.recovery_stats[op].bytes||0); recovery_stats[op].bytes += BigInt(st.recovery_stats[op].bytes||0);
} }
} }
for (const op in op_stats) return { op_stats, subop_stats, recovery_stats };
{
if (op_stats[op].count >= 0x10000000000000000n)
{
if (!this.prev_stats.op_stats[op])
{
overflow = true;
} }
else
sum_object_counts()
{ {
op_stats[op].count -= this.prev_stats.op_stats[op].count;
op_stats[op].usec -= this.prev_stats.op_stats[op].usec;
op_stats[op].bytes -= this.prev_stats.op_stats[op].bytes;
}
}
}
for (const op in subop_stats)
{
if (subop_stats[op].count >= 0x10000000000000000n)
{
if (!this.prev_stats.subop_stats[op])
{
overflow = true;
}
else
{
subop_stats[op].count -= this.prev_stats.subop_stats[op].count;
subop_stats[op].usec -= this.prev_stats.subop_stats[op].usec;
}
}
}
for (const op in recovery_stats)
{
if (recovery_stats[op].count >= 0x10000000000000000n)
{
if (!this.prev_stats.recovery_stats[op])
{
overflow = true;
}
else
{
recovery_stats[op].count -= this.prev_stats.recovery_stats[op].count;
recovery_stats[op].bytes -= this.prev_stats.recovery_stats[op].bytes;
}
}
}
const object_counts = { object: 0n, clean: 0n, misplaced: 0n, degraded: 0n, incomplete: 0n }; const object_counts = { object: 0n, clean: 0n, misplaced: 0n, degraded: 0n, incomplete: 0n };
for (const pool_id in this.state.pg.stats) for (const pool_id in this.state.pg.stats)
{ {
@ -1135,36 +1152,112 @@ class Mon
} }
} }
} }
return (this.prev_stats = { overflow, op_stats, subop_stats, recovery_stats, object_counts }); return object_counts;
}
sum_inode_stats()
{
const inode_stats = {};
const inode_stub = () => ({
raw_used: 0n,
read: { count: 0n, usec: 0n, bytes: 0n },
write: { count: 0n, usec: 0n, bytes: 0n },
delete: { count: 0n, usec: 0n, bytes: 0n },
});
for (const osd_num in this.state.osd.space)
{
for (const inode_num in this.state.osd.space[osd_num])
{
inode_stats[inode_num] = inode_stats[inode_num] || inode_stub();
inode_stats[inode_num].raw_used += BigInt(this.state.osd.space[osd_num][inode_num]||0);
}
}
for (const osd_num in this.state.osd.inodestats)
{
const ist = this.state.osd.inodestats[osd_num];
for (const inode_num in ist)
{
inode_stats[inode_num] = inode_stats[inode_num] || inode_stub();
for (const op of [ 'read', 'write', 'delete' ])
{
inode_stats[inode_num][op].count += BigInt(ist[inode_num][op].count||0);
inode_stats[inode_num][op].usec += BigInt(ist[inode_num][op].usec||0);
inode_stats[inode_num][op].bytes += BigInt(ist[inode_num][op].bytes||0);
}
}
}
return inode_stats;
}
fix_stat_overflows(obj, scratch)
{
for (const k in obj)
{
if (typeof obj[k] == 'bigint')
{
if (obj[k] >= 0x10000000000000000n)
{
if (scratch[k])
{
for (const k2 in scratch)
{
obj[k2] -= scratch[k2];
scratch[k2] = 0n;
}
}
else
{
for (const k2 in obj)
{
scratch[k2] = obj[k2];
}
}
}
}
else if (typeof obj[k] == 'object')
{
this.fix_stat_overflows(obj[k], scratch[k] = (scratch[k] || {}));
}
}
}
serialize_bigints(obj)
{
for (const k in obj)
{
if (typeof obj[k] == 'bigint')
{
obj[k] = ''+obj[k];
}
else if (typeof obj[k] == 'object')
{
this.serialize_bigints(obj[k]);
}
}
} }
async update_total_stats() async update_total_stats()
{ {
const txn = [];
const stats = this.sum_stats(); const stats = this.sum_stats();
if (!stats.overflow) const object_counts = this.sum_object_counts();
const inode_stats = this.sum_inode_stats();
this.fix_stat_overflows(stats, (this.prev_stats = this.prev_stats || {}));
this.fix_stat_overflows(inode_stats, (this.prev_inode_stats = this.prev_inode_stats || {}));
stats.object_counts = object_counts;
this.serialize_bigints(stats);
this.serialize_bigints(inode_stats);
txn.push({ requestPut: { key: b64(this.etcd_prefix+'/stats'), value: b64(JSON.stringify(stats)) } });
for (const inode_num in inode_stats)
{ {
// Convert to strings, serialize and save txn.push({ requestPut: {
const ser = {}; key: b64(this.etcd_prefix+'/inode/stats/'+inode_num),
for (const st of [ 'op_stats', 'subop_stats', 'recovery_stats' ]) value: b64(JSON.stringify(inode_stats[inode_num])),
{ } });
ser[st] = {};
for (const op in stats[st])
{
ser[st][op] = {};
for (const k in stats[st][op])
{
ser[st][op][k] = ''+stats[st][op][k];
} }
} if (txn.length)
}
ser.object_counts = {};
for (const k in stats.object_counts)
{ {
ser.object_counts[k] = ''+stats.object_counts[k]; await this.etcd_call('/kv/txn', { success: txn }, this.config.etcd_mon_timeout, 0);
}
await this.etcd_call('/kv/txn', {
success: [ { requestPut: { key: b64(this.etcd_prefix+'/stats'), value: b64(JSON.stringify(ser)) } } ],
}, this.config.etcd_mon_timeout, 0);
} }
} }
@ -1205,16 +1298,20 @@ class Mon
console.log('Bad value in etcd: '+kv.key+' = '+kv.value); console.log('Bad value in etcd: '+kv.key+' = '+kv.value);
return; return;
} }
key = key.split('/'); let key_parts = key.split('/');
let cur = this.state; let cur = this.state;
for (let i = 0; i < key.length-1; i++) for (let i = 0; i < key_parts.length-1; i++)
{ {
cur = (cur[key[i]] = cur[key[i]] || {}); cur = (cur[key_parts[i]] = cur[key_parts[i]] || {});
} }
cur[key[key.length-1]] = kv.value; if (etcd_nonempty_keys[key])
if (key.join('/') === 'config/global') {
// Do not clear these to null
kv.value = kv.value || {};
}
cur[key_parts[key_parts.length-1]] = kv.value;
if (key === 'config/global')
{ {
this.state.config.global = this.state.config.global || {};
this.config = this.state.config.global; this.config = this.state.config.global;
this.check_config(); this.check_config();
for (const osd_num in this.state.osd.stats) for (const osd_num in this.state.osd.stats)
@ -1225,7 +1322,7 @@ class Mon
); );
} }
} }
else if (key.join('/') === 'config/pools') else if (key === 'config/pools')
{ {
for (const pool_id in this.state.config.pools) for (const pool_id in this.state.config.pools)
{ {
@ -1234,7 +1331,7 @@ class Mon
this.validate_pool_cfg(pool_id, pool_cfg, true); this.validate_pool_cfg(pool_id, pool_cfg, true);
} }
} }
else if (key[0] === 'osd' && key[1] === 'stats') else if (key_parts[0] === 'osd' && key_parts[1] === 'stats')
{ {
// Recheck PGs <osd_out_time> later // Recheck PGs <osd_out_time> later
this.schedule_next_recheck_at( this.schedule_next_recheck_at(

View File

@ -48,4 +48,4 @@ FIO=`rpm -qi fio | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Ve
QEMU=`rpm -qi qemu qemu-kvm | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'` QEMU=`rpm -qi qemu qemu-kvm | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'`
perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec
perl -i -pe 's/(Requires:\s*qemu(?:-kvm)?)([^\n]+)?/$1 = '$QEMU'/' $VITASTOR/rpm/vitastor-el$EL.spec perl -i -pe 's/(Requires:\s*qemu(?:-kvm)?)([^\n]+)?/$1 = '$QEMU'/' $VITASTOR/rpm/vitastor-el$EL.spec
tar --transform 's#^#vitastor-0.5.9/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.5.9$(rpm --eval '%dist').tar.gz * tar --transform 's#^#vitastor-0.5.10/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.5.10$(rpm --eval '%dist').tar.gz *

View File

@ -37,7 +37,7 @@ ADD . /root/vitastor
RUN set -e; \ RUN set -e; \
cd /root/vitastor/rpm; \ cd /root/vitastor/rpm; \
sh build-tarball.sh; \ sh build-tarball.sh; \
cp /root/vitastor-0.5.9.el7.tar.gz ~/rpmbuild/SOURCES; \ cp /root/vitastor-0.5.10.el7.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \ cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \ cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \ rpmbuild -ba vitastor.spec; \

View File

@ -1,11 +1,11 @@
Name: vitastor Name: vitastor
Version: 0.5.9 Version: 0.5.10
Release: 1%{?dist} Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1 License: Vitastor Network Public License 1.1
URL: https://vitastor.io/ URL: https://vitastor.io/
Source0: vitastor-0.5.9.el7.tar.gz Source0: vitastor-0.5.10.el7.tar.gz
BuildRequires: liburing-devel >= 0.6 BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel BuildRequires: gperftools-devel

View File

@ -35,7 +35,7 @@ ADD . /root/vitastor
RUN set -e; \ RUN set -e; \
cd /root/vitastor/rpm; \ cd /root/vitastor/rpm; \
sh build-tarball.sh; \ sh build-tarball.sh; \
cp /root/vitastor-0.5.9.el8.tar.gz ~/rpmbuild/SOURCES; \ cp /root/vitastor-0.5.10.el8.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \ cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \ cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \ rpmbuild -ba vitastor.spec; \

View File

@ -1,11 +1,11 @@
Name: vitastor Name: vitastor
Version: 0.5.9 Version: 0.5.10
Release: 1%{?dist} Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1 License: Vitastor Network Public License 1.1
URL: https://vitastor.io/ URL: https://vitastor.io/
Source0: vitastor-0.5.9.el8.tar.gz Source0: vitastor-0.5.10.el8.tar.gz
BuildRequires: liburing-devel >= 0.6 BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel BuildRequires: gperftools-devel

View File

@ -13,19 +13,19 @@ allocator::allocator(uint64_t blocks)
{ {
throw std::invalid_argument("blocks"); throw std::invalid_argument("blocks");
} }
uint64_t p2 = 1, total = 1; uint64_t p2 = 1;
total = 0;
while (p2 * 64 < blocks) while (p2 * 64 < blocks)
{ {
p2 = p2 * 64;
total += p2; total += p2;
p2 = p2 * 64;
} }
total -= p2;
total += (blocks+63) / 64; total += (blocks+63) / 64;
mask = new uint64_t[2 + total]; mask = new uint64_t[total];
size = free = blocks; size = free = blocks;
last_one_mask = (blocks % 64) == 0 last_one_mask = (blocks % 64) == 0
? UINT64_MAX ? UINT64_MAX
: ~(UINT64_MAX << (64 - blocks % 64)); : ((1l << (blocks % 64)) - 1);
for (uint64_t i = 0; i < total; i++) for (uint64_t i = 0; i < total; i++)
{ {
mask[i] = 0; mask[i] = 0;
@ -99,6 +99,10 @@ uint64_t allocator::find_free()
uint64_t p2 = 1, offset = 0, addr = 0, f, i; uint64_t p2 = 1, offset = 0, addr = 0, f, i;
while (p2 < size) while (p2 < size)
{ {
if (offset+addr >= total)
{
return UINT64_MAX;
}
uint64_t m = mask[offset + addr]; uint64_t m = mask[offset + addr];
for (i = 0, f = 1; i < 64; i++, f <<= 1) for (i = 0, f = 1; i < 64; i++, f <<= 1)
{ {
@ -113,11 +117,6 @@ uint64_t allocator::find_free()
return UINT64_MAX; return UINT64_MAX;
} }
addr = (addr * 64) | i; addr = (addr * 64) | i;
if (addr >= size)
{
// No space
return UINT64_MAX;
}
offset += p2; offset += p2;
p2 = p2 * 64; p2 = p2 * 64;
} }
@ -128,3 +127,35 @@ uint64_t allocator::get_free_count()
{ {
return free; return free;
} }
void bitmap_set(void *bitmap, uint64_t start, uint64_t len, uint64_t bitmap_granularity)
{
if (start == 0)
{
if (len == 32*bitmap_granularity)
{
*((uint32_t*)bitmap) = UINT32_MAX;
return;
}
else if (len == 64*bitmap_granularity)
{
*((uint64_t*)bitmap) = UINT64_MAX;
return;
}
}
unsigned bit_start = start / bitmap_granularity;
unsigned bit_end = ((start + len) + bitmap_granularity - 1) / bitmap_granularity;
while (bit_start < bit_end)
{
if (!(bit_start & 7) && bit_end >= bit_start+8)
{
((uint8_t*)bitmap)[bit_start / 8] = UINT8_MAX;
bit_start += 8;
}
else
{
((uint8_t*)bitmap)[bit_start / 8] |= 1 << (bit_start % 8);
bit_start++;
}
}
}

View File

@ -8,6 +8,7 @@
// Hierarchical bitmap allocator // Hierarchical bitmap allocator
class allocator class allocator
{ {
uint64_t total;
uint64_t size; uint64_t size;
uint64_t free; uint64_t free;
uint64_t last_one_mask; uint64_t last_one_mask;
@ -19,3 +20,5 @@ public:
uint64_t find_free(); uint64_t find_free();
uint64_t get_free_count(); uint64_t get_free_count();
}; };
void bitmap_set(void *bitmap, uint64_t start, uint64_t len, uint64_t bitmap_granularity);

View File

@ -43,6 +43,11 @@ std::unordered_map<object_id, uint64_t> & blockstore_t::get_unstable_writes()
return impl->unstable_writes; return impl->unstable_writes;
} }
std::map<uint64_t, uint64_t> & blockstore_t::get_inode_space_stats()
{
return impl->inode_space_stats;
}
uint32_t blockstore_t::get_block_size() uint32_t blockstore_t::get_block_size()
{ {
return impl->get_block_size(); return impl->get_block_size();
@ -58,7 +63,7 @@ uint64_t blockstore_t::get_free_block_count()
return impl->get_free_block_count(); return impl->get_free_block_count();
} }
uint32_t blockstore_t::get_disk_alignment() uint32_t blockstore_t::get_bitmap_granularity()
{ {
return impl->get_disk_alignment(); return impl->get_bitmap_granularity();
} }

View File

@ -27,6 +27,7 @@
#define DEFAULT_ORDER 17 #define DEFAULT_ORDER 17
#define MIN_BLOCK_SIZE 4*1024 #define MIN_BLOCK_SIZE 4*1024
#define MAX_BLOCK_SIZE 128*1024*1024 #define MAX_BLOCK_SIZE 128*1024*1024
#define DEFAULT_BITMAP_GRANULARITY 4096
#define BS_OP_MIN 1 #define BS_OP_MIN 1
#define BS_OP_READ 1 #define BS_OP_READ 1
@ -64,6 +65,8 @@ Input:
- offset, len = offset and length within object. length may be zero, in that case - offset, len = offset and length within object. length may be zero, in that case
read operation only returns the version / write operation only bumps the version read operation only returns the version / write operation only bumps the version
- buf = pre-allocated buffer for data (read) / with data (write). may be NULL if len == 0. - buf = pre-allocated buffer for data (read) / with data (write). may be NULL if len == 0.
- bitmap = pointer to the new 'external' object bitmap data. Its part which is respective to the
write request is copied into the metadata area bitwise and stored there.
Output: Output:
- retval = number of bytes actually read/written or negative error number (-EINVAL or -ENOSPC) - retval = number of bytes actually read/written or negative error number (-EINVAL or -ENOSPC)
@ -141,6 +144,7 @@ struct blockstore_op_t
uint32_t offset; uint32_t offset;
uint32_t len; uint32_t len;
void *buf; void *buf;
void *bitmap;
int retval; int retval;
uint8_t private_data[BS_OP_PRIVATE_DATA_SIZE]; uint8_t private_data[BS_OP_PRIVATE_DATA_SIZE];
@ -178,10 +182,13 @@ public:
// Unstable writes are added here (map of object_id -> version) // Unstable writes are added here (map of object_id -> version)
std::unordered_map<object_id, uint64_t> & get_unstable_writes(); std::unordered_map<object_id, uint64_t> & get_unstable_writes();
// Get per-inode space usage statistics
std::map<uint64_t, uint64_t> & get_inode_space_stats();
// FIXME rename to object_size // FIXME rename to object_size
uint32_t get_block_size(); uint32_t get_block_size();
uint64_t get_block_count(); uint64_t get_block_count();
uint64_t get_free_block_count(); uint64_t get_free_block_count();
uint32_t get_disk_alignment(); uint32_t get_bitmap_granularity();
}; };

View File

@ -426,18 +426,18 @@ resume_1:
{ {
new_clean_bitmap = (bs->inmemory_meta new_clean_bitmap = (bs->inmemory_meta
? meta_new.buf + meta_new.pos*bs->clean_entry_size + sizeof(clean_disk_entry) ? meta_new.buf + meta_new.pos*bs->clean_entry_size + sizeof(clean_disk_entry)
: bs->clean_bitmap + (clean_loc >> bs->block_order)*bs->clean_entry_bitmap_size); : bs->clean_bitmap + (clean_loc >> bs->block_order)*(2*bs->clean_entry_bitmap_size));
if (clean_init_bitmap) if (clean_init_bitmap)
{ {
memset(new_clean_bitmap, 0, bs->clean_entry_bitmap_size); memset(new_clean_bitmap, 0, bs->clean_entry_bitmap_size);
bitmap_set(new_clean_bitmap, clean_bitmap_offset, clean_bitmap_len); bitmap_set(new_clean_bitmap, clean_bitmap_offset, clean_bitmap_len, bs->bitmap_granularity);
} }
} }
for (it = v.begin(); it != v.end(); it++) for (it = v.begin(); it != v.end(); it++)
{ {
if (new_clean_bitmap) if (new_clean_bitmap)
{ {
bitmap_set(new_clean_bitmap, it->offset, it->len); bitmap_set(new_clean_bitmap, it->offset, it->len, bs->bitmap_granularity);
} }
await_sqe(4); await_sqe(4);
data->iov = (struct iovec){ it->buf, (size_t)it->len }; data->iov = (struct iovec){ it->buf, (size_t)it->len };
@ -471,6 +471,7 @@ resume_1:
wait_state = 5; wait_state = 5;
return false; return false;
} }
// zero out old metadata entry
memset(meta_old.buf + meta_old.pos*bs->clean_entry_size, 0, bs->clean_entry_size); memset(meta_old.buf + meta_old.pos*bs->clean_entry_size, 0, bs->clean_entry_size);
await_sqe(15); await_sqe(15);
data->iov = (struct iovec){ meta_old.buf, bs->meta_block_size }; data->iov = (struct iovec){ meta_old.buf, bs->meta_block_size };
@ -482,6 +483,7 @@ resume_1:
} }
if (has_delete) if (has_delete)
{ {
// zero out new metadata entry
memset(meta_new.buf + meta_new.pos*bs->clean_entry_size, 0, bs->clean_entry_size); memset(meta_new.buf + meta_new.pos*bs->clean_entry_size, 0, bs->clean_entry_size);
} }
else else
@ -499,6 +501,12 @@ resume_1:
{ {
memcpy(&new_entry->bitmap, new_clean_bitmap, bs->clean_entry_bitmap_size); memcpy(&new_entry->bitmap, new_clean_bitmap, bs->clean_entry_bitmap_size);
} }
// copy latest external bitmap/attributes
if (bs->clean_entry_bitmap_size)
{
void *bmp_ptr = bs->clean_entry_bitmap_size > sizeof(void*) ? dirty_end->second.bitmap : &dirty_end->second.bitmap;
memcpy((void*)(new_entry+1) + bs->clean_entry_bitmap_size, bmp_ptr, bs->clean_entry_bitmap_size);
}
} }
await_sqe(6); await_sqe(6);
data->iov = (struct iovec){ meta_new.buf, bs->meta_block_size }; data->iov = (struct iovec){ meta_new.buf, bs->meta_block_size };
@ -823,7 +831,10 @@ bool journal_flusher_co::fsync_batch(bool fsync_meta, int wait_base)
sync_found: sync_found:
cur_sync->ready_count++; cur_sync->ready_count++;
flusher->syncing_flushers++; flusher->syncing_flushers++;
if (flusher->syncing_flushers >= flusher->flusher_count || !flusher->flush_queue.size()) resume_1:
if (!cur_sync->state)
{
if (flusher->syncing_flushers >= flusher->cur_flusher_count || !flusher->flush_queue.size())
{ {
// Sync batch is ready. Do it. // Sync batch is ready. Do it.
await_sqe(0); await_sqe(0);
@ -832,23 +843,23 @@ bool journal_flusher_co::fsync_batch(bool fsync_meta, int wait_base)
my_uring_prep_fsync(sqe, fsync_meta ? bs->meta_fd : bs->data_fd, IORING_FSYNC_DATASYNC); my_uring_prep_fsync(sqe, fsync_meta ? bs->meta_fd : bs->data_fd, IORING_FSYNC_DATASYNC);
cur_sync->state = 1; cur_sync->state = 1;
wait_count++; wait_count++;
resume_1: resume_2:
if (wait_count > 0) if (wait_count > 0)
{ {
wait_state = 1; wait_state = 2;
return false; return false;
} }
// Sync completed. All previous coroutines waiting for it must be resumed // Sync completed. All previous coroutines waiting for it must be resumed
cur_sync->state = 2; cur_sync->state = 2;
bs->ringloop->wakeup(); bs->ringloop->wakeup();
} }
// Wait until someone else sends and completes a sync. else
resume_2:
if (!cur_sync->state)
{ {
wait_state = 2; // Wait until someone else sends and completes a sync.
wait_state = 1;
return false; return false;
} }
}
flusher->syncing_flushers--; flusher->syncing_flushers--;
cur_sync->ready_count--; cur_sync->ready_count--;
if (cur_sync->ready_count == 0) if (cur_sync->ready_count == 0)
@ -858,35 +869,3 @@ bool journal_flusher_co::fsync_batch(bool fsync_meta, int wait_base)
} }
return true; return true;
} }
void journal_flusher_co::bitmap_set(void *bitmap, uint64_t start, uint64_t len)
{
if (start == 0)
{
if (len == 32*bs->bitmap_granularity)
{
*((uint32_t*)bitmap) = UINT32_MAX;
return;
}
else if (len == 64*bs->bitmap_granularity)
{
*((uint64_t*)bitmap) = UINT64_MAX;
return;
}
}
unsigned bit_start = start / bs->bitmap_granularity;
unsigned bit_end = ((start + len) + bs->bitmap_granularity - 1) / bs->bitmap_granularity;
while (bit_start < bit_end)
{
if (!(bit_start & 7) && bit_end >= bit_start+8)
{
((uint8_t*)bitmap)[bit_start / 8] = UINT8_MAX;
bit_start += 8;
}
else
{
((uint8_t*)bitmap)[bit_start / 8] |= 1 << (bit_start % 8);
bit_start++;
}
}
}

View File

@ -69,7 +69,6 @@ class journal_flusher_co
bool modify_meta_read(uint64_t meta_loc, flusher_meta_write_t &wr, int wait_base); bool modify_meta_read(uint64_t meta_loc, flusher_meta_write_t &wr, int wait_base);
void update_clean_db(); void update_clean_db();
bool fsync_batch(bool fsync_meta, int wait_base); bool fsync_batch(bool fsync_meta, int wait_base);
void bitmap_set(void *bitmap, uint64_t start, uint64_t len);
public: public:
journal_flusher_co(); journal_flusher_co();
bool loop(); bool loop();

View File

@ -10,9 +10,9 @@ blockstore_impl_t::blockstore_impl_t(blockstore_config_t & config, ring_loop_t *
ring_consumer.loop = [this]() { loop(); }; ring_consumer.loop = [this]() { loop(); };
ringloop->register_consumer(&ring_consumer); ringloop->register_consumer(&ring_consumer);
initialized = 0; initialized = 0;
zero_object = (uint8_t*)memalign_or_die(MEM_ALIGNMENT, block_size);
data_fd = meta_fd = journal.fd = -1; data_fd = meta_fd = journal.fd = -1;
parse_config(config); parse_config(config);
zero_object = (uint8_t*)memalign_or_die(MEM_ALIGNMENT, block_size);
try try
{ {
open_data(); open_data();
@ -105,10 +105,10 @@ void blockstore_impl_t::loop()
// has_writes == 1 - some writes in progress // has_writes == 1 - some writes in progress
// has_writes == 2 - tried to submit some writes, but failed // has_writes == 2 - tried to submit some writes, but failed
int has_writes = 0, op_idx = 0, new_idx = 0; int has_writes = 0, op_idx = 0, new_idx = 0;
for (; op_idx < submit_queue.size(); op_idx++) for (; op_idx < submit_queue.size(); op_idx++, new_idx++)
{ {
auto op = submit_queue[op_idx]; auto op = submit_queue[op_idx];
submit_queue[new_idx++] = op; submit_queue[new_idx] = op;
// FIXME: This needs some simplification // FIXME: This needs some simplification
// Writes should not block reads if the ring is not full and reads don't depend on them // Writes should not block reads if the ring is not full and reads don't depend on them
// In all other cases we should stop submission // In all other cases we should stop submission
@ -301,7 +301,7 @@ void blockstore_impl_t::check_wait(blockstore_op_t *op)
} }
else if (PRIV(op)->wait_for == WAIT_FREE) else if (PRIV(op)->wait_for == WAIT_FREE)
{ {
if (!data_alloc->get_free_count() && !flusher->is_active()) if (!data_alloc->get_free_count() && flusher->is_active())
{ {
#ifdef BLOCKSTORE_DEBUG #ifdef BLOCKSTORE_DEBUG
printf("Still waiting for free space on the data device\n"); printf("Still waiting for free space on the data device\n");

View File

@ -77,7 +77,8 @@
#include "blockstore_journal.h" #include "blockstore_journal.h"
// 24 bytes + block bitmap per "clean" entry on disk with fixed metadata tables // 32 bytes = 24 bytes + block bitmap (4 bytes by default) + external attributes (also bitmap, 4 bytes by default)
// per "clean" entry on disk with fixed metadata tables
// FIXME: maybe add crc32's to metadata // FIXME: maybe add crc32's to metadata
struct __attribute__((__packed__)) clean_disk_entry struct __attribute__((__packed__)) clean_disk_entry
{ {
@ -93,7 +94,7 @@ struct __attribute__((__packed__)) clean_entry
uint64_t location; uint64_t location;
}; };
// 56 = 24 + 32 bytes per dirty entry in memory (obj_ver_id => dirty_entry) // 64 = 24 + 40 bytes per dirty entry in memory (obj_ver_id => dirty_entry)
struct __attribute__((__packed__)) dirty_entry struct __attribute__((__packed__)) dirty_entry
{ {
uint32_t state; uint32_t state;
@ -102,6 +103,7 @@ struct __attribute__((__packed__)) dirty_entry
uint32_t offset; // data offset within object (stripe) uint32_t offset; // data offset within object (stripe)
uint32_t len; // data length uint32_t len; // data length
uint64_t journal_sector; // journal sector used for this entry uint64_t journal_sector; // journal sector used for this entry
void* bitmap; // either external bitmap itself when it fits, or a pointer to it when it doesn't
}; };
// - Sync must be submitted after previous writes/deletes (not before!) // - Sync must be submitted after previous writes/deletes (not before!)
@ -249,6 +251,7 @@ class blockstore_impl_t
void open_data(); void open_data();
void open_meta(); void open_meta();
void open_journal(); void open_journal();
uint8_t* get_clean_entry_bitmap(uint64_t block_loc, int offset);
// Asynchronous init // Asynchronous init
int initialized; int initialized;
@ -323,8 +326,11 @@ public:
// Unstable writes are added here (map of object_id -> version) // Unstable writes are added here (map of object_id -> version)
std::unordered_map<object_id, uint64_t> unstable_writes; std::unordered_map<object_id, uint64_t> unstable_writes;
// Space usage statistics
std::map<uint64_t, uint64_t> inode_space_stats;
inline uint32_t get_block_size() { return block_size; } inline uint32_t get_block_size() { return block_size; }
inline uint64_t get_block_count() { return block_count; } inline uint64_t get_block_count() { return block_count; }
inline uint64_t get_free_block_count() { return data_alloc->get_free_count(); } inline uint64_t get_free_block_count() { return data_alloc->get_free_count(); }
inline uint32_t get_disk_alignment() { return disk_alignment; } inline uint32_t get_bitmap_granularity() { return disk_alignment; }
}; };

View File

@ -100,7 +100,7 @@ void blockstore_init_meta::handle_entries(void* entries, unsigned count, int blo
clean_disk_entry *entry = (clean_disk_entry*)(entries + i*bs->clean_entry_size); clean_disk_entry *entry = (clean_disk_entry*)(entries + i*bs->clean_entry_size);
if (!bs->inmemory_meta && bs->clean_entry_bitmap_size) if (!bs->inmemory_meta && bs->clean_entry_bitmap_size)
{ {
memcpy(bs->clean_bitmap + (done_cnt+i)*bs->clean_entry_bitmap_size, &entry->bitmap, bs->clean_entry_bitmap_size); memcpy(bs->clean_bitmap + (done_cnt+i)*2*bs->clean_entry_bitmap_size, &entry->bitmap, 2*bs->clean_entry_bitmap_size);
} }
if (entry->oid.inode > 0) if (entry->oid.inode > 0)
{ {
@ -115,6 +115,10 @@ void blockstore_init_meta::handle_entries(void* entries, unsigned count, int blo
#endif #endif
bs->data_alloc->set(clean_it->second.location >> block_order, false); bs->data_alloc->set(clean_it->second.location >> block_order, false);
} }
else
{
bs->inode_space_stats[entry->oid.inode] += bs->block_size;
}
entries_loaded++; entries_loaded++;
#ifdef BLOCKSTORE_DEBUG #ifdef BLOCKSTORE_DEBUG
printf("Allocate block (clean entry) %lu: %lx:%lx v%lu\n", done_cnt+i, entry->oid.inode, entry->oid.stripe, entry->version); printf("Allocate block (clean entry) %lu: %lx:%lx v%lu\n", done_cnt+i, entry->oid.inode, entry->oid.stripe, entry->version);
@ -530,6 +534,21 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
.oid = je->small_write.oid, .oid = je->small_write.oid,
.version = je->small_write.version, .version = je->small_write.version,
}; };
void *bmp = (void*)je + sizeof(journal_entry_small_write);
if (bs->clean_entry_bitmap_size <= sizeof(void*))
{
memcpy(&bmp, bmp, bs->clean_entry_bitmap_size);
}
else if (!bs->journal.inmemory)
{
// FIXME Using large blockstore objects and not keeping journal in memory
// will result in a lot of small allocations for entry bitmaps. This can
// only be fixed by using a patched map with dynamic entry size, but not
// the btree_map, because it doesn't keep iterators valid all the time.
void *bmp_cp = malloc_or_die(bs->clean_entry_bitmap_size);
memcpy(bmp_cp, bmp, bs->clean_entry_bitmap_size);
bmp = bmp_cp;
}
bs->dirty_db.emplace(ov, (dirty_entry){ bs->dirty_db.emplace(ov, (dirty_entry){
.state = (BS_ST_SMALL_WRITE | BS_ST_SYNCED), .state = (BS_ST_SMALL_WRITE | BS_ST_SYNCED),
.flags = 0, .flags = 0,
@ -537,6 +556,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
.offset = je->small_write.offset, .offset = je->small_write.offset,
.len = je->small_write.len, .len = je->small_write.len,
.journal_sector = proc_pos, .journal_sector = proc_pos,
.bitmap = bmp,
}); });
bs->journal.used_sectors[proc_pos]++; bs->journal.used_sectors[proc_pos]++;
#ifdef BLOCKSTORE_DEBUG #ifdef BLOCKSTORE_DEBUG
@ -616,6 +636,21 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
.oid = je->big_write.oid, .oid = je->big_write.oid,
.version = je->big_write.version, .version = je->big_write.version,
}; };
void *bmp = (void*)je + sizeof(journal_entry_big_write);
if (bs->clean_entry_bitmap_size <= sizeof(void*))
{
memcpy(&bmp, bmp, bs->clean_entry_bitmap_size);
}
else if (!bs->journal.inmemory)
{
// FIXME Using large blockstore objects and not keeping journal in memory
// will result in a lot of small allocations for entry bitmaps. This can
// only be fixed by using a patched map with dynamic entry size, but not
// the btree_map, because it doesn't keep iterators valid all the time.
void *bmp_cp = malloc_or_die(bs->clean_entry_bitmap_size);
memcpy(bmp_cp, bmp, bs->clean_entry_bitmap_size);
bmp = bmp_cp;
}
bs->dirty_db.emplace(ov, (dirty_entry){ bs->dirty_db.emplace(ov, (dirty_entry){
.state = (BS_ST_BIG_WRITE | BS_ST_SYNCED), .state = (BS_ST_BIG_WRITE | BS_ST_SYNCED),
.flags = 0, .flags = 0,
@ -623,6 +658,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
.offset = je->big_write.offset, .offset = je->big_write.offset,
.len = je->big_write.len, .len = je->big_write.len,
.journal_sector = proc_pos, .journal_sector = proc_pos,
.bitmap = bmp,
}); });
#ifdef BLOCKSTORE_DEBUG #ifdef BLOCKSTORE_DEBUG
printf("Allocate block %lu\n", je->big_write.location >> bs->block_order); printf("Allocate block %lu\n", je->big_write.location >> bs->block_order);
@ -673,7 +709,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
printf("je_delete oid=%lx:%lx ver=%lu\n", je->del.oid.inode, je->del.oid.stripe, je->del.version); printf("je_delete oid=%lx:%lx ver=%lu\n", je->del.oid.inode, je->del.oid.stripe, je->del.version);
#endif #endif
auto clean_it = bs->clean_db.find(je->del.oid); auto clean_it = bs->clean_db.find(je->del.oid);
if (clean_it == bs->clean_db.end() || if (clean_it != bs->clean_db.end() &&
clean_it->second.version < je->del.version) clean_it->second.version < je->del.version)
{ {
// oid, version // oid, version

View File

@ -54,6 +54,9 @@ struct __attribute__((__packed__)) journal_entry_small_write
// data_offset is its offset within journal // data_offset is its offset within journal
uint64_t data_offset; uint64_t data_offset;
uint32_t crc32_data; uint32_t crc32_data;
// small_write and big_write entries are followed by the "external" bitmap
// its size is dynamic and included in journal entry's <size> field
uint8_t bitmap[];
}; };
struct __attribute__((__packed__)) journal_entry_big_write struct __attribute__((__packed__)) journal_entry_big_write
@ -68,6 +71,9 @@ struct __attribute__((__packed__)) journal_entry_big_write
uint32_t offset; uint32_t offset;
uint32_t len; uint32_t len;
uint64_t location; uint64_t location;
// small_write and big_write entries are followed by the "external" bitmap
// its size is dynamic and included in journal entry's <size> field
uint8_t bitmap[];
}; };
struct __attribute__((__packed__)) journal_entry_stable struct __attribute__((__packed__)) journal_entry_stable

View File

@ -94,7 +94,7 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
} }
else if (disk_alignment % MEM_ALIGNMENT) else if (disk_alignment % MEM_ALIGNMENT)
{ {
throw std::runtime_error("disk_alingment must be a multiple of "+std::to_string(MEM_ALIGNMENT)); throw std::runtime_error("disk_alignment must be a multiple of "+std::to_string(MEM_ALIGNMENT));
} }
if (!journal_block_size) if (!journal_block_size)
{ {
@ -118,7 +118,7 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
} }
if (!bitmap_granularity) if (!bitmap_granularity)
{ {
bitmap_granularity = 4096; bitmap_granularity = DEFAULT_BITMAP_GRANULARITY;
} }
else if (bitmap_granularity % disk_alignment) else if (bitmap_granularity % disk_alignment)
{ {
@ -170,7 +170,7 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
} }
// init some fields // init some fields
clean_entry_bitmap_size = block_size / bitmap_granularity / 8; clean_entry_bitmap_size = block_size / bitmap_granularity / 8;
clean_entry_size = sizeof(clean_disk_entry) + clean_entry_bitmap_size; clean_entry_size = sizeof(clean_disk_entry) + 2*clean_entry_bitmap_size;
journal.block_size = journal_block_size; journal.block_size = journal_block_size;
journal.next_free = journal_block_size; journal.next_free = journal_block_size;
journal.used_start = journal_block_size; journal.used_start = journal_block_size;
@ -237,7 +237,7 @@ void blockstore_impl_t::calc_lengths()
} }
else if (clean_entry_bitmap_size) else if (clean_entry_bitmap_size)
{ {
clean_bitmap = (uint8_t*)malloc(block_count * clean_entry_bitmap_size); clean_bitmap = (uint8_t*)malloc(block_count * 2*clean_entry_bitmap_size);
if (!clean_bitmap) if (!clean_bitmap)
throw std::runtime_error("Failed to allocate memory for the metadata sparse write bitmap"); throw std::runtime_error("Failed to allocate memory for the metadata sparse write bitmap");
} }

View File

@ -94,6 +94,21 @@ endwhile:
return 1; return 1;
} }
uint8_t* blockstore_impl_t::get_clean_entry_bitmap(uint64_t block_loc, int offset)
{
uint8_t *clean_entry_bitmap;
uint64_t meta_loc = block_loc >> block_order;
if (inmemory_meta)
{
uint64_t sector = (meta_loc / (meta_block_size / clean_entry_size)) * meta_block_size;
uint64_t pos = (meta_loc % (meta_block_size / clean_entry_size));
clean_entry_bitmap = (uint8_t*)(metadata_buffer + sector + pos*clean_entry_size + sizeof(clean_disk_entry) + offset);
}
else
clean_entry_bitmap = (uint8_t*)(clean_bitmap + meta_loc*2*clean_entry_bitmap_size + offset);
return clean_entry_bitmap;
}
int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op) int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
{ {
auto clean_it = clean_db.find(read_op->oid); auto clean_it = clean_db.find(read_op->oid);
@ -134,6 +149,11 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
if (!result_version) if (!result_version)
{ {
result_version = dirty_it->first.version; result_version = dirty_it->first.version;
if (read_op->bitmap)
{
void *bmp_ptr = (clean_entry_bitmap_size > sizeof(void*) ? dirty_it->second.bitmap : &dirty_it->second.bitmap);
memcpy(read_op->bitmap, bmp_ptr, clean_entry_bitmap_size);
}
} }
if (!fulfill_read(read_op, fulfilled, dirty.offset, dirty.offset + dirty.len, if (!fulfill_read(read_op, fulfilled, dirty.offset, dirty.offset + dirty.len,
dirty.state, dirty_it->first.version, dirty.location + (IS_JOURNAL(dirty.state) ? 0 : dirty.offset))) dirty.state, dirty_it->first.version, dirty.location + (IS_JOURNAL(dirty.state) ? 0 : dirty.offset)))
@ -155,6 +175,11 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
if (!result_version) if (!result_version)
{ {
result_version = clean_it->second.version; result_version = clean_it->second.version;
if (read_op->bitmap)
{
void *bmp_ptr = get_clean_entry_bitmap(clean_it->second.location, clean_entry_bitmap_size);
memcpy(read_op->bitmap, bmp_ptr, clean_entry_bitmap_size);
}
} }
if (fulfilled < read_op->len) if (fulfilled < read_op->len)
{ {
@ -169,18 +194,7 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
} }
else else
{ {
uint64_t meta_loc = clean_it->second.location >> block_order; uint8_t *clean_entry_bitmap = get_clean_entry_bitmap(clean_it->second.location, 0);
uint8_t *clean_entry_bitmap;
if (inmemory_meta)
{
uint64_t sector = (meta_loc / (meta_block_size / clean_entry_size)) * meta_block_size;
uint64_t pos = (meta_loc % (meta_block_size / clean_entry_size));
clean_entry_bitmap = (uint8_t*)(metadata_buffer + sector + pos*clean_entry_size + sizeof(clean_disk_entry));
}
else
{
clean_entry_bitmap = (uint8_t*)(clean_bitmap + meta_loc*clean_entry_bitmap_size);
}
uint64_t bmp_start = 0, bmp_end = 0, bmp_size = block_size/bitmap_granularity; uint64_t bmp_start = 0, bmp_end = 0, bmp_size = block_size/bitmap_granularity;
while (bmp_start < bmp_size) while (bmp_start < bmp_size)
{ {
@ -191,8 +205,8 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
if (bmp_end > bmp_start) if (bmp_end > bmp_start)
{ {
// fill with zeroes // fill with zeroes
fulfill_read(read_op, fulfilled, bmp_start * bitmap_granularity, assert(fulfill_read(read_op, fulfilled, bmp_start * bitmap_granularity,
bmp_end * bitmap_granularity, (BS_ST_DELETE | BS_ST_STABLE), 0, 0); bmp_end * bitmap_granularity, (BS_ST_DELETE | BS_ST_STABLE), 0, 0));
} }
bmp_start = bmp_end; bmp_start = bmp_end;
while (clean_entry_bitmap[bmp_end >> 3] & (1 << (bmp_end & 0x7)) && bmp_end < bmp_size) while (clean_entry_bitmap[bmp_end >> 3] & (1 << (bmp_end & 0x7)) && bmp_end < bmp_size)
@ -218,7 +232,7 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
else if (fulfilled < read_op->len) else if (fulfilled < read_op->len)
{ {
// fill remaining parts with zeroes // fill remaining parts with zeroes
fulfill_read(read_op, fulfilled, 0, block_size, (BS_ST_DELETE | BS_ST_STABLE), 0, 0); assert(fulfill_read(read_op, fulfilled, 0, block_size, (BS_ST_DELETE | BS_ST_STABLE), 0, 0));
} }
assert(fulfilled == read_op->len); assert(fulfilled == read_op->len);
read_op->version = result_version; read_op->version = result_version;

View File

@ -126,11 +126,8 @@ resume_2:
resume_3: resume_3:
if (!disable_journal_fsync) if (!disable_journal_fsync)
{ {
io_uring_sqe *sqe = get_sqe(); io_uring_sqe *sqe;
if (!sqe) BS_SUBMIT_GET_SQE_DECL(sqe);
{
return 0;
}
ring_data_t *data = ((ring_data_t*)sqe->user_data); ring_data_t *data = ((ring_data_t*)sqe->user_data);
my_uring_prep_fsync(sqe, journal.fd, IORING_FSYNC_DATASYNC); my_uring_prep_fsync(sqe, journal.fd, IORING_FSYNC_DATASYNC);
data->iov = { 0 }; data->iov = { 0 };
@ -166,10 +163,7 @@ void blockstore_impl_t::mark_rolled_back(const obj_ver_id & ov)
auto rm_start = it; auto rm_start = it;
auto rm_end = it; auto rm_end = it;
it--; it--;
while (it->first.oid == ov.oid && while (1)
it->first.version > ov.version &&
!IS_IN_FLIGHT(it->second.state) &&
!IS_STABLE(it->second.state))
{ {
if (it->first.oid != ov.oid) if (it->first.oid != ov.oid)
break; break;
@ -179,7 +173,7 @@ void blockstore_impl_t::mark_rolled_back(const obj_ver_id & ov)
max_unstable = it->first.version; max_unstable = it->first.version;
break; break;
} }
else if (IS_STABLE(it->second.state)) else if (IS_IN_FLIGHT(it->second.state) || IS_STABLE(it->second.state))
break; break;
// Remove entry // Remove entry
rm_start = it; rm_start = it;
@ -190,7 +184,6 @@ void blockstore_impl_t::mark_rolled_back(const obj_ver_id & ov)
if (rm_start != rm_end) if (rm_start != rm_end)
{ {
erase_dirty(rm_start, rm_end, UINT64_MAX); erase_dirty(rm_start, rm_end, UINT64_MAX);
}
auto unstab_it = unstable_writes.find(ov.oid); auto unstab_it = unstable_writes.find(ov.oid);
if (unstab_it != unstable_writes.end()) if (unstab_it != unstable_writes.end())
{ {
@ -200,6 +193,7 @@ void blockstore_impl_t::mark_rolled_back(const obj_ver_id & ov)
unstab_it->second = max_unstable; unstab_it->second = max_unstable;
} }
} }
}
} }
void blockstore_impl_t::handle_rollback_event(ring_data_t *data, blockstore_op_t *op) void blockstore_impl_t::handle_rollback_event(ring_data_t *data, blockstore_op_t *op)
@ -272,6 +266,11 @@ void blockstore_impl_t::erase_dirty(blockstore_dirty_db_t::iterator dirty_start,
{ {
journal.used_sectors.erase(dirty_it->second.journal_sector); journal.used_sectors.erase(dirty_it->second.journal_sector);
} }
if (clean_entry_bitmap_size > sizeof(void*))
{
free(dirty_it->second.bitmap);
dirty_it->second.bitmap = NULL;
}
if (dirty_it == dirty_start) if (dirty_it == dirty_start)
{ {
break; break;

View File

@ -150,11 +150,8 @@ resume_2:
resume_3: resume_3:
if (!disable_journal_fsync) if (!disable_journal_fsync)
{ {
io_uring_sqe *sqe = get_sqe(); io_uring_sqe *sqe;
if (!sqe) BS_SUBMIT_GET_SQE_DECL(sqe);
{
return 0;
}
ring_data_t *data = ((ring_data_t*)sqe->user_data); ring_data_t *data = ((ring_data_t*)sqe->user_data);
my_uring_prep_fsync(sqe, journal.fd, IORING_FSYNC_DATASYNC); my_uring_prep_fsync(sqe, journal.fd, IORING_FSYNC_DATASYNC);
data->iov = { 0 }; data->iov = { 0 };
@ -189,6 +186,15 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v)
if ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_SYNCED) if ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_SYNCED)
{ {
dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_STABLE; dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_STABLE;
// Allocations and deletions are counted when they're stabilized
if (IS_BIG_WRITE(dirty_it->second.state))
{
inode_space_stats[dirty_it->first.oid.inode] += block_size;
}
else if (IS_DELETE(dirty_it->second.state))
{
inode_space_stats[dirty_it->first.oid.inode] -= block_size;
}
} }
else if (IS_STABLE(dirty_it->second.state)) else if (IS_STABLE(dirty_it->second.state))
{ {

View File

@ -8,7 +8,12 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
// Check or assign version number // Check or assign version number
bool found = false, deleted = false, is_del = (op->opcode == BS_OP_DELETE); bool found = false, deleted = false, is_del = (op->opcode == BS_OP_DELETE);
bool wait_big = false, wait_del = false; bool wait_big = false, wait_del = false;
void *bmp = NULL;
uint64_t version = 1; uint64_t version = 1;
if (!is_del && clean_entry_bitmap_size > sizeof(void*))
{
bmp = calloc_or_die(1, clean_entry_bitmap_size);
}
if (dirty_db.size() > 0) if (dirty_db.size() > 0)
{ {
auto dirty_it = dirty_db.upper_bound((obj_ver_id){ auto dirty_it = dirty_db.upper_bound((obj_ver_id){
@ -25,6 +30,10 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
wait_big = (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE wait_big = (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE
? !IS_SYNCED(dirty_it->second.state) ? !IS_SYNCED(dirty_it->second.state)
: ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_WAIT_BIG); : ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_WAIT_BIG);
if (clean_entry_bitmap_size > sizeof(void*))
memcpy(bmp, dirty_it->second.bitmap, clean_entry_bitmap_size);
else
bmp = dirty_it->second.bitmap;
} }
} }
if (!found) if (!found)
@ -33,6 +42,8 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
if (clean_it != clean_db.end()) if (clean_it != clean_db.end())
{ {
version = clean_it->second.version + 1; version = clean_it->second.version + 1;
void *bmp_ptr = get_clean_entry_bitmap(clean_it->second.location, clean_entry_bitmap_size);
memcpy((clean_entry_bitmap_size > sizeof(void*) ? bmp : &bmp), bmp_ptr, clean_entry_bitmap_size);
} }
else else
{ {
@ -72,6 +83,10 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
{ {
// Invalid version requested // Invalid version requested
op->retval = -EEXIST; op->retval = -EEXIST;
if (!is_del && clean_entry_bitmap_size > sizeof(void*))
{
free(bmp);
}
return false; return false;
} }
} }
@ -109,6 +124,28 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
state |= BS_ST_IN_FLIGHT; state |= BS_ST_IN_FLIGHT;
if (op->opcode == BS_OP_WRITE_STABLE) if (op->opcode == BS_OP_WRITE_STABLE)
state |= BS_ST_INSTANT; state |= BS_ST_INSTANT;
if (op->bitmap)
{
// Only allow to overwrite part of the object bitmap respective to the write's offset/len
uint8_t *bmp_ptr = (uint8_t*)(clean_entry_bitmap_size > sizeof(void*) ? bmp : &bmp);
uint32_t bit = op->offset/bitmap_granularity;
uint32_t bits_left = op->len/bitmap_granularity;
while (!(bit % 8) && bits_left > 8)
{
// Copy bytes
bmp_ptr[bit/8] = ((uint8_t*)op->bitmap)[bit/8];
bit += 8;
bits_left -= 8;
}
while (bits_left > 0)
{
// Copy bits
bmp_ptr[bit/8] = (bmp_ptr[bit/8] & ~(1 << (bit%8)))
| (((uint8_t*)op->bitmap)[bit/8] & (1 << bit%8));
bit++;
bits_left--;
}
}
} }
dirty_db.emplace((obj_ver_id){ dirty_db.emplace((obj_ver_id){
.oid = op->oid, .oid = op->oid,
@ -120,6 +157,7 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
.offset = is_del ? 0 : op->offset, .offset = is_del ? 0 : op->offset,
.len = is_del ? 0 : op->len, .len = is_del ? 0 : op->len,
.journal_sector = 0, .journal_sector = 0,
.bitmap = bmp,
}); });
return true; return true;
} }
@ -128,6 +166,8 @@ void blockstore_impl_t::cancel_all_writes(blockstore_op_t *op, blockstore_dirty_
{ {
while (dirty_it != dirty_db.end() && dirty_it->first.oid == op->oid) while (dirty_it != dirty_db.end() && dirty_it->first.oid == op->oid)
{ {
if (clean_entry_bitmap_size > sizeof(void*))
free(dirty_it->second.bitmap);
dirty_db.erase(dirty_it++); dirty_db.erase(dirty_it++);
} }
bool found = false; bool found = false;
@ -305,7 +345,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
// Then pre-fill journal entry // Then pre-fill journal entry
journal_entry_small_write *je = (journal_entry_small_write*)prefill_single_journal_entry( journal_entry_small_write *je = (journal_entry_small_write*)prefill_single_journal_entry(
journal, op->opcode == BS_OP_WRITE_STABLE ? JE_SMALL_WRITE_INSTANT : JE_SMALL_WRITE, journal, op->opcode == BS_OP_WRITE_STABLE ? JE_SMALL_WRITE_INSTANT : JE_SMALL_WRITE,
sizeof(journal_entry_small_write) sizeof(journal_entry_small_write) + clean_entry_bitmap_size
); );
dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset; dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++; journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
@ -324,6 +364,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
je->len = op->len; je->len = op->len;
je->data_offset = journal.next_free; je->data_offset = journal.next_free;
je->crc32_data = crc32c(0, op->buf, op->len); je->crc32_data = crc32c(0, op->buf, op->len);
memcpy((void*)(je+1), (clean_entry_bitmap_size > sizeof(void*) ? dirty_it->second.bitmap : &dirty_it->second.bitmap), clean_entry_bitmap_size);
je->crc32 = je_crc32((journal_entry*)je); je->crc32 = je_crc32((journal_entry*)je);
journal.crc32_last = je->crc32; journal.crc32_last = je->crc32;
if (immediate_commit != IMMEDIATE_NONE) if (immediate_commit != IMMEDIATE_NONE)
@ -401,14 +442,10 @@ int blockstore_impl_t::continue_write(blockstore_op_t *op)
goto resume_4; goto resume_4;
resume_2: resume_2:
// Only for the immediate_commit mode: prepare and submit big_write journal entry // Only for the immediate_commit mode: prepare and submit big_write journal entry
sqe = get_sqe(); BS_SUBMIT_GET_SQE_DECL(sqe);
if (!sqe)
{
return 0;
}
je = (journal_entry_big_write*)prefill_single_journal_entry( je = (journal_entry_big_write*)prefill_single_journal_entry(
journal, op->opcode == BS_OP_WRITE_STABLE ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE, journal, op->opcode == BS_OP_WRITE_STABLE ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
sizeof(journal_entry_big_write) sizeof(journal_entry_big_write) + clean_entry_bitmap_size
); );
dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset; dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++; journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
@ -424,6 +461,7 @@ resume_2:
je->offset = op->offset; je->offset = op->offset;
je->len = op->len; je->len = op->len;
je->location = dirty_it->second.location; je->location = dirty_it->second.location;
memcpy((void*)(je+1), (clean_entry_bitmap_size > sizeof(void*) ? dirty_it->second.bitmap : &dirty_it->second.bitmap), clean_entry_bitmap_size);
je->crc32 = je_crc32((journal_entry*)je); je->crc32 = je_crc32((journal_entry*)je);
journal.crc32_last = je->crc32; journal.crc32_last = je->crc32;
prepare_journal_sector_write(journal, journal.cur_sector, sqe, prepare_journal_sector_write(journal, journal.cur_sector, sqe,

View File

@ -4,6 +4,8 @@
#include <stdexcept> #include <stdexcept>
#include "cluster_client.h" #include "cluster_client.h"
#define SCRAP_BUFFER_SIZE 4*1024*1024
cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config) cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config)
{ {
this->ringloop = ringloop; this->ringloop = ringloop;
@ -76,6 +78,9 @@ cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd
st_cli.parse_config(config); st_cli.parse_config(config);
st_cli.load_global_config(); st_cli.load_global_config();
scrap_buffer_size = SCRAP_BUFFER_SIZE;
scrap_buffer = malloc_or_die(scrap_buffer_size);
if (ringloop) if (ringloop)
{ {
consumer.loop = [this]() consumer.loop = [this]()
@ -94,13 +99,21 @@ cluster_client_t::~cluster_client_t()
{ {
ringloop->unregister_consumer(&consumer); ringloop->unregister_consumer(&consumer);
} }
free(scrap_buffer);
} }
void cluster_client_t::stop() cluster_op_t::~cluster_op_t()
{ {
while (msgr.clients.size() > 0) if (buf)
{ {
msgr.stop_client(msgr.clients.begin()->first); free(buf);
buf = NULL;
}
if (bitmap_buf)
{
free(bitmap_buf);
part_bitmaps = NULL;
bitmap_buf = NULL;
} }
} }
@ -141,20 +154,16 @@ static uint32_t is_power_of_two(uint64_t value)
void cluster_client_t::on_load_config_hook(json11::Json::object & config) void cluster_client_t::on_load_config_hook(json11::Json::object & config)
{ {
bs_block_size = config["block_size"].uint64_value(); bs_block_size = config["block_size"].uint64_value();
bs_disk_alignment = config["disk_alignment"].uint64_value();
bs_bitmap_granularity = config["bitmap_granularity"].uint64_value(); bs_bitmap_granularity = config["bitmap_granularity"].uint64_value();
if (!bs_block_size) if (!bs_block_size)
{ {
bs_block_size = DEFAULT_BLOCK_SIZE; bs_block_size = DEFAULT_BLOCK_SIZE;
} }
if (!bs_disk_alignment)
{
bs_disk_alignment = DEFAULT_DISK_ALIGNMENT;
}
if (!bs_bitmap_granularity) if (!bs_bitmap_granularity)
{ {
bs_bitmap_granularity = DEFAULT_BITMAP_GRANULARITY; bs_bitmap_granularity = DEFAULT_BITMAP_GRANULARITY;
} }
bs_bitmap_size = bs_block_size / bs_bitmap_granularity / 8;
uint32_t block_order; uint32_t block_order;
if ((block_order = is_power_of_two(bs_block_size)) >= 64 || bs_block_size < MIN_BLOCK_SIZE || bs_block_size >= MAX_BLOCK_SIZE) if ((block_order = is_power_of_two(bs_block_size)) >= 64 || bs_block_size < MIN_BLOCK_SIZE || bs_block_size >= MAX_BLOCK_SIZE)
{ {
@ -217,21 +226,21 @@ void cluster_client_t::on_change_hook(json11::Json::object & changes)
// And now they have to be resliced! // And now they have to be resliced!
for (auto op: cur_ops) for (auto op: cur_ops)
{ {
if (INODE_POOL(op->inode) == pool_item.first) if (INODE_POOL(op->cur_inode) == pool_item.first)
{ {
op->needs_reslice = true; op->needs_reslice = true;
} }
} }
for (auto op: unsynced_writes) for (auto op: unsynced_writes)
{ {
if (INODE_POOL(op->inode) == pool_item.first) if (INODE_POOL(op->cur_inode) == pool_item.first)
{ {
op->needs_reslice = true; op->needs_reslice = true;
} }
} }
for (auto op: syncing_writes) for (auto op: syncing_writes)
{ {
if (INODE_POOL(op->inode) == pool_item.first) if (INODE_POOL(op->cur_inode) == pool_item.first)
{ {
op->needs_reslice = true; op->needs_reslice = true;
} }
@ -250,6 +259,11 @@ void cluster_client_t::on_change_osd_state_hook(uint64_t peer_osd)
} }
} }
bool cluster_client_t::is_ready()
{
return pgs_loaded;
}
void cluster_client_t::on_ready(std::function<void(void)> fn) void cluster_client_t::on_ready(std::function<void(void)> fn)
{ {
if (pgs_loaded) if (pgs_loaded)
@ -301,7 +315,7 @@ void cluster_client_t::execute(cluster_op_t *op)
op->retval = 0; op->retval = 0;
if (op->opcode != OSD_OP_SYNC && op->opcode != OSD_OP_READ && op->opcode != OSD_OP_WRITE || if (op->opcode != OSD_OP_SYNC && op->opcode != OSD_OP_READ && op->opcode != OSD_OP_WRITE ||
(op->opcode == OSD_OP_READ || op->opcode == OSD_OP_WRITE) && (!op->inode || !op->len || (op->opcode == OSD_OP_READ || op->opcode == OSD_OP_WRITE) && (!op->inode || !op->len ||
op->offset % bs_disk_alignment || op->len % bs_disk_alignment)) op->offset % bs_bitmap_granularity || op->len % bs_bitmap_granularity))
{ {
op->retval = -EINVAL; op->retval = -EINVAL;
std::function<void(cluster_op_t*)>(op->callback)(op); std::function<void(cluster_op_t*)>(op->callback)(op);
@ -312,7 +326,17 @@ void cluster_client_t::execute(cluster_op_t *op)
execute_sync(op); execute_sync(op);
return; return;
} }
if (op->opcode == OSD_OP_WRITE && !immediate_commit) op->cur_inode = op->inode;
if (op->opcode == OSD_OP_WRITE)
{
auto ino_it = st_cli.inode_config.find(op->inode);
if (ino_it != st_cli.inode_config.end() && ino_it->second.readonly)
{
op->retval = -EINVAL;
std::function<void(cluster_op_t*)>(op->callback)(op);
return;
}
if (!immediate_commit)
{ {
if (next_writes.size() > 0) if (next_writes.size() > 0)
{ {
@ -332,35 +356,24 @@ void cluster_client_t::execute(cluster_op_t *op)
return; return;
} }
queued_bytes += op->len; queued_bytes += op->len;
op = copy_write(op);
unsynced_writes.push_back(op);
}
} }
cur_ops.insert(op); cur_ops.insert(op);
continue_rw(op); continue_rw(op);
} }
void cluster_client_t::continue_rw(cluster_op_t *op) cluster_op_t *cluster_client_t::copy_write(cluster_op_t *op)
{ {
pool_id_t pool_id = INODE_POOL(op->inode); // Save operation for replay when one of PGs goes out of sync
if (!pool_id)
{
op->retval = -EINVAL;
std::function<void(cluster_op_t*)>(op->callback)(op);
return;
}
if (st_cli.pool_config.find(pool_id) == st_cli.pool_config.end() ||
st_cli.pool_config[pool_id].real_pg_count == 0)
{
// Postpone operations to unknown pools
return;
}
if (op->opcode == OSD_OP_WRITE && !immediate_commit && !op->is_internal)
{
// Save operation for replay when PG goes out of sync
// (primary OSD drops our connection in this case) // (primary OSD drops our connection in this case)
cluster_op_t *op_copy = new cluster_op_t(); cluster_op_t *op_copy = new cluster_op_t();
op_copy->is_internal = true; op_copy->is_internal = true;
op_copy->orig_op = op; op_copy->orig_op = op;
op_copy->opcode = op->opcode; op_copy->opcode = op->opcode;
op_copy->inode = op->inode; op_copy->inode = op->inode;
op_copy->cur_inode = op->inode;
op_copy->offset = op->offset; op_copy->offset = op->offset;
op_copy->len = op->len; op_copy->len = op->len;
op_copy->buf = malloc_or_die(op->len); op_copy->buf = malloc_or_die(op->len);
@ -381,10 +394,24 @@ void cluster_client_t::continue_rw(cluster_op_t *op)
memcpy(cur_buf, op->iov.buf[i].iov_base, op->iov.buf[i].iov_len); memcpy(cur_buf, op->iov.buf[i].iov_base, op->iov.buf[i].iov_len);
cur_buf += op->iov.buf[i].iov_len; cur_buf += op->iov.buf[i].iov_len;
} }
unsynced_writes.push_back(op_copy); return op_copy;
cur_ops.erase(op); }
cur_ops.insert(op_copy);
op = op_copy; // FIXME Reimplement it using "coroutine emulation"
void cluster_client_t::continue_rw(cluster_op_t *op)
{
pool_id_t pool_id = INODE_POOL(op->cur_inode);
if (!pool_id)
{
op->retval = -EINVAL;
std::function<void(cluster_op_t*)>(op->callback)(op);
return;
}
if (st_cli.pool_config.find(pool_id) == st_cli.pool_config.end() ||
st_cli.pool_config[pool_id].real_pg_count == 0)
{
// Postpone operations to unknown pools
return;
} }
if (!op->parts.size()) if (!op->parts.size())
{ {
@ -394,11 +421,11 @@ void cluster_client_t::continue_rw(cluster_op_t *op)
if (!op->needs_reslice) if (!op->needs_reslice)
{ {
// Send unsent parts, if they're not subject to change // Send unsent parts, if they're not subject to change
for (auto & op_part: op->parts) for (int i = 0; i < op->parts.size(); i++)
{ {
if (!op_part.sent && !op_part.done) if (!op->parts[i].sent && !op->parts[i].done)
{ {
try_send(op, &op_part); try_send(op, i);
} }
} }
} }
@ -409,15 +436,37 @@ void cluster_client_t::continue_rw(cluster_op_t *op)
// Finished successfully // Finished successfully
// Even if the PG count has changed in meanwhile we treat it as success // Even if the PG count has changed in meanwhile we treat it as success
// because if some operations were invalid for the new PG count we'd get errors // because if some operations were invalid for the new PG count we'd get errors
bool is_read = op->opcode == OSD_OP_READ;
if (is_read)
{
// Check parent inode
auto ino_it = st_cli.inode_config.find(op->cur_inode);
if (ino_it != st_cli.inode_config.end() &&
ino_it->second.parent_id)
{
// Continue reading from the parent inode
// FIXME: This obviously requires optimizations for long snapshot chains
op->cur_inode = ino_it->second.parent_id;
op->parts.clear();
op->done_count = 0;
op->needs_reslice = true;
continue_rw(op);
return;
}
}
cur_ops.erase(op); cur_ops.erase(op);
op->retval = op->len; op->retval = op->len;
std::function<void(cluster_op_t*)>(op->callback)(op); std::function<void(cluster_op_t*)>(op->callback)(op);
if (!is_read)
{
continue_sync(); continue_sync();
}
return; return;
} }
else if (op->retval != 0 && op->retval != -EPIPE) else if (op->retval != 0 && op->retval != -EPIPE)
{ {
// Fatal error (not -EPIPE) // Fatal error (not -EPIPE)
bool is_read = op->opcode == OSD_OP_READ;
cur_ops.erase(op); cur_ops.erase(op);
if (!immediate_commit && op->opcode == OSD_OP_WRITE) if (!immediate_commit && op->opcode == OSD_OP_WRITE)
{ {
@ -434,11 +483,12 @@ void cluster_client_t::continue_rw(cluster_op_t *op)
std::function<void(cluster_op_t*)>(op->callback)(op); std::function<void(cluster_op_t*)>(op->callback)(op);
if (del) if (del)
{ {
if (op->buf)
free(op->buf);
delete op; delete op;
} }
if (!is_read)
{
continue_sync(); continue_sync();
}
return; return;
} }
else else
@ -456,60 +506,145 @@ void cluster_client_t::continue_rw(cluster_op_t *op)
} }
} }
void cluster_client_t::slice_rw(cluster_op_t *op) static void add_iov(int size, bool skip, cluster_op_t *op, int &iov_idx, size_t &iov_pos, osd_op_buf_list_t &iov, void *scrap, int scrap_len)
{ {
// Slice the request into individual object stripe requests int left = size;
// Primary OSDs still operate individual stripes, but their size is multiplied by PG minsize in case of EC
auto & pool_cfg = st_cli.pool_config[INODE_POOL(op->inode)];
uint64_t pg_block_size = bs_block_size * (
pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pool_cfg.pg_size-pool_cfg.parity_chunks
);
uint64_t first_stripe = (op->offset / pg_block_size) * pg_block_size;
uint64_t last_stripe = ((op->offset + op->len + pg_block_size - 1) / pg_block_size - 1) * pg_block_size;
op->retval = 0;
op->parts.resize((last_stripe - first_stripe) / pg_block_size + 1);
int iov_idx = 0;
size_t iov_pos = 0;
int i = 0;
for (uint64_t stripe = first_stripe; stripe <= last_stripe; stripe += pg_block_size)
{
pg_num_t pg_num = (op->inode + stripe/pool_cfg.pg_stripe_size) % pool_cfg.real_pg_count + 1;
uint64_t begin = (op->offset < stripe ? stripe : op->offset);
uint64_t end = (op->offset + op->len) > (stripe + pg_block_size)
? (stripe + pg_block_size) : (op->offset + op->len);
op->parts[i] = (cluster_op_part_t){
.parent = op,
.offset = begin,
.len = (uint32_t)(end - begin),
.pg_num = pg_num,
.sent = false,
.done = false,
};
int left = end-begin;
while (left > 0 && iov_idx < op->iov.count) while (left > 0 && iov_idx < op->iov.count)
{ {
if (op->iov.buf[iov_idx].iov_len - iov_pos < left) int cur_left = op->iov.buf[iov_idx].iov_len - iov_pos;
if (cur_left < left)
{ {
op->parts[i].iov.push_back(op->iov.buf[iov_idx].iov_base + iov_pos, op->iov.buf[iov_idx].iov_len - iov_pos); if (!skip)
left -= (op->iov.buf[iov_idx].iov_len - iov_pos); {
iov.push_back(op->iov.buf[iov_idx].iov_base + iov_pos, cur_left);
}
left -= cur_left;
iov_pos = 0; iov_pos = 0;
iov_idx++; iov_idx++;
} }
else else
{ {
op->parts[i].iov.push_back(op->iov.buf[iov_idx].iov_base + iov_pos, left); if (!skip)
{
iov.push_back(op->iov.buf[iov_idx].iov_base + iov_pos, left);
}
iov_pos += left; iov_pos += left;
left = 0; left = 0;
} }
} }
assert(left == 0); assert(left == 0);
if (skip && scrap_len > 0)
{
// All skipped ranges are read into the same useless buffer
left = size;
while (left > 0)
{
int cur_left = scrap_len < left ? scrap_len : left;
iov.push_back(scrap, cur_left);
left -= cur_left;
}
}
}
void cluster_client_t::slice_rw(cluster_op_t *op)
{
// Slice the request into individual object stripe requests
// Primary OSDs still operate individual stripes, but their size is multiplied by PG minsize in case of EC
auto & pool_cfg = st_cli.pool_config[INODE_POOL(op->cur_inode)];
uint32_t pg_data_size = (
pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pool_cfg.pg_size-pool_cfg.parity_chunks
);
uint64_t pg_block_size = bs_block_size * pg_data_size;
uint64_t first_stripe = (op->offset / pg_block_size) * pg_block_size;
uint64_t last_stripe = ((op->offset + op->len + pg_block_size - 1) / pg_block_size - 1) * pg_block_size;
op->retval = 0;
op->parts.resize((last_stripe - first_stripe) / pg_block_size + 1);
if (op->opcode == OSD_OP_READ)
{
// Allocate memory for the bitmap
unsigned object_bitmap_size = ((op->len / bs_bitmap_granularity + 7) / 8);
object_bitmap_size = (object_bitmap_size < 8 ? 8 : object_bitmap_size);
unsigned bitmap_mem = object_bitmap_size + (bs_bitmap_size * pg_data_size) * op->parts.size();
if (op->bitmap_buf_size < bitmap_mem)
{
op->bitmap_buf = realloc_or_die(op->bitmap_buf, bitmap_mem);
if (!op->bitmap_buf_size)
{
// First allocation
memset(op->bitmap_buf, 0, object_bitmap_size);
}
op->part_bitmaps = op->bitmap_buf + object_bitmap_size;
op->bitmap_buf_size = bitmap_mem;
}
}
int iov_idx = 0;
size_t iov_pos = 0;
int i = 0;
for (uint64_t stripe = first_stripe; stripe <= last_stripe; stripe += pg_block_size)
{
pg_num_t pg_num = (op->cur_inode + stripe/pool_cfg.pg_stripe_size) % pool_cfg.real_pg_count + 1; // like map_to_pg()
uint64_t begin = (op->offset < stripe ? stripe : op->offset);
uint64_t end = (op->offset + op->len) > (stripe + pg_block_size)
? (stripe + pg_block_size) : (op->offset + op->len);
op->parts[i].iov.reset();
if (op->cur_inode != op->inode)
{
// Read remaining parts from upper layers
uint64_t prev = begin, cur = begin;
bool skip_prev = true;
while (cur < end)
{
unsigned bmp_loc = (cur - op->offset)/bs_bitmap_granularity;
bool skip = (((*(uint8_t*)(op->bitmap_buf + bmp_loc/8)) >> (bmp_loc%8)) & 0x1);
if (skip_prev != skip)
{
if (cur > prev)
{
if (prev == begin && skip_prev)
{
begin = cur;
// Just advance iov_idx & iov_pos
add_iov(cur-prev, true, op, iov_idx, iov_pos, op->parts[i].iov, NULL, 0);
}
else
add_iov(cur-prev, skip_prev, op, iov_idx, iov_pos, op->parts[i].iov, scrap_buffer, scrap_buffer_size);
}
skip_prev = skip;
prev = cur;
}
cur += bs_bitmap_granularity;
}
assert(cur > prev);
if (skip_prev)
{
// Just advance iov_idx & iov_pos
add_iov(end-prev, true, op, iov_idx, iov_pos, op->parts[i].iov, NULL, 0);
end = prev;
}
else
add_iov(cur-prev, skip_prev, op, iov_idx, iov_pos, op->parts[i].iov, scrap_buffer, scrap_buffer_size);
if (end == begin)
op->done_count++;
}
else
{
add_iov(end-begin, false, op, iov_idx, iov_pos, op->parts[i].iov, NULL, 0);
}
op->parts[i].parent = op;
op->parts[i].offset = begin;
op->parts[i].len = (uint32_t)(end - begin);
op->parts[i].pg_num = pg_num;
op->parts[i].osd_num = 0;
op->parts[i].sent = end <= begin;
op->parts[i].done = end <= begin;
i++; i++;
} }
} }
bool cluster_client_t::try_send(cluster_op_t *op, cluster_op_part_t *part) bool cluster_client_t::try_send(cluster_op_t *op, int i)
{ {
auto & pool_cfg = st_cli.pool_config[INODE_POOL(op->inode)]; cluster_op_part_t *part = &op->parts[i];
auto & pool_cfg = st_cli.pool_config[INODE_POOL(op->cur_inode)];
auto pg_it = pool_cfg.pg_config.find(part->pg_num); auto pg_it = pool_cfg.pg_config.find(part->pg_num);
if (pg_it != pool_cfg.pg_config.end() && if (pg_it != pool_cfg.pg_config.end() &&
!pg_it->second.pause && pg_it->second.cur_primary) !pg_it->second.pause && pg_it->second.cur_primary)
@ -522,6 +657,9 @@ bool cluster_client_t::try_send(cluster_op_t *op, cluster_op_part_t *part)
part->osd_num = primary_osd; part->osd_num = primary_osd;
part->sent = true; part->sent = true;
op->sent_count++; op->sent_count++;
uint64_t pg_bitmap_size = bs_bitmap_size * (
pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pool_cfg.pg_size-pool_cfg.parity_chunks
);
part->op = (osd_op_t){ part->op = (osd_op_t){
.op_type = OSD_OP_OUT, .op_type = OSD_OP_OUT,
.peer_fd = peer_fd, .peer_fd = peer_fd,
@ -531,10 +669,12 @@ bool cluster_client_t::try_send(cluster_op_t *op, cluster_op_part_t *part)
.id = op_id++, .id = op_id++,
.opcode = op->opcode, .opcode = op->opcode,
}, },
.inode = op->inode, .inode = op->cur_inode,
.offset = part->offset, .offset = part->offset,
.len = part->len, .len = part->len,
} }, } },
.bitmap = op->opcode == OSD_OP_WRITE ? NULL : op->part_bitmaps + pg_bitmap_size*i,
.bitmap_len = (unsigned)(op->opcode == OSD_OP_WRITE ? 0 : pg_bitmap_size),
.callback = [this, part](osd_op_t *op_part) .callback = [this, part](osd_op_t *op_part)
{ {
handle_op_part(part); handle_op_part(part);
@ -660,8 +800,6 @@ void cluster_client_t::finish_sync()
assert(op->sent_count == 0); assert(op->sent_count == 0);
if (op->is_internal) if (op->is_internal)
{ {
if (op->buf)
free(op->buf);
delete op; delete op;
} }
} }
@ -701,6 +839,16 @@ void cluster_client_t::send_sync(cluster_op_t *op, cluster_op_part_t *part)
msgr.outbox_push(&part->op); msgr.outbox_push(&part->op);
} }
static inline void mem_or(void *res, const void *r2, unsigned int len)
{
unsigned int i;
for (i = 0; i < len; ++i)
{
// Hope the compiler vectorizes this
((uint8_t*)res)[i] = ((uint8_t*)res)[i] | ((uint8_t*)r2)[i];
}
}
void cluster_client_t::handle_op_part(cluster_op_part_t *part) void cluster_client_t::handle_op_part(cluster_op_part_t *part)
{ {
cluster_op_t *op = part->parent; cluster_op_t *op = part->parent;
@ -738,6 +886,35 @@ void cluster_client_t::handle_op_part(cluster_op_part_t *part)
// OK // OK
part->done = true; part->done = true;
op->done_count++; op->done_count++;
if (op->opcode == OSD_OP_READ)
{
// Copy (OR) bitmap
auto & pool_cfg = st_cli.pool_config[INODE_POOL(op->cur_inode)];
uint32_t pg_block_size = bs_block_size * (
pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pool_cfg.pg_size-pool_cfg.parity_chunks
);
uint32_t object_offset = (part->op.req.rw.offset - op->offset) / bs_bitmap_granularity;
uint32_t part_offset = (part->op.req.rw.offset % pg_block_size) / bs_bitmap_granularity;
uint32_t part_len = part->op.req.rw.len / bs_bitmap_granularity;
if (!(object_offset & 0x7) && !(part_offset & 0x7) && (part_len >= 8))
{
// Copy bytes
mem_or(op->bitmap_buf + object_offset/8, part->op.bitmap + part_offset/8, part_len/8);
object_offset += (part_len & ~0x7);
part_offset += (part_len & ~0x7);
part_len = (part_len & 0x7);
}
while (part_len > 0)
{
// Copy bits
(*(uint8_t*)(op->bitmap_buf + (object_offset >> 3))) |= (
(((*(uint8_t*)(part->op.bitmap + (part_offset >> 3))) >> (part_offset & 0x7)) & 0x1) << (object_offset & 0x7)
);
part_offset++;
object_offset++;
part_len--;
}
}
} }
if (op->sent_count == 0) if (op->sent_count == 0)
{ {

View File

@ -8,8 +8,6 @@
#define MIN_BLOCK_SIZE 4*1024 #define MIN_BLOCK_SIZE 4*1024
#define MAX_BLOCK_SIZE 128*1024*1024 #define MAX_BLOCK_SIZE 128*1024*1024
#define DEFAULT_DISK_ALIGNMENT 4096
#define DEFAULT_BITMAP_GRANULARITY 4096
#define DEFAULT_CLIENT_DIRTY_LIMIT 32*1024*1024 #define DEFAULT_CLIENT_DIRTY_LIMIT 32*1024*1024
struct cluster_op_t; struct cluster_op_t;
@ -36,7 +34,9 @@ struct cluster_op_t
int retval; int retval;
osd_op_buf_list_t iov; osd_op_buf_list_t iov;
std::function<void(cluster_op_t*)> callback; std::function<void(cluster_op_t*)> callback;
~cluster_op_t();
protected: protected:
uint64_t cur_inode; // for snapshot reads
void *buf = NULL; void *buf = NULL;
cluster_op_t *orig_op = NULL; cluster_op_t *orig_op = NULL;
bool is_internal = false; bool is_internal = false;
@ -44,6 +44,8 @@ protected:
bool up_wait = false; bool up_wait = false;
int sent_count = 0, done_count = 0; int sent_count = 0, done_count = 0;
std::vector<cluster_op_part_t> parts; std::vector<cluster_op_part_t> parts;
void *bitmap_buf = NULL, *part_bitmaps = NULL;
unsigned bitmap_buf_size = 0;
friend class cluster_client_t; friend class cluster_client_t;
}; };
@ -53,8 +55,7 @@ class cluster_client_t
ring_loop_t *ringloop; ring_loop_t *ringloop;
uint64_t bs_block_size = 0; uint64_t bs_block_size = 0;
uint64_t bs_disk_alignment = 0; uint32_t bs_bitmap_granularity = 0, bs_bitmap_size = 0;
uint64_t bs_bitmap_granularity = 0;
std::map<pool_id_t, uint64_t> pg_counts; std::map<pool_id_t, uint64_t> pg_counts;
bool immediate_commit = false; bool immediate_commit = false;
// FIXME: Implement inmemory_commit mode. Note that it requires to return overlapping reads from memory. // FIXME: Implement inmemory_commit mode. Note that it requires to return overlapping reads from memory.
@ -75,6 +76,8 @@ class cluster_client_t
std::vector<cluster_op_t*> next_writes; std::vector<cluster_op_t*> next_writes;
std::vector<cluster_op_t*> offline_ops; std::vector<cluster_op_t*> offline_ops;
uint64_t queued_bytes = 0; uint64_t queued_bytes = 0;
void *scrap_buffer = NULL;
unsigned scrap_buffer_size = 0;
bool pgs_loaded = false; bool pgs_loaded = false;
std::vector<std::function<void(void)>> on_ready_hooks; std::vector<std::function<void(void)>> on_ready_hooks;
@ -87,8 +90,8 @@ public:
cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config); cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config);
~cluster_client_t(); ~cluster_client_t();
void execute(cluster_op_t *op); void execute(cluster_op_t *op);
bool is_ready();
void on_ready(std::function<void(void)> fn); void on_ready(std::function<void(void)> fn);
void stop();
protected: protected:
void continue_ops(bool up_retry = false); void continue_ops(bool up_retry = false);
@ -96,9 +99,10 @@ protected:
void on_load_pgs_hook(bool success); void on_load_pgs_hook(bool success);
void on_change_hook(json11::Json::object & changes); void on_change_hook(json11::Json::object & changes);
void on_change_osd_state_hook(uint64_t peer_osd); void on_change_osd_state_hook(uint64_t peer_osd);
cluster_op_t *copy_write(cluster_op_t *op);
void continue_rw(cluster_op_t *op); void continue_rw(cluster_op_t *op);
void slice_rw(cluster_op_t *op); void slice_rw(cluster_op_t *op);
bool try_send(cluster_op_t *op, cluster_op_part_t *part); bool try_send(cluster_op_t *op, int i);
void execute_sync(cluster_op_t *op); void execute_sync(cluster_op_t *op);
void continue_sync(); void continue_sync();
void finish_sync(); void finish_sync();

View File

@ -9,6 +9,11 @@
etcd_state_client_t::~etcd_state_client_t() etcd_state_client_t::~etcd_state_client_t()
{ {
for (auto watch: watches)
{
delete watch;
}
watches.clear();
etcd_watches_initialised = -1; etcd_watches_initialised = -1;
if (etcd_watch_ws) if (etcd_watch_ws)
{ {
@ -56,6 +61,23 @@ void etcd_state_client_t::etcd_call(std::string api, json11::Json payload, int t
http_request_json(tfd, etcd_address, req, timeout, callback); http_request_json(tfd, etcd_address, req, timeout, callback);
} }
void etcd_state_client_t::add_etcd_url(std::string addr)
{
if (addr.length() > 0)
{
if (strtolower(addr.substr(0, 7)) == "http://")
addr = addr.substr(7);
else if (strtolower(addr.substr(0, 8)) == "https://")
{
printf("HTTPS is unsupported for etcd. Either use plain HTTP or setup a local proxy for etcd interaction\n");
exit(1);
}
if (addr.find('/') < 0)
addr += "/v3";
this->etcd_addresses.push_back(addr);
}
}
void etcd_state_client_t::parse_config(json11::Json & config) void etcd_state_client_t::parse_config(json11::Json & config)
{ {
this->etcd_addresses.clear(); this->etcd_addresses.clear();
@ -65,13 +87,7 @@ void etcd_state_client_t::parse_config(json11::Json & config)
while (1) while (1)
{ {
int pos = ea.find(','); int pos = ea.find(',');
std::string addr = pos >= 0 ? ea.substr(0, pos) : ea; add_etcd_url(pos >= 0 ? ea.substr(0, pos) : ea);
if (addr.length() > 0)
{
if (addr.find('/') < 0)
addr += "/v3";
this->etcd_addresses.push_back(addr);
}
if (pos >= 0) if (pos >= 0)
ea = ea.substr(pos+1); ea = ea.substr(pos+1);
else else
@ -82,13 +98,7 @@ void etcd_state_client_t::parse_config(json11::Json & config)
{ {
for (auto & ea: config["etcd_address"].array_items()) for (auto & ea: config["etcd_address"].array_items())
{ {
std::string addr = ea.string_value(); add_etcd_url(ea.string_value());
if (addr != "")
{
if (addr.find('/') < 0)
addr += "/v3";
this->etcd_addresses.push_back(addr);
}
} }
} }
this->etcd_prefix = config["etcd_prefix"].string_value(); this->etcd_prefix = config["etcd_prefix"].string_value();
@ -261,6 +271,12 @@ void etcd_state_client_t::load_pgs()
{ "key", base64_encode(etcd_prefix+"/config/pgs") }, { "key", base64_encode(etcd_prefix+"/config/pgs") },
} } } }
}, },
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(etcd_prefix+"/config/inode/") },
{ "range_end", base64_encode(etcd_prefix+"/config/inode0") },
} }
},
json11::Json::object { json11::Json::object {
{ "request_range", json11::Json::object { { "request_range", json11::Json::object {
{ "key", base64_encode(etcd_prefix+"/pg/history/") }, { "key", base64_encode(etcd_prefix+"/pg/history/") },
@ -607,4 +623,105 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
} }
} }
} }
else if (key.substr(0, etcd_prefix.length()+14) == etcd_prefix+"/config/inode/")
{
// <etcd_prefix>/config/inode/%d/%d
uint64_t pool_id = 0;
uint64_t inode_num = 0;
char null_byte = 0;
sscanf(key.c_str() + etcd_prefix.length()+14, "%lu/%lu%c", &pool_id, &inode_num, &null_byte);
if (!pool_id || pool_id >= POOL_ID_MAX || !inode_num || (inode_num >> (64-POOL_ID_BITS)) || null_byte != 0)
{
printf("Bad etcd key %s, ignoring\n", key.c_str());
}
else
{
inode_num |= (pool_id << (64-POOL_ID_BITS));
auto it = this->inode_config.find(inode_num);
if (it != this->inode_config.end() && it->second.name != "")
{
auto n_it = this->inode_by_name.find(it->second.name);
if (n_it->second == inode_num)
{
this->inode_by_name.erase(n_it);
for (auto w: watches)
{
if (w->name == it->second.name)
{
w->cfg = { 0 };
}
}
}
}
if (!value.is_object())
{
this->inode_config.erase(inode_num);
}
else
{
inode_t parent_inode_num = value["parent_id"].uint64_value();
if (parent_inode_num && !(parent_inode_num >> (64-POOL_ID_BITS)))
{
uint64_t parent_pool_id = value["parent_pool"].uint64_value();
if (!parent_pool_id)
parent_inode_num |= pool_id << (64-POOL_ID_BITS);
else if (parent_pool_id >= POOL_ID_MAX)
{
printf(
"Inode %lu/%lu parent_pool value is invalid, ignoring parent setting\n",
inode_num >> (64-POOL_ID_BITS), inode_num & ((1l << (64-POOL_ID_BITS)) - 1)
);
parent_inode_num = 0;
}
else
parent_inode_num |= parent_pool_id << (64-POOL_ID_BITS);
}
inode_config_t cfg = (inode_config_t){
.num = inode_num,
.name = value["name"].string_value(),
.size = value["size"].uint64_value(),
.parent_id = parent_inode_num,
.readonly = value["readonly"].bool_value(),
};
this->inode_config[inode_num] = cfg;
if (cfg.name != "")
{
this->inode_by_name[cfg.name] = inode_num;
for (auto w: watches)
{
if (w->name == value["name"].string_value())
{
w->cfg = cfg;
}
}
}
}
}
}
}
inode_watch_t* etcd_state_client_t::watch_inode(std::string name)
{
inode_watch_t *watch = new inode_watch_t;
watch->name = name;
watches.push_back(watch);
auto it = inode_by_name.find(name);
if (it != inode_by_name.end())
{
watch->cfg = inode_config[it->second];
}
return watch;
}
void etcd_state_client_t::close_watch(inode_watch_t* watch)
{
for (int i = 0; i < watches.size(); i++)
{
if (watches[i] == watch)
{
watches.erase(watches.begin()+i, watches.begin()+i+1);
break;
}
}
delete watch;
} }

View File

@ -52,8 +52,29 @@ struct pool_config_t
std::map<pg_num_t, pg_config_t> pg_config; std::map<pg_num_t, pg_config_t> pg_config;
}; };
struct inode_config_t
{
uint64_t num;
std::string name;
uint64_t size;
inode_t parent_id;
bool readonly;
};
struct inode_watch_t
{
std::string name;
inode_config_t cfg;
};
struct etcd_state_client_t struct etcd_state_client_t
{ {
protected:
std::vector<inode_watch_t*> watches;
websocket_t *etcd_watch_ws = NULL;
uint64_t bs_block_size = 0;
void add_etcd_url(std::string);
public:
std::vector<std::string> etcd_addresses; std::vector<std::string> etcd_addresses;
std::string etcd_prefix; std::string etcd_prefix;
int log_level = 0; int log_level = 0;
@ -61,10 +82,10 @@ struct etcd_state_client_t
int etcd_watches_initialised = 0; int etcd_watches_initialised = 0;
uint64_t etcd_watch_revision = 0; uint64_t etcd_watch_revision = 0;
websocket_t *etcd_watch_ws = NULL;
uint64_t bs_block_size = 0;
std::map<pool_id_t, pool_config_t> pool_config; std::map<pool_id_t, pool_config_t> pool_config;
std::map<osd_num_t, json11::Json> peer_states; std::map<osd_num_t, json11::Json> peer_states;
std::map<inode_t, inode_config_t> inode_config;
std::map<std::string, inode_t> inode_by_name;
std::function<void(json11::Json::object &)> on_change_hook; std::function<void(json11::Json::object &)> on_change_hook;
std::function<void(json11::Json::object &)> on_load_config_hook; std::function<void(json11::Json::object &)> on_load_config_hook;
@ -81,5 +102,7 @@ struct etcd_state_client_t
void load_pgs(); void load_pgs();
void parse_state(const std::string & key, const json11::Json & value); void parse_state(const std::string & key, const json11::Json & value);
void parse_config(json11::Json & config); void parse_config(json11::Json & config);
inode_watch_t* watch_inode(std::string name);
void close_watch(inode_watch_t* watch);
~etcd_state_client_t(); ~etcd_state_client_t();
}; };

View File

@ -6,17 +6,17 @@
// Random write: // Random write:
// //
// fio -thread -ioengine=./libfio_cluster.so -name=test -bs=4k -direct=1 -fsync=16 -iodepth=16 -rw=randwrite \ // fio -thread -ioengine=./libfio_cluster.so -name=test -bs=4k -direct=1 -fsync=16 -iodepth=16 -rw=randwrite \
// -etcd=127.0.0.1:2379 [-etcd_prefix=/vitastor] -pool=1 -inode=1 -size=1000M // -etcd=127.0.0.1:2379 [-etcd_prefix=/vitastor] (-image=testimg | -pool=1 -inode=1 -size=1000M)
// //
// Linear write: // Linear write:
// //
// fio -thread -ioengine=./libfio_cluster.so -name=test -bs=128k -direct=1 -fsync=32 -iodepth=32 -rw=write \ // fio -thread -ioengine=./libfio_cluster.so -name=test -bs=128k -direct=1 -fsync=32 -iodepth=32 -rw=write \
// -etcd=127.0.0.1:2379 [-etcd_prefix=/vitastor] -pool=1 -inode=1 -size=1000M // -etcd=127.0.0.1:2379 [-etcd_prefix=/vitastor] -image=testimg
// //
// Random read (run with -iodepth=32 or -iodepth=1): // Random read (run with -iodepth=32 or -iodepth=1):
// //
// fio -thread -ioengine=./libfio_cluster.so -name=test -bs=4k -direct=1 -iodepth=32 -rw=randread \ // fio -thread -ioengine=./libfio_cluster.so -name=test -bs=4k -direct=1 -iodepth=32 -rw=randread \
// -etcd=127.0.0.1:2379 [-etcd_prefix=/vitastor] -pool=1 -inode=1 -size=1000M // -etcd=127.0.0.1:2379 [-etcd_prefix=/vitastor] -image=testimg
#include <sys/types.h> #include <sys/types.h>
#include <sys/socket.h> #include <sys/socket.h>
@ -35,6 +35,7 @@ struct sec_data
ring_loop_t *ringloop = NULL; ring_loop_t *ringloop = NULL;
epoll_manager_t *epmgr = NULL; epoll_manager_t *epmgr = NULL;
cluster_client_t *cli = NULL; cluster_client_t *cli = NULL;
inode_watch_t *watch = NULL;
bool last_sync = false; bool last_sync = false;
/* The list of completed io_u structs. */ /* The list of completed io_u structs. */
std::vector<io_u*> completed; std::vector<io_u*> completed;
@ -47,6 +48,7 @@ struct sec_options
int __pad; int __pad;
char *etcd_host = NULL; char *etcd_host = NULL;
char *etcd_prefix = NULL; char *etcd_prefix = NULL;
char *image = NULL;
uint64_t pool = 0; uint64_t pool = 0;
uint64_t inode = 0; uint64_t inode = 0;
int cluster_log = 0; int cluster_log = 0;
@ -64,7 +66,7 @@ static struct fio_option options[] = {
.group = FIO_OPT_G_FILENAME, .group = FIO_OPT_G_FILENAME,
}, },
{ {
.name = "etcd", .name = "etcd_prefix",
.lname = "etcd key prefix", .lname = "etcd key prefix",
.type = FIO_OPT_STR_STORE, .type = FIO_OPT_STR_STORE,
.off1 = offsetof(struct sec_options, etcd_prefix), .off1 = offsetof(struct sec_options, etcd_prefix),
@ -72,6 +74,15 @@ static struct fio_option options[] = {
.category = FIO_OPT_C_ENGINE, .category = FIO_OPT_C_ENGINE,
.group = FIO_OPT_G_FILENAME, .group = FIO_OPT_G_FILENAME,
}, },
{
.name = "image",
.lname = "Vitastor image name",
.type = FIO_OPT_STR_STORE,
.off1 = offsetof(struct sec_options, image),
.help = "Vitastor image name to run tests on",
.category = FIO_OPT_C_ENGINE,
.group = FIO_OPT_G_FILENAME,
},
{ {
.name = "pool", .name = "pool",
.lname = "pool number for the inode", .lname = "pool number for the inode",
@ -86,7 +97,7 @@ static struct fio_option options[] = {
.lname = "inode to run tests on", .lname = "inode to run tests on",
.type = FIO_OPT_INT, .type = FIO_OPT_INT,
.off1 = offsetof(struct sec_options, inode), .off1 = offsetof(struct sec_options, inode),
.help = "inode to run tests on (1 by default)", .help = "inode number to run tests on",
.category = FIO_OPT_C_ENGINE, .category = FIO_OPT_C_ENGINE,
.group = FIO_OPT_G_FILENAME, .group = FIO_OPT_G_FILENAME,
}, },
@ -117,8 +128,15 @@ static struct fio_option options[] = {
static int sec_setup(struct thread_data *td) static int sec_setup(struct thread_data *td)
{ {
sec_options *o = (sec_options*)td->eo;
sec_data *bsd; sec_data *bsd;
if (!o->etcd_host)
{
td_verror(td, EINVAL, "etcd address is missing");
return 1;
}
bsd = new sec_data; bsd = new sec_data;
if (!bsd) if (!bsd)
{ {
@ -134,6 +152,51 @@ static int sec_setup(struct thread_data *td)
td->o.open_files++; td->o.open_files++;
} }
json11::Json cfg = json11::Json::object {
{ "etcd_address", std::string(o->etcd_host) },
{ "etcd_prefix", std::string(o->etcd_prefix ? o->etcd_prefix : "/vitastor") },
{ "log_level", o->cluster_log },
};
if (!o->image)
{
if (!(o->inode & ((1l << (64-POOL_ID_BITS)) - 1)))
{
td_verror(td, EINVAL, "inode number is missing");
return 1;
}
if (o->pool)
{
o->inode = (o->inode & ((1l << (64-POOL_ID_BITS)) - 1)) | (o->pool << (64-POOL_ID_BITS));
}
if (!(o->inode >> (64-POOL_ID_BITS)))
{
td_verror(td, EINVAL, "pool is missing");
return 1;
}
}
else
{
o->inode = 0;
}
bsd->ringloop = new ring_loop_t(512);
bsd->epmgr = new epoll_manager_t(bsd->ringloop);
bsd->cli = new cluster_client_t(bsd->ringloop, bsd->epmgr->tfd, cfg);
if (o->image)
{
while (!bsd->cli->is_ready())
{
bsd->ringloop->loop();
if (bsd->cli->is_ready())
break;
bsd->ringloop->wait();
}
bsd->watch = bsd->cli->st_cli.watch_inode(std::string(o->image));
td->files[0]->real_file_size = bsd->watch->cfg.size;
}
bsd->trace = o->trace ? true : false;
return 0; return 0;
} }
@ -142,6 +205,10 @@ static void sec_cleanup(struct thread_data *td)
sec_data *bsd = (sec_data*)td->io_ops_data; sec_data *bsd = (sec_data*)td->io_ops_data;
if (bsd) if (bsd)
{ {
if (bsd->watch)
{
bsd->cli->st_cli.close_watch(bsd->watch);
}
delete bsd->cli; delete bsd->cli;
delete bsd->epmgr; delete bsd->epmgr;
delete bsd->ringloop; delete bsd->ringloop;
@ -152,28 +219,6 @@ static void sec_cleanup(struct thread_data *td)
/* Connect to the server from each thread. */ /* Connect to the server from each thread. */
static int sec_init(struct thread_data *td) static int sec_init(struct thread_data *td)
{ {
sec_options *o = (sec_options*)td->eo;
sec_data *bsd = (sec_data*)td->io_ops_data;
json11::Json cfg = json11::Json::object {
{ "etcd_address", std::string(o->etcd_host) },
{ "etcd_prefix", std::string(o->etcd_prefix ? o->etcd_prefix : "/vitastor") },
{ "log_level", o->cluster_log },
};
if (o->pool)
o->inode = (o->inode & ((1l << (64-POOL_ID_BITS)) - 1)) | (o->pool << (64-POOL_ID_BITS));
if (!(o->inode >> (64-POOL_ID_BITS)))
{
td_verror(td, EINVAL, "pool is missing");
return 1;
}
bsd->ringloop = new ring_loop_t(512);
bsd->epmgr = new epoll_manager_t(bsd->ringloop);
bsd->cli = new cluster_client_t(bsd->ringloop, bsd->epmgr->tfd, cfg);
bsd->trace = o->trace ? true : false;
return 0; return 0;
} }
@ -193,19 +238,23 @@ static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)
io->engine_data = bsd; io->engine_data = bsd;
cluster_op_t *op = new cluster_op_t; cluster_op_t *op = new cluster_op_t;
op->inode = opt->image ? bsd->watch->cfg.num : opt->inode;
switch (io->ddir) switch (io->ddir)
{ {
case DDIR_READ: case DDIR_READ:
op->opcode = OSD_OP_READ; op->opcode = OSD_OP_READ;
op->inode = opt->inode;
op->offset = io->offset; op->offset = io->offset;
op->len = io->xfer_buflen; op->len = io->xfer_buflen;
op->iov.push_back(io->xfer_buf, io->xfer_buflen); op->iov.push_back(io->xfer_buf, io->xfer_buflen);
bsd->last_sync = false; bsd->last_sync = false;
break; break;
case DDIR_WRITE: case DDIR_WRITE:
if (opt->image && bsd->watch->cfg.readonly)
{
io->error = EROFS;
return FIO_Q_COMPLETED;
}
op->opcode = OSD_OP_WRITE; op->opcode = OSD_OP_WRITE;
op->inode = opt->inode;
op->offset = io->offset; op->offset = io->offset;
op->len = io->xfer_buflen; op->len = io->xfer_buflen;
op->iov.push_back(io->xfer_buf, io->xfer_buflen); op->iov.push_back(io->xfer_buf, io->xfer_buflen);

View File

@ -22,7 +22,6 @@
#define READ_BUFFER_SIZE 9000 #define READ_BUFFER_SIZE 9000
static int extract_port(std::string & host); static int extract_port(std::string & host);
static std::string strtolower(const std::string & in);
static std::string trim(const std::string & in); static std::string trim(const std::string & in);
static std::string ws_format_frame(int type, uint64_t size); static std::string ws_format_frame(int type, uint64_t size);
static bool ws_parse_frame(std::string & buf, int & type, std::string & res); static bool ws_parse_frame(std::string & buf, int & type, std::string & res);
@ -673,7 +672,7 @@ static int extract_port(std::string & host)
return port; return port;
} }
static std::string strtolower(const std::string & in) std::string strtolower(const std::string & in)
{ {
std::string s = in; std::string s = in;
for (int i = 0; i < s.length(); i++) for (int i = 0; i < s.length(); i++)

View File

@ -49,6 +49,8 @@ std::vector<std::string> getifaddr_list(bool include_v6 = false);
uint64_t stoull_full(const std::string & str, int base = 10); uint64_t stoull_full(const std::string & str, int base = 10);
std::string strtolower(const std::string & in);
void http_request(timerfd_manager_t *tfd, const std::string & host, const std::string & request, void http_request(timerfd_manager_t *tfd, const std::string & host, const std::string & request,
const http_options_t & options, std::function<void(const http_response_t *response)> callback); const http_options_t & options, std::function<void(const http_response_t *response)> callback);

View File

@ -35,6 +35,7 @@
#define DEFAULT_PEER_CONNECT_INTERVAL 5 #define DEFAULT_PEER_CONNECT_INTERVAL 5
#define DEFAULT_PEER_CONNECT_TIMEOUT 5 #define DEFAULT_PEER_CONNECT_TIMEOUT 5
#define DEFAULT_OSD_PING_TIMEOUT 5 #define DEFAULT_OSD_PING_TIMEOUT 5
#define DEFAULT_BITMAP_GRANULARITY 4096
// Kind of a vector with small-list-optimisation // Kind of a vector with small-list-optimisation
struct osd_op_buf_list_t struct osd_op_buf_list_t
@ -174,13 +175,17 @@ struct osd_primary_op_data_t;
struct osd_op_t struct osd_op_t
{ {
timespec tv_begin; timespec tv_begin = { 0 }, tv_end = { 0 };
uint64_t op_type = OSD_OP_IN; uint64_t op_type = OSD_OP_IN;
int peer_fd; int peer_fd;
osd_any_op_t req; osd_any_op_t req;
osd_any_reply_t reply; osd_any_reply_t reply;
blockstore_op_t *bs_op = NULL; blockstore_op_t *bs_op = NULL;
void *buf = NULL; void *buf = NULL;
// bitmap, bitmap_len, bmp_data are only meaningful for reads
void *bitmap = NULL;
unsigned bitmap_len = 0;
unsigned bmp_data = 0;
void *rmw_buf = NULL; void *rmw_buf = NULL;
osd_primary_op_data_t* op_data = NULL; osd_primary_op_data_t* op_data = NULL;
std::function<void(osd_op_t*)> callback; std::function<void(osd_op_t*)> callback;

View File

@ -202,22 +202,34 @@ void osd_messenger_t::handle_op_hdr(osd_client_t *cl)
osd_op_t *cur_op = cl->read_op; osd_op_t *cur_op = cl->read_op;
if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ) if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ)
{ {
if (cur_op->req.sec_rw.len > 0)
cur_op->buf = memalign_or_die(MEM_ALIGNMENT, cur_op->req.sec_rw.len);
cl->read_remaining = 0; cl->read_remaining = 0;
} }
else if (cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE || else if (cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE) cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE)
{ {
if (cur_op->req.sec_rw.attr_len > 0)
{
if (cur_op->req.sec_rw.attr_len > sizeof(unsigned))
cur_op->bitmap = cur_op->rmw_buf = malloc_or_die(cur_op->req.sec_rw.attr_len);
else
cur_op->bitmap = &cur_op->bmp_data;
cl->recv_list.push_back(cur_op->bitmap, cur_op->req.sec_rw.attr_len);
}
if (cur_op->req.sec_rw.len > 0) if (cur_op->req.sec_rw.len > 0)
{
cur_op->buf = memalign_or_die(MEM_ALIGNMENT, cur_op->req.sec_rw.len); cur_op->buf = memalign_or_die(MEM_ALIGNMENT, cur_op->req.sec_rw.len);
cl->read_remaining = cur_op->req.sec_rw.len; cl->recv_list.push_back(cur_op->buf, cur_op->req.sec_rw.len);
}
cl->read_remaining = cur_op->req.sec_rw.len + cur_op->req.sec_rw.attr_len;
} }
else if (cur_op->req.hdr.opcode == OSD_OP_SEC_STABILIZE || else if (cur_op->req.hdr.opcode == OSD_OP_SEC_STABILIZE ||
cur_op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK) cur_op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK)
{ {
if (cur_op->req.sec_stab.len > 0) if (cur_op->req.sec_stab.len > 0)
{
cur_op->buf = memalign_or_die(MEM_ALIGNMENT, cur_op->req.sec_stab.len); cur_op->buf = memalign_or_die(MEM_ALIGNMENT, cur_op->req.sec_stab.len);
cl->recv_list.push_back(cur_op->buf, cur_op->req.sec_stab.len);
}
cl->read_remaining = cur_op->req.sec_stab.len; cl->read_remaining = cur_op->req.sec_stab.len;
} }
else if (cur_op->req.hdr.opcode == OSD_OP_READ) else if (cur_op->req.hdr.opcode == OSD_OP_READ)
@ -227,13 +239,15 @@ void osd_messenger_t::handle_op_hdr(osd_client_t *cl)
else if (cur_op->req.hdr.opcode == OSD_OP_WRITE) else if (cur_op->req.hdr.opcode == OSD_OP_WRITE)
{ {
if (cur_op->req.rw.len > 0) if (cur_op->req.rw.len > 0)
{
cur_op->buf = memalign_or_die(MEM_ALIGNMENT, cur_op->req.rw.len); cur_op->buf = memalign_or_die(MEM_ALIGNMENT, cur_op->req.rw.len);
cl->recv_list.push_back(cur_op->buf, cur_op->req.rw.len);
}
cl->read_remaining = cur_op->req.rw.len; cl->read_remaining = cur_op->req.rw.len;
} }
if (cl->read_remaining > 0) if (cl->read_remaining > 0)
{ {
// Read data // Read data
cl->recv_list.push_back(cur_op->buf, cl->read_remaining);
cl->read_state = CL_READ_DATA; cl->read_state = CL_READ_DATA;
} }
else else
@ -259,12 +273,12 @@ bool osd_messenger_t::handle_reply_hdr(osd_client_t *cl)
osd_op_t *op = req_it->second; osd_op_t *op = req_it->second;
memcpy(op->reply.buf, cl->read_op->req.buf, OSD_PACKET_SIZE); memcpy(op->reply.buf, cl->read_op->req.buf, OSD_PACKET_SIZE);
cl->sent_ops.erase(req_it); cl->sent_ops.erase(req_it);
if ((op->reply.hdr.opcode == OSD_OP_SEC_READ || op->reply.hdr.opcode == OSD_OP_READ) && if (op->reply.hdr.opcode == OSD_OP_SEC_READ || op->reply.hdr.opcode == OSD_OP_READ)
op->reply.hdr.retval > 0)
{ {
// Read data. In this case we assume that the buffer is preallocated by the caller (!) // Read data. In this case we assume that the buffer is preallocated by the caller (!)
assert(op->iov.count > 0); unsigned bmp_len = (op->reply.hdr.opcode == OSD_OP_SEC_READ ? op->reply.sec_rw.attr_len : op->reply.rw.bitmap_len);
if (op->reply.hdr.retval != (op->reply.hdr.opcode == OSD_OP_SEC_READ ? op->req.sec_rw.len : op->req.rw.len)) if (op->reply.hdr.retval != (op->reply.hdr.opcode == OSD_OP_SEC_READ ? op->req.sec_rw.len : op->req.rw.len) ||
bmp_len > op->bitmap_len)
{ {
// Check reply length to not overflow the buffer // Check reply length to not overflow the buffer
printf("Client %d read reply of different length\n", cl->peer_fd); printf("Client %d read reply of different length\n", cl->peer_fd);
@ -272,11 +286,23 @@ bool osd_messenger_t::handle_reply_hdr(osd_client_t *cl)
stop_client(cl->peer_fd); stop_client(cl->peer_fd);
return false; return false;
} }
if (bmp_len > 0)
{
cl->recv_list.push_back(op->bitmap, bmp_len);
}
if (op->reply.hdr.retval > 0)
{
assert(op->iov.count > 0);
cl->recv_list.append(op->iov); cl->recv_list.append(op->iov);
}
cl->read_remaining = op->reply.hdr.retval + bmp_len;
if (cl->read_remaining == 0)
{
goto reuse;
}
delete cl->read_op; delete cl->read_op;
cl->read_op = op; cl->read_op = op;
cl->read_state = CL_READ_REPLY_DATA; cl->read_state = CL_READ_REPLY_DATA;
cl->read_remaining = op->reply.hdr.retval;
} }
else if (op->reply.hdr.opcode == OSD_OP_SEC_LIST && op->reply.hdr.retval > 0) else if (op->reply.hdr.opcode == OSD_OP_SEC_LIST && op->reply.hdr.retval > 0)
{ {
@ -300,6 +326,7 @@ bool osd_messenger_t::handle_reply_hdr(osd_client_t *cl)
} }
else else
{ {
reuse:
// It's fine to reuse cl->read_op for the next reply // It's fine to reuse cl->read_op for the next reply
handle_reply_ready(op); handle_reply_ready(op);
cl->recv_list.push_back(cl->read_op->req.buf, OSD_PACKET_SIZE); cl->recv_list.push_back(cl->read_op->req.buf, OSD_PACKET_SIZE);

View File

@ -47,6 +47,27 @@ void osd_messenger_t::outbox_push(osd_op_t *cur_op)
cl->sent_ops[cur_op->req.hdr.id] = cur_op; cl->sent_ops[cur_op->req.hdr.id] = cur_op;
} }
to_outbox.push_back(NULL); to_outbox.push_back(NULL);
// Bitmap
if (cur_op->op_type == OSD_OP_IN &&
cur_op->req.hdr.opcode == OSD_OP_SEC_READ &&
cur_op->reply.sec_rw.attr_len > 0)
{
to_send_list.push_back((iovec){
.iov_base = cur_op->bitmap,
.iov_len = cur_op->reply.sec_rw.attr_len,
});
to_outbox.push_back(NULL);
}
else if (cur_op->op_type == OSD_OP_OUT &&
(cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE || cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE) &&
cur_op->req.sec_rw.attr_len > 0)
{
to_send_list.push_back((iovec){
.iov_base = cur_op->bitmap,
.iov_len = cur_op->req.sec_rw.attr_len,
});
to_outbox.push_back(NULL);
}
// Operation data // Operation data
if ((cur_op->op_type == OSD_OP_IN if ((cur_op->op_type == OSD_OP_IN
? (cur_op->req.hdr.opcode == OSD_OP_READ || ? (cur_op->req.hdr.opcode == OSD_OP_READ ||
@ -97,8 +118,10 @@ void osd_messenger_t::measure_exec(osd_op_t *cur_op)
{ {
return; return;
} }
timespec tv_end; if (!cur_op->tv_end.tv_sec)
clock_gettime(CLOCK_REALTIME, &tv_end); {
clock_gettime(CLOCK_REALTIME, &cur_op->tv_end);
}
stats.op_stat_count[cur_op->req.hdr.opcode]++; stats.op_stat_count[cur_op->req.hdr.opcode]++;
if (!stats.op_stat_count[cur_op->req.hdr.opcode]) if (!stats.op_stat_count[cur_op->req.hdr.opcode])
{ {
@ -107,8 +130,8 @@ void osd_messenger_t::measure_exec(osd_op_t *cur_op)
stats.op_stat_bytes[cur_op->req.hdr.opcode] = 0; stats.op_stat_bytes[cur_op->req.hdr.opcode] = 0;
} }
stats.op_stat_sum[cur_op->req.hdr.opcode] += ( stats.op_stat_sum[cur_op->req.hdr.opcode] += (
(tv_end.tv_sec - cur_op->tv_begin.tv_sec)*1000000 + (cur_op->tv_end.tv_sec - cur_op->tv_begin.tv_sec)*1000000 +
(tv_end.tv_nsec - cur_op->tv_begin.tv_nsec)/1000 (cur_op->tv_end.tv_nsec - cur_op->tv_begin.tv_nsec)/1000
); );
if (cur_op->req.hdr.opcode == OSD_OP_READ || if (cur_op->req.hdr.opcode == OSD_OP_READ ||
cur_op->req.hdr.opcode == OSD_OP_WRITE) cur_op->req.hdr.opcode == OSD_OP_WRITE)

View File

@ -6,12 +6,14 @@
#include <stdint.h> #include <stdint.h>
#include <functional> #include <functional>
typedef uint64_t inode_t;
// 16 bytes per object/stripe id // 16 bytes per object/stripe id
// stripe = (start of the parity stripe + peer role) // stripe = (start of the parity stripe + peer role)
// i.e. for example (256KB + one of 0,1,2) // i.e. for example (256KB + one of 0,1,2)
struct __attribute__((__packed__)) object_id struct __attribute__((__packed__)) object_id
{ {
uint64_t inode; inode_t inode;
uint64_t stripe; uint64_t stripe;
}; };

View File

@ -9,15 +9,21 @@
#include "osd.h" #include "osd.h"
osd_t::osd_t(blockstore_config_t & config, blockstore_t *bs, ring_loop_t *ringloop) osd_t::osd_t(blockstore_config_t & config, ring_loop_t *ringloop)
{ {
bs_block_size = strtoull(config["block_size"].c_str(), NULL, 10);
bs_bitmap_granularity = strtoull(config["bitmap_granularity"].c_str(), NULL, 10);
if (!bs_block_size)
bs_block_size = DEFAULT_BLOCK_SIZE;
if (!bs_bitmap_granularity)
bs_bitmap_granularity = DEFAULT_BITMAP_GRANULARITY;
clean_entry_bitmap_size = bs_block_size / bs_bitmap_granularity / 8;
this->config = config; this->config = config;
this->bs = bs;
this->ringloop = ringloop; this->ringloop = ringloop;
this->bs_block_size = bs->get_block_size(); // FIXME: Create Blockstore from on-disk superblock config and check it against the OSD cluster config
// FIXME: use bitmap granularity instead this->bs = new blockstore_t(config, ringloop);
this->bs_disk_alignment = bs->get_disk_alignment();
parse_config(config); parse_config(config);
@ -49,17 +55,18 @@ osd_t::~osd_t()
{ {
ringloop->unregister_consumer(&consumer); ringloop->unregister_consumer(&consumer);
delete epmgr; delete epmgr;
delete bs;
close(listen_fd); close(listen_fd);
} }
void osd_t::parse_config(blockstore_config_t & config) void osd_t::parse_config(blockstore_config_t & config)
{ {
// Initial startup configuration
json11::Json json_config = json11::Json(config);
st_cli.parse_config(json_config);
if (config.find("log_level") == config.end()) if (config.find("log_level") == config.end())
config["log_level"] = "1"; config["log_level"] = "1";
log_level = strtoull(config["log_level"].c_str(), NULL, 10); log_level = strtoull(config["log_level"].c_str(), NULL, 10);
// Initial startup configuration
json11::Json json_config = json11::Json(config);
st_cli.parse_config(json_config);
etcd_report_interval = strtoull(config["etcd_report_interval"].c_str(), NULL, 10); etcd_report_interval = strtoull(config["etcd_report_interval"].c_str(), NULL, 10);
if (etcd_report_interval <= 0) if (etcd_report_interval <= 0)
etcd_report_interval = 30; etcd_report_interval = 30;
@ -96,6 +103,9 @@ void osd_t::parse_config(blockstore_config_t & config)
recovery_queue_depth = strtoull(config["recovery_queue_depth"].c_str(), NULL, 10); recovery_queue_depth = strtoull(config["recovery_queue_depth"].c_str(), NULL, 10);
if (recovery_queue_depth < 1 || recovery_queue_depth > MAX_RECOVERY_QUEUE) if (recovery_queue_depth < 1 || recovery_queue_depth > MAX_RECOVERY_QUEUE)
recovery_queue_depth = DEFAULT_RECOVERY_QUEUE; recovery_queue_depth = DEFAULT_RECOVERY_QUEUE;
recovery_sync_batch = strtoull(config["recovery_sync_batch"].c_str(), NULL, 10);
if (recovery_sync_batch < 1 || recovery_sync_batch > MAX_RECOVERY_QUEUE)
recovery_sync_batch = DEFAULT_RECOVERY_BATCH;
if (config["readonly"] == "true" || config["readonly"] == "1" || config["readonly"] == "yes") if (config["readonly"] == "true" || config["readonly"] == "1" || config["readonly"] == "yes")
readonly = true; readonly = true;
print_stats_interval = strtoull(config["print_stats_interval"].c_str(), NULL, 10); print_stats_interval = strtoull(config["print_stats_interval"].c_str(), NULL, 10);
@ -168,7 +178,7 @@ bool osd_t::shutdown()
{ {
return false; return false;
} }
return bs->is_safe_to_stop(); return !bs || bs->is_safe_to_stop();
} }
void osd_t::loop() void osd_t::loop()
@ -195,14 +205,14 @@ void osd_t::exec_op(osd_op_t *cur_op)
cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE || cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE) && cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE) &&
(cur_op->req.sec_rw.len > OSD_RW_MAX || (cur_op->req.sec_rw.len > OSD_RW_MAX ||
cur_op->req.sec_rw.len % bs_disk_alignment || cur_op->req.sec_rw.len % bs_bitmap_granularity ||
cur_op->req.sec_rw.offset % bs_disk_alignment)) || cur_op->req.sec_rw.offset % bs_bitmap_granularity)) ||
((cur_op->req.hdr.opcode == OSD_OP_READ || ((cur_op->req.hdr.opcode == OSD_OP_READ ||
cur_op->req.hdr.opcode == OSD_OP_WRITE || cur_op->req.hdr.opcode == OSD_OP_WRITE ||
cur_op->req.hdr.opcode == OSD_OP_DELETE) && cur_op->req.hdr.opcode == OSD_OP_DELETE) &&
(cur_op->req.rw.len > OSD_RW_MAX || (cur_op->req.rw.len > OSD_RW_MAX ||
cur_op->req.rw.len % bs_disk_alignment || cur_op->req.rw.len % bs_bitmap_granularity ||
cur_op->req.rw.offset % bs_disk_alignment))) cur_op->req.rw.offset % bs_bitmap_granularity)))
{ {
// Bad command // Bad command
finish_op(cur_op, -EINVAL); finish_op(cur_op, -EINVAL);

View File

@ -37,6 +37,7 @@
#define DEFAULT_AUTOSYNC_INTERVAL 5 #define DEFAULT_AUTOSYNC_INTERVAL 5
#define MAX_RECOVERY_QUEUE 2048 #define MAX_RECOVERY_QUEUE 2048
#define DEFAULT_RECOVERY_QUEUE 4 #define DEFAULT_RECOVERY_QUEUE 4
#define DEFAULT_RECOVERY_BATCH 16
//#define OSD_STUB //#define OSD_STUB
@ -54,6 +55,17 @@ struct osd_recovery_op_t
osd_op_t *osd_op = NULL; osd_op_t *osd_op = NULL;
}; };
// Posted as /osd/inodestats/$osd, then accumulated by the monitor
#define INODE_STATS_READ 0
#define INODE_STATS_WRITE 1
#define INODE_STATS_DELETE 2
struct inode_stats_t
{
uint64_t op_sum[3] = { 0 };
uint64_t op_count[3] = { 0 };
uint64_t op_bytes[3] = { 0 };
};
class osd_t class osd_t
{ {
// config // config
@ -76,6 +88,7 @@ class osd_t
int immediate_commit = IMMEDIATE_NONE; int immediate_commit = IMMEDIATE_NONE;
int autosync_interval = DEFAULT_AUTOSYNC_INTERVAL; // sync every 5 seconds int autosync_interval = DEFAULT_AUTOSYNC_INTERVAL; // sync every 5 seconds
int recovery_queue_depth = DEFAULT_RECOVERY_QUEUE; int recovery_queue_depth = DEFAULT_RECOVERY_QUEUE;
int recovery_sync_batch = DEFAULT_RECOVERY_BATCH;
int log_level = 0; int log_level = 0;
// cluster state // cluster state
@ -97,9 +110,11 @@ class osd_t
std::map<pool_pg_num_t, pg_t> pgs; std::map<pool_pg_num_t, pg_t> pgs;
std::set<pool_pg_num_t> dirty_pgs; std::set<pool_pg_num_t> dirty_pgs;
std::set<osd_num_t> dirty_osds; std::set<osd_num_t> dirty_osds;
int copies_to_delete_after_sync_count = 0;
uint64_t misplaced_objects = 0, degraded_objects = 0, incomplete_objects = 0; uint64_t misplaced_objects = 0, degraded_objects = 0, incomplete_objects = 0;
int peering_state = 0; int peering_state = 0;
std::map<object_id, osd_recovery_op_t> recovery_ops; std::map<object_id, osd_recovery_op_t> recovery_ops;
int recovery_done = 0;
osd_op_t *autosync_op = NULL; osd_op_t *autosync_op = NULL;
// Unstable writes // Unstable writes
@ -111,7 +126,7 @@ class osd_t
bool stopping = false; bool stopping = false;
int inflight_ops = 0; int inflight_ops = 0;
blockstore_t *bs; blockstore_t *bs;
uint32_t bs_block_size, bs_disk_alignment; uint32_t bs_block_size, bs_bitmap_granularity, clean_entry_bitmap_size;
ring_loop_t *ringloop; ring_loop_t *ringloop;
timerfd_manager_t *tfd = NULL; timerfd_manager_t *tfd = NULL;
epoll_manager_t *epmgr = NULL; epoll_manager_t *epmgr = NULL;
@ -122,6 +137,7 @@ class osd_t
// op statistics // op statistics
osd_op_stats_t prev_stats; osd_op_stats_t prev_stats;
std::map<uint64_t, inode_stats_t> inode_stats;
const char* recovery_stat_names[2] = { "degraded", "misplaced" }; const char* recovery_stat_names[2] = { "degraded", "misplaced" };
uint64_t recovery_stat_count[2][2] = { 0 }; uint64_t recovery_stat_count[2][2] = { 0 };
uint64_t recovery_stat_bytes[2][2] = { 0 }; uint64_t recovery_stat_bytes[2][2] = { 0 };
@ -201,6 +217,7 @@ class osd_t
void pg_cancel_write_queue(pg_t & pg, osd_op_t *first_op, object_id oid, int retval); void pg_cancel_write_queue(pg_t & pg, osd_op_t *first_op, object_id oid, int retval);
void submit_primary_subops(int submit_type, uint64_t op_version, int pg_size, const uint64_t* osd_set, osd_op_t *cur_op); void submit_primary_subops(int submit_type, uint64_t op_version, int pg_size, const uint64_t* osd_set, osd_op_t *cur_op);
void submit_primary_del_subops(osd_op_t *cur_op, uint64_t *cur_set, uint64_t set_size, pg_osd_set_t & loc_set); void submit_primary_del_subops(osd_op_t *cur_op, uint64_t *cur_set, uint64_t set_size, pg_osd_set_t & loc_set);
void submit_primary_del_batch(osd_op_t *cur_op, obj_ver_osd_t *chunks_to_delete, int chunks_to_delete_count);
void submit_primary_sync_subops(osd_op_t *cur_op); void submit_primary_sync_subops(osd_op_t *cur_op);
void submit_primary_stab_subops(osd_op_t *cur_op); void submit_primary_stab_subops(osd_op_t *cur_op);
@ -213,7 +230,7 @@ class osd_t
} }
public: public:
osd_t(blockstore_config_t & config, blockstore_t *bs, ring_loop_t *ringloop); osd_t(blockstore_config_t & config, ring_loop_t *ringloop);
~osd_t(); ~osd_t();
void force_stop(int exitcode); void force_stop(int exitcode);
bool shutdown(); bool shutdown();

View File

@ -179,11 +179,47 @@ void osd_t::report_statistics()
return; return;
} }
etcd_reporting_stats = true; etcd_reporting_stats = true;
// Report space usage statistics as a whole
// Maybe we'll report it using deltas if we tune for a lot of inodes at some point
json11::Json::object inode_space;
for (auto kv: bs->get_inode_space_stats())
{
inode_space[std::to_string(kv.first)] = kv.second;
}
json11::Json::object inode_ops;
for (auto kv: inode_stats)
{
inode_ops[std::to_string(kv.first)] = json11::Json::object {
{ "read", json11::Json::object {
{ "count", kv.second.op_count[INODE_STATS_READ] },
{ "usec", kv.second.op_sum[INODE_STATS_READ] },
{ "bytes", kv.second.op_bytes[INODE_STATS_READ] },
} },
{ "write", json11::Json::object {
{ "count", kv.second.op_count[INODE_STATS_WRITE] },
{ "usec", kv.second.op_sum[INODE_STATS_WRITE] },
{ "bytes", kv.second.op_bytes[INODE_STATS_WRITE] },
} },
{ "delete", json11::Json::object {
{ "count", kv.second.op_count[INODE_STATS_DELETE] },
{ "usec", kv.second.op_sum[INODE_STATS_DELETE] },
{ "bytes", kv.second.op_bytes[INODE_STATS_DELETE] },
} },
};
}
json11::Json::array txn = { json11::Json::object { json11::Json::array txn = { json11::Json::object {
{ "request_put", json11::Json::object { { "request_put", json11::Json::object {
{ "key", base64_encode(st_cli.etcd_prefix+"/osd/stats/"+std::to_string(osd_num)) }, { "key", base64_encode(st_cli.etcd_prefix+"/osd/stats/"+std::to_string(osd_num)) },
{ "value", base64_encode(get_statistics().dump()) }, { "value", base64_encode(get_statistics().dump()) },
} } } },
{ "request_put", json11::Json::object {
{ "key", base64_encode(st_cli.etcd_prefix+"/osd/space/"+std::to_string(osd_num)) },
{ "value", base64_encode(json11::Json(inode_space).dump()) },
} },
{ "request_put", json11::Json::object {
{ "key", base64_encode(st_cli.etcd_prefix+"/osd/inodestats/"+std::to_string(osd_num)) },
{ "value", base64_encode(json11::Json(inode_ops).dump()) },
} },
} }; } };
for (auto & p: pgs) for (auto & p: pgs)
{ {

View File

@ -270,7 +270,6 @@ void osd_t::submit_recovery_op(osd_recovery_op_t *op)
} }
op->osd_op->callback = [this, op](osd_op_t *osd_op) op->osd_op->callback = [this, op](osd_op_t *osd_op)
{ {
// Don't sync the write, it will be synced by our regular sync coroutine
if (osd_op->reply.hdr.retval < 0) if (osd_op->reply.hdr.retval < 0)
{ {
// Error recovering object // Error recovering object
@ -292,6 +291,17 @@ void osd_t::submit_recovery_op(osd_recovery_op_t *op)
op->osd_op = NULL; op->osd_op = NULL;
recovery_ops.erase(op->oid); recovery_ops.erase(op->oid);
delete osd_op; delete osd_op;
if (immediate_commit != IMMEDIATE_ALL)
{
recovery_done++;
if (recovery_done >= recovery_sync_batch)
{
// Force sync every <recovery_sync_batch> operations
// This is required not to pile up an excessive amount of delete operations
autosync();
recovery_done = 0;
}
}
continue_recovery(); continue_recovery();
}; };
exec_op(op->osd_op); exec_op(op->osd_op);

View File

@ -41,16 +41,13 @@ int main(int narg, char *args[])
signal(SIGINT, handle_sigint); signal(SIGINT, handle_sigint);
signal(SIGTERM, handle_sigint); signal(SIGTERM, handle_sigint);
ring_loop_t *ringloop = new ring_loop_t(512); ring_loop_t *ringloop = new ring_loop_t(512);
// FIXME: Create Blockstore from on-disk superblock config and check it against the OSD cluster config osd = new osd_t(config, ringloop);
blockstore_t *bs = new blockstore_t(config, ringloop);
osd = new osd_t(config, bs, ringloop);
while (1) while (1)
{ {
ringloop->loop(); ringloop->loop();
ringloop->wait(); ringloop->wait();
} }
delete osd; delete osd;
delete bs;
delete ringloop; delete ringloop;
return 0; return 0;
} }

View File

@ -71,6 +71,9 @@ struct __attribute__((__packed__)) osd_op_secondary_rw_t
uint32_t offset; uint32_t offset;
// length // length
uint32_t len; uint32_t len;
// bitmap/attribute length - bitmap comes after header, but before data
uint32_t attr_len;
uint32_t pad0;
}; };
struct __attribute__((__packed__)) osd_reply_secondary_rw_t struct __attribute__((__packed__)) osd_reply_secondary_rw_t
@ -78,6 +81,9 @@ struct __attribute__((__packed__)) osd_reply_secondary_rw_t
osd_reply_header_t header; osd_reply_header_t header;
// for reads and writes: assigned or read version number // for reads and writes: assigned or read version number
uint64_t version; uint64_t version;
// for reads: bitmap/attribute length (just to double-check)
uint32_t attr_len;
uint32_t pad0;
}; };
// delete object on the secondary OSD // delete object on the secondary OSD
@ -154,7 +160,6 @@ struct __attribute__((__packed__)) osd_reply_secondary_list_t
}; };
// read or write to the primary OSD (must be within individual stripe) // read or write to the primary OSD (must be within individual stripe)
// FIXME: allow to return used block bitmap (required for snapshots)
struct __attribute__((__packed__)) osd_op_rw_t struct __attribute__((__packed__)) osd_op_rw_t
{ {
osd_op_header_t header; osd_op_header_t header;
@ -169,6 +174,9 @@ struct __attribute__((__packed__)) osd_op_rw_t
struct __attribute__((__packed__)) osd_reply_rw_t struct __attribute__((__packed__)) osd_reply_rw_t
{ {
osd_reply_header_t header; osd_reply_header_t header;
// for reads: bitmap length
uint32_t bitmap_len;
uint32_t pad0;
}; };
// sync to the primary OSD // sync to the primary OSD

View File

@ -103,6 +103,8 @@ void osd_t::reset_pg(pg_t & pg)
{ {
pg.cur_peers.clear(); pg.cur_peers.clear();
pg.state_dict.clear(); pg.state_dict.clear();
copies_to_delete_after_sync_count -= pg.copies_to_delete_after_sync.size();
pg.copies_to_delete_after_sync.clear();
incomplete_objects -= pg.incomplete_objects.size(); incomplete_objects -= pg.incomplete_objects.size();
misplaced_objects -= pg.misplaced_objects.size(); misplaced_objects -= pg.misplaced_objects.size();
degraded_objects -= pg.degraded_objects.size(); degraded_objects -= pg.degraded_objects.size();
@ -180,16 +182,20 @@ void osd_t::start_pg_peering(pg_t & pg)
// (PG history is kept up to the latest active+clean state) // (PG history is kept up to the latest active+clean state)
for (auto & history_set: pg.target_history) for (auto & history_set: pg.target_history)
{ {
bool found = false; bool found = true;
for (auto history_osd: history_set) for (auto history_osd: history_set)
{ {
if (history_osd != 0 && (history_osd == this->osd_num || if (history_osd != 0)
c_cli.osd_peer_fds.find(history_osd) != c_cli.osd_peer_fds.end())) {
found = false;
if (history_osd == this->osd_num ||
c_cli.osd_peer_fds.find(history_osd) != c_cli.osd_peer_fds.end())
{ {
found = true; found = true;
break; break;
} }
} }
}
if (!found) if (!found)
{ {
pg.state = PG_INCOMPLETE; pg.state = PG_INCOMPLETE;

View File

@ -56,6 +56,13 @@ struct obj_piece_id_t
uint64_t osd_num; uint64_t osd_num;
}; };
struct obj_ver_osd_t
{
uint64_t osd_num;
object_id oid;
uint64_t version;
};
struct flush_action_t struct flush_action_t
{ {
bool rollback = false, make_stable = false; bool rollback = false, make_stable = false;
@ -101,6 +108,7 @@ struct pg_t
std::map<pg_osd_set_t, pg_osd_set_state_t> state_dict; std::map<pg_osd_set_t, pg_osd_set_state_t> state_dict;
btree::btree_map<object_id, pg_osd_set_state_t*> incomplete_objects, misplaced_objects, degraded_objects; btree::btree_map<object_id, pg_osd_set_state_t*> incomplete_objects, misplaced_objects, degraded_objects;
std::map<obj_piece_id_t, flush_action_t> flush_actions; std::map<obj_piece_id_t, flush_action_t> flush_actions;
std::vector<obj_ver_osd_t> copies_to_delete_after_sync;
btree::btree_map<object_id, uint64_t> ver_override; btree::btree_map<object_id, uint64_t> ver_override;
pg_peering_state_t *peering_state = NULL; pg_peering_state_t *peering_state = NULL;
pg_flush_batch_t *flush_batch = NULL; pg_flush_batch_t *flush_batch = NULL;

View File

@ -2,6 +2,7 @@
// License: VNPL-1.1 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "osd_primary.h" #include "osd_primary.h"
#include "allocator.h"
// read: read directly or read paired stripe(s), reconstruct, return // read: read directly or read paired stripe(s), reconstruct, return
// write: read paired stripe(s), reconstruct, modify, calculate parity, write // write: read paired stripe(s), reconstruct, modify, calculate parity, write
@ -44,14 +45,15 @@ bool osd_t::prepare_primary_rw(osd_op_t *cur_op)
return false; return false;
} }
if ((cur_op->req.rw.offset + cur_op->req.rw.len) > (oid.stripe + pg_block_size) || if ((cur_op->req.rw.offset + cur_op->req.rw.len) > (oid.stripe + pg_block_size) ||
(cur_op->req.rw.offset % bs_disk_alignment) != 0 || (cur_op->req.rw.offset % bs_bitmap_granularity) != 0 ||
(cur_op->req.rw.len % bs_disk_alignment) != 0) (cur_op->req.rw.len % bs_bitmap_granularity) != 0)
{ {
finish_op(cur_op, -EINVAL); finish_op(cur_op, -EINVAL);
return false; return false;
} }
int stripe_count = (pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pg_it->second.pg_size);
osd_primary_op_data_t *op_data = (osd_primary_op_data_t*)calloc_or_die( osd_primary_op_data_t *op_data = (osd_primary_op_data_t*)calloc_or_die(
1, sizeof(osd_primary_op_data_t) + sizeof(osd_rmw_stripe_t) * (pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pg_it->second.pg_size) 1, sizeof(osd_primary_op_data_t) + (clean_entry_bitmap_size + sizeof(osd_rmw_stripe_t)) * stripe_count
); );
op_data->pg_num = pg_num; op_data->pg_num = pg_num;
op_data->oid = oid; op_data->oid = oid;
@ -60,6 +62,11 @@ bool osd_t::prepare_primary_rw(osd_op_t *cur_op)
op_data->pg_data_size = pg_data_size; op_data->pg_data_size = pg_data_size;
cur_op->op_data = op_data; cur_op->op_data = op_data;
split_stripes(pg_data_size, bs_block_size, (uint32_t)(cur_op->req.rw.offset - oid.stripe), cur_op->req.rw.len, op_data->stripes); split_stripes(pg_data_size, bs_block_size, (uint32_t)(cur_op->req.rw.offset - oid.stripe), cur_op->req.rw.len, op_data->stripes);
// Allocate bitmaps along with stripes to avoid extra allocations and fragmentation
for (int i = 0; i < stripe_count; i++)
{
op_data->stripes[i].bmp_buf = (void*)(op_data->stripes+stripe_count) + clean_entry_bitmap_size*i;
}
pg_it->second.inflight++; pg_it->second.inflight++;
return true; return true;
} }
@ -99,6 +106,7 @@ void osd_t::continue_primary_read(osd_op_t *cur_op)
{ {
return; return;
} }
cur_op->reply.rw.bitmap_len = 0;
osd_primary_op_data_t *op_data = cur_op->op_data; osd_primary_op_data_t *op_data = cur_op->op_data;
if (op_data->st == 1) goto resume_1; if (op_data->st == 1) goto resume_1;
else if (op_data->st == 2) goto resume_2; else if (op_data->st == 2) goto resume_2;
@ -146,18 +154,20 @@ resume_2:
finish_op(cur_op, op_data->epipe > 0 ? -EPIPE : -EIO); finish_op(cur_op, op_data->epipe > 0 ? -EPIPE : -EIO);
return; return;
} }
cur_op->reply.rw.bitmap_len = op_data->pg_data_size * clean_entry_bitmap_size;
if (op_data->degraded) if (op_data->degraded)
{ {
// Reconstruct missing stripes // Reconstruct missing stripes
osd_rmw_stripe_t *stripes = op_data->stripes; osd_rmw_stripe_t *stripes = op_data->stripes;
if (op_data->scheme == POOL_SCHEME_XOR) if (op_data->scheme == POOL_SCHEME_XOR)
{ {
reconstruct_stripes_xor(stripes, op_data->pg_size); reconstruct_stripes_xor(stripes, op_data->pg_size, clean_entry_bitmap_size);
} }
else if (op_data->scheme == POOL_SCHEME_JERASURE) else if (op_data->scheme == POOL_SCHEME_JERASURE)
{ {
reconstruct_stripes_jerasure(stripes, op_data->pg_size, op_data->pg_data_size); reconstruct_stripes_jerasure(stripes, op_data->pg_size, op_data->pg_data_size, clean_entry_bitmap_size);
} }
cur_op->iov.push_back(op_data->stripes[0].bmp_buf, cur_op->reply.rw.bitmap_len);
for (int role = 0; role < op_data->pg_size; role++) for (int role = 0; role < op_data->pg_size; role++)
{ {
if (stripes[role].req_end != 0) if (stripes[role].req_end != 0)
@ -172,6 +182,7 @@ resume_2:
} }
else else
{ {
cur_op->iov.push_back(op_data->stripes[0].bmp_buf, cur_op->reply.rw.bitmap_len);
cur_op->iov.push_back(cur_op->buf, cur_op->req.rw.len); cur_op->iov.push_back(cur_op->buf, cur_op->req.rw.len);
} }
finish_op(cur_op, cur_op->req.rw.len); finish_op(cur_op, cur_op->req.rw.len);
@ -238,6 +249,7 @@ resume_1:
op_data->stripes[0].write_start = op_data->stripes[0].req_start; op_data->stripes[0].write_start = op_data->stripes[0].req_start;
op_data->stripes[0].write_end = op_data->stripes[0].req_end; op_data->stripes[0].write_end = op_data->stripes[0].req_end;
op_data->stripes[0].write_buf = cur_op->buf; op_data->stripes[0].write_buf = cur_op->buf;
op_data->stripes[0].bmp_buf = (void*)(op_data->stripes+1);
if (pg.cur_set.data() != op_data->prev_set && (op_data->stripes[0].write_start != 0 || if (pg.cur_set.data() != op_data->prev_set && (op_data->stripes[0].write_start != 0 ||
op_data->stripes[0].write_end != bs_block_size)) op_data->stripes[0].write_end != bs_block_size))
{ {
@ -250,7 +262,7 @@ resume_1:
else else
{ {
cur_op->rmw_buf = calc_rmw(cur_op->buf, op_data->stripes, op_data->prev_set, cur_op->rmw_buf = calc_rmw(cur_op->buf, op_data->stripes, op_data->prev_set,
pg.pg_size, op_data->pg_data_size, pg.pg_cursize, pg.cur_set.data(), bs_block_size); pg.pg_size, op_data->pg_data_size, pg.pg_cursize, pg.cur_set.data(), bs_block_size, clean_entry_bitmap_size);
if (!cur_op->rmw_buf) if (!cur_op->rmw_buf)
{ {
// Refuse partial overwrite of an incomplete object // Refuse partial overwrite of an incomplete object
@ -273,7 +285,9 @@ resume_3:
pg.ver_override[op_data->oid] = op_data->fact_ver; pg.ver_override[op_data->oid] = op_data->fact_ver;
if (op_data->scheme == POOL_SCHEME_REPLICATED) if (op_data->scheme == POOL_SCHEME_REPLICATED)
{ {
// Only (possibly) copy new data from the request into the recovery buffer // Set bitmap bits
bitmap_set(op_data->stripes[0].bmp_buf, op_data->stripes[0].write_start, op_data->stripes[0].write_end, bs_bitmap_granularity);
// Possibly copy new data from the request into the recovery buffer
if (pg.cur_set.data() != op_data->prev_set && (op_data->stripes[0].write_start != 0 || if (pg.cur_set.data() != op_data->prev_set && (op_data->stripes[0].write_start != 0 ||
op_data->stripes[0].write_end != bs_block_size)) op_data->stripes[0].write_end != bs_block_size))
{ {
@ -292,11 +306,11 @@ resume_3:
// Recover missing stripes, calculate parity // Recover missing stripes, calculate parity
if (pg.scheme == POOL_SCHEME_XOR) if (pg.scheme == POOL_SCHEME_XOR)
{ {
calc_rmw_parity_xor(op_data->stripes, pg.pg_size, op_data->prev_set, pg.cur_set.data(), bs_block_size); calc_rmw_parity_xor(op_data->stripes, pg.pg_size, op_data->prev_set, pg.cur_set.data(), bs_block_size, clean_entry_bitmap_size);
} }
else if (pg.scheme == POOL_SCHEME_JERASURE) else if (pg.scheme == POOL_SCHEME_JERASURE)
{ {
calc_rmw_parity_jerasure(op_data->stripes, pg.pg_size, op_data->pg_data_size, op_data->prev_set, pg.cur_set.data(), bs_block_size); calc_rmw_parity_jerasure(op_data->stripes, pg.pg_size, op_data->pg_data_size, op_data->prev_set, pg.cur_set.data(), bs_block_size, clean_entry_bitmap_size);
} }
} }
// Send writes // Send writes
@ -367,6 +381,32 @@ resume_7:
} }
// Any kind of a non-clean object can have extra chunks, because we don't record objects // Any kind of a non-clean object can have extra chunks, because we don't record objects
// as degraded & misplaced or incomplete & misplaced at the same time. So try to remove extra chunks // as degraded & misplaced or incomplete & misplaced at the same time. So try to remove extra chunks
if (immediate_commit != IMMEDIATE_ALL)
{
// We can't remove extra chunks yet if fsyncs are explicit, because
// new copies may not be committed to stable storage yet
// We can only remove extra chunks after a successful SYNC for this PG
for (auto & chunk: op_data->object_state->osd_set)
{
// Check is the same as in submit_primary_del_subops()
if (op_data->scheme == POOL_SCHEME_REPLICATED
? !contains_osd(pg.cur_set.data(), pg.pg_size, chunk.osd_num)
: (chunk.osd_num != pg.cur_set[chunk.role]))
{
pg.copies_to_delete_after_sync.push_back((obj_ver_osd_t){
.osd_num = chunk.osd_num,
.oid = {
.inode = op_data->oid.inode,
.stripe = op_data->oid.stripe | (op_data->scheme == POOL_SCHEME_REPLICATED ? 0 : chunk.role),
},
.version = op_data->fact_ver,
});
copies_to_delete_after_sync_count++;
}
}
}
else
{
submit_primary_del_subops(cur_op, pg.cur_set.data(), pg.pg_size, op_data->object_state->osd_set); submit_primary_del_subops(cur_op, pg.cur_set.data(), pg.pg_size, op_data->object_state->osd_set);
if (op_data->n_subops > 0) if (op_data->n_subops > 0)
{ {
@ -380,6 +420,7 @@ resume_9:
return; return;
} }
} }
}
// Clear object state // Clear object state
remove_object_from_state(op_data->oid, op_data->object_state, pg); remove_object_from_state(op_data->oid, op_data->object_state, pg);
pg.clean_count++; pg.clean_count++;
@ -511,6 +552,8 @@ void osd_t::continue_primary_sync(osd_op_t *cur_op)
else if (op_data->st == 4) goto resume_4; else if (op_data->st == 4) goto resume_4;
else if (op_data->st == 5) goto resume_5; else if (op_data->st == 5) goto resume_5;
else if (op_data->st == 6) goto resume_6; else if (op_data->st == 6) goto resume_6;
else if (op_data->st == 7) goto resume_7;
else if (op_data->st == 8) goto resume_8;
assert(op_data->st == 0); assert(op_data->st == 0);
if (syncs_in_progress.size() > 0) if (syncs_in_progress.size() > 0)
{ {
@ -572,11 +615,34 @@ resume_2:
this->unstable_writes.clear(); this->unstable_writes.clear();
} }
{ {
void *dirty_buf = malloc_or_die(sizeof(pool_pg_num_t)*dirty_pgs.size() + sizeof(osd_num_t)*dirty_osds.size()); void *dirty_buf = malloc_or_die(
sizeof(pool_pg_num_t)*dirty_pgs.size() +
sizeof(osd_num_t)*dirty_osds.size() +
sizeof(obj_ver_osd_t)*this->copies_to_delete_after_sync_count
);
op_data->dirty_pgs = (pool_pg_num_t*)dirty_buf; op_data->dirty_pgs = (pool_pg_num_t*)dirty_buf;
op_data->dirty_osds = (osd_num_t*)(dirty_buf + sizeof(pool_pg_num_t)*dirty_pgs.size()); op_data->dirty_osds = (osd_num_t*)(dirty_buf + sizeof(pool_pg_num_t)*dirty_pgs.size());
op_data->dirty_pg_count = dirty_pgs.size(); op_data->dirty_pg_count = dirty_pgs.size();
op_data->dirty_osd_count = dirty_osds.size(); op_data->dirty_osd_count = dirty_osds.size();
if (this->copies_to_delete_after_sync_count)
{
op_data->copies_to_delete_count = 0;
op_data->copies_to_delete = (obj_ver_osd_t*)(op_data->dirty_osds + op_data->dirty_osd_count);
for (auto dirty_pg_num: dirty_pgs)
{
auto & pg = pgs.at(dirty_pg_num);
assert(pg.copies_to_delete_after_sync.size() <= this->copies_to_delete_after_sync_count);
memcpy(
op_data->copies_to_delete + op_data->copies_to_delete_count,
pg.copies_to_delete_after_sync.data(),
sizeof(obj_ver_osd_t)*pg.copies_to_delete_after_sync.size()
);
op_data->copies_to_delete_count += pg.copies_to_delete_after_sync.size();
this->copies_to_delete_after_sync_count -= pg.copies_to_delete_after_sync.size();
pg.copies_to_delete_after_sync.clear();
}
assert(this->copies_to_delete_after_sync_count == 0);
}
int dpg = 0; int dpg = 0;
for (auto dirty_pg_num: dirty_pgs) for (auto dirty_pg_num: dirty_pgs)
{ {
@ -649,6 +715,36 @@ resume_6:
} }
} }
} }
if (op_data->copies_to_delete)
{
// Return 'copies to delete' back into respective PGs
for (int i = 0; i < op_data->copies_to_delete_count; i++)
{
auto & w = op_data->copies_to_delete[i];
auto & pg = pgs.at((pool_pg_num_t){
.pool_id = INODE_POOL(w.oid.inode),
.pg_num = map_to_pg(w.oid, st_cli.pool_config.at(INODE_POOL(w.oid.inode)).pg_stripe_size),
});
if (pg.state & PG_ACTIVE)
{
pg.copies_to_delete_after_sync.push_back(w);
copies_to_delete_after_sync_count++;
}
}
}
}
else if (op_data->copies_to_delete)
{
// Actually delete copies which we wanted to delete
submit_primary_del_batch(cur_op, op_data->copies_to_delete, op_data->copies_to_delete_count);
resume_7:
op_data->st = 7;
return;
resume_8:
if (op_data->errors > 0)
{
goto resume_6;
}
} }
for (int i = 0; i < op_data->dirty_pg_count; i++) for (int i = 0; i < op_data->dirty_pg_count; i++)
{ {

View File

@ -38,4 +38,8 @@ struct osd_primary_op_data_t
osd_num_t *dirty_osds = NULL; osd_num_t *dirty_osds = NULL;
int dirty_osd_count = 0; int dirty_osd_count = 0;
obj_ver_id *unstable_writes = NULL; obj_ver_id *unstable_writes = NULL;
obj_ver_osd_t *copies_to_delete = NULL;
int copies_to_delete_count = 0;
}; };
bool contains_osd(osd_num_t *osd_set, uint64_t size, osd_num_t osd_num);

View File

@ -36,6 +36,29 @@ void osd_t::autosync()
void osd_t::finish_op(osd_op_t *cur_op, int retval) void osd_t::finish_op(osd_op_t *cur_op, int retval)
{ {
inflight_ops--; inflight_ops--;
if (cur_op->req.hdr.opcode == OSD_OP_READ ||
cur_op->req.hdr.opcode == OSD_OP_WRITE ||
cur_op->req.hdr.opcode == OSD_OP_DELETE)
{
// Track inode statistics
if (!cur_op->tv_end.tv_sec)
{
clock_gettime(CLOCK_REALTIME, &cur_op->tv_end);
}
uint64_t usec = (
(cur_op->tv_end.tv_sec - cur_op->tv_begin.tv_sec)*1000000 +
(cur_op->tv_end.tv_nsec - cur_op->tv_begin.tv_nsec)/1000
);
int inode_st_op = cur_op->req.hdr.opcode == OSD_OP_DELETE
? INODE_STATS_DELETE
: (cur_op->req.hdr.opcode == OSD_OP_READ ? INODE_STATS_READ : INODE_STATS_WRITE);
inode_stats[cur_op->req.rw.inode].op_count[inode_st_op]++;
inode_stats[cur_op->req.rw.inode].op_sum[inode_st_op] += usec;
if (cur_op->req.hdr.opcode == OSD_OP_DELETE)
inode_stats[cur_op->req.rw.inode].op_bytes[inode_st_op] += cur_op->op_data->pg_data_size * bs_block_size;
else
inode_stats[cur_op->req.rw.inode].op_bytes[inode_st_op] += cur_op->req.rw.len;
}
if (cur_op->op_data) if (cur_op->op_data)
{ {
if (cur_op->op_data->pg_num > 0) if (cur_op->op_data->pg_num > 0)
@ -64,7 +87,7 @@ void osd_t::finish_op(osd_op_t *cur_op, int retval)
} }
else else
{ {
// FIXME add separate magic number // FIXME add separate magic number for primary ops
auto cl_it = c_cli.clients.find(cur_op->peer_fd); auto cl_it = c_cli.clients.find(cur_op->peer_fd);
if (cl_it != c_cli.clients.end()) if (cl_it != c_cli.clients.end())
{ {
@ -129,6 +152,8 @@ void osd_t::submit_primary_subops(int submit_type, uint64_t op_version, int pg_s
{ {
clock_gettime(CLOCK_REALTIME, &subops[i].tv_begin); clock_gettime(CLOCK_REALTIME, &subops[i].tv_begin);
subops[i].op_type = (uint64_t)cur_op; subops[i].op_type = (uint64_t)cur_op;
subops[i].bitmap = stripes[stripe_num].bmp_buf;
subops[i].bitmap_len = clean_entry_bitmap_size;
subops[i].bs_op = new blockstore_op_t({ subops[i].bs_op = new blockstore_op_t({
.opcode = (uint64_t)(wr ? (rep ? BS_OP_WRITE_STABLE : BS_OP_WRITE) : BS_OP_READ), .opcode = (uint64_t)(wr ? (rep ? BS_OP_WRITE_STABLE : BS_OP_WRITE) : BS_OP_READ),
.callback = [subop = &subops[i], this](blockstore_op_t *bs_subop) .callback = [subop = &subops[i], this](blockstore_op_t *bs_subop)
@ -143,6 +168,7 @@ void osd_t::submit_primary_subops(int submit_type, uint64_t op_version, int pg_s
.offset = wr ? stripes[stripe_num].write_start : stripes[stripe_num].read_start, .offset = wr ? stripes[stripe_num].write_start : stripes[stripe_num].read_start,
.len = wr ? stripes[stripe_num].write_end - stripes[stripe_num].write_start : stripes[stripe_num].read_end - stripes[stripe_num].read_start, .len = wr ? stripes[stripe_num].write_end - stripes[stripe_num].write_start : stripes[stripe_num].read_end - stripes[stripe_num].read_start,
.buf = wr ? stripes[stripe_num].write_buf : stripes[stripe_num].read_buf, .buf = wr ? stripes[stripe_num].write_buf : stripes[stripe_num].read_buf,
.bitmap = stripes[stripe_num].bmp_buf,
}); });
#ifdef OSD_DEBUG #ifdef OSD_DEBUG
printf( printf(
@ -157,6 +183,8 @@ void osd_t::submit_primary_subops(int submit_type, uint64_t op_version, int pg_s
{ {
subops[i].op_type = OSD_OP_OUT; subops[i].op_type = OSD_OP_OUT;
subops[i].peer_fd = c_cli.osd_peer_fds.at(role_osd_num); subops[i].peer_fd = c_cli.osd_peer_fds.at(role_osd_num);
subops[i].bitmap = stripes[stripe_num].bmp_buf;
subops[i].bitmap_len = clean_entry_bitmap_size;
subops[i].req.sec_rw = { subops[i].req.sec_rw = {
.header = { .header = {
.magic = SECONDARY_OSD_OP_MAGIC, .magic = SECONDARY_OSD_OP_MAGIC,
@ -170,6 +198,7 @@ void osd_t::submit_primary_subops(int submit_type, uint64_t op_version, int pg_s
.version = op_version, .version = op_version,
.offset = wr ? stripes[stripe_num].write_start : stripes[stripe_num].read_start, .offset = wr ? stripes[stripe_num].write_start : stripes[stripe_num].read_start,
.len = wr ? stripes[stripe_num].write_end - stripes[stripe_num].write_start : stripes[stripe_num].read_end - stripes[stripe_num].read_start, .len = wr ? stripes[stripe_num].write_end - stripes[stripe_num].write_start : stripes[stripe_num].read_end - stripes[stripe_num].read_start,
.attr_len = wr ? clean_entry_bitmap_size : 0,
}; };
#ifdef OSD_DEBUG #ifdef OSD_DEBUG
printf( printf(
@ -355,7 +384,7 @@ void osd_t::cancel_primary_write(osd_op_t *cur_op)
} }
} }
static bool contains_osd(osd_num_t *osd_set, uint64_t size, osd_num_t osd_num) bool contains_osd(osd_num_t *osd_set, uint64_t size, osd_num_t osd_num)
{ {
for (uint64_t i = 0; i < size; i++) for (uint64_t i = 0; i < size; i++)
{ {
@ -371,29 +400,43 @@ void osd_t::submit_primary_del_subops(osd_op_t *cur_op, osd_num_t *cur_set, uint
{ {
osd_primary_op_data_t *op_data = cur_op->op_data; osd_primary_op_data_t *op_data = cur_op->op_data;
bool rep = op_data->scheme == POOL_SCHEME_REPLICATED; bool rep = op_data->scheme == POOL_SCHEME_REPLICATED;
int extra_chunks = 0; obj_ver_osd_t extra_chunks[loc_set.size()];
// ordered comparison for EC/XOR, unordered for replicated pools int chunks_to_del = 0;
for (auto & chunk: loc_set) for (auto & chunk: loc_set)
{ {
if (!cur_set || (rep ? !contains_osd(cur_set, set_size, chunk.osd_num) : chunk.osd_num != cur_set[chunk.role])) // ordered comparison for EC/XOR, unordered for replicated pools
if (!cur_set || (rep
? !contains_osd(cur_set, set_size, chunk.osd_num)
: (chunk.osd_num != cur_set[chunk.role])))
{ {
extra_chunks++; extra_chunks[chunks_to_del++] = (obj_ver_osd_t){
.osd_num = chunk.osd_num,
.oid = {
.inode = op_data->oid.inode,
.stripe = op_data->oid.stripe | (rep ? 0 : chunk.role),
},
// Same version as write
.version = op_data->fact_ver,
};
} }
} }
op_data->n_subops = extra_chunks; submit_primary_del_batch(cur_op, extra_chunks, chunks_to_del);
}
void osd_t::submit_primary_del_batch(osd_op_t *cur_op, obj_ver_osd_t *chunks_to_delete, int chunks_to_delete_count)
{
osd_primary_op_data_t *op_data = cur_op->op_data;
op_data->n_subops = chunks_to_delete_count;
op_data->done = op_data->errors = 0; op_data->done = op_data->errors = 0;
if (!extra_chunks) if (!op_data->n_subops)
{ {
return; return;
} }
osd_op_t *subops = new osd_op_t[extra_chunks]; osd_op_t *subops = new osd_op_t[chunks_to_delete_count];
op_data->subops = subops; op_data->subops = subops;
int i = 0; for (int i = 0; i < chunks_to_delete_count; i++)
for (auto & chunk: loc_set)
{ {
if (!cur_set || (rep ? !contains_osd(cur_set, set_size, chunk.osd_num) : chunk.osd_num != cur_set[chunk.role])) auto & chunk = chunks_to_delete[i];
{
int stripe_num = op_data->scheme == POOL_SCHEME_REPLICATED ? 0 : chunk.role;
if (chunk.osd_num == this->osd_num) if (chunk.osd_num == this->osd_num)
{ {
clock_gettime(CLOCK_REALTIME, &subops[i].tv_begin); clock_gettime(CLOCK_REALTIME, &subops[i].tv_begin);
@ -404,12 +447,8 @@ void osd_t::submit_primary_del_subops(osd_op_t *cur_op, osd_num_t *cur_set, uint
{ {
handle_primary_bs_subop(subop); handle_primary_bs_subop(subop);
}, },
.oid = { .oid = chunk.oid,
.inode = op_data->oid.inode, .version = chunk.version,
.stripe = op_data->oid.stripe | stripe_num,
},
// Same version as write
.version = op_data->fact_ver,
}); });
bs->enqueue_op(subops[i].bs_op); bs->enqueue_op(subops[i].bs_op);
} }
@ -423,12 +462,8 @@ void osd_t::submit_primary_del_subops(osd_op_t *cur_op, osd_num_t *cur_set, uint
.id = c_cli.next_subop_id++, .id = c_cli.next_subop_id++,
.opcode = OSD_OP_SEC_DELETE, .opcode = OSD_OP_SEC_DELETE,
}, },
.oid = { .oid = chunk.oid,
.inode = op_data->oid.inode, .version = chunk.version,
.stripe = op_data->oid.stripe | stripe_num,
},
// Same version as write
.version = op_data->fact_ver,
}; };
subops[i].callback = [cur_op, this](osd_op_t *subop) subops[i].callback = [cur_op, this](osd_op_t *subop)
{ {
@ -442,8 +477,6 @@ void osd_t::submit_primary_del_subops(osd_op_t *cur_op, osd_num_t *cur_set, uint
}; };
c_cli.outbox_push(&subops[i]); c_cli.outbox_push(&subops[i]);
} }
i++;
}
} }
} }

View File

@ -7,6 +7,7 @@
#include <jerasure/reed_sol.h> #include <jerasure/reed_sol.h>
#include <jerasure.h> #include <jerasure.h>
#include <map> #include <map>
#include "allocator.h"
#include "xor.h" #include "xor.h"
#include "osd_rmw.h" #include "osd_rmw.h"
#include "malloc_or_die.h" #include "malloc_or_die.h"
@ -81,7 +82,7 @@ void split_stripes(uint64_t pg_minsize, uint32_t bs_block_size, uint32_t start,
} }
} }
void reconstruct_stripes_xor(osd_rmw_stripe_t *stripes, int pg_size) void reconstruct_stripes_xor(osd_rmw_stripe_t *stripes, int pg_size, uint32_t bitmap_size)
{ {
for (int role = 0; role < pg_size; role++) for (int role = 0; role < pg_size; role++)
{ {
@ -106,6 +107,7 @@ void reconstruct_stripes_xor(osd_rmw_stripe_t *stripes, int pg_size)
stripes[other].read_buf + (stripes[role].read_start - stripes[other].read_start), stripes[other].read_buf + (stripes[role].read_start - stripes[other].read_start),
stripes[role].read_buf, stripes[role].read_end - stripes[role].read_start stripes[role].read_buf, stripes[role].read_end - stripes[role].read_start
); );
memxor(stripes[prev].bmp_buf, stripes[other].bmp_buf, stripes[role].bmp_buf, bitmap_size);
prev = -1; prev = -1;
} }
else else
@ -116,6 +118,7 @@ void reconstruct_stripes_xor(osd_rmw_stripe_t *stripes, int pg_size)
stripes[other].read_buf + (stripes[role].read_start - stripes[other].read_start), stripes[other].read_buf + (stripes[role].read_start - stripes[other].read_start),
stripes[role].read_buf, stripes[role].read_end - stripes[role].read_start stripes[role].read_buf, stripes[role].read_end - stripes[role].read_start
); );
memxor(stripes[role].bmp_buf, stripes[other].bmp_buf, stripes[role].bmp_buf, bitmap_size);
} }
} }
} }
@ -212,7 +215,7 @@ int* get_jerasure_decoding_matrix(osd_rmw_stripe_t *stripes, int pg_size, int pg
auto dec_it = matrix->decodings.find((reed_sol_erased_t){ .data = erased, .size = pg_size }); auto dec_it = matrix->decodings.find((reed_sol_erased_t){ .data = erased, .size = pg_size });
if (dec_it == matrix->decodings.end()) if (dec_it == matrix->decodings.end())
{ {
int *dm_ids = (int*)malloc(sizeof(int)*(pg_minsize + pg_minsize*pg_minsize + pg_size)); int *dm_ids = (int*)malloc_or_die(sizeof(int)*(pg_minsize + pg_minsize*pg_minsize + pg_size));
int *decoding_matrix = dm_ids + pg_minsize; int *decoding_matrix = dm_ids + pg_minsize;
if (!dm_ids) if (!dm_ids)
throw std::bad_alloc(); throw std::bad_alloc();
@ -230,7 +233,7 @@ int* get_jerasure_decoding_matrix(osd_rmw_stripe_t *stripes, int pg_size, int pg
return dec_it->second; return dec_it->second;
} }
void reconstruct_stripes_jerasure(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize) void reconstruct_stripes_jerasure(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize, uint32_t bitmap_size)
{ {
int *dm_ids = get_jerasure_decoding_matrix(stripes, pg_size, pg_minsize); int *dm_ids = get_jerasure_decoding_matrix(stripes, pg_size, pg_minsize);
if (!dm_ids) if (!dm_ids)
@ -257,6 +260,18 @@ void reconstruct_stripes_jerasure(osd_rmw_stripe_t *stripes, int pg_size, int pg
pg_minsize, OSD_JERASURE_W, decoding_matrix+(role*pg_minsize), dm_ids, role, pg_minsize, OSD_JERASURE_W, decoding_matrix+(role*pg_minsize), dm_ids, role,
data_ptrs, data_ptrs+pg_minsize, stripes[role].read_end - stripes[role].read_start data_ptrs, data_ptrs+pg_minsize, stripes[role].read_end - stripes[role].read_start
); );
for (int other = 0; other < pg_size; other++)
{
if (stripes[other].read_end != 0 && !stripes[other].missing)
{
data_ptrs[other] = (char*)(stripes[other].bmp_buf);
}
}
data_ptrs[role] = (char*)stripes[role].bmp_buf;
jerasure_matrix_dotprod(
pg_minsize, OSD_JERASURE_W, decoding_matrix+(role*pg_minsize), dm_ids, role,
data_ptrs, data_ptrs+pg_minsize, bitmap_size
);
} }
} }
} }
@ -320,7 +335,8 @@ void* alloc_read_buffer(osd_rmw_stripe_t *stripes, int read_pg_size, uint64_t ad
} }
void* calc_rmw(void *request_buf, osd_rmw_stripe_t *stripes, uint64_t *read_osd_set, void* calc_rmw(void *request_buf, osd_rmw_stripe_t *stripes, uint64_t *read_osd_set,
uint64_t pg_size, uint64_t pg_minsize, uint64_t pg_cursize, uint64_t *write_osd_set, uint64_t chunk_size) uint64_t pg_size, uint64_t pg_minsize, uint64_t pg_cursize, uint64_t *write_osd_set,
uint64_t chunk_size, uint32_t bitmap_size)
{ {
// Generic parity modification (read-modify-write) algorithm // Generic parity modification (read-modify-write) algorithm
// Read -> Reconstruct missing chunks -> Calc parity chunks -> Write // Read -> Reconstruct missing chunks -> Calc parity chunks -> Write
@ -521,11 +537,12 @@ static void xor_multiple_buffers(buf_len_t *xor1, int n1, buf_len_t *xor2, int n
} }
static void calc_rmw_parity_copy_mod(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize, static void calc_rmw_parity_copy_mod(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize,
uint64_t *read_osd_set, uint64_t *write_osd_set, uint32_t chunk_size, uint32_t &start, uint32_t &end) uint64_t *read_osd_set, uint64_t *write_osd_set, uint32_t chunk_size, uint32_t bitmap_granularity,
uint32_t &start, uint32_t &end)
{ {
if (write_osd_set[pg_minsize] != 0 || write_osd_set != read_osd_set) if (write_osd_set[pg_minsize] != 0 || write_osd_set != read_osd_set)
{ {
// Required for the next two if()s // start & end are required for calc_rmw_parity
for (int role = 0; role < pg_minsize; role++) for (int role = 0; role < pg_minsize; role++)
{ {
if (stripes[role].req_end != 0) if (stripes[role].req_end != 0)
@ -543,6 +560,20 @@ static void calc_rmw_parity_copy_mod(osd_rmw_stripe_t *stripes, int pg_size, int
} }
} }
} }
// Set bitmap bits accordingly
if (bitmap_granularity > 0)
{
for (int role = 0; role < pg_minsize; role++)
{
if (stripes[role].req_end != 0)
{
bitmap_set(
stripes[role].bmp_buf, stripes[role].req_start,
stripes[role].req_end-stripes[role].req_start, bitmap_granularity
);
}
}
}
if (write_osd_set != read_osd_set) if (write_osd_set != read_osd_set)
{ {
for (int role = 0; role < pg_minsize; role++) for (int role = 0; role < pg_minsize; role++)
@ -603,12 +634,14 @@ static void calc_rmw_parity_copy_parity(osd_rmw_stripe_t *stripes, int pg_size,
#endif #endif
} }
void calc_rmw_parity_xor(osd_rmw_stripe_t *stripes, int pg_size, uint64_t *read_osd_set, uint64_t *write_osd_set, uint32_t chunk_size) void calc_rmw_parity_xor(osd_rmw_stripe_t *stripes, int pg_size, uint64_t *read_osd_set, uint64_t *write_osd_set,
uint32_t chunk_size, uint32_t bitmap_size)
{ {
uint32_t bitmap_granularity = bitmap_size > 0 ? chunk_size / bitmap_size / 8 : 0;
int pg_minsize = pg_size-1; int pg_minsize = pg_size-1;
reconstruct_stripes_xor(stripes, pg_size); reconstruct_stripes_xor(stripes, pg_size, bitmap_size);
uint32_t start = 0, end = 0; uint32_t start = 0, end = 0;
calc_rmw_parity_copy_mod(stripes, pg_size, pg_minsize, read_osd_set, write_osd_set, chunk_size, start, end); calc_rmw_parity_copy_mod(stripes, pg_size, pg_minsize, read_osd_set, write_osd_set, chunk_size, bitmap_granularity, start, end);
if (write_osd_set[pg_minsize] != 0 && end != 0) if (write_osd_set[pg_minsize] != 0 && end != 0)
{ {
// Calculate new parity (XOR k+1) // Calculate new parity (XOR k+1)
@ -626,9 +659,11 @@ void calc_rmw_parity_xor(osd_rmw_stripe_t *stripes, int pg_size, uint64_t *read_
if (prev == -1) if (prev == -1)
{ {
xor1[n1++] = { .buf = stripes[parity].write_buf, .len = end-start }; xor1[n1++] = { .buf = stripes[parity].write_buf, .len = end-start };
memxor(stripes[parity].bmp_buf, stripes[other].bmp_buf, stripes[parity].bmp_buf, bitmap_size);
} }
else else
{ {
memxor(stripes[prev].bmp_buf, stripes[other].bmp_buf, stripes[parity].bmp_buf, bitmap_size);
get_old_new_buffers(stripes[prev], start, end, xor1, n1); get_old_new_buffers(stripes[prev], start, end, xor1, n1);
prev = -1; prev = -1;
} }
@ -641,12 +676,13 @@ void calc_rmw_parity_xor(osd_rmw_stripe_t *stripes, int pg_size, uint64_t *read_
} }
void calc_rmw_parity_jerasure(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize, void calc_rmw_parity_jerasure(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize,
uint64_t *read_osd_set, uint64_t *write_osd_set, uint32_t chunk_size) uint64_t *read_osd_set, uint64_t *write_osd_set, uint32_t chunk_size, uint32_t bitmap_size)
{ {
uint32_t bitmap_granularity = bitmap_size > 0 ? chunk_size / bitmap_size / 8 : 0;
reed_sol_matrix_t *matrix = get_jerasure_matrix(pg_size, pg_minsize); reed_sol_matrix_t *matrix = get_jerasure_matrix(pg_size, pg_minsize);
reconstruct_stripes_jerasure(stripes, pg_size, pg_minsize); reconstruct_stripes_jerasure(stripes, pg_size, pg_minsize, bitmap_size);
uint32_t start = 0, end = 0; uint32_t start = 0, end = 0;
calc_rmw_parity_copy_mod(stripes, pg_size, pg_minsize, read_osd_set, write_osd_set, chunk_size, start, end); calc_rmw_parity_copy_mod(stripes, pg_size, pg_minsize, read_osd_set, write_osd_set, chunk_size, bitmap_granularity, start, end);
if (end != 0) if (end != 0)
{ {
int i; int i;
@ -701,6 +737,14 @@ void calc_rmw_parity_jerasure(osd_rmw_stripe_t *stripes, int pg_size, int pg_min
); );
pos = next_end; pos = next_end;
} }
for (int i = 0; i < pg_size; i++)
{
data_ptrs[i] = stripes[i].bmp_buf;
}
jerasure_matrix_encode(
pg_minsize, pg_size-pg_minsize, OSD_JERASURE_W, matrix->data,
(char**)data_ptrs, (char**)data_ptrs+pg_minsize, bitmap_size
);
} }
} }
calc_rmw_parity_copy_parity(stripes, pg_size, pg_minsize, read_osd_set, write_osd_set, chunk_size, start, end); calc_rmw_parity_copy_parity(stripes, pg_size, pg_minsize, read_osd_set, write_osd_set, chunk_size, start, end);

View File

@ -20,6 +20,7 @@ struct buf_len_t
struct osd_rmw_stripe_t struct osd_rmw_stripe_t
{ {
void *read_buf, *write_buf; void *read_buf, *write_buf;
void *bmp_buf;
uint32_t req_start, req_end; uint32_t req_start, req_end;
uint32_t read_start, read_end; uint32_t read_start, read_end;
uint32_t write_start, write_end; uint32_t write_start, write_end;
@ -30,20 +31,22 @@ struct osd_rmw_stripe_t
void split_stripes(uint64_t pg_minsize, uint32_t bs_block_size, uint32_t start, uint32_t len, osd_rmw_stripe_t *stripes); void split_stripes(uint64_t pg_minsize, uint32_t bs_block_size, uint32_t start, uint32_t len, osd_rmw_stripe_t *stripes);
void reconstruct_stripes_xor(osd_rmw_stripe_t *stripes, int pg_size); void reconstruct_stripes_xor(osd_rmw_stripe_t *stripes, int pg_size, uint32_t bitmap_size);
int extend_missing_stripes(osd_rmw_stripe_t *stripes, osd_num_t *osd_set, int pg_minsize, int pg_size); int extend_missing_stripes(osd_rmw_stripe_t *stripes, osd_num_t *osd_set, int pg_minsize, int pg_size);
void* alloc_read_buffer(osd_rmw_stripe_t *stripes, int read_pg_size, uint64_t add_size); void* alloc_read_buffer(osd_rmw_stripe_t *stripes, int read_pg_size, uint64_t add_size);
void* calc_rmw(void *request_buf, osd_rmw_stripe_t *stripes, uint64_t *read_osd_set, void* calc_rmw(void *request_buf, osd_rmw_stripe_t *stripes, uint64_t *read_osd_set,
uint64_t pg_size, uint64_t pg_minsize, uint64_t pg_cursize, uint64_t *write_osd_set, uint64_t chunk_size); uint64_t pg_size, uint64_t pg_minsize, uint64_t pg_cursize, uint64_t *write_osd_set,
uint64_t chunk_size, uint32_t bitmap_size);
void calc_rmw_parity_xor(osd_rmw_stripe_t *stripes, int pg_size, uint64_t *read_osd_set, uint64_t *write_osd_set, uint32_t chunk_size); void calc_rmw_parity_xor(osd_rmw_stripe_t *stripes, int pg_size, uint64_t *read_osd_set, uint64_t *write_osd_set,
uint32_t chunk_size, uint32_t bitmap_size);
void use_jerasure(int pg_size, int pg_minsize, bool use); void use_jerasure(int pg_size, int pg_minsize, bool use);
void reconstruct_stripes_jerasure(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize); void reconstruct_stripes_jerasure(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize, uint32_t bitmap_size);
void calc_rmw_parity_jerasure(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize, void calc_rmw_parity_jerasure(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize,
uint64_t *read_osd_set, uint64_t *write_osd_set, uint32_t chunk_size); uint64_t *read_osd_set, uint64_t *write_osd_set, uint32_t chunk_size, uint32_t bitmap_size);

View File

@ -126,12 +126,16 @@ void test1()
void test4() void test4()
{ {
const uint32_t bmp = 4;
unsigned bitmaps[3] = { 0 };
osd_num_t osd_set[3] = { 1, 0, 3 }; osd_num_t osd_set[3] = { 1, 0, 3 };
osd_rmw_stripe_t stripes[3] = { 0 }; osd_rmw_stripe_t stripes[3] = { 0 };
// Test 4.1 // Test 4.1
split_stripes(2, 128*1024, 128*1024-4096, 8192, stripes); split_stripes(2, 128*1024, 128*1024-4096, 8192, stripes);
for (int i = 0; i < 3; i++)
stripes[i].bmp_buf = bitmaps+i;
void* write_buf = malloc(8192); void* write_buf = malloc(8192);
void* rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 2, osd_set, 128*1024); void* rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 2, osd_set, 128*1024, bmp);
assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024); assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024);
assert(stripes[1].read_start == 4096 && stripes[1].read_end == 128*1024); assert(stripes[1].read_start == 4096 && stripes[1].read_end == 128*1024);
assert(stripes[2].read_start == 4096 && stripes[2].read_end == 128*1024); assert(stripes[2].read_start == 4096 && stripes[2].read_end == 128*1024);
@ -149,7 +153,13 @@ void test4()
set_pattern(stripes[0].read_buf, 128*1024, PATTERN1); // old data set_pattern(stripes[0].read_buf, 128*1024, PATTERN1); // old data
set_pattern(stripes[1].read_buf, 128*1024-4096, UINT64_MAX); // didn't read it, it's missing set_pattern(stripes[1].read_buf, 128*1024-4096, UINT64_MAX); // didn't read it, it's missing
set_pattern(stripes[2].read_buf, 128*1024-4096, 0); // old parity = 0 set_pattern(stripes[2].read_buf, 128*1024-4096, 0); // old parity = 0
calc_rmw_parity_xor(stripes, 3, osd_set, osd_set, 128*1024); memset(stripes[0].bmp_buf, 0, bmp);
memset(stripes[1].bmp_buf, 0, bmp);
memset(stripes[2].bmp_buf, 0, bmp);
calc_rmw_parity_xor(stripes, 3, osd_set, osd_set, 128*1024, bmp);
assert(*(uint32_t*)stripes[0].bmp_buf == 0x80000000);
assert(*(uint32_t*)stripes[1].bmp_buf == 0x00000001);
assert(*(uint32_t*)stripes[2].bmp_buf == 0x80000001); // XOR
check_pattern(stripes[2].write_buf, 4096, PATTERN0^PATTERN1); // new parity check_pattern(stripes[2].write_buf, 4096, PATTERN0^PATTERN1); // new parity
check_pattern(stripes[2].write_buf+4096, 128*1024-4096*2, 0); // new parity check_pattern(stripes[2].write_buf+4096, 128*1024-4096*2, 0); // new parity
check_pattern(stripes[2].write_buf+128*1024-4096, 4096, PATTERN0^PATTERN1); // new parity check_pattern(stripes[2].write_buf+128*1024-4096, 4096, PATTERN0^PATTERN1); // new parity
@ -181,7 +191,7 @@ void test5()
assert(stripes[2].req_end == 0); assert(stripes[2].req_end == 0);
// Test 5.2 // Test 5.2
void *write_buf = malloc(64*1024*3); void *write_buf = malloc(64*1024*3);
void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 2, osd_set, 128*1024); void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 2, osd_set, 128*1024, 0);
assert(stripes[0].read_start == 64*1024 && stripes[0].read_end == 128*1024); assert(stripes[0].read_start == 64*1024 && stripes[0].read_end == 128*1024);
assert(stripes[1].read_start == 64*1024 && stripes[1].read_end == 128*1024); assert(stripes[1].read_start == 64*1024 && stripes[1].read_end == 128*1024);
assert(stripes[2].read_start == 64*1024 && stripes[2].read_end == 128*1024); assert(stripes[2].read_start == 64*1024 && stripes[2].read_end == 128*1024);
@ -218,7 +228,7 @@ void test6()
// Test 6.1 // Test 6.1
split_stripes(2, 128*1024, 0, 64*1024*3, stripes); split_stripes(2, 128*1024, 0, 64*1024*3, stripes);
void *write_buf = malloc(64*1024*3); void *write_buf = malloc(64*1024*3);
void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 3, osd_set, 128*1024); void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 3, osd_set, 128*1024, 0);
assert(stripes[0].read_end == 0); assert(stripes[0].read_end == 0);
assert(stripes[1].read_start == 64*1024 && stripes[1].read_end == 128*1024); assert(stripes[1].read_start == 64*1024 && stripes[1].read_end == 128*1024);
assert(stripes[2].read_end == 0); assert(stripes[2].read_end == 0);
@ -261,7 +271,7 @@ void test7()
// Test 7.1 // Test 7.1
split_stripes(2, 128*1024, 128*1024-4096, 8192, stripes); split_stripes(2, 128*1024, 128*1024-4096, 8192, stripes);
void *write_buf = malloc(8192); void *write_buf = malloc(8192);
void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 2, write_osd_set, 128*1024); void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 2, write_osd_set, 128*1024, 0);
assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024); assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024);
assert(stripes[1].read_start == 0 && stripes[1].read_end == 128*1024); assert(stripes[1].read_start == 0 && stripes[1].read_end == 128*1024);
assert(stripes[2].read_start == 0 && stripes[2].read_end == 128*1024); assert(stripes[2].read_start == 0 && stripes[2].read_end == 128*1024);
@ -279,7 +289,7 @@ void test7()
set_pattern(stripes[0].read_buf, 128*1024, PATTERN1); // old data set_pattern(stripes[0].read_buf, 128*1024, PATTERN1); // old data
set_pattern(stripes[1].read_buf, 128*1024, UINT64_MAX); // didn't read it, it's missing set_pattern(stripes[1].read_buf, 128*1024, UINT64_MAX); // didn't read it, it's missing
set_pattern(stripes[2].read_buf, 128*1024, 0); // old parity = 0 set_pattern(stripes[2].read_buf, 128*1024, 0); // old parity = 0
calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024); calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024, 0);
assert(stripes[0].write_start == 128*1024-4096 && stripes[0].write_end == 128*1024); assert(stripes[0].write_start == 128*1024-4096 && stripes[0].write_end == 128*1024);
assert(stripes[1].write_start == 0 && stripes[1].write_end == 128*1024); assert(stripes[1].write_start == 0 && stripes[1].write_end == 128*1024);
assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024); assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
@ -314,7 +324,7 @@ void test8()
// Test 8.1 // Test 8.1
split_stripes(2, 128*1024, 0, 128*1024+4096, stripes); split_stripes(2, 128*1024, 0, 128*1024+4096, stripes);
void *write_buf = malloc(128*1024+4096); void *write_buf = malloc(128*1024+4096);
void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 2, write_osd_set, 128*1024); void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 2, write_osd_set, 128*1024, 0);
assert(stripes[0].read_start == 0 && stripes[0].read_end == 0); assert(stripes[0].read_start == 0 && stripes[0].read_end == 0);
assert(stripes[1].read_start == 4096 && stripes[1].read_end == 128*1024); assert(stripes[1].read_start == 4096 && stripes[1].read_end == 128*1024);
assert(stripes[2].read_start == 0 && stripes[2].read_end == 0); assert(stripes[2].read_start == 0 && stripes[2].read_end == 0);
@ -330,7 +340,7 @@ void test8()
// Test 8.2 // Test 8.2
set_pattern(write_buf, 128*1024+4096, PATTERN0); set_pattern(write_buf, 128*1024+4096, PATTERN0);
set_pattern(stripes[1].read_buf, 128*1024-4096, PATTERN1); set_pattern(stripes[1].read_buf, 128*1024-4096, PATTERN1);
calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024); calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024, 0);
assert(stripes[0].write_start == 0 && stripes[0].write_end == 128*1024); // recheck again assert(stripes[0].write_start == 0 && stripes[0].write_end == 128*1024); // recheck again
assert(stripes[1].write_start == 0 && stripes[1].write_end == 4096); // recheck again assert(stripes[1].write_start == 0 && stripes[1].write_end == 4096); // recheck again
assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024); // recheck again assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024); // recheck again
@ -373,7 +383,7 @@ void test9()
assert(stripes[2].req_start == 0 && stripes[2].req_end == 0); assert(stripes[2].req_start == 0 && stripes[2].req_end == 0);
// Test 9.1 // Test 9.1
void *write_buf = NULL; void *write_buf = NULL;
void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 3, write_osd_set, 128*1024); void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 3, write_osd_set, 128*1024, 0);
assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024); assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024);
assert(stripes[1].read_start == 0 && stripes[1].read_end == 128*1024); assert(stripes[1].read_start == 0 && stripes[1].read_end == 128*1024);
assert(stripes[2].read_start == 0 && stripes[2].read_end == 128*1024); assert(stripes[2].read_start == 0 && stripes[2].read_end == 128*1024);
@ -389,7 +399,7 @@ void test9()
// Test 9.2 // Test 9.2
set_pattern(stripes[1].read_buf, 128*1024, 0); set_pattern(stripes[1].read_buf, 128*1024, 0);
set_pattern(stripes[2].read_buf, 128*1024, PATTERN1); set_pattern(stripes[2].read_buf, 128*1024, PATTERN1);
calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024); calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024, 0);
assert(stripes[0].write_start == 0 && stripes[0].write_end == 128*1024); assert(stripes[0].write_start == 0 && stripes[0].write_end == 128*1024);
assert(stripes[1].write_start == 0 && stripes[1].write_end == 0); assert(stripes[1].write_start == 0 && stripes[1].write_end == 0);
assert(stripes[2].write_start == 0 && stripes[2].write_end == 0); assert(stripes[2].write_start == 0 && stripes[2].write_end == 0);
@ -428,7 +438,7 @@ void test10()
assert(stripes[2].req_start == 0 && stripes[2].req_end == 0); assert(stripes[2].req_start == 0 && stripes[2].req_end == 0);
// Test 10.1 // Test 10.1
void *write_buf = malloc(256*1024); void *write_buf = malloc(256*1024);
void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 3, write_osd_set, 128*1024); void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 3, write_osd_set, 128*1024, 0);
assert(rmw_buf); assert(rmw_buf);
assert(stripes[0].read_start == 0 && stripes[0].read_end == 0); assert(stripes[0].read_start == 0 && stripes[0].read_end == 0);
assert(stripes[1].read_start == 0 && stripes[1].read_end == 0); assert(stripes[1].read_start == 0 && stripes[1].read_end == 0);
@ -445,7 +455,7 @@ void test10()
// Test 10.2 // Test 10.2
set_pattern(stripes[0].write_buf, 128*1024, PATTERN1); set_pattern(stripes[0].write_buf, 128*1024, PATTERN1);
set_pattern(stripes[1].write_buf, 128*1024, PATTERN2); set_pattern(stripes[1].write_buf, 128*1024, PATTERN2);
calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024); calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024, 0);
assert(stripes[0].write_start == 0 && stripes[0].write_end == 128*1024); assert(stripes[0].write_start == 0 && stripes[0].write_end == 128*1024);
assert(stripes[1].write_start == 0 && stripes[1].write_end == 128*1024); assert(stripes[1].write_start == 0 && stripes[1].write_end == 128*1024);
assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024); assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
@ -484,7 +494,7 @@ void test11()
assert(stripes[2].req_start == 0 && stripes[2].req_end == 0); assert(stripes[2].req_start == 0 && stripes[2].req_end == 0);
// Test 11.1 // Test 11.1
void *write_buf = malloc(256*1024); void *write_buf = malloc(256*1024);
void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 3, write_osd_set, 128*1024); void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 3, write_osd_set, 128*1024, 0);
assert(rmw_buf); assert(rmw_buf);
assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024); assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024);
assert(stripes[1].read_start == 0 && stripes[1].read_end == 0); assert(stripes[1].read_start == 0 && stripes[1].read_end == 0);
@ -501,7 +511,7 @@ void test11()
// Test 11.2 // Test 11.2
set_pattern(stripes[0].read_buf, 128*1024, PATTERN1); set_pattern(stripes[0].read_buf, 128*1024, PATTERN1);
set_pattern(stripes[1].write_buf, 128*1024, PATTERN2); set_pattern(stripes[1].write_buf, 128*1024, PATTERN2);
calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024); calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024, 0);
assert(stripes[0].write_start == 0 && stripes[0].write_end == 0); assert(stripes[0].write_start == 0 && stripes[0].write_end == 0);
assert(stripes[1].write_start == 0 && stripes[1].write_end == 128*1024); assert(stripes[1].write_start == 0 && stripes[1].write_end == 128*1024);
assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024); assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
@ -539,7 +549,7 @@ void test12()
assert(stripes[1].req_start == 0 && stripes[1].req_end == 0); assert(stripes[1].req_start == 0 && stripes[1].req_end == 0);
assert(stripes[2].req_start == 0 && stripes[2].req_end == 0); assert(stripes[2].req_start == 0 && stripes[2].req_end == 0);
// Test 12.1 // Test 12.1
void *rmw_buf = calc_rmw(NULL, stripes, osd_set, 3, 2, 3, write_osd_set, 128*1024); void *rmw_buf = calc_rmw(NULL, stripes, osd_set, 3, 2, 3, write_osd_set, 128*1024, 0);
assert(rmw_buf); assert(rmw_buf);
assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024); assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024);
assert(stripes[1].read_start == 0 && stripes[1].read_end == 128*1024); assert(stripes[1].read_start == 0 && stripes[1].read_end == 128*1024);
@ -556,7 +566,7 @@ void test12()
// Test 12.2 // Test 12.2
set_pattern(stripes[0].read_buf, 128*1024, PATTERN1); set_pattern(stripes[0].read_buf, 128*1024, PATTERN1);
set_pattern(stripes[1].read_buf, 128*1024, PATTERN2); set_pattern(stripes[1].read_buf, 128*1024, PATTERN2);
calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024); calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024, 0);
assert(stripes[0].write_start == 0 && stripes[0].write_end == 0); assert(stripes[0].write_start == 0 && stripes[0].write_end == 0);
assert(stripes[1].write_start == 0 && stripes[1].write_end == 0); assert(stripes[1].write_start == 0 && stripes[1].write_end == 0);
assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024); assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
@ -596,7 +606,7 @@ void test13()
assert(stripes[2].req_start == 0 && stripes[2].req_end == 0); assert(stripes[2].req_start == 0 && stripes[2].req_end == 0);
assert(stripes[3].req_start == 0 && stripes[3].req_end == 0); assert(stripes[3].req_start == 0 && stripes[3].req_end == 0);
// Test 13.1 // Test 13.1
void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 4, 2, 4, write_osd_set, 128*1024); void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 4, 2, 4, write_osd_set, 128*1024, 0);
assert(rmw_buf); assert(rmw_buf);
assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024-4096); assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024-4096);
assert(stripes[1].read_start == 4096 && stripes[1].read_end == 128*1024); assert(stripes[1].read_start == 4096 && stripes[1].read_end == 128*1024);
@ -618,7 +628,7 @@ void test13()
set_pattern(write_buf, 8192, PATTERN3); set_pattern(write_buf, 8192, PATTERN3);
set_pattern(stripes[0].read_buf, 128*1024-4096, PATTERN1); set_pattern(stripes[0].read_buf, 128*1024-4096, PATTERN1);
set_pattern(stripes[1].read_buf, 128*1024-4096, PATTERN2); set_pattern(stripes[1].read_buf, 128*1024-4096, PATTERN2);
calc_rmw_parity_jerasure(stripes, 4, 2, osd_set, write_osd_set, 128*1024); calc_rmw_parity_jerasure(stripes, 4, 2, osd_set, write_osd_set, 128*1024, 0);
assert(stripes[0].write_start == 128*1024-4096 && stripes[0].write_end == 128*1024); assert(stripes[0].write_start == 128*1024-4096 && stripes[0].write_end == 128*1024);
assert(stripes[1].write_start == 0 && stripes[1].write_end == 4096); assert(stripes[1].write_start == 0 && stripes[1].write_end == 4096);
assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024); assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
@ -653,7 +663,7 @@ void test13()
assert(stripes[3].read_buf == read_buf+3*128*1024); assert(stripes[3].read_buf == read_buf+3*128*1024);
memcpy(read_buf+2*128*1024, rmw_buf, 128*1024); memcpy(read_buf+2*128*1024, rmw_buf, 128*1024);
memcpy(read_buf+3*128*1024, rmw_buf+128*1024, 128*1024); memcpy(read_buf+3*128*1024, rmw_buf+128*1024, 128*1024);
reconstruct_stripes_jerasure(stripes, 4, 2); reconstruct_stripes_jerasure(stripes, 4, 2, 0);
check_pattern(stripes[0].read_buf, 128*1024-4096, PATTERN1); check_pattern(stripes[0].read_buf, 128*1024-4096, PATTERN1);
check_pattern(stripes[0].read_buf+128*1024-4096, 4096, PATTERN3); check_pattern(stripes[0].read_buf+128*1024-4096, 4096, PATTERN3);
check_pattern(stripes[1].read_buf, 4096, PATTERN3); check_pattern(stripes[1].read_buf, 4096, PATTERN3);
@ -684,7 +694,7 @@ void test13()
assert(stripes[3].read_buf == read_buf+2*128*1024); assert(stripes[3].read_buf == read_buf+2*128*1024);
memcpy(read_buf+128*1024, rmw_buf, 128*1024); memcpy(read_buf+128*1024, rmw_buf, 128*1024);
memcpy(read_buf+2*128*1024, rmw_buf+128*1024, 128*1024); memcpy(read_buf+2*128*1024, rmw_buf+128*1024, 128*1024);
reconstruct_stripes_jerasure(stripes, 4, 2); reconstruct_stripes_jerasure(stripes, 4, 2, 0);
check_pattern(stripes[0].read_buf, 128*1024-4096, PATTERN1); check_pattern(stripes[0].read_buf, 128*1024-4096, PATTERN1);
check_pattern(stripes[0].read_buf+128*1024-4096, 4096, PATTERN3); check_pattern(stripes[0].read_buf+128*1024-4096, 4096, PATTERN3);
free(read_buf); free(read_buf);
@ -711,10 +721,12 @@ void test13()
void test14() void test14()
{ {
const int bmp = 4;
use_jerasure(3, 2, true); use_jerasure(3, 2, true);
osd_num_t osd_set[3] = { 1, 2, 0 }; osd_num_t osd_set[3] = { 1, 2, 0 };
osd_num_t write_osd_set[3] = { 1, 2, 3 }; osd_num_t write_osd_set[3] = { 1, 2, 3 };
osd_rmw_stripe_t stripes[3] = { 0 }; osd_rmw_stripe_t stripes[3] = { 0 };
unsigned bitmaps[3] = { 0 };
// Test 13.0 // Test 13.0
void *write_buf = malloc_or_die(8192); void *write_buf = malloc_or_die(8192);
split_stripes(2, 128*1024, 128*1024-4096, 8192, stripes); split_stripes(2, 128*1024, 128*1024-4096, 8192, stripes);
@ -722,7 +734,9 @@ void test14()
assert(stripes[1].req_start == 0 && stripes[1].req_end == 4096); assert(stripes[1].req_start == 0 && stripes[1].req_end == 4096);
assert(stripes[2].req_start == 0 && stripes[2].req_end == 0); assert(stripes[2].req_start == 0 && stripes[2].req_end == 0);
// Test 13.1 // Test 13.1
void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 3, write_osd_set, 128*1024); void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 3, write_osd_set, 128*1024, bmp);
for (int i = 0; i < 3; i++)
stripes[i].bmp_buf = bitmaps+i;
assert(rmw_buf); assert(rmw_buf);
assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024-4096); assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024-4096);
assert(stripes[1].read_start == 4096 && stripes[1].read_end == 128*1024); assert(stripes[1].read_start == 4096 && stripes[1].read_end == 128*1024);
@ -740,7 +754,13 @@ void test14()
set_pattern(write_buf, 8192, PATTERN3); set_pattern(write_buf, 8192, PATTERN3);
set_pattern(stripes[0].read_buf, 128*1024-4096, PATTERN1); set_pattern(stripes[0].read_buf, 128*1024-4096, PATTERN1);
set_pattern(stripes[1].read_buf, 128*1024-4096, PATTERN2); set_pattern(stripes[1].read_buf, 128*1024-4096, PATTERN2);
calc_rmw_parity_jerasure(stripes, 3, 2, osd_set, write_osd_set, 128*1024); memset(stripes[0].bmp_buf, 0, bmp);
memset(stripes[1].bmp_buf, 0, bmp);
memset(stripes[2].bmp_buf, 0, bmp);
calc_rmw_parity_jerasure(stripes, 3, 2, osd_set, write_osd_set, 128*1024, bmp);
assert(*(uint32_t*)stripes[0].bmp_buf == 0x80000000);
assert(*(uint32_t*)stripes[1].bmp_buf == 0x00000001);
assert(*(uint32_t*)stripes[2].bmp_buf == 0x80000001); // jerasure 2+1 is still just XOR
assert(stripes[0].write_start == 128*1024-4096 && stripes[0].write_end == 128*1024); assert(stripes[0].write_start == 128*1024-4096 && stripes[0].write_end == 128*1024);
assert(stripes[1].write_start == 0 && stripes[1].write_end == 4096); assert(stripes[1].write_start == 0 && stripes[1].write_end == 4096);
assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024); assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
@ -764,6 +784,8 @@ void test14()
assert(stripes[1].read_start == 0 && stripes[1].read_end == 128*1024); assert(stripes[1].read_start == 0 && stripes[1].read_end == 128*1024);
assert(stripes[2].read_start == 0 && stripes[2].read_end == 128*1024); assert(stripes[2].read_start == 0 && stripes[2].read_end == 128*1024);
void *read_buf = alloc_read_buffer(stripes, 3, 0); void *read_buf = alloc_read_buffer(stripes, 3, 0);
for (int i = 0; i < 3; i++)
stripes[i].bmp_buf = bitmaps+i;
assert(read_buf); assert(read_buf);
assert(stripes[0].read_buf == read_buf); assert(stripes[0].read_buf == read_buf);
assert(stripes[1].read_buf == read_buf+128*1024); assert(stripes[1].read_buf == read_buf+128*1024);
@ -771,7 +793,7 @@ void test14()
set_pattern(stripes[1].read_buf, 4096, PATTERN3); set_pattern(stripes[1].read_buf, 4096, PATTERN3);
set_pattern(stripes[1].read_buf+4096, 128*1024-4096, PATTERN2); set_pattern(stripes[1].read_buf+4096, 128*1024-4096, PATTERN2);
memcpy(stripes[2].read_buf, rmw_buf, 128*1024); memcpy(stripes[2].read_buf, rmw_buf, 128*1024);
reconstruct_stripes_jerasure(stripes, 3, 2); reconstruct_stripes_jerasure(stripes, 3, 2, bmp);
check_pattern(stripes[0].read_buf, 128*1024-4096, PATTERN1); check_pattern(stripes[0].read_buf, 128*1024-4096, PATTERN1);
check_pattern(stripes[0].read_buf+128*1024-4096, 4096, PATTERN3); check_pattern(stripes[0].read_buf+128*1024-4096, 4096, PATTERN3);
free(read_buf); free(read_buf);

View File

@ -17,9 +17,13 @@ void osd_t::secondary_op_callback(osd_op_t *op)
{ {
op->reply.sec_del.version = op->bs_op->version; op->reply.sec_del.version = op->bs_op->version;
} }
if (op->req.hdr.opcode == OSD_OP_SEC_READ && if (op->req.hdr.opcode == OSD_OP_SEC_READ)
op->bs_op->retval > 0)
{ {
if (op->bs_op->retval >= 0)
op->reply.sec_rw.attr_len = clean_entry_bitmap_size;
else
op->reply.sec_rw.attr_len = 0;
if (op->bs_op->retval > 0)
op->iov.push_back(op->buf, op->bs_op->retval); op->iov.push_back(op->buf, op->bs_op->retval);
} }
else if (op->req.hdr.opcode == OSD_OP_SEC_LIST) else if (op->req.hdr.opcode == OSD_OP_SEC_LIST)
@ -55,11 +59,22 @@ void osd_t::exec_secondary(osd_op_t *cur_op)
cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE || cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE) cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE)
{ {
if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ)
{
// Allocate memory for the read operation
if (clean_entry_bitmap_size > sizeof(unsigned))
cur_op->bitmap = cur_op->rmw_buf = malloc_or_die(clean_entry_bitmap_size);
else
cur_op->bitmap = &cur_op->bmp_data;
if (cur_op->req.sec_rw.len > 0)
cur_op->buf = memalign_or_die(MEM_ALIGNMENT, cur_op->req.sec_rw.len);
}
cur_op->bs_op->oid = cur_op->req.sec_rw.oid; cur_op->bs_op->oid = cur_op->req.sec_rw.oid;
cur_op->bs_op->version = cur_op->req.sec_rw.version; cur_op->bs_op->version = cur_op->req.sec_rw.version;
cur_op->bs_op->offset = cur_op->req.sec_rw.offset; cur_op->bs_op->offset = cur_op->req.sec_rw.offset;
cur_op->bs_op->len = cur_op->req.sec_rw.len; cur_op->bs_op->len = cur_op->req.sec_rw.len;
cur_op->bs_op->buf = cur_op->buf; cur_op->bs_op->buf = cur_op->buf;
cur_op->bs_op->bitmap = cur_op->bitmap;
#ifdef OSD_STUB #ifdef OSD_STUB
cur_op->bs_op->retval = cur_op->bs_op->len; cur_op->bs_op->retval = cur_op->bs_op->len;
#endif #endif

View File

@ -39,12 +39,14 @@ void DSO_STAMP_FUN(void)
typedef struct VitastorClient typedef struct VitastorClient
{ {
void *proxy; void *proxy;
void *watch;
char *etcd_host; char *etcd_host;
char *etcd_prefix; char *etcd_prefix;
char *image;
uint64_t inode; uint64_t inode;
uint64_t pool; uint64_t pool;
uint64_t size; uint64_t size;
int readonly; long readonly;
QemuMutex mutex; QemuMutex mutex;
} VitastorClient; } VitastorClient;
@ -53,10 +55,14 @@ typedef struct VitastorRPC
BlockDriverState *bs; BlockDriverState *bs;
Coroutine *co; Coroutine *co;
QEMUIOVector *iov; QEMUIOVector *iov;
int ret; long ret;
int complete; int complete;
} VitastorRPC; } VitastorRPC;
static void vitastor_co_init_task(BlockDriverState *bs, VitastorRPC *task);
static void vitastor_co_generic_bh_cb(long retval, void *opaque);
static void vitastor_close(BlockDriverState *bs);
static char *qemu_rbd_next_tok(char *src, char delim, char **p) static char *qemu_rbd_next_tok(char *src, char delim, char **p)
{ {
char *end; char *end;
@ -132,22 +138,25 @@ static void vitastor_parse_filename(const char *filename, QDict *options, Error
qdict_put_str(options, name, value); qdict_put_str(options, name, value);
} }
} }
if (!qdict_get_try_str(options, "image"))
{
if (!qdict_get_try_int(options, "inode", 0)) if (!qdict_get_try_int(options, "inode", 0))
{ {
error_setg(errp, "inode is missing"); error_setg(errp, "one of image (name) and inode (number) must be specified");
goto out; goto out;
} }
if (!(qdict_get_try_int(options, "inode", 0) >> (64-POOL_ID_BITS)) && if (!(qdict_get_try_int(options, "inode", 0) >> (64-POOL_ID_BITS)) &&
!qdict_get_try_int(options, "pool", 0)) !qdict_get_try_int(options, "pool", 0))
{ {
error_setg(errp, "pool number is missing"); error_setg(errp, "pool number must be specified or included in the inode number");
goto out; goto out;
} }
if (!qdict_get_try_int(options, "size", 0)) if (!qdict_get_try_int(options, "size", 0))
{ {
error_setg(errp, "size is missing"); error_setg(errp, "size must be specified when inode number is used instead of image name");
goto out; goto out;
} }
}
if (!qdict_get_str(options, "etcd_host")) if (!qdict_get_str(options, "etcd_host"))
{ {
error_setg(errp, "etcd_host is missing"); error_setg(errp, "etcd_host is missing");
@ -159,27 +168,86 @@ out:
return; return;
} }
static void coroutine_fn vitastor_co_get_metadata(VitastorRPC *task)
{
BlockDriverState *bs = task->bs;
VitastorClient *client = bs->opaque;
task->co = qemu_coroutine_self();
qemu_mutex_lock(&client->mutex);
vitastor_proxy_watch_metadata(client->proxy, client->image, vitastor_co_generic_bh_cb, task);
qemu_mutex_unlock(&client->mutex);
while (!task->complete)
{
qemu_coroutine_yield();
}
}
static int vitastor_file_open(BlockDriverState *bs, QDict *options, int flags, Error **errp) static int vitastor_file_open(BlockDriverState *bs, QDict *options, int flags, Error **errp)
{ {
VitastorClient *client = bs->opaque; VitastorClient *client = bs->opaque;
int64_t ret = 0; int64_t ret = 0;
qemu_mutex_init(&client->mutex);
client->etcd_host = g_strdup(qdict_get_try_str(options, "etcd_host")); client->etcd_host = g_strdup(qdict_get_try_str(options, "etcd_host"));
client->etcd_prefix = g_strdup(qdict_get_try_str(options, "etcd_prefix")); client->etcd_prefix = g_strdup(qdict_get_try_str(options, "etcd_prefix"));
client->proxy = vitastor_proxy_create(bdrv_get_aio_context(bs), client->etcd_host, client->etcd_prefix);
client->image = g_strdup(qdict_get_try_str(options, "image"));
client->readonly = (flags & BDRV_O_RDWR) ? 1 : 0;
if (client->image)
{
// Get image metadata (size and readonly flag)
VitastorRPC task;
task.complete = 0;
task.bs = bs;
if (qemu_in_coroutine())
{
vitastor_co_get_metadata(&task);
}
else
{
assert(qemu_get_current_aio_context() == qemu_get_aio_context());
qemu_coroutine_enter(qemu_coroutine_create((void(*)(void*))vitastor_co_get_metadata, &task));
}
BDRV_POLL_WHILE(bs, !task.complete);
client->watch = (void*)task.ret;
client->readonly = client->readonly || vitastor_proxy_get_readonly(client->watch);
client->size = vitastor_proxy_get_size(client->watch);
if (!vitastor_proxy_get_inode_num(client->watch))
{
error_setg(errp, "image does not exist");
vitastor_close(bs);
}
if (!client->size)
{
client->size = qdict_get_int(options, "size");
}
}
else
{
client->watch = NULL;
client->inode = qdict_get_int(options, "inode"); client->inode = qdict_get_int(options, "inode");
client->pool = qdict_get_int(options, "pool"); client->pool = qdict_get_int(options, "pool");
if (client->pool) if (client->pool)
{
client->inode = (client->inode & ((1l << (64-POOL_ID_BITS)) - 1)) | (client->pool << (64-POOL_ID_BITS)); client->inode = (client->inode & ((1l << (64-POOL_ID_BITS)) - 1)) | (client->pool << (64-POOL_ID_BITS));
}
client->size = qdict_get_int(options, "size"); client->size = qdict_get_int(options, "size");
client->readonly = (flags & BDRV_O_RDWR) ? 1 : 0; }
client->proxy = vitastor_proxy_create(bdrv_get_aio_context(bs), client->etcd_host, client->etcd_prefix); if (!client->size)
//client->aio_context = bdrv_get_aio_context(bs); {
error_setg(errp, "image size not specified");
vitastor_close(bs);
return -1;
}
bs->total_sectors = client->size / BDRV_SECTOR_SIZE; bs->total_sectors = client->size / BDRV_SECTOR_SIZE;
//client->aio_context = bdrv_get_aio_context(bs);
qdict_del(options, "etcd_host"); qdict_del(options, "etcd_host");
qdict_del(options, "etcd_prefix"); qdict_del(options, "etcd_prefix");
qdict_del(options, "image");
qdict_del(options, "inode"); qdict_del(options, "inode");
qdict_del(options, "pool"); qdict_del(options, "pool");
qdict_del(options, "size"); qdict_del(options, "size");
qemu_mutex_init(&client->mutex);
return ret; return ret;
} }
@ -191,6 +259,8 @@ static void vitastor_close(BlockDriverState *bs)
g_free(client->etcd_host); g_free(client->etcd_host);
if (client->etcd_prefix) if (client->etcd_prefix)
g_free(client->etcd_prefix); g_free(client->etcd_prefix);
if (client->image)
g_free(client->image);
} }
#if QEMU_VERSION_MAJOR >= 3 #if QEMU_VERSION_MAJOR >= 3
@ -296,7 +366,7 @@ static void vitastor_co_init_task(BlockDriverState *bs, VitastorRPC *task)
}; };
} }
static void vitastor_co_generic_bh_cb(int retval, void *opaque) static void vitastor_co_generic_bh_cb(long retval, void *opaque)
{ {
VitastorRPC *task = opaque; VitastorRPC *task = opaque;
task->ret = retval; task->ret = retval;
@ -319,8 +389,9 @@ static int coroutine_fn vitastor_co_preadv(BlockDriverState *bs, uint64_t offset
vitastor_co_init_task(bs, &task); vitastor_co_init_task(bs, &task);
task.iov = iov; task.iov = iov;
uint64_t inode = client->watch ? vitastor_proxy_get_inode_num(client->watch) : client->inode;
qemu_mutex_lock(&client->mutex); qemu_mutex_lock(&client->mutex);
vitastor_proxy_rw(0, client->proxy, client->inode, offset, bytes, iov->iov, iov->niov, vitastor_co_generic_bh_cb, &task); vitastor_proxy_rw(0, client->proxy, inode, offset, bytes, iov->iov, iov->niov, vitastor_co_generic_bh_cb, &task);
qemu_mutex_unlock(&client->mutex); qemu_mutex_unlock(&client->mutex);
while (!task.complete) while (!task.complete)
@ -338,8 +409,9 @@ static int coroutine_fn vitastor_co_pwritev(BlockDriverState *bs, uint64_t offse
vitastor_co_init_task(bs, &task); vitastor_co_init_task(bs, &task);
task.iov = iov; task.iov = iov;
uint64_t inode = client->watch ? vitastor_proxy_get_inode_num(client->watch) : client->inode;
qemu_mutex_lock(&client->mutex); qemu_mutex_lock(&client->mutex);
vitastor_proxy_rw(1, client->proxy, client->inode, offset, bytes, iov->iov, iov->niov, vitastor_co_generic_bh_cb, &task); vitastor_proxy_rw(1, client->proxy, inode, offset, bytes, iov->iov, iov->niov, vitastor_co_generic_bh_cb, &task);
qemu_mutex_unlock(&client->mutex); qemu_mutex_unlock(&client->mutex);
while (!task.complete) while (!task.complete)

View File

@ -47,7 +47,6 @@ public:
~QemuProxy() ~QemuProxy()
{ {
cli->stop();
delete cli; delete cli;
delete tfd; delete tfd;
} }
@ -127,4 +126,38 @@ void vitastor_proxy_sync(void *client, VitastorIOHandler cb, void *opaque)
p->cli->execute(op); p->cli->execute(op);
} }
void vitastor_proxy_watch_metadata(void *client, char *image, VitastorIOHandler cb, void *opaque)
{
QemuProxy *p = (QemuProxy*)client;
p->cli->on_ready([=]()
{
auto watch = p->cli->st_cli.watch_inode(std::string(image));
cb((long)watch, opaque);
});
}
void vitastor_proxy_close_watch(void *client, void *watch)
{
QemuProxy *p = (QemuProxy*)client;
p->cli->st_cli.close_watch((inode_watch_t*)watch);
}
uint64_t vitastor_proxy_get_size(void *watch_ptr)
{
inode_watch_t *watch = (inode_watch_t*)watch_ptr;
return watch->cfg.size;
}
uint64_t vitastor_proxy_get_inode_num(void *watch_ptr)
{
inode_watch_t *watch = (inode_watch_t*)watch_ptr;
return watch->cfg.num;
}
int vitastor_proxy_get_readonly(void *watch_ptr)
{
inode_watch_t *watch = (inode_watch_t*)watch_ptr;
return watch->cfg.readonly;
}
} }

View File

@ -15,12 +15,17 @@ extern "C" {
#endif #endif
// Our exports // Our exports
typedef void VitastorIOHandler(int retval, void *opaque); typedef void VitastorIOHandler(long retval, void *opaque);
void* vitastor_proxy_create(AioContext *ctx, const char *etcd_host, const char *etcd_prefix); void* vitastor_proxy_create(AioContext *ctx, const char *etcd_host, const char *etcd_prefix);
void vitastor_proxy_destroy(void *client); void vitastor_proxy_destroy(void *client);
void vitastor_proxy_rw(int write, void *client, uint64_t inode, uint64_t offset, uint64_t len, void vitastor_proxy_rw(int write, void *client, uint64_t inode, uint64_t offset, uint64_t len,
struct iovec *iov, int iovcnt, VitastorIOHandler cb, void *opaque); struct iovec *iov, int iovcnt, VitastorIOHandler cb, void *opaque);
void vitastor_proxy_sync(void *client, VitastorIOHandler cb, void *opaque); void vitastor_proxy_sync(void *client, VitastorIOHandler cb, void *opaque);
void vitastor_proxy_watch_metadata(void *client, char *image, VitastorIOHandler cb, void *opaque);
void vitastor_proxy_close_watch(void *client, void *watch);
uint64_t vitastor_proxy_get_size(void *watch);
uint64_t vitastor_proxy_get_inode_num(void *watch);
int vitastor_proxy_get_readonly(void *watch);
#ifdef __cplusplus #ifdef __cplusplus
} }

View File

@ -2,20 +2,39 @@
// License: VNPL-1.1 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include <stdio.h> #include <stdio.h>
#include <stdlib.h>
#include "allocator.h" #include "allocator.h"
void alloc_all(int size)
{
allocator *a = new allocator(size);
for (int i = 0; i < size; i++)
{
uint64_t x = a->find_free();
if (x == UINT64_MAX)
{
printf("ran out of space %d allocated=%d\n", size, i);
exit(1);
}
if (x != i)
{
printf("incorrect block allocated: expected %d, got %lu\n", i, x);
}
a->set(x, true);
}
uint64_t x = a->find_free();
if (x != UINT64_MAX)
{
printf("extra free space found: %lx (%d)\n", x, size);
exit(1);
}
delete a;
}
int main(int narg, char *args[]) int main(int narg, char *args[])
{ {
allocator a(8192); alloc_all(8192);
for (int i = 0; i < 8192; i++) alloc_all(8062);
{ alloc_all(4096);
uint64_t x = a.find_free();
if (x == UINT64_MAX)
{
printf("ran out of space %d\n", i);
return 1;
}
a.set(x, true);
}
return 0; return 0;
} }

View File

@ -28,8 +28,6 @@ cd ..
node mon/mon-main.js --etcd_url http://$ETCD_URL --etcd_prefix "/vitastor" --verbose 1 &>./testdata/mon.log & node mon/mon-main.js --etcd_url http://$ETCD_URL --etcd_prefix "/vitastor" --verbose 1 &>./testdata/mon.log &
MON_PID=$! MON_PID=$!
$ETCDCTL put /vitastor/config/global '{"immediate_commit":"all"}'
$ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":2,"pg_count":16,"failure_domain":"osd"}}' $ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":2,"pg_count":16,"failure_domain":"osd"}}'
sleep 2 sleep 2

View File

@ -0,0 +1,109 @@
#!/bin/bash -ex
. `dirname $0`/common.sh
if [ "$IMMEDIATE_COMMIT" != "" ]; then
NO_SAME="--journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024 --disable_data_fsync 1 --immediate_commit all --log_level 1"
$ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"osd_out_time":5,"immediate_commit":"all"}'
else
NO_SAME="--journal_sector_buffer_count 1024 --log_level 1"
$ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"osd_out_time":5}'
fi
dd if=/dev/zero of=./testdata/test_osd1.bin bs=1024 count=1 seek=$((1024*1024-1))
dd if=/dev/zero of=./testdata/test_osd2.bin bs=1024 count=1 seek=$((1024*1024-1))
dd if=/dev/zero of=./testdata/test_osd3.bin bs=1024 count=1 seek=$((1024*1024-1))
dd if=/dev/zero of=./testdata/test_osd4.bin bs=1024 count=1 seek=$((1024*1024-1))
dd if=/dev/zero of=./testdata/test_osd5.bin bs=1024 count=1 seek=$((1024*1024-1))
dd if=/dev/zero of=./testdata/test_osd6.bin bs=1024 count=1 seek=$((1024*1024-1))
dd if=/dev/zero of=./testdata/test_osd7.bin bs=1024 count=1 seek=$((1024*1024-1))
build/src/vitastor-osd --osd_num 1 --bind_address 127.0.0.1 $NO_SAME --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd1.bin 2>/dev/null) 2>&1 >>./testdata/osd1.log &
OSD1_PID=$!
build/src/vitastor-osd --osd_num 2 --bind_address 127.0.0.1 $NO_SAME --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd2.bin 2>/dev/null) 2>&1 >>./testdata/osd2.log &
OSD2_PID=$!
build/src/vitastor-osd --osd_num 3 --bind_address 127.0.0.1 $NO_SAME --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd3.bin 2>/dev/null) 2>&1 >>./testdata/osd3.log &
OSD3_PID=$!
build/src/vitastor-osd --osd_num 4 --bind_address 127.0.0.1 $NO_SAME --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd4.bin 2>/dev/null) 2>&1 >>./testdata/osd4.log &
OSD4_PID=$!
build/src/vitastor-osd --osd_num 5 --bind_address 127.0.0.1 $NO_SAME --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd5.bin 2>/dev/null) 2>&1 >>./testdata/osd5.log &
OSD5_PID=$!
build/src/vitastor-osd --osd_num 6 --bind_address 127.0.0.1 $NO_SAME --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd6.bin 2>/dev/null) 2>&1 >>./testdata/osd6.log &
OSD6_PID=$!
build/src/vitastor-osd --osd_num 7 --bind_address 127.0.0.1 $NO_SAME --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd7.bin 2>/dev/null) 2>&1 >>./testdata/osd7.log &
OSD7_PID=$!
cd mon
npm install
cd ..
node mon/mon-main.js --etcd_url http://$ETCD_URL --etcd_prefix "/vitastor" --verbose 1 &>./testdata/mon.log &
MON_PID=$!
$ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":32,"failure_domain":"osd"}}'
sleep 2
if ! ($ETCDCTL get /vitastor/config/pgs --print-value-only | jq -s -e '(.[0].items["1"] | map((.osd_set | select(. > 0)) | length == 2) | length) == 32'); then
format_error "FAILED: 32 PGS NOT CONFIGURED"
fi
if ! ($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"]) ] | length) == 32'); then
format_error "FAILED: 32 PGS NOT UP"
fi
IMG_SIZE=960
LD_PRELOAD=libasan.so.5 \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4M -direct=1 -iodepth=16 -fsync=16 -rw=write \
-etcd=$ETCD_URL -pool=1 -inode=2 -size=${IMG_SIZE}M -cluster_log_level=10
try_reweight()
{
osd=$1
w=$2
$ETCDCTL put /vitastor/config/osd/$osd '{"reweight":'$w'}'
sleep 3
}
try_reweight 1 0
try_reweight 2 0
try_reweight 3 0
try_reweight 4 0
try_reweight 5 0
try_reweight 1 1
try_reweight 2 1
try_reweight 3 1
try_reweight 4 1
try_reweight 5 1
# Wait for the rebalance to finish
for i in {1..60}; do
($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"]) ] | length) == 32') && \
break
if [ $i -eq 60 ]; then
format_error "Rebalance couldn't finish in 60 seconds"
fi
sleep 1
done
# Check that PGs never had degraded objects !
if grep has_degraded ./testdata/mon.log; then
format_error "Some copies of objects were lost during interrupted rebalancings"
fi
# Check that no objects are lost !
nobj=`$ETCDCTL get --prefix '/vitastor/pg/stats' --print-value-only | jq -s '[ .[].object_count ] | reduce .[] as $num (0; .+$num)'`
if [ "$nobj" -ne $((IMG_SIZE*8)) ]; then
format_error "Data lost after multiple interrupted rebalancings"
fi
format_green OK

View File

@ -34,7 +34,40 @@ fi
#LD_PRELOAD=libasan.so.5 \ #LD_PRELOAD=libasan.so.5 \
# fio -thread -name=test -ioengine=build/src/libfio_vitastor_sec.so -bs=4k -fsync=128 `$ETCDCTL get /vitastor/osd/state/1 --print-value-only | jq -r '"-host="+.addresses[0]+" -port="+(.port|tostring)'` -rw=write -size=32M # fio -thread -name=test -ioengine=build/src/libfio_vitastor_sec.so -bs=4k -fsync=128 `$ETCDCTL get /vitastor/osd/state/1 --print-value-only | jq -r '"-host="+.addresses[0]+" -port="+(.port|tostring)'` -rw=write -size=32M
# Test basic write and snapshot
$ETCDCTL put /vitastor/config/inode/1/2 '{"name":"testimg","size":'$((32*1024*1024))'}'
LD_PRELOAD=libasan.so.5 \ LD_PRELOAD=libasan.so.5 \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4M -direct=1 -iodepth=1 -fsync=1 -rw=write -etcd=$ETCD_URL -pool=1 -inode=1 -size=1G -cluster_log_level=10 fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4M -direct=1 -iodepth=1 -fsync=1 -rw=write \
-etcd=$ETCD_URL -pool=1 -inode=2 -size=32M -cluster_log_level=10
$ETCDCTL put /vitastor/config/inode/1/2 '{"name":"testimg@0","size":'$((32*1024*1024))'}'
$ETCDCTL put /vitastor/config/inode/1/3 '{"parent_id":2,"name":"testimg","size":'$((32*1024*1024))'}'
LD_PRELOAD=libasan.so.5 \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4k -direct=1 -iodepth=1 -fsync=32 -buffer_pattern=0xdeadface \
-rw=randwrite -etcd=$ETCD_URL -image=testimg -number_ios=1024
LD_PRELOAD=libasan.so.5 \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4M -direct=1 -iodepth=1 -rw=read -etcd=$ETCD_URL -pool=1 -inode=3 -size=32M
qemu-img convert -S 4096 -p \
-f raw "vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:pool=1:inode=3:size=$((32*1024*1024))" \
-O raw ./testdata/merged.bin
qemu-img convert -S 4096 -p \
-f raw "vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:image=testimg@0" \
-O raw ./testdata/layer0.bin
$ETCDCTL put /vitastor/config/inode/1/3 '{"name":"testimg","size":'$((32*1024*1024))'}'
qemu-img convert -S 4096 -p \
-f raw "vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:image=testimg" \
-O raw ./testdata/layer1.bin
node mon/merge.js ./testdata/layer0.bin ./testdata/layer1.bin ./testdata/check.bin
cmp ./testdata/merged.bin ./testdata/check.bin
format_green OK format_green OK