Compare commits

..

1 Commits

Author SHA1 Message Date
Vitaliy Filippov eabfe4faac Test SQ poll threads. Unstable and in fact slower :( 2020-03-03 17:23:39 +03:00
366 changed files with 6215 additions and 60363 deletions

View File

@ -1,19 +0,0 @@
.git
build
packages
mon/node_modules
*.o
*.so
osd
stub_osd
stub_uring_osd
stub_bench
osd_test
dump_journal
nbd_proxy
rm_inode
fio
qemu
rpm/*.Dockerfile
debian/*.Dockerfile
Dockerfile

18
.gitignore vendored
View File

@ -1,18 +0,0 @@
*.o
*.so
package-lock.json
fio
qemu
osd
stub_osd
stub_uring_osd
stub_bench
osd_test
osd_peering_pg_test
dump_journal
nbd_proxy
rm_inode
test_allocator
test_blockstore
test_shit
osd_rmw_test

6
.gitmodules vendored
View File

@ -1,6 +0,0 @@
[submodule "cpp-btree"]
path = cpp-btree
url = ../cpp-btree.git
[submodule "json11"]
path = json11
url = ../json11.git

View File

@ -1,7 +0,0 @@
cmake_minimum_required(VERSION 2.8)
project(vitastor)
set(VERSION "0.6.17")
add_subdirectory(src)

View File

@ -1,339 +0,0 @@
GNU GENERAL PUBLIC LICENSE
Version 2, June 1991
Copyright (C) 1989, 1991 Free Software Foundation, Inc.,
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
Preamble
The licenses for most software are designed to take away your
freedom to share and change it. By contrast, the GNU General Public
License is intended to guarantee your freedom to share and change free
software--to make sure the software is free for all its users. This
General Public License applies to most of the Free Software
Foundation's software and to any other program whose authors commit to
using it. (Some other Free Software Foundation software is covered by
the GNU Lesser General Public License instead.) You can apply it to
your programs, too.
When we speak of free software, we are referring to freedom, not
price. Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
this service if you wish), that you receive source code or can get it
if you want it, that you can change the software or use pieces of it
in new free programs; and that you know you can do these things.
To protect your rights, we need to make restrictions that forbid
anyone to deny you these rights or to ask you to surrender the rights.
These restrictions translate to certain responsibilities for you if you
distribute copies of the software, or if you modify it.
For example, if you distribute copies of such a program, whether
gratis or for a fee, you must give the recipients all the rights that
you have. You must make sure that they, too, receive or can get the
source code. And you must show them these terms so they know their
rights.
We protect your rights with two steps: (1) copyright the software, and
(2) offer you this license which gives you legal permission to copy,
distribute and/or modify the software.
Also, for each author's protection and ours, we want to make certain
that everyone understands that there is no warranty for this free
software. If the software is modified by someone else and passed on, we
want its recipients to know that what they have is not the original, so
that any problems introduced by others will not reflect on the original
authors' reputations.
Finally, any free program is threatened constantly by software
patents. We wish to avoid the danger that redistributors of a free
program will individually obtain patent licenses, in effect making the
program proprietary. To prevent this, we have made it clear that any
patent must be licensed for everyone's free use or not licensed at all.
The precise terms and conditions for copying, distribution and
modification follow.
GNU GENERAL PUBLIC LICENSE
TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
0. This License applies to any program or other work which contains
a notice placed by the copyright holder saying it may be distributed
under the terms of this General Public License. The "Program", below,
refers to any such program or work, and a "work based on the Program"
means either the Program or any derivative work under copyright law:
that is to say, a work containing the Program or a portion of it,
either verbatim or with modifications and/or translated into another
language. (Hereinafter, translation is included without limitation in
the term "modification".) Each licensee is addressed as "you".
Activities other than copying, distribution and modification are not
covered by this License; they are outside its scope. The act of
running the Program is not restricted, and the output from the Program
is covered only if its contents constitute a work based on the
Program (independent of having been made by running the Program).
Whether that is true depends on what the Program does.
1. You may copy and distribute verbatim copies of the Program's
source code as you receive it, in any medium, provided that you
conspicuously and appropriately publish on each copy an appropriate
copyright notice and disclaimer of warranty; keep intact all the
notices that refer to this License and to the absence of any warranty;
and give any other recipients of the Program a copy of this License
along with the Program.
You may charge a fee for the physical act of transferring a copy, and
you may at your option offer warranty protection in exchange for a fee.
2. You may modify your copy or copies of the Program or any portion
of it, thus forming a work based on the Program, and copy and
distribute such modifications or work under the terms of Section 1
above, provided that you also meet all of these conditions:
a) You must cause the modified files to carry prominent notices
stating that you changed the files and the date of any change.
b) You must cause any work that you distribute or publish, that in
whole or in part contains or is derived from the Program or any
part thereof, to be licensed as a whole at no charge to all third
parties under the terms of this License.
c) If the modified program normally reads commands interactively
when run, you must cause it, when started running for such
interactive use in the most ordinary way, to print or display an
announcement including an appropriate copyright notice and a
notice that there is no warranty (or else, saying that you provide
a warranty) and that users may redistribute the program under
these conditions, and telling the user how to view a copy of this
License. (Exception: if the Program itself is interactive but
does not normally print such an announcement, your work based on
the Program is not required to print an announcement.)
These requirements apply to the modified work as a whole. If
identifiable sections of that work are not derived from the Program,
and can be reasonably considered independent and separate works in
themselves, then this License, and its terms, do not apply to those
sections when you distribute them as separate works. But when you
distribute the same sections as part of a whole which is a work based
on the Program, the distribution of the whole must be on the terms of
this License, whose permissions for other licensees extend to the
entire whole, and thus to each and every part regardless of who wrote it.
Thus, it is not the intent of this section to claim rights or contest
your rights to work written entirely by you; rather, the intent is to
exercise the right to control the distribution of derivative or
collective works based on the Program.
In addition, mere aggregation of another work not based on the Program
with the Program (or with a work based on the Program) on a volume of
a storage or distribution medium does not bring the other work under
the scope of this License.
3. You may copy and distribute the Program (or a work based on it,
under Section 2) in object code or executable form under the terms of
Sections 1 and 2 above provided that you also do one of the following:
a) Accompany it with the complete corresponding machine-readable
source code, which must be distributed under the terms of Sections
1 and 2 above on a medium customarily used for software interchange; or,
b) Accompany it with a written offer, valid for at least three
years, to give any third party, for a charge no more than your
cost of physically performing source distribution, a complete
machine-readable copy of the corresponding source code, to be
distributed under the terms of Sections 1 and 2 above on a medium
customarily used for software interchange; or,
c) Accompany it with the information you received as to the offer
to distribute corresponding source code. (This alternative is
allowed only for noncommercial distribution and only if you
received the program in object code or executable form with such
an offer, in accord with Subsection b above.)
The source code for a work means the preferred form of the work for
making modifications to it. For an executable work, complete source
code means all the source code for all modules it contains, plus any
associated interface definition files, plus the scripts used to
control compilation and installation of the executable. However, as a
special exception, the source code distributed need not include
anything that is normally distributed (in either source or binary
form) with the major components (compiler, kernel, and so on) of the
operating system on which the executable runs, unless that component
itself accompanies the executable.
If distribution of executable or object code is made by offering
access to copy from a designated place, then offering equivalent
access to copy the source code from the same place counts as
distribution of the source code, even though third parties are not
compelled to copy the source along with the object code.
4. You may not copy, modify, sublicense, or distribute the Program
except as expressly provided under this License. Any attempt
otherwise to copy, modify, sublicense or distribute the Program is
void, and will automatically terminate your rights under this License.
However, parties who have received copies, or rights, from you under
this License will not have their licenses terminated so long as such
parties remain in full compliance.
5. You are not required to accept this License, since you have not
signed it. However, nothing else grants you permission to modify or
distribute the Program or its derivative works. These actions are
prohibited by law if you do not accept this License. Therefore, by
modifying or distributing the Program (or any work based on the
Program), you indicate your acceptance of this License to do so, and
all its terms and conditions for copying, distributing or modifying
the Program or works based on it.
6. Each time you redistribute the Program (or any work based on the
Program), the recipient automatically receives a license from the
original licensor to copy, distribute or modify the Program subject to
these terms and conditions. You may not impose any further
restrictions on the recipients' exercise of the rights granted herein.
You are not responsible for enforcing compliance by third parties to
this License.
7. If, as a consequence of a court judgment or allegation of patent
infringement or for any other reason (not limited to patent issues),
conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License. If you cannot
distribute so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you
may not distribute the Program at all. For example, if a patent
license would not permit royalty-free redistribution of the Program by
all those who receive copies directly or indirectly through you, then
the only way you could satisfy both it and this License would be to
refrain entirely from distribution of the Program.
If any portion of this section is held invalid or unenforceable under
any particular circumstance, the balance of the section is intended to
apply and the section as a whole is intended to apply in other
circumstances.
It is not the purpose of this section to induce you to infringe any
patents or other property right claims or to contest validity of any
such claims; this section has the sole purpose of protecting the
integrity of the free software distribution system, which is
implemented by public license practices. Many people have made
generous contributions to the wide range of software distributed
through that system in reliance on consistent application of that
system; it is up to the author/donor to decide if he or she is willing
to distribute software through any other system and a licensee cannot
impose that choice.
This section is intended to make thoroughly clear what is believed to
be a consequence of the rest of this License.
8. If the distribution and/or use of the Program is restricted in
certain countries either by patents or by copyrighted interfaces, the
original copyright holder who places the Program under this License
may add an explicit geographical distribution limitation excluding
those countries, so that distribution is permitted only in or among
countries not thus excluded. In such case, this License incorporates
the limitation as if written in the body of this License.
9. The Free Software Foundation may publish revised and/or new versions
of the General Public License from time to time. Such new versions will
be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.
Each version is given a distinguishing version number. If the Program
specifies a version number of this License which applies to it and "any
later version", you have the option of following the terms and conditions
either of that version or of any later version published by the Free
Software Foundation. If the Program does not specify a version number of
this License, you may choose any version ever published by the Free Software
Foundation.
10. If you wish to incorporate parts of the Program into other free
programs whose distribution conditions are different, write to the author
to ask for permission. For software which is copyrighted by the Free
Software Foundation, write to the Free Software Foundation; we sometimes
make exceptions for this. Our decision will be guided by the two goals
of preserving the free status of all derivatives of our free software and
of promoting the sharing and reuse of software generally.
NO WARRANTY
11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
REPAIR OR CORRECTION.
12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
POSSIBILITY OF SUCH DAMAGES.
END OF TERMS AND CONDITIONS
How to Apply These Terms to Your New Programs
If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest
to attach them to the start of each source file to most effectively
convey the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.
<one line to give the program's name and a brief idea of what it does.>
Copyright (C) <year> <name of author>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc.,
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
Also add information on how to contact you by electronic and paper mail.
If the program is interactive, make it output a short notice like this
when it starts in an interactive mode:
Gnomovision version 69, Copyright (C) year name of author
Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
This is free software, and you are welcome to redistribute it
under certain conditions; type `show c' for details.
The hypothetical commands `show w' and `show c' should show the appropriate
parts of the General Public License. Of course, the commands you use may
be called something other than `show w' and `show c'; they could even be
mouse-clicks or menu items--whatever suits your program.
You should also get your employer (if you work as a programmer) or your
school, if any, to sign a "copyright disclaimer" for the program, if
necessary. Here is a sample; alter the names:
Yoyodyne, Inc., hereby disclaims all copyright interest in the program
`Gnomovision' (which makes passes at compilers) written by James Hacker.
<signature of Ty Coon>, 1 April 1989
Ty Coon, President of Vice
This General Public License does not permit incorporating your program into
proprietary programs. If your program is a subroutine library, you may
consider it more useful to permit linking proprietary applications with the
library. If this is what you want to do, use the GNU Lesser General
Public License instead of this License.

27
LICENSE
View File

@ -1,27 +0,0 @@
Copyright (c) Vitaliy Filippov (vitalif [at] yourcmc.ru), 2019+
All server-side code (OSD, Monitor and so on) is licensed under the terms of
Vitastor Network Public License 1.1 (VNPL 1.1), a copyleft license based on
GNU GPLv3.0 with the additional "Network Interaction" clause which requires
opensourcing all programs directly or indirectly interacting with Vitastor
through a computer network and expressly designed to be used in conjunction
with it ("Proxy Programs"). Proxy Programs may be made public not only under
the terms of the same license, but also under the terms of any GPL-Compatible
Free Software License, as listed by the Free Software Foundation.
This is a stricter copyleft license than the Affero GPL.
Please note that VNPL doesn't require you to open the code of proprietary
software running inside a VM if it's not specially designed to be used with
Vitastor.
Basically, you can't use the software in a proprietary environment to provide
its functionality to users without opensourcing all intermediary components
standing between the user and Vitastor or purchasing a commercial license
from the author 😀.
Client libraries (cluster_client and so on) are dual-licensed under the same
VNPL 1.1 and also GNU GPL 2.0 or later to allow for compatibility with GPLed
software like QEMU and fio.
You can find the full text of VNPL-1.1 in the file [VNPL-1.1.txt](VNPL-1.1.txt).
GPL 2.0 is also included in this repository as [GPL-2.0.txt](GPL-2.0.txt).

66
Makefile Normal file
View File

@ -0,0 +1,66 @@
BLOCKSTORE_OBJS := allocator.o blockstore.o blockstore_impl.o blockstore_init.o blockstore_open.o blockstore_journal.o blockstore_read.o \
blockstore_write.o blockstore_sync.o blockstore_stable.o blockstore_rollback.o blockstore_flush.o crc32c.o ringloop.o timerfd_interval.o
# -fsanitize=address
CXXFLAGS := -g -O3 -Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -fPIC -fdiagnostics-color=always
all: $(BLOCKSTORE_OBJS) libfio_blockstore.so osd libfio_sec_osd.so test_blockstore stub_osd stub_bench osd_test
clean:
rm -f *.o
crc32c.o: crc32c.c
g++ $(CXXFLAGS) -c -o $@ $<
json11.o: json11/json11.cpp
g++ $(CXXFLAGS) -c -o json11.o json11/json11.cpp
allocator.o: allocator.cpp allocator.h
g++ $(CXXFLAGS) -c -o $@ $<
ringloop.o: ringloop.cpp ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
timerfd_interval.o: timerfd_interval.cpp timerfd_interval.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
%.o: %.cpp allocator.h blockstore_flush.h blockstore.h blockstore_impl.h blockstore_init.h blockstore_journal.h crc32c.h ringloop.h timerfd_interval.h object_id.h
g++ $(CXXFLAGS) -c -o $@ $<
libblockstore.so: $(BLOCKSTORE_OBJS)
g++ $(CXXFLAGS) -o libblockstore.so -shared $(BLOCKSTORE_OBJS) -ltcmalloc_minimal -luring
libfio_blockstore.so: ./libblockstore.so fio_engine.cpp json11.o
g++ $(CXXFLAGS) -shared -o libfio_blockstore.so fio_engine.cpp json11.o ./libblockstore.so -ltcmalloc_minimal -luring
OSD_OBJS := osd.o osd_secondary.o osd_receive.o osd_send.o osd_peering.o osd_peering_pg.o osd_primary.o osd_rmw.o json11.o timerfd_interval.o
osd_secondary.o: osd_secondary.cpp osd.h osd_ops.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_receive.o: osd_receive.cpp osd.h osd_ops.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_send.o: osd_send.cpp osd.h osd_ops.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_peering.o: osd_peering.cpp osd.h osd_ops.h osd_peering_pg.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_peering_pg.o: osd_peering_pg.cpp object_id.h osd_peering_pg.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_rmw.o: osd_rmw.cpp osd_rmw.h xor.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_rmw_test: osd_rmw_test.cpp osd_rmw.cpp osd_rmw.h xor.h
g++ $(CXXFLAGS) -o $@ $<
osd_primary.o: osd_primary.cpp osd.h osd_ops.h osd_peering_pg.h xor.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
osd.o: osd.cpp osd.h osd_ops.h osd_peering_pg.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
osd: ./libblockstore.so osd_main.cpp osd.h osd_ops.h $(OSD_OBJS)
g++ $(CXXFLAGS) -o osd osd_main.cpp $(OSD_OBJS) ./libblockstore.so -ltcmalloc_minimal -luring
stub_osd: stub_osd.cpp osd_ops.h rw_blocking.o
g++ $(CXXFLAGS) -o stub_osd stub_osd.cpp rw_blocking.o -ltcmalloc_minimal
stub_bench: stub_bench.cpp osd_ops.h rw_blocking.o
g++ $(CXXFLAGS) -o stub_bench stub_bench.cpp rw_blocking.o -ltcmalloc_minimal
rw_blocking.o: rw_blocking.cpp rw_blocking.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_test: osd_test.cpp osd_ops.h rw_blocking.o
g++ $(CXXFLAGS) -o osd_test osd_test.cpp rw_blocking.o -ltcmalloc_minimal
libfio_sec_osd.so: fio_sec_osd.cpp osd_ops.h rw_blocking.o
g++ $(CXXFLAGS) -ltcmalloc_minimal -shared -o libfio_sec_osd.so fio_sec_osd.cpp rw_blocking.o -luring
test_blockstore: ./libblockstore.so test_blockstore.cpp
g++ $(CXXFLAGS) -o test_blockstore test_blockstore.cpp ./libblockstore.so -ltcmalloc_minimal -luring
test: test.cpp osd_peering_pg.o
g++ $(CXXFLAGS) -o test test.cpp osd_peering_pg.o -luring
test_allocator: test_allocator.cpp allocator.o
g++ $(CXXFLAGS) -o test_allocator test_allocator.cpp allocator.o

View File

@ -1,100 +0,0 @@
## Vitastor
[Read English version](README.md)
## Идея
Вернём былую скорость кластерному блочному хранилищу!
Vitastor - распределённая блочная SDS (программная СХД), прямой аналог Ceph RBD и
внутренних СХД популярных облачных провайдеров. Однако, в отличие от них, Vitastor
быстрый и при этом простой. Только пока маленький :-).
Vitastor архитектурно похож на Ceph, что означает атомарность и строгую консистентность,
репликацию через первичный OSD, симметричную кластеризацию без единой точки отказа
и автоматическое распределение данных по любому числу дисков любого размера с настраиваемыми схемами
избыточности - репликацией или с произвольными кодами коррекции ошибок.
Vitastor нацелен на SSD и SSD+HDD кластеры с как минимум 10 Гбит/с сетью, поддерживает
TCP и RDMA и на хорошем железе может достигать задержки 4 КБ чтения и записи на уровне ~0.1 мс,
что примерно в 10 раз быстрее, чем Ceph и другие популярные программные СХД.
Vitastor поддерживает QEMU-драйвер, протоколы NBD и NFS, драйверы OpenStack, Proxmox, Kubernetes.
Другие драйверы могут также быть легко реализованы.
Подробности смотрите в документации по ссылкам ниже.
## Презентации и записи докладов
- DevOpsConf'2021: презентация ([на русском](https://vitastor.io/presentation/devopsconf/devopsconf.html),
[на английском](https://vitastor.io/presentation/devopsconf/devopsconf_en.html)),
[видео](https://vitastor.io/presentation/devopsconf/talk.webm)
- Highload'2022: презентация ([на русском](https://vitastor.io/presentation/highload/highload.html)),
[видео](https://vitastor.io/presentation/highload/talk.webm)
## Документация
- Введение
- [Быстрый старт](docs/intro/quickstart.ru.md)
- [Возможности](docs/intro/features.ru.md)
- [Архитектура](docs/intro/architecture.ru.md)
- [Автор и лицензия](docs/intro/author.ru.md)
- Установка
- [Пакеты](docs/installation/packages.ru.md)
- [Proxmox](docs/installation/proxmox.ru.md)
- [OpenStack](docs/installation/openstack.ru.md)
- [Kubernetes CSI](docs/installation/kubernetes.ru.md)
- [Сборка из исходных кодов](docs/installation/source.ru.md)
- Конфигурация
- [Обзор](docs/config.ru.md)
- Параметры
- [Общие](docs/config/common.ru.md)
- [Сетевые](docs/config/network.ru.md)
- [Глобальные дисковые параметры](docs/config/layout-cluster.ru.md)
- [Дисковые параметры OSD](docs/config/layout-osd.ru.md)
- [Прочие параметры OSD](docs/config/osd.ru.md)
- [Параметры мониторов](docs/config/monitor.ru.md)
- [Настройки пулов](docs/config/pool.ru.md)
- [Метаданные образов в etcd](docs/config/inode.ru.md)
- Использование
- [vitastor-cli](docs/usage/cli.ru.md) (консольный интерфейс)
- [fio](docs/usage/fio.ru.md) для тестов производительности
- [NBD](docs/usage/nbd.ru.md) для монтирования ядром
- [QEMU и qemu-img](docs/usage/qemu.ru.md)
- [NFS](docs/usage/nfs.ru.md)-прокси для VMWare и подобных
- Производительность
- [Понимание сути производительности](docs/performance/understanding.ru.md)
- [Теоретический максимум](docs/performance/theoretical.ru.md)
- [Пример сравнения с Ceph](docs/performance/comparison1.ru.md)
## Автор и лицензия
Автор: Виталий Филиппов (vitalif [at] yourcmc.ru), 2019+
Заходите в Telegram-чат Vitastor: https://t.me/vitastor
Лицензия: VNPL 1.1 на серверный код и двойная VNPL 1.1 + GPL 2.0+ на клиентский.
VNPL - "сетевой копилефт", собственная свободная копилефт-лицензия
Vitastor Network Public License 1.1, основанная на GNU GPL 3.0 с дополнительным
условием "Сетевого взаимодействия", требующим распространять все программы,
специально разработанные для использования вместе с Vitastor и взаимодействующие
с ним по сети, под лицензией VNPL или под любой другой свободной лицензией.
Идея VNPL - расширение действия копилефта не только на модули, явным образом
связываемые с кодом Vitastor, но также на модули, оформленные в виде микросервисов
и взаимодействующие с ним по сети.
Таким образом, если вы хотите построить на основе Vitastor сервис, содержаший
компоненты с закрытым кодом, взаимодействующие с Vitastor, вам нужна коммерческая
лицензия от автора 😀.
На Windows и любое другое ПО, не разработанное *специально* для использования
вместе с Vitastor, никакие ограничения не накладываются.
Клиентские библиотеки распространяются на условиях двойной лицензии VNPL 1.0
и также на условиях GNU GPL 2.0 или более поздней версии. Так сделано в целях
совместимости с таким ПО, как QEMU и fio.
Вы можете найти полный текст VNPL 1.1 в файле [VNPL-1.1.txt](VNPL-1.1.txt),
а GPL 2.0 в файле [GPL-2.0.txt](GPL-2.0.txt).

100
README.md
View File

@ -1,100 +0,0 @@
# Vitastor
[Читать на русском](README-ru.md)
## The Idea
Make Clustered Block Storage Fast Again.
Vitastor is a distributed block SDS, direct replacement of Ceph RBD and internal SDS's
of public clouds. However, in contrast to them, Vitastor is fast and simple at the same time.
The only thing is it's slightly young :-).
Vitastor is architecturally similar to Ceph which means strong consistency,
primary-replication, symmetric clustering and automatic data distribution over any
number of drives of any size with configurable redundancy (replication or erasure codes/XOR).
Vitastor targets SSD and SSD+HDD clusters with at least 10 Gbit/s network, supports
TCP and RDMA and may achieve 4 KB read and write latency as low as ~0.1 ms
with proper hardware which is ~10 times faster than other popular SDS's like Ceph
or internal systems of public clouds.
Vitastor supports QEMU, NBD, NFS protocols, OpenStack, Proxmox, Kubernetes drivers.
More drivers may be created easily.
Read more details below in the documentation.
## Talks and presentations
- DevOpsConf'2021: presentation ([in Russian](https://vitastor.io/presentation/devopsconf/devopsconf.html),
[in English](https://vitastor.io/presentation/devopsconf/devopsconf_en.html)),
[video](https://vitastor.io/presentation/devopsconf/talk.webm)
- Highload'2022: presentation ([in Russian](https://vitastor.io/presentation/highload/highload.html)),
[video](https://vitastor.io/presentation/highload/talk.webm)
## Documentation
- Introduction
- [Quick Start](docs/intro/quickstart.en.md)
- [Features](docs/intro/features.en.md)
- [Architecture](docs/intro/architecture.en.md)
- [Author and license](docs/intro/author.en.md)
- Installation
- [Packages](docs/installation/packages.en.md)
- [Proxmox](docs/installation/proxmox.en.md)
- [OpenStack](docs/installation/openstack.en.md)
- [Kubernetes CSI](docs/installation/kubernetes.en.md)
- [Building from Source](docs/installation/source.en.md)
- Configuration
- [Overview](docs/config.en.md)
- Parameter Reference
- [Common](docs/config/common.en.md)
- [Network](docs/config/network.en.md)
- [Global Disk Layout](docs/config/layout-cluster.en.md)
- [OSD Disk Layout](docs/config/layout-osd.en.md)
- [OSD Runtime Parameters](docs/config/osd.en.md)
- [Monitor](docs/config/monitor.en.md)
- [Pool configuration](docs/config/pool.en.md)
- [Image metadata in etcd](docs/config/inode.en.md)
- Usage
- [vitastor-cli](docs/usage/cli.en.md) (command-line interface)
- [fio](docs/usage/fio.en.md) for benchmarks
- [NBD](docs/usage/nbd.en.md) for kernel mounts
- [QEMU and qemu-img](docs/usage/qemu.en.md)
- [NFS](docs/usage/nfs.en.md) emulator for VMWare and similar
- Performance
- [Understanding storage performance](docs/performance/understanding.en.md)
- [Theoretical performance](docs/performance/theoretical.en.md)
- [Example comparison with Ceph](docs/performance/comparison1.en.md)
## Author and License
Copyright (c) Vitaliy Filippov (vitalif [at] yourcmc.ru), 2019+
Join Vitastor Telegram Chat: https://t.me/vitastor
All server-side code (OSD, Monitor and so on) is licensed under the terms of
Vitastor Network Public License 1.1 (VNPL 1.1), a copyleft license based on
GNU GPLv3.0 with the additional "Network Interaction" clause which requires
opensourcing all programs directly or indirectly interacting with Vitastor
through a computer network and expressly designed to be used in conjunction
with it ("Proxy Programs"). Proxy Programs may be made public not only under
the terms of the same license, but also under the terms of any GPL-Compatible
Free Software License, as listed by the Free Software Foundation.
This is a stricter copyleft license than the Affero GPL.
Please note that VNPL doesn't require you to open the code of proprietary
software running inside a VM if it's not specially designed to be used with
Vitastor.
Basically, you can't use the software in a proprietary environment to provide
its functionality to users without opensourcing all intermediary components
standing between the user and Vitastor or purchasing a commercial license
from the author 😀.
Client libraries (cluster_client and so on) are dual-licensed under the same
VNPL 1.1 and also GNU GPL 2.0 or later to allow for compatibility with GPLed
software like QEMU and fio.
You can find the full text of VNPL-1.1 in the file [VNPL-1.1.txt](VNPL-1.1.txt).
GPL 2.0 is also included in this repository as [GPL-2.0.txt](GPL-2.0.txt).

View File

@ -1,648 +0,0 @@
VITASTOR NETWORK PUBLIC LICENSE
Version 1.1, 6 February 2021
Copyright (C) 2021 Vitaliy Filippov <vitalif@yourcmc.ru>
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
Preamble
The Vitastor Network Public License is a free, copyleft license for
software and other kinds of works, specifically designed to ensure
cooperation with the community in the case of network server software.
The licenses for most software and other practical works are designed
to take away your freedom to share and change the works. By contrast,
GNU General Public Licenses and Vitastor Network Public License are
intended to guarantee your freedom to share and change all versions
of a program--to make sure it remains free software for all its users.
When we speak of free software, we are referring to freedom, not
price. GNU General Public Licenses and Vitastor Network Public License
are designed to make sure that you have the freedom to distribute copies
of free software (and charge for them if you wish), that you receive
source code or can get it if you want it, that you can change the software
or use pieces of it in new free programs, and that you know you can do these
things.
Developers that use GNU General Public Licenses and Vitastor
Network Public License protect your rights with two steps:
(1) assert copyright on the software, and (2) offer
you this License which gives you legal permission to copy, distribute
and/or modify the software.
A secondary benefit of defending all users' freedom is that
improvements made in alternate versions of the program, if they
receive widespread use, become available for other developers to
incorporate. Many developers of free software are heartened and
encouraged by the resulting cooperation. However, in the case of
software used on network servers, this result may fail to come about.
The GNU General Public License permits making a modified version and
letting the public access it on a server without ever releasing its
source code to the public. Even the GNU Affero General Public License
permits running a modified version in a closed environment where
public users only interact with it through a closed-source proxy, again,
without making the program and the proxy available to the public
for free.
The Vitastor Network Public License is designed specifically to
ensure that, in such cases, the modified program and the proxy stays
available to the community. It requires the operator of a network server to
provide the source code of the original program and all other programs
communicating with it running there to the users of that server.
Therefore, public use of a modified version, on a server accessible
directly or indirectly to the public, gives the public access to the source
code of the modified version.
The precise terms and conditions for copying, distribution and
modification follow.
TERMS AND CONDITIONS
0. Definitions.
"This License" refers to version 1 of the Vitastor Network Public License.
"Copyright" also means copyright-like laws that apply to other kinds of
works, such as semiconductor masks.
"The Program" refers to any copyrightable work licensed under this
License. Each licensee is addressed as "you". "Licensees" and
"recipients" may be individuals or organizations.
To "modify" a work means to copy from or adapt all or part of the work
in a fashion requiring copyright permission, other than the making of an
exact copy. The resulting work is called a "modified version" of the
earlier work or a work "based on" the earlier work.
A "covered work" means either the unmodified Program or a work based
on the Program.
To "propagate" a work means to do anything with it that, without
permission, would make you directly or secondarily liable for
infringement under applicable copyright law, except executing it on a
computer or modifying a private copy. Propagation includes copying,
distribution (with or without modification), making available to the
public, and in some countries other activities as well.
To "convey" a work means any kind of propagation that enables other
parties to make or receive copies. Mere interaction with a user through
a computer network, with no transfer of a copy, is not conveying.
An interactive user interface displays "Appropriate Legal Notices"
to the extent that it includes a convenient and prominently visible
feature that (1) displays an appropriate copyright notice, and (2)
tells the user that there is no warranty for the work (except to the
extent that warranties are provided), that licensees may convey the
work under this License, and how to view a copy of this License. If
the interface presents a list of user commands or options, such as a
menu, a prominent item in the list meets this criterion.
1. Source Code.
The "source code" for a work means the preferred form of the work
for making modifications to it. "Object code" means any non-source
form of a work.
A "Standard Interface" means an interface that either is an official
standard defined by a recognized standards body, or, in the case of
interfaces specified for a particular programming language, one that
is widely used among developers working in that language.
The "System Libraries" of an executable work include anything, other
than the work as a whole, that (a) is included in the normal form of
packaging a Major Component, but which is not part of that Major
Component, and (b) serves only to enable use of the work with that
Major Component, or to implement a Standard Interface for which an
implementation is available to the public in source code form. A
"Major Component", in this context, means a major essential component
(kernel, window system, and so on) of the specific operating system
(if any) on which the executable work runs, or a compiler used to
produce the work, or an object code interpreter used to run it.
The "Corresponding Source" for a work in object code form means all
the source code needed to generate, install, and (for an executable
work) run the object code and to modify the work, including scripts to
control those activities. However, it does not include the work's
System Libraries, or general-purpose tools or generally available free
programs which are used unmodified in performing those activities but
which are not part of the work. For example, Corresponding Source
includes interface definition files associated with source files for
the work, and the source code for shared libraries and dynamically
linked subprograms that the work is specifically designed to require,
such as by intimate data communication or control flow between those
subprograms and other parts of the work.
The Corresponding Source need not include anything that users
can regenerate automatically from other parts of the Corresponding
Source.
The Corresponding Source for a work in source code form is that
same work.
2. Basic Permissions.
All rights granted under this License are granted for the term of
copyright on the Program, and are irrevocable provided the stated
conditions are met. This License explicitly affirms your unlimited
permission to run the unmodified Program. The output from running a
covered work is covered by this License only if the output, given its
content, constitutes a covered work. This License acknowledges your
rights of fair use or other equivalent, as provided by copyright law.
You may make, run and propagate covered works that you do not
convey, without conditions so long as your license otherwise remains
in force. You may convey covered works to others for the sole purpose
of having them make modifications exclusively for you, or provide you
with facilities for running those works, provided that you comply with
the terms of this License in conveying all material for which you do
not control copyright. Those thus making or running the covered works
for you must do so exclusively on your behalf, under your direction
and control, on terms that prohibit them from making any copies of
your copyrighted material outside their relationship with you.
Conveying under any other circumstances is permitted solely under
the conditions stated below. Sublicensing is not allowed; section 10
makes it unnecessary.
3. Protecting Users' Legal Rights From Anti-Circumvention Law.
No covered work shall be deemed part of an effective technological
measure under any applicable law fulfilling obligations under article
11 of the WIPO copyright treaty adopted on 20 December 1996, or
similar laws prohibiting or restricting circumvention of such
measures.
When you convey a covered work, you waive any legal power to forbid
circumvention of technological measures to the extent such circumvention
is effected by exercising rights under this License with respect to
the covered work, and you disclaim any intention to limit operation or
modification of the work as a means of enforcing, against the work's
users, your or third parties' legal rights to forbid circumvention of
technological measures.
4. Conveying Verbatim Copies.
You may convey verbatim copies of the Program's source code as you
receive it, in any medium, provided that you conspicuously and
appropriately publish on each copy an appropriate copyright notice;
keep intact all notices stating that this License and any
non-permissive terms added in accord with section 7 apply to the code;
keep intact all notices of the absence of any warranty; and give all
recipients a copy of this License along with the Program.
You may charge any price or no price for each copy that you convey,
and you may offer support or warranty protection for a fee.
5. Conveying Modified Source Versions.
You may convey a work based on the Program, or the modifications to
produce it from the Program, in the form of source code under the
terms of section 4, provided that you also meet all of these conditions:
a) The work must carry prominent notices stating that you modified
it, and giving a relevant date.
b) The work must carry prominent notices stating that it is
released under this License and any conditions added under section
7. This requirement modifies the requirement in section 4 to
"keep intact all notices".
c) You must license the entire work, as a whole, under this
License to anyone who comes into possession of a copy. This
License will therefore apply, along with any applicable section 7
additional terms, to the whole of the work, and all its parts,
regardless of how they are packaged. This License gives no
permission to license the work in any other way, but it does not
invalidate such permission if you have separately received it.
d) If the work has interactive user interfaces, each must display
Appropriate Legal Notices; however, if the Program has interactive
interfaces that do not display Appropriate Legal Notices, your
work need not make them do so.
A compilation of a covered work with other separate and independent
works, which are not by their nature extensions of the covered work,
and which are not combined with it such as to form a larger program,
in or on a volume of a storage or distribution medium, is called an
"aggregate" if the compilation and its resulting copyright are not
used to limit the access or legal rights of the compilation's users
beyond what the individual works permit. Inclusion of a covered work
in an aggregate does not cause this License to apply to the other
parts of the aggregate.
6. Conveying Non-Source Forms.
You may convey a covered work in object code form under the terms
of sections 4 and 5, provided that you also convey the
machine-readable Corresponding Source under the terms of this License,
in one of these ways:
a) Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by the
Corresponding Source fixed on a durable physical medium
customarily used for software interchange.
b) Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by a
written offer, valid for at least three years and valid for as
long as you offer spare parts or customer support for that product
model, to give anyone who possesses the object code either (1) a
copy of the Corresponding Source for all the software in the
product that is covered by this License, on a durable physical
medium customarily used for software interchange, for a price no
more than your reasonable cost of physically performing this
conveying of source, or (2) access to copy the
Corresponding Source from a network server at no charge.
c) Convey individual copies of the object code with a copy of the
written offer to provide the Corresponding Source. This
alternative is allowed only occasionally and noncommercially, and
only if you received the object code with such an offer, in accord
with subsection 6b.
d) Convey the object code by offering access from a designated
place (gratis or for a charge), and offer equivalent access to the
Corresponding Source in the same way through the same place at no
further charge. You need not require recipients to copy the
Corresponding Source along with the object code. If the place to
copy the object code is a network server, the Corresponding Source
may be on a different server (operated by you or a third party)
that supports equivalent copying facilities, provided you maintain
clear directions next to the object code saying where to find the
Corresponding Source. Regardless of what server hosts the
Corresponding Source, you remain obligated to ensure that it is
available for as long as needed to satisfy these requirements.
e) Convey the object code using peer-to-peer transmission, provided
you inform other peers where the object code and Corresponding
Source of the work are being offered to the general public at no
charge under subsection 6d.
A separable portion of the object code, whose source code is excluded
from the Corresponding Source as a System Library, need not be
included in conveying the object code work.
A "User Product" is either (1) a "consumer product", which means any
tangible personal property which is normally used for personal, family,
or household purposes, or (2) anything designed or sold for incorporation
into a dwelling. In determining whether a product is a consumer product,
doubtful cases shall be resolved in favor of coverage. For a particular
product received by a particular user, "normally used" refers to a
typical or common use of that class of product, regardless of the status
of the particular user or of the way in which the particular user
actually uses, or expects or is expected to use, the product. A product
is a consumer product regardless of whether the product has substantial
commercial, industrial or non-consumer uses, unless such uses represent
the only significant mode of use of the product.
"Installation Information" for a User Product means any methods,
procedures, authorization keys, or other information required to install
and execute modified versions of a covered work in that User Product from
a modified version of its Corresponding Source. The information must
suffice to ensure that the continued functioning of the modified object
code is in no case prevented or interfered with solely because
modification has been made.
If you convey an object code work under this section in, or with, or
specifically for use in, a User Product, and the conveying occurs as
part of a transaction in which the right of possession and use of the
User Product is transferred to the recipient in perpetuity or for a
fixed term (regardless of how the transaction is characterized), the
Corresponding Source conveyed under this section must be accompanied
by the Installation Information. But this requirement does not apply
if neither you nor any third party retains the ability to install
modified object code on the User Product (for example, the work has
been installed in ROM).
The requirement to provide Installation Information does not include a
requirement to continue to provide support service, warranty, or updates
for a work that has been modified or installed by the recipient, or for
the User Product in which it has been modified or installed. Access to a
network may be denied when the modification itself materially and
adversely affects the operation of the network or violates the rules and
protocols for communication across the network.
Corresponding Source conveyed, and Installation Information provided,
in accord with this section must be in a format that is publicly
documented (and with an implementation available to the public in
source code form), and must require no special password or key for
unpacking, reading or copying.
7. Additional Terms.
"Additional permissions" are terms that supplement the terms of this
License by making exceptions from one or more of its conditions.
Additional permissions that are applicable to the entire Program shall
be treated as though they were included in this License, to the extent
that they are valid under applicable law. If additional permissions
apply only to part of the Program, that part may be used separately
under those permissions, but the entire Program remains governed by
this License without regard to the additional permissions.
When you convey a copy of a covered work, you may at your option
remove any additional permissions from that copy, or from any part of
it. (Additional permissions may be written to require their own
removal in certain cases when you modify the work.) You may place
additional permissions on material, added by you to a covered work,
for which you have or can give appropriate copyright permission.
Notwithstanding any other provision of this License, for material you
add to a covered work, you may (if authorized by the copyright holders of
that material) supplement the terms of this License with terms:
a) Disclaiming warranty or limiting liability differently from the
terms of sections 15 and 16 of this License; or
b) Requiring preservation of specified reasonable legal notices or
author attributions in that material or in the Appropriate Legal
Notices displayed by works containing it; or
c) Prohibiting misrepresentation of the origin of that material, or
requiring that modified versions of such material be marked in
reasonable ways as different from the original version; or
d) Limiting the use for publicity purposes of names of licensors or
authors of the material; or
e) Declining to grant rights under trademark law for use of some
trade names, trademarks, or service marks; or
f) Requiring indemnification of licensors and authors of that
material by anyone who conveys the material (or modified versions of
it) with contractual assumptions of liability to the recipient, for
any liability that these contractual assumptions directly impose on
those licensors and authors.
All other non-permissive additional terms are considered "further
restrictions" within the meaning of section 10. If the Program as you
received it, or any part of it, contains a notice stating that it is
governed by this License along with a term that is a further
restriction, you may remove that term. If a license document contains
a further restriction but permits relicensing or conveying under this
License, you may add to a covered work material governed by the terms
of that license document, provided that the further restriction does
not survive such relicensing or conveying.
If you add terms to a covered work in accord with this section, you
must place, in the relevant source files, a statement of the
additional terms that apply to those files, or a notice indicating
where to find the applicable terms.
Additional terms, permissive or non-permissive, may be stated in the
form of a separately written license, or stated as exceptions;
the above requirements apply either way.
8. Termination.
You may not propagate or modify a covered work except as expressly
provided under this License. Any attempt otherwise to propagate or
modify it is void, and will automatically terminate your rights under
this License (including any patent licenses granted under the third
paragraph of section 11).
However, if you cease all violation of this License, then your
license from a particular copyright holder is reinstated (a)
provisionally, unless and until the copyright holder explicitly and
finally terminates your license, and (b) permanently, if the copyright
holder fails to notify you of the violation by some reasonable means
prior to 60 days after the cessation.
Moreover, your license from a particular copyright holder is
reinstated permanently if the copyright holder notifies you of the
violation by some reasonable means, this is the first time you have
received notice of violation of this License (for any work) from that
copyright holder, and you cure the violation prior to 30 days after
your receipt of the notice.
Termination of your rights under this section does not terminate the
licenses of parties who have received copies or rights from you under
this License. If your rights have been terminated and not permanently
reinstated, you do not qualify to receive new licenses for the same
material under section 10.
9. Acceptance Not Required for Having Copies.
You are not required to accept this License in order to receive or
run a copy of the Program. Ancillary propagation of a covered work
occurring solely as a consequence of using peer-to-peer transmission
to receive a copy likewise does not require acceptance. However,
nothing other than this License grants you permission to propagate or
modify any covered work. These actions infringe copyright if you do
not accept this License. Therefore, by modifying or propagating a
covered work, you indicate your acceptance of this License to do so.
10. Automatic Licensing of Downstream Recipients.
Each time you convey a covered work, the recipient automatically
receives a license from the original licensors, to run, modify and
propagate that work, subject to this License. You are not responsible
for enforcing compliance by third parties with this License.
An "entity transaction" is a transaction transferring control of an
organization, or substantially all assets of one, or subdividing an
organization, or merging organizations. If propagation of a covered
work results from an entity transaction, each party to that
transaction who receives a copy of the work also receives whatever
licenses to the work the party's predecessor in interest had or could
give under the previous paragraph, plus a right to possession of the
Corresponding Source of the work from the predecessor in interest, if
the predecessor has it or can get it with reasonable efforts.
You may not impose any further restrictions on the exercise of the
rights granted or affirmed under this License. For example, you may
not impose a license fee, royalty, or other charge for exercise of
rights granted under this License, and you may not initiate litigation
(including a cross-claim or counterclaim in a lawsuit) alleging that
any patent claim is infringed by making, using, selling, offering for
sale, or importing the Program or any portion of it.
11. Patents.
A "contributor" is a copyright holder who authorizes use under this
License of the Program or a work on which the Program is based. The
work thus licensed is called the contributor's "contributor version".
A contributor's "essential patent claims" are all patent claims
owned or controlled by the contributor, whether already acquired or
hereafter acquired, that would be infringed by some manner, permitted
by this License, of making, using, or selling its contributor version,
but do not include claims that would be infringed only as a
consequence of further modification of the contributor version. For
purposes of this definition, "control" includes the right to grant
patent sublicenses in a manner consistent with the requirements of
this License.
Each contributor grants you a non-exclusive, worldwide, royalty-free
patent license under the contributor's essential patent claims, to
make, use, sell, offer for sale, import and otherwise run, modify and
propagate the contents of its contributor version.
In the following three paragraphs, a "patent license" is any express
agreement or commitment, however denominated, not to enforce a patent
(such as an express permission to practice a patent or covenant not to
sue for patent infringement). To "grant" such a patent license to a
party means to make such an agreement or commitment not to enforce a
patent against the party.
If you convey a covered work, knowingly relying on a patent license,
and the Corresponding Source of the work is not available for anyone
to copy, free of charge and under the terms of this License, through a
publicly available network server or other readily accessible means,
then you must either (1) cause the Corresponding Source to be so
available, or (2) arrange to deprive yourself of the benefit of the
patent license for this particular work, or (3) arrange, in a manner
consistent with the requirements of this License, to extend the patent
license to downstream recipients. "Knowingly relying" means you have
actual knowledge that, but for the patent license, your conveying the
covered work in a country, or your recipient's use of the covered work
in a country, would infringe one or more identifiable patents in that
country that you have reason to believe are valid.
If, pursuant to or in connection with a single transaction or
arrangement, you convey, or propagate by procuring conveyance of, a
covered work, and grant a patent license to some of the parties
receiving the covered work authorizing them to use, propagate, modify
or convey a specific copy of the covered work, then the patent license
you grant is automatically extended to all recipients of the covered
work and works based on it.
A patent license is "discriminatory" if it does not include within
the scope of its coverage, prohibits the exercise of, or is
conditioned on the non-exercise of one or more of the rights that are
specifically granted under this License. You may not convey a covered
work if you are a party to an arrangement with a third party that is
in the business of distributing software, under which you make payment
to the third party based on the extent of your activity of conveying
the work, and under which the third party grants, to any of the
parties who would receive the covered work from you, a discriminatory
patent license (a) in connection with copies of the covered work
conveyed by you (or copies made from those copies), or (b) primarily
for and in connection with specific products or compilations that
contain the covered work, unless you entered into that arrangement,
or that patent license was granted, prior to 28 March 2007.
Nothing in this License shall be construed as excluding or limiting
any implied license or other defenses to infringement that may
otherwise be available to you under applicable patent law.
12. No Surrender of Others' Freedom.
If conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License. If you cannot convey a
covered work so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you may
not convey it at all. For example, if you agree to terms that obligate you
to collect a royalty for further conveying from those to whom you convey
the Program, the only way you could satisfy both those terms and this
License would be to refrain entirely from conveying the Program.
13. Remote Network Interaction.
A "Proxy Program" means a separate program which is specially designed to
be used in conjunction with the covered work and interacts with it directly
or indirectly through any kind of API (application programming interfaces),
a computer network, an imitation of such network, or another Proxy Program
itself.
Notwithstanding any other provision of this License, if you provide any user
with an opportunity to interact with the covered work through a computer
network, an imitation of such network, or any number of "Proxy Programs",
you must prominently offer that user an opportunity to receive the
Corresponding Source of the covered work and all Proxy Programs from a
network server at no charge, through some standard or customary means of
facilitating copying of software. The Corresponding Source for the covered
work must be made available under the conditions of this License, and
the Corresponding Source for all Proxy Programs must be made available
under the conditions of either this License or any GPL-Compatible
Free Software License, as described by the Free Software Foundation
in their "GPL-Compatible License List".
14. Revised Versions of this License.
Vitastor Author may publish revised and/or new versions of
the Vitastor Network Public License from time to time. Such new versions
will be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.
Each version is given a distinguishing version number. If the
Program specifies that a certain numbered version of the Vitastor Network
Public License "or any later version" applies to it, you have the
option of following the terms and conditions either of that numbered
version or of any later version. If the Program does not specify a version
number of the Vitastor Network Public License, you may choose any version
ever published.
Later license versions may give you additional or different
permissions. However, no additional obligations are imposed on any
author or copyright holder as a result of your choosing to follow a
later version.
15. Disclaimer of Warranty.
THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
16. Limitation of Liability.
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
SUCH DAMAGES.
17. Interpretation of Sections 15 and 16.
If the disclaimer of warranty and limitation of liability provided
above cannot be given local legal effect according to their terms,
reviewing courts shall apply local law that most closely approximates
an absolute waiver of all civil liability in connection with the
Program, unless a warranty or assumption of liability accompanies a
copy of the Program in return for a fee.
END OF TERMS AND CONDITIONS
How to Apply These Terms to Your New Programs
If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest
to attach them to the start of each source file to most effectively
state the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.
<one line to give the program's name and a brief idea of what it does.>
Copyright (C) <year> <name of author>
This program is free software: you can redistribute it and/or modify
it under the terms of the Vitastor Network Public License as published by
the Vitastor Author, either version 1 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
Vitastor Network Public License for more details.
Also add information on how to contact you by electronic and paper mail.
If your software can interact with users remotely through a computer
network, you should also make sure that it provides a way for users to
get its source. For example, if your program is a web application, its
interface could display a "Source" link that leads users to an archive
of the code. There are many ways you could offer source, and different
solutions will be better for different programs; see section 13 for the
specific requirements.

View File

@ -1,6 +1,3 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
#include <stdexcept> #include <stdexcept>
#include "allocator.h" #include "allocator.h"
@ -13,19 +10,19 @@ allocator::allocator(uint64_t blocks)
{ {
throw std::invalid_argument("blocks"); throw std::invalid_argument("blocks");
} }
uint64_t p2 = 1; uint64_t p2 = 1, total = 1;
total = 0;
while (p2 * 64 < blocks) while (p2 * 64 < blocks)
{ {
total += p2;
p2 = p2 * 64; p2 = p2 * 64;
total += p2;
} }
total -= p2;
total += (blocks+63) / 64; total += (blocks+63) / 64;
mask = new uint64_t[total]; mask = new uint64_t[2 + total];
size = free = blocks; size = free = blocks;
last_one_mask = (blocks % 64) == 0 last_one_mask = (blocks % 64) == 0
? UINT64_MAX ? UINT64_MAX
: (((uint64_t)1 << (blocks % 64)) - 1); : ~(UINT64_MAX << (64 - blocks % 64));
for (uint64_t i = 0; i < total; i++) for (uint64_t i = 0; i < total; i++)
{ {
mask[i] = 0; mask[i] = 0;
@ -37,21 +34,6 @@ allocator::~allocator()
delete[] mask; delete[] mask;
} }
bool allocator::get(uint64_t addr)
{
if (addr >= size)
{
return false;
}
uint64_t p2 = 1, offset = 0;
while (p2 * 64 < size)
{
offset += p2;
p2 = p2 * 64;
}
return ((mask[offset + addr/64] >> (addr % 64)) & 1);
}
void allocator::set(uint64_t addr, bool value) void allocator::set(uint64_t addr, bool value)
{ {
if (addr >= size) if (addr >= size)
@ -79,7 +61,7 @@ void allocator::set(uint64_t addr, bool value)
} }
if (value) if (value)
{ {
mask[last] = mask[last] | ((uint64_t)1 << bit); mask[last] = mask[last] | (1l << bit);
if (mask[last] != (!is_last || cur_addr/64 < size/64 if (mask[last] != (!is_last || cur_addr/64 < size/64
? UINT64_MAX : last_one_mask)) ? UINT64_MAX : last_one_mask))
{ {
@ -88,7 +70,7 @@ void allocator::set(uint64_t addr, bool value)
} }
else else
{ {
mask[last] = mask[last] & ~((uint64_t)1 << bit); mask[last] = mask[last] & ~(1l << bit);
} }
is_last = false; is_last = false;
if (p2 > 1) if (p2 > 1)
@ -114,10 +96,6 @@ uint64_t allocator::find_free()
uint64_t p2 = 1, offset = 0, addr = 0, f, i; uint64_t p2 = 1, offset = 0, addr = 0, f, i;
while (p2 < size) while (p2 < size)
{ {
if (offset+addr >= total)
{
return UINT64_MAX;
}
uint64_t m = mask[offset + addr]; uint64_t m = mask[offset + addr];
for (i = 0, f = 1; i < 64; i++, f <<= 1) for (i = 0, f = 1; i < 64; i++, f <<= 1)
{ {
@ -132,6 +110,11 @@ uint64_t allocator::find_free()
return UINT64_MAX; return UINT64_MAX;
} }
addr = (addr * 64) | i; addr = (addr * 64) | i;
if (addr >= size)
{
// No space
return UINT64_MAX;
}
offset += p2; offset += p2;
p2 = p2 * 64; p2 = p2 * 64;
} }
@ -142,35 +125,3 @@ uint64_t allocator::get_free_count()
{ {
return free; return free;
} }
void bitmap_set(void *bitmap, uint64_t start, uint64_t len, uint64_t bitmap_granularity)
{
if (start == 0)
{
if (len == 32*bitmap_granularity)
{
*((uint32_t*)bitmap) = UINT32_MAX;
return;
}
else if (len == 64*bitmap_granularity)
{
*((uint64_t*)bitmap) = UINT64_MAX;
return;
}
}
unsigned bit_start = start / bitmap_granularity;
unsigned bit_end = ((start + len) + bitmap_granularity - 1) / bitmap_granularity;
while (bit_start < bit_end)
{
if (!(bit_start & 7) && bit_end >= bit_start+8)
{
((uint8_t*)bitmap)[bit_start / 8] = UINT8_MAX;
bit_start += 8;
}
else
{
((uint8_t*)bitmap)[bit_start / 8] |= 1 << (bit_start % 8);
bit_start++;
}
}
}

View File

@ -1,6 +1,3 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
#pragma once #pragma once
#include <stdint.h> #include <stdint.h>
@ -8,7 +5,6 @@
// Hierarchical bitmap allocator // Hierarchical bitmap allocator
class allocator class allocator
{ {
uint64_t total;
uint64_t size; uint64_t size;
uint64_t free; uint64_t free;
uint64_t last_one_mask; uint64_t last_one_mask;
@ -16,10 +12,7 @@ class allocator
public: public:
allocator(uint64_t blocks); allocator(uint64_t blocks);
~allocator(); ~allocator();
bool get(uint64_t addr);
void set(uint64_t addr, bool value); void set(uint64_t addr, bool value);
uint64_t find_free(); uint64_t find_free();
uint64_t get_free_count(); uint64_t get_free_count();
}; };
void bitmap_set(void *bitmap, uint64_t start, uint64_t len, uint64_t bitmap_granularity);

61
blockstore.cpp Normal file
View File

@ -0,0 +1,61 @@
#include "blockstore_impl.h"
blockstore_t::blockstore_t(blockstore_config_t & config, ring_loop_t *ringloop)
{
impl = new blockstore_impl_t(config, ringloop);
}
blockstore_t::~blockstore_t()
{
delete impl;
}
void blockstore_t::loop()
{
impl->loop();
}
bool blockstore_t::is_started()
{
return impl->is_started();
}
bool blockstore_t::is_stalled()
{
return impl->is_stalled();
}
bool blockstore_t::is_safe_to_stop()
{
return impl->is_safe_to_stop();
}
void blockstore_t::enqueue_op(blockstore_op_t *op)
{
impl->enqueue_op(op, false);
}
void blockstore_t::enqueue_op_first(blockstore_op_t *op)
{
impl->enqueue_op(op, true);
}
std::unordered_map<object_id, uint64_t> & blockstore_t::get_unstable_writes()
{
return impl->unstable_writes;
}
uint32_t blockstore_t::get_block_size()
{
return impl->get_block_size();
}
uint64_t blockstore_t::get_block_count()
{
return impl->get_block_count();
}
uint32_t blockstore_t::get_disk_alignment()
{
return impl->get_disk_alignment();
}

View File

@ -1,6 +1,3 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
#pragma once #pragma once
#ifndef _LARGEFILE64_SOURCE #ifndef _LARGEFILE64_SOURCE
@ -9,38 +6,32 @@
#include <stdint.h> #include <stdint.h>
#include <string>
#include <map> #include <map>
#include <unordered_map> #include <unordered_map>
#include <functional> #include <functional>
#include "object_id.h" #include "object_id.h"
#include "ringloop.h" #include "ringloop.h"
#include "timerfd_manager.h"
// Memory alignment for direct I/O (usually 512 bytes) // Memory alignment for direct I/O (usually 512 bytes)
// All other alignments must be a multiple of this one // All other alignments must be a multiple of this one
#ifndef MEM_ALIGNMENT #define MEM_ALIGNMENT 512
#define MEM_ALIGNMENT 4096
#endif
// Default block size is 128 KB, current allowed range is 4K - 128M // Default block size is 128 KB, current allowed range is 4K - 128M
#define DEFAULT_ORDER 17 #define DEFAULT_ORDER 17
#define MIN_BLOCK_SIZE 4*1024 #define MIN_BLOCK_SIZE 4*1024
#define MAX_BLOCK_SIZE 128*1024*1024 #define MAX_BLOCK_SIZE 128*1024*1024
#define DEFAULT_BITMAP_GRANULARITY 4096
#define BS_OP_MIN 1 #define BS_OP_MIN 1
#define BS_OP_READ 1 #define BS_OP_READ 1
#define BS_OP_WRITE 2 #define BS_OP_WRITE 2
#define BS_OP_WRITE_STABLE 3 #define BS_OP_SYNC 3
#define BS_OP_SYNC 4 #define BS_OP_STABLE 4
#define BS_OP_STABLE 5 #define BS_OP_DELETE 5
#define BS_OP_DELETE 6 #define BS_OP_LIST 6
#define BS_OP_LIST 7 #define BS_OP_ROLLBACK 7
#define BS_OP_ROLLBACK 8 #define BS_OP_SYNC_STAB_ALL 8
#define BS_OP_SYNC_STAB_ALL 9 #define BS_OP_MAX 8
#define BS_OP_MAX 9
#define BS_OP_PRIVATE_DATA_SIZE 256 #define BS_OP_PRIVATE_DATA_SIZE 256
@ -48,9 +39,9 @@
Blockstore opcode documentation: Blockstore opcode documentation:
## BS_OP_READ / BS_OP_WRITE / BS_OP_WRITE_STABLE ## BS_OP_READ / BS_OP_WRITE
Read or write object data. WRITE_STABLE writes a version that doesn't require marking as stable. Read or write object data.
Input: Input:
- oid = requested object - oid = requested object
@ -59,15 +50,12 @@ Input:
- version == 0: read the last stable version, - version == 0: read the last stable version,
- version == UINT64_MAX: read the last version, - version == UINT64_MAX: read the last version,
- otherwise: read the newest version that is <= the specified version - otherwise: read the newest version that is <= the specified version
- reads aren't guaranteed to return data from previous unfinished writes
For writes: For writes:
- if version == 0, a new version is assigned automatically - if version == 0, a new version is assigned automatically
- if version != 0, it is assigned for the new write if possible, otherwise -EINVAL is returned - if version != 0, it is assigned for the new write if possible, otherwise -EINVAL is returned
- offset, len = offset and length within object. length may be zero, in that case - offset, len = offset and length within object. length may be zero, in that case
read operation only returns the version / write operation only bumps the version read operation only returns the version / write operation only bumps the version
- buf = pre-allocated buffer for data (read) / with data (write). may be NULL if len == 0. - buf = pre-allocated buffer for data (read) / with data (write). may be NULL if len == 0.
- bitmap = pointer to the new 'external' object bitmap data. Its part which is respective to the
write request is copied into the metadata area bitwise and stored there.
Output: Output:
- retval = number of bytes actually read/written or negative error number (-EINVAL or -ENOSPC) - retval = number of bytes actually read/written or negative error number (-EINVAL or -ENOSPC)
@ -104,7 +92,7 @@ Input:
- buf = pre-allocated obj_ver_id array <len> units long - buf = pre-allocated obj_ver_id array <len> units long
Output: Output:
- retval = 0 or negative error number (-EINVAL, -ENOENT if no such version or -EBUSY if not synced) - retval = 0 or negative error number (-EINVAL)
## BS_OP_SYNC_STAB_ALL ## BS_OP_SYNC_STAB_ALL
@ -122,8 +110,6 @@ Input:
- oid.stripe = PG alignment - oid.stripe = PG alignment
- len = PG count or 0 to list all objects - len = PG count or 0 to list all objects
- offset = PG number - offset = PG number
- oid.inode = min inode number or 0 to list all inodes
- version = max inode number or 0 to list all inodes
Output: Output:
- retval = total obj_ver_id count - retval = total obj_ver_id count
@ -145,7 +131,6 @@ struct blockstore_op_t
uint32_t offset; uint32_t offset;
uint32_t len; uint32_t len;
void *buf; void *buf;
void *bitmap;
int retval; int retval;
uint8_t private_data[BS_OP_PRIVATE_DATA_SIZE]; uint8_t private_data[BS_OP_PRIVATE_DATA_SIZE];
@ -159,7 +144,7 @@ class blockstore_t
{ {
blockstore_impl_t *impl; blockstore_impl_t *impl;
public: public:
blockstore_t(blockstore_config_t & config, ring_loop_t *ringloop, timerfd_manager_t *tfd); blockstore_t(blockstore_config_t & config, ring_loop_t *ringloop);
~blockstore_t(); ~blockstore_t();
// Event loop // Event loop
@ -180,21 +165,16 @@ public:
// Submission // Submission
void enqueue_op(blockstore_op_t *op); void enqueue_op(blockstore_op_t *op);
// Simplified synchronous operation: get object bitmap & current version // Insert operation into the beginning of the queue
int read_bitmap(object_id oid, uint64_t target_version, void *bitmap, uint64_t *result_version = NULL); // Intended for the OSD syncer "thread" to be able to stabilize something when the journal is full
void enqueue_op_first(blockstore_op_t *op);
// Get per-inode space usage statistics // Unstable writes are added here (map of object_id -> version)
std::map<uint64_t, uint64_t> & get_inode_space_stats(); std::unordered_map<object_id, uint64_t> & get_unstable_writes();
// Print diagnostics to stdout
void dump_diagnostics();
// FIXME rename to object_size // FIXME rename to object_size
uint32_t get_block_size(); uint32_t get_block_size();
uint64_t get_block_count(); uint64_t get_block_count();
uint64_t get_free_block_count();
uint64_t get_journal_size(); uint32_t get_disk_alignment();
uint32_t get_bitmap_granularity();
}; };

View File

@ -1,27 +1,16 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h" #include "blockstore_impl.h"
journal_flusher_t::journal_flusher_t(blockstore_impl_t *bs) journal_flusher_t::journal_flusher_t(int flusher_count, blockstore_impl_t *bs)
{ {
this->bs = bs; this->bs = bs;
this->max_flusher_count = bs->max_flusher_count; this->flusher_count = flusher_count;
this->min_flusher_count = bs->min_flusher_count;
this->cur_flusher_count = bs->min_flusher_count;
this->target_flusher_count = bs->min_flusher_count;
dequeuing = false;
trimming = false;
active_flushers = 0; active_flushers = 0;
syncing_flushers = 0; sync_threshold = flusher_count == 1 ? 1 : flusher_count/2;
// FIXME: allow to configure flusher_start_threshold and journal_trim_interval journal_trim_interval = sync_threshold;
flusher_start_threshold = bs->journal_block_size / sizeof(journal_entry_stable); journal_trim_counter = 0;
journal_trim_interval = 512; journal_superblock = bs->journal.inmemory ? bs->journal.buffer : memalign(MEM_ALIGNMENT, bs->journal_block_size);
journal_trim_counter = bs->journal.flush_journal ? 1 : 0; co = new journal_flusher_co[flusher_count];
trim_wanted = bs->journal.flush_journal ? 1 : 0; for (int i = 0; i < flusher_count; i++)
journal_superblock = bs->journal.inmemory ? bs->journal.buffer : memalign_or_die(MEM_ALIGNMENT, bs->journal_block_size);
co = new journal_flusher_co[max_flusher_count];
for (int i = 0; i < max_flusher_count; i++)
{ {
co[i].bs = bs; co[i].bs = bs;
co[i].flusher = this; co[i].flusher = this;
@ -66,36 +55,23 @@ journal_flusher_t::~journal_flusher_t()
bool journal_flusher_t::is_active() bool journal_flusher_t::is_active()
{ {
return active_flushers > 0 || dequeuing; return active_flushers > 0 || start_forced && flush_queue.size() > 0 || flush_queue.size() >= sync_threshold;
} }
void journal_flusher_t::loop() void journal_flusher_t::loop()
{ {
target_flusher_count = bs->write_iodepth*2; for (int i = 0; i < flusher_count; i++)
if (target_flusher_count < min_flusher_count)
target_flusher_count = min_flusher_count;
else if (target_flusher_count > max_flusher_count)
target_flusher_count = max_flusher_count;
if (target_flusher_count > cur_flusher_count)
cur_flusher_count = target_flusher_count;
else if (target_flusher_count < cur_flusher_count)
{ {
while (target_flusher_count < cur_flusher_count) if (!active_flushers && (start_forced ? !flush_queue.size() : (flush_queue.size() < sync_threshold)))
{ {
if (co[cur_flusher_count-1].wait_state) return;
break;
cur_flusher_count--;
} }
}
for (int i = 0; (active_flushers > 0 || dequeuing) && i < cur_flusher_count; i++)
co[i].loop(); co[i].loop();
}
} }
void journal_flusher_t::enqueue_flush(obj_ver_id ov) void journal_flusher_t::enqueue_flush(obj_ver_id ov)
{ {
#ifdef BLOCKSTORE_DEBUG
printf("enqueue_flush %lx:%lx v%lu\n", ov.oid.inode, ov.oid.stripe, ov.version);
#endif
auto it = flush_versions.find(ov.oid); auto it = flush_versions.find(ov.oid);
if (it != flush_versions.end()) if (it != flush_versions.end())
{ {
@ -107,18 +83,10 @@ void journal_flusher_t::enqueue_flush(obj_ver_id ov)
flush_versions[ov.oid] = ov.version; flush_versions[ov.oid] = ov.version;
flush_queue.push_back(ov.oid); flush_queue.push_back(ov.oid);
} }
if (!dequeuing && (flush_queue.size() >= flusher_start_threshold || trim_wanted > 0))
{
dequeuing = true;
bs->ringloop->wakeup();
}
} }
void journal_flusher_t::unshift_flush(obj_ver_id ov, bool force) void journal_flusher_t::unshift_flush(obj_ver_id ov)
{ {
#ifdef BLOCKSTORE_DEBUG
printf("unshift_flush %lx:%lx v%lu\n", ov.oid.inode, ov.oid.stripe, ov.version);
#endif
auto it = flush_versions.find(ov.oid); auto it = flush_versions.find(ov.oid);
if (it != flush_versions.end()) if (it != flush_versions.end())
{ {
@ -128,129 +96,16 @@ void journal_flusher_t::unshift_flush(obj_ver_id ov, bool force)
else else
{ {
flush_versions[ov.oid] = ov.version; flush_versions[ov.oid] = ov.version;
if (!force)
flush_queue.push_front(ov.oid);
}
if (force)
flush_queue.push_front(ov.oid); flush_queue.push_front(ov.oid);
if (force || !dequeuing && (flush_queue.size() >= flusher_start_threshold || trim_wanted > 0))
{
dequeuing = true;
bs->ringloop->wakeup();
} }
} }
void journal_flusher_t::remove_flush(object_id oid) void journal_flusher_t::force_start()
{ {
#ifdef BLOCKSTORE_DEBUG start_forced = true;
printf("undo_flush %lx:%lx\n", oid.inode, oid.stripe);
#endif
auto v_it = flush_versions.find(oid);
if (v_it != flush_versions.end())
{
flush_versions.erase(v_it);
for (auto q_it = flush_queue.begin(); q_it != flush_queue.end(); q_it++)
{
if (*q_it == oid)
{
flush_queue.erase(q_it);
break;
}
}
}
}
void journal_flusher_t::request_trim()
{
dequeuing = true;
trim_wanted++;
bs->ringloop->wakeup(); bs->ringloop->wakeup();
} }
void journal_flusher_t::mark_trim_possible()
{
if (trim_wanted > 0)
{
dequeuing = true;
journal_trim_counter++;
bs->ringloop->wakeup();
}
}
void journal_flusher_t::release_trim()
{
trim_wanted--;
}
void journal_flusher_t::dump_diagnostics()
{
const char *unflushable_type = "";
obj_ver_id unflushable = {};
// Try to find out if there is a flushable object for information
for (object_id cur_oid: flush_queue)
{
obj_ver_id cur = { .oid = cur_oid, .version = flush_versions[cur_oid] };
auto dirty_end = bs->dirty_db.find(cur);
if (dirty_end == bs->dirty_db.end())
{
// Already flushed
continue;
}
auto repeat_it = sync_to_repeat.find(cur.oid);
if (repeat_it != sync_to_repeat.end())
{
// Someone is already flushing it
unflushable_type = "locked,";
unflushable = cur;
break;
}
if (dirty_end->second.journal_sector >= bs->journal.dirty_start &&
(bs->journal.dirty_start >= bs->journal.used_start ||
dirty_end->second.journal_sector < bs->journal.used_start))
{
// Object is more recent than possible to flush
bool found = try_find_older(dirty_end, cur);
if (!found)
{
unflushable_type = "dirty,";
unflushable = cur;
break;
}
}
unflushable_type = "ok,";
unflushable = cur;
break;
}
printf(
"Flusher: queued=%ld first=%s%lx:%lx trim_wanted=%d dequeuing=%d trimming=%d cur=%d target=%d active=%d syncing=%d\n",
flush_queue.size(), unflushable_type, unflushable.oid.inode, unflushable.oid.stripe,
trim_wanted, dequeuing, trimming, cur_flusher_count, target_flusher_count,
active_flushers, syncing_flushers
);
}
bool journal_flusher_t::try_find_older(std::map<obj_ver_id, dirty_entry>::iterator & dirty_end, obj_ver_id & cur)
{
bool found = false;
while (dirty_end != bs->dirty_db.begin())
{
dirty_end--;
if (dirty_end->first.oid != cur.oid)
{
break;
}
if (!(dirty_end->second.journal_sector >= bs->journal.dirty_start &&
(bs->journal.dirty_start >= bs->journal.used_start ||
dirty_end->second.journal_sector < bs->journal.used_start)))
{
found = true;
cur.version = dirty_end->first.version;
break;
}
}
return found;
}
#define await_sqe(label) \ #define await_sqe(label) \
resume_##label:\ resume_##label:\
sqe = bs->get_sqe();\ sqe = bs->get_sqe();\
@ -261,7 +116,6 @@ bool journal_flusher_t::try_find_older(std::map<obj_ver_id, dirty_entry>::iterat
}\ }\
data = ((ring_data_t*)sqe->user_data); data = ((ring_data_t*)sqe->user_data);
// FIXME: Implement batch flushing
bool journal_flusher_co::loop() bool journal_flusher_co::loop()
{ {
// This is much better than implementing the whole function as an FSM // This is much better than implementing the whole function as an FSM
@ -300,24 +154,11 @@ bool journal_flusher_co::loop()
goto resume_17; goto resume_17;
else if (wait_state == 18) else if (wait_state == 18)
goto resume_18; goto resume_18;
else if (wait_state == 19)
goto resume_19;
else if (wait_state == 20)
goto resume_20;
else if (wait_state == 21)
goto resume_21;
resume_0: resume_0:
if (flusher->flush_queue.size() < flusher->min_flusher_count && !flusher->trim_wanted || if (!flusher->flush_queue.size() ||
!flusher->flush_queue.size() || !flusher->dequeuing) !flusher->start_forced && !flusher->active_flushers && flusher->flush_queue.size() < flusher->sync_threshold)
{ {
stop_flusher: flusher->start_forced = false;
if (flusher->trim_wanted > 0 && flusher->journal_trim_counter > 0)
{
// Attempt forced trim
flusher->active_flushers++;
goto trim_journal;
}
flusher->dequeuing = false;
wait_state = 0; wait_state = 0;
return true; return true;
} }
@ -332,7 +173,7 @@ stop_flusher:
if (repeat_it != flusher->sync_to_repeat.end()) if (repeat_it != flusher->sync_to_repeat.end())
{ {
#ifdef BLOCKSTORE_DEBUG #ifdef BLOCKSTORE_DEBUG
printf("Postpone %lx:%lx v%lu\n", cur.oid.inode, cur.oid.stripe, cur.version); printf("Postpone %lu:%lu v%lu\n", cur.oid.inode, cur.oid.stripe, cur.version);
#endif #endif
// We don't flush different parts of history of the same object in parallel // We don't flush different parts of history of the same object in parallel
// So we check if someone is already flushing this object // So we check if someone is already flushing this object
@ -345,103 +186,42 @@ stop_flusher:
} }
else else
flusher->sync_to_repeat[cur.oid] = 0; flusher->sync_to_repeat[cur.oid] = 0;
if (dirty_end->second.journal_sector >= bs->journal.dirty_start &&
(bs->journal.dirty_start >= bs->journal.used_start ||
dirty_end->second.journal_sector < bs->journal.used_start))
{
flusher->enqueue_flush(cur);
// We can't flush journal sectors that are still written to
// However, as we group flushes by oid, current oid may have older writes to flush!
// And it may even block writes if we don't flush the older version
// (if it's in the beginning of the journal)...
// So first try to find an older version of the same object to flush.
bool found = flusher->try_find_older(dirty_end, cur);
if (!found)
{
// Try other objects
flusher->sync_to_repeat.erase(cur.oid);
int search_left = flusher->flush_queue.size() - 1;
#ifdef BLOCKSTORE_DEBUG #ifdef BLOCKSTORE_DEBUG
printf("Flusher overran writers (%lx:%lx v%lu, dirty_start=%08lx) - searching for older flushes (%d left)\n", printf("Flushing %lu:%lu v%lu\n", cur.oid.inode, cur.oid.stripe, cur.version);
cur.oid.inode, cur.oid.stripe, cur.version, bs->journal.dirty_start, search_left);
#endif
while (search_left > 0)
{
cur.oid = flusher->flush_queue.front();
cur.version = flusher->flush_versions[cur.oid];
flusher->flush_queue.pop_front();
flusher->flush_versions.erase(cur.oid);
dirty_end = bs->dirty_db.find(cur);
if (dirty_end != bs->dirty_db.end())
{
if (dirty_end->second.journal_sector >= bs->journal.dirty_start &&
(bs->journal.dirty_start >= bs->journal.used_start ||
dirty_end->second.journal_sector < bs->journal.used_start))
{
#ifdef BLOCKSTORE_DEBUG
printf("Write %lx:%lx v%lu is too new: offset=%08lx\n", cur.oid.inode, cur.oid.stripe, cur.version, dirty_end->second.journal_sector);
#endif
flusher->enqueue_flush(cur);
}
else
{
repeat_it = flusher->sync_to_repeat.find(cur.oid);
if (repeat_it != flusher->sync_to_repeat.end())
{
if (repeat_it->second < cur.version)
repeat_it->second = cur.version;
}
else
{
flusher->sync_to_repeat[cur.oid] = 0;
break;
}
}
}
search_left--;
}
if (search_left <= 0)
{
#ifdef BLOCKSTORE_DEBUG
printf("No older flushes, stopping\n");
#endif
goto stop_flusher;
}
}
}
#ifdef BLOCKSTORE_DEBUG
printf("Flushing %lx:%lx v%lu\n", cur.oid.inode, cur.oid.stripe, cur.version);
#endif #endif
flusher->active_flushers++; flusher->active_flushers++;
resume_1: resume_1:
// Find it in clean_db
{
auto & clean_db = bs->clean_db_shard(cur.oid);
auto clean_it = clean_db.find(cur.oid);
old_clean_loc = (clean_it != clean_db.end() ? clean_it->second.location : UINT64_MAX);
}
// Scan dirty versions of the object // Scan dirty versions of the object
if (!scan_dirty(1)) if (!scan_dirty(1))
{ {
wait_state += 1; wait_state += 1;
return false; return false;
} }
// Writes and deletes shouldn't happen at the same time if (copy_count == 0 && clean_loc == UINT64_MAX && !has_delete && !has_empty)
assert(!has_writes || !has_delete);
if (!has_writes && !has_delete || has_delete && old_clean_loc == UINT64_MAX)
{ {
// Nothing to flush // Nothing to flush
bs->erase_dirty(dirty_start, std::next(dirty_end), clean_loc); flusher->active_flushers--;
goto release_oid; repeat_it = flusher->sync_to_repeat.find(cur.oid);
if (repeat_it != flusher->sync_to_repeat.end() && repeat_it->second > cur.version)
{
// Requeue version
flusher->unshift_flush({ .oid = cur.oid, .version = repeat_it->second });
}
flusher->sync_to_repeat.erase(repeat_it);
wait_state = 0;
goto resume_0;
} }
// Find it in clean_db
clean_it = bs->clean_db.find(cur.oid);
old_clean_loc = (clean_it != bs->clean_db.end() ? clean_it->second.location : UINT64_MAX);
if (clean_loc == UINT64_MAX) if (clean_loc == UINT64_MAX)
{ {
if (old_clean_loc == UINT64_MAX) if (copy_count > 0 && has_delete || old_clean_loc == UINT64_MAX)
{ {
// Object not allocated. This is a bug. // Object not allocated. This is a bug.
char err[1024]; char err[1024];
snprintf( snprintf(
err, 1024, "BUG: Object %lx:%lx v%lu that we are trying to flush is not allocated on the data device", err, 1024, "BUG: Object %lu:%lu v%lu that we are trying to flush is not allocated on the data device",
cur.oid.inode, cur.oid.stripe, cur.version cur.oid.inode, cur.oid.stripe, cur.version
); );
throw std::runtime_error(err); throw std::runtime_error(err);
@ -489,26 +269,27 @@ resume_1:
if (bs->clean_entry_bitmap_size) if (bs->clean_entry_bitmap_size)
{ {
new_clean_bitmap = (bs->inmemory_meta new_clean_bitmap = (bs->inmemory_meta
? (uint8_t*)meta_new.buf + meta_new.pos*bs->clean_entry_size + sizeof(clean_disk_entry) ? meta_new.buf + meta_new.pos*bs->clean_entry_size + sizeof(clean_disk_entry)
: (uint8_t*)bs->clean_bitmap + (clean_loc >> bs->block_order)*(2*bs->clean_entry_bitmap_size)); : bs->clean_bitmap + (clean_loc >> bs->block_order)*bs->clean_entry_bitmap_size);
if (clean_init_bitmap) if (clean_init_bitmap)
{ {
memset(new_clean_bitmap, 0, bs->clean_entry_bitmap_size); memset(new_clean_bitmap, 0, bs->clean_entry_bitmap_size);
bitmap_set(new_clean_bitmap, clean_bitmap_offset, clean_bitmap_len, bs->bitmap_granularity); bitmap_set(new_clean_bitmap, clean_bitmap_offset, clean_bitmap_len);
} }
} }
for (it = v.begin(); it != v.end(); it++) for (it = v.begin(); it != v.end(); it++)
{ {
if (new_clean_bitmap) if (new_clean_bitmap)
{ {
bitmap_set(new_clean_bitmap, it->offset, it->len, bs->bitmap_granularity); bitmap_set(new_clean_bitmap, it->offset, it->len);
} }
await_sqe(4); await_sqe(4);
data->iov = (struct iovec){ it->buf, (size_t)it->len }; data->iov = (struct iovec){ it->buf, (size_t)it->len };
data->callback = simple_callback_w; data->callback = simple_callback_w;
my_uring_prep_writev( my_uring_prep_writev(
sqe, bs->data_fd, &data->iov, 1, bs->data_offset + clean_loc + it->offset sqe, bs->data_fd_index, &data->iov, 1, bs->data_offset + clean_loc + it->offset
); );
sqe->flags |= IOSQE_FIXED_FILE;
wait_count++; wait_count++;
} }
// Sync data before writing metadata // Sync data before writing metadata
@ -535,58 +316,37 @@ resume_1:
wait_state = 5; wait_state = 5;
return false; return false;
} }
// zero out old metadata entry memset(meta_old.buf + meta_old.pos*bs->clean_entry_size, 0, bs->clean_entry_size);
memset((uint8_t*)meta_old.buf + meta_old.pos*bs->clean_entry_size, 0, bs->clean_entry_size);
await_sqe(15); await_sqe(15);
data->iov = (struct iovec){ meta_old.buf, bs->meta_block_size }; data->iov = (struct iovec){ meta_old.buf, bs->meta_block_size };
data->callback = simple_callback_w; data->callback = simple_callback_w;
my_uring_prep_writev( my_uring_prep_writev(
sqe, bs->meta_fd, &data->iov, 1, bs->meta_offset + meta_old.sector sqe, bs->meta_fd_index, &data->iov, 1, bs->meta_offset + meta_old.sector
); );
sqe->flags |= IOSQE_FIXED_FILE;
wait_count++; wait_count++;
} }
if (has_delete) if (has_delete)
{ {
clean_disk_entry *new_entry = (clean_disk_entry*)((uint8_t*)meta_new.buf + meta_new.pos*bs->clean_entry_size); memset(meta_new.buf + meta_new.pos*bs->clean_entry_size, 0, bs->clean_entry_size);
if (new_entry->oid.inode != 0 && new_entry->oid != cur.oid)
{
printf("Fatal error (metadata corruption or bug): tried to delete metadata entry %lu (%lx:%lx v%lu) while deleting %lx:%lx\n",
clean_loc >> bs->block_order, new_entry->oid.inode, new_entry->oid.stripe,
new_entry->version, cur.oid.inode, cur.oid.stripe);
exit(1);
}
// zero out new metadata entry
memset((uint8_t*)meta_new.buf + meta_new.pos*bs->clean_entry_size, 0, bs->clean_entry_size);
} }
else else
{ {
clean_disk_entry *new_entry = (clean_disk_entry*)((uint8_t*)meta_new.buf + meta_new.pos*bs->clean_entry_size); clean_disk_entry *new_entry = (clean_disk_entry*)(meta_new.buf + meta_new.pos*bs->clean_entry_size);
if (new_entry->oid.inode != 0 && new_entry->oid != cur.oid)
{
printf("Fatal error (metadata corruption or bug): tried to overwrite non-zero metadata entry %lu (%lx:%lx v%lu) with %lx:%lx v%lu\n",
clean_loc >> bs->block_order, new_entry->oid.inode, new_entry->oid.stripe, new_entry->version,
cur.oid.inode, cur.oid.stripe, cur.version);
exit(1);
}
new_entry->oid = cur.oid; new_entry->oid = cur.oid;
new_entry->version = cur.version; new_entry->version = cur.version;
if (!bs->inmemory_meta) if (!bs->inmemory_meta)
{ {
memcpy(&new_entry->bitmap, new_clean_bitmap, bs->clean_entry_bitmap_size); memcpy(&new_entry->bitmap, new_clean_bitmap, bs->clean_entry_bitmap_size);
} }
// copy latest external bitmap/attributes
if (bs->clean_entry_bitmap_size)
{
void *bmp_ptr = bs->clean_entry_bitmap_size > sizeof(void*) ? dirty_end->second.bitmap : &dirty_end->second.bitmap;
memcpy((uint8_t*)(new_entry+1) + bs->clean_entry_bitmap_size, bmp_ptr, bs->clean_entry_bitmap_size);
}
} }
await_sqe(6); await_sqe(6);
data->iov = (struct iovec){ meta_new.buf, bs->meta_block_size }; data->iov = (struct iovec){ meta_new.buf, bs->meta_block_size };
data->callback = simple_callback_w; data->callback = simple_callback_w;
my_uring_prep_writev( my_uring_prep_writev(
sqe, bs->meta_fd, &data->iov, 1, bs->meta_offset + meta_new.sector sqe, bs->meta_fd_index, &data->iov, 1, bs->meta_offset + meta_new.sector
); );
sqe->flags |= IOSQE_FIXED_FILE;
wait_count++; wait_count++;
resume_7: resume_7:
if (wait_count > 0) if (wait_count > 0)
@ -629,35 +389,13 @@ resume_1:
} }
// Update clean_db and dirty_db, free old data locations // Update clean_db and dirty_db, free old data locations
update_clean_db(); update_clean_db();
#ifdef BLOCKSTORE_DEBUG
printf("Flushed %lx:%lx v%lu (%d copies, wr:%d, del:%d), %ld left\n", cur.oid.inode, cur.oid.stripe, cur.version,
copy_count, has_writes, has_delete, flusher->flush_queue.size());
#endif
release_oid:
repeat_it = flusher->sync_to_repeat.find(cur.oid);
if (repeat_it != flusher->sync_to_repeat.end() && repeat_it->second > cur.version)
{
// Requeue version
flusher->unshift_flush({ .oid = cur.oid, .version = repeat_it->second }, false);
}
flusher->sync_to_repeat.erase(repeat_it);
trim_journal:
// Clear unused part of the journal every <journal_trim_interval> flushes // Clear unused part of the journal every <journal_trim_interval> flushes
if (!((++flusher->journal_trim_counter) % flusher->journal_trim_interval) || flusher->trim_wanted > 0) if (!((++flusher->journal_trim_counter) % flusher->journal_trim_interval))
{ {
flusher->journal_trim_counter = 0; flusher->journal_trim_counter = 0;
new_trim_pos = bs->journal.get_trim_pos(); if (bs->journal.trim())
if (new_trim_pos != bs->journal.used_start)
{ {
resume_19: // Update journal "superblock"
// Wait for other coroutines trimming the journal, if any
if (flusher->trimming)
{
wait_state = 19;
return false;
}
flusher->trimming = true;
// First update journal "superblock" and only then update <used_start> in memory
await_sqe(12); await_sqe(12);
*((journal_entry_start*)flusher->journal_superblock) = { *((journal_entry_start*)flusher->journal_superblock) = {
.crc32 = 0, .crc32 = 0,
@ -665,13 +403,13 @@ resume_1:
.type = JE_START, .type = JE_START,
.size = sizeof(journal_entry_start), .size = sizeof(journal_entry_start),
.reserved = 0, .reserved = 0,
.journal_start = new_trim_pos, .journal_start = bs->journal.used_start,
.version = JOURNAL_VERSION,
}; };
((journal_entry_start*)flusher->journal_superblock)->crc32 = je_crc32((journal_entry*)flusher->journal_superblock); ((journal_entry_start*)flusher->journal_superblock)->crc32 = je_crc32((journal_entry*)flusher->journal_superblock);
data->iov = (struct iovec){ flusher->journal_superblock, bs->journal_block_size }; data->iov = (struct iovec){ flusher->journal_superblock, bs->journal_block_size };
data->callback = simple_callback_w; data->callback = simple_callback_w;
my_uring_prep_writev(sqe, bs->journal.fd, &data->iov, 1, bs->journal.offset); my_uring_prep_writev(sqe, bs->journal_fd_index, &data->iov, 1, bs->journal.offset);
sqe->flags |= IOSQE_FIXED_FILE;
wait_count++; wait_count++;
resume_13: resume_13:
if (wait_count > 0) if (wait_count > 0)
@ -679,34 +417,20 @@ resume_1:
wait_state = 13; wait_state = 13;
return false; return false;
} }
if (!bs->disable_journal_fsync)
{
await_sqe(20);
my_uring_prep_fsync(sqe, bs->journal.fd, IORING_FSYNC_DATASYNC);
data->iov = { 0 };
data->callback = simple_callback_w;
resume_21:
if (wait_count > 0)
{
wait_state = 21;
return false;
}
}
bs->journal.used_start = new_trim_pos;
#ifdef BLOCKSTORE_DEBUG
printf("Journal trimmed to %08lx (next_free=%08lx)\n", bs->journal.used_start, bs->journal.next_free);
#endif
flusher->trimming = false;
}
if (bs->journal.flush_journal && !flusher->flush_queue.size())
{
assert(bs->journal.used_start == bs->journal.next_free);
printf("Journal flushed\n");
exit(0);
} }
} }
// All done // All done
#ifdef BLOCKSTORE_DEBUG
printf("Flushed %lu:%lu v%lu\n", cur.oid.inode, cur.oid.stripe, cur.version);
#endif
flusher->active_flushers--; flusher->active_flushers--;
repeat_it = flusher->sync_to_repeat.find(cur.oid);
if (repeat_it != flusher->sync_to_repeat.end() && repeat_it->second > cur.version)
{
// Requeue version
flusher->unshift_flush({ .oid = cur.oid, .version = repeat_it->second });
}
flusher->sync_to_repeat.erase(repeat_it);
wait_state = 0; wait_state = 0;
goto resume_0; goto resume_0;
} }
@ -725,82 +449,82 @@ bool journal_flusher_co::scan_dirty(int wait_base)
copy_count = 0; copy_count = 0;
clean_loc = UINT64_MAX; clean_loc = UINT64_MAX;
has_delete = false; has_delete = false;
has_writes = false; has_empty = false;
skip_copy = false; skip_copy = false;
clean_init_bitmap = false; clean_init_bitmap = false;
while (1) while (1)
{ {
if (!IS_STABLE(dirty_it->second.state)) if (dirty_it->second.state == ST_J_STABLE && !skip_copy)
{
char err[1024];
snprintf(
err, 1024, "BUG: Unexpected dirty_entry %lx:%lx v%lu unstable state during flush: 0x%x",
dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version, dirty_it->second.state
);
throw std::runtime_error(err);
}
else if (IS_JOURNAL(dirty_it->second.state) && !skip_copy)
{ {
// First we submit all reads // First we submit all reads
has_writes = true; if (dirty_it->second.len == 0)
if (dirty_it->second.len != 0) {
has_empty = true;
}
else
{ {
offset = dirty_it->second.offset; offset = dirty_it->second.offset;
end_offset = dirty_it->second.offset + dirty_it->second.len; end_offset = dirty_it->second.offset + dirty_it->second.len;
it = v.begin(); it = v.begin();
while (end_offset > offset) while (1)
{ {
for (; it != v.end(); it++) for (; it != v.end(); it++)
if (it->offset+it->len > offset) if (it->offset >= offset)
break; break;
// If all items end before offset or if the found item starts after end_offset, just insert the buffer if (it == v.end() || it->offset > offset && it->len > 0)
// If (offset < it->offset < end_offset) insert (offset..it->offset) part
// If (it->offset <= offset <= it->offset+it->len) then just skip to it->offset+it->len
if (it == v.end() || it->offset > offset)
{ {
submit_offset = dirty_it->second.location + offset - dirty_it->second.offset; submit_offset = dirty_it->second.location + offset - dirty_it->second.offset;
submit_len = it == v.end() || it->offset >= end_offset ? end_offset-offset : it->offset-offset; submit_len = it == v.end() || it->offset >= end_offset ? end_offset-offset : it->offset-offset;
it = v.insert(it, (copy_buffer_t){ .offset = offset, .len = submit_len, .buf = memalign_or_die(MEM_ALIGNMENT, submit_len) }); it = v.insert(it, (copy_buffer_t){ .offset = offset, .len = submit_len, .buf = memalign(MEM_ALIGNMENT, submit_len) });
copy_count++; copy_count++;
if (bs->journal.inmemory) if (bs->journal.inmemory)
{ {
// Take it from memory // Take it from memory
memcpy(it->buf, (uint8_t*)bs->journal.buffer + submit_offset, submit_len); memcpy(v.back().buf, bs->journal.buffer + submit_offset, submit_len);
} }
else else
{ {
// Read it from disk // Read it from disk
await_sqe(0); await_sqe(0);
data->iov = (struct iovec){ it->buf, (size_t)submit_len }; data->iov = (struct iovec){ v.back().buf, (size_t)submit_len };
data->callback = simple_callback_r; data->callback = simple_callback_r;
my_uring_prep_readv( my_uring_prep_readv(
sqe, bs->journal.fd, &data->iov, 1, bs->journal.offset + submit_offset sqe, bs->journal_fd_index, &data->iov, 1, bs->journal.offset + submit_offset
); );
sqe->flags |= IOSQE_FIXED_FILE;
wait_count++; wait_count++;
} }
} }
offset = it->offset+it->len; offset = it->offset+it->len;
if (it == v.end()) if (it == v.end() || offset >= end_offset)
break; break;
} }
} }
} }
else if (IS_BIG_WRITE(dirty_it->second.state) && !skip_copy) else if (dirty_it->second.state == ST_D_STABLE && !skip_copy)
{ {
// There is an unflushed big write. Copy small writes in its position // There is an unflushed big write. Copy small writes in its position
has_writes = true;
clean_loc = dirty_it->second.location; clean_loc = dirty_it->second.location;
clean_init_bitmap = true; clean_init_bitmap = true;
clean_bitmap_offset = dirty_it->second.offset; clean_bitmap_offset = dirty_it->second.offset;
clean_bitmap_len = dirty_it->second.len; clean_bitmap_len = dirty_it->second.len;
skip_copy = true; skip_copy = true;
} }
else if (IS_DELETE(dirty_it->second.state) && !skip_copy) else if (dirty_it->second.state == ST_DEL_STABLE && !skip_copy)
{ {
// There is an unflushed delete // There is an unflushed delete
has_delete = true; has_delete = true;
skip_copy = true; skip_copy = true;
} }
else if (!IS_STABLE(dirty_it->second.state))
{
char err[1024];
snprintf(
err, 1024, "BUG: Unexpected dirty_entry %lu:%lu v%lu state during flush: %d",
dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version, dirty_it->second.state
);
throw std::runtime_error(err);
}
dirty_start = dirty_it; dirty_start = dirty_it;
if (dirty_it == bs->dirty_db.begin()) if (dirty_it == bs->dirty_db.begin())
{ {
@ -829,14 +553,14 @@ bool journal_flusher_co::modify_meta_read(uint64_t meta_loc, flusher_meta_write_
wr.pos = ((meta_loc >> bs->block_order) % (bs->meta_block_size / bs->clean_entry_size)); wr.pos = ((meta_loc >> bs->block_order) % (bs->meta_block_size / bs->clean_entry_size));
if (bs->inmemory_meta) if (bs->inmemory_meta)
{ {
wr.buf = (uint8_t*)bs->metadata_buffer + wr.sector; wr.buf = bs->metadata_buffer + wr.sector;
return true; return true;
} }
wr.it = flusher->meta_sectors.find(wr.sector); wr.it = flusher->meta_sectors.find(wr.sector);
if (wr.it == flusher->meta_sectors.end()) if (wr.it == flusher->meta_sectors.end())
{ {
// Not in memory yet, read it // Not in memory yet, read it
wr.buf = memalign_or_die(MEM_ALIGNMENT, bs->meta_block_size); wr.buf = memalign(MEM_ALIGNMENT, bs->meta_block_size);
wr.it = flusher->meta_sectors.emplace(wr.sector, (meta_sector_t){ wr.it = flusher->meta_sectors.emplace(wr.sector, (meta_sector_t){
.offset = wr.sector, .offset = wr.sector,
.len = bs->meta_block_size, .len = bs->meta_block_size,
@ -849,8 +573,9 @@ bool journal_flusher_co::modify_meta_read(uint64_t meta_loc, flusher_meta_write_
data->callback = simple_callback_r; data->callback = simple_callback_r;
wr.submitted = true; wr.submitted = true;
my_uring_prep_readv( my_uring_prep_readv(
sqe, bs->meta_fd, &data->iov, 1, bs->meta_offset + wr.sector sqe, bs->meta_fd_index, &data->iov, 1, bs->meta_offset + wr.sector
); );
sqe->flags |= IOSQE_FIXED_FILE;
wait_count++; wait_count++;
} }
else else
@ -866,29 +591,20 @@ void journal_flusher_co::update_clean_db()
if (old_clean_loc != UINT64_MAX && old_clean_loc != clean_loc) if (old_clean_loc != UINT64_MAX && old_clean_loc != clean_loc)
{ {
#ifdef BLOCKSTORE_DEBUG #ifdef BLOCKSTORE_DEBUG
printf("Free block %lu from %lx:%lx v%lu (new location is %lu)\n", printf("Free block %lu\n", old_clean_loc >> bs->block_order);
old_clean_loc >> bs->block_order,
cur.oid.inode, cur.oid.stripe, cur.version,
clean_loc >> bs->block_order);
#endif #endif
bs->data_alloc->set(old_clean_loc >> bs->block_order, false); bs->data_alloc->set(old_clean_loc >> bs->block_order, false);
} }
auto & clean_db = bs->clean_db_shard(cur.oid);
if (has_delete) if (has_delete)
{ {
auto clean_it = clean_db.find(cur.oid); auto clean_it = bs->clean_db.find(cur.oid);
clean_db.erase(clean_it); bs->clean_db.erase(clean_it);
#ifdef BLOCKSTORE_DEBUG
printf("Free block %lu from %lx:%lx v%lu (delete)\n",
clean_loc >> bs->block_order,
cur.oid.inode, cur.oid.stripe, cur.version);
#endif
bs->data_alloc->set(clean_loc >> bs->block_order, false); bs->data_alloc->set(clean_loc >> bs->block_order, false);
clean_loc = UINT64_MAX; clean_loc = UINT64_MAX;
} }
else else
{ {
clean_db[cur.oid] = { bs->clean_db[cur.oid] = {
.version = cur.version, .version = cur.version,
.location = clean_loc, .location = clean_loc,
}; };
@ -904,7 +620,7 @@ bool journal_flusher_co::fsync_batch(bool fsync_meta, int wait_base)
goto resume_1; goto resume_1;
else if (wait_state == wait_base+2) else if (wait_state == wait_base+2)
goto resume_2; goto resume_2;
if (!(fsync_meta ? bs->disable_meta_fsync : bs->disable_data_fsync)) if (!(fsync_meta ? bs->disable_meta_fsync : bs->disable_journal_fsync))
{ {
cur_sync = flusher->syncs.end(); cur_sync = flusher->syncs.end();
while (cur_sync != flusher->syncs.begin()) while (cur_sync != flusher->syncs.begin())
@ -922,37 +638,33 @@ bool journal_flusher_co::fsync_batch(bool fsync_meta, int wait_base)
}); });
sync_found: sync_found:
cur_sync->ready_count++; cur_sync->ready_count++;
flusher->syncing_flushers++; if (cur_sync->ready_count >= flusher->sync_threshold || !flusher->flush_queue.size())
resume_1:
if (!cur_sync->state)
{ {
if (flusher->syncing_flushers >= flusher->cur_flusher_count || !flusher->flush_queue.size()) // Sync batch is ready. Do it.
await_sqe(0);
data->iov = { 0 };
data->callback = simple_callback_w;
my_uring_prep_fsync(sqe, fsync_meta ? bs->meta_fd_index : bs->data_fd_index, IORING_FSYNC_DATASYNC);
sqe->flags |= IOSQE_FIXED_FILE;
cur_sync->state = 1;
wait_count++;
resume_1:
if (wait_count > 0)
{ {
// Sync batch is ready. Do it.
await_sqe(0);
data->iov = { 0 };
data->callback = simple_callback_w;
my_uring_prep_fsync(sqe, fsync_meta ? bs->meta_fd : bs->data_fd, IORING_FSYNC_DATASYNC);
cur_sync->state = 1;
wait_count++;
resume_2:
if (wait_count > 0)
{
wait_state = 2;
return false;
}
// Sync completed. All previous coroutines waiting for it must be resumed
cur_sync->state = 2;
bs->ringloop->wakeup();
}
else
{
// Wait until someone else sends and completes a sync.
wait_state = 1; wait_state = 1;
return false; return false;
} }
// Sync completed. All previous coroutines waiting for it must be resumed
cur_sync->state = 2;
bs->ringloop->wakeup();
}
// Wait until someone else sends and completes a sync.
resume_2:
if (!cur_sync->state)
{
wait_state = 2;
return false;
} }
flusher->syncing_flushers--;
cur_sync->ready_count--; cur_sync->ready_count--;
if (cur_sync->ready_count == 0) if (cur_sync->ready_count == 0)
{ {
@ -961,3 +673,35 @@ bool journal_flusher_co::fsync_batch(bool fsync_meta, int wait_base)
} }
return true; return true;
} }
void journal_flusher_co::bitmap_set(void *bitmap, uint64_t start, uint64_t len)
{
if (start == 0)
{
if (len == 32*bs->bitmap_granularity)
{
*((uint32_t*)bitmap) = UINT32_MAX;
return;
}
else if (len == 64*bs->bitmap_granularity)
{
*((uint64_t*)bitmap) = UINT64_MAX;
return;
}
}
unsigned bit_start = start / bs->bitmap_granularity;
unsigned bit_end = ((start + len) + bs->bitmap_granularity - 1) / bs->bitmap_granularity;
while (bit_start < bit_end)
{
if (!(bit_start & 7) && bit_end >= bit_start+8)
{
((uint8_t*)bitmap)[bit_start / 8] = UINT8_MAX;
bit_start += 8;
}
else
{
((uint8_t*)bitmap)[bit_start / 8] |= 1 << (bit_start % 8);
bit_start++;
}
}
}

View File

@ -1,6 +1,3 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
struct copy_buffer_t struct copy_buffer_t
{ {
uint64_t offset, len; uint64_t offset, len;
@ -48,7 +45,8 @@ class journal_flusher_co
std::map<object_id, uint64_t>::iterator repeat_it; std::map<object_id, uint64_t>::iterator repeat_it;
std::function<void(ring_data_t*)> simple_callback_r, simple_callback_w; std::function<void(ring_data_t*)> simple_callback_r, simple_callback_w;
bool skip_copy, has_delete, has_writes; bool skip_copy, has_delete, has_empty;
spp::sparse_hash_map<object_id, clean_entry>::iterator clean_it;
std::vector<copy_buffer_t> v; std::vector<copy_buffer_t> v;
std::vector<copy_buffer_t>::iterator it; std::vector<copy_buffer_t>::iterator it;
int copy_count; int copy_count;
@ -58,8 +56,6 @@ class journal_flusher_co
uint64_t clean_bitmap_offset, clean_bitmap_len; uint64_t clean_bitmap_offset, clean_bitmap_len;
void *new_clean_bitmap; void *new_clean_bitmap;
uint64_t new_trim_pos;
// local: scan_dirty() // local: scan_dirty()
uint64_t offset, end_offset, submit_offset, submit_len; uint64_t offset, end_offset, submit_offset, submit_len;
@ -68,6 +64,7 @@ class journal_flusher_co
bool modify_meta_read(uint64_t meta_loc, flusher_meta_write_t &wr, int wait_base); bool modify_meta_read(uint64_t meta_loc, flusher_meta_write_t &wr, int wait_base);
void update_clean_db(); void update_clean_db();
bool fsync_batch(bool fsync_meta, int wait_base); bool fsync_batch(bool fsync_meta, int wait_base);
void bitmap_set(void *bitmap, uint64_t start, uint64_t len);
public: public:
journal_flusher_co(); journal_flusher_co();
bool loop(); bool loop();
@ -76,39 +73,29 @@ public:
// Journal flusher itself // Journal flusher itself
class journal_flusher_t class journal_flusher_t
{ {
int trim_wanted = 0; bool start_forced = false;
bool dequeuing; int flusher_count;
int min_flusher_count, max_flusher_count, cur_flusher_count, target_flusher_count; int sync_threshold;
int flusher_start_threshold;
journal_flusher_co *co; journal_flusher_co *co;
blockstore_impl_t *bs; blockstore_impl_t *bs;
friend class journal_flusher_co; friend class journal_flusher_co;
int journal_trim_counter, journal_trim_interval; int journal_trim_counter, journal_trim_interval;
bool trimming;
void* journal_superblock; void* journal_superblock;
int active_flushers; int active_flushers;
int syncing_flushers;
std::list<flusher_sync_t> syncs; std::list<flusher_sync_t> syncs;
std::map<object_id, uint64_t> sync_to_repeat; std::map<object_id, uint64_t> sync_to_repeat;
std::map<uint64_t, meta_sector_t> meta_sectors; std::map<uint64_t, meta_sector_t> meta_sectors;
std::deque<object_id> flush_queue; std::deque<object_id> flush_queue;
std::map<object_id, uint64_t> flush_versions; std::map<object_id, uint64_t> flush_versions;
bool try_find_older(std::map<obj_ver_id, dirty_entry>::iterator & dirty_end, obj_ver_id & cur);
public: public:
journal_flusher_t(blockstore_impl_t *bs); journal_flusher_t(int flusher_count, blockstore_impl_t *bs);
~journal_flusher_t(); ~journal_flusher_t();
void loop(); void loop();
bool is_active(); bool is_active();
void mark_trim_possible(); void force_start();
void request_trim();
void release_trim();
void enqueue_flush(obj_ver_id oid); void enqueue_flush(obj_ver_id oid);
void unshift_flush(obj_ver_id oid, bool force); void unshift_flush(obj_ver_id oid);
void remove_flush(object_id oid);
void dump_diagnostics();
}; };

458
blockstore_impl.cpp Normal file
View File

@ -0,0 +1,458 @@
#include "blockstore_impl.h"
blockstore_impl_t::blockstore_impl_t(blockstore_config_t & config, ring_loop_t *ringloop)
{
assert(sizeof(blockstore_op_private_t) <= BS_OP_PRIVATE_DATA_SIZE);
this->ringloop = ringloop;
ring_consumer.loop = [this]() { loop(); };
ringloop->register_consumer(ring_consumer);
initialized = 0;
zero_object = (uint8_t*)memalign(MEM_ALIGNMENT, block_size);
data_fd = meta_fd = journal.fd = -1;
parse_config(config);
try
{
open_data();
open_meta();
open_journal();
calc_lengths();
data_alloc = new allocator(block_count);
}
catch (std::exception & e)
{
if (data_fd >= 0)
close(data_fd);
if (meta_fd >= 0 && meta_fd != data_fd)
close(meta_fd);
if (journal.fd >= 0 && journal.fd != meta_fd)
close(journal.fd);
throw;
}
flusher = new journal_flusher_t(flusher_count, this);
}
blockstore_impl_t::~blockstore_impl_t()
{
delete data_alloc;
delete flusher;
free(zero_object);
ringloop->unregister_consumer(ring_consumer);
if (data_fd >= 0)
close(data_fd);
if (meta_fd >= 0 && meta_fd != data_fd)
close(meta_fd);
if (journal.fd >= 0 && journal.fd != meta_fd)
close(journal.fd);
if (metadata_buffer)
free(metadata_buffer);
if (clean_bitmap)
free(clean_bitmap);
}
bool blockstore_impl_t::is_started()
{
return initialized == 10;
}
bool blockstore_impl_t::is_stalled()
{
return queue_stall;
}
// main event loop - produce requests
void blockstore_impl_t::loop()
{
// FIXME: initialized == 10 is ugly
if (initialized != 10)
{
// read metadata, then journal
if (initialized == 0)
{
metadata_init_reader = new blockstore_init_meta(this);
initialized = 1;
}
if (initialized == 1)
{
int res = metadata_init_reader->loop();
if (!res)
{
delete metadata_init_reader;
metadata_init_reader = NULL;
journal_init_reader = new blockstore_init_journal(this);
initialized = 2;
}
}
if (initialized == 2)
{
int res = journal_init_reader->loop();
if (!res)
{
delete journal_init_reader;
journal_init_reader = NULL;
initialized = 10;
ringloop->wakeup();
}
}
}
else
{
// try to submit ops
unsigned initial_ring_space = ringloop->space_left();
auto cur_sync = in_progress_syncs.begin();
while (cur_sync != in_progress_syncs.end())
{
continue_sync(*cur_sync++);
}
auto cur = submit_queue.begin();
int has_writes = 0;
while (cur != submit_queue.end())
{
auto op_ptr = cur;
auto op = *(cur++);
// FIXME: This needs some simplification
// Writes should not block reads if the ring is not full and reads don't depend on them
// In all other cases we should stop submission
if (PRIV(op)->wait_for)
{
check_wait(op);
#ifdef BLOCKSTORE_DEBUG
if (PRIV(op)->wait_for)
{
printf("still waiting for %d\n", PRIV(op)->wait_for);
}
#endif
if (PRIV(op)->wait_for == WAIT_SQE)
{
break;
}
else if (PRIV(op)->wait_for)
{
if (op->opcode == BS_OP_WRITE || op->opcode == BS_OP_DELETE)
{
has_writes = 2;
}
continue;
}
}
unsigned ring_space = ringloop->space_left();
unsigned prev_sqe_pos = ringloop->save();
int dequeue_op = 0;
if (op->opcode == BS_OP_READ)
{
dequeue_op = dequeue_read(op);
}
else if (op->opcode == BS_OP_WRITE || op->opcode == BS_OP_DELETE)
{
if (has_writes == 2)
{
// Some writes could not be submitted
break;
}
dequeue_op = dequeue_write(op);
has_writes = dequeue_op ? 1 : 2;
}
else if (op->opcode == BS_OP_SYNC)
{
// wait for all small writes to be submitted
// wait for all big writes to complete, submit data device fsync
// wait for the data device fsync to complete, then submit journal writes for big writes
// then submit an fsync operation
if (has_writes)
{
// Can't submit SYNC before previous writes
continue;
}
dequeue_op = dequeue_sync(op);
}
else if (op->opcode == BS_OP_STABLE)
{
dequeue_op = dequeue_stable(op);
}
else if (op->opcode == BS_OP_ROLLBACK)
{
dequeue_op = dequeue_rollback(op);
}
else if (op->opcode == BS_OP_LIST)
{
process_list(op);
dequeue_op = true;
}
if (dequeue_op)
{
submit_queue.erase(op_ptr);
}
else
{
ringloop->restore(prev_sqe_pos);
if (PRIV(op)->wait_for == WAIT_SQE)
{
PRIV(op)->wait_detail = 1 + ring_space;
// ring is full, stop submission
break;
}
}
}
if (!readonly)
{
flusher->loop();
}
int ret = ringloop->submit();
if (ret < 0)
{
throw std::runtime_error(std::string("io_uring_submit: ") + strerror(-ret));
}
if ((initial_ring_space - ringloop->space_left()) > 0)
{
live = true;
}
queue_stall = !live && !ringloop->get_loop_again();
live = false;
}
}
bool blockstore_impl_t::is_safe_to_stop()
{
// It's safe to stop blockstore when there are no in-flight operations,
// no in-progress syncs and flusher isn't doing anything
if (submit_queue.size() > 0 || in_progress_syncs.size() > 0 || !readonly && flusher->is_active())
{
return false;
}
if (unsynced_big_writes.size() > 0 || unsynced_small_writes.size() > 0)
{
if (!readonly && !stop_sync_submitted)
{
// We should sync the blockstore before unmounting
blockstore_op_t *op = new blockstore_op_t;
op->opcode = BS_OP_SYNC;
op->buf = NULL;
op->callback = [](blockstore_op_t *op)
{
delete op;
};
enqueue_op(op);
stop_sync_submitted = true;
}
return false;
}
return true;
}
void blockstore_impl_t::check_wait(blockstore_op_t *op)
{
if (PRIV(op)->wait_for == WAIT_SQE)
{
if (ringloop->space_left() < PRIV(op)->wait_detail)
{
// stop submission if there's still no free space
return;
}
PRIV(op)->wait_for = 0;
}
else if (PRIV(op)->wait_for == WAIT_IN_FLIGHT)
{
auto dirty_it = dirty_db.find((obj_ver_id){
.oid = op->oid,
.version = PRIV(op)->wait_detail,
});
if (dirty_it != dirty_db.end() && IS_IN_FLIGHT(dirty_it->second.state))
{
// do not submit
return;
}
PRIV(op)->wait_for = 0;
}
else if (PRIV(op)->wait_for == WAIT_JOURNAL)
{
if (journal.used_start == PRIV(op)->wait_detail)
{
// do not submit
return;
}
PRIV(op)->wait_for = 0;
}
else if (PRIV(op)->wait_for == WAIT_JOURNAL_BUFFER)
{
int next = ((journal.cur_sector + 1) % journal.sector_count);
if (journal.sector_info[next].usage_count > 0 ||
journal.sector_info[next].dirty)
{
// do not submit
return;
}
PRIV(op)->wait_for = 0;
}
else if (PRIV(op)->wait_for == WAIT_FREE)
{
if (!data_alloc->get_free_count() && !flusher->is_active())
{
return;
}
PRIV(op)->wait_for = 0;
}
else
{
throw std::runtime_error("BUG: op->wait_for value is unexpected");
}
}
void blockstore_impl_t::enqueue_op(blockstore_op_t *op, bool first)
{
if (op->opcode < BS_OP_MIN || op->opcode > BS_OP_MAX ||
((op->opcode == BS_OP_READ || op->opcode == BS_OP_WRITE) && (
op->offset >= block_size ||
op->len > block_size-op->offset ||
(op->len % disk_alignment)
)) ||
readonly && op->opcode != BS_OP_READ ||
first && op->opcode == BS_OP_WRITE)
{
// Basic verification not passed
op->retval = -EINVAL;
op->callback(op);
return;
}
if (op->opcode == BS_OP_SYNC_STAB_ALL)
{
std::function<void(blockstore_op_t*)> *old_callback = new std::function<void(blockstore_op_t*)>(op->callback);
op->opcode = BS_OP_SYNC;
op->callback = [this, old_callback](blockstore_op_t *op)
{
if (op->retval >= 0 && unstable_writes.size() > 0)
{
op->opcode = BS_OP_STABLE;
op->len = unstable_writes.size();
obj_ver_id *vers = new obj_ver_id[op->len];
op->buf = vers;
int i = 0;
for (auto it = unstable_writes.begin(); it != unstable_writes.end(); it++, i++)
{
vers[i] = {
.oid = it->first,
.version = it->second,
};
}
unstable_writes.clear();
op->callback = [this, old_callback](blockstore_op_t *op)
{
obj_ver_id *vers = (obj_ver_id*)op->buf;
delete[] vers;
op->buf = NULL;
(*old_callback)(op);
delete old_callback;
};
this->enqueue_op(op);
}
else
{
(*old_callback)(op);
delete old_callback;
}
};
}
if (op->opcode == BS_OP_WRITE && !enqueue_write(op))
{
op->callback(op);
return;
}
if (0 && op->opcode == BS_OP_SYNC && immediate_commit)
{
op->retval = 0;
op->callback(op);
return;
}
// Call constructor without allocating memory. We'll call destructor before returning op back
new ((void*)op->private_data) blockstore_op_private_t;
PRIV(op)->wait_for = 0;
PRIV(op)->sync_state = 0;
PRIV(op)->pending_ops = 0;
if (!first)
{
submit_queue.push_back(op);
}
else
{
submit_queue.push_front(op);
}
ringloop->wakeup();
}
void blockstore_impl_t::process_list(blockstore_op_t *op)
{
// Count objects
uint32_t list_pg = op->offset;
uint32_t pg_count = op->len;
uint64_t parity_block_size = op->oid.stripe;
if (pg_count != 0 && (parity_block_size < MIN_BLOCK_SIZE || list_pg >= pg_count))
{
op->retval = -EINVAL;
FINISH_OP(op);
return;
}
uint64_t stable_count = 0;
if (pg_count > 0)
{
for (auto it = clean_db.begin(); it != clean_db.end(); it++)
{
uint32_t pg = (it->first.inode + it->first.stripe / parity_block_size) % pg_count;
if (pg == list_pg)
{
stable_count++;
}
}
}
else
{
stable_count = clean_db.size();
}
uint64_t total_count = stable_count;
for (auto it = dirty_db.begin(); it != dirty_db.end(); it++)
{
if (!pg_count || ((it->first.oid.inode + it->first.oid.stripe / parity_block_size) % pg_count) == list_pg)
{
if (IS_STABLE(it->second.state))
{
stable_count++;
}
total_count++;
}
}
// Allocate memory
op->version = stable_count;
op->retval = total_count;
op->buf = malloc(sizeof(obj_ver_id) * total_count);
if (!op->buf)
{
op->retval = -ENOMEM;
FINISH_OP(op);
return;
}
obj_ver_id *vers = (obj_ver_id*)op->buf;
int i = 0;
for (auto it = clean_db.begin(); it != clean_db.end(); it++)
{
if (!pg_count || ((it->first.inode + it->first.stripe / parity_block_size) % pg_count) == list_pg)
{
vers[i++] = {
.oid = it->first,
.version = it->second.version,
};
}
}
int j = stable_count;
for (auto it = dirty_db.begin(); it != dirty_db.end(); it++)
{
if (!pg_count || ((it->first.oid.inode + it->first.oid.stripe / parity_block_size) % pg_count) == list_pg)
{
if (IS_STABLE(it->second.state))
{
vers[i++] = it->first;
}
else
{
vers[j++] = it->first;
}
}
}
FINISH_OP(op);
}

View File

@ -1,16 +1,14 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
#pragma once #pragma once
#include "blockstore.h" #include "blockstore.h"
#include "timerfd_interval.h"
#include <sys/types.h> #include <sys/types.h>
#include <sys/ioctl.h> #include <sys/ioctl.h>
#include <sys/stat.h> #include <sys/stat.h>
#include <fcntl.h> #include <fcntl.h>
#include <time.h>
#include <unistd.h> #include <unistd.h>
#include <malloc.h>
#include <linux/fs.h> #include <linux/fs.h>
#include <vector> #include <vector>
@ -18,50 +16,43 @@
#include <deque> #include <deque>
#include <new> #include <new>
#include "cpp-btree/btree_map.h" #include "sparsepp/sparsepp/spp.h"
#include "malloc_or_die.h"
#include "allocator.h" #include "allocator.h"
//#define BLOCKSTORE_DEBUG //#define BLOCKSTORE_DEBUG
// States are not stored on disk. Instead, they're deduced from the journal // States are not stored on disk. Instead, they're deduced from the journal
// FIXME: Rename to BS_ST_*
#define BS_ST_SMALL_WRITE 0x01 #define ST_J_IN_FLIGHT 1
#define BS_ST_BIG_WRITE 0x02 #define ST_J_SUBMITTED 2
#define BS_ST_DELETE 0x03 #define ST_J_WRITTEN 3
#define ST_J_SYNCED 4
#define ST_J_STABLE 5
#define BS_ST_WAIT_DEL 0x10 #define ST_D_IN_FLIGHT 15
#define BS_ST_WAIT_BIG 0x20 #define ST_D_SUBMITTED 16
#define BS_ST_IN_FLIGHT 0x30 #define ST_D_WRITTEN 17
#define BS_ST_SUBMITTED 0x40 #define ST_D_META_WRITTEN 19
#define BS_ST_WRITTEN 0x50 #define ST_D_META_SYNCED 20
#define BS_ST_SYNCED 0x60 #define ST_D_STABLE 21
#define BS_ST_STABLE 0x70
#define BS_ST_INSTANT 0x100 #define ST_DEL_IN_FLIGHT 31
#define ST_DEL_SUBMITTED 32
#define ST_DEL_WRITTEN 33
#define ST_DEL_SYNCED 34
#define ST_DEL_STABLE 35
#define IMMEDIATE_NONE 0 #define ST_CURRENT 48
#define IMMEDIATE_SMALL 1
#define IMMEDIATE_ALL 2
#define BS_ST_TYPE_MASK 0x0F #define IS_IN_FLIGHT(st) (st == ST_J_IN_FLIGHT || st == ST_D_IN_FLIGHT || st == ST_DEL_IN_FLIGHT || st == ST_J_SUBMITTED || st == ST_D_SUBMITTED || st == ST_DEL_SUBMITTED)
#define BS_ST_WORKFLOW_MASK 0xF0 #define IS_STABLE(st) (st == ST_J_STABLE || st == ST_D_STABLE || st == ST_DEL_STABLE || st == ST_CURRENT)
#define IS_IN_FLIGHT(st) (((st) & 0xF0) <= BS_ST_SUBMITTED) #define IS_SYNCED(st) (IS_STABLE(st) || st == ST_J_SYNCED || st == ST_D_META_SYNCED || st == ST_DEL_SYNCED)
#define IS_STABLE(st) (((st) & 0xF0) == BS_ST_STABLE) #define IS_JOURNAL(st) (st >= ST_J_SUBMITTED && st <= ST_J_STABLE)
#define IS_SYNCED(st) (((st) & 0xF0) >= BS_ST_SYNCED) #define IS_BIG_WRITE(st) (st >= ST_D_SUBMITTED && st <= ST_D_STABLE)
#define IS_JOURNAL(st) (((st) & 0x0F) == BS_ST_SMALL_WRITE) #define IS_DELETE(st) (st >= ST_DEL_SUBMITTED && st <= ST_DEL_STABLE)
#define IS_BIG_WRITE(st) (((st) & 0x0F) == BS_ST_BIG_WRITE) #define IS_UNSYNCED(st) (st >= ST_J_SUBMITTED && st <= ST_J_WRITTEN || st >= ST_D_SUBMITTED && st <= ST_D_META_WRITTEN || st >= ST_DEL_SUBMITTED && st <= ST_DEL_WRITTEN)
#define IS_DELETE(st) (((st) & 0x0F) == BS_ST_DELETE)
#define BS_SUBMIT_CHECK_SQES(n) \
if (ringloop->sqes_left() < (n))\
{\
/* Pause until there are more requests available */\
PRIV(op)->wait_detail = (n);\
PRIV(op)->wait_for = WAIT_SQE;\
return 0;\
}
#define BS_SUBMIT_GET_SQE(sqe, data) \ #define BS_SUBMIT_GET_SQE(sqe, data) \
BS_SUBMIT_GET_ONLY_SQE(sqe); \ BS_SUBMIT_GET_ONLY_SQE(sqe); \
@ -72,7 +63,6 @@
if (!sqe)\ if (!sqe)\
{\ {\
/* Pause until there are more requests available */\ /* Pause until there are more requests available */\
PRIV(op)->wait_detail = 1;\
PRIV(op)->wait_for = WAIT_SQE;\ PRIV(op)->wait_for = WAIT_SQE;\
return 0;\ return 0;\
} }
@ -82,32 +72,13 @@
if (!sqe)\ if (!sqe)\
{\ {\
/* Pause until there are more requests available */\ /* Pause until there are more requests available */\
PRIV(op)->wait_detail = 1;\
PRIV(op)->wait_for = WAIT_SQE;\ PRIV(op)->wait_for = WAIT_SQE;\
return 0;\ return 0;\
} }
#include "blockstore_journal.h" #include "blockstore_journal.h"
// "VITAstor" // 24 bytes + block bitmap per "clean" entry on disk with fixed metadata tables
#define BLOCKSTORE_META_MAGIC 0x726F747341544956l
#define BLOCKSTORE_META_VERSION 1
// metadata header (superblock)
// FIXME: After adding the OSD superblock, add a key to metadata
// and journal headers to check if they belong to the same OSD
struct __attribute__((__packed__)) blockstore_meta_header_t
{
uint64_t zero;
uint64_t magic;
uint64_t version;
uint32_t meta_block_size;
uint32_t data_block_size;
uint32_t bitmap_granularity;
};
// 32 bytes = 24 bytes + block bitmap (4 bytes by default) + external attributes (also bitmap, 4 bytes by default)
// per "clean" entry on disk with fixed metadata tables
// FIXME: maybe add crc32's to metadata // FIXME: maybe add crc32's to metadata
struct __attribute__((__packed__)) clean_disk_entry struct __attribute__((__packed__)) clean_disk_entry
{ {
@ -123,7 +94,7 @@ struct __attribute__((__packed__)) clean_entry
uint64_t location; uint64_t location;
}; };
// 64 = 24 + 40 bytes per dirty entry in memory (obj_ver_id => dirty_entry) // 56 = 24 + 32 bytes per dirty entry in memory (obj_ver_id => dirty_entry)
struct __attribute__((__packed__)) dirty_entry struct __attribute__((__packed__)) dirty_entry
{ {
uint32_t state; uint32_t state;
@ -132,7 +103,6 @@ struct __attribute__((__packed__)) dirty_entry
uint32_t offset; // data offset within object (stripe) uint32_t offset; // data offset within object (stripe)
uint32_t len; // data length uint32_t len; // data length
uint64_t journal_sector; // journal sector used for this entry uint64_t journal_sector; // journal sector used for this entry
void* bitmap; // either external bitmap itself when it fits, or a pointer to it when it doesn't
}; };
// - Sync must be submitted after previous writes/deletes (not before!) // - Sync must be submitted after previous writes/deletes (not before!)
@ -154,6 +124,8 @@ struct __attribute__((__packed__)) dirty_entry
// Suspend operation until there are more free SQEs // Suspend operation until there are more free SQEs
#define WAIT_SQE 1 #define WAIT_SQE 1
// Suspend operation until version <wait_detail> of object <oid> is written
#define WAIT_IN_FLIGHT 2
// Suspend operation until there are <wait_detail> bytes of free space in the journal on disk // Suspend operation until there are <wait_detail> bytes of free space in the journal on disk
#define WAIT_JOURNAL 3 #define WAIT_JOURNAL 3
// Suspend operation until the next journal sector buffer is free // Suspend operation until the next journal sector buffer is free
@ -167,7 +139,7 @@ struct fulfill_read_t
}; };
#define PRIV(op) ((blockstore_op_private_t*)(op)->private_data) #define PRIV(op) ((blockstore_op_private_t*)(op)->private_data)
#define FINISH_OP(op) PRIV(op)->~blockstore_op_private_t(); std::function<void (blockstore_op_t*)>(op->callback)(op) #define FINISH_OP(op) PRIV(op)->~blockstore_op_private_t(); op->callback(op)
struct blockstore_op_private_t struct blockstore_op_private_t
{ {
@ -175,46 +147,29 @@ struct blockstore_op_private_t
int wait_for; int wait_for;
uint64_t wait_detail; uint64_t wait_detail;
int pending_ops; int pending_ops;
int op_state;
// Read // Read
std::vector<fulfill_read_t> read_vec; std::vector<fulfill_read_t> read_vec;
// Sync, write // Sync, write
int min_flushed_journal_sector, max_flushed_journal_sector; uint64_t min_used_journal_sector, max_used_journal_sector;
// Write // Write
struct iovec iov_zerofill[3]; struct iovec iov_zerofill[3];
// Warning: must not have a default value here because it's written to before calling constructor in blockstore_write.cpp O_o
uint64_t real_version;
timespec tv_begin;
// Sync // Sync
std::vector<obj_ver_id> sync_big_writes, sync_small_writes; std::vector<obj_ver_id> sync_big_writes, sync_small_writes;
int sync_small_checked, sync_big_checked; int sync_small_checked, sync_big_checked;
std::list<blockstore_op_t*>::iterator in_progress_ptr;
int sync_state, prev_sync_count;
}; };
// https://github.com/algorithm-ninja/cpp-btree
// https://github.com/greg7mdp/sparsepp/ was used previously, but it was TERRIBLY slow after resizing
// with sparsepp, random reads dropped to ~700 iops very fast with just as much as ~32k objects in the DB
typedef btree::btree_map<object_id, clean_entry> blockstore_clean_db_t;
typedef std::map<obj_ver_id, dirty_entry> blockstore_dirty_db_t; typedef std::map<obj_ver_id, dirty_entry> blockstore_dirty_db_t;
#include "blockstore_init.h" #include "blockstore_init.h"
#include "blockstore_flush.h" #include "blockstore_flush.h"
typedef uint32_t pool_id_t;
typedef uint64_t pool_pg_id_t;
#define POOL_ID_BITS 16
struct pool_shard_settings_t
{
uint32_t pg_count;
uint32_t pg_stripe_size;
};
class blockstore_impl_t class blockstore_impl_t
{ {
/******* OPTIONS *******/ /******* OPTIONS *******/
@ -222,49 +177,34 @@ class blockstore_impl_t
uint32_t block_size; uint32_t block_size;
uint64_t meta_offset; uint64_t meta_offset;
uint64_t data_offset; uint64_t data_offset;
uint64_t cfg_journal_size, cfg_data_size; uint64_t cfg_journal_size;
// Required write alignment and journal/metadata/data areas' location alignment // Required write alignment and journal/metadata/data areas' location alignment
uint32_t disk_alignment = 4096; uint32_t disk_alignment = 512;
// Journal block size - minimum_io_size of the journal device is the best choice // Journal block size - minimum_io_size of the journal device is the best choice
uint64_t journal_block_size = 4096; uint64_t journal_block_size = 512;
// Metadata block size - minimum_io_size of the metadata device is the best choice // Metadata block size - minimum_io_size of the metadata device is the best choice
uint64_t meta_block_size = 4096; uint64_t meta_block_size = 512;
// Sparse write tracking granularity. 4 KB is a good choice. Must be a multiple of disk_alignment // Sparse write tracking granularity. 4 KB is a good choice. Must be a multiple of disk_alignment
uint64_t bitmap_granularity = 4096; uint64_t bitmap_granularity = 4096;
bool readonly = false; bool readonly = false;
// By default, Blockstore locks all opened devices exclusively. This option can be used to disable locking
bool disable_flock = false;
// It is safe to disable fsync() if drive write cache is writethrough // It is safe to disable fsync() if drive write cache is writethrough
bool disable_data_fsync = false, disable_meta_fsync = false, disable_journal_fsync = false; bool disable_data_fsync = false, disable_meta_fsync = false, disable_journal_fsync = false;
// Enable if you want every operation to be executed with an "implicit fsync" // Enable if you want every operation to be executed with an "implicit fsync"
// Suitable only for server SSDs with capacitors, requires disabled data and journal fsyncs // FIXME Not implemented yet
int immediate_commit = IMMEDIATE_NONE; bool immediate_commit = false;
bool inmemory_meta = false; bool inmemory_meta = false;
// Maximum and minimum flusher count int flusher_count;
unsigned max_flusher_count, min_flusher_count;
// Maximum queue depth
unsigned max_write_iodepth = 128;
// Enable small (journaled) write throttling, useful for the SSD+HDD case
bool throttle_small_writes = false;
// Target data device iops, bandwidth and parallelism for throttling (100/100/1 is the default for HDD)
int throttle_target_iops = 100;
int throttle_target_mbs = 100;
int throttle_target_parallelism = 1;
// Minimum difference in microseconds between target and real execution times to throttle the response
int throttle_threshold_us = 50;
// Maximum number of LIST operations to be processed between
int single_tick_list_limit = 1;
/******* END OF OPTIONS *******/ /******* END OF OPTIONS *******/
struct ring_consumer_t ring_consumer; struct ring_consumer_t ring_consumer;
std::map<pool_id_t, pool_shard_settings_t> clean_db_settings; // Another option is https://github.com/algorithm-ninja/cpp-btree
std::map<pool_pg_id_t, blockstore_clean_db_t> clean_db_shards; spp::sparse_hash_map<object_id, clean_entry> clean_db;
uint8_t *clean_bitmap = NULL; uint8_t *clean_bitmap = NULL;
blockstore_dirty_db_t dirty_db; blockstore_dirty_db_t dirty_db;
std::vector<blockstore_op_t*> submit_queue; std::list<blockstore_op_t*> submit_queue; // FIXME: funny thing is that vector is better here
std::vector<obj_ver_id> unsynced_big_writes, unsynced_small_writes; std::vector<obj_ver_id> unsynced_big_writes, unsynced_small_writes;
int unsynced_big_write_count = 0; std::list<blockstore_op_t*> in_progress_syncs; // ...and probably here, too
allocator *data_alloc = NULL; allocator *data_alloc = NULL;
uint8_t *zero_object; uint8_t *zero_object;
@ -276,17 +216,15 @@ class blockstore_impl_t
int data_fd; int data_fd;
uint64_t meta_size, meta_area, meta_len; uint64_t meta_size, meta_area, meta_len;
uint64_t data_size, data_len; uint64_t data_size, data_len;
uint64_t data_device_sect, meta_device_sect, journal_device_sect; int meta_fd_index, data_fd_index, journal_fd_index;
void *metadata_buffer = NULL; void *metadata_buffer = NULL;
struct journal_t journal; struct journal_t journal;
journal_flusher_t *flusher; journal_flusher_t *flusher;
int write_iodepth = 0;
bool live = false, queue_stall = false; bool live = false, queue_stall = false;
ring_loop_t *ringloop; ring_loop_t *ringloop;
timerfd_manager_t *tfd;
bool stop_sync_submitted; bool stop_sync_submitted;
@ -297,7 +235,7 @@ class blockstore_impl_t
friend class blockstore_init_meta; friend class blockstore_init_meta;
friend class blockstore_init_journal; friend class blockstore_init_journal;
friend struct blockstore_journal_check_t; friend class blockstore_journal_check_t;
friend class journal_flusher_t; friend class journal_flusher_t;
friend class journal_flusher_co; friend class journal_flusher_co;
@ -306,14 +244,6 @@ class blockstore_impl_t
void open_data(); void open_data();
void open_meta(); void open_meta();
void open_journal(); void open_journal();
uint8_t* get_clean_entry_bitmap(uint64_t block_loc, int offset);
blockstore_clean_db_t& clean_db_shard(object_id oid);
void reshard_clean_db(pool_id_t pool_id, uint32_t pg_count, uint32_t pg_stripe_size);
// Journaling
void prepare_journal_sector_write(int sector, blockstore_op_t *op);
void handle_journal_write(ring_data_t *data, uint64_t flush_id);
// Asynchronous init // Asynchronous init
int initialized; int initialized;
@ -333,27 +263,27 @@ class blockstore_impl_t
// Write // Write
bool enqueue_write(blockstore_op_t *op); bool enqueue_write(blockstore_op_t *op);
void cancel_all_writes(blockstore_op_t *op, blockstore_dirty_db_t::iterator dirty_it, int retval);
int dequeue_write(blockstore_op_t *op); int dequeue_write(blockstore_op_t *op);
int dequeue_del(blockstore_op_t *op); int dequeue_del(blockstore_op_t *op);
int continue_write(blockstore_op_t *op); void ack_write(blockstore_op_t *op);
void release_journal_sectors(blockstore_op_t *op); void release_journal_sectors(blockstore_op_t *op);
void handle_write_event(ring_data_t *data, blockstore_op_t *op); void handle_write_event(ring_data_t *data, blockstore_op_t *op);
// Sync // Sync
int continue_sync(blockstore_op_t *op, bool queue_has_in_progress_sync); int dequeue_sync(blockstore_op_t *op);
void ack_sync(blockstore_op_t *op); void handle_sync_event(ring_data_t *data, blockstore_op_t *op);
int continue_sync(blockstore_op_t *op);
void ack_one_sync(blockstore_op_t *op);
int ack_sync(blockstore_op_t *op);
// Stabilize // Stabilize
int dequeue_stable(blockstore_op_t *op); int dequeue_stable(blockstore_op_t *op);
int continue_stable(blockstore_op_t *op); void handle_stable_event(ring_data_t *data, blockstore_op_t *op);
void mark_stable(const obj_ver_id & ov, bool forget_dirty = false);
void stabilize_object(object_id oid, uint64_t max_ver); void stabilize_object(object_id oid, uint64_t max_ver);
// Rollback // Rollback
int dequeue_rollback(blockstore_op_t *op); int dequeue_rollback(blockstore_op_t *op);
int continue_rollback(blockstore_op_t *op); void handle_rollback_event(ring_data_t *data, blockstore_op_t *op);
void mark_rolled_back(const obj_ver_id & ov);
void erase_dirty(blockstore_dirty_db_t::iterator dirty_start, blockstore_dirty_db_t::iterator dirty_end, uint64_t clean_loc); void erase_dirty(blockstore_dirty_db_t::iterator dirty_start, blockstore_dirty_db_t::iterator dirty_end, uint64_t clean_loc);
// List // List
@ -361,7 +291,7 @@ class blockstore_impl_t
public: public:
blockstore_impl_t(blockstore_config_t & config, ring_loop_t *ringloop, timerfd_manager_t *tfd); blockstore_impl_t(blockstore_config_t & config, ring_loop_t *ringloop);
~blockstore_impl_t(); ~blockstore_impl_t();
// Event loop // Event loop
@ -380,23 +310,12 @@ public:
bool is_stalled(); bool is_stalled();
// Submission // Submission
void enqueue_op(blockstore_op_t *op); void enqueue_op(blockstore_op_t *op, bool first = false);
// Simplified synchronous operation: get object bitmap & current version
int read_bitmap(object_id oid, uint64_t target_version, void *bitmap, uint64_t *result_version = NULL);
// Unstable writes are added here (map of object_id -> version) // Unstable writes are added here (map of object_id -> version)
std::unordered_map<object_id, uint64_t> unstable_writes; std::unordered_map<object_id, uint64_t> unstable_writes;
// Space usage statistics
std::map<uint64_t, uint64_t> inode_space_stats;
// Print diagnostics to stdout
void dump_diagnostics();
inline uint32_t get_block_size() { return block_size; } inline uint32_t get_block_size() { return block_size; }
inline uint64_t get_block_count() { return block_count; } inline uint64_t get_block_count() { return block_count; }
inline uint64_t get_free_block_count() { return data_alloc->get_free_count(); } inline uint32_t get_disk_alignment() { return disk_alignment; }
inline uint32_t get_bitmap_granularity() { return disk_alignment; }
inline uint64_t get_journal_size() { return journal.len; }
}; };

View File

@ -1,22 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h" #include "blockstore_impl.h"
#define GET_SQE() \
sqe = bs->get_sqe();\
if (!sqe)\
throw std::runtime_error("io_uring is full during initialization");\
data = ((ring_data_t*)sqe->user_data)
static bool iszero(uint64_t *buf, int len)
{
for (int i = 0; i < len; i++)
if (buf[i] != 0)
return false;
return true;
}
blockstore_init_meta::blockstore_init_meta(blockstore_impl_t *bs) blockstore_init_meta::blockstore_init_meta(blockstore_impl_t *bs)
{ {
this->bs = bs; this->bs = bs;
@ -24,7 +7,7 @@ blockstore_init_meta::blockstore_init_meta(blockstore_impl_t *bs)
void blockstore_init_meta::handle_event(ring_data_t *data) void blockstore_init_meta::handle_event(ring_data_t *data)
{ {
if (data->res < 0) if (data->res <= 0)
{ {
throw std::runtime_error( throw std::runtime_error(
std::string("read metadata failed at offset ") + std::to_string(metadata_read) + std::string("read metadata failed at offset ") + std::to_string(metadata_read) +
@ -42,12 +25,6 @@ int blockstore_init_meta::loop()
{ {
if (wait_state == 1) if (wait_state == 1)
goto resume_1; goto resume_1;
else if (wait_state == 2)
goto resume_2;
else if (wait_state == 3)
goto resume_3;
else if (wait_state == 4)
goto resume_4;
printf("Reading blockstore metadata\n"); printf("Reading blockstore metadata\n");
if (bs->inmemory_meta) if (bs->inmemory_meta)
metadata_buffer = bs->metadata_buffer; metadata_buffer = bs->metadata_buffer;
@ -55,114 +32,31 @@ int blockstore_init_meta::loop()
metadata_buffer = memalign(MEM_ALIGNMENT, 2*bs->metadata_buf_size); metadata_buffer = memalign(MEM_ALIGNMENT, 2*bs->metadata_buf_size);
if (!metadata_buffer) if (!metadata_buffer)
throw std::runtime_error("Failed to allocate metadata read buffer"); throw std::runtime_error("Failed to allocate metadata read buffer");
// Read superblock
GET_SQE();
data->iov = { metadata_buffer, bs->meta_block_size };
data->callback = [this](ring_data_t *data) { handle_event(data); };
my_uring_prep_readv(sqe, bs->meta_fd, &data->iov, 1, bs->meta_offset);
bs->ringloop->submit();
submitted = 1;
resume_1:
if (submitted)
{
wait_state = 1;
return 1;
}
if (iszero((uint64_t*)metadata_buffer, bs->meta_block_size / sizeof(uint64_t)))
{
{
blockstore_meta_header_t *hdr = (blockstore_meta_header_t *)metadata_buffer;
hdr->zero = 0;
hdr->magic = BLOCKSTORE_META_MAGIC;
hdr->version = BLOCKSTORE_META_VERSION;
hdr->meta_block_size = bs->meta_block_size;
hdr->data_block_size = bs->block_size;
hdr->bitmap_granularity = bs->bitmap_granularity;
}
if (bs->readonly)
{
printf("Skipping metadata initialization because blockstore is readonly\n");
}
else
{
printf("Initializing metadata area\n");
GET_SQE();
data->iov = (struct iovec){ metadata_buffer, bs->meta_block_size };
data->callback = [this](ring_data_t *data) { handle_event(data); };
my_uring_prep_writev(sqe, bs->meta_fd, &data->iov, 1, bs->meta_offset);
bs->ringloop->submit();
submitted = 1;
resume_3:
if (submitted > 0)
{
wait_state = 3;
return 1;
}
zero_on_init = true;
}
}
else
{
blockstore_meta_header_t *hdr = (blockstore_meta_header_t *)metadata_buffer;
if (hdr->zero != 0 ||
hdr->magic != BLOCKSTORE_META_MAGIC ||
hdr->version != BLOCKSTORE_META_VERSION)
{
printf(
"Metadata is corrupt or old version.\n"
" If this is a new OSD please zero out the metadata area before starting it.\n"
" If you need to upgrade from 0.5.x please request it via the issue tracker.\n"
);
exit(1);
}
if (hdr->meta_block_size != bs->meta_block_size ||
hdr->data_block_size != bs->block_size ||
hdr->bitmap_granularity != bs->bitmap_granularity)
{
printf(
"Configuration stored in metadata superblock"
" (meta_block_size=%u, data_block_size=%u, bitmap_granularity=%u)"
" differs from OSD configuration (%lu/%u/%lu).\n",
hdr->meta_block_size, hdr->data_block_size, hdr->bitmap_granularity,
bs->meta_block_size, bs->block_size, bs->bitmap_granularity
);
exit(1);
}
}
// Skip superblock
bs->meta_offset += bs->meta_block_size;
bs->meta_len -= bs->meta_block_size;
prev_done = 0;
done_len = 0;
done_pos = 0;
metadata_read = 0;
// Read the rest of the metadata
while (1) while (1)
{ {
resume_2: resume_1:
if (submitted) if (submitted)
{ {
wait_state = 2; wait_state = 1;
return 1; return 1;
} }
if (metadata_read < bs->meta_len) if (metadata_read < bs->meta_len)
{ {
GET_SQE(); sqe = bs->get_sqe();
if (!sqe)
{
throw std::runtime_error("io_uring is full while trying to read metadata");
}
data = ((ring_data_t*)sqe->user_data);
data->iov = { data->iov = {
(uint8_t*)metadata_buffer + (bs->inmemory_meta metadata_buffer + (bs->inmemory_meta
? metadata_read ? metadata_read
: (prev == 1 ? bs->metadata_buf_size : 0)), : (prev == 1 ? bs->metadata_buf_size : 0)),
bs->meta_len - metadata_read > bs->metadata_buf_size ? bs->metadata_buf_size : bs->meta_len - metadata_read, bs->meta_len - metadata_read > bs->metadata_buf_size ? bs->metadata_buf_size : bs->meta_len - metadata_read,
}; };
data->callback = [this](ring_data_t *data) { handle_event(data); }; data->callback = [this](ring_data_t *data) { handle_event(data); };
if (!zero_on_init) my_uring_prep_readv(sqe, bs->meta_fd_index, &data->iov, 1, bs->meta_offset + metadata_read);
my_uring_prep_readv(sqe, bs->meta_fd, &data->iov, 1, bs->meta_offset + metadata_read); sqe->flags |= IOSQE_FIXED_FILE;
else
{
// Fill metadata with zeroes
memset(data->iov.iov_base, 0, data->iov.iov_len);
my_uring_prep_writev(sqe, bs->meta_fd, &data->iov, 1, bs->meta_offset + metadata_read);
}
bs->ringloop->submit(); bs->ringloop->submit();
submitted = (prev == 1 ? 2 : 1); submitted = (prev == 1 ? 2 : 1);
prev = submitted; prev = submitted;
@ -170,13 +64,13 @@ resume_1:
if (prev_done) if (prev_done)
{ {
void *done_buf = bs->inmemory_meta void *done_buf = bs->inmemory_meta
? ((uint8_t*)metadata_buffer + done_pos) ? (metadata_buffer + done_pos)
: ((uint8_t*)metadata_buffer + (prev_done == 2 ? bs->metadata_buf_size : 0)); : (metadata_buffer + (prev_done == 2 ? bs->metadata_buf_size : 0));
unsigned count = bs->meta_block_size / bs->clean_entry_size; unsigned count = bs->meta_block_size / bs->clean_entry_size;
for (int sector = 0; sector < done_len; sector += bs->meta_block_size) for (int sector = 0; sector < done_len; sector += bs->meta_block_size)
{ {
// handle <count> entries // handle <count> entries
handle_entries((uint8_t*)done_buf + sector, count, bs->block_order); handle_entries(done_buf + sector, count, bs->block_order);
done_cnt += count; done_cnt += count;
} }
prev_done = 0; prev_done = 0;
@ -194,21 +88,6 @@ resume_1:
free(metadata_buffer); free(metadata_buffer);
metadata_buffer = NULL; metadata_buffer = NULL;
} }
if (zero_on_init && !bs->disable_meta_fsync)
{
GET_SQE();
my_uring_prep_fsync(sqe, bs->meta_fd, IORING_FSYNC_DATASYNC);
data->iov = { 0 };
data->callback = [this](ring_data_t *data) { handle_event(data); };
submitted = 1;
bs->ringloop->submit();
resume_4:
if (submitted > 0)
{
wait_state = 4;
return 1;
}
}
return 0; return 0;
} }
@ -216,38 +95,30 @@ void blockstore_init_meta::handle_entries(void* entries, unsigned count, int blo
{ {
for (unsigned i = 0; i < count; i++) for (unsigned i = 0; i < count; i++)
{ {
clean_disk_entry *entry = (clean_disk_entry*)((uint8_t*)entries + i*bs->clean_entry_size); clean_disk_entry *entry = (clean_disk_entry*)(entries + i*bs->clean_entry_size);
if (!bs->inmemory_meta && bs->clean_entry_bitmap_size) if (!bs->inmemory_meta && bs->clean_entry_bitmap_size)
{ {
memcpy(bs->clean_bitmap + (done_cnt+i)*2*bs->clean_entry_bitmap_size, &entry->bitmap, 2*bs->clean_entry_bitmap_size); memcpy(bs->clean_bitmap + (done_cnt+i)*bs->clean_entry_bitmap_size, &entry->bitmap, bs->clean_entry_bitmap_size);
} }
if (entry->oid.inode > 0) if (entry->oid.inode > 0)
{ {
auto & clean_db = bs->clean_db_shard(entry->oid); auto clean_it = bs->clean_db.find(entry->oid);
auto clean_it = clean_db.find(entry->oid); if (clean_it == bs->clean_db.end() || clean_it->second.version < entry->version)
if (clean_it == clean_db.end() || clean_it->second.version < entry->version)
{ {
if (clean_it != clean_db.end()) if (clean_it != bs->clean_db.end())
{ {
// free the previous block // free the previous block
#ifdef BLOCKSTORE_DEBUG #ifdef BLOCKSTORE_DEBUG
printf("Free block %lu from %lx:%lx v%lu (new location is %lu)\n", printf("Free block %lu\n", clean_it->second.location >> bs->block_order);
clean_it->second.location >> block_order,
clean_it->first.inode, clean_it->first.stripe, clean_it->second.version,
done_cnt+i);
#endif #endif
bs->data_alloc->set(clean_it->second.location >> block_order, false); bs->data_alloc->set(clean_it->second.location >> block_order, false);
} }
else
{
bs->inode_space_stats[entry->oid.inode] += bs->block_size;
}
entries_loaded++; entries_loaded++;
#ifdef BLOCKSTORE_DEBUG #ifdef BLOCKSTORE_DEBUG
printf("Allocate block (clean entry) %lu: %lx:%lx v%lu\n", done_cnt+i, entry->oid.inode, entry->oid.stripe, entry->version); printf("Allocate block (clean entry) %lu: %lu:%lu v%lu\n", done_cnt+i, entry->oid.inode, entry->oid.stripe, entry->version);
#endif #endif
bs->data_alloc->set(done_cnt+i, true); bs->data_alloc->set(done_cnt+i, true);
clean_db[entry->oid] = (struct clean_entry){ bs->clean_db[entry->oid] = (struct clean_entry){
.version = entry->version, .version = entry->version,
.location = (done_cnt+i) << block_order, .location = (done_cnt+i) << block_order,
}; };
@ -255,7 +126,7 @@ void blockstore_init_meta::handle_entries(void* entries, unsigned count, int blo
else else
{ {
#ifdef BLOCKSTORE_DEBUG #ifdef BLOCKSTORE_DEBUG
printf("Old clean entry %lu: %lx:%lx v%lu\n", done_cnt+i, entry->oid.inode, entry->oid.stripe, entry->version); printf("Old clean entry %lu: %lu:%lu v%lu\n", done_cnt+i, entry->oid.inode, entry->oid.stripe, entry->version);
#endif #endif
} }
} }
@ -276,6 +147,14 @@ blockstore_init_journal::blockstore_init_journal(blockstore_impl_t *bs)
}; };
} }
bool iszero(uint64_t *buf, int len)
{
for (int i = 0; i < len; i++)
if (buf[i] != 0)
return false;
return true;
}
void blockstore_init_journal::handle_event(ring_data_t *data1) void blockstore_init_journal::handle_event(ring_data_t *data1)
{ {
if (data1->res <= 0) if (data1->res <= 0)
@ -300,6 +179,12 @@ void blockstore_init_journal::handle_event(ring_data_t *data1)
submitted_buf = NULL; submitted_buf = NULL;
} }
#define GET_SQE() \
sqe = bs->get_sqe();\
if (!sqe)\
throw std::runtime_error("io_uring is full while trying to read journal");\
data = ((ring_data_t*)sqe->user_data)
int blockstore_init_journal::loop() int blockstore_init_journal::loop()
{ {
if (wait_state == 1) if (wait_state == 1)
@ -318,7 +203,11 @@ int blockstore_init_journal::loop()
goto resume_7; goto resume_7;
printf("Reading blockstore journal\n"); printf("Reading blockstore journal\n");
if (!bs->journal.inmemory) if (!bs->journal.inmemory)
submitted_buf = memalign_or_die(MEM_ALIGNMENT, 2*bs->journal.block_size); {
submitted_buf = memalign(MEM_ALIGNMENT, 2*bs->journal.block_size);
if (!submitted_buf)
throw std::bad_alloc();
}
else else
submitted_buf = bs->journal.buffer; submitted_buf = bs->journal.buffer;
// Read first block of the journal // Read first block of the journal
@ -328,7 +217,8 @@ int blockstore_init_journal::loop()
data = ((ring_data_t*)sqe->user_data); data = ((ring_data_t*)sqe->user_data);
data->iov = { submitted_buf, bs->journal.block_size }; data->iov = { submitted_buf, bs->journal.block_size };
data->callback = simple_callback; data->callback = simple_callback;
my_uring_prep_readv(sqe, bs->journal.fd, &data->iov, 1, bs->journal.offset); my_uring_prep_readv(sqe, bs->journal_fd_index, &data->iov, 1, bs->journal.offset);
sqe->flags |= IOSQE_FIXED_FILE;
bs->ringloop->submit(); bs->ringloop->submit();
wait_count = 1; wait_count = 1;
resume_1: resume_1:
@ -337,7 +227,7 @@ resume_1:
wait_state = 1; wait_state = 1;
return 1; return 1;
} }
if (iszero((uint64_t*)submitted_buf, bs->journal.block_size / sizeof(uint64_t))) if (iszero((uint64_t*)submitted_buf, 3))
{ {
// Journal is empty // Journal is empty
// FIXME handle this wrapping to journal_block_size better (maybe) // FIXME handle this wrapping to journal_block_size better (maybe)
@ -352,7 +242,6 @@ resume_1:
.size = sizeof(journal_entry_start), .size = sizeof(journal_entry_start),
.reserved = 0, .reserved = 0,
.journal_start = bs->journal.block_size, .journal_start = bs->journal.block_size,
.version = JOURNAL_VERSION,
}; };
((journal_entry_start*)submitted_buf)->crc32 = je_crc32((journal_entry*)submitted_buf); ((journal_entry_start*)submitted_buf)->crc32 = je_crc32((journal_entry*)submitted_buf);
if (bs->readonly) if (bs->readonly)
@ -367,7 +256,8 @@ resume_1:
GET_SQE(); GET_SQE();
data->iov = (struct iovec){ submitted_buf, 2*bs->journal.block_size }; data->iov = (struct iovec){ submitted_buf, 2*bs->journal.block_size };
data->callback = simple_callback; data->callback = simple_callback;
my_uring_prep_writev(sqe, bs->journal.fd, &data->iov, 1, bs->journal.offset); my_uring_prep_writev(sqe, bs->journal_fd_index, &data->iov, 1, bs->journal.offset);
sqe->flags |= IOSQE_FIXED_FILE;
wait_count++; wait_count++;
bs->ringloop->submit(); bs->ringloop->submit();
resume_6: resume_6:
@ -379,7 +269,8 @@ resume_1:
if (!bs->disable_journal_fsync) if (!bs->disable_journal_fsync)
{ {
GET_SQE(); GET_SQE();
my_uring_prep_fsync(sqe, bs->journal.fd, IORING_FSYNC_DATASYNC); my_uring_prep_fsync(sqe, bs->journal_fd_index, IORING_FSYNC_DATASYNC);
sqe->flags |= IOSQE_FIXED_FILE;
data->iov = { 0 }; data->iov = { 0 };
data->callback = simple_callback; data->callback = simple_callback;
wait_count++; wait_count++;
@ -403,21 +294,11 @@ resume_1:
je_start = (journal_entry_start*)submitted_buf; je_start = (journal_entry_start*)submitted_buf;
if (je_start->magic != JOURNAL_MAGIC || if (je_start->magic != JOURNAL_MAGIC ||
je_start->type != JE_START || je_start->type != JE_START ||
je_crc32((journal_entry*)je_start) != je_start->crc32 || je_start->size != sizeof(journal_entry_start) ||
je_start->size != sizeof(journal_entry_start) && je_start->size != JE_START_LEGACY_SIZE) je_crc32((journal_entry*)je_start) != je_start->crc32)
{ {
// Entry is corrupt // Entry is corrupt
fprintf(stderr, "First entry of the journal is corrupt\n"); throw std::runtime_error("first entry of the journal is corrupt");
exit(1);
}
if (je_start->size == JE_START_LEGACY_SIZE || je_start->version != JOURNAL_VERSION)
{
fprintf(
stderr, "The code only supports journal version %d, but it is %lu on disk."
" Please use the previous version to flush the journal before upgrading OSD\n",
JOURNAL_VERSION, je_start->size == JE_START_LEGACY_SIZE ? 0 : je_start->version
);
exit(1);
} }
next_free = journal_pos = bs->journal.used_start = je_start->journal_start; next_free = journal_pos = bs->journal.used_start = je_start->journal_start;
if (!bs->journal.inmemory) if (!bs->journal.inmemory)
@ -440,15 +321,16 @@ resume_1:
if (journal_pos < bs->journal.used_start) if (journal_pos < bs->journal.used_start)
end = bs->journal.used_start; end = bs->journal.used_start;
if (!bs->journal.inmemory) if (!bs->journal.inmemory)
submitted_buf = memalign_or_die(MEM_ALIGNMENT, JOURNAL_BUFFER_SIZE); submitted_buf = memalign(MEM_ALIGNMENT, JOURNAL_BUFFER_SIZE);
else else
submitted_buf = (uint8_t*)bs->journal.buffer + journal_pos; submitted_buf = bs->journal.buffer + journal_pos;
data->iov = { data->iov = {
submitted_buf, submitted_buf,
end - journal_pos < JOURNAL_BUFFER_SIZE ? end - journal_pos : JOURNAL_BUFFER_SIZE, end - journal_pos < JOURNAL_BUFFER_SIZE ? end - journal_pos : JOURNAL_BUFFER_SIZE,
}; };
data->callback = [this](ring_data_t *data1) { handle_event(data1); }; data->callback = [this](ring_data_t *data1) { handle_event(data1); };
my_uring_prep_readv(sqe, bs->journal.fd, &data->iov, 1, bs->journal.offset + journal_pos); my_uring_prep_readv(sqe, bs->journal_fd_index, &data->iov, 1, bs->journal.offset + journal_pos);
sqe->flags |= IOSQE_FIXED_FILE;
bs->ringloop->submit(); bs->ringloop->submit();
} }
while (done.size() > 0) while (done.size() > 0)
@ -463,7 +345,8 @@ resume_1:
GET_SQE(); GET_SQE();
data->iov = { init_write_buf, bs->journal.block_size }; data->iov = { init_write_buf, bs->journal.block_size };
data->callback = simple_callback; data->callback = simple_callback;
my_uring_prep_writev(sqe, bs->journal.fd, &data->iov, 1, bs->journal.offset + init_write_sector); my_uring_prep_writev(sqe, bs->journal_fd_index, &data->iov, 1, bs->journal.offset + init_write_sector);
sqe->flags |= IOSQE_FIXED_FILE;
wait_count++; wait_count++;
bs->ringloop->submit(); bs->ringloop->submit();
resume_7: resume_7:
@ -477,7 +360,8 @@ resume_1:
GET_SQE(); GET_SQE();
data->iov = { 0 }; data->iov = { 0 };
data->callback = simple_callback; data->callback = simple_callback;
my_uring_prep_fsync(sqe, bs->journal.fd, IORING_FSYNC_DATASYNC); my_uring_prep_fsync(sqe, bs->journal_fd_index, IORING_FSYNC_DATASYNC);
sqe->flags |= IOSQE_FIXED_FILE;
wait_count++; wait_count++;
bs->ringloop->submit(); bs->ringloop->submit();
} }
@ -523,22 +407,10 @@ resume_1:
} }
} }
} }
for (auto ov: double_allocs) // Trim journal on start so we don't stall when all entries are older
{ bs->journal.trim();
auto dirty_it = bs->dirty_db.find(ov);
if (dirty_it != bs->dirty_db.end() &&
IS_BIG_WRITE(dirty_it->second.state) &&
dirty_it->second.location == UINT64_MAX)
{
printf("Fatal error (bug): %lx:%lx v%lu big_write journal_entry was allocated over another object\n",
dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version);
exit(1);
}
}
bs->flusher->mark_trim_possible();
bs->journal.dirty_start = bs->journal.next_free;
printf( printf(
"Journal entries loaded: %lu, free journal space: %lu bytes (%08lx..%08lx is used), free blocks: %lu / %lu\n", "Journal entries loaded: %lu, free journal space: %lu bytes (%lu..%lu is used), free blocks: %lu / %lu\n",
entries_loaded, entries_loaded,
(bs->journal.next_free >= bs->journal.used_start (bs->journal.next_free >= bs->journal.used_start
? bs->journal.len-bs->journal.block_size - (bs->journal.next_free-bs->journal.used_start) ? bs->journal.len-bs->journal.block_size - (bs->journal.next_free-bs->journal.used_start)
@ -572,9 +444,9 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
resume: resume:
while (pos < bs->journal.block_size) while (pos < bs->journal.block_size)
{ {
journal_entry *je = (journal_entry*)((uint8_t*)buf + proc_pos - done_pos + pos); journal_entry *je = (journal_entry*)(buf + proc_pos - done_pos + pos);
if (je->magic != JOURNAL_MAGIC || je_crc32(je) != je->crc32 || if (je->magic != JOURNAL_MAGIC || je_crc32(je) != je->crc32 ||
je->type < JE_MIN || je->type > JE_MAX || started && je->crc32_prev != crc32_last) je->type < JE_SMALL_WRITE || je->type > JE_DELETE || started && je->crc32_prev != crc32_last)
{ {
if (pos == 0) if (pos == 0)
{ {
@ -588,15 +460,10 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
break; break;
} }
} }
if (je->type == JE_SMALL_WRITE || je->type == JE_SMALL_WRITE_INSTANT) if (je->type == JE_SMALL_WRITE)
{ {
#ifdef BLOCKSTORE_DEBUG #ifdef BLOCKSTORE_DEBUG
printf( printf("je_small_write oid=%lu:%lu ver=%lu offset=%u len=%u\n", je->small_write.oid.inode, je->small_write.oid.stripe, je->small_write.version, je->small_write.offset, je->small_write.len);
"je_small_write%s oid=%lx:%lx ver=%lu offset=%u len=%u\n",
je->type == JE_SMALL_WRITE_INSTANT ? "_instant" : "",
je->small_write.oid.inode, je->small_write.oid.stripe, je->small_write.version,
je->small_write.offset, je->small_write.len
);
#endif #endif
// oid, version, offset, len // oid, version, offset, len
uint64_t prev_free = next_free; uint64_t prev_free = next_free;
@ -614,14 +481,14 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
if (location != je->small_write.data_offset) if (location != je->small_write.data_offset)
{ {
char err[1024]; char err[1024];
snprintf(err, 1024, "BUG: calculated journal data offset (%08lx) != stored journal data offset (%08lx)", location, je->small_write.data_offset); snprintf(err, 1024, "BUG: calculated journal data offset (%lu) != stored journal data offset (%lu)", location, je->small_write.data_offset);
throw std::runtime_error(err); throw std::runtime_error(err);
} }
uint32_t data_crc32 = 0; uint32_t data_crc32 = 0;
if (location >= done_pos && location+je->small_write.len <= done_pos+len) if (location >= done_pos && location+je->small_write.len <= done_pos+len)
{ {
// data is within this buffer // data is within this buffer
data_crc32 = crc32c(0, (uint8_t*)buf + location - done_pos, je->small_write.len); data_crc32 = crc32c(0, buf + location - done_pos, je->small_write.len);
} }
else else
{ {
@ -636,7 +503,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
? location+je->small_write.len : done[i].pos+done[i].len); ? location+je->small_write.len : done[i].pos+done[i].len);
uint64_t part_begin = (location < done[i].pos ? done[i].pos : location); uint64_t part_begin = (location < done[i].pos ? done[i].pos : location);
covered += part_end - part_begin; covered += part_end - part_begin;
data_crc32 = crc32c(data_crc32, (uint8_t*)done[i].buf + part_begin - done[i].pos, part_end - part_begin); data_crc32 = crc32c(data_crc32, done[i].buf + part_begin - done[i].pos, part_end - part_begin);
} }
} }
if (covered < je->small_write.len) if (covered < je->small_write.len)
@ -649,98 +516,44 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
if (data_crc32 != je->small_write.crc32_data) if (data_crc32 != je->small_write.crc32_data)
{ {
// journal entry is corrupt, stop here // journal entry is corrupt, stop here
// interesting thing is that we must clear the corrupt entry if we're not readonly, // interesting thing is that we must clear the corrupt entry if we're not readonly
// because we don't write next entries in the same journal block memset(buf + proc_pos - done_pos + pos, 0, bs->journal.block_size - pos);
printf("Journal entry data is corrupt (data crc32 %x != %x)\n", data_crc32, je->small_write.crc32_data);
memset((uint8_t*)buf + proc_pos - done_pos + pos, 0, bs->journal.block_size - pos);
bs->journal.next_free = prev_free; bs->journal.next_free = prev_free;
init_write_buf = (uint8_t*)buf + proc_pos - done_pos; init_write_buf = buf + proc_pos - done_pos;
init_write_sector = proc_pos; init_write_sector = proc_pos;
return 0; return 0;
} }
auto & clean_db = bs->clean_db_shard(je->small_write.oid); auto clean_it = bs->clean_db.find(je->small_write.oid);
auto clean_it = clean_db.find(je->small_write.oid); if (clean_it == bs->clean_db.end() ||
if (clean_it == clean_db.end() || clean_it->second.version < je->big_write.version)
clean_it->second.version < je->small_write.version)
{ {
obj_ver_id ov = { obj_ver_id ov = {
.oid = je->small_write.oid, .oid = je->small_write.oid,
.version = je->small_write.version, .version = je->small_write.version,
}; };
void *bmp = NULL;
void *bmp_from = (uint8_t*)je + sizeof(journal_entry_small_write);
if (bs->clean_entry_bitmap_size <= sizeof(void*))
{
memcpy(&bmp, bmp_from, bs->clean_entry_bitmap_size);
}
else
{
// FIXME Using large blockstore objects will result in a lot of small
// allocations for entry bitmaps. This can only be fixed by using
// a patched map with dynamic entry size, but not the btree_map,
// because it doesn't keep iterators valid all the time.
bmp = malloc_or_die(bs->clean_entry_bitmap_size);
memcpy(bmp, bmp_from, bs->clean_entry_bitmap_size);
}
bs->dirty_db.emplace(ov, (dirty_entry){ bs->dirty_db.emplace(ov, (dirty_entry){
.state = (BS_ST_SMALL_WRITE | BS_ST_SYNCED), .state = ST_J_SYNCED,
.flags = 0, .flags = 0,
.location = location, .location = location,
.offset = je->small_write.offset, .offset = je->small_write.offset,
.len = je->small_write.len, .len = je->small_write.len,
.journal_sector = proc_pos, .journal_sector = proc_pos,
.bitmap = bmp,
}); });
bs->journal.used_sectors[proc_pos]++; bs->journal.used_sectors[proc_pos]++;
#ifdef BLOCKSTORE_DEBUG #ifdef BLOCKSTORE_DEBUG
printf( printf("journal offset %lu is used by %lu:%lu v%lu\n", proc_pos, ov.oid.inode, ov.oid.stripe, ov.version);
"journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
proc_pos, ov.oid.inode, ov.oid.stripe, ov.version, bs->journal.used_sectors[proc_pos]
);
#endif #endif
auto & unstab = bs->unstable_writes[ov.oid]; auto & unstab = bs->unstable_writes[ov.oid];
unstab = unstab < ov.version ? ov.version : unstab; unstab = unstab < ov.version ? ov.version : unstab;
if (je->type == JE_SMALL_WRITE_INSTANT)
{
bs->mark_stable(ov, true);
}
} }
} }
else if (je->type == JE_BIG_WRITE || je->type == JE_BIG_WRITE_INSTANT) else if (je->type == JE_BIG_WRITE)
{ {
#ifdef BLOCKSTORE_DEBUG #ifdef BLOCKSTORE_DEBUG
printf( printf("je_big_write oid=%lu:%lu ver=%lu loc=%lu\n", je->big_write.oid.inode, je->big_write.oid.stripe, je->big_write.version, je->big_write.location);
"je_big_write%s oid=%lx:%lx ver=%lu loc=%lu\n",
je->type == JE_BIG_WRITE_INSTANT ? "_instant" : "",
je->big_write.oid.inode, je->big_write.oid.stripe, je->big_write.version, je->big_write.location >> bs->block_order
);
#endif #endif
auto dirty_it = bs->dirty_db.upper_bound((obj_ver_id){ auto clean_it = bs->clean_db.find(je->big_write.oid);
.oid = je->big_write.oid, if (clean_it == bs->clean_db.end() ||
.version = UINT64_MAX,
});
if (dirty_it != bs->dirty_db.begin() && bs->dirty_db.size() > 0)
{
dirty_it--;
if (dirty_it->first.oid == je->big_write.oid &&
dirty_it->first.version >= je->big_write.version &&
(dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_DELETE)
{
// It is allowed to overwrite a deleted object with a
// version number smaller than deletion version number,
// because the presence of a BIG_WRITE entry means that
// its data and metadata are already flushed.
// We don't know if newer versions are flushed, but
// the previous delete definitely is.
// So we forget previous dirty entries, but retain the clean one.
// This feature is required for writes happening shortly
// after deletes.
erase_dirty_object(dirty_it);
}
}
auto & clean_db = bs->clean_db_shard(je->big_write.oid);
auto clean_it = clean_db.find(je->big_write.oid);
if (clean_it == clean_db.end() ||
clean_it->second.version < je->big_write.version) clean_it->second.version < je->big_write.version)
{ {
// oid, version, block // oid, version, block
@ -748,134 +561,131 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
.oid = je->big_write.oid, .oid = je->big_write.oid,
.version = je->big_write.version, .version = je->big_write.version,
}; };
void *bmp = NULL; bs->dirty_db.emplace(ov, (dirty_entry){
void *bmp_from = (uint8_t*)je + sizeof(journal_entry_big_write); .state = ST_D_META_SYNCED,
if (bs->clean_entry_bitmap_size <= sizeof(void*))
{
memcpy(&bmp, bmp_from, bs->clean_entry_bitmap_size);
}
else
{
// FIXME Using large blockstore objects will result in a lot of small
// allocations for entry bitmaps. This can only be fixed by using
// a patched map with dynamic entry size, but not the btree_map,
// because it doesn't keep iterators valid all the time.
bmp = malloc_or_die(bs->clean_entry_bitmap_size);
memcpy(bmp, bmp_from, bs->clean_entry_bitmap_size);
}
auto dirty_it = bs->dirty_db.emplace(ov, (dirty_entry){
.state = (BS_ST_BIG_WRITE | BS_ST_SYNCED),
.flags = 0, .flags = 0,
.location = je->big_write.location, .location = je->big_write.location,
.offset = je->big_write.offset, .offset = je->big_write.offset,
.len = je->big_write.len, .len = je->big_write.len,
.journal_sector = proc_pos, .journal_sector = proc_pos,
.bitmap = bmp, });
}).first;
if (bs->data_alloc->get(je->big_write.location >> bs->block_order))
{
// This is probably a big_write that's already flushed and freed, but it may
// also indicate a bug. So we remember such entries and recheck them afterwards.
// If it's not a bug they won't be present after reading the whole journal.
dirty_it->second.location = UINT64_MAX;
double_allocs.push_back(ov);
}
else
{
#ifdef BLOCKSTORE_DEBUG #ifdef BLOCKSTORE_DEBUG
printf( printf("Allocate block %lu\n", je->big_write.location >> bs->block_order);
"Allocate block (journal) %lu: %lx:%lx v%lu\n",
je->big_write.location >> bs->block_order,
ov.oid.inode, ov.oid.stripe, ov.version
);
#endif #endif
bs->data_alloc->set(je->big_write.location >> bs->block_order, true); bs->data_alloc->set(je->big_write.location >> bs->block_order, true);
}
bs->journal.used_sectors[proc_pos]++; bs->journal.used_sectors[proc_pos]++;
#ifdef BLOCKSTORE_DEBUG
printf(
"journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
proc_pos, ov.oid.inode, ov.oid.stripe, ov.version, bs->journal.used_sectors[proc_pos]
);
#endif
auto & unstab = bs->unstable_writes[ov.oid]; auto & unstab = bs->unstable_writes[ov.oid];
unstab = unstab < ov.version ? ov.version : unstab; unstab = unstab < ov.version ? ov.version : unstab;
if (je->type == JE_BIG_WRITE_INSTANT)
{
bs->mark_stable(ov, true);
}
} }
} }
else if (je->type == JE_STABLE) else if (je->type == JE_STABLE)
{ {
#ifdef BLOCKSTORE_DEBUG #ifdef BLOCKSTORE_DEBUG
printf("je_stable oid=%lx:%lx ver=%lu\n", je->stable.oid.inode, je->stable.oid.stripe, je->stable.version); printf("je_stable oid=%lu:%lu ver=%lu\n", je->stable.oid.inode, je->stable.oid.stripe, je->stable.version);
#endif #endif
// oid, version // oid, version
obj_ver_id ov = { obj_ver_id ov = {
.oid = je->stable.oid, .oid = je->stable.oid,
.version = je->stable.version, .version = je->stable.version,
}; };
bs->mark_stable(ov, true); auto it = bs->dirty_db.find(ov);
if (it == bs->dirty_db.end())
{
// journal contains a legitimate STABLE entry for a non-existing dirty write
// this probably means that journal was trimmed between WRITE and STABLE entries
// skip it
}
else
{
while (1)
{
it->second.state = (it->second.state == ST_D_META_SYNCED
? ST_D_STABLE
: (it->second.state == ST_DEL_SYNCED ? ST_DEL_STABLE : ST_J_STABLE));
if (it == bs->dirty_db.begin())
break;
it--;
if (it->first.oid != ov.oid || IS_STABLE(it->second.state))
break;
}
bs->flusher->enqueue_flush(ov);
}
auto unstab_it = bs->unstable_writes.find(ov.oid);
if (unstab_it != bs->unstable_writes.end() && unstab_it->second <= ov.version)
{
bs->unstable_writes.erase(unstab_it);
}
} }
else if (je->type == JE_ROLLBACK) else if (je->type == JE_ROLLBACK)
{ {
#ifdef BLOCKSTORE_DEBUG #ifdef BLOCKSTORE_DEBUG
printf("je_rollback oid=%lx:%lx ver=%lu\n", je->rollback.oid.inode, je->rollback.oid.stripe, je->rollback.version); printf("je_rollback oid=%lu:%lu ver=%lu\n", je->rollback.oid.inode, je->rollback.oid.stripe, je->rollback.version);
#endif #endif
// rollback dirty writes of <oid> up to <version> // rollback dirty writes of <oid> up to <version>
obj_ver_id ov = { auto it = bs->dirty_db.lower_bound((obj_ver_id){
.oid = je->rollback.oid, .oid = je->rollback.oid,
.version = je->rollback.version, .version = UINT64_MAX,
}; });
bs->mark_rolled_back(ov); if (it != bs->dirty_db.begin())
{
uint64_t max_unstable = 0;
auto rm_start = it;
auto rm_end = it;
it--;
while (it->first.oid == je->rollback.oid &&
it->first.version > je->rollback.version &&
!IS_IN_FLIGHT(it->second.state) &&
!IS_STABLE(it->second.state))
{
if (it->first.oid != je->rollback.oid)
break;
else if (it->first.version <= je->rollback.version)
{
if (!IS_STABLE(it->second.state))
max_unstable = it->first.version;
break;
}
else if (IS_STABLE(it->second.state))
break;
// Remove entry
rm_start = it;
if (it == bs->dirty_db.begin())
break;
it--;
}
if (rm_start != rm_end)
{
bs->erase_dirty(rm_start, rm_end, UINT64_MAX);
}
auto unstab_it = bs->unstable_writes.find(je->rollback.oid);
if (unstab_it != bs->unstable_writes.end())
{
if (max_unstable == 0)
bs->unstable_writes.erase(unstab_it);
else
unstab_it->second = max_unstable;
}
}
} }
else if (je->type == JE_DELETE) else if (je->type == JE_DELETE)
{ {
#ifdef BLOCKSTORE_DEBUG #ifdef BLOCKSTORE_DEBUG
printf("je_delete oid=%lx:%lx ver=%lu\n", je->del.oid.inode, je->del.oid.stripe, je->del.version); printf("je_delete oid=%lu:%lu ver=%lu\n", je->del.oid.inode, je->del.oid.stripe, je->del.version);
#endif #endif
bool dirty_exists = false; // oid, version
auto dirty_it = bs->dirty_db.upper_bound((obj_ver_id){ obj_ver_id ov = {
.oid = je->del.oid, .oid = je->del.oid,
.version = UINT64_MAX, .version = je->del.version,
};
bs->dirty_db.emplace(ov, (dirty_entry){
.state = ST_DEL_SYNCED,
.flags = 0,
.location = 0,
.offset = 0,
.len = 0,
.journal_sector = proc_pos,
}); });
if (dirty_it != bs->dirty_db.begin()) bs->journal.used_sectors[proc_pos]++;
{
dirty_it--;
dirty_exists = dirty_it->first.oid == je->del.oid;
}
auto & clean_db = bs->clean_db_shard(je->del.oid);
auto clean_it = clean_db.find(je->del.oid);
bool clean_exists = (clean_it != clean_db.end() &&
clean_it->second.version < je->del.version);
if (!clean_exists && dirty_exists)
{
// Clean entry doesn't exist. This means that the delete is already flushed.
// So we must not flush this object anymore.
erase_dirty_object(dirty_it);
}
else if (clean_exists || dirty_exists)
{
// oid, version
obj_ver_id ov = {
.oid = je->del.oid,
.version = je->del.version,
};
bs->dirty_db.emplace(ov, (dirty_entry){
.state = (BS_ST_DELETE | BS_ST_SYNCED),
.flags = 0,
.location = 0,
.offset = 0,
.len = 0,
.journal_sector = proc_pos,
});
bs->journal.used_sectors[proc_pos]++;
// Deletions are treated as immediately stable, because
// "2-phase commit" (write->stabilize) isn't sufficient for them anyway
bs->mark_stable(ov, true);
}
// Ignore delete if neither preceding dirty entries nor the clean one are present
} }
started = true; started = true;
pos += je->size; pos += je->size;
@ -886,36 +696,3 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
bs->journal.next_free = next_free; bs->journal.next_free = next_free;
return 1; return 1;
} }
void blockstore_init_journal::erase_dirty_object(blockstore_dirty_db_t::iterator dirty_it)
{
auto oid = dirty_it->first.oid;
bool exists = !IS_DELETE(dirty_it->second.state);
auto dirty_end = dirty_it;
dirty_end++;
while (1)
{
if (dirty_it == bs->dirty_db.begin())
{
break;
}
dirty_it--;
if (dirty_it->first.oid != oid)
{
dirty_it++;
break;
}
}
auto & clean_db = bs->clean_db_shard(oid);
auto clean_it = clean_db.find(oid);
uint64_t clean_loc = clean_it != clean_db.end()
? clean_it->second.location : UINT64_MAX;
if (exists && clean_loc == UINT64_MAX)
{
bs->inode_space_stats[oid.inode] -= bs->block_size;
}
bs->erase_dirty(dirty_it, dirty_end, clean_loc);
// Remove it from the flusher's queue, too
// Otherwise it may end up referring to a small unstable write after reading the rest of the journal
bs->flusher->remove_flush(oid);
}

View File

@ -1,13 +1,9 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
#pragma once #pragma once
class blockstore_init_meta class blockstore_init_meta
{ {
blockstore_impl_t *bs; blockstore_impl_t *bs;
int wait_state = 0; int wait_state = 0, wait_count = 0;
bool zero_on_init = false;
void *metadata_buffer = NULL; void *metadata_buffer = NULL;
uint64_t metadata_read = 0; uint64_t metadata_read = 0;
int prev = 0, prev_done = 0, done_len = 0, submitted = 0; int prev = 0, prev_done = 0, done_len = 0, submitted = 0;
@ -37,7 +33,6 @@ class blockstore_init_journal
bool started = false; bool started = false;
uint64_t next_free; uint64_t next_free;
std::vector<bs_init_journal_done> done; std::vector<bs_init_journal_done> done;
std::vector<obj_ver_id> double_allocs;
uint64_t journal_pos = 0; uint64_t journal_pos = 0;
uint64_t continue_pos = 0; uint64_t continue_pos = 0;
void *init_write_buf = NULL; void *init_write_buf = NULL;
@ -50,7 +45,6 @@ class blockstore_init_journal
std::function<void(ring_data_t*)> simple_callback; std::function<void(ring_data_t*)> simple_callback;
int handle_journal_part(void *buf, uint64_t done_pos, uint64_t len); int handle_journal_part(void *buf, uint64_t done_pos, uint64_t len);
void handle_event(ring_data_t *data); void handle_event(ring_data_t *data);
void erase_dirty_object(blockstore_dirty_db_t::iterator dirty_it);
public: public:
blockstore_init_journal(blockstore_impl_t* bs); blockstore_init_journal(blockstore_impl_t* bs);
int loop(); int loop();

187
blockstore_journal.cpp Normal file
View File

@ -0,0 +1,187 @@
#include "blockstore_impl.h"
blockstore_journal_check_t::blockstore_journal_check_t(blockstore_impl_t *bs)
{
this->bs = bs;
sectors_required = 0;
next_pos = bs->journal.next_free;
next_sector = bs->journal.cur_sector;
next_in_pos = bs->journal.in_sector_pos;
right_dir = next_pos >= bs->journal.used_start;
}
// Check if we can write <required> entries of <size> bytes and <data_after> data bytes after them to the journal
int blockstore_journal_check_t::check_available(blockstore_op_t *op, int required, int size, int data_after)
{
while (1)
{
int fits = (bs->journal.block_size - next_in_pos) / size;
if (fits > 0)
{
required -= fits;
next_in_pos += fits * size;
sectors_required++;
}
else if (bs->journal.sector_info[next_sector].dirty)
{
// sectors_required is more like "sectors to write"
sectors_required++;
}
if (required <= 0)
{
break;
}
next_pos = next_pos + bs->journal.block_size;
if (next_pos >= bs->journal.len)
{
next_pos = bs->journal.block_size;
right_dir = false;
}
next_in_pos = 0;
if (bs->journal.sector_info[next_sector].usage_count > 0 ||
bs->journal.sector_info[next_sector].dirty)
{
next_sector = ((next_sector + 1) % bs->journal.sector_count);
}
if (bs->journal.sector_info[next_sector].usage_count > 0 ||
bs->journal.sector_info[next_sector].dirty)
{
// No memory buffer available. Wait for it.
#ifdef BLOCKSTORE_DEBUG
printf("next journal buffer %d is still dirty=%d used=%d\n", next_sector,
bs->journal.sector_info[next_sector].dirty, bs->journal.sector_info[next_sector].usage_count);
#endif
PRIV(op)->wait_for = WAIT_JOURNAL_BUFFER;
return 0;
}
}
if (data_after > 0)
{
next_pos = next_pos + data_after;
if (next_pos > bs->journal.len)
{
next_pos = bs->journal.block_size + data_after;
right_dir = false;
}
}
if (!right_dir && next_pos >= bs->journal.used_start-bs->journal.block_size)
{
// No space in the journal. Wait until used_start changes.
printf(
"Ran out of journal space (free space: %lu bytes)\n",
(bs->journal.next_free >= bs->journal.used_start
? bs->journal.len-bs->journal.block_size - (bs->journal.next_free-bs->journal.used_start)
: bs->journal.used_start - bs->journal.next_free)
);
PRIV(op)->wait_for = WAIT_JOURNAL;
bs->flusher->force_start();
PRIV(op)->wait_detail = bs->journal.used_start;
return 0;
}
return 1;
}
journal_entry* prefill_single_journal_entry(journal_t & journal, uint16_t type, uint32_t size)
{
if (journal.block_size - journal.in_sector_pos < size)
{
assert(!journal.sector_info[journal.cur_sector].dirty);
// Move to the next journal sector
if (journal.sector_info[journal.cur_sector].usage_count > 0)
{
// Also select next sector buffer in memory
journal.cur_sector = ((journal.cur_sector + 1) % journal.sector_count);
}
journal.sector_info[journal.cur_sector].offset = journal.next_free;
journal.in_sector_pos = 0;
journal.next_free = (journal.next_free+journal.block_size) < journal.len ? journal.next_free + journal.block_size : journal.block_size;
memset(journal.inmemory
? journal.buffer + journal.sector_info[journal.cur_sector].offset
: journal.sector_buf + journal.block_size*journal.cur_sector, 0, journal.block_size);
}
journal_entry *je = (struct journal_entry*)(
(journal.inmemory
? journal.buffer + journal.sector_info[journal.cur_sector].offset
: journal.sector_buf + journal.block_size*journal.cur_sector) + journal.in_sector_pos
);
journal.in_sector_pos += size;
je->magic = JOURNAL_MAGIC;
je->type = type;
je->size = size;
je->crc32_prev = journal.crc32_last;
journal.sector_info[journal.cur_sector].dirty = true;
return je;
}
void prepare_journal_sector_write(journal_t & journal, int cur_sector, io_uring_sqe *sqe, std::function<void(ring_data_t*)> cb)
{
journal.sector_info[cur_sector].dirty = false;
journal.sector_info[cur_sector].usage_count++;
ring_data_t *data = ((ring_data_t*)sqe->user_data);
data->iov = (struct iovec){
(journal.inmemory
? journal.buffer + journal.sector_info[cur_sector].offset
: journal.sector_buf + journal.block_size*cur_sector),
journal.block_size
};
data->callback = cb;
my_uring_prep_writev(
sqe, journal.fd_index, &data->iov, 1, journal.offset + journal.sector_info[cur_sector].offset
);
sqe->flags |= IOSQE_FIXED_FILE;
}
journal_t::~journal_t()
{
if (sector_buf)
free(sector_buf);
if (sector_info)
free(sector_info);
if (buffer)
free(buffer);
sector_buf = NULL;
sector_info = NULL;
buffer = NULL;
}
bool journal_t::trim()
{
auto journal_used_it = used_sectors.lower_bound(used_start);
#ifdef BLOCKSTORE_DEBUG
printf(
"Trimming journal (used_start=%lu, next_free=%lu, first_used=%lu, usage_count=%lu)\n",
used_start, next_free,
journal_used_it == used_sectors.end() ? 0 : journal_used_it->first,
journal_used_it == used_sectors.end() ? 0 : journal_used_it->second
);
#endif
if (journal_used_it == used_sectors.end())
{
// Journal is cleared to its end, restart from the beginning
journal_used_it = used_sectors.begin();
if (journal_used_it == used_sectors.end())
{
// Journal is empty
used_start = next_free;
}
else
{
used_start = journal_used_it->first;
// next_free does not need updating here
}
}
else if (journal_used_it->first > used_start)
{
// Journal is cleared up to <journal_used_it>
used_start = journal_used_it->first;
}
else
{
// Can't trim journal
return false;
}
#ifdef BLOCKSTORE_DEBUG
printf("Journal trimmed to %lu (next_free=%lu)\n", used_start, next_free);
#endif
return true;
}

View File

@ -1,34 +1,23 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
#pragma once #pragma once
#include "crc32c.h" #include "crc32c.h"
#include <set>
#define MIN_JOURNAL_SIZE 4*1024*1024 #define MIN_JOURNAL_SIZE 4*1024*1024
#define JOURNAL_MAGIC 0x4A33 #define JOURNAL_MAGIC 0x4A33
#define JOURNAL_VERSION 1
#define JOURNAL_BUFFER_SIZE 4*1024*1024 #define JOURNAL_BUFFER_SIZE 4*1024*1024
// We reserve some extra space for future stabilize requests during writes // We reserve some extra space for future stabilize requests during writes
// FIXME: This value should be dynamic i.e. Blockstore ideally shouldn't allow
// writing more than can be stabilized afterwards
#define JOURNAL_STABILIZE_RESERVATION 65536 #define JOURNAL_STABILIZE_RESERVATION 65536
// Journal entries // Journal entries
// Journal entries are linked to each other by their crc32 value // Journal entries are linked to each other by their crc32 value
// The journal is almost a blockchain, because object versions constantly increase // The journal is almost a blockchain, because object versions constantly increase
#define JE_MIN 0x01
#define JE_START 0x01 #define JE_START 0x01
#define JE_SMALL_WRITE 0x02 #define JE_SMALL_WRITE 0x02
#define JE_BIG_WRITE 0x03 #define JE_BIG_WRITE 0x03
#define JE_STABLE 0x04 #define JE_STABLE 0x04
#define JE_DELETE 0x05 #define JE_DELETE 0x05
#define JE_ROLLBACK 0x06 #define JE_ROLLBACK 0x06
#define JE_SMALL_WRITE_INSTANT 0x07
#define JE_BIG_WRITE_INSTANT 0x08
#define JE_MAX 0x08
// crc32c comes first to ease calculation and is equal to crc32() // crc32c comes first to ease calculation and is equal to crc32()
struct __attribute__((__packed__)) journal_entry_start struct __attribute__((__packed__)) journal_entry_start
@ -39,9 +28,7 @@ struct __attribute__((__packed__)) journal_entry_start
uint32_t size; uint32_t size;
uint32_t reserved; uint32_t reserved;
uint64_t journal_start; uint64_t journal_start;
uint64_t version;
}; };
#define JE_START_LEGACY_SIZE 24
struct __attribute__((__packed__)) journal_entry_small_write struct __attribute__((__packed__)) journal_entry_small_write
{ {
@ -58,9 +45,6 @@ struct __attribute__((__packed__)) journal_entry_small_write
// data_offset is its offset within journal // data_offset is its offset within journal
uint64_t data_offset; uint64_t data_offset;
uint32_t crc32_data; uint32_t crc32_data;
// small_write and big_write entries are followed by the "external" bitmap
// its size is dynamic and included in journal entry's <size> field
uint8_t bitmap[];
}; };
struct __attribute__((__packed__)) journal_entry_big_write struct __attribute__((__packed__)) journal_entry_big_write
@ -75,9 +59,6 @@ struct __attribute__((__packed__)) journal_entry_big_write
uint32_t offset; uint32_t offset;
uint32_t len; uint32_t len;
uint64_t location; uint64_t location;
// small_write and big_write entries are followed by the "external" bitmap
// its size is dynamic and included in journal entry's <size> field
uint8_t bitmap[];
}; };
struct __attribute__((__packed__)) journal_entry_stable struct __attribute__((__packed__)) journal_entry_stable
@ -143,52 +124,29 @@ inline uint32_t je_crc32(journal_entry *je)
struct journal_sector_info_t struct journal_sector_info_t
{ {
uint64_t offset; uint64_t offset;
uint64_t flush_count; uint64_t usage_count;
bool written;
bool dirty; bool dirty;
uint64_t submit_id;
}; };
struct pending_journaling_t
{
uint64_t flush_id;
int sector;
blockstore_op_t *op;
};
inline bool operator < (const pending_journaling_t & a, const pending_journaling_t & b)
{
return a.flush_id < b.flush_id || a.flush_id == b.flush_id && a.op < b.op;
}
struct journal_t struct journal_t
{ {
int fd; int fd, fd_index;
uint64_t device_size; uint64_t device_size;
bool inmemory = false; bool inmemory = false;
bool flush_journal = false;
void *buffer = NULL; void *buffer = NULL;
uint64_t block_size; uint64_t block_size = 512;
uint64_t offset, len; uint64_t offset, len;
// Next free block offset
uint64_t next_free = 0; uint64_t next_free = 0;
// First occupied block offset
uint64_t used_start = 0; uint64_t used_start = 0;
// End of the last block not used for writing anymore
uint64_t dirty_start = 0;
uint32_t crc32_last = 0; uint32_t crc32_last = 0;
// Current sector(s) used for writing // Current sector(s) used for writing
void *sector_buf = NULL; void *sector_buf = NULL;
journal_sector_info_t *sector_info = NULL; journal_sector_info_t *sector_info = NULL;
uint64_t sector_count; uint64_t sector_count;
bool no_same_sector_overwrites = false;
int cur_sector = 0; int cur_sector = 0;
int in_sector_pos = 0; int in_sector_pos = 0;
std::vector<int> submitting_sectors;
std::set<pending_journaling_t> flushing_ops;
uint64_t submit_id = 0;
// Used sector map // Used sector map
// May use ~ 80 MB per 1 GB of used journal space in the worst case // May use ~ 80 MB per 1 GB of used journal space in the worst case
@ -196,20 +154,13 @@ struct journal_t
~journal_t(); ~journal_t();
bool trim(); bool trim();
uint64_t get_trim_pos();
void dump_diagnostics();
inline bool entry_fits(int size)
{
return !(block_size - in_sector_pos < size ||
no_same_sector_overwrites && sector_info[cur_sector].written);
}
}; };
struct blockstore_journal_check_t struct blockstore_journal_check_t
{ {
blockstore_impl_t *bs; blockstore_impl_t *bs;
uint64_t next_pos, next_sector, next_in_pos; uint64_t next_pos, next_sector, next_in_pos;
int sectors_to_write, first_sector; int sectors_required;
bool right_dir; // writing to the end or the beginning of the ring buffer bool right_dir; // writing to the end or the beginning of the ring buffer
blockstore_journal_check_t(blockstore_impl_t *bs); blockstore_journal_check_t(blockstore_impl_t *bs);
@ -217,3 +168,5 @@ struct blockstore_journal_check_t
}; };
journal_entry* prefill_single_journal_entry(journal_t & journal, uint16_t type, uint32_t size); journal_entry* prefill_single_journal_entry(journal_t & journal, uint16_t type, uint32_t size);
void prepare_journal_sector_write(journal_t & journal, int sector, io_uring_sqe *sqe, std::function<void(ring_data_t*)> cb);

View File

@ -1,7 +1,3 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
#include <sys/file.h>
#include "blockstore_impl.h" #include "blockstore_impl.h"
static uint32_t is_power_of_two(uint64_t value) static uint32_t is_power_of_two(uint64_t value)
@ -38,28 +34,10 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
{ {
disable_journal_fsync = true; disable_journal_fsync = true;
} }
if (config["disable_device_lock"] == "true" || config["disable_device_lock"] == "1" || config["disable_device_lock"] == "yes")
{
disable_flock = true;
}
if (config["flush_journal"] == "true" || config["flush_journal"] == "1" || config["flush_journal"] == "yes")
{
// Only flush journal and exit
journal.flush_journal = true;
}
if (config["immediate_commit"] == "all")
{
immediate_commit = IMMEDIATE_ALL;
}
else if (config["immediate_commit"] == "small")
{
immediate_commit = IMMEDIATE_SMALL;
}
metadata_buf_size = strtoull(config["meta_buf_size"].c_str(), NULL, 10); metadata_buf_size = strtoull(config["meta_buf_size"].c_str(), NULL, 10);
cfg_journal_size = strtoull(config["journal_size"].c_str(), NULL, 10); cfg_journal_size = strtoull(config["journal_size"].c_str(), NULL, 10);
data_device = config["data_device"]; data_device = config["data_device"];
data_offset = strtoull(config["data_offset"].c_str(), NULL, 10); data_offset = strtoull(config["data_offset"].c_str(), NULL, 10);
cfg_data_size = strtoull(config["data_size"].c_str(), NULL, 10);
meta_device = config["meta_device"]; meta_device = config["meta_device"];
meta_offset = strtoull(config["meta_offset"].c_str(), NULL, 10); meta_offset = strtoull(config["meta_offset"].c_str(), NULL, 10);
block_size = strtoull(config["block_size"].c_str(), NULL, 10); block_size = strtoull(config["block_size"].c_str(), NULL, 10);
@ -67,23 +45,12 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
journal_device = config["journal_device"]; journal_device = config["journal_device"];
journal.offset = strtoull(config["journal_offset"].c_str(), NULL, 10); journal.offset = strtoull(config["journal_offset"].c_str(), NULL, 10);
journal.sector_count = strtoull(config["journal_sector_buffer_count"].c_str(), NULL, 10); journal.sector_count = strtoull(config["journal_sector_buffer_count"].c_str(), NULL, 10);
journal.no_same_sector_overwrites = config["journal_no_same_sector_overwrites"] == "true" ||
config["journal_no_same_sector_overwrites"] == "1" || config["journal_no_same_sector_overwrites"] == "yes";
journal.inmemory = config["inmemory_journal"] != "false"; journal.inmemory = config["inmemory_journal"] != "false";
disk_alignment = strtoull(config["disk_alignment"].c_str(), NULL, 10); disk_alignment = strtoull(config["disk_alignment"].c_str(), NULL, 10);
journal_block_size = strtoull(config["journal_block_size"].c_str(), NULL, 10); journal_block_size = strtoull(config["journal_block_size"].c_str(), NULL, 10);
meta_block_size = strtoull(config["meta_block_size"].c_str(), NULL, 10); meta_block_size = strtoull(config["meta_block_size"].c_str(), NULL, 10);
bitmap_granularity = strtoull(config["bitmap_granularity"].c_str(), NULL, 10); bitmap_granularity = strtoull(config["bitmap_granularity"].c_str(), NULL, 10);
max_flusher_count = strtoull(config["max_flusher_count"].c_str(), NULL, 10); flusher_count = strtoull(config["flusher_count"].c_str(), NULL, 10);
if (!max_flusher_count)
max_flusher_count = strtoull(config["flusher_count"].c_str(), NULL, 10);
min_flusher_count = strtoull(config["min_flusher_count"].c_str(), NULL, 10);
max_write_iodepth = strtoull(config["max_write_iodepth"].c_str(), NULL, 10);
throttle_small_writes = config["throttle_small_writes"] == "true" || config["throttle_small_writes"] == "1" || config["throttle_small_writes"] == "yes";
throttle_target_iops = strtoull(config["throttle_target_iops"].c_str(), NULL, 10);
throttle_target_mbs = strtoull(config["throttle_target_mbs"].c_str(), NULL, 10);
throttle_target_parallelism = strtoull(config["throttle_target_parallelism"].c_str(), NULL, 10);
throttle_threshold_us = strtoull(config["throttle_threshold_us"].c_str(), NULL, 10);
// Validate // Validate
if (!block_size) if (!block_size)
{ {
@ -93,29 +60,21 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
{ {
throw std::runtime_error("Bad block size"); throw std::runtime_error("Bad block size");
} }
if (!max_flusher_count) if (!flusher_count)
{ {
max_flusher_count = 256; flusher_count = 32;
}
if (!min_flusher_count || journal.flush_journal)
{
min_flusher_count = 1;
}
if (!max_write_iodepth)
{
max_write_iodepth = 128;
} }
if (!disk_alignment) if (!disk_alignment)
{ {
disk_alignment = 4096; disk_alignment = 512;
} }
else if (disk_alignment % MEM_ALIGNMENT) else if (disk_alignment % MEM_ALIGNMENT)
{ {
throw std::runtime_error("disk_alignment must be a multiple of "+std::to_string(MEM_ALIGNMENT)); throw std::runtime_error("disk_alingment must be a multiple of "+std::to_string(MEM_ALIGNMENT));
} }
if (!journal_block_size) if (!journal_block_size)
{ {
journal_block_size = 4096; journal_block_size = 512;
} }
else if (journal_block_size % MEM_ALIGNMENT) else if (journal_block_size % MEM_ALIGNMENT)
{ {
@ -123,7 +82,7 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
} }
if (!meta_block_size) if (!meta_block_size)
{ {
meta_block_size = 4096; meta_block_size = 512;
} }
else if (meta_block_size % MEM_ALIGNMENT) else if (meta_block_size % MEM_ALIGNMENT)
{ {
@ -135,7 +94,7 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
} }
if (!bitmap_granularity) if (!bitmap_granularity)
{ {
bitmap_granularity = DEFAULT_BITMAP_GRANULARITY; bitmap_granularity = 4096;
} }
else if (bitmap_granularity % disk_alignment) else if (bitmap_granularity % disk_alignment)
{ {
@ -169,41 +128,9 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
{ {
metadata_buf_size = 4*1024*1024; metadata_buf_size = 4*1024*1024;
} }
if (meta_device == "")
{
disable_meta_fsync = disable_data_fsync;
}
if (journal_device == "")
{
disable_journal_fsync = disable_meta_fsync;
}
if (immediate_commit != IMMEDIATE_NONE && !disable_journal_fsync)
{
throw std::runtime_error("immediate_commit requires disable_journal_fsync");
}
if (immediate_commit == IMMEDIATE_ALL && !disable_data_fsync)
{
throw std::runtime_error("immediate_commit=all requires disable_journal_fsync and disable_data_fsync");
}
if (!throttle_target_iops)
{
throttle_target_iops = 100;
}
if (!throttle_target_mbs)
{
throttle_target_mbs = 100;
}
if (!throttle_target_parallelism)
{
throttle_target_parallelism = 1;
}
if (!throttle_threshold_us)
{
throttle_threshold_us = 50;
}
// init some fields // init some fields
clean_entry_bitmap_size = block_size / bitmap_granularity / 8; clean_entry_bitmap_size = block_size / bitmap_granularity / 8;
clean_entry_size = sizeof(clean_disk_entry) + 2*clean_entry_bitmap_size; clean_entry_size = sizeof(clean_disk_entry) + clean_entry_bitmap_size;
journal.block_size = journal_block_size; journal.block_size = journal_block_size;
journal.next_free = journal_block_size; journal.next_free = journal_block_size;
journal.used_start = journal_block_size; journal.used_start = journal_block_size;
@ -213,6 +140,10 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
void blockstore_impl_t::calc_lengths() void blockstore_impl_t::calc_lengths()
{ {
// register fds
data_fd_index = ringloop->register_fd(data_fd);
meta_fd_index = meta_fd == data_fd ? data_fd_index : ringloop->register_fd(meta_fd);
journal.fd_index = journal_fd_index = journal.fd == meta_fd ? meta_fd_index : ringloop->register_fd(journal.fd);
// data // data
data_len = data_size - data_offset; data_len = data_size - data_offset;
if (data_fd == meta_fd && data_offset < meta_offset) if (data_fd == meta_fd && data_offset < meta_offset)
@ -224,15 +155,6 @@ void blockstore_impl_t::calc_lengths()
data_len = data_len < journal.offset-data_offset data_len = data_len < journal.offset-data_offset
? data_len : journal.offset-data_offset; ? data_len : journal.offset-data_offset;
} }
if (cfg_data_size != 0)
{
if (data_len < cfg_data_size)
{
throw std::runtime_error("Data area ("+std::to_string(data_len)+
" bytes) is less than configured size ("+std::to_string(cfg_data_size)+" bytes)");
}
data_len = cfg_data_size;
}
// meta // meta
meta_area = (meta_fd == data_fd ? data_size : meta_size) - meta_offset; meta_area = (meta_fd == data_fd ? data_size : meta_size) - meta_offset;
if (meta_fd == data_fd && meta_offset <= data_offset) if (meta_fd == data_fd && meta_offset <= data_offset)
@ -257,7 +179,7 @@ void blockstore_impl_t::calc_lengths()
} }
// required metadata size // required metadata size
block_count = data_len / block_size; block_count = data_len / block_size;
meta_len = (1 + (block_count - 1 + meta_block_size / clean_entry_size) / (meta_block_size / clean_entry_size)) * meta_block_size; meta_len = ((block_count - 1 + meta_block_size / clean_entry_size) / (meta_block_size / clean_entry_size)) * meta_block_size;
if (meta_area < meta_len) if (meta_area < meta_len)
{ {
throw std::runtime_error("Metadata area is too small, need at least "+std::to_string(meta_len)+" bytes"); throw std::runtime_error("Metadata area is too small, need at least "+std::to_string(meta_len)+" bytes");
@ -270,7 +192,7 @@ void blockstore_impl_t::calc_lengths()
} }
else if (clean_entry_bitmap_size) else if (clean_entry_bitmap_size)
{ {
clean_bitmap = (uint8_t*)malloc(block_count * 2*clean_entry_bitmap_size); clean_bitmap = (uint8_t*)malloc(block_count * clean_entry_bitmap_size);
if (!clean_bitmap) if (!clean_bitmap)
throw std::runtime_error("Failed to allocate memory for the metadata sparse write bitmap"); throw std::runtime_error("Failed to allocate memory for the metadata sparse write bitmap");
} }
@ -295,9 +217,9 @@ void blockstore_impl_t::calc_lengths()
} }
} }
static void check_size(int fd, uint64_t *size, uint64_t *sectsize, std::string name) void check_size(int fd, uint64_t *size, std::string name)
{ {
int sect; int sectsize;
struct stat st; struct stat st;
if (fstat(fd, &st) < 0) if (fstat(fd, &st) < 0)
{ {
@ -306,21 +228,14 @@ static void check_size(int fd, uint64_t *size, uint64_t *sectsize, std::string n
if (S_ISREG(st.st_mode)) if (S_ISREG(st.st_mode))
{ {
*size = st.st_size; *size = st.st_size;
if (sectsize)
{
*sectsize = st.st_blksize;
}
} }
else if (S_ISBLK(st.st_mode)) else if (S_ISBLK(st.st_mode))
{ {
if (ioctl(fd, BLKGETSIZE64, size) < 0 || if (ioctl(fd, BLKSSZGET, &sectsize) < 0 ||
ioctl(fd, BLKSSZGET, &sect) < 0) ioctl(fd, BLKGETSIZE64, size) < 0 ||
sectsize != 512)
{ {
throw std::runtime_error("failed to get "+name+" size or block size: "+strerror(errno)); throw std::runtime_error(name+" sector is not equal to 512 bytes");
}
if (sectsize)
{
*sectsize = sect;
} }
} }
else else
@ -336,22 +251,11 @@ void blockstore_impl_t::open_data()
{ {
throw std::runtime_error("Failed to open data device"); throw std::runtime_error("Failed to open data device");
} }
check_size(data_fd, &data_size, &data_device_sect, "data device"); check_size(data_fd, &data_size, "data device");
if (disk_alignment % data_device_sect)
{
throw std::runtime_error(
"disk_alignment ("+std::to_string(disk_alignment)+
") is not a multiple of data device sector size ("+std::to_string(data_device_sect)+")"
);
}
if (data_offset >= data_size) if (data_offset >= data_size)
{ {
throw std::runtime_error("data_offset exceeds device size = "+std::to_string(data_size)); throw std::runtime_error("data_offset exceeds device size = "+std::to_string(data_size));
} }
if (!disable_flock && flock(data_fd, LOCK_EX|LOCK_NB) != 0)
{
throw std::runtime_error(std::string("Failed to lock data device: ") + strerror(errno));
}
} }
void blockstore_impl_t::open_meta() void blockstore_impl_t::open_meta()
@ -364,33 +268,22 @@ void blockstore_impl_t::open_meta()
{ {
throw std::runtime_error("Failed to open metadata device"); throw std::runtime_error("Failed to open metadata device");
} }
check_size(meta_fd, &meta_size, &meta_device_sect, "metadata device"); check_size(meta_fd, &meta_size, "metadata device");
if (meta_offset >= meta_size) if (meta_offset >= meta_size)
{ {
throw std::runtime_error("meta_offset exceeds device size = "+std::to_string(meta_size)); throw std::runtime_error("meta_offset exceeds device size = "+std::to_string(meta_size));
} }
if (!disable_flock && flock(meta_fd, LOCK_EX|LOCK_NB) != 0)
{
throw std::runtime_error(std::string("Failed to lock metadata device: ") + strerror(errno));
}
} }
else else
{ {
meta_fd = data_fd; meta_fd = data_fd;
meta_device_sect = data_device_sect; disable_meta_fsync = disable_data_fsync;
meta_size = 0; meta_size = 0;
if (meta_offset >= data_size) if (meta_offset >= data_size)
{ {
throw std::runtime_error("meta_offset exceeds device size = "+std::to_string(data_size)); throw std::runtime_error("meta_offset exceeds device size = "+std::to_string(data_size));
} }
} }
if (meta_block_size % meta_device_sect)
{
throw std::runtime_error(
"meta_block_size ("+std::to_string(meta_block_size)+
") is not a multiple of data device sector size ("+std::to_string(meta_device_sect)+")"
);
}
} }
void blockstore_impl_t::open_journal() void blockstore_impl_t::open_journal()
@ -402,16 +295,12 @@ void blockstore_impl_t::open_journal()
{ {
throw std::runtime_error("Failed to open journal device"); throw std::runtime_error("Failed to open journal device");
} }
check_size(journal.fd, &journal.device_size, &journal_device_sect, "journal device"); check_size(journal.fd, &journal.device_size, "metadata device");
if (!disable_flock && flock(journal.fd, LOCK_EX|LOCK_NB) != 0)
{
throw std::runtime_error(std::string("Failed to lock journal device: ") + strerror(errno));
}
} }
else else
{ {
journal.fd = meta_fd; journal.fd = meta_fd;
journal_device_sect = meta_device_sect; disable_journal_fsync = disable_meta_fsync;
journal.device_size = 0; journal.device_size = 0;
if (journal.offset >= data_size) if (journal.offset >= data_size)
{ {
@ -429,11 +318,4 @@ void blockstore_impl_t::open_journal()
if (!journal.sector_buf) if (!journal.sector_buf)
throw std::bad_alloc(); throw std::bad_alloc();
} }
if (journal_block_size % journal_device_sect)
{
throw std::runtime_error(
"journal_block_size ("+std::to_string(journal_block_size)+
") is not a multiple of journal device sector size ("+std::to_string(journal_device_sect)+")"
);
}
} }

View File

@ -1,6 +1,3 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h" #include "blockstore_impl.h"
int blockstore_impl_t::fulfill_read_push(blockstore_op_t *op, void *buf, uint64_t offset, uint64_t len, int blockstore_impl_t::fulfill_read_push(blockstore_op_t *op, void *buf, uint64_t offset, uint64_t len,
@ -11,10 +8,12 @@ int blockstore_impl_t::fulfill_read_push(blockstore_op_t *op, void *buf, uint64_
// Zero-length version - skip // Zero-length version - skip
return 1; return 1;
} }
else if (IS_IN_FLIGHT(item_state)) if (IS_IN_FLIGHT(item_state))
{ {
// Write not finished yet - skip // Pause until it's written somewhere
return 1; PRIV(op)->wait_for = WAIT_IN_FLIGHT;
PRIV(op)->wait_detail = item_version;
return 0;
} }
else if (IS_DELETE(item_state)) else if (IS_DELETE(item_state))
{ {
@ -24,7 +23,7 @@ int blockstore_impl_t::fulfill_read_push(blockstore_op_t *op, void *buf, uint64_
} }
if (journal.inmemory && IS_JOURNAL(item_state)) if (journal.inmemory && IS_JOURNAL(item_state))
{ {
memcpy(buf, (uint8_t*)journal.buffer + offset, len); memcpy(buf, journal.buffer + offset, len);
return 1; return 1;
} }
BS_SUBMIT_GET_SQE(sqe, data); BS_SUBMIT_GET_SQE(sqe, data);
@ -32,15 +31,15 @@ int blockstore_impl_t::fulfill_read_push(blockstore_op_t *op, void *buf, uint64_
PRIV(op)->pending_ops++; PRIV(op)->pending_ops++;
my_uring_prep_readv( my_uring_prep_readv(
sqe, sqe,
IS_JOURNAL(item_state) ? journal.fd : data_fd, IS_JOURNAL(item_state) ? journal_fd_index : data_fd_index,
&data->iov, 1, &data->iov, 1,
(IS_JOURNAL(item_state) ? journal.offset : data_offset) + offset (IS_JOURNAL(item_state) ? journal.offset : data_offset) + offset
); );
sqe->flags |= IOSQE_FIXED_FILE;
data->callback = [this, op](ring_data_t *data) { handle_read_event(data, op); }; data->callback = [this, op](ring_data_t *data) { handle_read_event(data, op); };
return 1; return 1;
} }
// FIXME I've seen a bug here so I want some tests
int blockstore_impl_t::fulfill_read(blockstore_op_t *read_op, uint64_t &fulfilled, uint32_t item_start, uint32_t item_end, int blockstore_impl_t::fulfill_read(blockstore_op_t *read_op, uint64_t &fulfilled, uint32_t item_start, uint32_t item_end,
uint32_t item_state, uint64_t item_version, uint64_t item_location) uint32_t item_state, uint64_t item_version, uint64_t item_location)
{ {
@ -53,20 +52,8 @@ int blockstore_impl_t::fulfill_read(blockstore_op_t *read_op, uint64_t &fulfille
while (1) while (1)
{ {
for (; it != PRIV(read_op)->read_vec.end(); it++) for (; it != PRIV(read_op)->read_vec.end(); it++)
{
if (it->offset >= cur_start) if (it->offset >= cur_start)
{
break; break;
}
else if (it->offset + it->len > cur_start)
{
cur_start = it->offset + it->len;
if (cur_start >= item_end)
{
goto endwhile;
}
}
}
if (it == PRIV(read_op)->read_vec.end() || it->offset > cur_start) if (it == PRIV(read_op)->read_vec.end() || it->offset > cur_start)
{ {
fulfill_read_t el = { fulfill_read_t el = {
@ -75,7 +62,7 @@ int blockstore_impl_t::fulfill_read(blockstore_op_t *read_op, uint64_t &fulfille
}; };
it = PRIV(read_op)->read_vec.insert(it, el); it = PRIV(read_op)->read_vec.insert(it, el);
if (!fulfill_read_push(read_op, if (!fulfill_read_push(read_op,
(uint8_t*)read_op->buf + el.offset - read_op->offset, read_op->buf + el.offset - read_op->offset,
item_location + el.offset - item_start, item_location + el.offset - item_start,
el.len, item_state, item_version)) el.len, item_state, item_version))
{ {
@ -85,33 +72,14 @@ int blockstore_impl_t::fulfill_read(blockstore_op_t *read_op, uint64_t &fulfille
} }
cur_start = it->offset + it->len; cur_start = it->offset + it->len;
if (it == PRIV(read_op)->read_vec.end() || cur_start >= item_end) if (it == PRIV(read_op)->read_vec.end() || cur_start >= item_end)
{
break; break;
}
} }
} }
endwhile:
return 1; return 1;
} }
uint8_t* blockstore_impl_t::get_clean_entry_bitmap(uint64_t block_loc, int offset)
{
uint8_t *clean_entry_bitmap;
uint64_t meta_loc = block_loc >> block_order;
if (inmemory_meta)
{
uint64_t sector = (meta_loc / (meta_block_size / clean_entry_size)) * meta_block_size;
uint64_t pos = (meta_loc % (meta_block_size / clean_entry_size));
clean_entry_bitmap = ((uint8_t*)metadata_buffer + sector + pos*clean_entry_size + sizeof(clean_disk_entry) + offset);
}
else
clean_entry_bitmap = (uint8_t*)(clean_bitmap + meta_loc*2*clean_entry_bitmap_size + offset);
return clean_entry_bitmap;
}
int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op) int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
{ {
auto & clean_db = clean_db_shard(read_op->oid);
auto clean_it = clean_db.find(read_op->oid); auto clean_it = clean_db.find(read_op->oid);
auto dirty_it = dirty_db.upper_bound((obj_ver_id){ auto dirty_it = dirty_db.upper_bound((obj_ver_id){
.oid = read_op->oid, .oid = read_op->oid,
@ -128,7 +96,7 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
read_op->version = 0; read_op->version = 0;
read_op->retval = read_op->len; read_op->retval = read_op->len;
FINISH_OP(read_op); FINISH_OP(read_op);
return 2; return 1;
} }
uint64_t fulfilled = 0; uint64_t fulfilled = 0;
PRIV(read_op)->pending_ops = 0; PRIV(read_op)->pending_ops = 0;
@ -150,11 +118,6 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
if (!result_version) if (!result_version)
{ {
result_version = dirty_it->first.version; result_version = dirty_it->first.version;
if (read_op->bitmap)
{
void *bmp_ptr = (clean_entry_bitmap_size > sizeof(void*) ? dirty_it->second.bitmap : &dirty_it->second.bitmap);
memcpy(read_op->bitmap, bmp_ptr, clean_entry_bitmap_size);
}
} }
if (!fulfill_read(read_op, fulfilled, dirty.offset, dirty.offset + dirty.len, if (!fulfill_read(read_op, fulfilled, dirty.offset, dirty.offset + dirty.len,
dirty.state, dirty_it->first.version, dirty.location + (IS_JOURNAL(dirty.state) ? 0 : dirty.offset))) dirty.state, dirty_it->first.version, dirty.location + (IS_JOURNAL(dirty.state) ? 0 : dirty.offset)))
@ -171,61 +134,63 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
dirty_it--; dirty_it--;
} }
} }
if (clean_it != clean_db.end()) if (clean_it != clean_db.end() && fulfilled < read_op->len)
{ {
if (!result_version) if (!result_version)
{ {
result_version = clean_it->second.version; result_version = clean_it->second.version;
if (read_op->bitmap) }
if (!clean_entry_bitmap_size)
{
if (!fulfill_read(read_op, fulfilled, 0, block_size, ST_CURRENT, 0, clean_it->second.location))
{ {
void *bmp_ptr = get_clean_entry_bitmap(clean_it->second.location, clean_entry_bitmap_size); // need to wait. undo added requests, don't dequeue op
memcpy(read_op->bitmap, bmp_ptr, clean_entry_bitmap_size); PRIV(read_op)->read_vec.clear();
return 0;
} }
} }
if (fulfilled < read_op->len) else
{ {
if (!clean_entry_bitmap_size) uint64_t meta_loc = clean_it->second.location >> block_order;
uint8_t *clean_entry_bitmap;
if (inmemory_meta)
{ {
if (!fulfill_read(read_op, fulfilled, 0, block_size, (BS_ST_BIG_WRITE | BS_ST_STABLE), 0, clean_it->second.location)) uint64_t sector = (meta_loc / (meta_block_size / clean_entry_size)) * meta_block_size;
{ uint64_t pos = (meta_loc % (meta_block_size / clean_entry_size));
// need to wait. undo added requests, don't dequeue op clean_entry_bitmap = (uint8_t*)(metadata_buffer + sector + pos*clean_entry_size + sizeof(clean_disk_entry));
PRIV(read_op)->read_vec.clear();
return 0;
}
} }
else else
{ {
uint8_t *clean_entry_bitmap = get_clean_entry_bitmap(clean_it->second.location, 0); clean_entry_bitmap = (uint8_t*)(clean_bitmap + meta_loc*clean_entry_bitmap_size);
uint64_t bmp_start = 0, bmp_end = 0, bmp_size = block_size/bitmap_granularity; }
while (bmp_start < bmp_size) uint64_t bmp_start = 0, bmp_end = 0, bmp_size = block_size/bitmap_granularity;
while (bmp_start < bmp_size)
{
while (!(clean_entry_bitmap[bmp_end >> 3] & (1 << (bmp_end & 0x7))) && bmp_end < bmp_size)
{ {
while (!(clean_entry_bitmap[bmp_end >> 3] & (1 << (bmp_end & 0x7))) && bmp_end < bmp_size) bmp_end++;
}
if (bmp_end > bmp_start)
{
// fill with zeroes
fulfill_read(read_op, fulfilled, bmp_start * bitmap_granularity,
bmp_end * bitmap_granularity, ST_DEL_STABLE, 0, 0);
}
bmp_start = bmp_end;
while (clean_entry_bitmap[bmp_end >> 3] & (1 << (bmp_end & 0x7)) && bmp_end < bmp_size)
{
bmp_end++;
}
if (bmp_end > bmp_start)
{
if (!fulfill_read(read_op, fulfilled, bmp_start * bitmap_granularity,
bmp_end * bitmap_granularity, ST_CURRENT, 0, clean_it->second.location + bmp_start * bitmap_granularity))
{ {
bmp_end++; // need to wait. undo added requests, don't dequeue op
} PRIV(read_op)->read_vec.clear();
if (bmp_end > bmp_start) return 0;
{
// fill with zeroes
assert(fulfill_read(read_op, fulfilled, bmp_start * bitmap_granularity,
bmp_end * bitmap_granularity, (BS_ST_DELETE | BS_ST_STABLE), 0, 0));
} }
bmp_start = bmp_end; bmp_start = bmp_end;
while (clean_entry_bitmap[bmp_end >> 3] & (1 << (bmp_end & 0x7)) && bmp_end < bmp_size)
{
bmp_end++;
}
if (bmp_end > bmp_start)
{
if (!fulfill_read(read_op, fulfilled, bmp_start * bitmap_granularity,
bmp_end * bitmap_granularity, (BS_ST_BIG_WRITE | BS_ST_STABLE), 0,
clean_it->second.location + bmp_start * bitmap_granularity))
{
// need to wait. undo added requests, don't dequeue op
PRIV(read_op)->read_vec.clear();
return 0;
}
bmp_start = bmp_end;
}
} }
} }
} }
@ -233,7 +198,7 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
else if (fulfilled < read_op->len) else if (fulfilled < read_op->len)
{ {
// fill remaining parts with zeroes // fill remaining parts with zeroes
assert(fulfill_read(read_op, fulfilled, 0, block_size, (BS_ST_DELETE | BS_ST_STABLE), 0, 0)); fulfill_read(read_op, fulfilled, 0, block_size, ST_DEL_STABLE, 0, 0);
} }
assert(fulfilled == read_op->len); assert(fulfilled == read_op->len);
read_op->version = result_version; read_op->version = result_version;
@ -247,10 +212,10 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
} }
read_op->retval = read_op->len; read_op->retval = read_op->len;
FINISH_OP(read_op); FINISH_OP(read_op);
return 2; return 1;
} }
read_op->retval = 0; read_op->retval = 0;
return 2; return 1;
} }
void blockstore_impl_t::handle_read_event(ring_data_t *data, blockstore_op_t *op) void blockstore_impl_t::handle_read_event(ring_data_t *data, blockstore_op_t *op)
@ -269,51 +234,3 @@ void blockstore_impl_t::handle_read_event(ring_data_t *data, blockstore_op_t *op
FINISH_OP(op); FINISH_OP(op);
} }
} }
int blockstore_impl_t::read_bitmap(object_id oid, uint64_t target_version, void *bitmap, uint64_t *result_version)
{
auto dirty_it = dirty_db.upper_bound((obj_ver_id){
.oid = oid,
.version = UINT64_MAX,
});
if (dirty_it != dirty_db.begin())
dirty_it--;
if (dirty_it != dirty_db.end())
{
while (dirty_it->first.oid == oid)
{
if (target_version >= dirty_it->first.version)
{
if (result_version)
*result_version = dirty_it->first.version;
if (bitmap)
{
void *bmp_ptr = (clean_entry_bitmap_size > sizeof(void*) ? dirty_it->second.bitmap : &dirty_it->second.bitmap);
memcpy(bitmap, bmp_ptr, clean_entry_bitmap_size);
}
return 0;
}
if (dirty_it == dirty_db.begin())
break;
dirty_it--;
}
}
auto & clean_db = clean_db_shard(oid);
auto clean_it = clean_db.find(oid);
if (clean_it != clean_db.end())
{
if (result_version)
*result_version = clean_it->second.version;
if (bitmap)
{
void *bmp_ptr = get_clean_entry_bitmap(clean_it->second.location, clean_entry_bitmap_size);
memcpy(bitmap, bmp_ptr, clean_entry_bitmap_size);
}
return 0;
}
if (result_version)
*result_version = 0;
if (bitmap)
memset(bitmap, 0, clean_entry_bitmap_size);
return -ENOENT;
}

187
blockstore_rollback.cpp Normal file
View File

@ -0,0 +1,187 @@
#include "blockstore_impl.h"
int blockstore_impl_t::dequeue_rollback(blockstore_op_t *op)
{
obj_ver_id* v;
int i, todo = op->len;
for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++)
{
// Check that there are some versions greater than v->version (which may be zero),
// check that they're unstable, synced, and not currently written to
auto dirty_it = dirty_db.lower_bound((obj_ver_id){
.oid = v->oid,
.version = UINT64_MAX,
});
if (dirty_it == dirty_db.begin())
{
bad_op:
op->retval = -EINVAL;
FINISH_OP(op);
return 1;
}
else
{
dirty_it--;
if (dirty_it->first.oid != v->oid || dirty_it->first.version < v->version)
{
goto bad_op;
}
while (dirty_it->first.oid == v->oid && dirty_it->first.version > v->version)
{
if (!IS_SYNCED(dirty_it->second.state) ||
IS_STABLE(dirty_it->second.state))
{
goto bad_op;
}
if (dirty_it == dirty_db.begin())
{
break;
}
dirty_it--;
}
}
}
// Check journal space
blockstore_journal_check_t space_check(this);
if (!space_check.check_available(op, todo, sizeof(journal_entry_rollback), 0))
{
return 0;
}
// There is sufficient space. Get SQEs
struct io_uring_sqe *sqe[space_check.sectors_required];
for (i = 0; i < space_check.sectors_required; i++)
{
BS_SUBMIT_GET_SQE_DECL(sqe[i]);
}
// Prepare and submit journal entries
auto cb = [this, op](ring_data_t *data) { handle_rollback_event(data, op); };
int s = 0, cur_sector = -1;
if ((journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_rollback) &&
journal.sector_info[journal.cur_sector].dirty)
{
if (cur_sector == -1)
PRIV(op)->min_used_journal_sector = 1 + journal.cur_sector;
cur_sector = journal.cur_sector;
prepare_journal_sector_write(journal, cur_sector, sqe[s++], cb);
}
for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++)
{
// FIXME This is here only for the purpose of tracking unstable_writes. Remove if not required
// FIXME ...aaaand this is similar to blockstore_init.cpp - maybe dedup it?
auto dirty_it = dirty_db.lower_bound((obj_ver_id){
.oid = v->oid,
.version = UINT64_MAX,
});
uint64_t max_unstable = 0;
while (dirty_it != dirty_db.begin())
{
dirty_it--;
if (dirty_it->first.oid != v->oid)
break;
else if (dirty_it->first.version <= v->version)
{
if (!IS_STABLE(dirty_it->second.state))
max_unstable = dirty_it->first.version;
break;
}
}
auto unstab_it = unstable_writes.find(v->oid);
if (unstab_it != unstable_writes.end())
{
if (max_unstable == 0)
unstable_writes.erase(unstab_it);
else
unstab_it->second = max_unstable;
}
journal_entry_rollback *je = (journal_entry_rollback*)
prefill_single_journal_entry(journal, JE_ROLLBACK, sizeof(journal_entry_rollback));
journal.sector_info[journal.cur_sector].dirty = false;
je->oid = v->oid;
je->version = v->version;
je->crc32 = je_crc32((journal_entry*)je);
journal.crc32_last = je->crc32;
if (cur_sector != journal.cur_sector)
{
if (cur_sector == -1)
PRIV(op)->min_used_journal_sector = 1 + journal.cur_sector;
cur_sector = journal.cur_sector;
prepare_journal_sector_write(journal, cur_sector, sqe[s++], cb);
}
}
PRIV(op)->max_used_journal_sector = 1 + journal.cur_sector;
PRIV(op)->pending_ops = s;
return 1;
}
void blockstore_impl_t::handle_rollback_event(ring_data_t *data, blockstore_op_t *op)
{
live = true;
if (data->res != data->iov.iov_len)
{
throw std::runtime_error(
"write operation failed ("+std::to_string(data->res)+" != "+std::to_string(data->iov.iov_len)+
"). in-memory state is corrupted. AAAAAAAaaaaaaaaa!!!111"
);
}
PRIV(op)->pending_ops--;
if (PRIV(op)->pending_ops == 0)
{
// Release used journal sectors
release_journal_sectors(op);
obj_ver_id* v;
int i;
for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++)
{
// Erase dirty_db entries
auto rm_end = dirty_db.lower_bound((obj_ver_id){
.oid = v->oid,
.version = UINT64_MAX,
});
rm_end--;
auto rm_start = rm_end;
while (1)
{
if (rm_end->first.oid != v->oid)
break;
else if (rm_end->first.version <= v->version)
break;
rm_start = rm_end;
if (rm_end == dirty_db.begin())
break;
rm_end--;
}
if (rm_end != rm_start)
erase_dirty(rm_start, rm_end, UINT64_MAX);
}
journal.trim();
// Acknowledge op
op->retval = 0;
FINISH_OP(op);
}
}
void blockstore_impl_t::erase_dirty(blockstore_dirty_db_t::iterator dirty_start, blockstore_dirty_db_t::iterator dirty_end, uint64_t clean_loc)
{
auto dirty_it = dirty_end;
while (dirty_it != dirty_start)
{
dirty_it--;
if (IS_BIG_WRITE(dirty_it->second.state) && dirty_it->second.location != clean_loc)
{
#ifdef BLOCKSTORE_DEBUG
printf("Free block %lu\n", dirty_it->second.location >> block_order);
#endif
data_alloc->set(dirty_it->second.location >> block_order, false);
}
#ifdef BLOCKSTORE_DEBUG
printf("remove usage of journal offset %lu by %lu:%lu v%lu\n", dirty_it->second.journal_sector,
dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version);
#endif
int used = --journal.used_sectors[dirty_it->second.journal_sector];
if (used == 0)
{
journal.used_sectors.erase(dirty_it->second.journal_sector);
}
}
dirty_db.erase(dirty_start, dirty_end);
}

195
blockstore_stable.cpp Normal file
View File

@ -0,0 +1,195 @@
#include "blockstore_impl.h"
// Stabilize small write:
// 1) Copy data from the journal to the data device
// 2) Increase version on the metadata device and sync it
// 3) Advance clean_db entry's version, clear previous journal entries
//
// This makes 1 4K small write+sync look like:
// 512b+4K (journal) + sync + 512b (journal) + sync + 4K (data) [+ sync?] + 512b (metadata) + sync.
// WA = 2.375. It's not the best, SSD FTL-like redirect-write could probably be lower
// even with defragmentation. But it's fixed and it's still better than in Ceph. :)
// except for HDD-only clusters, because each write results in 3 seeks.
// Stabilize big write:
// 1) Copy metadata from the journal to the metadata device
// 2) Move dirty_db entry to clean_db and clear previous journal entries
//
// This makes 1 128K big write+sync look like:
// 128K (data) + sync + 512b (journal) + sync + 512b (journal) + sync + 512b (metadata) + sync.
// WA = 1.012. Very good :)
// Stabilize delete:
// 1) Remove metadata entry and sync it
// 2) Remove dirty_db entry and clear previous journal entries
// We have 2 problems here:
// - In the cluster environment, we must store the "tombstones" of deleted objects until
// all replicas (not just quorum) agrees about their deletion. That is, "stabilize" is
// not possible for deletes in degraded placement groups
// - With simple "fixed" metadata tables we can't just clear the metadata entry of the latest
// object version. We must clear all previous entries, too.
// FIXME Fix both problems - probably, by switching from "fixed" metadata tables to "dynamic"
// AND We must do it in batches, for the sake of reduced fsync call count
// AND We must know what we stabilize. Basic workflow is like:
// 1) primary OSD receives sync request
// 2) it submits syncs to blockstore and peers
// 3) after everyone acks sync it acks sync to the client
// 4) after a while it takes his synced object list and sends stabilize requests
// to peers and to its own blockstore, thus freeing the old version
int blockstore_impl_t::dequeue_stable(blockstore_op_t *op)
{
obj_ver_id* v;
int i, todo = 0;
for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++)
{
auto dirty_it = dirty_db.find(*v);
if (dirty_it == dirty_db.end())
{
auto clean_it = clean_db.find(v->oid);
if (clean_it == clean_db.end() || clean_it->second.version < v->version)
{
// No such object version
op->retval = -EINVAL;
FINISH_OP(op);
return 1;
}
else
{
// Already stable
}
}
else if (IS_UNSYNCED(dirty_it->second.state))
{
// Object not synced yet. Caller must sync it first
op->retval = EAGAIN;
FINISH_OP(op);
return 1;
}
else if (!IS_STABLE(dirty_it->second.state))
{
todo++;
}
}
if (!todo)
{
// Already stable
op->retval = 0;
FINISH_OP(op);
return 1;
}
// Check journal space
blockstore_journal_check_t space_check(this);
if (!space_check.check_available(op, todo, sizeof(journal_entry_stable), 0))
{
return 0;
}
// There is sufficient space. Get SQEs
struct io_uring_sqe *sqe[space_check.sectors_required];
for (i = 0; i < space_check.sectors_required; i++)
{
BS_SUBMIT_GET_SQE_DECL(sqe[i]);
}
// Prepare and submit journal entries
auto cb = [this, op](ring_data_t *data) { handle_stable_event(data, op); };
int s = 0, cur_sector = -1;
if ((journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_stable) &&
journal.sector_info[journal.cur_sector].dirty)
{
if (cur_sector == -1)
PRIV(op)->min_used_journal_sector = 1 + journal.cur_sector;
cur_sector = journal.cur_sector;
prepare_journal_sector_write(journal, cur_sector, sqe[s++], cb);
}
for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++)
{
auto unstab_it = unstable_writes.find(v->oid);
if (unstab_it != unstable_writes.end() &&
unstab_it->second <= v->version)
{
unstable_writes.erase(unstab_it);
}
journal_entry_stable *je = (journal_entry_stable*)
prefill_single_journal_entry(journal, JE_STABLE, sizeof(journal_entry_stable));
journal.sector_info[journal.cur_sector].dirty = false;
je->oid = v->oid;
je->version = v->version;
je->crc32 = je_crc32((journal_entry*)je);
journal.crc32_last = je->crc32;
if (cur_sector != journal.cur_sector)
{
if (cur_sector == -1)
PRIV(op)->min_used_journal_sector = 1 + journal.cur_sector;
cur_sector = journal.cur_sector;
prepare_journal_sector_write(journal, cur_sector, sqe[s++], cb);
}
}
PRIV(op)->max_used_journal_sector = 1 + journal.cur_sector;
PRIV(op)->pending_ops = s;
return 1;
}
void blockstore_impl_t::handle_stable_event(ring_data_t *data, blockstore_op_t *op)
{
live = true;
if (data->res != data->iov.iov_len)
{
throw std::runtime_error(
"write operation failed ("+std::to_string(data->res)+" != "+std::to_string(data->iov.iov_len)+
"). in-memory state is corrupted. AAAAAAAaaaaaaaaa!!!111"
);
}
PRIV(op)->pending_ops--;
if (PRIV(op)->pending_ops == 0)
{
// Release used journal sectors
release_journal_sectors(op);
// Mark dirty_db entries as stable, acknowledge op completion
obj_ver_id* v;
int i;
for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++)
{
// Mark all dirty_db entries up to op->version as stable
auto dirty_it = dirty_db.find(*v);
if (dirty_it != dirty_db.end())
{
while (1)
{
if (dirty_it->second.state == ST_J_SYNCED)
{
dirty_it->second.state = ST_J_STABLE;
}
else if (dirty_it->second.state == ST_D_META_SYNCED)
{
dirty_it->second.state = ST_D_STABLE;
}
else if (dirty_it->second.state == ST_DEL_SYNCED)
{
dirty_it->second.state = ST_DEL_STABLE;
}
else if (IS_STABLE(dirty_it->second.state))
{
break;
}
if (dirty_it == dirty_db.begin())
{
break;
}
dirty_it--;
if (dirty_it->first.oid != v->oid)
{
break;
}
}
#ifdef BLOCKSTORE_DEBUG
printf("enqueue_flush %lu:%lu v%lu\n", v->oid.inode, v->oid.stripe, v->version);
#endif
flusher->enqueue_flush(*v);
}
}
// Acknowledge op
op->retval = 0;
FINISH_OP(op);
}
}

271
blockstore_sync.cpp Normal file
View File

@ -0,0 +1,271 @@
#include "blockstore_impl.h"
#define SYNC_HAS_SMALL 1
#define SYNC_HAS_BIG 2
#define SYNC_DATA_SYNC_SENT 3
#define SYNC_DATA_SYNC_DONE 4
#define SYNC_JOURNAL_WRITE_SENT 5
#define SYNC_JOURNAL_WRITE_DONE 6
#define SYNC_JOURNAL_SYNC_SENT 7
#define SYNC_DONE 8
int blockstore_impl_t::dequeue_sync(blockstore_op_t *op)
{
if (PRIV(op)->sync_state == 0)
{
stop_sync_submitted = false;
PRIV(op)->sync_big_writes.swap(unsynced_big_writes);
PRIV(op)->sync_small_writes.swap(unsynced_small_writes);
PRIV(op)->sync_small_checked = 0;
PRIV(op)->sync_big_checked = 0;
unsynced_big_writes.clear();
unsynced_small_writes.clear();
if (PRIV(op)->sync_big_writes.size() > 0)
PRIV(op)->sync_state = SYNC_HAS_BIG;
else if (PRIV(op)->sync_small_writes.size() > 0)
PRIV(op)->sync_state = SYNC_HAS_SMALL;
else
PRIV(op)->sync_state = SYNC_DONE;
// Always add sync to in_progress_syncs because we clear unsynced_big_writes and unsynced_small_writes
PRIV(op)->prev_sync_count = in_progress_syncs.size();
PRIV(op)->in_progress_ptr = in_progress_syncs.insert(in_progress_syncs.end(), op);
}
continue_sync(op);
// Always dequeue because we always add syncs to in_progress_syncs
return 1;
}
int blockstore_impl_t::continue_sync(blockstore_op_t *op)
{
auto cb = [this, op](ring_data_t *data) { handle_sync_event(data, op); };
if (PRIV(op)->sync_state == SYNC_HAS_SMALL)
{
// No big writes, just fsync the journal
for (; PRIV(op)->sync_small_checked < PRIV(op)->sync_small_writes.size(); PRIV(op)->sync_small_checked++)
{
if (IS_IN_FLIGHT(dirty_db[PRIV(op)->sync_small_writes[PRIV(op)->sync_small_checked]].state))
{
// Wait for small inflight writes to complete
return 0;
}
}
if (journal.sector_info[journal.cur_sector].dirty)
{
// Write out the last journal sector if it happens to be dirty
BS_SUBMIT_GET_ONLY_SQE(sqe);
prepare_journal_sector_write(journal, journal.cur_sector, sqe, cb);
PRIV(op)->min_used_journal_sector = PRIV(op)->max_used_journal_sector = 1 + journal.cur_sector;
PRIV(op)->pending_ops = 1;
PRIV(op)->sync_state = SYNC_JOURNAL_WRITE_SENT;
return 1;
}
else
{
PRIV(op)->sync_state = SYNC_JOURNAL_WRITE_DONE;
}
}
if (PRIV(op)->sync_state == SYNC_HAS_BIG)
{
for (; PRIV(op)->sync_big_checked < PRIV(op)->sync_big_writes.size(); PRIV(op)->sync_big_checked++)
{
if (IS_IN_FLIGHT(dirty_db[PRIV(op)->sync_big_writes[PRIV(op)->sync_big_checked]].state))
{
// Wait for big inflight writes to complete
return 0;
}
}
// 1st step: fsync data
if (!disable_data_fsync)
{
BS_SUBMIT_GET_SQE(sqe, data);
my_uring_prep_fsync(sqe, data_fd_index, IORING_FSYNC_DATASYNC);
sqe->flags |= IOSQE_FIXED_FILE;
data->iov = { 0 };
data->callback = cb;
PRIV(op)->min_used_journal_sector = PRIV(op)->max_used_journal_sector = 0;
PRIV(op)->pending_ops = 1;
PRIV(op)->sync_state = SYNC_DATA_SYNC_SENT;
return 1;
}
else
{
PRIV(op)->sync_state = SYNC_DATA_SYNC_DONE;
}
}
if (PRIV(op)->sync_state == SYNC_DATA_SYNC_DONE)
{
for (; PRIV(op)->sync_small_checked < PRIV(op)->sync_small_writes.size(); PRIV(op)->sync_small_checked++)
{
if (IS_IN_FLIGHT(dirty_db[PRIV(op)->sync_small_writes[PRIV(op)->sync_small_checked]].state))
{
// Wait for small inflight writes to complete
return 0;
}
}
// 2nd step: Data device is synced, prepare & write journal entries
// Check space in the journal and journal memory buffers
blockstore_journal_check_t space_check(this);
if (!space_check.check_available(op, PRIV(op)->sync_big_writes.size(), sizeof(journal_entry_big_write), 0))
{
return 0;
}
// Get SQEs. Don't bother about merging, submit each journal sector as a separate request
struct io_uring_sqe *sqe[space_check.sectors_required];
for (int i = 0; i < space_check.sectors_required; i++)
{
BS_SUBMIT_GET_SQE_DECL(sqe[i]);
}
// Prepare and submit journal entries
auto it = PRIV(op)->sync_big_writes.begin();
int s = 0, cur_sector = -1;
if ((journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_big_write) &&
journal.sector_info[journal.cur_sector].dirty)
{
if (cur_sector == -1)
PRIV(op)->min_used_journal_sector = 1 + journal.cur_sector;
cur_sector = journal.cur_sector;
prepare_journal_sector_write(journal, cur_sector, sqe[s++], cb);
}
while (it != PRIV(op)->sync_big_writes.end())
{
journal_entry_big_write *je = (journal_entry_big_write*)
prefill_single_journal_entry(journal, JE_BIG_WRITE, sizeof(journal_entry_big_write));
dirty_db[*it].journal_sector = journal.sector_info[journal.cur_sector].offset;
journal.sector_info[journal.cur_sector].dirty = false;
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
#ifdef BLOCKSTORE_DEBUG
printf("journal offset %lu is used by %lu:%lu v%lu\n", dirty_db[*it].journal_sector, it->oid.inode, it->oid.stripe, it->version);
#endif
je->oid = it->oid;
je->version = it->version;
je->offset = dirty_db[*it].offset;
je->len = dirty_db[*it].len;
je->location = dirty_db[*it].location;
je->crc32 = je_crc32((journal_entry*)je);
journal.crc32_last = je->crc32;
it++;
if (cur_sector != journal.cur_sector)
{
if (cur_sector == -1)
PRIV(op)->min_used_journal_sector = 1 + journal.cur_sector;
cur_sector = journal.cur_sector;
prepare_journal_sector_write(journal, cur_sector, sqe[s++], cb);
}
}
PRIV(op)->max_used_journal_sector = 1 + journal.cur_sector;
PRIV(op)->pending_ops = s;
PRIV(op)->sync_state = SYNC_JOURNAL_WRITE_SENT;
return 1;
}
if (PRIV(op)->sync_state == SYNC_JOURNAL_WRITE_DONE)
{
if (!disable_journal_fsync)
{
BS_SUBMIT_GET_SQE(sqe, data);
my_uring_prep_fsync(sqe, journal_fd_index, IORING_FSYNC_DATASYNC);
sqe->flags |= IOSQE_FIXED_FILE;
data->iov = { 0 };
data->callback = cb;
PRIV(op)->pending_ops = 1;
PRIV(op)->sync_state = SYNC_JOURNAL_SYNC_SENT;
return 1;
}
else
{
PRIV(op)->sync_state = SYNC_DONE;
}
}
if (PRIV(op)->sync_state == SYNC_DONE)
{
ack_sync(op);
}
return 1;
}
void blockstore_impl_t::handle_sync_event(ring_data_t *data, blockstore_op_t *op)
{
live = true;
if (data->res != data->iov.iov_len)
{
throw std::runtime_error(
"write operation failed ("+std::to_string(data->res)+" != "+std::to_string(data->iov.iov_len)+
"). in-memory state is corrupted. AAAAAAAaaaaaaaaa!!!111"
);
}
PRIV(op)->pending_ops--;
if (PRIV(op)->pending_ops == 0)
{
// Release used journal sectors
release_journal_sectors(op);
// Handle states
if (PRIV(op)->sync_state == SYNC_DATA_SYNC_SENT)
{
PRIV(op)->sync_state = SYNC_DATA_SYNC_DONE;
}
else if (PRIV(op)->sync_state == SYNC_JOURNAL_WRITE_SENT)
{
PRIV(op)->sync_state = SYNC_JOURNAL_WRITE_DONE;
}
else if (PRIV(op)->sync_state == SYNC_JOURNAL_SYNC_SENT)
{
PRIV(op)->sync_state = SYNC_DONE;
ack_sync(op);
}
else
{
throw std::runtime_error("BUG: unexpected sync op state");
}
}
}
int blockstore_impl_t::ack_sync(blockstore_op_t *op)
{
if (PRIV(op)->sync_state == SYNC_DONE && PRIV(op)->prev_sync_count == 0)
{
// Remove dependency of subsequent syncs
auto it = PRIV(op)->in_progress_ptr;
int done_syncs = 1;
++it;
// Acknowledge sync
ack_one_sync(op);
while (it != in_progress_syncs.end())
{
auto & next_sync = *it++;
PRIV(next_sync)->prev_sync_count -= done_syncs;
if (PRIV(next_sync)->prev_sync_count == 0 && PRIV(next_sync)->sync_state == SYNC_DONE)
{
done_syncs++;
// Acknowledge next_sync
ack_one_sync(next_sync);
}
}
return 1;
}
return 0;
}
void blockstore_impl_t::ack_one_sync(blockstore_op_t *op)
{
// Handle states
for (auto it = PRIV(op)->sync_big_writes.begin(); it != PRIV(op)->sync_big_writes.end(); it++)
{
#ifdef BLOCKSTORE_DEBUG
printf("Ack sync big %lu:%lu v%lu\n", it->oid.inode, it->oid.stripe, it->version);
#endif
auto & unstab = unstable_writes[it->oid];
unstab = unstab < it->version ? it->version : unstab;
dirty_db[*it].state = ST_D_META_SYNCED;
}
for (auto it = PRIV(op)->sync_small_writes.begin(); it != PRIV(op)->sync_small_writes.end(); it++)
{
#ifdef BLOCKSTORE_DEBUG
printf("Ack sync small %lu:%lu v%lu\n", it->oid.inode, it->oid.stripe, it->version);
#endif
auto & unstab = unstable_writes[it->oid];
unstab = unstab < it->version ? it->version : unstab;
dirty_db[*it].state = dirty_db[*it].state == ST_DEL_WRITTEN ? ST_DEL_SYNCED : ST_J_SYNCED;
}
in_progress_syncs.erase(PRIV(op)->in_progress_ptr);
op->retval = 0;
FINISH_OP(op);
}

330
blockstore_write.cpp Normal file
View File

@ -0,0 +1,330 @@
#include "blockstore_impl.h"
bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
{
// Check or assign version number
bool found = false, deleted = false, is_del = (op->opcode == BS_OP_DELETE);
uint64_t version = 1;
if (dirty_db.size() > 0)
{
auto dirty_it = dirty_db.upper_bound((obj_ver_id){
.oid = op->oid,
.version = UINT64_MAX,
});
dirty_it--; // segfaults when dirty_db is empty
if (dirty_it != dirty_db.end() && dirty_it->first.oid == op->oid)
{
found = true;
version = dirty_it->first.version + 1;
deleted = IS_DELETE(dirty_it->second.state);
}
}
if (!found)
{
auto clean_it = clean_db.find(op->oid);
if (clean_it != clean_db.end())
{
version = clean_it->second.version + 1;
}
else
{
deleted = true;
}
}
if (op->version == 0)
{
op->version = version;
}
else if (op->version < version)
{
// Invalid version requested
op->retval = -EINVAL;
return false;
}
if (deleted && is_del)
{
// Already deleted
op->retval = 0;
return false;
}
// Immediately add the operation into dirty_db, so subsequent reads could see it
#ifdef BLOCKSTORE_DEBUG
printf("%s %lu:%lu v%lu\n", is_del ? "Delete" : "Write", op->oid.inode, op->oid.stripe, op->version);
#endif
dirty_db.emplace((obj_ver_id){
.oid = op->oid,
.version = op->version,
}, (dirty_entry){
.state = (uint32_t)(
is_del
? ST_DEL_IN_FLIGHT
: (op->len == block_size || deleted ? ST_D_IN_FLIGHT : ST_J_IN_FLIGHT)
),
.flags = 0,
.location = 0,
.offset = is_del ? 0 : op->offset,
.len = is_del ? 0 : op->len,
.journal_sector = 0,
});
return true;
}
// First step of the write algorithm: dequeue operation and submit initial write(s)
int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
{
auto dirty_it = dirty_db.find((obj_ver_id){
.oid = op->oid,
.version = op->version,
});
if (dirty_it->second.state == ST_D_IN_FLIGHT)
{
blockstore_journal_check_t space_check(this);
if (!space_check.check_available(op, unsynced_big_writes.size() + 1, sizeof(journal_entry_big_write), JOURNAL_STABILIZE_RESERVATION))
{
return 0;
}
// Big (redirect) write
uint64_t loc = data_alloc->find_free();
if (loc == UINT64_MAX)
{
// no space
if (flusher->is_active())
{
// hope that some space will be available after flush
PRIV(op)->wait_for = WAIT_FREE;
return 0;
}
op->retval = -ENOSPC;
FINISH_OP(op);
return 1;
}
BS_SUBMIT_GET_SQE(sqe, data);
dirty_it->second.location = loc << block_order;
dirty_it->second.state = ST_D_SUBMITTED;
#ifdef BLOCKSTORE_DEBUG
printf("Allocate block %lu\n", loc);
#endif
data_alloc->set(loc, true);
uint64_t stripe_offset = (op->offset % bitmap_granularity);
uint64_t stripe_end = (op->offset + op->len) % bitmap_granularity;
// Zero fill up to bitmap_granularity
int vcnt = 0;
if (stripe_offset)
{
PRIV(op)->iov_zerofill[vcnt++] = (struct iovec){ zero_object, stripe_offset };
}
PRIV(op)->iov_zerofill[vcnt++] = (struct iovec){ op->buf, op->len };
if (stripe_end)
{
stripe_end = bitmap_granularity - stripe_end;
PRIV(op)->iov_zerofill[vcnt++] = (struct iovec){ zero_object, stripe_end };
}
data->iov.iov_len = op->len + stripe_offset + stripe_end; // to check it in the callback
data->callback = [this, op](ring_data_t *data) { handle_write_event(data, op); };
my_uring_prep_writev(
sqe, data_fd_index, PRIV(op)->iov_zerofill, vcnt, data_offset + (loc << block_order) + op->offset - stripe_offset
);
sqe->flags |= IOSQE_FIXED_FILE;
PRIV(op)->pending_ops = 1;
PRIV(op)->min_used_journal_sector = PRIV(op)->max_used_journal_sector = 0;
// Remember big write as unsynced
unsynced_big_writes.push_back((obj_ver_id){
.oid = op->oid,
.version = op->version,
});
}
else
{
// Small (journaled) write
// First check if the journal has sufficient space
blockstore_journal_check_t space_check(this);
if (unsynced_big_writes.size() && !space_check.check_available(op, unsynced_big_writes.size(), sizeof(journal_entry_big_write), 0)
|| !space_check.check_available(op, 1, sizeof(journal_entry_small_write), op->len + JOURNAL_STABILIZE_RESERVATION))
{
return 0;
}
// There is sufficient space. Get SQE(s)
struct io_uring_sqe *sqe1 = NULL;
if ((journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_small_write) &&
journal.sector_info[journal.cur_sector].dirty)
{
// Write current journal sector only if it's dirty and full
BS_SUBMIT_GET_SQE_DECL(sqe1);
}
struct io_uring_sqe *sqe2 = NULL;
if (op->len > 0)
{
BS_SUBMIT_GET_SQE_DECL(sqe2);
}
// Got SQEs. Prepare previous journal sector write if required
auto cb = [this, op](ring_data_t *data) { handle_write_event(data, op); };
if (sqe1)
{
prepare_journal_sector_write(journal, journal.cur_sector, sqe1, cb);
// FIXME rename to min/max _flushing
PRIV(op)->min_used_journal_sector = PRIV(op)->max_used_journal_sector = 1 + journal.cur_sector;
PRIV(op)->pending_ops++;
}
else
{
PRIV(op)->min_used_journal_sector = PRIV(op)->max_used_journal_sector = 0;
}
// Then pre-fill journal entry
journal_entry_small_write *je = (journal_entry_small_write*)
prefill_single_journal_entry(journal, JE_SMALL_WRITE, sizeof(journal_entry_small_write));
dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
#ifdef BLOCKSTORE_DEBUG
printf("journal offset %lu is used by %lu:%lu v%lu\n", dirty_it->second.journal_sector, dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version);
#endif
// Figure out where data will be
journal.next_free = (journal.next_free + op->len) <= journal.len ? journal.next_free : journal_block_size;
je->oid = op->oid;
je->version = op->version;
je->offset = op->offset;
je->len = op->len;
je->data_offset = journal.next_free;
je->crc32_data = crc32c(0, op->buf, op->len);
je->crc32 = je_crc32((journal_entry*)je);
journal.crc32_last = je->crc32;
if (op->len > 0)
{
// Prepare journal data write
if (journal.inmemory)
{
// Copy data
memcpy(journal.buffer + journal.next_free, op->buf, op->len);
}
ring_data_t *data2 = ((ring_data_t*)sqe2->user_data);
data2->iov = (struct iovec){ op->buf, op->len };
data2->callback = cb;
my_uring_prep_writev(
sqe2, journal_fd_index, &data2->iov, 1, journal.offset + journal.next_free
);
sqe2->flags |= IOSQE_FIXED_FILE;
PRIV(op)->pending_ops++;
}
else
{
// Zero-length overwrite. Allowed to bump object version in EC placement groups without actually writing data
}
dirty_it->second.location = journal.next_free;
dirty_it->second.state = ST_J_SUBMITTED;
journal.next_free += op->len;
if (journal.next_free >= journal.len)
{
journal.next_free = journal_block_size;
}
// Remember small write as unsynced
unsynced_small_writes.push_back((obj_ver_id){
.oid = op->oid,
.version = op->version,
});
if (!PRIV(op)->pending_ops)
{
ack_write(op);
}
}
return 1;
}
void blockstore_impl_t::handle_write_event(ring_data_t *data, blockstore_op_t *op)
{
live = true;
if (data->res != data->iov.iov_len)
{
// FIXME: our state becomes corrupted after a write error. maybe do something better than just die
throw std::runtime_error(
"write operation failed ("+std::to_string(data->res)+" != "+std::to_string(data->iov.iov_len)+
"). in-memory state is corrupted. AAAAAAAaaaaaaaaa!!!111"
);
}
PRIV(op)->pending_ops--;
if (PRIV(op)->pending_ops == 0)
{
release_journal_sectors(op);
ack_write(op);
}
}
void blockstore_impl_t::release_journal_sectors(blockstore_op_t *op)
{
// Release used journal sectors
if (PRIV(op)->min_used_journal_sector > 0 &&
PRIV(op)->max_used_journal_sector > 0)
{
uint64_t s = PRIV(op)->min_used_journal_sector;
while (1)
{
journal.sector_info[s-1].usage_count--;
if (s == PRIV(op)->max_used_journal_sector)
break;
s = 1 + s % journal.sector_count;
}
PRIV(op)->min_used_journal_sector = PRIV(op)->max_used_journal_sector = 0;
}
}
void blockstore_impl_t::ack_write(blockstore_op_t *op)
{
// Switch object state
auto & dirty_entry = dirty_db[(obj_ver_id){
.oid = op->oid,
.version = op->version,
}];
#ifdef BLOCKSTORE_DEBUG
printf("Ack write %lu:%lu v%lu = %d\n", op->oid.inode, op->oid.stripe, op->version, dirty_entry.state);
#endif
if (dirty_entry.state == ST_J_SUBMITTED)
{
dirty_entry.state = ST_J_WRITTEN;
}
else if (dirty_entry.state == ST_D_SUBMITTED)
{
dirty_entry.state = ST_D_WRITTEN;
}
else if (dirty_entry.state == ST_DEL_SUBMITTED)
{
dirty_entry.state = ST_DEL_WRITTEN;
}
// Acknowledge write without sync
op->retval = op->len;
FINISH_OP(op);
}
int blockstore_impl_t::dequeue_del(blockstore_op_t *op)
{
auto dirty_it = dirty_db.find((obj_ver_id){
.oid = op->oid,
.version = op->version,
});
blockstore_journal_check_t space_check(this);
if (!space_check.check_available(op, 1, sizeof(journal_entry_del), 0))
{
return 0;
}
BS_SUBMIT_GET_ONLY_SQE(sqe);
// Prepare journal sector write
journal_entry_del *je = (journal_entry_del*)
prefill_single_journal_entry(journal, JE_DELETE, sizeof(struct journal_entry_del));
dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
#ifdef BLOCKSTORE_DEBUG
printf("journal offset %lu is used by %lu:%lu v%lu\n", dirty_it->second.journal_sector, dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version);
#endif
je->oid = op->oid;
je->version = op->version;
je->crc32 = je_crc32((journal_entry*)je);
journal.crc32_last = je->crc32;
auto cb = [this, op](ring_data_t *data) { handle_write_event(data, op); };
prepare_journal_sector_write(journal, journal.cur_sector, sqe, cb);
PRIV(op)->min_used_journal_sector = PRIV(op)->max_used_journal_sector = 1 + journal.cur_sector;
PRIV(op)->pending_ops = 1;
dirty_it->second.state = ST_DEL_SUBMITTED;
// Remember small write as unsynced
unsynced_small_writes.push_back((obj_ver_id){
.oid = op->oid,
.version = op->version,
});
return 1;
}

View File

@ -1,13 +0,0 @@
#!/bin/bash
gcc -I. -E -o fio_headers.i src/fio_headers.h
rm -rf fio-copy
for i in `grep -Po 'fio/[^"]+' fio_headers.i | sort | uniq`; do
j=${i##fio/}
p=$(dirname $j)
mkdir -p fio-copy/$p
cp $i fio-copy/$j
done
rm fio_headers.i

View File

@ -1,18 +0,0 @@
#!/bin/bash
#cd qemu
#debian/rules b/configure-stamp
#cd b/qemu; make qapi
gcc -I qemu/b/qemu `pkg-config glib-2.0 --cflags` \
-I qemu/include -E -o qemu_driver.i src/qemu_driver.c
rm -rf qemu-copy
for i in `grep -Po 'qemu/[^"]+' qemu_driver.i | sort | uniq`; do
j=${i##qemu/}
p=$(dirname $j)
mkdir -p qemu-copy/$p
cp $i qemu-copy/$j
done
rm qemu_driver.i

@ -1 +0,0 @@
Subproject commit 45e6d1f13196a0824e2089a586c53b9de0283f17

View File

@ -8,10 +8,4 @@
// unsigned __int64 _mm_crc32_u64 (unsigned __int64 crc, unsigned __int64 v) // unsigned __int64 _mm_crc32_u64 (unsigned __int64 crc, unsigned __int64 v)
// unsigned int _mm_crc32_u8 (unsigned int crc, unsigned char v) // unsigned int _mm_crc32_u8 (unsigned int crc, unsigned char v)
#ifdef __cplusplus
extern "C" {
#endif
uint32_t crc32c(uint32_t crc, const void *buf, size_t len); uint32_t crc32c(uint32_t crc, const void *buf, size_t len);
#ifdef __cplusplus
};
#endif

View File

@ -1,2 +0,0 @@
vitastor-csi
Dockerfile

View File

@ -1,32 +0,0 @@
# Compile stage
FROM golang:buster AS build
ADD go.sum go.mod /app/
RUN cd /app; CGO_ENABLED=1 GOOS=linux GOARCH=amd64 go mod download -x
ADD . /app
RUN perl -i -e '$/ = undef; while(<>) { s/\n\s*(\{\s*\n)/$1\n/g; s/\}(\s*\n\s*)else\b/$1} else/g; print; }' `find /app -name '*.go'`
RUN cd /app; CGO_ENABLED=1 GOOS=linux GOARCH=amd64 go build -o vitastor-csi
# Final stage
FROM debian:buster
LABEL maintainers="Vitaliy Filippov <vitalif@yourcmc.ru>"
LABEL description="Vitastor CSI Driver"
ENV NODE_ID=""
ENV CSI_ENDPOINT=""
RUN apt-get update && \
apt-get install -y wget && \
wget -q -O /etc/apt/trusted.gpg.d/vitastor.gpg https://vitastor.io/debian/pubkey.gpg && \
(echo deb http://vitastor.io/debian buster main > /etc/apt/sources.list.d/vitastor.list) && \
(echo deb http://deb.debian.org/debian buster-backports main > /etc/apt/sources.list.d/backports.list) && \
(echo "APT::Install-Recommends false;" > /etc/apt/apt.conf) && \
apt-get update && \
apt-get install -y e2fsprogs xfsprogs vitastor kmod && \
apt-get clean && \
(echo options nbd nbds_max=128 > /etc/modprobe.d/nbd.conf)
COPY --from=build /app/vitastor-csi /bin/
ENTRYPOINT ["/bin/vitastor-csi"]

View File

@ -1,9 +0,0 @@
VERSION ?= v0.6.17
all: build push
build:
@docker build --rm -t vitalif/vitastor-csi:$(VERSION) .
push:
@docker push vitalif/vitastor-csi:$(VERSION)

View File

@ -1,5 +0,0 @@
---
apiVersion: v1
kind: Namespace
metadata:
name: vitastor-system

View File

@ -1,9 +0,0 @@
---
apiVersion: v1
kind: ConfigMap
data:
vitastor.conf: |-
{"etcd_address":"http://192.168.7.2:2379","etcd_prefix":"/vitastor"}
metadata:
namespace: vitastor-system
name: vitastor-config

View File

@ -1,37 +0,0 @@
---
apiVersion: v1
kind: ServiceAccount
metadata:
namespace: vitastor-system
name: vitastor-csi-nodeplugin
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: vitastor-system
name: vitastor-csi-nodeplugin
rules:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get"]
# allow to read Vault Token and connection options from the Tenants namespace
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: vitastor-system
name: vitastor-csi-nodeplugin
subjects:
- kind: ServiceAccount
name: vitastor-csi-nodeplugin
namespace: vitastor-system
roleRef:
kind: ClusterRole
name: vitastor-csi-nodeplugin
apiGroup: rbac.authorization.k8s.io

View File

@ -1,72 +0,0 @@
---
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
namespace: vitastor-system
name: vitastor-csi-nodeplugin-psp
spec:
allowPrivilegeEscalation: true
allowedCapabilities:
- 'SYS_ADMIN'
fsGroup:
rule: RunAsAny
privileged: true
hostNetwork: true
hostPID: true
runAsUser:
rule: RunAsAny
seLinux:
rule: RunAsAny
supplementalGroups:
rule: RunAsAny
volumes:
- 'configMap'
- 'emptyDir'
- 'projected'
- 'secret'
- 'downwardAPI'
- 'hostPath'
allowedHostPaths:
- pathPrefix: '/dev'
readOnly: false
- pathPrefix: '/run/mount'
readOnly: false
- pathPrefix: '/sys'
readOnly: false
- pathPrefix: '/lib/modules'
readOnly: true
- pathPrefix: '/var/lib/kubelet/pods'
readOnly: false
- pathPrefix: '/var/lib/kubelet/plugins/csi.vitastor.io'
readOnly: false
- pathPrefix: '/var/lib/kubelet/plugins_registry'
readOnly: false
- pathPrefix: '/var/lib/kubelet/plugins'
readOnly: false
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: vitastor-system
name: vitastor-csi-nodeplugin-psp
rules:
- apiGroups: ['policy']
resources: ['podsecuritypolicies']
verbs: ['use']
resourceNames: ['vitastor-csi-nodeplugin-psp']
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: vitastor-system
name: vitastor-csi-nodeplugin-psp
subjects:
- kind: ServiceAccount
name: vitastor-csi-nodeplugin
namespace: vitastor-system
roleRef:
kind: Role
name: vitastor-csi-nodeplugin-psp
apiGroup: rbac.authorization.k8s.io

View File

@ -1,140 +0,0 @@
---
kind: DaemonSet
apiVersion: apps/v1
metadata:
namespace: vitastor-system
name: csi-vitastor
spec:
selector:
matchLabels:
app: csi-vitastor
template:
metadata:
namespace: vitastor-system
labels:
app: csi-vitastor
spec:
serviceAccountName: vitastor-csi-nodeplugin
hostNetwork: true
hostPID: true
priorityClassName: system-node-critical
# to use e.g. Rook orchestrated cluster, and mons' FQDN is
# resolved through k8s service, set dns policy to cluster first
dnsPolicy: ClusterFirstWithHostNet
containers:
- name: driver-registrar
# This is necessary only for systems with SELinux, where
# non-privileged sidecar containers cannot access unix domain socket
# created by privileged CSI driver container.
securityContext:
privileged: true
image: k8s.gcr.io/sig-storage/csi-node-driver-registrar:v2.2.0
args:
- "--v=5"
- "--csi-address=/csi/csi.sock"
- "--kubelet-registration-path=/var/lib/kubelet/plugins/csi.vitastor.io/csi.sock"
env:
- name: KUBE_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumeMounts:
- name: socket-dir
mountPath: /csi
- name: registration-dir
mountPath: /registration
- name: csi-vitastor
securityContext:
privileged: true
capabilities:
add: ["SYS_ADMIN"]
allowPrivilegeEscalation: true
image: vitalif/vitastor-csi:v0.6.17
args:
- "--node=$(NODE_ID)"
- "--endpoint=$(CSI_ENDPOINT)"
env:
- name: NODE_ID
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: CSI_ENDPOINT
value: unix:///csi/csi.sock
imagePullPolicy: "IfNotPresent"
ports:
- containerPort: 9898
name: healthz
protocol: TCP
livenessProbe:
failureThreshold: 5
httpGet:
path: /healthz
port: healthz
initialDelaySeconds: 10
timeoutSeconds: 3
periodSeconds: 2
volumeMounts:
- name: socket-dir
mountPath: /csi
- mountPath: /dev
name: host-dev
- mountPath: /sys
name: host-sys
- mountPath: /run/mount
name: host-mount
- mountPath: /lib/modules
name: lib-modules
readOnly: true
- name: vitastor-config
mountPath: /etc/vitastor
- name: plugin-dir
mountPath: /var/lib/kubelet/plugins
mountPropagation: "Bidirectional"
- name: mountpoint-dir
mountPath: /var/lib/kubelet/pods
mountPropagation: "Bidirectional"
- name: liveness-probe
securityContext:
privileged: true
image: quay.io/k8scsi/livenessprobe:v1.1.0
args:
- "--csi-address=$(CSI_ENDPOINT)"
- "--health-port=9898"
env:
- name: CSI_ENDPOINT
value: unix://csi/csi.sock
volumeMounts:
- mountPath: /csi
name: socket-dir
volumes:
- name: socket-dir
hostPath:
path: /var/lib/kubelet/plugins/csi.vitastor.io
type: DirectoryOrCreate
- name: plugin-dir
hostPath:
path: /var/lib/kubelet/plugins
type: Directory
- name: mountpoint-dir
hostPath:
path: /var/lib/kubelet/pods
type: DirectoryOrCreate
- name: registration-dir
hostPath:
path: /var/lib/kubelet/plugins_registry/
type: Directory
- name: host-dev
hostPath:
path: /dev
- name: host-sys
hostPath:
path: /sys
- name: host-mount
hostPath:
path: /run/mount
- name: lib-modules
hostPath:
path: /lib/modules
- name: vitastor-config
configMap:
name: vitastor-config

View File

@ -1,102 +0,0 @@
---
apiVersion: v1
kind: ServiceAccount
metadata:
namespace: vitastor-system
name: vitastor-csi-provisioner
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: vitastor-system
name: vitastor-external-provisioner-runner
rules:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["events"]
verbs: ["list", "watch", "create", "update", "patch"]
- apiGroups: [""]
resources: ["persistentvolumes"]
verbs: ["get", "list", "watch", "create", "update", "delete", "patch"]
- apiGroups: [""]
resources: ["persistentvolumeclaims"]
verbs: ["get", "list", "watch", "update"]
- apiGroups: [""]
resources: ["persistentvolumeclaims/status"]
verbs: ["update", "patch"]
- apiGroups: ["storage.k8s.io"]
resources: ["storageclasses"]
verbs: ["get", "list", "watch"]
- apiGroups: ["snapshot.storage.k8s.io"]
resources: ["volumesnapshots"]
verbs: ["get", "list"]
- apiGroups: ["snapshot.storage.k8s.io"]
resources: ["volumesnapshotcontents"]
verbs: ["create", "get", "list", "watch", "update", "delete"]
- apiGroups: ["snapshot.storage.k8s.io"]
resources: ["volumesnapshotclasses"]
verbs: ["get", "list", "watch"]
- apiGroups: ["storage.k8s.io"]
resources: ["volumeattachments"]
verbs: ["get", "list", "watch", "update", "patch"]
- apiGroups: ["storage.k8s.io"]
resources: ["volumeattachments/status"]
verbs: ["patch"]
- apiGroups: ["storage.k8s.io"]
resources: ["csinodes"]
verbs: ["get", "list", "watch"]
- apiGroups: ["snapshot.storage.k8s.io"]
resources: ["volumesnapshotcontents/status"]
verbs: ["update"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: vitastor-system
name: vitastor-csi-provisioner-role
subjects:
- kind: ServiceAccount
name: vitastor-csi-provisioner
namespace: vitastor-system
roleRef:
kind: ClusterRole
name: vitastor-external-provisioner-runner
apiGroup: rbac.authorization.k8s.io
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: vitastor-system
name: vitastor-external-provisioner-cfg
rules:
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "list", "watch", "create", "update", "delete"]
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["get", "watch", "list", "delete", "update", "create"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: vitastor-csi-provisioner-role-cfg
namespace: vitastor-system
subjects:
- kind: ServiceAccount
name: vitastor-csi-provisioner
namespace: vitastor-system
roleRef:
kind: Role
name: vitastor-external-provisioner-cfg
apiGroup: rbac.authorization.k8s.io

View File

@ -1,60 +0,0 @@
---
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
namespace: vitastor-system
name: vitastor-csi-provisioner-psp
spec:
allowPrivilegeEscalation: true
allowedCapabilities:
- 'SYS_ADMIN'
fsGroup:
rule: RunAsAny
privileged: true
runAsUser:
rule: RunAsAny
seLinux:
rule: RunAsAny
supplementalGroups:
rule: RunAsAny
volumes:
- 'configMap'
- 'emptyDir'
- 'projected'
- 'secret'
- 'downwardAPI'
- 'hostPath'
allowedHostPaths:
- pathPrefix: '/dev'
readOnly: false
- pathPrefix: '/sys'
readOnly: false
- pathPrefix: '/lib/modules'
readOnly: true
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: vitastor-system
name: vitastor-csi-provisioner-psp
rules:
- apiGroups: ['policy']
resources: ['podsecuritypolicies']
verbs: ['use']
resourceNames: ['vitastor-csi-provisioner-psp']
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: vitastor-csi-provisioner-psp
namespace: vitastor-system
subjects:
- kind: ServiceAccount
name: vitastor-csi-provisioner
namespace: vitastor-system
roleRef:
kind: Role
name: vitastor-csi-provisioner-psp
apiGroup: rbac.authorization.k8s.io

View File

@ -1,159 +0,0 @@
---
kind: Service
apiVersion: v1
metadata:
namespace: vitastor-system
name: csi-vitastor-provisioner
labels:
app: csi-metrics
spec:
selector:
app: csi-vitastor-provisioner
ports:
- name: http-metrics
port: 8080
protocol: TCP
targetPort: 8680
---
kind: Deployment
apiVersion: apps/v1
metadata:
namespace: vitastor-system
name: csi-vitastor-provisioner
spec:
replicas: 3
selector:
matchLabels:
app: csi-vitastor-provisioner
template:
metadata:
namespace: vitastor-system
labels:
app: csi-vitastor-provisioner
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- csi-vitastor-provisioner
topologyKey: "kubernetes.io/hostname"
serviceAccountName: vitastor-csi-provisioner
priorityClassName: system-cluster-critical
containers:
- name: csi-provisioner
image: k8s.gcr.io/sig-storage/csi-provisioner:v2.2.0
args:
- "--csi-address=$(ADDRESS)"
- "--v=5"
- "--timeout=150s"
- "--retry-interval-start=500ms"
- "--leader-election=true"
# set it to true to use topology based provisioning
- "--feature-gates=Topology=false"
# if fstype is not specified in storageclass, ext4 is default
- "--default-fstype=ext4"
- "--extra-create-metadata=true"
env:
- name: ADDRESS
value: unix:///csi/csi-provisioner.sock
imagePullPolicy: "IfNotPresent"
volumeMounts:
- name: socket-dir
mountPath: /csi
- name: csi-snapshotter
image: k8s.gcr.io/sig-storage/csi-snapshotter:v4.0.0
args:
- "--csi-address=$(ADDRESS)"
- "--v=5"
- "--timeout=150s"
- "--leader-election=true"
env:
- name: ADDRESS
value: unix:///csi/csi-provisioner.sock
imagePullPolicy: "IfNotPresent"
securityContext:
privileged: true
volumeMounts:
- name: socket-dir
mountPath: /csi
- name: csi-attacher
image: k8s.gcr.io/sig-storage/csi-attacher:v3.1.0
args:
- "--v=5"
- "--csi-address=$(ADDRESS)"
- "--leader-election=true"
- "--retry-interval-start=500ms"
env:
- name: ADDRESS
value: /csi/csi-provisioner.sock
imagePullPolicy: "IfNotPresent"
volumeMounts:
- name: socket-dir
mountPath: /csi
- name: csi-resizer
image: k8s.gcr.io/sig-storage/csi-resizer:v1.1.0
args:
- "--csi-address=$(ADDRESS)"
- "--v=5"
- "--timeout=150s"
- "--leader-election"
- "--retry-interval-start=500ms"
- "--handle-volume-inuse-error=false"
env:
- name: ADDRESS
value: unix:///csi/csi-provisioner.sock
imagePullPolicy: "IfNotPresent"
volumeMounts:
- name: socket-dir
mountPath: /csi
- name: csi-vitastor
securityContext:
privileged: true
capabilities:
add: ["SYS_ADMIN"]
image: vitalif/vitastor-csi:v0.6.17
args:
- "--node=$(NODE_ID)"
- "--endpoint=$(CSI_ENDPOINT)"
env:
- name: NODE_ID
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: CSI_ENDPOINT
value: unix:///csi/csi-provisioner.sock
imagePullPolicy: "IfNotPresent"
volumeMounts:
- name: socket-dir
mountPath: /csi
- mountPath: /dev
name: host-dev
- mountPath: /sys
name: host-sys
- mountPath: /lib/modules
name: lib-modules
readOnly: true
- name: vitastor-config
mountPath: /etc/vitastor
volumes:
- name: host-dev
hostPath:
path: /dev
- name: host-sys
hostPath:
path: /sys
- name: lib-modules
hostPath:
path: /lib/modules
- name: socket-dir
emptyDir: {
medium: "Memory"
}
- name: vitastor-config
configMap:
name: vitastor-config

View File

@ -1,11 +0,0 @@
---
# if Kubernetes version is less than 1.18 change
# apiVersion to storage.k8s.io/v1betav1
apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
namespace: vitastor-system
name: csi.vitastor.io
spec:
attachRequired: true
podInfoOnMount: false

View File

@ -1,19 +0,0 @@
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
namespace: vitastor-system
name: vitastor
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: csi.vitastor.io
volumeBindingMode: Immediate
parameters:
etcdVolumePrefix: ""
poolId: "1"
# you can choose other configuration file if you have it in the config map
#configPath: "/etc/vitastor/vitastor.conf"
# you can also specify etcdUrl here, maybe to connect to another Vitastor cluster
# multiple etcdUrls may be specified, delimited by comma
#etcdUrl: "http://192.168.7.2:2379"
#etcdPrefix: "/vitastor"

View File

@ -1,13 +0,0 @@
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-vitastor-pvc-block
spec:
storageClassName: vitastor
volumeMode: Block
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi

View File

@ -1,12 +0,0 @@
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-vitastor-pvc
spec:
storageClassName: vitastor
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi

View File

@ -1,17 +0,0 @@
apiVersion: v1
kind: Pod
metadata:
name: vitastor-test-block-pvc
namespace: default
spec:
containers:
- name: vitastor-test-block-pvc
image: nginx
volumeDevices:
- name: data
devicePath: /dev/xvda
volumes:
- name: data
persistentVolumeClaim:
claimName: test-vitastor-pvc-block
readOnly: false

View File

@ -1,17 +0,0 @@
apiVersion: v1
kind: Pod
metadata:
name: vitastor-test-nginx
namespace: default
spec:
containers:
- name: vitastor-test-nginx
image: nginx
volumeMounts:
- mountPath: /usr/share/nginx/html/s3
name: data
volumes:
- name: data
persistentVolumeClaim:
claimName: test-vitastor-pvc
readOnly: false

View File

@ -1,35 +0,0 @@
module vitastor.io/csi
go 1.15
require (
github.com/container-storage-interface/spec v1.4.0
github.com/coreos/bbolt v0.0.0-00010101000000-000000000000 // indirect
github.com/coreos/etcd v3.3.25+incompatible // indirect
github.com/coreos/go-semver v0.3.0 // indirect
github.com/coreos/go-systemd v0.0.0-20191104093116-d3cd4ed1dbcf // indirect
github.com/coreos/pkg v0.0.0-20180928190104-399ea9e2e55f // indirect
github.com/dustin/go-humanize v1.0.0 // indirect
github.com/golang/glog v0.0.0-20160126235308-23def4e6c14b
github.com/gorilla/websocket v1.4.2 // indirect
github.com/grpc-ecosystem/go-grpc-middleware v1.3.0 // indirect
github.com/grpc-ecosystem/go-grpc-prometheus v1.2.0 // indirect
github.com/grpc-ecosystem/grpc-gateway v1.16.0 // indirect
github.com/jonboulle/clockwork v0.2.2 // indirect
github.com/kubernetes-csi/csi-lib-utils v0.9.1
github.com/soheilhy/cmux v0.1.5 // indirect
github.com/tmc/grpc-websocket-proxy v0.0.0-20201229170055-e5319fda7802 // indirect
github.com/xiang90/probing v0.0.0-20190116061207-43a291ad63a2 // indirect
go.etcd.io/bbolt v0.0.0-00010101000000-000000000000 // indirect
go.etcd.io/etcd v3.3.25+incompatible
golang.org/x/net v0.0.0-20201202161906-c7110b5ffcbb
google.golang.org/grpc v1.33.1
k8s.io/klog v1.0.0
k8s.io/utils v0.0.0-20210305010621-2afb4311ab10
)
replace github.com/coreos/bbolt => go.etcd.io/bbolt v1.3.5
replace go.etcd.io/bbolt => github.com/coreos/bbolt v1.3.5
replace google.golang.org/grpc => google.golang.org/grpc v1.25.1

View File

@ -1,448 +0,0 @@
cloud.google.com/go v0.34.0/go.mod h1:aQUYkXzVsufM+DwF1aE+0xfcU+56JwCaLick0ClmMTw=
cloud.google.com/go v0.38.0/go.mod h1:990N+gfupTy94rShfmMCWGDn0LpTmnzTp2qbd1dvSRU=
cloud.google.com/go v0.44.1/go.mod h1:iSa0KzasP4Uvy3f1mN/7PiObzGgflwredwwASm/v6AU=
cloud.google.com/go v0.44.2/go.mod h1:60680Gw3Yr4ikxnPRS/oxxkBccT6SA1yMk63TGekxKY=
cloud.google.com/go v0.45.1/go.mod h1:RpBamKRgapWJb87xiFSdk4g1CME7QZg3uwTez+TSTjc=
cloud.google.com/go v0.46.3/go.mod h1:a6bKKbmY7er1mI7TEI4lsAkts/mkhTSZK8w33B4RAg0=
cloud.google.com/go v0.51.0/go.mod h1:hWtGJ6gnXH+KgDv+V0zFGDvpi07n3z8ZNj3T1RW0Gcw=
cloud.google.com/go/bigquery v1.0.1/go.mod h1:i/xbL2UlR5RvWAURpBYZTtm/cXjCha9lbfbpx4poX+o=
cloud.google.com/go/datastore v1.0.0/go.mod h1:LXYbyblFSglQ5pkeyhO+Qmw7ukd3C+pD7TKLgZqpHYE=
cloud.google.com/go/pubsub v1.0.1/go.mod h1:R0Gpsv3s54REJCy4fxDixWD93lHJMoZTyQ2kNxGRt3I=
cloud.google.com/go/storage v1.0.0/go.mod h1:IhtSnM/ZTZV8YYJWCY8RULGVqBDmpoyjwiyrjsg+URw=
dmitri.shuralyov.com/gpu/mtl v0.0.0-20190408044501-666a987793e9/go.mod h1:H6x//7gZCb22OMCxBHrMx7a5I7Hp++hsVxbQ4BYO7hU=
github.com/Azure/go-ansiterm v0.0.0-20170929234023-d6e3b3328b78/go.mod h1:LmzpDX56iTiv29bbRTIsUNlaFfuhWRQBWjQdVyAevI8=
github.com/Azure/go-autorest/autorest v0.9.0/go.mod h1:xyHB1BMZT0cuDHU7I0+g046+BFDTQ8rEZB0s4Yfa6bI=
github.com/Azure/go-autorest/autorest v0.9.6/go.mod h1:/FALq9T/kS7b5J5qsQ+RSTUdAmGFqi0vUdVNNx8q630=
github.com/Azure/go-autorest/autorest/adal v0.5.0/go.mod h1:8Z9fGy2MpX0PvDjB1pEgQTmVqjGhiHBW7RJJEciWzS0=
github.com/Azure/go-autorest/autorest/adal v0.8.2/go.mod h1:ZjhuQClTqx435SRJ2iMlOxPYt3d2C/T/7TiQCVZSn3Q=
github.com/Azure/go-autorest/autorest/date v0.1.0/go.mod h1:plvfp3oPSKwf2DNjlBjWF/7vwR+cUD/ELuzDCXwHUVA=
github.com/Azure/go-autorest/autorest/date v0.2.0/go.mod h1:vcORJHLJEh643/Ioh9+vPmf1Ij9AEBM5FuBIXLmIy0g=
github.com/Azure/go-autorest/autorest/mocks v0.1.0/go.mod h1:OTyCOPRA2IgIlWxVYxBee2F5Gr4kF2zd2J5cFRaIDN0=
github.com/Azure/go-autorest/autorest/mocks v0.2.0/go.mod h1:OTyCOPRA2IgIlWxVYxBee2F5Gr4kF2zd2J5cFRaIDN0=
github.com/Azure/go-autorest/autorest/mocks v0.3.0/go.mod h1:a8FDP3DYzQ4RYfVAxAN3SVSiiO77gL2j2ronKKP0syM=
github.com/Azure/go-autorest/logger v0.1.0/go.mod h1:oExouG+K6PryycPJfVSxi/koC6LSNgds39diKLz7Vrc=
github.com/Azure/go-autorest/tracing v0.5.0/go.mod h1:r/s2XiOKccPW3HrqB+W0TQzfbtp2fGCgRFtBroKn4Dk=
github.com/BurntSushi/toml v0.3.1/go.mod h1:xHWCNGjB5oqiDr8zfno3MHue2Ht5sIBksp03qcyfWMU=
github.com/BurntSushi/xgb v0.0.0-20160522181843-27f122750802/go.mod h1:IVnqGOEym/WlBOVXweHU+Q+/VP0lqqI8lqeDx9IjBqo=
github.com/NYTimes/gziphandler v0.0.0-20170623195520-56545f4a5d46/go.mod h1:3wb06e3pkSAbeQ52E9H9iFoQsEEwGN64994WTCIhntQ=
github.com/PuerkitoBio/purell v1.0.0/go.mod h1:c11w/QuzBsJSee3cPx9rAFu61PvFxuPbtSwDGJws/X0=
github.com/PuerkitoBio/urlesc v0.0.0-20160726150825-5bd2802263f2/go.mod h1:uGdkoq3SwY9Y+13GIhn11/XLaGBb4BfwItxLd5jeuXE=
github.com/alecthomas/template v0.0.0-20160405071501-a0175ee3bccc/go.mod h1:LOuyumcjzFXgccqObfd/Ljyb9UuFJ6TxHnclSeseNhc=
github.com/alecthomas/template v0.0.0-20190718012654-fb15b899a751/go.mod h1:LOuyumcjzFXgccqObfd/Ljyb9UuFJ6TxHnclSeseNhc=
github.com/alecthomas/units v0.0.0-20151022065526-2efee857e7cf/go.mod h1:ybxpYRFXyAe+OPACYpWeL0wqObRcbAqCMya13uyzqw0=
github.com/alecthomas/units v0.0.0-20190717042225-c3de453c63f4/go.mod h1:ybxpYRFXyAe+OPACYpWeL0wqObRcbAqCMya13uyzqw0=
github.com/antihax/optional v1.0.0/go.mod h1:uupD/76wgC+ih3iEmQUL+0Ugr19nfwCT1kdvxnR2qWY=
github.com/beorn7/perks v0.0.0-20180321164747-3a771d992973/go.mod h1:Dwedo/Wpr24TaqPxmxbtue+5NUziq4I4S80YR8gNf3Q=
github.com/beorn7/perks v1.0.0/go.mod h1:KWe93zE9D1o94FZ5RNwFwVgaQK1VOXiVxmqh+CedLV8=
github.com/beorn7/perks v1.0.1 h1:VlbKKnNfV8bJzeqoa4cOKqO6bYr3WgKZxO8Z16+hsOM=
github.com/beorn7/perks v1.0.1/go.mod h1:G2ZrVWU2WbWT9wwq4/hrbKbnv/1ERSJQ0ibhJ6rlkpw=
github.com/blang/semver v3.5.0+incompatible/go.mod h1:kRBLl5iJ+tD4TcOOxsy/0fnwebNt5EWlYSAyrTnjyyk=
github.com/census-instrumentation/opencensus-proto v0.2.1/go.mod h1:f6KPmirojxKA12rnyqOA5BBL4O983OfeGPqjHWSTneU=
github.com/cespare/xxhash/v2 v2.1.1 h1:6MnRN8NT7+YBpUIWxHtefFZOKTAPgGjpQSxqLNn0+qY=
github.com/cespare/xxhash/v2 v2.1.1/go.mod h1:VGX0DQ3Q6kWi7AoAeZDth3/j3BFtOZR5XLFGgcrjCOs=
github.com/chzyer/logex v1.1.10/go.mod h1:+Ywpsq7O8HXn0nuIou7OrIPyXbp3wmkHB+jjWRnGsAI=
github.com/chzyer/readline v0.0.0-20180603132655-2972be24d48e/go.mod h1:nSuG5e5PlCu98SY8svDHJxuZscDgtXS6KTTbou5AhLI=
github.com/chzyer/test v0.0.0-20180213035817-a1ea475d72b1/go.mod h1:Q3SI9o4m/ZMnBNeIyt5eFwwo7qiLfzFZmjNmxjkiQlU=
github.com/container-storage-interface/spec v1.2.0/go.mod h1:6URME8mwIBbpVyZV93Ce5St17xBiQJQY67NDsuohiy4=
github.com/container-storage-interface/spec v1.4.0 h1:ozAshSKxpJnYUfmkpZCTYyF/4MYeYlhdXbAvPvfGmkg=
github.com/container-storage-interface/spec v1.4.0/go.mod h1:6URME8mwIBbpVyZV93Ce5St17xBiQJQY67NDsuohiy4=
github.com/coreos/bbolt v1.3.5 h1:XFv7xaq7701j8ZSEzR28VohFYSlyakMyqNMU5FQH6Ac=
github.com/coreos/bbolt v1.3.5/go.mod h1:G5EMThwa9y8QZGBClrRx5EY+Yw9kAhnjy3bSjsnlVTQ=
github.com/coreos/etcd v3.3.25+incompatible h1:0GQEw6h3YnuOVdtwygkIfJ+Omx0tZ8/QkVyXI4LkbeY=
github.com/coreos/etcd v3.3.25+incompatible/go.mod h1:uF7uidLiAD3TWHmW31ZFd/JWoc32PjwdhPthX9715RE=
github.com/coreos/go-semver v0.3.0 h1:wkHLiw0WNATZnSG7epLsujiMCgPAc9xhjJ4tgnAxmfM=
github.com/coreos/go-semver v0.3.0/go.mod h1:nnelYz7RCh+5ahJtPPxZlU+153eP4D4r3EedlOD2RNk=
github.com/coreos/go-systemd v0.0.0-20191104093116-d3cd4ed1dbcf h1:iW4rZ826su+pqaw19uhpSCzhj44qo35pNgKFGqzDKkU=
github.com/coreos/go-systemd v0.0.0-20191104093116-d3cd4ed1dbcf/go.mod h1:F5haX7vjVVG0kc13fIWeqUViNPyEJxv/OmvnBo0Yme4=
github.com/coreos/pkg v0.0.0-20180928190104-399ea9e2e55f h1:lBNOc5arjvs8E5mO2tbpBpLoyyu8B6e44T7hJy6potg=
github.com/coreos/pkg v0.0.0-20180928190104-399ea9e2e55f/go.mod h1:E3G3o1h8I7cfcXa63jLwjI0eiQQMgzzUDFVpN/nH/eA=
github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/dgrijalva/jwt-go v3.2.0+incompatible h1:7qlOGliEKZXTDg6OTjfoBKDXWrumCAMpl/TFQ4/5kLM=
github.com/dgrijalva/jwt-go v3.2.0+incompatible/go.mod h1:E3ru+11k8xSBh+hMPgOLZmtrrCbhqsmaPHjLKYnJCaQ=
github.com/docker/spdystream v0.0.0-20160310174837-449fdfce4d96/go.mod h1:Qh8CwZgvJUkLughtfhJv5dyTYa91l1fOUCrgjqmcifM=
github.com/docopt/docopt-go v0.0.0-20180111231733-ee0de3bc6815/go.mod h1:WwZ+bS3ebgob9U8Nd0kOddGdZWjyMGR8Wziv+TBNwSE=
github.com/dustin/go-humanize v1.0.0 h1:VSnTsYCnlFHaM2/igO1h6X3HA71jcobQuxemgkq4zYo=
github.com/dustin/go-humanize v1.0.0/go.mod h1:HtrtbFcZ19U5GC7JDqmcUSB87Iq5E25KnS6fMYU6eOk=
github.com/elazarl/goproxy v0.0.0-20180725130230-947c36da3153/go.mod h1:/Zj4wYkgs4iZTTu3o/KG3Itv/qCCa8VVMlb3i9OVuzc=
github.com/emicklei/go-restful v0.0.0-20170410110728-ff4f55a20633/go.mod h1:otzb+WCGbkyDHkqmQmT5YD2WR4BBwUdeQoFo8l/7tVs=
github.com/envoyproxy/go-control-plane v0.9.0/go.mod h1:YTl/9mNaCwkRvm6d1a2C3ymFceY/DCBVvsKhRF0iEA4=
github.com/envoyproxy/protoc-gen-validate v0.1.0/go.mod h1:iSmxcyjqTsJpI2R4NaDN7+kN2VEUnK/pcBlmesArF7c=
github.com/evanphx/json-patch v4.9.0+incompatible/go.mod h1:50XU6AFN0ol/bzJsmQLiYLvXMP4fmwYFNcr97nuDLSk=
github.com/fsnotify/fsnotify v1.4.7/go.mod h1:jwhsz4b93w/PPRr/qN1Yymfu8t87LnFCMoQvtojpjFo=
github.com/fsnotify/fsnotify v1.4.9/go.mod h1:znqG4EE+3YCdAaPaxE2ZRY/06pZUdp0tY4IgpuI1SZQ=
github.com/ghodss/yaml v0.0.0-20150909031657-73d445a93680/go.mod h1:4dBDuWmgqj2HViK6kFavaiC9ZROes6MMH2rRYeMEF04=
github.com/ghodss/yaml v1.0.0/go.mod h1:4dBDuWmgqj2HViK6kFavaiC9ZROes6MMH2rRYeMEF04=
github.com/go-gl/glfw/v3.3/glfw v0.0.0-20191125211704-12ad95a8df72/go.mod h1:tQ2UAYgL5IevRw8kRxooKSPJfGvJ9fJQFa0TUsXzTg8=
github.com/go-kit/kit v0.8.0/go.mod h1:xBxKIO96dXMWWy0MnWVtmwkA9/13aqxPnvrjFYMA2as=
github.com/go-kit/kit v0.9.0/go.mod h1:xBxKIO96dXMWWy0MnWVtmwkA9/13aqxPnvrjFYMA2as=
github.com/go-logfmt/logfmt v0.3.0/go.mod h1:Qt1PoO58o5twSAckw1HlFXLmHsOX5/0LbT9GBnD5lWE=
github.com/go-logfmt/logfmt v0.4.0/go.mod h1:3RMwSq7FuexP4Kalkev3ejPJsZTpXXBr9+V4qmtdjCk=
github.com/go-logr/logr v0.1.0/go.mod h1:ixOQHD9gLJUVQQ2ZOR7zLEifBX6tGkNJF4QyIY7sIas=
github.com/go-logr/logr v0.2.0 h1:QvGt2nLcHH0WK9orKa+ppBPAxREcH364nPUedEpK0TY=
github.com/go-logr/logr v0.2.0/go.mod h1:z6/tIYblkpsD+a4lm/fGIIU9mZ+XfAiaFtq7xTgseGU=
github.com/go-openapi/jsonpointer v0.0.0-20160704185906-46af16f9f7b1/go.mod h1:+35s3my2LFTysnkMfxsJBAMHj/DoqoB9knIWoYG/Vk0=
github.com/go-openapi/jsonreference v0.0.0-20160704190145-13c6e3589ad9/go.mod h1:W3Z9FmVs9qj+KR4zFKmDPGiLdk1D9Rlm7cyMvf57TTg=
github.com/go-openapi/spec v0.0.0-20160808142527-6aced65f8501/go.mod h1:J8+jY1nAiCcj+friV/PDoE1/3eeccG9LYBs0tYvLOWc=
github.com/go-openapi/swag v0.0.0-20160704191624-1d0bd113de87/go.mod h1:DXUve3Dpr1UfpPtxFw+EFuQ41HhCWZfha5jSVRG7C7I=
github.com/go-stack/stack v1.8.0/go.mod h1:v0f6uXyyMGvRgIKkXu+yp6POWl0qKG85gN/melR3HDY=
github.com/gogo/protobuf v1.1.1/go.mod h1:r8qH/GZQm5c6nD/R0oafs1akxWv10x8SbQlK7atdtwQ=
github.com/gogo/protobuf v1.3.1 h1:DqDEcV5aeaTmdFBePNpYsp3FlcVH/2ISVVM9Qf8PSls=
github.com/gogo/protobuf v1.3.1/go.mod h1:SlYgWuQ5SjCEi6WLHjHCa1yvBfUnHcTbrrZtXPKa29o=
github.com/gogo/protobuf v1.3.2 h1:Ov1cvc58UF3b5XjBnZv7+opcTcQFZebYjWzi34vdm4Q=
github.com/gogo/protobuf v1.3.2/go.mod h1:P1XiOD3dCwIKUDQYPy72D8LYyHL2YPYrpS2s69NZV8Q=
github.com/golang/glog v0.0.0-20160126235308-23def4e6c14b h1:VKtxabqXZkF25pY9ekfRL6a582T4P37/31XEstQ5p58=
github.com/golang/glog v0.0.0-20160126235308-23def4e6c14b/go.mod h1:SBH7ygxi8pfUlaOkMMuAQtPIUF8ecWP5IEl/CR7VP2Q=
github.com/golang/groupcache v0.0.0-20190702054246-869f871628b6/go.mod h1:cIg4eruTrX1D+g88fzRXU5OdNfaM+9IcxsU14FzY7Hc=
github.com/golang/groupcache v0.0.0-20191227052852-215e87163ea7 h1:5ZkaAPbicIKTF2I64qf5Fh8Aa83Q/dnOafMYV0OMwjA=
github.com/golang/groupcache v0.0.0-20191227052852-215e87163ea7/go.mod h1:cIg4eruTrX1D+g88fzRXU5OdNfaM+9IcxsU14FzY7Hc=
github.com/golang/mock v1.1.1/go.mod h1:oTYuIxOrZwtPieC+H1uAHpcLFnEyAGVDL/k47Jfbm0A=
github.com/golang/mock v1.2.0/go.mod h1:oTYuIxOrZwtPieC+H1uAHpcLFnEyAGVDL/k47Jfbm0A=
github.com/golang/mock v1.3.1/go.mod h1:sBzyDLLjw3U8JLTeZvSv8jJB+tU5PVekmnlKIyFUx0Y=
github.com/golang/protobuf v1.2.0/go.mod h1:6lQm79b+lXiMfvg/cZm0SGofjICqVBUtrP5yJMmIC1U=
github.com/golang/protobuf v1.3.1/go.mod h1:6lQm79b+lXiMfvg/cZm0SGofjICqVBUtrP5yJMmIC1U=
github.com/golang/protobuf v1.3.2/go.mod h1:6lQm79b+lXiMfvg/cZm0SGofjICqVBUtrP5yJMmIC1U=
github.com/golang/protobuf v1.3.3/go.mod h1:vzj43D7+SQXF/4pzW/hwtAqwc6iTitCiVSaWz5lYuqw=
github.com/golang/protobuf v1.4.0-rc.1/go.mod h1:ceaxUfeHdC40wWswd/P6IGgMaK3YpKi5j83Wpe3EHw8=
github.com/golang/protobuf v1.4.0-rc.1.0.20200221234624-67d41d38c208/go.mod h1:xKAWHe0F5eneWXFV3EuXVDTCmh+JuBKY0li0aMyXATA=
github.com/golang/protobuf v1.4.0-rc.2/go.mod h1:LlEzMj4AhA7rCAGe4KMBDvJI+AwstrUpVNzEA03Pprs=
github.com/golang/protobuf v1.4.0-rc.4.0.20200313231945-b860323f09d0/go.mod h1:WU3c8KckQ9AFe+yFwt9sWVRKCVIyN9cPHBJSNnbL67w=
github.com/golang/protobuf v1.4.0/go.mod h1:jodUvKwWbYaEsadDk5Fwe5c77LiNKVO9IDvqG2KuDX0=
github.com/golang/protobuf v1.4.1/go.mod h1:U8fpvMrcmy5pZrNK1lt4xCsGvpyWQ/VVv6QDs8UjoX8=
github.com/golang/protobuf v1.4.2 h1:+Z5KGCizgyZCbGh1KZqA0fcLLkwbsjIzS4aV2v7wJX0=
github.com/golang/protobuf v1.4.2/go.mod h1:oDoupMAO8OvCJWAcko0GGGIgR6R6ocIYbsSw735rRwI=
github.com/google/btree v0.0.0-20180813153112-4030bb1f1f0c/go.mod h1:lNA+9X1NB3Zf8V7Ke586lFgjr2dZNuvo3lPJSGZ5JPQ=
github.com/google/btree v1.0.0 h1:0udJVsspx3VBr5FwtLhQQtuAsVc79tTq0ocGIPAU6qo=
github.com/google/btree v1.0.0/go.mod h1:lNA+9X1NB3Zf8V7Ke586lFgjr2dZNuvo3lPJSGZ5JPQ=
github.com/google/go-cmp v0.2.0/go.mod h1:oXzfMopK8JAjlY9xF4vHSVASa0yLyX7SntLO5aqRK0M=
github.com/google/go-cmp v0.3.0/go.mod h1:8QqcDgzrUqlUb/G2PQTWiueGozuR1884gddMywk6iLU=
github.com/google/go-cmp v0.3.1/go.mod h1:8QqcDgzrUqlUb/G2PQTWiueGozuR1884gddMywk6iLU=
github.com/google/go-cmp v0.4.0 h1:xsAVV57WRhGj6kEIi8ReJzQlHHqcBYCElAvkovg3B/4=
github.com/google/go-cmp v0.4.0/go.mod h1:v8dTdLbMG2kIc/vJvl+f65V22dbkXbowE6jgT/gNBxE=
github.com/google/gofuzz v1.0.0/go.mod h1:dBl0BpW6vV/+mYPU4Po3pmUjxk6FQPldtuIdl/M65Eg=
github.com/google/gofuzz v1.1.0/go.mod h1:dBl0BpW6vV/+mYPU4Po3pmUjxk6FQPldtuIdl/M65Eg=
github.com/google/martian v2.1.0+incompatible/go.mod h1:9I4somxYTbIHy5NJKHRl3wXiIaQGbYVAs8BPL6v8lEs=
github.com/google/pprof v0.0.0-20181206194817-3ea8567a2e57/go.mod h1:zfwlbNMJ+OItoe0UupaVj+oy1omPYYDuagoSzA8v9mc=
github.com/google/pprof v0.0.0-20190515194954-54271f7e092f/go.mod h1:zfwlbNMJ+OItoe0UupaVj+oy1omPYYDuagoSzA8v9mc=
github.com/google/pprof v0.0.0-20191218002539-d4f498aebedc/go.mod h1:ZgVRPoUq/hfqzAqh7sHMqb3I9Rq5C59dIz2SbBwJ4eM=
github.com/google/renameio v0.1.0/go.mod h1:KWCgfxg9yswjAJkECMjeO8J8rahYeXnNhOm40UhjYkI=
github.com/google/uuid v1.1.1 h1:Gkbcsh/GbpXz7lPftLA3P6TYMwjCLYm83jiFQZF/3gY=
github.com/google/uuid v1.1.1/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo=
github.com/googleapis/gax-go/v2 v2.0.4/go.mod h1:0Wqv26UfaUD9n4G6kQubkQ+KchISgw+vpHVxEJEs9eg=
github.com/googleapis/gax-go/v2 v2.0.5/go.mod h1:DWXyrwAJ9X0FpwwEdw+IPEYBICEFu5mhpdKc/us6bOk=
github.com/googleapis/gnostic v0.4.1/go.mod h1:LRhVm6pbyptWbWbuZ38d1eyptfvIytN3ir6b65WBswg=
github.com/gorilla/websocket v1.4.2 h1:+/TMaTYc4QFitKJxsQ7Yye35DkWvkdLcvGKqM+x0Ufc=
github.com/gorilla/websocket v1.4.2/go.mod h1:YR8l580nyteQvAITg2hZ9XVh4b55+EU/adAjf1fMHhE=
github.com/gregjones/httpcache v0.0.0-20180305231024-9cad4c3443a7/go.mod h1:FecbI9+v66THATjSRHfNgh1IVFe/9kFxbXtjV0ctIMA=
github.com/grpc-ecosystem/go-grpc-middleware v1.3.0 h1:+9834+KizmvFV7pXQGSXQTsaWhq2GjuNUt0aUU0YBYw=
github.com/grpc-ecosystem/go-grpc-middleware v1.3.0/go.mod h1:z0ButlSOZa5vEBq9m2m2hlwIgKw+rp3sdCBRoJY+30Y=
github.com/grpc-ecosystem/go-grpc-prometheus v1.2.0 h1:Ovs26xHkKqVztRpIrF/92BcuyuQ/YW4NSIpoGtfXNho=
github.com/grpc-ecosystem/go-grpc-prometheus v1.2.0/go.mod h1:8NvIoxWQoOIhqOTXgfV/d3M/q6VIi02HzZEHgUlZvzk=
github.com/grpc-ecosystem/grpc-gateway v1.16.0 h1:gmcG1KaJ57LophUzW0Hy8NmPhnMZb4M0+kPpLofRdBo=
github.com/grpc-ecosystem/grpc-gateway v1.16.0/go.mod h1:BDjrQk3hbvj6Nolgz8mAMFbcEtjT1g+wF4CSlocrBnw=
github.com/hashicorp/golang-lru v0.5.0/go.mod h1:/m3WP610KZHVQ1SGc6re/UDhFvYD7pJ4Ao+sR/qLZy8=
github.com/hashicorp/golang-lru v0.5.1/go.mod h1:/m3WP610KZHVQ1SGc6re/UDhFvYD7pJ4Ao+sR/qLZy8=
github.com/hpcloud/tail v1.0.0/go.mod h1:ab1qPbhIpdTxEkNHXyeSf5vhxWSCs/tWer42PpOxQnU=
github.com/ianlancetaylor/demangle v0.0.0-20181102032728-5e5cf60278f6/go.mod h1:aSSvb/t6k1mPoxDqO4vJh6VOCGPwU4O0C2/Eqndh1Sc=
github.com/imdario/mergo v0.3.5/go.mod h1:2EnlNZ0deacrJVfApfmtdGgDfMuh/nq6Ok1EcJh5FfA=
github.com/jonboulle/clockwork v0.2.2 h1:UOGuzwb1PwsrDAObMuhUnj0p5ULPj8V/xJ7Kx9qUBdQ=
github.com/jonboulle/clockwork v0.2.2/go.mod h1:Pkfl5aHPm1nk2H9h0bjmnJD/BcgbGXUBGnn1kMkgxc8=
github.com/json-iterator/go v1.1.6/go.mod h1:+SdeFBvtyEkXs7REEP0seUULqWtbJapLOCVDaaPEHmU=
github.com/json-iterator/go v1.1.10 h1:Kz6Cvnvv2wGdaG/V8yMvfkmNiXq9Ya2KUv4rouJJr68=
github.com/json-iterator/go v1.1.10/go.mod h1:KdQUCv79m/52Kvf8AW2vK1V8akMuk1QjK/uOdHXbAo4=
github.com/jstemmer/go-junit-report v0.0.0-20190106144839-af01ea7f8024/go.mod h1:6v2b51hI/fHJwM22ozAgKL4VKDeJcHhJFhtBdhmNjmU=
github.com/jstemmer/go-junit-report v0.9.1/go.mod h1:Brl9GWCQeLvo8nXZwPNNblvFj/XSXhF0NWZEnDohbsk=
github.com/julienschmidt/httprouter v1.2.0/go.mod h1:SYymIcj16QtmaHHD7aYtjjsJG7VTCxuUUipMqKk8s4w=
github.com/kisielk/errcheck v1.2.0/go.mod h1:/BMXB+zMLi60iA8Vv6Ksmxu/1UDYcXs4uQLJ+jE2L00=
github.com/kisielk/errcheck v1.5.0/go.mod h1:pFxgyoBC7bSaBwPgfKdkLd5X25qrDl4LWUI2bnpBCr8=
github.com/kisielk/gotool v1.0.0/go.mod h1:XhKaO+MFFWcvkIS/tQcRk01m1F5IRFswLeQ+oQHNcck=
github.com/konsorten/go-windows-terminal-sequences v1.0.1/go.mod h1:T0+1ngSBFLxvqU3pZ+m/2kptfBszLMUkC4ZK/EgS/cQ=
github.com/konsorten/go-windows-terminal-sequences v1.0.3 h1:CE8S1cTafDpPvMhIxNJKvHsGVBgn1xWYf1NbHQhywc8=
github.com/konsorten/go-windows-terminal-sequences v1.0.3/go.mod h1:T0+1ngSBFLxvqU3pZ+m/2kptfBszLMUkC4ZK/EgS/cQ=
github.com/kr/logfmt v0.0.0-20140226030751-b84e30acd515/go.mod h1:+0opPa2QZZtGFBFZlji/RkVcI2GknAs/DXo4wKdlNEc=
github.com/kr/pretty v0.1.0/go.mod h1:dAy3ld7l9f0ibDNOQOHHMYYIIbhfbHSm3C4ZsoJORNo=
github.com/kr/pretty v0.2.0 h1:s5hAObm+yFO5uHYt5dYjxi2rXrsnmRpJx4OYvIWUaQs=
github.com/kr/pretty v0.2.0/go.mod h1:ipq/a2n7PKx3OHsz4KJII5eveXtPO4qwEXGdVfWzfnI=
github.com/kr/pty v1.1.1/go.mod h1:pFQYn66WHrOpPYNljwOMqo10TkYh1fy3cYio2l3bCsQ=
github.com/kr/text v0.1.0 h1:45sCR5RtlFHMR4UwH9sdQ5TC8v0qDQCHnXt+kaKSTVE=
github.com/kr/text v0.1.0/go.mod h1:4Jbv+DJW3UT/LiOwJeYQe1efqtUx/iVham/4vfdArNI=
github.com/kubernetes-csi/csi-lib-utils v0.9.1 h1:sGq6ifVujfMSkfTsMZip44Ttv8SDXvsBlFk9GdYl/b8=
github.com/kubernetes-csi/csi-lib-utils v0.9.1/go.mod h1:8E2jVUX9j3QgspwHXa6LwyN7IHQDjW9jX3kwoWnSC+M=
github.com/mailru/easyjson v0.0.0-20160728113105-d5b7844b561a/go.mod h1:C1wdFJiN94OJF2b5HbByQZoLdCWB1Yqtg26g4irojpc=
github.com/matttproud/golang_protobuf_extensions v1.0.1/go.mod h1:D8He9yQNgCq6Z5Ld7szi9bcBfOoFv/3dc6xSMkL2PC0=
github.com/matttproud/golang_protobuf_extensions v1.0.2-0.20181231171920-c182affec369 h1:I0XW9+e1XWDxdcEniV4rQAIOPUGDq67JSCiRCgGCZLI=
github.com/matttproud/golang_protobuf_extensions v1.0.2-0.20181231171920-c182affec369/go.mod h1:BSXmuO+STAnVfrANrmjBb36TMTDstsz7MSK+HVaYKv4=
github.com/moby/term v0.0.0-20200312100748-672ec06f55cd/go.mod h1:DdlQx2hp0Ss5/fLikoLlEeIYiATotOjgB//nb973jeo=
github.com/modern-go/concurrent v0.0.0-20180228061459-e0a39a4cb421/go.mod h1:6dJC0mAP4ikYIbvyc7fijjWJddQyLn8Ig3JB5CqoB9Q=
github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd h1:TRLaZ9cD/w8PVh93nsPXa1VrQ6jlwL5oN8l14QlcNfg=
github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd/go.mod h1:6dJC0mAP4ikYIbvyc7fijjWJddQyLn8Ig3JB5CqoB9Q=
github.com/modern-go/reflect2 v0.0.0-20180701023420-4b7aa43c6742/go.mod h1:bx2lNnkwVCuqBIxFjflWJWanXIb3RllmbCylyMrvgv0=
github.com/modern-go/reflect2 v1.0.1 h1:9f412s+6RmYXLWZSEzVVgPGK7C2PphHj5RJrvfx9AWI=
github.com/modern-go/reflect2 v1.0.1/go.mod h1:bx2lNnkwVCuqBIxFjflWJWanXIb3RllmbCylyMrvgv0=
github.com/munnerz/goautoneg v0.0.0-20120707110453-a547fc61f48d/go.mod h1:+n7T8mK8HuQTcFwEeznm/DIxMOiR9yIdICNftLE1DvQ=
github.com/mwitkow/go-conntrack v0.0.0-20161129095857-cc309e4a2223/go.mod h1:qRWi+5nqEBWmkhHvq77mSJWrCKwh8bxhgT7d/eI7P4U=
github.com/mxk/go-flowrate v0.0.0-20140419014527-cca7078d478f/go.mod h1:ZdcZmHo+o7JKHSa8/e818NopupXU1YMK5fe1lsApnBw=
github.com/onsi/ginkgo v0.0.0-20170829012221-11459a886d9c/go.mod h1:lLunBs/Ym6LB5Z9jYTR76FiuTmxDTDusOGeTQH+WWjE=
github.com/onsi/ginkgo v1.6.0/go.mod h1:lLunBs/Ym6LB5Z9jYTR76FiuTmxDTDusOGeTQH+WWjE=
github.com/onsi/ginkgo v1.11.0/go.mod h1:lLunBs/Ym6LB5Z9jYTR76FiuTmxDTDusOGeTQH+WWjE=
github.com/onsi/gomega v0.0.0-20170829124025-dcabb60a477c/go.mod h1:C1qb7wdrVGGVU+Z6iS04AVkA3Q65CEZX59MT0QO5uiA=
github.com/onsi/gomega v1.7.0/go.mod h1:ex+gbHU/CVuBBDIJjb2X0qEXbFg53c61hWP/1CpauHY=
github.com/opentracing/opentracing-go v1.1.0/go.mod h1:UkNAQd3GIcIGf0SeVgPpRdFStlNbqXla1AfSYxPUl2o=
github.com/peterbourgon/diskv v2.0.1+incompatible/go.mod h1:uqqh8zWWbv1HBMNONnaR/tNboyR3/BZd58JJSHlUSCU=
github.com/pkg/errors v0.8.0/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0=
github.com/pkg/errors v0.8.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0=
github.com/pkg/errors v0.9.1 h1:FEBLx1zS214owpjy7qsBeixbURkuhQAwrK5UwLGTwt4=
github.com/pkg/errors v0.9.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0=
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
github.com/prometheus/client_golang v0.9.1/go.mod h1:7SWBe2y4D6OKWSNQJUaRYU/AaXPKyh/dDVn+NZz0KFw=
github.com/prometheus/client_golang v1.0.0/go.mod h1:db9x61etRT2tGnBNRi70OPL5FsnadC4Ky3P0J6CfImo=
github.com/prometheus/client_golang v1.7.1 h1:NTGy1Ja9pByO+xAeH/qiWnLrKtr3hJPNjaVUwnjpdpA=
github.com/prometheus/client_golang v1.7.1/go.mod h1:PY5Wy2awLA44sXw4AOSfFBetzPP4j5+D6mVACh+pe2M=
github.com/prometheus/client_model v0.0.0-20180712105110-5c3871d89910/go.mod h1:MbSGuTsp3dbXC40dX6PRTWyKYBIrTGTE9sqQNg2J8bo=
github.com/prometheus/client_model v0.0.0-20190129233127-fd36f4220a90/go.mod h1:xMI15A0UPsDsEKsMN9yxemIoYk6Tm2C1GtYGdfGttqA=
github.com/prometheus/client_model v0.0.0-20190812154241-14fe0d1b01d4/go.mod h1:xMI15A0UPsDsEKsMN9yxemIoYk6Tm2C1GtYGdfGttqA=
github.com/prometheus/client_model v0.2.0 h1:uq5h0d+GuxiXLJLNABMgp2qUWDPiLvgCzz2dUR+/W/M=
github.com/prometheus/client_model v0.2.0/go.mod h1:xMI15A0UPsDsEKsMN9yxemIoYk6Tm2C1GtYGdfGttqA=
github.com/prometheus/common v0.4.1/go.mod h1:TNfzLD0ON7rHzMJeJkieUDPYmFC7Snx/y86RQel1bk4=
github.com/prometheus/common v0.10.0 h1:RyRA7RzGXQZiW+tGMr7sxa85G1z0yOpM1qq5c8lNawc=
github.com/prometheus/common v0.10.0/go.mod h1:Tlit/dnDKsSWFlCLTWaA1cyBgKHSMdTB80sz/V91rCo=
github.com/prometheus/procfs v0.0.0-20181005140218-185b4288413d/go.mod h1:c3At6R/oaqEKCNdg8wHV1ftS6bRYblBhIjjI8uT2IGk=
github.com/prometheus/procfs v0.0.2/go.mod h1:TjEm7ze935MbeOT/UhFTIMYKhuLP4wbCsTZCD3I8kEA=
github.com/prometheus/procfs v0.1.3 h1:F0+tqvhOksq22sc6iCHF5WGlWjdwj92p0udFh1VFBS8=
github.com/prometheus/procfs v0.1.3/go.mod h1:lV6e/gmhEcM9IjHGsFOCxxuZ+z1YqCvr4OA4YeYWdaU=
github.com/rogpeppe/fastuuid v1.2.0/go.mod h1:jVj6XXZzXRy/MSR5jhDC/2q6DgLz+nrA6LYCDYWNEvQ=
github.com/rogpeppe/go-internal v1.3.0/go.mod h1:M8bDsm7K2OlrFYOpmOWEs/qY81heoFRclV5y23lUDJ4=
github.com/sirupsen/logrus v1.2.0/go.mod h1:LxeOpSwHxABJmUn/MG1IvRgCAasNZTLOkJPxbbu5VWo=
github.com/sirupsen/logrus v1.4.2/go.mod h1:tLMulIdttU9McNUspp0xgXVQah82FyeX6MwdIuYE2rE=
github.com/sirupsen/logrus v1.6.0 h1:UBcNElsrwanuuMsnGSlYmtmgbb23qDR5dG+6X6Oo89I=
github.com/sirupsen/logrus v1.6.0/go.mod h1:7uNnSEd1DgxDLC74fIahvMZmmYsHGZGEOFrfsX/uA88=
github.com/soheilhy/cmux v0.1.5 h1:jjzc5WVemNEDTLwv9tlmemhC73tI08BNOIGwBOo10Js=
github.com/soheilhy/cmux v0.1.5/go.mod h1:T7TcVDs9LWfQgPlPsdngu6I6QIoyIFZDDC6sNE1GqG0=
github.com/spf13/afero v1.2.2/go.mod h1:9ZxEEn6pIJ8Rxe320qSDBk6AsU0r9pR7Q4OcevTdifk=
github.com/spf13/pflag v0.0.0-20170130214245-9ff6c6923cff/go.mod h1:DYY7MBk1bdzusC3SYhjObp+wFpr4gzcvqqNjLnInEg4=
github.com/spf13/pflag v1.0.3/go.mod h1:DYY7MBk1bdzusC3SYhjObp+wFpr4gzcvqqNjLnInEg4=
github.com/spf13/pflag v1.0.5/go.mod h1:McXfInJRrz4CZXVZOBLb0bTZqETkiAhM9Iw0y3An2Bg=
github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
github.com/stretchr/objx v0.1.1/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
github.com/stretchr/testify v1.2.2/go.mod h1:a8OnRcib4nhh0OaRAV+Yts87kKdq0PP7pXfy6kDkUVs=
github.com/stretchr/testify v1.3.0/go.mod h1:M5WIy9Dh21IEIfnGCwXGc5bZfKNJtfHm1UVUgZn+9EI=
github.com/stretchr/testify v1.4.0/go.mod h1:j7eGeouHqKxXV5pUuKE4zz7dFj8WfuZ+81PSLYec5m4=
github.com/stretchr/testify v1.5.1 h1:nOGnQDM7FYENwehXlg/kFVnos3rEvtKTjRvOWSzb6H4=
github.com/stretchr/testify v1.5.1/go.mod h1:5W2xD1RspED5o8YsWQXVCued0rvSQ+mT+I5cxcmMvtA=
github.com/tmc/grpc-websocket-proxy v0.0.0-20201229170055-e5319fda7802 h1:uruHq4dN7GR16kFc5fp3d1RIYzJW5onx8Ybykw2YQFA=
github.com/tmc/grpc-websocket-proxy v0.0.0-20201229170055-e5319fda7802/go.mod h1:ncp9v5uamzpCO7NfCPTXjqaC+bZgJeR0sMTm6dMHP7U=
github.com/xiang90/probing v0.0.0-20190116061207-43a291ad63a2 h1:eY9dn8+vbi4tKz5Qo6v2eYzo7kUS51QINcR5jNpbZS8=
github.com/xiang90/probing v0.0.0-20190116061207-43a291ad63a2/go.mod h1:UETIi67q53MR2AWcXfiuqkDkRtnGDLqkBTpCHuJHxtU=
github.com/yuin/goldmark v1.1.27/go.mod h1:3hX8gzYuyVAZsxl0MRgGTJEmQBFcNTphYh9decYSb74=
github.com/yuin/goldmark v1.2.1/go.mod h1:3hX8gzYuyVAZsxl0MRgGTJEmQBFcNTphYh9decYSb74=
go.etcd.io/bbolt v1.3.5 h1:XAzx9gjCb0Rxj7EoqcClPD1d5ZBxZJk0jbuoPHenBt0=
go.etcd.io/bbolt v1.3.5/go.mod h1:G5EMThwa9y8QZGBClrRx5EY+Yw9kAhnjy3bSjsnlVTQ=
go.etcd.io/etcd v3.3.25+incompatible h1:V1RzkZJj9LqsJRy+TUBgpWSbZXITLB819lstuTFoZOY=
go.etcd.io/etcd v3.3.25+incompatible/go.mod h1:yaeTdrJi5lOmYerz05bd8+V7KubZs8YSFZfzsF9A6aI=
go.opencensus.io v0.21.0/go.mod h1:mSImk1erAIZhrmZN+AvHh14ztQfjbGwt4TtuofqLduU=
go.opencensus.io v0.22.0/go.mod h1:+kGneAE2xo2IficOXnaByMWTGM9T73dGwxeWcUqIpI8=
go.opencensus.io v0.22.2/go.mod h1:yxeiOL68Rb0Xd1ddK5vPZ/oVn4vY4Ynel7k9FzqtOIw=
go.uber.org/atomic v1.4.0 h1:cxzIVoETapQEqDhQu3QfnvXAV4AlzcvUCxkVUFw3+EU=
go.uber.org/atomic v1.4.0/go.mod h1:gD2HeocX3+yG+ygLZcrzQJaqmWj9AIm7n08wl/qW/PE=
go.uber.org/multierr v1.1.0 h1:HoEmRHQPVSqub6w2z2d2EOVs2fjyFRGyofhKuyDq0QI=
go.uber.org/multierr v1.1.0/go.mod h1:wR5kodmAFQ0UK8QlbwjlSNy0Z68gJhDJUG5sjR94q/0=
go.uber.org/zap v1.10.0 h1:ORx85nbTijNz8ljznvCMR1ZBIPKFn3jQrag10X2AsuM=
go.uber.org/zap v1.10.0/go.mod h1:vwi/ZaCAaUcBkycHslxD9B2zi4UTXhF60s6SWpuDF0Q=
golang.org/x/crypto v0.0.0-20180904163835-0709b304e793/go.mod h1:6SG95UA2DQfeDnfUPMdvaQW0Q7yPrPDi9nlGo2tz2b4=
golang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2/go.mod h1:djNgcEr1/C05ACkg1iLfiJU5Ep61QUkGW8qpdssI0+w=
golang.org/x/crypto v0.0.0-20190510104115-cbcb75029529/go.mod h1:yigFU9vqHzYiE8UmvKecakEJjdnWj3jj499lnFckfCI=
golang.org/x/crypto v0.0.0-20190605123033-f99c8df09eb5/go.mod h1:yigFU9vqHzYiE8UmvKecakEJjdnWj3jj499lnFckfCI=
golang.org/x/crypto v0.0.0-20191011191535-87dc89f01550/go.mod h1:yigFU9vqHzYiE8UmvKecakEJjdnWj3jj499lnFckfCI=
golang.org/x/crypto v0.0.0-20191206172530-e9b2fee46413/go.mod h1:LzIPMQfyMNhhGPhUkYOs5KpL4U8rLKemX1yGLhDgUto=
golang.org/x/crypto v0.0.0-20200622213623-75b288015ac9 h1:psW17arqaxU48Z5kZ0CQnkZWQJsqcURM6tKiBApRjXI=
golang.org/x/crypto v0.0.0-20200622213623-75b288015ac9/go.mod h1:LzIPMQfyMNhhGPhUkYOs5KpL4U8rLKemX1yGLhDgUto=
golang.org/x/exp v0.0.0-20190121172915-509febef88a4/go.mod h1:CJ0aWSM057203Lf6IL+f9T1iT9GByDxfZKAQTCR3kQA=
golang.org/x/exp v0.0.0-20190306152737-a1d7652674e8/go.mod h1:CJ0aWSM057203Lf6IL+f9T1iT9GByDxfZKAQTCR3kQA=
golang.org/x/exp v0.0.0-20190510132918-efd6b22b2522/go.mod h1:ZjyILWgesfNpC6sMxTJOJm9Kp84zZh5NQWvqDGG3Qr8=
golang.org/x/exp v0.0.0-20190829153037-c13cbed26979/go.mod h1:86+5VVa7VpoJ4kLfm080zCjGlMRFzhUhsZKEZO7MGek=
golang.org/x/exp v0.0.0-20191227195350-da58074b4299/go.mod h1:2RIsYlXP63K8oxa1u096TMicItID8zy7Y6sNkU49FU4=
golang.org/x/image v0.0.0-20190227222117-0694c2d4d067/go.mod h1:kZ7UVZpmo3dzQBMxlp+ypCbDeSB+sBbTgSJuh5dn5js=
golang.org/x/image v0.0.0-20190802002840-cff245a6509b/go.mod h1:FeLwcggjj3mMvU+oOTbSwawSJRM1uh48EjtB4UJZlP0=
golang.org/x/lint v0.0.0-20190227174305-5b3e6a55c961/go.mod h1:wehouNa3lNwaWXcvxsM5YxQ5yQlVC4a0KAMCusXpPoU=
golang.org/x/lint v0.0.0-20190301231843-5614ed5bae6f/go.mod h1:UVdnD1Gm6xHRNCYTkRU2/jEulfH38KcIWyp/GAMgvoE=
golang.org/x/lint v0.0.0-20190313153728-d0100b6bd8b3/go.mod h1:6SW0HCj/g11FgYtHlgUYUwCkIfeOF89ocIRzGO/8vkc=
golang.org/x/lint v0.0.0-20190409202823-959b441ac422/go.mod h1:6SW0HCj/g11FgYtHlgUYUwCkIfeOF89ocIRzGO/8vkc=
golang.org/x/lint v0.0.0-20190909230951-414d861bb4ac/go.mod h1:6SW0HCj/g11FgYtHlgUYUwCkIfeOF89ocIRzGO/8vkc=
golang.org/x/lint v0.0.0-20191125180803-fdd1cda4f05f/go.mod h1:5qLYkcX4OjUUV8bRuDixDT3tpyyb+LUpUlRWLxfhWrs=
golang.org/x/mobile v0.0.0-20190312151609-d3739f865fa6/go.mod h1:z+o9i4GpDbdi3rU15maQ/Ox0txvL9dWGYEHz965HBQE=
golang.org/x/mobile v0.0.0-20190719004257-d2bd2a29d028/go.mod h1:E/iHnbuqvinMTCcRqshq8CkpyQDoeVncDDYHnLhea+o=
golang.org/x/mod v0.0.0-20190513183733-4bf6d317e70e/go.mod h1:mXi4GBBbnImb6dmsKGUJ2LatrhH/nqhxcFungHvyanc=
golang.org/x/mod v0.1.0/go.mod h1:0QHyrYULN0/3qlju5TqG8bIK38QM8yzMo5ekMj3DlcY=
golang.org/x/mod v0.1.1-0.20191105210325-c90efee705ee/go.mod h1:QqPTAvyqsEbceGzBzNggFXnrqF1CaUcvgkdR5Ot7KZg=
golang.org/x/mod v0.2.0/go.mod h1:s0Qsj1ACt9ePp/hMypM3fl4fZqREWJwdYDEqhRiZZUA=
golang.org/x/mod v0.3.0/go.mod h1:s0Qsj1ACt9ePp/hMypM3fl4fZqREWJwdYDEqhRiZZUA=
golang.org/x/net v0.0.0-20180724234803-3673e40ba225/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4=
golang.org/x/net v0.0.0-20180906233101-161cd47e91fd/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4=
golang.org/x/net v0.0.0-20181114220301-adae6a3d119a/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4=
golang.org/x/net v0.0.0-20190108225652-1e06a53dbb7e/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4=
golang.org/x/net v0.0.0-20190213061140-3a22650c66bd/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4=
golang.org/x/net v0.0.0-20190311183353-d8887717615a/go.mod h1:t9HGtf8HONx5eT2rtn7q6eTqICYqUVnKs3thJo3Qplg=
golang.org/x/net v0.0.0-20190404232315-eb5bcb51f2a3/go.mod h1:t9HGtf8HONx5eT2rtn7q6eTqICYqUVnKs3thJo3Qplg=
golang.org/x/net v0.0.0-20190501004415-9ce7a6920f09/go.mod h1:t9HGtf8HONx5eT2rtn7q6eTqICYqUVnKs3thJo3Qplg=
golang.org/x/net v0.0.0-20190503192946-f4e77d36d62c/go.mod h1:t9HGtf8HONx5eT2rtn7q6eTqICYqUVnKs3thJo3Qplg=
golang.org/x/net v0.0.0-20190603091049-60506f45cf65/go.mod h1:HSz+uSET+XFnRR8LxR5pz3Of3rY3CfYBVs4xY44aLks=
golang.org/x/net v0.0.0-20190613194153-d28f0bde5980/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s=
golang.org/x/net v0.0.0-20190620200207-3b0461eec859/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s=
golang.org/x/net v0.0.0-20191209160850-c0dbc17a3553/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s=
golang.org/x/net v0.0.0-20200226121028-0de0cce0169b/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s=
golang.org/x/net v0.0.0-20200324143707-d3edc9973b7e/go.mod h1:qpuaurCH72eLCgpAm/N6yyVIVM9cpaDIP3A8BGJEC5A=
golang.org/x/net v0.0.0-20200707034311-ab3426394381 h1:VXak5I6aEWmAXeQjA+QSZzlgNrpq9mjcfDemuexIKsU=
golang.org/x/net v0.0.0-20200707034311-ab3426394381/go.mod h1:/O7V0waA8r7cgGh81Ro3o1hOxt32SMVPicZroKQ2sZA=
golang.org/x/net v0.0.0-20200822124328-c89045814202/go.mod h1:/O7V0waA8r7cgGh81Ro3o1hOxt32SMVPicZroKQ2sZA=
golang.org/x/net v0.0.0-20201021035429-f5854403a974/go.mod h1:sp8m0HH+o8qH0wwXwYZr8TS3Oi6o0r6Gce1SSxlDquU=
golang.org/x/net v0.0.0-20201202161906-c7110b5ffcbb h1:eBmm0M9fYhWpKZLjQUUKka/LtIxf46G4fxeEz5KJr9U=
golang.org/x/net v0.0.0-20201202161906-c7110b5ffcbb/go.mod h1:sp8m0HH+o8qH0wwXwYZr8TS3Oi6o0r6Gce1SSxlDquU=
golang.org/x/oauth2 v0.0.0-20180821212333-d2e6202438be/go.mod h1:N/0e6XlmueqKjAGxoOufVs8QHGRruUQn6yWY3a++T0U=
golang.org/x/oauth2 v0.0.0-20190226205417-e64efc72b421/go.mod h1:gOpvHmFTYa4IltrdGE7lF6nIHvwfUNPOp7c8zoXwtLw=
golang.org/x/oauth2 v0.0.0-20190604053449-0f29369cfe45/go.mod h1:gOpvHmFTYa4IltrdGE7lF6nIHvwfUNPOp7c8zoXwtLw=
golang.org/x/oauth2 v0.0.0-20191202225959-858c2ad4c8b6/go.mod h1:gOpvHmFTYa4IltrdGE7lF6nIHvwfUNPOp7c8zoXwtLw=
golang.org/x/oauth2 v0.0.0-20200107190931-bf48bf16ab8d/go.mod h1:gOpvHmFTYa4IltrdGE7lF6nIHvwfUNPOp7c8zoXwtLw=
golang.org/x/sync v0.0.0-20180314180146-1d60e4601c6f/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
golang.org/x/sync v0.0.0-20181108010431-42b317875d0f/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
golang.org/x/sync v0.0.0-20181221193216-37e7f081c4d4/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
golang.org/x/sync v0.0.0-20190227155943-e225da77a7e6/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
golang.org/x/sync v0.0.0-20190423024810-112230192c58/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
golang.org/x/sync v0.0.0-20190911185100-cd5d95a43a6e/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
golang.org/x/sync v0.0.0-20201020160332-67f06af15bc9/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
golang.org/x/sys v0.0.0-20180905080454-ebe1bf3edb33/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY=
golang.org/x/sys v0.0.0-20180909124046-d0be0721c37e/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY=
golang.org/x/sys v0.0.0-20181116152217-5ac8a444bdc5/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY=
golang.org/x/sys v0.0.0-20190215142949-d0b11bdaac8a/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY=
golang.org/x/sys v0.0.0-20190312061237-fead79001313/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20190412213103-97732733099d/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20190422165155-953cdadca894/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20190502145724-3ef323f4f1fd/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20190507160741-ecd444e8653b/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20190606165138-5da285871e9c/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20190624142023-c5567b49c5d0/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20191005200804-aed5e4c7ecf9/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20191204072324-ce4227a45e2e/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20191228213918-04cbcbbfeed8/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20200106162015-b016eb3dc98e/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20200202164722-d101bd2416d5/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20200302150141-5c8b2ff67527/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20200323222414-85ca7c5b95cd/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20200615200032-f1bc736245b1/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20200622214017-ed371f2e16b4 h1:5/PjkGUjvEU5Gl6BxmvKRPpqo2uNMv4rcHBMwzk/st8=
golang.org/x/sys v0.0.0-20200622214017-ed371f2e16b4/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20200930185726-fdedc70b468f h1:+Nyd8tzPX9R7BWHguqsrbFdRx3WQ/1ib8I44HXV5yTA=
golang.org/x/sys v0.0.0-20200930185726-fdedc70b468f/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
golang.org/x/text v0.3.1-0.20180807135948-17ff2d5776d2/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
golang.org/x/text v0.3.2/go.mod h1:bEr9sfX3Q8Zfm5fL9x+3itogRgK3+ptLWKqgva+5dAk=
golang.org/x/text v0.3.3 h1:cokOdA+Jmi5PJGXLlLllQSgYigAEfHXJAERHVMaCc2k=
golang.org/x/text v0.3.3/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ=
golang.org/x/time v0.0.0-20181108054448-85acf8d2951c/go.mod h1:tRJNPiyCQ0inRvYxbN9jk5I+vvW/OXSQhTDSoE431IQ=
golang.org/x/time v0.0.0-20190308202827-9d24e82272b4/go.mod h1:tRJNPiyCQ0inRvYxbN9jk5I+vvW/OXSQhTDSoE431IQ=
golang.org/x/time v0.0.0-20191024005414-555d28b269f0 h1:/5xXl8Y5W96D+TtHSlonuFqGHIWVuyCkGJLwGh9JJFs=
golang.org/x/time v0.0.0-20191024005414-555d28b269f0/go.mod h1:tRJNPiyCQ0inRvYxbN9jk5I+vvW/OXSQhTDSoE431IQ=
golang.org/x/tools v0.0.0-20180917221912-90fa682c2a6e/go.mod h1:n7NCudcB/nEzxVGmLbDWY5pfWTLqBcC2KZ6jyYvM4mQ=
golang.org/x/tools v0.0.0-20181011042414-1f849cf54d09/go.mod h1:n7NCudcB/nEzxVGmLbDWY5pfWTLqBcC2KZ6jyYvM4mQ=
golang.org/x/tools v0.0.0-20181030221726-6c7e314b6563/go.mod h1:n7NCudcB/nEzxVGmLbDWY5pfWTLqBcC2KZ6jyYvM4mQ=
golang.org/x/tools v0.0.0-20190226205152-f727befe758c/go.mod h1:9Yl7xja0Znq3iFh3HoIrodX9oNMXvdceNzlUR8zjMvY=
golang.org/x/tools v0.0.0-20190311212946-11955173bddd/go.mod h1:LCzVGOaR6xXOjkQ3onu1FJEFr0SW1gC7cKk1uF8kGRs=
golang.org/x/tools v0.0.0-20190312151545-0bb0c0a6e846/go.mod h1:LCzVGOaR6xXOjkQ3onu1FJEFr0SW1gC7cKk1uF8kGRs=
golang.org/x/tools v0.0.0-20190312170243-e65039ee4138/go.mod h1:LCzVGOaR6xXOjkQ3onu1FJEFr0SW1gC7cKk1uF8kGRs=
golang.org/x/tools v0.0.0-20190425150028-36563e24a262/go.mod h1:RgjU9mgBXZiqYHBnxXauZ1Gv1EHHAz9KjViQ78xBX0Q=
golang.org/x/tools v0.0.0-20190506145303-2d16b83fe98c/go.mod h1:RgjU9mgBXZiqYHBnxXauZ1Gv1EHHAz9KjViQ78xBX0Q=
golang.org/x/tools v0.0.0-20190524140312-2c0ae7006135/go.mod h1:RgjU9mgBXZiqYHBnxXauZ1Gv1EHHAz9KjViQ78xBX0Q=
golang.org/x/tools v0.0.0-20190606124116-d0a3d012864b/go.mod h1:/rFqwRUd4F7ZHNgwSSTFct+R/Kf4OFW1sUzUTQQTgfc=
golang.org/x/tools v0.0.0-20190621195816-6e04913cbbac/go.mod h1:/rFqwRUd4F7ZHNgwSSTFct+R/Kf4OFW1sUzUTQQTgfc=
golang.org/x/tools v0.0.0-20190624222133-a101b041ded4/go.mod h1:/rFqwRUd4F7ZHNgwSSTFct+R/Kf4OFW1sUzUTQQTgfc=
golang.org/x/tools v0.0.0-20190628153133-6cdbf07be9d0/go.mod h1:/rFqwRUd4F7ZHNgwSSTFct+R/Kf4OFW1sUzUTQQTgfc=
golang.org/x/tools v0.0.0-20190816200558-6889da9d5479/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo=
golang.org/x/tools v0.0.0-20190911174233-4f2ddba30aff/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo=
golang.org/x/tools v0.0.0-20191012152004-8de300cfc20a/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo=
golang.org/x/tools v0.0.0-20191119224855-298f0cb1881e/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo=
golang.org/x/tools v0.0.0-20191125144606-a911d9008d1f/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo=
golang.org/x/tools v0.0.0-20191227053925-7b8e75db28f4/go.mod h1:TB2adYChydJhpapKDTa4BR/hXlZSLoq2Wpct/0txZ28=
golang.org/x/tools v0.0.0-20200619180055-7c47624df98f/go.mod h1:EkVYQZoAsY45+roYkvgYkIh4xh/qjgUK9TdY2XT94GE=
golang.org/x/tools v0.0.0-20210106214847-113979e3529a/go.mod h1:emZCQorbCU4vsT4fOWvOPXz4eW1wZW4PmDk9uLelYpA=
golang.org/x/xerrors v0.0.0-20190717185122-a985d3407aa7/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
golang.org/x/xerrors v0.0.0-20191011141410-1b5146add898/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
golang.org/x/xerrors v0.0.0-20191204190536-9bdfabe68543 h1:E7g+9GITq07hpfrRu66IVDexMakfv52eLZ2CXBWiKr4=
golang.org/x/xerrors v0.0.0-20191204190536-9bdfabe68543/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
golang.org/x/xerrors v0.0.0-20200804184101-5ec99f83aff1 h1:go1bK/D/BFZV2I8cIQd1NKEZ+0owSTG1fDTci4IqFcE=
golang.org/x/xerrors v0.0.0-20200804184101-5ec99f83aff1/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
google.golang.org/api v0.4.0/go.mod h1:8k5glujaEP+g9n7WNsDg8QP6cUVNI86fCNMcbazEtwE=
google.golang.org/api v0.7.0/go.mod h1:WtwebWUNSVBH/HAw79HIFXZNqEvBhG+Ra+ax0hx3E3M=
google.golang.org/api v0.8.0/go.mod h1:o4eAsZoiT+ibD93RtjEohWalFOjRDx6CVaqeizhEnKg=
google.golang.org/api v0.9.0/go.mod h1:o4eAsZoiT+ibD93RtjEohWalFOjRDx6CVaqeizhEnKg=
google.golang.org/api v0.15.0/go.mod h1:iLdEw5Ide6rF15KTC1Kkl0iskquN2gFfn9o9XIsbkAI=
google.golang.org/appengine v1.4.0/go.mod h1:xpcJRLb0r/rnEns0DIKYYv+WjYCduHsrkT7/EB5XEv4=
google.golang.org/appengine v1.5.0/go.mod h1:xpcJRLb0r/rnEns0DIKYYv+WjYCduHsrkT7/EB5XEv4=
google.golang.org/appengine v1.6.1/go.mod h1:i06prIuMbXzDqacNJfV5OdTW448YApPu5ww/cMBSeb0=
google.golang.org/appengine v1.6.5/go.mod h1:8WjMMxjGQR8xUklV/ARdw2HLXBOI7O7uCIDZVag1xfc=
google.golang.org/genproto v0.0.0-20190307195333-5fe7a883aa19/go.mod h1:VzzqZJRnGkLBvHegQrXjBqPurQTc5/KpmUdxsrq26oE=
google.golang.org/genproto v0.0.0-20190418145605-e7d98fc518a7/go.mod h1:VzzqZJRnGkLBvHegQrXjBqPurQTc5/KpmUdxsrq26oE=
google.golang.org/genproto v0.0.0-20190425155659-357c62f0e4bb/go.mod h1:VzzqZJRnGkLBvHegQrXjBqPurQTc5/KpmUdxsrq26oE=
google.golang.org/genproto v0.0.0-20190502173448-54afdca5d873/go.mod h1:VzzqZJRnGkLBvHegQrXjBqPurQTc5/KpmUdxsrq26oE=
google.golang.org/genproto v0.0.0-20190801165951-fa694d86fc64/go.mod h1:DMBHOl98Agz4BDEuKkezgsaosCRResVns1a3J2ZsMNc=
google.golang.org/genproto v0.0.0-20190819201941-24fa4b261c55/go.mod h1:DMBHOl98Agz4BDEuKkezgsaosCRResVns1a3J2ZsMNc=
google.golang.org/genproto v0.0.0-20190911173649-1774047e7e51/go.mod h1:IbNlFCBrqXvoKpeg0TB2l7cyZUmoaFKYIwrEpbDKLA8=
google.golang.org/genproto v0.0.0-20191230161307-f3c370f40bfb/go.mod h1:n3cpQtvxv34hfy77yVDNjmbRyujviMdxYliBSkLhpCc=
google.golang.org/genproto v0.0.0-20200423170343-7949de9c1215/go.mod h1:55QSHmfGQM9UVYDPBsyGGes0y52j32PQ3BqQfXhyH3c=
google.golang.org/genproto v0.0.0-20200513103714-09dca8ec2884/go.mod h1:55QSHmfGQM9UVYDPBsyGGes0y52j32PQ3BqQfXhyH3c=
google.golang.org/genproto v0.0.0-20200526211855-cb27e3aa2013 h1:+kGHl1aib/qcwaRi1CbqBZ1rk19r85MNUf8HaBghugY=
google.golang.org/genproto v0.0.0-20200526211855-cb27e3aa2013/go.mod h1:NbSheEEYHJ7i3ixzK3sjbqSGDJWnxyFXZblF3eUsNvo=
google.golang.org/grpc v1.25.1 h1:wdKvqQk7IttEw92GoRyKG2IDrUIpgpj6H6m81yfeMW0=
google.golang.org/grpc v1.25.1/go.mod h1:c3i+UQWmh7LiEpx4sFZnkU36qjEYZ0imhYfXVyQciAY=
google.golang.org/protobuf v0.0.0-20200109180630-ec00e32a8dfd/go.mod h1:DFci5gLYBciE7Vtevhsrf46CRTquxDuWsQurQQe4oz8=
google.golang.org/protobuf v0.0.0-20200221191635-4d8936d0db64/go.mod h1:kwYJMbMJ01Woi6D6+Kah6886xMZcty6N08ah7+eCXa0=
google.golang.org/protobuf v0.0.0-20200228230310-ab0ca4ff8a60/go.mod h1:cfTl7dwQJ+fmap5saPgwCLgHXTUD7jkjRqWcaiX5VyM=
google.golang.org/protobuf v1.20.1-0.20200309200217-e05f789c0967/go.mod h1:A+miEFZTKqfCUM6K7xSMQL9OKL/b6hQv+e19PK+JZNE=
google.golang.org/protobuf v1.21.0/go.mod h1:47Nbq4nVaFHyn7ilMalzfO3qCViNmqZ2kzikPIcrTAo=
google.golang.org/protobuf v1.22.0/go.mod h1:EGpADcykh3NcUnDUJcl1+ZksZNG86OlYog2l/sGQquU=
google.golang.org/protobuf v1.23.0/go.mod h1:EGpADcykh3NcUnDUJcl1+ZksZNG86OlYog2l/sGQquU=
google.golang.org/protobuf v1.23.1-0.20200526195155-81db48ad09cc/go.mod h1:EGpADcykh3NcUnDUJcl1+ZksZNG86OlYog2l/sGQquU=
google.golang.org/protobuf v1.24.0 h1:UhZDfRO8JRQru4/+LlLE0BRKGF8L+PICnvYZmx/fEGA=
google.golang.org/protobuf v1.24.0/go.mod h1:r/3tXBNzIEhYS9I1OUVjXDlt8tc493IdKGjtUeSXeh4=
gopkg.in/alecthomas/kingpin.v2 v2.2.6/go.mod h1:FMv+mEhP44yOT+4EoQTLFTRgOQ1FBLkstjWtayDeSgw=
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
gopkg.in/check.v1 v1.0.0-20180628173108-788fd7840127/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
gopkg.in/check.v1 v1.0.0-20190902080502-41f04d3bba15 h1:YR8cESwS4TdDjEe65xsg0ogRM/Nc3DYOhEAlW+xobZo=
gopkg.in/check.v1 v1.0.0-20190902080502-41f04d3bba15/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
gopkg.in/errgo.v2 v2.1.0/go.mod h1:hNsd1EY+bozCKY1Ytp96fpM3vjJbqLJn88ws8XvfDNI=
gopkg.in/fsnotify.v1 v1.4.7/go.mod h1:Tz8NjZHkW78fSQdbUxIjBTcgA1z1m8ZHf0WmKUhAMys=
gopkg.in/inf.v0 v0.9.1/go.mod h1:cWUDdTG/fYaXco+Dcufb5Vnc6Gp2YChqWtbxRZE0mXw=
gopkg.in/tomb.v1 v1.0.0-20141024135613-dd632973f1e7/go.mod h1:dt/ZhP58zS4L8KSrWDmTeBkI65Dw0HsyUHuEVlX15mw=
gopkg.in/yaml.v2 v2.2.1/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI=
gopkg.in/yaml.v2 v2.2.2/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI=
gopkg.in/yaml.v2 v2.2.3/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI=
gopkg.in/yaml.v2 v2.2.4/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI=
gopkg.in/yaml.v2 v2.2.5/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI=
gopkg.in/yaml.v2 v2.2.8 h1:obN1ZagJSUGI0Ek/LBmuj4SNLPfIny3KsKFopxRdj10=
gopkg.in/yaml.v2 v2.2.8/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI=
gotest.tools v2.2.0+incompatible/go.mod h1:DsYFclhRJ6vuDpmuTbkuFWG+y2sxOXAzmJt81HFBacw=
gotest.tools/v3 v3.0.2/go.mod h1:3SzNCllyD9/Y+b5r9JIKQ474KzkZyqLqEfYqMsX94Bk=
honnef.co/go/tools v0.0.0-20190102054323-c2f93a96b099/go.mod h1:rf3lG4BRIbNafJWhAfAdb/ePZxsR/4RtNHQocxwk9r4=
honnef.co/go/tools v0.0.0-20190106161140-3f1c8253044a/go.mod h1:rf3lG4BRIbNafJWhAfAdb/ePZxsR/4RtNHQocxwk9r4=
honnef.co/go/tools v0.0.0-20190418001031-e561f6794a2a/go.mod h1:rf3lG4BRIbNafJWhAfAdb/ePZxsR/4RtNHQocxwk9r4=
honnef.co/go/tools v0.0.0-20190523083050-ea95bdfd59fc/go.mod h1:rf3lG4BRIbNafJWhAfAdb/ePZxsR/4RtNHQocxwk9r4=
honnef.co/go/tools v0.0.1-2019.2.3/go.mod h1:a3bituU0lyd329TUQxRnasdCoJDkEUEAqEt0JzvZhAg=
k8s.io/api v0.19.0/go.mod h1:I1K45XlvTrDjmj5LoM5LuP/KYrhWbjUKT/SoPG0qTjw=
k8s.io/apimachinery v0.19.0/go.mod h1:DnPGDnARWFvYa3pMHgSxtbZb7gpzzAZ1pTfaUNDVlmA=
k8s.io/client-go v0.19.0/go.mod h1:H9E/VT95blcFQnlyShFgnFT9ZnJOAceiUHM3MlRC+mU=
k8s.io/component-base v0.19.0/go.mod h1:dKsY8BxkA+9dZIAh2aWJLL/UdASFDNtGYTCItL4LM7Y=
k8s.io/gengo v0.0.0-20200413195148-3a45101e95ac/go.mod h1:ezvh/TsK7cY6rbqRK0oQQ8IAqLxYwwyPxAX1Pzy0ii0=
k8s.io/klog v1.0.0 h1:Pt+yjF5aB1xDSVbau4VsWe+dQNzA0qv1LlXdC2dF6Q8=
k8s.io/klog v1.0.0/go.mod h1:4Bi6QPql/J/LkTDqv7R/cd3hPo4k2DG6Ptcz060Ez5I=
k8s.io/klog/v2 v2.0.0/go.mod h1:PBfzABfn139FHAV07az/IF9Wp1bkk3vpT2XSJ76fSDE=
k8s.io/klog/v2 v2.2.0 h1:XRvcwJozkgZ1UQJmfMGpvRthQHOvihEhYtDfAaxMz/A=
k8s.io/klog/v2 v2.2.0/go.mod h1:Od+F08eJP+W3HUb4pSrPpgp9DGU4GzlpG/TmITuYh/Y=
k8s.io/kube-openapi v0.0.0-20200805222855-6aeccd4b50c6/go.mod h1:UuqjUnNftUyPE5H64/qeyjQoUZhGpeFDVdxjTeEVN2o=
k8s.io/utils v0.0.0-20200729134348-d5654de09c73/go.mod h1:jPW/WVKK9YHAvNhRxK0md/EJ228hCsBRufyofKtW8HA=
k8s.io/utils v0.0.0-20210305010621-2afb4311ab10 h1:u5rPykqiCpL+LBfjRkXvnK71gOgIdmq3eHUEkPrbeTI=
k8s.io/utils v0.0.0-20210305010621-2afb4311ab10/go.mod h1:jPW/WVKK9YHAvNhRxK0md/EJ228hCsBRufyofKtW8HA=
rsc.io/binaryregexp v0.2.0/go.mod h1:qTv7/COck+e2FymRvadv62gMdZztPaShugOCi3I+8D8=
sigs.k8s.io/structured-merge-diff/v4 v4.0.1/go.mod h1:bJZC9H9iH24zzfZ/41RGcq60oK1F7G282QMXDPYydCw=
sigs.k8s.io/yaml v1.1.0/go.mod h1:UJmg0vDUVViEyp3mgSv9WPwZCDxu4rQW1olrI1uml+o=
sigs.k8s.io/yaml v1.2.0 h1:kr/MCeFWJWTwyaHoR9c8EjH9OumOmoF9YGiZd7lFm/Q=
sigs.k8s.io/yaml v1.2.0/go.mod h1:yfXDCHCao9+ENCvLSE62v9VSji2MKu5jeNfTrofGhJc=

View File

@ -1,22 +0,0 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
package vitastor
const (
vitastorCSIDriverName = "csi.vitastor.io"
vitastorCSIDriverVersion = "0.6.17"
)
// Config struct fills the parameters of request or user input
type Config struct
{
Endpoint string
NodeID string
}
// NewConfig returns config struct to initialize new driver
func NewConfig() *Config
{
return &Config{}
}

View File

@ -1,530 +0,0 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
package vitastor
import (
"context"
"encoding/json"
"strings"
"bytes"
"strconv"
"time"
"fmt"
"os"
"os/exec"
"io/ioutil"
"github.com/kubernetes-csi/csi-lib-utils/protosanitizer"
"k8s.io/klog"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/status"
"go.etcd.io/etcd/clientv3"
"github.com/container-storage-interface/spec/lib/go/csi"
)
const (
KB int64 = 1024
MB int64 = 1024 * KB
GB int64 = 1024 * MB
TB int64 = 1024 * GB
ETCD_TIMEOUT time.Duration = 15*time.Second
)
type InodeIndex struct
{
Id uint64 `json:"id"`
PoolId uint64 `json:"pool_id"`
}
type InodeConfig struct
{
Name string `json:"name"`
Size uint64 `json:"size,omitempty"`
ParentPool uint64 `json:"parent_pool,omitempty"`
ParentId uint64 `json:"parent_id,omitempty"`
Readonly bool `json:"readonly,omitempty"`
}
type ControllerServer struct
{
*Driver
}
// NewControllerServer create new instance controller
func NewControllerServer(driver *Driver) *ControllerServer
{
return &ControllerServer{
Driver: driver,
}
}
func GetConnectionParams(params map[string]string) (map[string]string, []string, string)
{
ctxVars := make(map[string]string)
configPath := params["configPath"]
if (configPath == "")
{
configPath = "/etc/vitastor/vitastor.conf"
}
else
{
ctxVars["configPath"] = configPath
}
config := make(map[string]interface{})
if configFD, err := os.Open(configPath); err == nil
{
defer configFD.Close()
data, _ := ioutil.ReadAll(configFD)
json.Unmarshal(data, &config)
}
// Try to load prefix & etcd URL from the config
var etcdUrl []string
if (params["etcdUrl"] != "")
{
ctxVars["etcdUrl"] = params["etcdUrl"]
etcdUrl = strings.Split(params["etcdUrl"], ",")
}
if (len(etcdUrl) == 0)
{
switch config["etcd_address"].(type)
{
case string:
etcdUrl = strings.Split(config["etcd_address"].(string), ",")
case []string:
etcdUrl = config["etcd_address"].([]string)
}
}
etcdPrefix := params["etcdPrefix"]
if (etcdPrefix == "")
{
etcdPrefix, _ = config["etcd_prefix"].(string)
if (etcdPrefix == "")
{
etcdPrefix = "/vitastor"
}
}
else
{
ctxVars["etcdPrefix"] = etcdPrefix
}
return ctxVars, etcdUrl, etcdPrefix
}
// Create the volume
func (cs *ControllerServer) CreateVolume(ctx context.Context, req *csi.CreateVolumeRequest) (*csi.CreateVolumeResponse, error)
{
klog.Infof("received controller create volume request %+v", protosanitizer.StripSecrets(req))
if (req == nil)
{
return nil, status.Errorf(codes.InvalidArgument, "request cannot be empty")
}
if (req.GetName() == "")
{
return nil, status.Error(codes.InvalidArgument, "name is a required field")
}
volumeCapabilities := req.GetVolumeCapabilities()
if (volumeCapabilities == nil)
{
return nil, status.Error(codes.InvalidArgument, "volume capabilities is a required field")
}
etcdVolumePrefix := req.Parameters["etcdVolumePrefix"]
poolId, _ := strconv.ParseUint(req.Parameters["poolId"], 10, 64)
if (poolId == 0)
{
return nil, status.Error(codes.InvalidArgument, "poolId is missing in storage class configuration")
}
volName := etcdVolumePrefix + req.GetName()
volSize := 1 * GB
if capRange := req.GetCapacityRange(); capRange != nil
{
volSize = ((capRange.GetRequiredBytes() + MB - 1) / MB) * MB
}
// FIXME: The following should PROBABLY be implemented externally in a management tool
ctxVars, etcdUrl, etcdPrefix := GetConnectionParams(req.Parameters)
if (len(etcdUrl) == 0)
{
return nil, status.Error(codes.InvalidArgument, "no etcdUrl in storage class configuration and no etcd_address in vitastor.conf")
}
// Connect to etcd
cli, err := clientv3.New(clientv3.Config{
DialTimeout: ETCD_TIMEOUT,
Endpoints: etcdUrl,
})
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to connect to etcd at "+strings.Join(etcdUrl, ",")+": "+err.Error())
}
defer cli.Close()
var imageId uint64 = 0
for
{
// Check if the image exists
ctx, cancel := context.WithTimeout(context.Background(), ETCD_TIMEOUT)
resp, err := cli.Get(ctx, etcdPrefix+"/index/image/"+volName)
cancel()
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
}
if (len(resp.Kvs) > 0)
{
kv := resp.Kvs[0]
var v InodeIndex
err := json.Unmarshal(kv.Value, &v)
if (err != nil)
{
return nil, status.Error(codes.Internal, "invalid /index/image/"+volName+" key in etcd: "+err.Error())
}
poolId = v.PoolId
imageId = v.Id
inodeCfgKey := fmt.Sprintf("/config/inode/%d/%d", poolId, imageId)
ctx, cancel := context.WithTimeout(context.Background(), ETCD_TIMEOUT)
resp, err := cli.Get(ctx, etcdPrefix+inodeCfgKey)
cancel()
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
}
if (len(resp.Kvs) == 0)
{
return nil, status.Error(codes.Internal, "missing "+inodeCfgKey+" key in etcd")
}
var inodeCfg InodeConfig
err = json.Unmarshal(resp.Kvs[0].Value, &inodeCfg)
if (err != nil)
{
return nil, status.Error(codes.Internal, "invalid "+inodeCfgKey+" key in etcd: "+err.Error())
}
if (inodeCfg.Size < uint64(volSize))
{
return nil, status.Error(codes.Internal, "image "+volName+" is already created, but size is less than expected")
}
}
else
{
// Find a free ID
// Create image metadata in a transaction verifying that the image doesn't exist yet AND ID is still free
maxIdKey := fmt.Sprintf("%s/index/maxid/%d", etcdPrefix, poolId)
ctx, cancel := context.WithTimeout(context.Background(), ETCD_TIMEOUT)
resp, err := cli.Get(ctx, maxIdKey)
cancel()
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
}
var modRev int64
var nextId uint64
if (len(resp.Kvs) > 0)
{
var err error
nextId, err = strconv.ParseUint(string(resp.Kvs[0].Value), 10, 64)
if (err != nil)
{
return nil, status.Error(codes.Internal, maxIdKey+" contains invalid ID")
}
modRev = resp.Kvs[0].ModRevision
nextId++
}
else
{
nextId = 1
}
inodeIdxJson, _ := json.Marshal(InodeIndex{
Id: nextId,
PoolId: poolId,
})
inodeCfgJson, _ := json.Marshal(InodeConfig{
Name: volName,
Size: uint64(volSize),
})
ctx, cancel = context.WithTimeout(context.Background(), ETCD_TIMEOUT)
txnResp, err := cli.Txn(ctx).If(
clientv3.Compare(clientv3.ModRevision(fmt.Sprintf("%s/index/maxid/%d", etcdPrefix, poolId)), "=", modRev),
clientv3.Compare(clientv3.CreateRevision(fmt.Sprintf("%s/index/image/%s", etcdPrefix, volName)), "=", 0),
clientv3.Compare(clientv3.CreateRevision(fmt.Sprintf("%s/config/inode/%d/%d", etcdPrefix, poolId, nextId)), "=", 0),
).Then(
clientv3.OpPut(fmt.Sprintf("%s/index/maxid/%d", etcdPrefix, poolId), fmt.Sprintf("%d", nextId)),
clientv3.OpPut(fmt.Sprintf("%s/index/image/%s", etcdPrefix, volName), string(inodeIdxJson)),
clientv3.OpPut(fmt.Sprintf("%s/config/inode/%d/%d", etcdPrefix, poolId, nextId), string(inodeCfgJson)),
).Commit()
cancel()
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to commit transaction in etcd: "+err.Error())
}
if (txnResp.Succeeded)
{
imageId = nextId
break
}
// Start over if the transaction fails
}
}
ctxVars["name"] = volName
volumeIdJson, _ := json.Marshal(ctxVars)
return &csi.CreateVolumeResponse{
Volume: &csi.Volume{
// Ugly, but VolumeContext isn't passed to DeleteVolume :-(
VolumeId: string(volumeIdJson),
CapacityBytes: volSize,
},
}, nil
}
// DeleteVolume deletes the given volume
func (cs *ControllerServer) DeleteVolume(ctx context.Context, req *csi.DeleteVolumeRequest) (*csi.DeleteVolumeResponse, error)
{
klog.Infof("received controller delete volume request %+v", protosanitizer.StripSecrets(req))
if (req == nil)
{
return nil, status.Error(codes.InvalidArgument, "request cannot be empty")
}
ctxVars := make(map[string]string)
err := json.Unmarshal([]byte(req.VolumeId), &ctxVars)
if (err != nil)
{
return nil, status.Error(codes.Internal, "volume ID not in JSON format")
}
volName := ctxVars["name"]
_, etcdUrl, etcdPrefix := GetConnectionParams(ctxVars)
if (len(etcdUrl) == 0)
{
return nil, status.Error(codes.InvalidArgument, "no etcdUrl in storage class configuration and no etcd_address in vitastor.conf")
}
cli, err := clientv3.New(clientv3.Config{
DialTimeout: ETCD_TIMEOUT,
Endpoints: etcdUrl,
})
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to connect to etcd at "+strings.Join(etcdUrl, ",")+": "+err.Error())
}
defer cli.Close()
// Find inode by name
ctx, cancel := context.WithTimeout(context.Background(), ETCD_TIMEOUT)
resp, err := cli.Get(ctx, etcdPrefix+"/index/image/"+volName)
cancel()
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
}
if (len(resp.Kvs) == 0)
{
return nil, status.Error(codes.NotFound, "volume "+volName+" does not exist")
}
var idx InodeIndex
err = json.Unmarshal(resp.Kvs[0].Value, &idx)
if (err != nil)
{
return nil, status.Error(codes.Internal, "invalid /index/image/"+volName+" key in etcd: "+err.Error())
}
// Get inode config
inodeCfgKey := fmt.Sprintf("%s/config/inode/%d/%d", etcdPrefix, idx.PoolId, idx.Id)
ctx, cancel = context.WithTimeout(context.Background(), ETCD_TIMEOUT)
resp, err = cli.Get(ctx, inodeCfgKey)
cancel()
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
}
if (len(resp.Kvs) == 0)
{
return nil, status.Error(codes.NotFound, "volume "+volName+" does not exist")
}
var inodeCfg InodeConfig
err = json.Unmarshal(resp.Kvs[0].Value, &inodeCfg)
if (err != nil)
{
return nil, status.Error(codes.Internal, "invalid "+inodeCfgKey+" key in etcd: "+err.Error())
}
// Delete inode data by invoking vitastor-cli
args := []string{
"rm-data", "--etcd_address", strings.Join(etcdUrl, ","),
"--pool", fmt.Sprintf("%d", idx.PoolId),
"--inode", fmt.Sprintf("%d", idx.Id),
}
if (ctxVars["configPath"] != "")
{
args = append(args, "--config_path", ctxVars["configPath"])
}
c := exec.Command("/usr/bin/vitastor-cli", args...)
var stderr bytes.Buffer
c.Stdout = nil
c.Stderr = &stderr
err = c.Run()
stderrStr := string(stderr.Bytes())
if (err != nil)
{
klog.Errorf("vitastor-cli rm-data failed: %s, status %s\n", stderrStr, err)
return nil, status.Error(codes.Internal, stderrStr+" (status "+err.Error()+")")
}
// Delete inode config in etcd
ctx, cancel = context.WithTimeout(context.Background(), ETCD_TIMEOUT)
txnResp, err := cli.Txn(ctx).Then(
clientv3.OpDelete(fmt.Sprintf("%s/index/image/%s", etcdPrefix, volName)),
clientv3.OpDelete(fmt.Sprintf("%s/config/inode/%d/%d", etcdPrefix, idx.PoolId, idx.Id)),
).Commit()
cancel()
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to delete keys in etcd: "+err.Error())
}
if (!txnResp.Succeeded)
{
return nil, status.Error(codes.Internal, "failed to delete keys in etcd: transaction failed")
}
return &csi.DeleteVolumeResponse{}, nil
}
// ControllerPublishVolume return Unimplemented error
func (cs *ControllerServer) ControllerPublishVolume(ctx context.Context, req *csi.ControllerPublishVolumeRequest) (*csi.ControllerPublishVolumeResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}
// ControllerUnpublishVolume return Unimplemented error
func (cs *ControllerServer) ControllerUnpublishVolume(ctx context.Context, req *csi.ControllerUnpublishVolumeRequest) (*csi.ControllerUnpublishVolumeResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}
// ValidateVolumeCapabilities checks whether the volume capabilities requested are supported.
func (cs *ControllerServer) ValidateVolumeCapabilities(ctx context.Context, req *csi.ValidateVolumeCapabilitiesRequest) (*csi.ValidateVolumeCapabilitiesResponse, error)
{
klog.Infof("received controller validate volume capability request %+v", protosanitizer.StripSecrets(req))
if (req == nil)
{
return nil, status.Errorf(codes.InvalidArgument, "request is nil")
}
volumeID := req.GetVolumeId()
if (volumeID == "")
{
return nil, status.Error(codes.InvalidArgument, "volumeId is nil")
}
volumeCapabilities := req.GetVolumeCapabilities()
if (volumeCapabilities == nil)
{
return nil, status.Error(codes.InvalidArgument, "volumeCapabilities is nil")
}
var volumeCapabilityAccessModes []*csi.VolumeCapability_AccessMode
for _, mode := range []csi.VolumeCapability_AccessMode_Mode{
csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER,
csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER,
} {
volumeCapabilityAccessModes = append(volumeCapabilityAccessModes, &csi.VolumeCapability_AccessMode{Mode: mode})
}
capabilitySupport := false
for _, capability := range volumeCapabilities
{
for _, volumeCapabilityAccessMode := range volumeCapabilityAccessModes
{
if (volumeCapabilityAccessMode.Mode == capability.AccessMode.Mode)
{
capabilitySupport = true
}
}
}
if (!capabilitySupport)
{
return nil, status.Errorf(codes.NotFound, "%v not supported", req.GetVolumeCapabilities())
}
return &csi.ValidateVolumeCapabilitiesResponse{
Confirmed: &csi.ValidateVolumeCapabilitiesResponse_Confirmed{
VolumeCapabilities: req.VolumeCapabilities,
},
}, nil
}
// ListVolumes returns a list of volumes
func (cs *ControllerServer) ListVolumes(ctx context.Context, req *csi.ListVolumesRequest) (*csi.ListVolumesResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}
// GetCapacity returns the capacity of the storage pool
func (cs *ControllerServer) GetCapacity(ctx context.Context, req *csi.GetCapacityRequest) (*csi.GetCapacityResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}
// ControllerGetCapabilities returns the capabilities of the controller service.
func (cs *ControllerServer) ControllerGetCapabilities(ctx context.Context, req *csi.ControllerGetCapabilitiesRequest) (*csi.ControllerGetCapabilitiesResponse, error)
{
functionControllerServerCapabilities := func(cap csi.ControllerServiceCapability_RPC_Type) *csi.ControllerServiceCapability
{
return &csi.ControllerServiceCapability{
Type: &csi.ControllerServiceCapability_Rpc{
Rpc: &csi.ControllerServiceCapability_RPC{
Type: cap,
},
},
}
}
var controllerServerCapabilities []*csi.ControllerServiceCapability
for _, capability := range []csi.ControllerServiceCapability_RPC_Type{
csi.ControllerServiceCapability_RPC_CREATE_DELETE_VOLUME,
csi.ControllerServiceCapability_RPC_LIST_VOLUMES,
csi.ControllerServiceCapability_RPC_EXPAND_VOLUME,
csi.ControllerServiceCapability_RPC_CREATE_DELETE_SNAPSHOT,
} {
controllerServerCapabilities = append(controllerServerCapabilities, functionControllerServerCapabilities(capability))
}
return &csi.ControllerGetCapabilitiesResponse{
Capabilities: controllerServerCapabilities,
}, nil
}
// CreateSnapshot create snapshot of an existing PV
func (cs *ControllerServer) CreateSnapshot(ctx context.Context, req *csi.CreateSnapshotRequest) (*csi.CreateSnapshotResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}
// DeleteSnapshot delete provided snapshot of a PV
func (cs *ControllerServer) DeleteSnapshot(ctx context.Context, req *csi.DeleteSnapshotRequest) (*csi.DeleteSnapshotResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}
// ListSnapshots list the snapshots of a PV
func (cs *ControllerServer) ListSnapshots(ctx context.Context, req *csi.ListSnapshotsRequest) (*csi.ListSnapshotsResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}
// ControllerExpandVolume resizes a volume
func (cs *ControllerServer) ControllerExpandVolume(ctx context.Context, req *csi.ControllerExpandVolumeRequest) (*csi.ControllerExpandVolumeResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}
// ControllerGetVolume get volume info
func (cs *ControllerServer) ControllerGetVolume(ctx context.Context, req *csi.ControllerGetVolumeRequest) (*csi.ControllerGetVolumeResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}

View File

@ -1,137 +0,0 @@
/*
Copyright 2017 The Kubernetes Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package vitastor
import (
"fmt"
"net"
"os"
"strings"
"sync"
"github.com/golang/glog"
"golang.org/x/net/context"
"google.golang.org/grpc"
"github.com/container-storage-interface/spec/lib/go/csi"
"github.com/kubernetes-csi/csi-lib-utils/protosanitizer"
)
// Defines Non blocking GRPC server interfaces
type NonBlockingGRPCServer interface {
// Start services at the endpoint
Start(endpoint string, ids csi.IdentityServer, cs csi.ControllerServer, ns csi.NodeServer)
// Waits for the service to stop
Wait()
// Stops the service gracefully
Stop()
// Stops the service forcefully
ForceStop()
}
func NewNonBlockingGRPCServer() NonBlockingGRPCServer {
return &nonBlockingGRPCServer{}
}
// NonBlocking server
type nonBlockingGRPCServer struct {
wg sync.WaitGroup
server *grpc.Server
}
func (s *nonBlockingGRPCServer) Start(endpoint string, ids csi.IdentityServer, cs csi.ControllerServer, ns csi.NodeServer) {
s.wg.Add(1)
go s.serve(endpoint, ids, cs, ns)
return
}
func (s *nonBlockingGRPCServer) Wait() {
s.wg.Wait()
}
func (s *nonBlockingGRPCServer) Stop() {
s.server.GracefulStop()
}
func (s *nonBlockingGRPCServer) ForceStop() {
s.server.Stop()
}
func (s *nonBlockingGRPCServer) serve(endpoint string, ids csi.IdentityServer, cs csi.ControllerServer, ns csi.NodeServer) {
proto, addr, err := ParseEndpoint(endpoint)
if err != nil {
glog.Fatal(err.Error())
}
if proto == "unix" {
addr = "/" + addr
if err := os.Remove(addr); err != nil && !os.IsNotExist(err) {
glog.Fatalf("Failed to remove %s, error: %s", addr, err.Error())
}
}
listener, err := net.Listen(proto, addr)
if err != nil {
glog.Fatalf("Failed to listen: %v", err)
}
opts := []grpc.ServerOption{
grpc.UnaryInterceptor(logGRPC),
}
server := grpc.NewServer(opts...)
s.server = server
if ids != nil {
csi.RegisterIdentityServer(server, ids)
}
if cs != nil {
csi.RegisterControllerServer(server, cs)
}
if ns != nil {
csi.RegisterNodeServer(server, ns)
}
glog.Infof("Listening for connections on address: %#v", listener.Addr())
server.Serve(listener)
}
func ParseEndpoint(ep string) (string, string, error) {
if strings.HasPrefix(strings.ToLower(ep), "unix://") || strings.HasPrefix(strings.ToLower(ep), "tcp://") {
s := strings.SplitN(ep, "://", 2)
if s[1] != "" {
return s[0], s[1], nil
}
}
return "", "", fmt.Errorf("Invalid endpoint: %v", ep)
}
func logGRPC(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (interface{}, error) {
glog.V(3).Infof("GRPC call: %s", info.FullMethod)
glog.V(5).Infof("GRPC request: %s", protosanitizer.StripSecrets(req))
resp, err := handler(ctx, req)
if err != nil {
glog.Errorf("GRPC error: %v", err)
} else {
glog.V(5).Infof("GRPC response: %s", protosanitizer.StripSecrets(resp))
}
return resp, err
}

View File

@ -1,60 +0,0 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
package vitastor
import (
"context"
"github.com/kubernetes-csi/csi-lib-utils/protosanitizer"
"k8s.io/klog"
"github.com/container-storage-interface/spec/lib/go/csi"
)
// IdentityServer struct of Vitastor CSI driver with supported methods of CSI identity server spec.
type IdentityServer struct
{
*Driver
}
// NewIdentityServer create new instance identity
func NewIdentityServer(driver *Driver) *IdentityServer
{
return &IdentityServer{
Driver: driver,
}
}
// GetPluginInfo returns metadata of the plugin
func (is *IdentityServer) GetPluginInfo(ctx context.Context, req *csi.GetPluginInfoRequest) (*csi.GetPluginInfoResponse, error)
{
klog.Infof("received identity plugin info request %+v", protosanitizer.StripSecrets(req))
return &csi.GetPluginInfoResponse{
Name: vitastorCSIDriverName,
VendorVersion: vitastorCSIDriverVersion,
}, nil
}
// GetPluginCapabilities returns available capabilities of the plugin
func (is *IdentityServer) GetPluginCapabilities(ctx context.Context, req *csi.GetPluginCapabilitiesRequest) (*csi.GetPluginCapabilitiesResponse, error)
{
klog.Infof("received identity plugin capabilities request %+v", protosanitizer.StripSecrets(req))
return &csi.GetPluginCapabilitiesResponse{
Capabilities: []*csi.PluginCapability{
{
Type: &csi.PluginCapability_Service_{
Service: &csi.PluginCapability_Service{
Type: csi.PluginCapability_Service_CONTROLLER_SERVICE,
},
},
},
},
}, nil
}
// Probe returns the health and readiness of the plugin
func (is *IdentityServer) Probe(ctx context.Context, req *csi.ProbeRequest) (*csi.ProbeResponse, error)
{
return &csi.ProbeResponse{}, nil
}

View File

@ -1,293 +0,0 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
package vitastor
import (
"context"
"os"
"os/exec"
"encoding/json"
"strings"
"bytes"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/status"
"k8s.io/utils/mount"
utilexec "k8s.io/utils/exec"
"github.com/container-storage-interface/spec/lib/go/csi"
"github.com/kubernetes-csi/csi-lib-utils/protosanitizer"
"k8s.io/klog"
)
// NodeServer struct of Vitastor CSI driver with supported methods of CSI node server spec.
type NodeServer struct
{
*Driver
mounter mount.Interface
}
// NewNodeServer create new instance node
func NewNodeServer(driver *Driver) *NodeServer
{
return &NodeServer{
Driver: driver,
mounter: mount.New(""),
}
}
// NodeStageVolume mounts the volume to a staging path on the node.
func (ns *NodeServer) NodeStageVolume(ctx context.Context, req *csi.NodeStageVolumeRequest) (*csi.NodeStageVolumeResponse, error)
{
return &csi.NodeStageVolumeResponse{}, nil
}
// NodeUnstageVolume unstages the volume from the staging path
func (ns *NodeServer) NodeUnstageVolume(ctx context.Context, req *csi.NodeUnstageVolumeRequest) (*csi.NodeUnstageVolumeResponse, error)
{
return &csi.NodeUnstageVolumeResponse{}, nil
}
func Contains(list []string, s string) bool
{
for i := 0; i < len(list); i++
{
if (list[i] == s)
{
return true
}
}
return false
}
// NodePublishVolume mounts the volume mounted to the staging path to the target path
func (ns *NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublishVolumeRequest) (*csi.NodePublishVolumeResponse, error)
{
klog.Infof("received node publish volume request %+v", protosanitizer.StripSecrets(req))
targetPath := req.GetTargetPath()
isBlock := req.GetVolumeCapability().GetBlock() != nil
// Check that it's not already mounted
_, error := mount.IsNotMountPoint(ns.mounter, targetPath)
if (error != nil)
{
if (os.IsNotExist(error))
{
if (isBlock)
{
pathFile, err := os.OpenFile(targetPath, os.O_CREATE|os.O_RDWR, 0o600)
if (err != nil)
{
klog.Errorf("failed to create block device mount target %s with error: %v", targetPath, err)
return nil, status.Error(codes.Internal, err.Error())
}
err = pathFile.Close()
if (err != nil)
{
klog.Errorf("failed to close %s with error: %v", targetPath, err)
return nil, status.Error(codes.Internal, err.Error())
}
}
else
{
err := os.MkdirAll(targetPath, 0777)
if (err != nil)
{
klog.Errorf("failed to create fs mount target %s with error: %v", targetPath, err)
return nil, status.Error(codes.Internal, err.Error())
}
}
}
else
{
return nil, status.Error(codes.Internal, error.Error())
}
}
ctxVars := make(map[string]string)
err := json.Unmarshal([]byte(req.VolumeId), &ctxVars)
if (err != nil)
{
return nil, status.Error(codes.Internal, "volume ID not in JSON format")
}
volName := ctxVars["name"]
_, etcdUrl, etcdPrefix := GetConnectionParams(ctxVars)
if (len(etcdUrl) == 0)
{
return nil, status.Error(codes.InvalidArgument, "no etcdUrl in storage class configuration and no etcd_address in vitastor.conf")
}
// Map NBD device
// FIXME: Check if already mapped
args := []string{
"map", "--etcd_address", strings.Join(etcdUrl, ","),
"--etcd_prefix", etcdPrefix,
"--image", volName,
};
if (ctxVars["configPath"] != "")
{
args = append(args, "--config_path", ctxVars["configPath"])
}
if (req.GetReadonly())
{
args = append(args, "--readonly", "1")
}
c := exec.Command("/usr/bin/vitastor-nbd", args...)
var stdout, stderr bytes.Buffer
c.Stdout, c.Stderr = &stdout, &stderr
err = c.Run()
stdoutStr, stderrStr := string(stdout.Bytes()), string(stderr.Bytes())
if (err != nil)
{
klog.Errorf("vitastor-nbd map failed: %s, status %s\n", stdoutStr+stderrStr, err)
return nil, status.Error(codes.Internal, stdoutStr+stderrStr+" (status "+err.Error()+")")
}
devicePath := strings.TrimSpace(stdoutStr)
// Check existing format
diskMounter := &mount.SafeFormatAndMount{Interface: ns.mounter, Exec: utilexec.New()}
existingFormat, err := diskMounter.GetDiskFormat(devicePath)
if (err != nil)
{
klog.Errorf("failed to get disk format for path %s, error: %v", err)
// unmap NBD device
unmapOut, unmapErr := exec.Command("/usr/bin/vitastor-nbd", "unmap", devicePath).CombinedOutput()
if (unmapErr != nil)
{
klog.Errorf("failed to unmap NBD device %s: %s, error: %v", devicePath, unmapOut, unmapErr)
}
return nil, err
}
// Format the device (ext4 or xfs)
fsType := req.GetVolumeCapability().GetMount().GetFsType()
opt := req.GetVolumeCapability().GetMount().GetMountFlags()
opt = append(opt, "_netdev")
if ((req.VolumeCapability.AccessMode.Mode == csi.VolumeCapability_AccessMode_MULTI_NODE_READER_ONLY ||
req.VolumeCapability.AccessMode.Mode == csi.VolumeCapability_AccessMode_SINGLE_NODE_READER_ONLY) &&
!Contains(opt, "ro"))
{
opt = append(opt, "ro")
}
if (fsType == "xfs")
{
opt = append(opt, "nouuid")
}
readOnly := Contains(opt, "ro")
if (existingFormat == "" && !readOnly)
{
args := []string{}
switch fsType
{
case "ext4":
args = []string{"-m0", "-Enodiscard,lazy_itable_init=1,lazy_journal_init=1", devicePath}
case "xfs":
args = []string{"-K", devicePath}
}
if (len(args) > 0)
{
cmdOut, cmdErr := diskMounter.Exec.Command("mkfs."+fsType, args...).CombinedOutput()
if (cmdErr != nil)
{
klog.Errorf("failed to run mkfs error: %v, output: %v", cmdErr, string(cmdOut))
// unmap NBD device
unmapOut, unmapErr := exec.Command("/usr/bin/vitastor-nbd", "unmap", devicePath).CombinedOutput()
if (unmapErr != nil)
{
klog.Errorf("failed to unmap NBD device %s: %s, error: %v", devicePath, unmapOut, unmapErr)
}
return nil, status.Error(codes.Internal, cmdErr.Error())
}
}
}
if (isBlock)
{
opt = append(opt, "bind")
err = diskMounter.Mount(devicePath, targetPath, fsType, opt)
}
else
{
err = diskMounter.FormatAndMount(devicePath, targetPath, fsType, opt)
}
if (err != nil)
{
klog.Errorf(
"failed to mount device path (%s) to path (%s) for volume (%s) error: %s",
devicePath, targetPath, volName, err,
)
// unmap NBD device
unmapOut, unmapErr := exec.Command("/usr/bin/vitastor-nbd", "unmap", devicePath).CombinedOutput()
if (unmapErr != nil)
{
klog.Errorf("failed to unmap NBD device %s: %s, error: %v", devicePath, unmapOut, unmapErr)
}
return nil, status.Error(codes.Internal, err.Error())
}
return &csi.NodePublishVolumeResponse{}, nil
}
// NodeUnpublishVolume unmounts the volume from the target path
func (ns *NodeServer) NodeUnpublishVolume(ctx context.Context, req *csi.NodeUnpublishVolumeRequest) (*csi.NodeUnpublishVolumeResponse, error)
{
klog.Infof("received node unpublish volume request %+v", protosanitizer.StripSecrets(req))
targetPath := req.GetTargetPath()
devicePath, refCount, err := mount.GetDeviceNameFromMount(ns.mounter, targetPath)
if (err != nil)
{
if (os.IsNotExist(err))
{
return nil, status.Error(codes.NotFound, "Target path not found")
}
return nil, status.Error(codes.Internal, err.Error())
}
if (devicePath == "")
{
return nil, status.Error(codes.NotFound, "Volume not mounted")
}
// unmount
err = mount.CleanupMountPoint(targetPath, ns.mounter, false)
if (err != nil)
{
return nil, status.Error(codes.Internal, err.Error())
}
// unmap NBD device
if (refCount == 1)
{
unmapOut, unmapErr := exec.Command("/usr/bin/vitastor-nbd", "unmap", devicePath).CombinedOutput()
if (unmapErr != nil)
{
klog.Errorf("failed to unmap NBD device %s: %s, error: %v", devicePath, unmapOut, unmapErr)
}
}
return &csi.NodeUnpublishVolumeResponse{}, nil
}
// NodeGetVolumeStats returns volume capacity statistics available for the volume
func (ns *NodeServer) NodeGetVolumeStats(ctx context.Context, req *csi.NodeGetVolumeStatsRequest) (*csi.NodeGetVolumeStatsResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}
// NodeExpandVolume expanding the file system on the node
func (ns *NodeServer) NodeExpandVolume(ctx context.Context, req *csi.NodeExpandVolumeRequest) (*csi.NodeExpandVolumeResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}
// NodeGetCapabilities returns the supported capabilities of the node server
func (ns *NodeServer) NodeGetCapabilities(ctx context.Context, req *csi.NodeGetCapabilitiesRequest) (*csi.NodeGetCapabilitiesResponse, error)
{
return &csi.NodeGetCapabilitiesResponse{}, nil
}
// NodeGetInfo returns NodeGetInfoResponse for CO.
func (ns *NodeServer) NodeGetInfo(ctx context.Context, req *csi.NodeGetInfoRequest) (*csi.NodeGetInfoResponse, error)
{
klog.Infof("received node get info request %+v", protosanitizer.StripSecrets(req))
return &csi.NodeGetInfoResponse{
NodeId: ns.NodeID,
}, nil
}

View File

@ -1,36 +0,0 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
package vitastor
import (
"k8s.io/klog"
)
type Driver struct
{
*Config
}
// NewDriver create new instance driver
func NewDriver(config *Config) (*Driver, error)
{
if (config == nil)
{
klog.Errorf("Vitastor CSI driver initialization failed")
return nil, nil
}
driver := &Driver{
Config: config,
}
klog.Infof("Vitastor CSI driver initialized")
return driver, nil
}
// Start server
func (driver *Driver) Run()
{
server := NewNonBlockingGRPCServer()
server.Start(driver.Endpoint, NewIdentityServer(driver), NewControllerServer(driver), NewNodeServer(driver))
server.Wait()
}

View File

@ -1,39 +0,0 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
package main
import (
"flag"
"fmt"
"os"
"k8s.io/klog"
"vitastor.io/csi/src"
)
func main()
{
var config = vitastor.NewConfig()
flag.StringVar(&config.Endpoint, "endpoint", "", "CSI endpoint")
flag.StringVar(&config.NodeID, "node", "", "Node ID")
flag.Parse()
if (config.Endpoint == "")
{
config.Endpoint = os.Getenv("CSI_ENDPOINT")
}
if (config.NodeID == "")
{
config.NodeID = os.Getenv("NODE_ID")
}
if (config.Endpoint == "" && config.NodeID == "")
{
fmt.Fprintf(os.Stderr, "Please set -endpoint and -node / CSI_ENDPOINT & NODE_ID env vars\n")
os.Exit(1)
}
drv, err := vitastor.NewDriver(config)
if (err != nil)
{
klog.Fatalln(err)
}
drv.Run()
}

View File

@ -1,7 +0,0 @@
#!/bin/bash
cat < vitastor.Dockerfile > ../Dockerfile
cd ..
mkdir -p packages
sudo podman build --build-arg REL=bullseye -v `pwd`/packages:/root/packages -f Dockerfile .
rm Dockerfile

View File

@ -1,7 +0,0 @@
#!/bin/bash
cat < vitastor.Dockerfile > ../Dockerfile
cd ..
mkdir -p packages
sudo podman build --build-arg REL=buster -v `pwd`/packages:/root/packages -f Dockerfile .
rm Dockerfile

27
debian/changelog vendored
View File

@ -1,27 +0,0 @@
vitastor (0.6.17-1) unstable; urgency=medium
* RDMA support
* Bugfixes
-- Vitaliy Filippov <vitalif@yourcmc.ru> Sat, 01 May 2021 18:46:10 +0300
vitastor (0.6.0-1) unstable; urgency=medium
* Snapshots and Copy-on-Write clones
* Image metadata in etcd (name, size)
* Image I/O and space statistics in etcd
* Write throttling for smoothing random write workloads in SSD+HDD configurations
-- Vitaliy Filippov <vitalif@yourcmc.ru> Sun, 11 Apr 2021 00:49:18 +0300
vitastor (0.5.1-1) unstable; urgency=medium
* Add jerasure support
-- Vitaliy Filippov <vitalif@yourcmc.ru> Sat, 05 Dec 2020 17:02:26 +0300
vitastor (0.5-1) unstable; urgency=medium
* First packaging for Debian
-- Vitaliy Filippov <vitalif@yourcmc.ru> Thu, 05 Nov 2020 02:20:59 +0300

1
debian/compat vendored
View File

@ -1 +0,0 @@
13

55
debian/control vendored
View File

@ -1,55 +0,0 @@
Source: vitastor
Section: admin
Priority: optional
Maintainer: Vitaliy Filippov <vitalif@yourcmc.ru>
Build-Depends: debhelper, liburing-dev (>= 0.6), g++ (>= 8), libstdc++6 (>= 8), linux-libc-dev, libgoogle-perftools-dev, libjerasure-dev, libgf-complete-dev, libibverbs-dev
Standards-Version: 4.5.0
Homepage: https://vitastor.io/
Rules-Requires-Root: no
Package: vitastor
Architecture: amd64
Depends: vitastor-osd, vitastor-mon, vitastor-client, vitastor-client-dev, vitastor-fio
Description: Vitastor, a fast software-defined clustered block storage
Vitastor is a small, simple and fast clustered block storage (storage for VM drives),
architecturally similar to Ceph which means strong consistency, primary-replication,
symmetric clustering and automatic data distribution over any number of drives of any
size with configurable redundancy (replication or erasure codes/XOR).
Package: vitastor-osd
Architecture: amd64
Depends: ${shlibs:Depends}, ${misc:Depends}, vitastor-client (= ${binary:Version})
Description: Vitastor, a fast software-defined clustered block storage - object storage daemon
Vitastor object storage daemon, i.e. server program that stores data.
Package: vitastor-mon
Architecture: amd64
Depends: ${misc:Depends}, nodejs (>= 10), node-sprintf-js, node-ws (>= 7), lp-solve
Description: Vitastor, a fast software-defined clustered block storage - monitor
Vitastor monitor, i.e. server program responsible for watching cluster state and
scheduling cluster-level operations.
Package: vitastor-client
Architecture: amd64
Depends: ${shlibs:Depends}, ${misc:Depends}
Description: Vitastor, a fast software-defined clustered block storage - client
Vitastor client library and command-line interface.
Package: vitastor-client-dev
Section: devel
Architecture: amd64
Depends: ${misc:Depends}, vitastor-client (= ${binary:Version})
Description: Vitastor, a fast software-defined clustered block storage - development files
Vitastor library headers for development.
Package: vitastor-fio
Architecture: amd64
Depends: ${shlibs:Depends}, ${misc:Depends}, vitastor-client (= ${binary:Version}), fio (= ${dep:fio})
Description: Vitastor, a fast software-defined clustered block storage - fio drivers
Vitastor fio drivers for benchmarking.
Package: pve-storage-vitastor
Architecture: amd64
Depends: ${shlibs:Depends}, ${misc:Depends}, vitastor-client (= ${binary:Version})
Description: Vitastor Proxmox Virtual Environment storage plugin
Vitastor storage plugin for Proxmox Virtual Environment.

21
debian/copyright vendored
View File

@ -1,21 +0,0 @@
Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
Upstream-Name: vitastor
Upstream-Contact: Vitaliy Filippov <vitalif@yourcmc.ru>
Source: https://vitastor.io
Files: *
Copyright: 2019+ Vitaliy Filippov <vitalif@yourcmc.ru>
License: Multiple licenses VNPL-1.1 and/or GPL-2.0+
All server-side code (OSD, Monitor and so on) is licensed under the terms of
Vitastor Network Public License 1.1 (VNPL 1.1), a copyleft license based on
GNU GPLv3.0 with the additional "Network Interaction" clause which requires
opensourcing all programs directly or indirectly interacting with Vitastor
through a computer network and expressly designed to be used in conjunction
with it ("Proxy Programs"). Proxy Programs may be made public not only under
the terms of the same license, but also under the terms of any GPL-Compatible
Free Software License, as listed by the Free Software Foundation.
This is a stricter copyleft license than the Affero GPL.
.
Client libraries (cluster_client and so on) are dual-licensed under the same
VNPL 1.1 and also GNU GPL 2.0 or later to allow for compatibility with GPLed
software like QEMU and fio.

1
debian/fio_version vendored
View File

@ -1 +0,0 @@
dep:fio=3.16-1

4
debian/install vendored
View File

@ -1,4 +0,0 @@
VNPL-1.1.txt usr/share/doc/vitastor
GPL-2.0.txt usr/share/doc/vitastor
README.md usr/share/doc/vitastor
README-ru.md usr/share/doc/vitastor

View File

@ -1,40 +0,0 @@
# Build patched libvirt for Debian Buster or Bullseye/Sid inside a container
# cd ..; podman build --build-arg REL=bullseye -v `pwd`/packages:/root/packages -f debian/libvirt.Dockerfile .
ARG REL=
FROM debian:$REL
ARG REL=
WORKDIR /root
RUN if [ "$REL" = "buster" -o "$REL" = "bullseye" ]; then \
echo "deb http://deb.debian.org/debian $REL-backports main" >> /etc/apt/sources.list; \
echo >> /etc/apt/preferences; \
echo 'Package: *' >> /etc/apt/preferences; \
echo "Pin: release a=$REL-backports" >> /etc/apt/preferences; \
echo 'Pin-Priority: 500' >> /etc/apt/preferences; \
fi; \
grep '^deb ' /etc/apt/sources.list | perl -pe 's/^deb/deb-src/' >> /etc/apt/sources.list; \
echo 'APT::Install-Recommends false;' >> /etc/apt/apt.conf; \
echo 'APT::Install-Suggests false;' >> /etc/apt/apt.conf
RUN apt-get update; apt-get -y install devscripts
RUN apt-get -y build-dep libvirt0
RUN apt-get -y install libglusterfs-dev
RUN apt-get --download-only source libvirt
ADD patches/libvirt-5.0-vitastor.diff patches/libvirt-7.0-vitastor.diff patches/libvirt-7.5-vitastor.diff patches/libvirt-7.6-vitastor.diff /root
RUN set -e; \
mkdir -p /root/packages/libvirt-$REL; \
rm -rf /root/packages/libvirt-$REL/*; \
cd /root/packages/libvirt-$REL; \
dpkg-source -x /root/libvirt*.dsc; \
D=$(ls -d libvirt-*/); \
V=$(ls -d libvirt-*/ | perl -pe 's/libvirt-(\d+\.\d+).*/$1/'); \
cp /root/libvirt-$V-vitastor.diff $D/debian/patches; \
echo libvirt-$V-vitastor.diff >> $D/debian/patches/series; \
cd $D; \
V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)(~bpo[\d\+]*)?(\+deb[u\d]+)?\).*$/$1/')+vitastor2; \
DEBEMAIL="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v $V 'Add Vitastor support'; \
DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \
rm -rf /root/packages/libvirt-$REL/$D

View File

@ -1,61 +0,0 @@
# Build patched QEMU for Debian Buster or Bullseye/Sid inside a container
# cd ..; podman build --build-arg REL=bullseye -v `pwd`/packages:/root/packages -f debian/patched-qemu.Dockerfile .
ARG REL=
FROM debian:$REL
ARG REL=
WORKDIR /root
RUN if [ "$REL" = "buster" -o "$REL" = "bullseye" ]; then \
echo "deb http://deb.debian.org/debian $REL-backports main" >> /etc/apt/sources.list; \
echo >> /etc/apt/preferences; \
echo 'Package: *' >> /etc/apt/preferences; \
echo "Pin: release a=$REL-backports" >> /etc/apt/preferences; \
echo 'Pin-Priority: 500' >> /etc/apt/preferences; \
fi; \
grep '^deb ' /etc/apt/sources.list | perl -pe 's/^deb/deb-src/' >> /etc/apt/sources.list; \
echo 'APT::Install-Recommends false;' >> /etc/apt/apt.conf; \
echo 'APT::Install-Suggests false;' >> /etc/apt/apt.conf
RUN apt-get update
RUN apt-get -y install qemu fio liburing1 liburing-dev libgoogle-perftools-dev devscripts
RUN apt-get -y build-dep qemu
# To build a custom version
#RUN cp /root/packages/qemu-orig/* /root
RUN apt-get --download-only source qemu
ADD patches/qemu-5.0-vitastor.patch patches/qemu-5.1-vitastor.patch patches/qemu-6.1-vitastor.patch src/qemu_driver.c /root/vitastor/patches/
RUN set -e; \
apt-get install -y wget; \
wget -q -O /etc/apt/trusted.gpg.d/vitastor.gpg https://vitastor.io/debian/pubkey.gpg; \
(echo deb http://vitastor.io/debian $REL main > /etc/apt/sources.list.d/vitastor.list); \
(echo "APT::Install-Recommends false;" > /etc/apt/apt.conf) && \
apt-get update; \
apt-get install -y vitastor-client vitastor-client-dev quilt; \
mkdir -p /root/packages/qemu-$REL; \
rm -rf /root/packages/qemu-$REL/*; \
cd /root/packages/qemu-$REL; \
dpkg-source -x /root/qemu*.dsc; \
if ls -d /root/packages/qemu-$REL/qemu-5.0*; then \
D=$(ls -d /root/packages/qemu-$REL/qemu-5.0*); \
cp /root/vitastor/patches/qemu-5.0-vitastor.patch $D/debian/patches; \
echo qemu-5.0-vitastor.patch >> $D/debian/patches/series; \
elif ls /root/packages/qemu-$REL/qemu-6.1*; then \
D=$(ls -d /root/packages/qemu-$REL/qemu-6.1*); \
cp /root/vitastor/patches/qemu-6.1-vitastor.patch $D/debian/patches; \
echo qemu-6.1-vitastor.patch >> $D/debian/patches/series; \
else \
cp /root/vitastor/patches/qemu-5.1-vitastor.patch /root/packages/qemu-$REL/qemu-*/debian/patches; \
P=`ls -d /root/packages/qemu-$REL/qemu-*/debian/patches`; \
echo qemu-5.1-vitastor.patch >> $P/series; \
fi; \
cd /root/packages/qemu-$REL/qemu-*/; \
quilt push -a; \
quilt add block/vitastor.c; \
cp /root/vitastor/patches/qemu_driver.c block/vitastor.c; \
quilt refresh; \
V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)(~bpo[\d\+]*)?\).*$/$1/')+vitastor1; \
DEBEMAIL="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v $V 'Plug Vitastor block driver'; \
DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \
rm -rf /root/packages/qemu-$REL/qemu-*/

View File

@ -1 +0,0 @@
patches/PVE_VitastorPlugin.pm usr/share/perl5/PVE/Storage/Custom/VitastorPlugin.pm

19
debian/raw.h vendored
View File

@ -1,19 +0,0 @@
/* Removed in Linux 5.14 */
/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
#ifndef __LINUX_RAW_H
#define __LINUX_RAW_H
#include <linux/types.h>
#define RAW_SETBIND _IO( 0xac, 0 )
#define RAW_GETBIND _IO( 0xac, 1 )
struct raw_config_request
{
int raw_minor;
__u64 block_major;
__u64 block_minor;
};
#endif /* __LINUX_RAW_H */

10
debian/rules vendored
View File

@ -1,10 +0,0 @@
#!/usr/bin/make -f
export DH_VERBOSE = 1
%:
dh $@
override_dh_installdeb:
cat debian/fio_version >> debian/vitastor-fio.substvars
[ -f debian/qemu_version ] && (cat debian/qemu_version >> debian/vitastor-qemu.substvars) || true
dh_installdeb

View File

@ -1 +0,0 @@
3.0 (quilt)

View File

@ -1,2 +0,0 @@
usr/include
usr/lib/*/pkgconfig

View File

@ -1,7 +0,0 @@
usr/bin/vita
usr/bin/vitastor-cli
usr/bin/vitastor-rm
usr/bin/vitastor-nbd
usr/bin/vitastor-nfs
usr/lib/*/libvitastor*.so*
mon/make-osd.sh /usr/lib/vitastor

View File

@ -1 +0,0 @@
usr/lib/*/libfio*.so*

View File

@ -1 +0,0 @@
mon usr/lib/vitastor

View File

@ -1,2 +0,0 @@
usr/bin/vitastor-osd
usr/bin/vitastor-dump-journal

View File

@ -1,55 +0,0 @@
# Build Vitastor packages for Debian Buster or Bullseye/Sid inside a container
# cd ..; podman build --build-arg REL=bullseye -v `pwd`/packages:/root/packages -f debian/vitastor.Dockerfile .
ARG REL=
FROM debian:$REL
ARG REL=
WORKDIR /root
RUN if [ "$REL" = "buster" -o "$REL" = "bullseye" ]; then \
echo "deb http://deb.debian.org/debian $REL-backports main" >> /etc/apt/sources.list; \
echo >> /etc/apt/preferences; \
echo 'Package: *' >> /etc/apt/preferences; \
echo "Pin: release a=$REL-backports" >> /etc/apt/preferences; \
echo 'Pin-Priority: 500' >> /etc/apt/preferences; \
fi; \
grep '^deb ' /etc/apt/sources.list | perl -pe 's/^deb/deb-src/' >> /etc/apt/sources.list; \
echo 'APT::Install-Recommends false;' >> /etc/apt/apt.conf; \
echo 'APT::Install-Suggests false;' >> /etc/apt/apt.conf
RUN apt-get update
RUN apt-get -y install fio liburing1 liburing-dev libgoogle-perftools-dev devscripts
RUN apt-get -y build-dep fio
RUN apt-get --download-only source fio
RUN apt-get update && apt-get -y install libjerasure-dev cmake libibverbs-dev
ADD . /root/vitastor
RUN set -e -x; \
mkdir -p /root/fio-build/; \
cd /root/fio-build/; \
rm -rf /root/fio-build/*; \
dpkg-source -x /root/fio*.dsc; \
mkdir -p /root/packages/vitastor-$REL; \
rm -rf /root/packages/vitastor-$REL/*; \
cd /root/packages/vitastor-$REL; \
cp -r /root/vitastor vitastor-0.6.17; \
cd vitastor-0.6.17; \
ln -s /root/fio-build/fio-*/ ./fio; \
FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
ls /usr/include/linux/raw.h || cp ./debian/raw.h /usr/include/linux/raw.h; \
sh copy-fio-includes.sh; \
rm fio; \
mkdir -p a b debian/patches; \
mv fio-copy b/fio; \
diff -NaurpbB a b > debian/patches/fio-headers.patch || true; \
echo fio-headers.patch >> debian/patches/series; \
rm -rf a b; \
echo "dep:fio=$FIO" > debian/fio_version; \
cd /root/packages/vitastor-$REL; \
tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.6.17.orig.tar.xz vitastor-0.6.17; \
cd vitastor-0.6.17; \
V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \
DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \
rm -rf /root/packages/vitastor-$REL/vitastor-*/

View File

@ -1,9 +0,0 @@
# Build Docker image with Vitastor packages
FROM debian:bullseye
ADD vitastor.list /etc/apt/sources.list.d
ADD vitastor.gpg /etc/apt/trusted.gpg.d
ADD vitastor.pref /etc/apt/preferences.d
ADD apt.conf /etc/apt/
RUN apt-get update && apt-get -y install vitastor qemu-system-x86 qemu-system-common && apt-get clean

View File

@ -1 +0,0 @@
APT::Install-Recommends false;

Binary file not shown.

View File

@ -1 +0,0 @@
deb http://vitastor.io/debian bullseye main

View File

@ -1,3 +0,0 @@
Package: *
Pin: origin "vitastor.io"
Pin-Priority: 1000

View File

@ -1,37 +0,0 @@
[Documentation](../README.md#documentation) → Configuration Reference
-----
[Читать на русском](config.ru.md)
# Configuration Reference
Vitastor configuration consists of:
- [Configuration parameters (key-value)](#parameter-reference)
- [Pool configuration](config/pool.en.md)
- [OSD placement tree configuration](config/pool.en.md#placement-tree)
- [Separate OSD settings](config/pool.en.md#osd-settings)
- [Inode configuration](config/inode.en.md) i.e. image metadata like name, size and parent reference
Configuration parameters can be set in 3 places:
- Configuration file (`/etc/vitastor/vitastor.conf` or other path)
- etcd key `/vitastor/config/global`. Most variables can be set there, but etcd
connection parameters should obviously be set in the configuration file.
- Command line of Vitastor components: OSD, mon, fio and QEMU options,
OpenStack/Proxmox/etc configuration. The latter doesn't allow to set all
variables directly, but it allows to override the configuration file and
set everything you need inside it.
In the future, additional configuration methods may be added:
- OSD superblock which will, by design, contain parameters related to the disk
layout and to one specific OSD.
- OSD-specific keys in etcd like `/vitastor/config/osd/<number>`.
## Parameter Reference
- [Common](config/common.en.md)
- [Network](config/network.en.md)
- [Global Disk Layout](config/layout-cluster.en.md)
- [OSD Disk Layout](config/layout-osd.en.md)
- [OSD Runtime Parameters](config/osd.en.md)
- [Monitor](config/monitor.en.md)

View File

@ -1,39 +0,0 @@
[Документация](../README-ru.md#документация) → Конфигурация Vitastor
-----
[Read in English](config.en.md)
# Конфигурация Vitastor
Конфигурация Vitastor состоит из:
- [Параметров (ключ-значение)](#список-параметров)
- [Настроек пулов](config/pool.ru.md)
- [Настроек дерева OSD](config/pool.ru.md#дерево-размещения)
- [Настроек отдельных OSD](config/pool.ru.md#настройки-osd)
- [Настроек инодов](config/inode.ru.md), т.е. метаданных образов, таких, как имя, размер и ссылки на
родительский образ
Параметры конфигурации могут задаваться в 3 местах:
- Файле конфигурации (`/etc/vitastor/vitastor.conf` или по другому пути)
- Ключе в etcd `/vitastor/config/global`. Большая часть параметров может
задаваться там, кроме, естественно, самих параметров соединения с etcd,
которые должны задаваться в файле конфигурации
- В командной строке компонентов Vitastor: OSD, монитора, опциях fio и QEMU,
настроек OpenStack, Proxmox и т.п. Последние, как правило, не включают полный
набор параметров напрямую, но разрешают определить путь к файлу конфигурации
и задать любые параметры в нём.
В будущем также могут быть добавлены другие способы конфигурации:
- Суперблок OSD, в котором будут храниться параметры OSD, связанные с дисковым
форматом и с этим конкретным OSD.
- OSD-специфичные ключи в etcd типа `/vitastor/config/osd/<номер>`.
## Список параметров
- [Общие](config/common.ru.md)
- [Сеть](config/network.ru.md)
- [Глобальные дисковые параметры](config/layout-cluster.ru.md)
- [Дисковые параметры OSD](config/layout-osd.ru.md)
- [Прочие параметры OSD](config/osd.ru.md)
- [Параметры мониторов](config/monitor.ru.md)

View File

@ -1,46 +0,0 @@
[Documentation](../../README.md#documentation) → [Configuration](../config.en.md) → Common Parameters
-----
[Читать на русском](common.ru.md)
# Common Parameters
These are the most common parameters which apply to all components of Vitastor.
- [config_path](#config_path)
- [etcd_address](#etcd_address)
- [etcd_prefix](#etcd_prefix)
- [log_level](#log_level)
## config_path
- Type: string
- Default: /etc/vitastor/vitastor.conf
Path to the JSON configuration file. Configuration file is optional,
a non-existing configuration file does not prevent Vitastor from
running if required parameters are specified.
## etcd_address
- Type: string or array of strings
etcd connection endpoint(s). Multiple endpoints may be delimited by "," or
specified in a JSON array `["10.0.115.10:2379/v3","10.0.115.11:2379/v3"]`.
Note that https is not supported for etcd connections yet.
## etcd_prefix
- Type: string
- Default: /vitastor
Prefix for all keys in etcd used by Vitastor. You can change prefix and, for
example, use a single etcd cluster for multiple Vitastor clusters.
## log_level
- Type: integer
- Default: 0
Log level. Raise if you want more verbose output.

View File

@ -1,45 +0,0 @@
[Документация](../../README-ru.md#документация) → [Конфигурация](../config.ru.md) → Общие параметры
-----
[Read in English](common.en.md)
# Общие параметры
Это наиболее общие параметры, используемые всеми компонентами Vitastor.
- [config_path](#config_path)
- [etcd_address](#etcd_address)
- [etcd_prefix](#etcd_prefix)
- [log_level](#log_level)
## config_path
- Тип: строка
- Значение по умолчанию: /etc/vitastor/vitastor.conf
Путь к файлу конфигурации в формате JSON. Файл конфигурации необязателен,
без него Vitastor тоже будет работать, если переданы необходимые параметры.
## etcd_address
- Тип: строка или массив строк
Адрес(а) подключения к etcd. Несколько адресов могут разделяться запятой
или указываться в виде JSON-массива `["10.0.115.10:2379/v3","10.0.115.11:2379/v3"]`.
## etcd_prefix
- Тип: строка
- Значение по умолчанию: /vitastor
Префикс для ключей etcd, которые использует Vitastor. Вы можете задать другой
префикс, например, чтобы запустить несколько кластеров Vitastor с одним
кластером etcd.
## log_level
- Тип: целое число
- Значение по умолчанию: 0
Уровень логгирования. Повысьте, если хотите более подробный вывод.

View File

@ -1,32 +0,0 @@
[Documentation](../../README.md#documentation) → [Configuration](../config.en.md) → Image metadata in etcd
-----
[Читать на русском](inode.ru.md)
# Image metadata in etcd
Image list is stored in etcd in `/vitastor/config/inode/<pool>/<inode>` keys.
You can even create images manually:
```
etcdctl --endpoints=<etcd> put /vitastor/config/inode/<pool>/<inode> '{"name":"<name>","size":<size>[,"parent_id":<parent_inode_number>][,"readonly":true]}'
```
For example:
```
etcdctl --endpoints=http://10.115.0.10:2379/v3 put /vitastor/config/inode/1/1 '{"name":"testimg","size":2147483648}'
```
If you specify parent_id the image becomes a CoW clone. I.e. all writes go to the new inode and reads first check it
and then upper layers. You can then make parent readonly by updating its entry with `"readonly":true` for safety and
basically treat it as a snapshot.
So to create a snapshot you basically rename the previous upper layer (for example from testimg to testimg@0), make it readonly
and create a new top layer with the original name (testimg) and the previous one as a parent.
vitastor-cli, K8s, OpenStack and other drivers also store the reverse mapping in `/vitastor/index/image/<name>` keys
in JSON format: `{"id":<inode>,"pool_id":<pool>}` and ID counters in `/vitastor/index/maxid/<pool>` as numbers
to simplify ID generation.

View File

@ -1,34 +0,0 @@
[Документация](../../README-ru.md#документация) → [Конфигурация](../config.ru.md) → Метаданные образов в etcd
-----
[Read in English](inode.en.md)
# Метаданные образов в etcd
Список образов хранится в etcd в ключах `/vitastor/config/inode/<pool>/<inode>`.
Вы можете даже создавать образы вручную:
```
etcdctl --endpoints=<etcd> put /vitastor/config/inode/<pool>/<inode> '{"name":"<name>","size":<size>[,"parent_id":<parent_inode_number>][,"readonly":true]}'
```
Например:
```
etcdctl --endpoints=http://10.115.0.10:2379/v3 put /vitastor/config/inode/1/1 '{"name":"testimg","size":2147483648}'
```
Если вы зададите parent_id, то образ станет CoW-клоном, т.е. все новые запросы записи пойдут в новый инод, а запросы
чтения будут проверять сначала его, а потом родительские слои по цепочке вверх. Чтобы случайно не перезаписать данные
в родительском слое, вы можете переключить его в режим "только чтение", добавив флаг `"readonly":true` в его запись
метаданных. В таком случае родительский образ становится просто снапшотом.
Таким образом, для создания снапшота вам нужно просто переименовать предыдущий inode (например, из testimg в testimg@0),
сделать его readonly и создать новый слой с исходным именем образа (testimg), ссылающийся на только что переименованный
в качестве родительского.
vitastor-cli и драйвера K8s, OpenStack и т.п. также хранят обратный маппинг в ключах `/vitastor/index/image/<name>`
в JSON-формате: `{"id":<inode>,"pool_id":<pool>}` и счётчики ID `/vitastor/index/maxid/<pool>` в виде просто чисел
для упрощения генерации ID новых образов.

View File

@ -1,124 +0,0 @@
[Documentation](../../README.md#documentation) → [Configuration](../config.en.md) → Cluster-Wide Disk Layout Parameters
-----
[Читать на русском](layout-cluster.ru.md)
# Cluster-Wide Disk Layout Parameters
These parameters apply to clients and OSDs, are fixed at the moment of OSD drive
initialization and can't be changed after it without losing data.
- [block_size](#block_size)
- [bitmap_granularity](#bitmap_granularity)
- [immediate_commit](#immediate_commit)
- [client_dirty_limit](#client_dirty_limit)
## block_size
- Type: integer
- Default: 131072
Size of objects (data blocks) into which all physical and virtual drives are
subdivided in Vitastor. One of current main settings in Vitastor, affects
memory usage, write amplification and I/O load distribution effectiveness.
Recommended default block size is 128 KB for SSD and 4 MB for HDD. In fact,
it's possible to use 4 MB for SSD too - it will lower memory usage, but
may increase average WA and reduce linear performance.
OSDs with different block sizes (for example, SSD and SSD+HDD OSDs) can
currently coexist in one etcd instance only within separate Vitastor
clusters with different etcd_prefix'es.
Also block size can't be changed after OSD initialization without losing
data.
You must always specify block_size in etcd in /vitastor/config/global if
you change it so all clients can know about it.
OSD memory usage is roughly (SIZE / BLOCK * 68 bytes) which is roughly
544 MB per 1 TB of used disk space with the default 128 KB block size.
## bitmap_granularity
- Type: integer
- Default: 4096
Required virtual disk write alignment ("sector size"). Must be a multiple
of disk_alignment. It's called bitmap granularity because Vitastor tracks
an allocation bitmap for each object containing 2 bits per each
(bitmap_granularity) bytes.
This parameter can't be changed after OSD initialization without losing
data. Also it's fixed for the whole Vitastor cluster i.e. two different
values can't be used in a single Vitastor cluster.
Clients MUST be aware of this parameter value, so put it into etcd key
/vitastor/config/global if you change it for any reason.
## immediate_commit
- Type: string
- Default: false
Another parameter which is really important for performance.
Desktop SSDs are very fast (100000+ iops) for simple random writes
without cache flush. However, they are really slow (only around 1000 iops)
if you try to fsync() each write, that is, when you want to guarantee that
each change gets immediately persisted to the physical media.
Server-grade SSDs with "Advanced/Enhanced Power Loss Protection" or with
"Supercapacitor-based Power Loss Protection", on the other hand, are equally
fast with and without fsync because their cache is protected from sudden
power loss by a built-in supercapacitor-based "UPS".
Some software-defined storage systems always fsync each write and thus are
really slow when used with desktop SSDs. Vitastor, however, can also
efficiently utilize desktop SSDs by postponing fsync until the client calls
it explicitly.
This is what this parameter regulates. When it's set to "all" the whole
Vitastor cluster commits each change to disks immediately and clients just
ignore fsyncs because they know for sure that they're unneeded. This reduces
the amount of network roundtrips performed by clients and improves
performance. So it's always better to use server grade SSDs with
supercapacitors even with Vitastor, especially given that they cost only
a bit more than desktop models.
There is also a common SATA SSD (and HDD too!) firmware bug (or feature)
that makes server SSDs which have supercapacitors slow with fsync. To check
if your SSDs are affected, compare benchmark results from `fio -name=test
-ioengine=libaio -direct=1 -bs=4k -rw=randwrite -iodepth=1` with and without
`-fsync=1`. Results should be the same. If fsync=1 result is worse you can
try to work around this bug by "disabling" drive write-back cache by running
`hdparm -W 0 /dev/sdXX` or `echo write through > /sys/block/sdXX/device/scsi_disk/*/cache_type`
(IMPORTANT: don't mistake it with `/sys/block/sdXX/queue/write_cache` - it's
unsafe to change by hand). The same may apply to newer HDDs with internal
SSD cache or "media-cache" - for example, a lot of Seagate EXOS drives have
it (they have internal SSD cache even though it's not stated in datasheets).
This parameter must be set both in etcd in /vitastor/config/global and in
OSD command line or configuration. Setting it to "all" or "small" requires
enabling disable_journal_fsync and disable_meta_fsync, setting it to "all"
also requires enabling disable_data_fsync.
TLDR: For optimal performance, set immediate_commit to "all" if you only use
SSDs with supercapacitor-based power loss protection (nonvolatile
write-through cache) for both data and journals in the whole Vitastor
cluster. Set it to "small" if you only use such SSDs for journals. Leave
empty if your drives have write-back cache.
## client_dirty_limit
- Type: integer
- Default: 33554432
Without immediate_commit=all this parameter sets the limit of "dirty"
(not committed by fsync) data allowed by the client before forcing an
additional fsync and committing the data. Also note that the client always
holds a copy of uncommitted data in memory so this setting also affects
RAM usage of clients.
This parameter doesn't affect OSDs themselves.

View File

@ -1,134 +0,0 @@
[Документация](../../README-ru.md#документация) → [Конфигурация](../config.ru.md) → Дисковые параметры уровня кластера
-----
[Read in English](layout-cluster.en.md)
# Дисковые параметры уровня кластера
Данные параметры используются клиентами и OSD, задаются в момент инициализации
диска OSD и не могут быть изменены после этого без потери данных.
- [block_size](#block_size)
- [bitmap_granularity](#bitmap_granularity)
- [immediate_commit](#immediate_commit)
- [client_dirty_limit](#client_dirty_limit)
## block_size
- Тип: целое число
- Значение по умолчанию: 131072
Размер объектов (блоков данных), на которые делятся физические и виртуальные
диски в Vitastor. Одна из ключевых на данный момент настроек, влияет на
потребление памяти, объём избыточной записи (write amplification) и
эффективность распределения нагрузки по OSD.
Рекомендуемые по умолчанию размеры блока - 128 килобайт для SSD и 4
мегабайта для HDD. В принципе, для SSD можно тоже использовать 4 мегабайта,
это понизит использование памяти, но ухудшит распределение нагрузки и в
среднем увеличит WA.
OSD с разными размерами блока (например, SSD и SSD+HDD OSD) на данный
момент могут сосуществовать в рамках одного etcd только в виде двух независимых
кластеров Vitastor с разными etcd_prefix.
Также размер блока нельзя менять после инициализации OSD без потери данных.
Если вы меняете размер блока, обязательно прописывайте его в etcd в
/vitastor/config/global, дабы все клиенты его знали.
Потребление памяти OSD составляет примерно (РАЗМЕР / БЛОК * 68 байт),
т.е. примерно 544 МБ памяти на 1 ТБ занятого места на диске при
стандартном 128 КБ блоке.
## bitmap_granularity
- Тип: целое число
- Значение по умолчанию: 4096
Требуемое выравнивание записи на виртуальные диски (размер их "сектора").
Должен быть кратен disk_alignment. Называется гранулярностью битовой карты
потому, что Vitastor хранит битовую карту для каждого объекта, содержащую
по 2 бита на каждые (bitmap_granularity) байт.
Данный параметр нельзя менять после инициализации OSD без потери данных.
Также он фиксирован для всего кластера Vitastor, т.е. разные значения
не могут сосуществовать в одном кластере.
Клиенты ДОЛЖНЫ знать правильное значение этого параметра, так что если вы
его меняете, обязательно прописывайте изменённое значение в etcd в ключ
/vitastor/config/global.
## immediate_commit
- Тип: строка
- Значение по умолчанию: false
Ещё один важный для производительности параметр.
Модели SSD для настольных компьютеров очень быстрые (100000+ операций в
секунду) при простой случайной записи без сбросов кэша. Однако они очень
медленные (всего порядка 1000 iops), если вы пытаетесь сбрасывать кэш после
каждой записи, то есть, если вы пытаетесь гарантировать, что каждое
изменение физически записывается в энергонезависимую память.
С другой стороны, серверные SSD с конденсаторами - функцией, называемой
"Advanced/Enhanced Power Loss Protection" или просто "Supercapacitor-based
Power Loss Protection" - одинаково быстрые и со сбросом кэша, и без
него, потому что их кэш защищён от потери питания встроенным "источником
бесперебойного питания" на основе суперконденсаторов и на самом деле они
его никогда не сбрасывают.
Некоторые программные СХД всегда сбрасывают кэши дисков при каждой записи
и поэтому работают очень медленно с настольными SSD. Vitastor, однако, может
откладывать fsync до явного его вызова со стороны клиента и таким образом
эффективно утилизировать настольные SSD.
Данный параметр влияет как раз на это. Когда он установлен в значение "all",
весь кластер Vitastor мгновенно фиксирует каждое изменение на физические
носители и клиенты могут просто игнорировать запросы fsync, т.к. они точно
знают, что fsync-и не нужны. Это уменьшает число необходимых обращений к OSD
по сети и улучшает производительность. Поэтому даже с Vitastor лучше всегда
использовать только серверные модели SSD с суперконденсаторами, особенно
учитывая то, что стоят они ненамного дороже настольных.
Также в прошивках SATA SSD (и даже HDD!) очень часто встречается либо баг,
либо просто особенность логики, из-за которой серверные SSD, имеющие
конденсаторы и защиту от потери питания, всё равно медленно работают с
fsync. Чтобы понять, подвержены ли этой проблеме ваши SSD, сравните
результаты тестов `fio -name=test -ioengine=libaio -direct=1 -bs=4k
-rw=randwrite -iodepth=1` без и с опцией `-fsync=1`. Результаты должны
быть одинаковые. Если результат с `fsync=1` хуже, вы можете попробовать
обойти проблему, "отключив" кэш записи диска командой `hdparm -W 0 /dev/sdXX`
либо `echo write through > /sys/block/sdXX/device/scsi_disk/*/cache_type`
(ВАЖНО: не перепутайте с `/sys/block/sdXX/queue/write_cache` - этот параметр
менять руками небезопасно). Такая же проблема может встречаться и в новых
HDD-дисках с внутренним SSD или "медиа" кэшем - например, она встречается во
многих дисках Seagate EXOS (у них есть внутренний SSD-кэш, хотя это и не
указано в спецификациях).
Данный параметр нужно указывать и в etcd в /vitastor/config/global, и в
командной строке или конфигурации OSD. Значения "all" и "small" требуют
включения disable_journal_fsync и disable_meta_fsync, значение "all" также
требует включения disable_data_fsync.
Итого, вкратце: для оптимальной производительности установите
immediate_commit в значение "all", если вы используете в кластере только SSD
с суперконденсаторами и для данных, и для журналов. Если вы используете
такие SSD для всех журналов, но не для данных - можете установить параметр
в "small". Если и какие-то из дисков журналов имеют волатильный кэш записи -
оставьте параметр пустым.
## client_dirty_limit
- Тип: целое число
- Значение по умолчанию: 33554432
При работе без immediate_commit=all - это лимит объёма "грязных" (не
зафиксированных fsync-ом) данных, при достижении которого клиент будет
принудительно вызывать fsync и фиксировать данные. Также стоит иметь в виду,
что в этом случае до момента fsync клиент хранит копию незафиксированных
данных в памяти, то есть, настройка влияет на потребление памяти клиентами.
Параметр не влияет на сами OSD.

View File

@ -1,176 +0,0 @@
[Documentation](../../README.md#documentation) → [Configuration](../config.en.md) → OSD Disk Layout Parameters
-----
[Читать на русском](layout-osd.ru.md)
# OSD Disk Layout Parameters
These parameters apply to OSDs, are fixed at the moment of OSD drive
initialization and can't be changed after it without losing data.
- [data_device](#data_device)
- [meta_device](#meta_device)
- [journal_device](#journal_device)
- [journal_offset](#journal_offset)
- [journal_size](#journal_size)
- [meta_offset](#meta_offset)
- [data_offset](#data_offset)
- [data_size](#data_size)
- [meta_block_size](#meta_block_size)
- [journal_block_size](#journal_block_size)
- [disable_data_fsync](#disable_data_fsync)
- [disable_meta_fsync](#disable_meta_fsync)
- [disable_journal_fsync](#disable_journal_fsync)
- [disable_device_lock](#disable_device_lock)
- [disk_alignment](#disk_alignment)
## data_device
- Type: string
Path to the block device to use for data. It's highly recommendded to use
stable paths for all device names: `/dev/disk/by-partuuid/xxx...` instead
of just `/dev/sda` or `/dev/nvme0n1` to not mess up after server restart.
Files can also be used instead of block devices, but this is implemented
only for testing purposes and not for production.
## meta_device
- Type: string
Path to the block device to use for the metadata. Metadata must be on a fast
SSD or performance will suffer. If this option is skipped, `data_device` is
used for the metadata.
## journal_device
- Type: string
Path to the block device to use for the journal. Journal must be on a fast
SSD or performance will suffer. If this option is skipped, `meta_device` is
used for the journal, and if it's also empty, journal is put on
`data_device`. It's almost always fine to put metadata and journal on the
same device, in this case you only need to set `meta_device`.
## journal_offset
- Type: integer
- Default: 0
Offset on the device in bytes where the journal is stored.
## journal_size
- Type: integer
Journal size in bytes. By default, all available space between journal_offset
and data_offset, meta_offset or the end of the journal device is used.
Large journals aren't needed in SSD-only setups, 32 MB is always enough.
In SSD+HDD setups it is beneficial to use larger journals (for example, 1 GB)
and enable [throttle_small_writes](osd.en.md#throttle_small_writes).
## meta_offset
- Type: integer
- Default: 0
Offset on the device in bytes where the metadata area is stored.
Again, set it to something if you colocate metadata with journal or data.
## data_offset
- Type: integer
- Default: 0
Offset on the device in bytes where the data area is stored.
Again, set it to something if you colocate data with journal or metadata.
## data_size
- Type: integer
Data area size in bytes. By default, the whole data device up to the end
will be used for the data area, but you can restrict it if you want to use
a smaller part. Note that there is no option to set metadata area size -
it's derived from the data area size.
## meta_block_size
- Type: integer
- Default: 4096
Physical block size of the metadata device. 4096 for most current
HDDs and SSDs.
## journal_block_size
- Type: integer
- Default: 4096
Physical block size of the journal device. Must be a multiple of
`disk_alignment`. 4096 for most current HDDs and SSDs.
## disable_data_fsync
- Type: boolean
- Default: false
Do not issue fsyncs to the data device, i.e. do not flush its cache.
Safe ONLY if your data device has write-through cache. If you disable
the cache yourself using `hdparm` or `scsi_disk/cache_type` then make sure
that the cache disable command is run every time before starting Vitastor
OSD, for example, in the systemd unit. See also `immediate_commit` option
for the instructions to disable cache and how to benefit from it.
## disable_meta_fsync
- Type: boolean
- Default: false
Same as disable_data_fsync, but for the metadata device. If the metadata
device is not set or if the data device is used for the metadata the option
is ignored and disable_data_fsync value is used instead of it.
## disable_journal_fsync
- Type: boolean
- Default: false
Same as disable_data_fsync, but for the journal device. If the journal
device is not set or if the metadata device is used for the journal the
option is ignored and disable_meta_fsync value is used instead of it. If
the same device is used for data, metadata and journal the option is also
ignored and disable_data_fsync value is used instead of it.
## disable_device_lock
- Type: boolean
- Default: false
Do not lock data, metadata and journal block devices exclusively with
flock(). Though it's not recommended, but you can use it you want to run
multiple OSD with a single device and different offsets, without using
partitions.
## disk_alignment
- Type: integer
- Default: 4096
Required physical disk write alignment. Most current SSD and HDD drives
use 4 KB physical sectors even if they report 512 byte logical sector
size, so 4 KB is a good default setting.
Note, however, that physical sector size also affects WA, because with block
devices it's impossible to write anything smaller than a block. So, when
Vitastor has to write a single metadata entry that's only about 32 bytes in
size, it actually has to write the whole 4 KB sector.
Because of this it can actually be beneficial to use SSDs which work well
with 512 byte sectors and use 512 byte disk_alignment, journal_block_size
and meta_block_size. But the only SSD that may fit into this category is
Intel Optane (probably, not tested yet).
Clients don't need to be aware of disk_alignment, so it's not required to
put a modified value into etcd key /vitastor/config/global.

View File

@ -1,185 +0,0 @@
[Документация](../../README-ru.md#документация) → [Конфигурация](../config.ru.md) → Дисковые параметры OSD
-----
[Read in English](layout-osd.en.md)
# Дисковые параметры OSD
Данные параметры используются только OSD и, также как и общекластерные
дисковые параметры, задаются в момент инициализации дисков OSD и не могут быть
изменены после этого без потери данных.
- [data_device](#data_device)
- [meta_device](#meta_device)
- [journal_device](#journal_device)
- [journal_offset](#journal_offset)
- [journal_size](#journal_size)
- [meta_offset](#meta_offset)
- [data_offset](#data_offset)
- [data_size](#data_size)
- [meta_block_size](#meta_block_size)
- [journal_block_size](#journal_block_size)
- [disable_data_fsync](#disable_data_fsync)
- [disable_meta_fsync](#disable_meta_fsync)
- [disable_journal_fsync](#disable_journal_fsync)
- [disable_device_lock](#disable_device_lock)
- [disk_alignment](#disk_alignment)
## data_device
- Тип: строка
Путь к диску (блочному устройству) для хранения данных. Крайне рекомендуется
использовать стабильные пути: `/dev/disk/by-partuuid/xxx...` вместо простых
`/dev/sda` или `/dev/nvme0n1`, чтобы пути не могли спутаться после
перезагрузки сервера. Также вместо блочных устройств можно указывать файлы,
но это реализовано только для тестирования, а не для боевой среды.
## meta_device
- Тип: строка
Путь к диску метаданных. Метаданные должны располагаться на быстром
SSD-диске, иначе производительность пострадает. Если эта опция не указана,
для метаданных используется `data_device`.
## journal_device
- Тип: строка
Путь к диску журнала. Журнал должен располагаться на быстром SSD-диске,
иначе производительность пострадает. Если эта опция не указана,
для журнала используется `meta_device`, если же пуста и она, журнал
располагается на `data_device`. Нормально располагать журнал и метаданные
на одном устройстве, в этом случае достаточно указать только `meta_device`.
## journal_offset
- Тип: целое число
- Значение по умолчанию: 0
Смещение на устройстве в байтах, по которому располагается журнал.
## journal_size
- Тип: целое число
Размер журнала в байтах. По умолчанию для журнала используется всё доступное
место между journal_offset и data_offset, meta_offset или концом диска.
В SSD-кластерах большие журналы не нужны, достаточно 32 МБ. В гибридных
(SSD+HDD) кластерах осмысленно использовать больший размер журнал (например, 1 ГБ)
и включить [throttle_small_writes](osd.ru.md#throttle_small_writes).
## meta_offset
- Тип: целое число
- Значение по умолчанию: 0
Смещение на устройстве в байтах, по которому располагаются метаданные.
Эту опцию нужно задать, если метаданные у вас хранятся на том же
устройстве, что данные или журнал.
## data_offset
- Тип: целое число
- Значение по умолчанию: 0
Смещение на устройстве в байтах, по которому располагаются данные.
Эту опцию нужно задать, если данные у вас хранятся на том же
устройстве, что метаданные или журнал.
## data_size
- Тип: целое число
Размер области данных в байтах. По умолчанию под данные будет использована
вся доступная область устройства данных до конца устройства, но вы можете
использовать эту опцию, чтобы ограничить её меньшим размером. Заметьте, что
опции размера области метаданных нет - она вычисляется из размера области
данных автоматически.
## meta_block_size
- Тип: целое число
- Значение по умолчанию: 4096
Размер физического блока устройства метаданных. 4096 для большинства
современных SSD и HDD.
## journal_block_size
- Тип: целое число
- Значение по умолчанию: 4096
Размер физического блока устройства журнала. Должен быть кратен
`disk_alignment`. 4096 для большинства современных SSD и HDD.
## disable_data_fsync
- Тип: булево (да/нет)
- Значение по умолчанию: false
Не отправлять fsync-и устройству данных, т.е. не сбрасывать его кэш.
Безопасно, ТОЛЬКО если ваше устройство данных имеет кэш со сквозной
записью (write-through). Если вы отключаете кэш через `hdparm` или
`scsi_disk/cache_type`, то удостоверьтесь, что команда отключения кэша
выполняется перед каждым запуском Vitastor OSD, например, в systemd unit-е.
Смотрите также опцию `immediate_commit` для инструкций по отключению кэша
и о том, как из этого извлечь выгоду.
## disable_meta_fsync
- Тип: булево (да/нет)
- Значение по умолчанию: false
То же, что disable_data_fsync, но для устройства метаданных. Если устройство
метаданных не задано или если оно равно устройству данных, значение опции
игнорируется и вместо него используется значение опции disable_data_fsync.
## disable_journal_fsync
- Тип: булево (да/нет)
- Значение по умолчанию: false
То же, что disable_data_fsync, но для устройства журнала. Если устройство
журнала не задано или если оно равно устройству метаданных, значение опции
игнорируется и вместо него используется значение опции disable_meta_fsync.
Если одно и то же устройство используется и под данные, и под журнал, и под
метаданные - значение опции также игнорируется и вместо него используется
значение опции disable_data_fsync.
## disable_device_lock
- Тип: булево (да/нет)
- Значение по умолчанию: false
Не блокировать устройства данных, метаданных и журнала от открытия их
другими OSD с помощью flock(). Так делать не рекомендуется, но теоретически
вы можете это использовать, чтобы запускать несколько OSD на одном
устройстве с разными смещениями и без использования разделов.
## disk_alignment
- Тип: целое число
- Значение по умолчанию: 4096
Требуемое выравнивание записи на физические диски. Почти все современные
SSD и HDD диски используют 4 КБ физические секторы, даже если показывают
логический размер сектора 512 байт, поэтому 4 КБ - хорошее значение по
умолчанию.
Однако стоит понимать, что физический размер сектора тоже влияет на
избыточную запись (WA), потому что ничего меньше блока (сектора) на блочное
устройство записать невозможно. Таким образом, когда Vitastor-у нужно
записать на диск всего лишь одну 32-байтную запись метаданных, фактически
приходится перезаписывать 4 КБ сектор целиком.
Поэтому, на самом деле, может быть выгодно найти SSD, хорошо работающие с
меньшими, 512-байтными, блоками и использовать 512-байтные disk_alignment,
journal_block_size и meta_block_size. Однако единственные SSD, которые
теоретически могут попасть в эту категорию - это Intel Optane (но и это
пока не проверялось автором).
Клиентам не обязательно знать про disk_alignment, так что помещать значение
этого параметра в etcd в /vitastor/config/global не нужно.

View File

@ -1,79 +0,0 @@
[Documentation](../../README.md#documentation) → [Configuration](../config.en.md) → Monitor Parameters
-----
[Читать на русском](monitor.ru.md)
# Monitor Parameters
These parameters only apply to Monitors.
- [etcd_mon_ttl](#etcd_mon_ttl)
- [etcd_mon_timeout](#etcd_mon_timeout)
- [etcd_mon_retries](#etcd_mon_retries)
- [mon_change_timeout](#mon_change_timeout)
- [mon_stats_timeout](#mon_stats_timeout)
- [osd_out_time](#osd_out_time)
- [placement_levels](#placement_levels)
## etcd_mon_ttl
- Type: seconds
- Default: 30
- Minimum: 10
Monitor etcd lease refresh interval in seconds
## etcd_mon_timeout
- Type: milliseconds
- Default: 1000
etcd request timeout used by monitor
## etcd_mon_retries
- Type: integer
- Default: 5
Maximum number of attempts for one monitor etcd request
## mon_change_timeout
- Type: milliseconds
- Default: 1000
- Minimum: 100
Optimistic retry interval for monitor etcd modification requests
## mon_stats_timeout
- Type: milliseconds
- Default: 1000
- Minimum: 100
Interval for monitor to wait before updating aggregated statistics in
etcd after receiving OSD statistics updates
## osd_out_time
- Type: seconds
- Default: 600
Time after which a failed OSD is removed from the data distribution.
I.e. time which the monitor waits before attempting to restore data
redundancy using other OSDs.
## placement_levels
- Type: json
- Default: `{"host":100,"osd":101}`
Levels for the placement tree. You can define arbitrary tree levels by
defining them in this parameter. The configuration parameter value should
contain a JSON object with level names as keys and integer priorities as
values. Smaller priority means higher level in tree. For example,
"datacenter" should have smaller priority than "osd". "host" and "osd"
levels are always predefined and can't be removed. If one of them is not
present in the configuration, then it is defined with the default priority
(100 for "host", 101 for "osd").

View File

@ -1,80 +0,0 @@
[Документация](../../README-ru.md#документация) → [Конфигурация](../config.ru.md) → Параметры мониторов
-----
[Read in English](monitor.en.md)
# Параметры мониторов
Данные параметры используются только мониторами Vitastor.
- [etcd_mon_ttl](#etcd_mon_ttl)
- [etcd_mon_timeout](#etcd_mon_timeout)
- [etcd_mon_retries](#etcd_mon_retries)
- [mon_change_timeout](#mon_change_timeout)
- [mon_stats_timeout](#mon_stats_timeout)
- [osd_out_time](#osd_out_time)
- [placement_levels](#placement_levels)
## etcd_mon_ttl
- Тип: секунды
- Значение по умолчанию: 30
- Минимальное значение: 10
Интервал обновления etcd резервации (lease) монитором
## etcd_mon_timeout
- Тип: миллисекунды
- Значение по умолчанию: 1000
Таймаут выполнения запросов к etcd от монитора
## etcd_mon_retries
- Тип: целое число
- Значение по умолчанию: 5
Максимальное число попыток выполнения запросов к etcd монитором
## mon_change_timeout
- Тип: миллисекунды
- Значение по умолчанию: 1000
- Минимальное значение: 100
Время повтора при коллизиях при запросах модификации в etcd, производимых монитором
## mon_stats_timeout
- Тип: миллисекунды
- Значение по умолчанию: 1000
- Минимальное значение: 100
Интервал, который монитор ожидает при изменении статистики по отдельным
OSD перед обновлением агрегированной статистики в etcd
## osd_out_time
- Тип: секунды
- Значение по умолчанию: 600
Время, через которое отключенный OSD исключается из распределения данных.
То есть, время, которое монитор ожидает перед попыткой переместить данные
на другие OSD и таким образом восстановить избыточность хранения.
## placement_levels
- Тип: json
- Значение по умолчанию: `{"host":100,"osd":101}`
Определения уровней для дерева размещения OSD. Вы можете определять
произвольные уровни, помещая их в данный параметр конфигурации. Значение
параметра должно содержать JSON-объект, ключи которого будут являться
названиями уровней, а значения - целочисленными приоритетами. Меньшие
приоритеты соответствуют верхним уровням дерева. Например, уровень
"датацентр" должен иметь меньший приоритет, чем "OSD". Уровни с названиями
"host" и "osd" являются предопределёнными и не могут быть удалены. Если
один из них отсутствует в конфигурации, он доопределяется с приоритетом по
умолчанию (100 для уровня "host", 101 для "osd").

View File

@ -1,214 +0,0 @@
[Documentation](../../README.md#documentation) → [Configuration](../config.en.md) → Network Protocol Parameters
-----
[Читать на русском](network.ru.md)
# Network Protocol Parameters
These parameters apply to clients and OSDs and affect network connection logic
between clients, OSDs and etcd.
- [tcp_header_buffer_size](#tcp_header_buffer_size)
- [use_sync_send_recv](#use_sync_send_recv)
- [use_rdma](#use_rdma)
- [rdma_device](#rdma_device)
- [rdma_port_num](#rdma_port_num)
- [rdma_gid_index](#rdma_gid_index)
- [rdma_mtu](#rdma_mtu)
- [rdma_max_sge](#rdma_max_sge)
- [rdma_max_msg](#rdma_max_msg)
- [rdma_max_recv](#rdma_max_recv)
- [peer_connect_interval](#peer_connect_interval)
- [peer_connect_timeout](#peer_connect_timeout)
- [osd_idle_timeout](#osd_idle_timeout)
- [osd_ping_timeout](#osd_ping_timeout)
- [up_wait_retry_interval](#up_wait_retry_interval)
- [max_etcd_attempts](#max_etcd_attempts)
- [etcd_quick_timeout](#etcd_quick_timeout)
- [etcd_slow_timeout](#etcd_slow_timeout)
- [etcd_keepalive_timeout](#etcd_keepalive_timeout)
- [etcd_ws_keepalive_timeout](#etcd_ws_keepalive_timeout)
## tcp_header_buffer_size
- Type: integer
- Default: 65536
Size of the buffer used to read data using an additional copy. Vitastor
packet headers are 128 bytes, payload is always at least 4 KB, so it is
usually beneficial to try to read multiple packets at once even though
it requires to copy the data an additional time. The rest of each packet
is received without an additional copy. You can try to play with this
parameter and see how it affects random iops and linear bandwidth if you
want.
## use_sync_send_recv
- Type: boolean
- Default: false
If true, synchronous send/recv syscalls are used instead of io_uring for
socket communication. Useless for OSDs because they require io_uring anyway,
but may be required for clients with old kernel versions.
## use_rdma
- Type: boolean
- Default: true
Try to use RDMA for communication if it's available. Disable if you don't
want Vitastor to use RDMA. TCP-only clients can also talk to an RDMA-enabled
cluster, so disabling RDMA may be needed if clients have RDMA devices,
but they are not connected to the cluster.
## rdma_device
- Type: string
RDMA device name to use for Vitastor OSD communications (for example,
"rocep5s0f0"). Please note that Vitastor RDMA requires Implicit On-Demand
Paging (Implicit ODP) and Scatter/Gather (SG) support from the RDMA device
to work. For example, Mellanox ConnectX-3 and older adapters don't have
Implicit ODP, so they're unsupported by Vitastor. Run `ibv_devinfo -v` as
root to list available RDMA devices and their features.
## rdma_port_num
- Type: integer
- Default: 1
RDMA device port number to use. Only for devices that have more than 1 port.
See `phys_port_cnt` in `ibv_devinfo -v` output to determine how many ports
your device has.
## rdma_gid_index
- Type: integer
- Default: 0
Global address identifier index of the RDMA device to use. Different GID
indexes may correspond to different protocols like RoCEv1, RoCEv2 and iWARP.
Search for "GID" in `ibv_devinfo -v` output to determine which GID index
you need.
**IMPORTANT:** If you want to use RoCEv2 (as recommended) then the correct
rdma_gid_index is usually 1 (IPv6) or 3 (IPv4).
## rdma_mtu
- Type: integer
- Default: 4096
RDMA Path MTU to use. Must be 1024, 2048 or 4096. There is usually no
sense to change it from the default 4096.
## rdma_max_sge
- Type: integer
- Default: 128
Maximum number of scatter/gather entries to use for RDMA. OSDs negotiate
the actual value when establishing connection anyway, so it's usually not
required to change this parameter.
## rdma_max_msg
- Type: integer
- Default: 1048576
Maximum size of a single RDMA send or receive operation in bytes.
## rdma_max_recv
- Type: integer
- Default: 8
Maximum number of parallel RDMA receive operations. Note that this number
of receive buffers `rdma_max_msg` in size are allocated for each client,
so this setting actually affects memory usage. This is because RDMA receive
operations are (sadly) still not zero-copy in Vitastor. It may be fixed in
later versions.
## peer_connect_interval
- Type: seconds
- Default: 5
- Minimum: 1
Interval before attempting to reconnect to an unavailable OSD.
## peer_connect_timeout
- Type: seconds
- Default: 5
- Minimum: 1
Timeout for OSD connection attempts.
## osd_idle_timeout
- Type: seconds
- Default: 5
- Minimum: 1
OSD connection inactivity time after which clients and other OSDs send
keepalive requests to check state of the connection.
## osd_ping_timeout
- Type: seconds
- Default: 5
- Minimum: 1
Maximum time to wait for OSD keepalive responses. If an OSD doesn't respond
within this time, the connection to it is dropped and a reconnection attempt
is scheduled.
## up_wait_retry_interval
- Type: milliseconds
- Default: 500
- Minimum: 50
OSDs respond to clients with a special error code when they receive I/O
requests for a PG that's not synchronized and started. This parameter sets
the time for the clients to wait before re-attempting such I/O requests.
## max_etcd_attempts
- Type: integer
- Default: 5
Maximum number of attempts for etcd requests which can't be retried
indefinitely.
## etcd_quick_timeout
- Type: milliseconds
- Default: 1000
Timeout for etcd requests which should complete quickly, like lease refresh.
## etcd_slow_timeout
- Type: milliseconds
- Default: 5000
Timeout for etcd requests which are allowed to wait for some time.
## etcd_keepalive_timeout
- Type: seconds
- Default: max(30, etcd_report_interval*2)
Timeout for etcd connection HTTP Keep-Alive. Should be higher than
etcd_report_interval to guarantee that keepalive actually works.
## etcd_ws_keepalive_timeout
- Type: seconds
- Default: 30
etcd websocket ping interval required to keep the connection alive and
detect disconnections quickly.

View File

@ -1,224 +0,0 @@
[Документация](../../README-ru.md#документация) → [Конфигурация](../config.ru.md) → Параметры сетевого протокола
-----
[Read in English](network.en.md)
# Параметры сетевого протокола
Данные параметры используются клиентами и OSD и влияют на логику сетевого
взаимодействия между клиентами, OSD, а также etcd.
- [tcp_header_buffer_size](#tcp_header_buffer_size)
- [use_sync_send_recv](#use_sync_send_recv)
- [use_rdma](#use_rdma)
- [rdma_device](#rdma_device)
- [rdma_port_num](#rdma_port_num)
- [rdma_gid_index](#rdma_gid_index)
- [rdma_mtu](#rdma_mtu)
- [rdma_max_sge](#rdma_max_sge)
- [rdma_max_msg](#rdma_max_msg)
- [rdma_max_recv](#rdma_max_recv)
- [peer_connect_interval](#peer_connect_interval)
- [peer_connect_timeout](#peer_connect_timeout)
- [osd_idle_timeout](#osd_idle_timeout)
- [osd_ping_timeout](#osd_ping_timeout)
- [up_wait_retry_interval](#up_wait_retry_interval)
- [max_etcd_attempts](#max_etcd_attempts)
- [etcd_quick_timeout](#etcd_quick_timeout)
- [etcd_slow_timeout](#etcd_slow_timeout)
- [etcd_keepalive_timeout](#etcd_keepalive_timeout)
- [etcd_ws_keepalive_timeout](#etcd_ws_keepalive_timeout)
## tcp_header_buffer_size
- Тип: целое число
- Значение по умолчанию: 65536
Размер буфера для чтения данных с дополнительным копированием. Пакеты
Vitastor содержат 128-байтные заголовки, за которыми следуют данные размером
от 4 КБ и для мелких операций ввода-вывода обычно выгодно за 1 вызов читать
сразу несколько пакетов, даже не смотря на то, что это требует лишний раз
скопировать данные. Часть каждого пакета за пределами значения данного
параметра читается без дополнительного копирования. Вы можете попробовать
поменять этот параметр и посмотреть, как он влияет на производительность
случайного и линейного доступа.
## use_sync_send_recv
- Тип: булево (да/нет)
- Значение по умолчанию: false
Если установлено в истину, то вместо io_uring для передачи данных по сети
будут использоваться обычные синхронные системные вызовы send/recv. Для OSD
это бессмысленно, так как OSD в любом случае нуждается в io_uring, но, в
принципе, это может применяться для клиентов со старыми версиями ядра.
## use_rdma
- Тип: булево (да/нет)
- Значение по умолчанию: true
Пытаться использовать RDMA для связи при наличии доступных устройств.
Отключите, если вы не хотите, чтобы Vitastor использовал RDMA.
TCP-клиенты также могут работать с RDMA-кластером, так что отключать
RDMA может быть нужно только если у клиентов есть RDMA-устройства,
но они не имеют соединения с кластером Vitastor.
## rdma_device
- Тип: строка
Название RDMA-устройства для связи с Vitastor OSD (например, "rocep5s0f0").
Имейте в виду, что поддержка RDMA в Vitastor требует функций устройства
Implicit On-Demand Paging (Implicit ODP) и Scatter/Gather (SG). Например,
адаптеры Mellanox ConnectX-3 и более старые не поддерживают Implicit ODP и
потому не поддерживаются в Vitastor. Запустите `ibv_devinfo -v` от имени
суперпользователя, чтобы посмотреть список доступных RDMA-устройств, их
параметры и возможности.
## rdma_port_num
- Тип: целое число
- Значение по умолчанию: 1
Номер порта RDMA-устройства, который следует использовать. Имеет смысл
только для устройств, у которых более 1 порта. Чтобы узнать, сколько портов
у вашего адаптера, посмотрите `phys_port_cnt` в выводе команды
`ibv_devinfo -v`.
## rdma_gid_index
- Тип: целое число
- Значение по умолчанию: 0
Номер глобального идентификатора адреса RDMA-устройства, который следует
использовать. Разным gid_index могут соответствовать разные протоколы связи:
RoCEv1, RoCEv2, iWARP. Чтобы понять, какой нужен вам - смотрите строчки со
словом "GID" в выводе команды `ibv_devinfo -v`.
**ВАЖНО:** Если вы хотите использовать RoCEv2 (как мы и рекомендуем), то
правильный rdma_gid_index, как правило, 1 (IPv6) или 3 (IPv4).
## rdma_mtu
- Тип: целое число
- Значение по умолчанию: 4096
Максимальная единица передачи (Path MTU) для RDMA. Должно быть равно 1024,
2048 или 4096. Обычно нет смысла менять значение по умолчанию, равное 4096.
## rdma_max_sge
- Тип: целое число
- Значение по умолчанию: 128
Максимальное число записей разделения/сборки (scatter/gather) для RDMA.
OSD в любом случае согласовывают реальное значение при установке соединения,
так что менять этот параметр обычно не нужно.
## rdma_max_msg
- Тип: целое число
- Значение по умолчанию: 1048576
Максимальный размер одной RDMA-операции отправки или приёма.
## rdma_max_recv
- Тип: целое число
- Значение по умолчанию: 8
Максимальное число параллельных RDMA-операций получения данных. Следует
иметь в виду, что данное число буферов размером `rdma_max_msg` выделяется
для каждого подключённого клиентского соединения, так что данная настройка
влияет на потребление памяти. Это так потому, что RDMA-приём данных в
Vitastor, увы, всё равно не является zero-copy, т.е. всё равно 1 раз
копирует данные в памяти. Данная особенность, возможно, будет исправлена в
более новых версиях Vitastor.
## peer_connect_interval
- Тип: секунды
- Значение по умолчанию: 5
- Минимальное значение: 1
Время ожидания перед повторной попыткой соединиться с недоступным OSD.
## peer_connect_timeout
- Тип: секунды
- Значение по умолчанию: 5
- Минимальное значение: 1
Максимальное время ожидания попытки соединения с OSD.
## osd_idle_timeout
- Тип: секунды
- Значение по умолчанию: 5
- Минимальное значение: 1
Время неактивности соединения с OSD, после которого клиенты или другие OSD
посылают запрос проверки состояния соединения.
## osd_ping_timeout
- Тип: секунды
- Значение по умолчанию: 5
- Минимальное значение: 1
Максимальное время ожидания ответа на запрос проверки состояния соединения.
Если OSD не отвечает за это время, соединение отключается и производится
повторная попытка соединения.
## up_wait_retry_interval
- Тип: миллисекунды
- Значение по умолчанию: 500
- Минимальное значение: 50
Когда OSD получают от клиентов запросы ввода-вывода, относящиеся к не
поднятым на данный момент на них PG, либо к PG в процессе синхронизации,
они отвечают клиентам специальным кодом ошибки, означающим, что клиент
должен некоторое время подождать перед повторением запроса. Именно это время
ожидания задаёт данный параметр.
## max_etcd_attempts
- Тип: целое число
- Значение по умолчанию: 5
Максимальное число попыток выполнения запросов к etcd для тех запросов,
которые нельзя повторять бесконечно.
## etcd_quick_timeout
- Тип: миллисекунды
- Значение по умолчанию: 1000
Максимальное время выполнения запросов к etcd, которые должны завершаться
быстро, таких, как обновление резервации (lease).
## etcd_slow_timeout
- Тип: миллисекунды
- Значение по умолчанию: 5000
Максимальное время выполнения запросов к etcd, для которых не обязательно
гарантировать быстрое выполнение.
## etcd_keepalive_timeout
- Тип: секунды
- Значение по умолчанию: max(30, etcd_report_interval*2)
Таймаут для HTTP Keep-Alive в соединениях к etcd. Должен быть больше, чем
etcd_report_interval, чтобы keepalive гарантированно работал.
## etcd_ws_keepalive_timeout
- Тип: секунды
- Значение по умолчанию: 30
Интервал проверки живости вебсокет-подключений к etcd.

View File

@ -1,297 +0,0 @@
[Documentation](../../README.md#documentation) → [Configuration](../config.en.md) → Runtime OSD Parameters
-----
[Читать на русском](osd.ru.md)
# Runtime OSD Parameters
These parameters only apply to OSDs, are not fixed at the moment of OSD drive
initialization and can be changed with an OSD restart.
- [etcd_report_interval](#etcd_report_interval)
- [run_primary](#run_primary)
- [osd_network](#osd_network)
- [bind_address](#bind_address)
- [bind_port](#bind_port)
- [autosync_interval](#autosync_interval)
- [autosync_writes](#autosync_writes)
- [recovery_queue_depth](#recovery_queue_depth)
- [recovery_sync_batch](#recovery_sync_batch)
- [readonly](#readonly)
- [no_recovery](#no_recovery)
- [no_rebalance](#no_rebalance)
- [print_stats_interval](#print_stats_interval)
- [slow_log_interval](#slow_log_interval)
- [max_write_iodepth](#max_write_iodepth)
- [min_flusher_count](#min_flusher_count)
- [max_flusher_count](#max_flusher_count)
- [inmemory_metadata](#inmemory_metadata)
- [inmemory_journal](#inmemory_journal)
- [journal_sector_buffer_count](#journal_sector_buffer_count)
- [journal_no_same_sector_overwrites](#journal_no_same_sector_overwrites)
- [throttle_small_writes](#throttle_small_writes)
- [throttle_target_iops](#throttle_target_iops)
- [throttle_target_mbs](#throttle_target_mbs)
- [throttle_target_parallelism](#throttle_target_parallelism)
- [throttle_threshold_us](#throttle_threshold_us)
- [osd_memlock](#osd_memlock)
## etcd_report_interval
- Type: seconds
- Default: 5
Interval at which OSDs report their state to etcd. Affects OSD lease time
and thus the failover speed. Lease time is equal to this parameter value
plus max_etcd_attempts * etcd_quick_timeout because it should be guaranteed
that every OSD always refreshes its lease in time.
## run_primary
- Type: boolean
- Default: true
Start primary OSD logic on this OSD. As of now, can be turned off only for
debugging purposes. It's possible to implement additional feature for the
monitor which may allow to separate primary and secondary OSDs, but it's
unclear why anyone could need it, so it's not implemented.
## osd_network
- Type: string or array of strings
Network mask of the network (IPv4 or IPv6) to use for OSDs. Note that
although it's possible to specify multiple networks here, this does not
mean that OSDs will create multiple listening sockets - they'll only
pick the first matching address of an UP + RUNNING interface. Separate
networks for cluster and client connections are also not implemented, but
they are mostly useless anyway, so it's not a big deal.
## bind_address
- Type: string
- Default: 0.0.0.0
Instead of the network mask, you can also set OSD listen address explicitly
using this parameter. May be useful if you want to start OSDs on interfaces
that are not UP + RUNNING.
## bind_port
- Type: integer
By default, OSDs pick random ports to use for incoming connections
automatically. With this option you can set a specific port for a specific
OSD by hand.
## autosync_interval
- Type: seconds
- Default: 5
Time interval at which automatic fsyncs/flushes are issued by each OSD when
the immediate_commit mode if disabled. fsyncs are required because without
them OSDs quickly fill their journals, become unable to clear them and
stall. Also this option limits the amount of recent uncommitted changes
which OSDs may lose in case of a power outage in case when clients don't
issue fsyncs at all.
## autosync_writes
- Type: integer
- Default: 128
Same as autosync_interval, but sets the maximum number of uncommitted write
operations before issuing an fsync operation internally.
## recovery_queue_depth
- Type: integer
- Default: 4
Maximum recovery operations per one primary OSD at any given moment of time.
Currently it's the only parameter available to tune the speed or recovery
and rebalancing, but it's planned to implement more.
## recovery_sync_batch
- Type: integer
- Default: 16
Maximum number of recovery operations before issuing an additional fsync.
## readonly
- Type: boolean
- Default: false
Read-only mode. If this is enabled, an OSD will never issue any writes to
the underlying device. This may be useful for recovery purposes.
## no_recovery
- Type: boolean
- Default: false
Disable automatic background recovery of objects. Note that it doesn't
affect implicit recovery of objects happening during writes - a write is
always made to a full set of at least pg_minsize OSDs.
## no_rebalance
- Type: boolean
- Default: false
Disable background movement of data between different OSDs. Disabling it
means that PGs in the `has_misplaced` state will be left in it indefinitely.
## print_stats_interval
- Type: seconds
- Default: 3
Time interval at which OSDs print simple human-readable operation
statistics on stdout.
## slow_log_interval
- Type: seconds
- Default: 10
Time interval at which OSDs dump slow or stuck operations on stdout, if
they're any. Also it's the time after which an operation is considered
"slow".
## max_write_iodepth
- Type: integer
- Default: 128
Parallel client write operation limit per one OSD. Operations that exceed
this limit are pushed to a temporary queue instead of being executed
immediately.
## min_flusher_count
- Type: integer
- Default: 1
Flusher is a micro-thread that moves data from the journal to the data
area of the device. Their number is auto-tuned between minimum and maximum.
Minimum number is set by this parameter.
## max_flusher_count
- Type: integer
- Default: 256
Maximum number of journal flushers (see above min_flusher_count).
## inmemory_metadata
- Type: boolean
- Default: true
This parameter makes Vitastor always keep metadata area of the block device
in memory. It's required for good performance because it allows to avoid
additional read-modify-write cycles during metadata modifications. Metadata
area size is currently roughly 224 MB per 1 TB of data. You can turn it off
to reduce memory usage by this value, but it will hurt performance. This
restriction is likely to be removed in the future along with the upgrade
of the metadata storage scheme.
## inmemory_journal
- Type: boolean
- Default: true
This parameter make Vitastor always keep journal area of the block
device in memory. Turning it off will, again, reduce memory usage, but
hurt performance because flusher coroutines will have to read data from
the disk back before copying it into the main area. The memory usage benefit
is typically very small because it's sufficient to have 16-32 MB journal
for SSD OSDs. However, in theory it's possible that you'll want to turn it
off for hybrid (HDD+SSD) OSDs with large journals on quick devices.
## journal_sector_buffer_count
- Type: integer
- Default: 32
Maximum number of buffers that can be used for writing journal metadata
blocks. The only situation when you should increase it to a larger value
is when you enable journal_no_same_sector_overwrites. In this case set
it to, for example, 1024.
## journal_no_same_sector_overwrites
- Type: boolean
- Default: false
Enable this option for SSDs like Intel D3-S4510 and D3-S4610 which REALLY
don't like when a program overwrites the same sector multiple times in a
row and slow down significantly (from 25000+ iops to ~3000 iops). When
this option is set, Vitastor will always move to the next sector of the
journal after writing it instead of possibly overwriting it the second time.
Most (99%) other SSDs don't need this option.
## throttle_small_writes
- Type: boolean
- Default: false
Enable soft throttling of small journaled writes. Useful for hybrid OSDs
with fast journal/metadata devices and slow data devices. The idea is that
small writes complete very quickly because they're first written to the
journal device, but moving them to the main device is slow. So if an OSD
allows clients to issue a lot of small writes it will perform very good
for several seconds and then the journal will fill up and the performance
will drop to almost zero. Throttling is meant to prevent this problem by
artifically slowing quick writes down based on the amount of free space in
the journal. When throttling is used, the performance of small writes will
decrease smoothly instead of abrupt drop at the moment when the journal
fills up.
## throttle_target_iops
- Type: integer
- Default: 100
Target maximum number of throttled operations per second under the condition
of full journal. Set it to approximate random write iops of your data devices
(HDDs).
## throttle_target_mbs
- Type: integer
- Default: 100
Target maximum bandwidth in MB/s of throttled operations per second under
the condition of full journal. Set it to approximate linear write
performance of your data devices (HDDs).
## throttle_target_parallelism
- Type: integer
- Default: 1
Target maximum parallelism of throttled operations under the condition of
full journal. Set it to approximate internal parallelism of your data
devices (1 for HDDs, 4-8 for SSDs).
## throttle_threshold_us
- Type: microseconds
- Default: 50
Minimal computed delay to be applied to throttled operations. Usually
doesn't need to be changed.
## osd_memlock
- Type: boolean
- Default: false
Lock all OSD memory to prevent it from being unloaded into swap with mlockall(). Requires sufficient ulimit -l (max locked memory).

View File

@ -1,310 +0,0 @@
[Документация](../../README-ru.md#документация) → [Конфигурация](../config.ru.md) → Изменяемые параметры OSD
-----
[Read in English](osd.en.md)
# Изменяемые параметры OSD
Данные параметры используются только OSD, но, в отличие от дисковых параметров,
не фиксируются в момент инициализации дисков OSD и могут быть изменены в любой
момент с перезапуском OSD.
- [etcd_report_interval](#etcd_report_interval)
- [run_primary](#run_primary)
- [osd_network](#osd_network)
- [bind_address](#bind_address)
- [bind_port](#bind_port)
- [autosync_interval](#autosync_interval)
- [autosync_writes](#autosync_writes)
- [recovery_queue_depth](#recovery_queue_depth)
- [recovery_sync_batch](#recovery_sync_batch)
- [readonly](#readonly)
- [no_recovery](#no_recovery)
- [no_rebalance](#no_rebalance)
- [print_stats_interval](#print_stats_interval)
- [slow_log_interval](#slow_log_interval)
- [max_write_iodepth](#max_write_iodepth)
- [min_flusher_count](#min_flusher_count)
- [max_flusher_count](#max_flusher_count)
- [inmemory_metadata](#inmemory_metadata)
- [inmemory_journal](#inmemory_journal)
- [journal_sector_buffer_count](#journal_sector_buffer_count)
- [journal_no_same_sector_overwrites](#journal_no_same_sector_overwrites)
- [throttle_small_writes](#throttle_small_writes)
- [throttle_target_iops](#throttle_target_iops)
- [throttle_target_mbs](#throttle_target_mbs)
- [throttle_target_parallelism](#throttle_target_parallelism)
- [throttle_threshold_us](#throttle_threshold_us)
- [osd_memlock](#osd_memlock)
## etcd_report_interval
- Тип: секунды
- Значение по умолчанию: 5
Интервал, с которым OSD обновляет своё состояние в etcd. Значение параметра
влияет на время резервации (lease) OSD и поэтому на скорость переключения
при падении OSD. Время lease равняется значению этого параметра плюс
max_etcd_attempts * etcd_quick_timeout.
## run_primary
- Тип: булево (да/нет)
- Значение по умолчанию: true
Запускать логику первичного OSD на данном OSD. На данный момент отключать
эту опцию может иметь смысл только в целях отладки. В теории, можно
реализовать дополнительный режим для монитора, который позволит отделять
первичные OSD от вторичных, но пока не понятно, зачем это может кому-то
понадобиться, поэтому это не реализовано.
## osd_network
- Тип: строка или массив строк
Маска подсети (IPv4 или IPv6) для использования для соединений с OSD.
Имейте в виду, что хотя сейчас и можно передать в этот параметр несколько
подсетей, это не означает, что OSD будут создавать несколько слушающих
сокетов - они лишь будут выбирать адрес первого поднятого (состояние UP +
RUNNING), подходящий под заданную маску. Также не реализовано разделение
кластерной и публичной сетей OSD. Правда, от него обычно всё равно довольно
мало толку, так что особенной проблемы в этом нет.
## bind_address
- Тип: строка
- Значение по умолчанию: 0.0.0.0
Этим параметром можно явным образом задать адрес, на котором будет ожидать
соединений OSD (вместо использования маски подсети). Может быть полезно,
например, чтобы запускать OSD на неподнятых интерфейсах (не UP + RUNNING).
## bind_port
- Тип: целое число
По умолчанию OSD сами выбирают случайные порты для входящих подключений.
С помощью данной опции вы можете задать порт для отдельного OSD вручную.
## autosync_interval
- Тип: секунды
- Значение по умолчанию: 5
Временной интервал отправки автоматических fsync-ов (операций очистки кэша)
каждым OSD для случая, когда режим immediate_commit отключён. fsync-и нужны
OSD, чтобы успевать очищать журнал - без них OSD быстро заполняют журналы и
перестают обрабатывать операции записи. Также эта опция ограничивает объём
недавних незафиксированных изменений, которые OSD могут терять при
отключении питания, если клиенты вообще не отправляют fsync.
## autosync_writes
- Тип: целое число
- Значение по умолчанию: 128
Аналогично autosync_interval, но задаёт не временной интервал, а
максимальное количество незафиксированных операций записи перед
принудительной отправкой fsync-а.
## recovery_queue_depth
- Тип: целое число
- Значение по умолчанию: 4
Максимальное число операций восстановления на одном первичном OSD в любой
момент времени. На данный момент единственный параметр, который можно менять
для ускорения или замедления восстановления и перебалансировки данных, но
в планах реализация других параметров.
## recovery_sync_batch
- Тип: целое число
- Значение по умолчанию: 16
Максимальное число операций восстановления перед дополнительным fsync.
## readonly
- Тип: булево (да/нет)
- Значение по умолчанию: false
Режим "только чтение". Если включить этот режим, OSD не будет писать ничего
на диск. Может быть полезно в целях восстановления.
## no_recovery
- Тип: булево (да/нет)
- Значение по умолчанию: false
Отключить автоматическое фоновое восстановление объектов. Обратите внимание,
что эта опция не отключает восстановление объектов, происходящее при
записи - запись всегда производится в полный набор из как минимум pg_minsize
OSD.
## no_rebalance
- Тип: булево (да/нет)
- Значение по умолчанию: false
Отключить фоновое перемещение объектов между разными OSD. Отключение
означает, что PG, находящиеся в состоянии `has_misplaced`, будут оставлены
в нём на неопределённый срок.
## print_stats_interval
- Тип: секунды
- Значение по умолчанию: 3
Временной интервал, с которым OSD печатают простую человекочитаемую
статистику выполнения операций в стандартный вывод.
## slow_log_interval
- Тип: секунды
- Значение по умолчанию: 10
Временной интервал, с которым OSD выводят в стандартный вывод список
медленных или зависших операций, если таковые имеются. Также время, при
превышении которого операция считается "медленной".
## max_write_iodepth
- Тип: целое число
- Значение по умолчанию: 128
Максимальное число одновременных клиентских операций записи на один OSD.
Операции, превышающие этот лимит, не исполняются сразу, а сохраняются во
временной очереди.
## min_flusher_count
- Тип: целое число
- Значение по умолчанию: 1
Flusher - это микро-поток (корутина), которая копирует данные из журнала в
основную область устройства данных. Их число настраивается динамически между
минимальным и максимальным значением. Этот параметр задаёт минимальное число.
## max_flusher_count
- Тип: целое число
- Значение по умолчанию: 256
Максимальное число микро-потоков очистки журнала (см. выше min_flusher_count).
## inmemory_metadata
- Тип: булево (да/нет)
- Значение по умолчанию: true
Данный параметр заставляет Vitastor всегда держать область метаданных диска
в памяти. Это нужно, чтобы избегать дополнительных операций чтения с диска
при записи. Размер области метаданных на данный момент составляет примерно
224 МБ на 1 ТБ данных. При включении потребление памяти снизится примерно
на эту величину, но при этом также снизится и производительность. В будущем,
после обновления схемы хранения метаданных, это ограничение, скорее всего,
будет ликвидировано.
## inmemory_journal
- Тип: булево (да/нет)
- Значение по умолчанию: true
Данный параметр заставляет Vitastor всегда держать в памяти журналы OSD.
Отключение параметра, опять же, снижает потребление памяти, но ухудшает
производительность, так как для копирования данных из журнала в основную
область устройства OSD будут вынуждены читать их обратно с диска. Выигрыш
по памяти при этом обычно крайне низкий, так как для SSD OSD обычно
достаточно 16- или 32-мегабайтного журнала. Однако в теории отключение
параметра может оказаться полезным для гибридных OSD (HDD+SSD) с большими
журналами, расположенными на быстром по сравнению с HDD устройстве.
## journal_sector_buffer_count
- Тип: целое число
- Значение по умолчанию: 32
Максимальное число буферов, разрешённых для использования под записываемые
в журнал блоки метаданных. Единственная ситуация, в которой этот параметр
нужно менять - это если вы включаете journal_no_same_sector_overwrites. В
этом случае установите данный параметр, например, в 1024.
## journal_no_same_sector_overwrites
- Тип: булево (да/нет)
- Значение по умолчанию: false
Включайте данную опцию для SSD вроде Intel D3-S4510 и D3-S4610, которые
ОЧЕНЬ не любят, когда ПО перезаписывает один и тот же сектор несколько раз
подряд. Такие SSD при многократной перезаписи одного и того же сектора
сильно замедляются - условно, с 25000 и более iops до 3000 iops. Когда
данная опция установлена, Vitastor всегда переходит к следующему сектору
журнала после записи вместо потенциально повторной перезаписи того же
самого сектора.
Почти все другие SSD (99% моделей) не требуют данной опции.
## throttle_small_writes
- Тип: булево (да/нет)
- Значение по умолчанию: false
Разрешить мягкое ограничение скорости журналируемой записи. Полезно для
гибридных OSD с быстрыми устройствами метаданных и медленными устройствами
данных. Идея заключается в том, что мелкие записи в этой ситуации могут
завершаться очень быстро, так как они изначально записываются на быстрое
журнальное устройство (SSD). Но перемещать их потом на основное медленное
устройство долго. Поэтому если OSD быстро примет от клиентов очень много
мелких операций записи, он быстро заполнит свой журнал, после чего
производительность записи резко упадёт практически до нуля. Ограничение
скорости записи призвано решить эту проблему с помощью искусственного
замедления операций записи на основании объёма свободного места в журнале.
Когда эта опция включена, производительность мелких операций записи будет
снижаться плавно, а не резко в момент окончательного заполнения журнала.
## throttle_target_iops
- Тип: целое число
- Значение по умолчанию: 100
Расчётное максимальное число ограничиваемых операций в секунду при условии
отсутствия свободного места в журнале. Устанавливайте приблизительно равным
максимальной производительности случайной записи ваших устройств данных
(HDD) в операциях в секунду.
## throttle_target_mbs
- Тип: целое число
- Значение по умолчанию: 100
Расчётный максимальный размер в МБ/с ограничиваемых операций в секунду при
условии отсутствия свободного места в журнале. Устанавливайте приблизительно
равным максимальной производительности линейной записи ваших устройств
данных (HDD).
## throttle_target_parallelism
- Тип: целое число
- Значение по умолчанию: 1
Расчётный максимальный параллелизм ограничиваемых операций в секунду при
условии отсутствия свободного места в журнале. Устанавливайте приблизительно
равным внутреннему параллелизму ваших устройств данных (1 для HDD, 4-8
для SSD).
## throttle_threshold_us
- Тип: микросекунды
- Значение по умолчанию: 50
Минимальная применимая к ограничиваемым операциям задержка. Обычно не
требует изменений.
## osd_memlock
- Тип: булево (да/нет)
- Значение по умолчанию: false
Блокировать всю память OSD с помощью mlockall, чтобы запретить её выгрузку в пространство подкачки. Требует достаточного значения ulimit -l (лимита заблокированной памяти).

Some files were not shown because too many files have changed in this diff Show More