Release 0.6.5

- Basic support for OpenStack: Cinder driver, patches for Nova and libvirt - Add missing "image" and "config_path" QEMU options - Calculate aggregate per-pool statistics in monitor - Implement writes with Check-And-Set semantics - Add a C wrapper library with public header
Fix centos builds (yum-builddep stopped working in el7, cmake in el8..)
2021-07-10 11:01:21 +03:00 · 2021-07-10 11:01:21 +03:00 · 2021-07-10 01:11:20 +03:00 · 2021-07-10 01:06:29 +03:00 · 2021-07-09 21:51:19 +03:00 · 2021-07-09 12:29:39 +03:00
216 changed files with 28478 additions and 7378 deletions
--- a/.dockerignore
+++ b/.dockerignore
@ -0,0 +1,19 @@
 .git
 build
 packages
 mon/node_modules
 *.o
 *.so
 osd
 stub_osd
 stub_uring_osd
 stub_bench
 osd_test
 dump_journal
 nbd_proxy
 rm_inode
 fio
 qemu
 rpm/*.Dockerfile
 debian/*.Dockerfile
 Dockerfile
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,18 @@
 *.o
 *.so
 package-lock.json
 fio
 qemu
 osd
 stub_osd
 stub_uring_osd
 stub_bench
 osd_test
 osd_peering_pg_test
 dump_journal
 nbd_proxy
 rm_inode
 test_allocator
 test_blockstore
 test_shit
 osd_rmw_test
--- a/.gitmodules
+++ b/.gitmodules
@ -0,0 +1,6 @@
 [submodule "cpp-btree"]
 	path = cpp-btree
 	url = ../cpp-btree.git
 [submodule "json11"]
 	path = json11
 	url = ../json11.git
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@ -0,0 +1,7 @@
 cmake_minimum_required(VERSION 2.8)
 project(vitastor)
 set(VERSION "0.6.5")
 add_subdirectory(src)
--- a/GPL-2.0.txt
+++ b/GPL-2.0.txt
@ -0,0 +1,339 @@
                    GNU GENERAL PUBLIC LICENSE
                       Version 2, June 1991
 Copyright (C) 1989, 1991 Free Software Foundation, Inc.,
 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
 Everyone is permitted to copy and distribute verbatim copies
 of this license document, but changing it is not allowed.
                            Preamble
  The licenses for most software are designed to take away your
 freedom to share and change it.  By contrast, the GNU General Public
 License is intended to guarantee your freedom to share and change free
 software--to make sure the software is free for all its users.  This
 General Public License applies to most of the Free Software
 Foundation's software and to any other program whose authors commit to
 using it.  (Some other Free Software Foundation software is covered by
 the GNU Lesser General Public License instead.)  You can apply it to
 your programs, too.
  When we speak of free software, we are referring to freedom, not
 price.  Our General Public Licenses are designed to make sure that you
 have the freedom to distribute copies of free software (and charge for
 this service if you wish), that you receive source code or can get it
 if you want it, that you can change the software or use pieces of it
 in new free programs; and that you know you can do these things.
  To protect your rights, we need to make restrictions that forbid
 anyone to deny you these rights or to ask you to surrender the rights.
 These restrictions translate to certain responsibilities for you if you
 distribute copies of the software, or if you modify it.
  For example, if you distribute copies of such a program, whether
 gratis or for a fee, you must give the recipients all the rights that
 you have.  You must make sure that they, too, receive or can get the
 source code.  And you must show them these terms so they know their
 rights.
  We protect your rights with two steps: (1) copyright the software, and
 (2) offer you this license which gives you legal permission to copy,
 distribute and/or modify the software.
  Also, for each author's protection and ours, we want to make certain
 that everyone understands that there is no warranty for this free
 software.  If the software is modified by someone else and passed on, we
 want its recipients to know that what they have is not the original, so
 that any problems introduced by others will not reflect on the original
 authors' reputations.
  Finally, any free program is threatened constantly by software
 patents.  We wish to avoid the danger that redistributors of a free
 program will individually obtain patent licenses, in effect making the
 program proprietary.  To prevent this, we have made it clear that any
 patent must be licensed for everyone's free use or not licensed at all.
  The precise terms and conditions for copying, distribution and
 modification follow.
                    GNU GENERAL PUBLIC LICENSE
   TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
  0. This License applies to any program or other work which contains
 a notice placed by the copyright holder saying it may be distributed
 under the terms of this General Public License.  The "Program", below,
 refers to any such program or work, and a "work based on the Program"
 means either the Program or any derivative work under copyright law:
 that is to say, a work containing the Program or a portion of it,
 either verbatim or with modifications and/or translated into another
 language.  (Hereinafter, translation is included without limitation in
 the term "modification".)  Each licensee is addressed as "you".
 Activities other than copying, distribution and modification are not
 covered by this License; they are outside its scope.  The act of
 running the Program is not restricted, and the output from the Program
 is covered only if its contents constitute a work based on the
 Program (independent of having been made by running the Program).
 Whether that is true depends on what the Program does.
  1. You may copy and distribute verbatim copies of the Program's
 source code as you receive it, in any medium, provided that you
 conspicuously and appropriately publish on each copy an appropriate
 copyright notice and disclaimer of warranty; keep intact all the
 notices that refer to this License and to the absence of any warranty;
 and give any other recipients of the Program a copy of this License
 along with the Program.
 You may charge a fee for the physical act of transferring a copy, and
 you may at your option offer warranty protection in exchange for a fee.
  2. You may modify your copy or copies of the Program or any portion
 of it, thus forming a work based on the Program, and copy and
 distribute such modifications or work under the terms of Section 1
 above, provided that you also meet all of these conditions:
    a) You must cause the modified files to carry prominent notices
    stating that you changed the files and the date of any change.
    b) You must cause any work that you distribute or publish, that in
    whole or in part contains or is derived from the Program or any
    part thereof, to be licensed as a whole at no charge to all third
    parties under the terms of this License.
    c) If the modified program normally reads commands interactively
    when run, you must cause it, when started running for such
    interactive use in the most ordinary way, to print or display an
    announcement including an appropriate copyright notice and a
    notice that there is no warranty (or else, saying that you provide
    a warranty) and that users may redistribute the program under
    these conditions, and telling the user how to view a copy of this
    License.  (Exception: if the Program itself is interactive but
    does not normally print such an announcement, your work based on
    the Program is not required to print an announcement.)
 These requirements apply to the modified work as a whole.  If
 identifiable sections of that work are not derived from the Program,
 and can be reasonably considered independent and separate works in
 themselves, then this License, and its terms, do not apply to those
 sections when you distribute them as separate works.  But when you
 distribute the same sections as part of a whole which is a work based
 on the Program, the distribution of the whole must be on the terms of
 this License, whose permissions for other licensees extend to the
 entire whole, and thus to each and every part regardless of who wrote it.
 Thus, it is not the intent of this section to claim rights or contest
 your rights to work written entirely by you; rather, the intent is to
 exercise the right to control the distribution of derivative or
 collective works based on the Program.
 In addition, mere aggregation of another work not based on the Program
 with the Program (or with a work based on the Program) on a volume of
 a storage or distribution medium does not bring the other work under
 the scope of this License.
  3. You may copy and distribute the Program (or a work based on it,
 under Section 2) in object code or executable form under the terms of
 Sections 1 and 2 above provided that you also do one of the following:
    a) Accompany it with the complete corresponding machine-readable
    source code, which must be distributed under the terms of Sections
    1 and 2 above on a medium customarily used for software interchange; or,
    b) Accompany it with a written offer, valid for at least three
    years, to give any third party, for a charge no more than your
    cost of physically performing source distribution, a complete
    machine-readable copy of the corresponding source code, to be
    distributed under the terms of Sections 1 and 2 above on a medium
    customarily used for software interchange; or,
    c) Accompany it with the information you received as to the offer
    to distribute corresponding source code.  (This alternative is
    allowed only for noncommercial distribution and only if you
    received the program in object code or executable form with such
    an offer, in accord with Subsection b above.)
 The source code for a work means the preferred form of the work for
 making modifications to it.  For an executable work, complete source
 code means all the source code for all modules it contains, plus any
 associated interface definition files, plus the scripts used to
 control compilation and installation of the executable.  However, as a
 special exception, the source code distributed need not include
 anything that is normally distributed (in either source or binary
 form) with the major components (compiler, kernel, and so on) of the
 operating system on which the executable runs, unless that component
 itself accompanies the executable.
 If distribution of executable or object code is made by offering
 access to copy from a designated place, then offering equivalent
 access to copy the source code from the same place counts as
 distribution of the source code, even though third parties are not
 compelled to copy the source along with the object code.
  4. You may not copy, modify, sublicense, or distribute the Program
 except as expressly provided under this License.  Any attempt
 otherwise to copy, modify, sublicense or distribute the Program is
 void, and will automatically terminate your rights under this License.
 However, parties who have received copies, or rights, from you under
 this License will not have their licenses terminated so long as such
 parties remain in full compliance.
  5. You are not required to accept this License, since you have not
 signed it.  However, nothing else grants you permission to modify or
 distribute the Program or its derivative works.  These actions are
 prohibited by law if you do not accept this License.  Therefore, by
 modifying or distributing the Program (or any work based on the
 Program), you indicate your acceptance of this License to do so, and
 all its terms and conditions for copying, distributing or modifying
 the Program or works based on it.
  6. Each time you redistribute the Program (or any work based on the
 Program), the recipient automatically receives a license from the
 original licensor to copy, distribute or modify the Program subject to
 these terms and conditions.  You may not impose any further
 restrictions on the recipients' exercise of the rights granted herein.
 You are not responsible for enforcing compliance by third parties to
 this License.
  7. If, as a consequence of a court judgment or allegation of patent
 infringement or for any other reason (not limited to patent issues),
 conditions are imposed on you (whether by court order, agreement or
 otherwise) that contradict the conditions of this License, they do not
 excuse you from the conditions of this License.  If you cannot
 distribute so as to satisfy simultaneously your obligations under this
 License and any other pertinent obligations, then as a consequence you
 may not distribute the Program at all.  For example, if a patent
 license would not permit royalty-free redistribution of the Program by
 all those who receive copies directly or indirectly through you, then
 the only way you could satisfy both it and this License would be to
 refrain entirely from distribution of the Program.
 If any portion of this section is held invalid or unenforceable under
 any particular circumstance, the balance of the section is intended to
 apply and the section as a whole is intended to apply in other
 circumstances.
 It is not the purpose of this section to induce you to infringe any
 patents or other property right claims or to contest validity of any
 such claims; this section has the sole purpose of protecting the
 integrity of the free software distribution system, which is
 implemented by public license practices.  Many people have made
 generous contributions to the wide range of software distributed
 through that system in reliance on consistent application of that
 system; it is up to the author/donor to decide if he or she is willing
 to distribute software through any other system and a licensee cannot
 impose that choice.
 This section is intended to make thoroughly clear what is believed to
 be a consequence of the rest of this License.
  8. If the distribution and/or use of the Program is restricted in
 certain countries either by patents or by copyrighted interfaces, the
 original copyright holder who places the Program under this License
 may add an explicit geographical distribution limitation excluding
 those countries, so that distribution is permitted only in or among
 countries not thus excluded.  In such case, this License incorporates
 the limitation as if written in the body of this License.
  9. The Free Software Foundation may publish revised and/or new versions
 of the General Public License from time to time.  Such new versions will
 be similar in spirit to the present version, but may differ in detail to
 address new problems or concerns.
 Each version is given a distinguishing version number.  If the Program
 specifies a version number of this License which applies to it and "any
 later version", you have the option of following the terms and conditions
 either of that version or of any later version published by the Free
 Software Foundation.  If the Program does not specify a version number of
 this License, you may choose any version ever published by the Free Software
 Foundation.
  10. If you wish to incorporate parts of the Program into other free
 programs whose distribution conditions are different, write to the author
 to ask for permission.  For software which is copyrighted by the Free
 Software Foundation, write to the Free Software Foundation; we sometimes
 make exceptions for this.  Our decision will be guided by the two goals
 of preserving the free status of all derivatives of our free software and
 of promoting the sharing and reuse of software generally.
                            NO WARRANTY
  11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
 FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW.  EXCEPT WHEN
 OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
 PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
 OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
 MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.  THE ENTIRE RISK AS
 TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU.  SHOULD THE
 PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
 REPAIR OR CORRECTION.
  12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
 WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
 REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
 INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
 OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
 TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
 YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
 PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
 POSSIBILITY OF SUCH DAMAGES.
                     END OF TERMS AND CONDITIONS
            How to Apply These Terms to Your New Programs
  If you develop a new program, and you want it to be of the greatest
 possible use to the public, the best way to achieve this is to make it
 free software which everyone can redistribute and change under these terms.
  To do so, attach the following notices to the program.  It is safest
 to attach them to the start of each source file to most effectively
 convey the exclusion of warranty; and each file should have at least
 the "copyright" line and a pointer to where the full notice is found.
    <one line to give the program's name and a brief idea of what it does.>
    Copyright (C) <year>  <name of author>
    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation; either version 2 of the License, or
    (at your option) any later version.
    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.
    You should have received a copy of the GNU General Public License along
    with this program; if not, write to the Free Software Foundation, Inc.,
    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
 Also add information on how to contact you by electronic and paper mail.
 If the program is interactive, make it output a short notice like this
 when it starts in an interactive mode:
    Gnomovision version 69, Copyright (C) year name of author
    Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
    This is free software, and you are welcome to redistribute it
    under certain conditions; type `show c' for details.
 The hypothetical commands `show w' and `show c' should show the appropriate
 parts of the General Public License.  Of course, the commands you use may
 be called something other than `show w' and `show c'; they could even be
 mouse-clicks or menu items--whatever suits your program.
 You should also get your employer (if you work as a programmer) or your
 school, if any, to sign a "copyright disclaimer" for the program, if
 necessary.  Here is a sample; alter the names:
  Yoyodyne, Inc., hereby disclaims all copyright interest in the program
  `Gnomovision' (which makes passes at compilers) written by James Hacker.
  <signature of Ty Coon>, 1 April 1989
  Ty Coon, President of Vice
 This General Public License does not permit incorporating your program into
 proprietary programs.  If your program is a subroutine library, you may
 consider it more useful to permit linking proprietary applications with the
 library.  If this is what you want to do, use the GNU Lesser General
 Public License instead of this License.
--- a/27
+++ b/27
@ -0,0 +1,27 @@
 Copyright (c) Vitaliy Filippov (vitalif [at] yourcmc.ru), 2019+
 All server-side code (OSD, Monitor and so on) is licensed under the terms of
 Vitastor Network Public License 1.1 (VNPL 1.1), a copyleft license based on
 GNU GPLv3.0 with the additional "Network Interaction" clause which requires
 opensourcing all programs directly or indirectly interacting with Vitastor
 through a computer network and expressly designed to be used in conjunction
 with it ("Proxy Programs"). Proxy Programs may be made public not only under
 the terms of the same license, but also under the terms of any GPL-Compatible
 Free Software License, as listed by the Free Software Foundation.
 This is a stricter copyleft license than the Affero GPL.
 Please note that VNPL doesn't require you to open the code of proprietary
 software running inside a VM if it's not specially designed to be used with
 Vitastor.
 Basically, you can't use the software in a proprietary environment to provide
 its functionality to users without opensourcing all intermediary components
 standing between the user and Vitastor or purchasing a commercial license
 from the author 😀.
 Client libraries (cluster_client and so on) are dual-licensed under the same
 VNPL 1.1 and also GNU GPL 2.0 or later to allow for compatibility with GPLed
 software like QEMU and fio.
 You can find the full text of VNPL-1.1 in the file [VNPL-1.1.txt](VNPL-1.1.txt).
 GPL 2.0 is also included in this repository as [GPL-2.0.txt](GPL-2.0.txt).
--- a/90
+++ b/90
@ -1,90 +0,0 @@
 BLOCKSTORE_OBJS := allocator.o blockstore.o blockstore_impl.o blockstore_init.o blockstore_open.o blockstore_journal.o blockstore_read.o \
 	blockstore_write.o blockstore_sync.o blockstore_stable.o blockstore_rollback.o blockstore_flush.o crc32c.o ringloop.o
 # -fsanitize=address
 CXXFLAGS := -g -O3 -Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -fPIC -fdiagnostics-color=always
 all: $(BLOCKSTORE_OBJS) libfio_blockstore.so osd libfio_sec_osd.so stub_osd stub_bench osd_test dump_journal
 clean:
 	rm -f *.o
 crc32c.o: crc32c.c
 	g++ $(CXXFLAGS) -c -o $@ $<
 json11.o: json11/json11.cpp
 	g++ $(CXXFLAGS) -c -o json11.o json11/json11.cpp
 allocator.o: allocator.cpp allocator.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 ringloop.o: ringloop.cpp ringloop.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 timerfd_interval.o: timerfd_interval.cpp timerfd_interval.h ringloop.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 timerfd_manager.o: timerfd_manager.cpp timerfd_manager.h ringloop.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 %.o: %.cpp allocator.h blockstore_flush.h blockstore.h blockstore_impl.h blockstore_init.h blockstore_journal.h crc32c.h ringloop.h object_id.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 dump_journal: dump_journal.cpp crc32c.o blockstore_journal.h
 	g++ $(CXXFLAGS) -o $@ $< crc32c.o
 libblockstore.so: $(BLOCKSTORE_OBJS)
 	g++ $(CXXFLAGS) -o libblockstore.so -shared $(BLOCKSTORE_OBJS) -ltcmalloc_minimal -luring
 libfio_blockstore.so: ./libblockstore.so fio_engine.cpp json11.o
 	g++ $(CXXFLAGS) -shared -o libfio_blockstore.so fio_engine.cpp json11.o ./libblockstore.so -ltcmalloc_minimal -luring
 OSD_OBJS := osd.o osd_secondary.o osd_receive.o osd_send.o osd_peering.o osd_flush.o osd_peering_pg.o \
 	osd_primary.o osd_primary_subops.o etcd_state_client.o cluster_client.o osd_cluster.o http_client.o pg_states.o \
 	osd_rmw.o json11.o base64.o timerfd_manager.o
 base64.o: base64.cpp base64.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 osd_secondary.o: osd_secondary.cpp osd.h osd_ops.h ringloop.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 osd_receive.o: osd_receive.cpp osd.h osd_ops.h ringloop.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 osd_send.o: osd_send.cpp osd.h osd_ops.h ringloop.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 osd_peering.o: osd_peering.cpp osd.h osd_ops.h osd_peering_pg.h ringloop.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 osd_cluster.o: osd_cluster.cpp osd.h osd_ops.h ringloop.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 http_client.o: http_client.cpp http_client.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 etcd_state_client.o: etcd_state_client.cpp etcd_state_client.h http_client.h pg_states.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 cluster_client.o: cluster_client.cpp cluster_client.h osd_ops.h timerfd_manager.h ringloop.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 osd_flush.o: osd_flush.cpp osd.h osd_ops.h osd_peering_pg.h ringloop.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 osd_peering_pg.o: osd_peering_pg.cpp object_id.h osd_peering_pg.h pg_states.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 pg_states.o: pg_states.cpp pg_states.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 osd_rmw.o: osd_rmw.cpp osd_rmw.h xor.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 osd_rmw_test: osd_rmw_test.cpp osd_rmw.cpp osd_rmw.h xor.h
 	g++ $(CXXFLAGS) -o $@ $<
 osd_primary.o: osd_primary.cpp osd_primary.h osd_rmw.h osd.h osd_ops.h osd_peering_pg.h xor.h ringloop.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 osd_primary_subops.o: osd_primary_subops.cpp osd_primary.h osd_rmw.h osd.h osd_ops.h osd_peering_pg.h xor.h ringloop.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 osd.o: osd.cpp osd.h http_client.h osd_ops.h osd_peering_pg.h ringloop.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 osd: ./libblockstore.so osd_main.cpp osd.h osd_ops.h $(OSD_OBJS)
 	g++ $(CXXFLAGS) -o osd osd_main.cpp $(OSD_OBJS) ./libblockstore.so -ltcmalloc_minimal -luring
 stub_osd: stub_osd.cpp osd_ops.h rw_blocking.o
 	g++ $(CXXFLAGS) -o stub_osd stub_osd.cpp rw_blocking.o -ltcmalloc_minimal
 stub_bench: stub_bench.cpp osd_ops.h rw_blocking.o
 	g++ $(CXXFLAGS) -o stub_bench stub_bench.cpp rw_blocking.o -ltcmalloc_minimal
 rw_blocking.o: rw_blocking.cpp rw_blocking.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 osd_test: osd_test.cpp osd_ops.h rw_blocking.o
 	g++ $(CXXFLAGS) -o osd_test osd_test.cpp rw_blocking.o -ltcmalloc_minimal
 osd_peering_pg_test: osd_peering_pg_test.cpp osd_peering_pg.o
 	g++ $(CXXFLAGS) -o $@ $< osd_peering_pg.o -ltcmalloc_minimal
 libfio_sec_osd.so: fio_sec_osd.cpp osd_ops.h rw_blocking.o
 	g++ $(CXXFLAGS) -ltcmalloc_minimal -shared -o libfio_sec_osd.so fio_sec_osd.cpp rw_blocking.o -luring
 test_blockstore: ./libblockstore.so test_blockstore.cpp timerfd_interval.o
 	g++ $(CXXFLAGS) -o test_blockstore test_blockstore.cpp timerfd_interval.o ./libblockstore.so -ltcmalloc_minimal -luring
 test: test.cpp osd_peering_pg.o
 	g++ $(CXXFLAGS) -o test test.cpp osd_peering_pg.o -luring -lm
 test_allocator: test_allocator.cpp allocator.o
 	g++ $(CXXFLAGS) -o test_allocator test_allocator.cpp allocator.o
--- a/README-ru.md
+++ b/README-ru.md
@ -0,0 +1,579 @@
 ## Vitastor
 [Read English version](README.md)
 ## Идея
 Я всего лишь хочу сделать качественную блочную SDS!
 Vitastor - распределённая блочная SDS, прямой аналог Ceph RBD и внутренних СХД популярных
 облачных провайдеров. Однако, в отличие от них, Vitastor быстрый и при этом простой.
 Только пока маленький :-).
 Архитектурная схожесть с Ceph означает заложенную на уровне алгоритмов записи строгую консистентность,
 репликацию через первичный OSD, симметричную кластеризацию без единой точки отказа
 и автоматическое распределение данных по любому числу дисков любого размера с настраиваемыми схемами
 избыточности - репликацией или с произвольными кодами коррекции ошибок.
 ## Возможности
 Vitastor на данный момент находится в статусе предварительного выпуска, расширенные
 возможности пока отсутствуют, а в будущих версиях вероятны "ломающие" изменения.
 Однако следующее уже реализовано:
 - Базовая часть - надёжное кластерное блочное хранилище без единой точки отказа
 - Производительность ;-D
 - Несколько схем отказоустойчивости: репликация, XOR n+1 (1 диск чётности), коды коррекции ошибок
  Рида-Соломона на основе библиотеки jerasure с любым числом дисков данных и чётности в группе
 - Конфигурация через простые человекочитаемые JSON-структуры в etcd
 - Автоматическое распределение данных по OSD, с поддержкой:
  - Математической оптимизации для лучшей равномерности распределения и минимизации перемещений данных
  - Нескольких пулов с разными схемами избыточности
  - Дерева распределения, выбора OSD по тегам / классам устройств (только SSD, только HDD) и по поддереву
  - Настраиваемых доменов отказа (диск/сервер/стойка и т.п.)
 - Восстановление деградированных блоков
 - Ребаланс, то есть перемещение данных между OSD (дисками)
 - Поддержка "ленивого" fsync (fsync не на каждую операцию)
 - Сбор статистики ввода/вывода в etcd
 - Клиентская библиотека режима пользователя для ввода/вывода
 - Драйвер диска для QEMU (собирается вне дерева исходников QEMU)
 - Драйвер диска для утилиты тестирования производительности fio (также собирается вне дерева исходников fio)
 - NBD-прокси для монтирования образов ядром ("блочное устройство в режиме пользователя")
 - Утилита удаления образов/инодов (vitastor-rm)
 - Пакеты для Debian и CentOS
 - Статистика операций ввода/вывода и занятого места в разрезе инодов
 - Именование инодов через хранение их метаданных в etcd
 - Снапшоты и copy-on-write клоны
 - Сглаживание производительности случайной записи в SSD+HDD конфигурациях
 - Поддержка RDMA/RoCEv2 через libibverbs
 - CSI-плагин для Kubernetes
 - Базовая поддержка OpenStack: драйвер Cinder, патчи для Nova и libvirt
 ## Планы развития
 - Поддержка удаления снапшотов (слияния слоёв)
 - Более корректные скрипты разметки дисков и автоматического запуска OSD
 - Другие инструменты администрирования
 - Плагины для OpenNebula, Proxmox и других облачных систем
 - iSCSI-прокси
 - Более быстрое переключение при отказах
 - Фоновая проверка целостности без контрольных сумм (сверка реплик)
 - Контрольные суммы
 - Поддержка SSD-кэширования (tiered storage)
 - Поддержка NVDIMM
 - Web-интерфейс
 - Возможно, сжатие
 - Возможно, поддержка кэширования данных через системный page cache
 ## Архитектура
 Так же, как и в Ceph, в Vitastor:
 - Есть пулы (pools), PG, OSD, мониторы, домены отказа, дерево распределения (аналог crush-дерева).
 - Образы делятся на блоки фиксированного размера (объекты), и эти объекты распределяются по OSD.
 - У OSD есть журнал и метаданные и они тоже могут размещаться на отдельных быстрых дисках.
 - Все операции записи тоже транзакционны. В Vitastor, правда, есть режим отложенного/ленивого fsync
  (коммита), в котором fsync не вызывается на каждую операцию записи, что делает его более
  пригодным для использования на "плохих" (десктопных) SSD. Однако все операции записи
  в любом случае атомарны.
 - Клиентская библиотека тоже старается ждать восстановления после любого отказа кластера, то есть,
  вы тоже можете перезагрузить хоть весь кластер разом, и клиенты только на время зависнут,
  но не отключатся.
 Некоторые базовые термины для тех, кто не знаком с Ceph:
 - OSD (Object Storage Daemon) - процесс, который хранит данные на одном диске и обрабатывает
  запросы чтения/записи от клиентов.
 - Пул (Pool) - контейнер для данных, имеющих одну и ту же схему избыточности и правила распределения по OSD.
 - PG (Placement Group) - группа объектов, хранимых на одном и том же наборе реплик (OSD).
  Несколько PG могут храниться на одном и том же наборе реплик, но объекты одной PG
  в норме не хранятся на разных наборах OSD.
 - Монитор - демон, хранящий состояние кластера.
 - Домен отказа (Failure Domain) - группа OSD, которым вы разрешаете "упасть" всем вместе.
  Иными словами, это группа OSD, в которые СХД не помещает разные копии одного и того же
  блока данных. Например, если домен отказа - сервер, то на двух дисках одного сервера
  никогда не окажется 2 и более копий одного и того же блока данных, а значит, даже
  если в этом сервере откажут все диски, это будет равносильно потере только 1 копии
  любого блока данных.
 - Дерево распределения (Placement Tree / CRUSH Tree) - иерархическая группировка OSD
  в узлы, которые далее можно использовать как домены отказа. То есть, диск (OSD) входит в
  сервер, сервер входит в стойку, стойка входит в ряд, ряд в датацентр и т.п.
 Чем Vitastor отличается от Ceph:
 - Vitastor в первую очередь сфокусирован на SSD. Также Vitastor, вероятно, должен неплохо работать
  с комбинацией SSD и HDD через bcache, а в будущем, возможно, будут добавлены и нативные способы
  оптимизации под SSD+HDD. Однако хранилище на основе одних лишь жёстких дисков, вообще без SSD,
  не в приоритете, поэтому оптимизации под этот кейс могут вообще не состояться.
 - OSD Vitastor однопоточный и всегда таким останется, так как это самый оптимальный способ работы.
  Если вам не хватает 1 ядра на 1 диск, просто делите диск на разделы и запускайте на нём несколько OSD.
  Но, скорее всего, вам хватит и 1 ядра - Vitastor не так прожорлив к ресурсам CPU, как Ceph.
 - Журнал и метаданные всегда размещаются в памяти, благодаря чему никогда не тратится лишнее время
  на чтение метаданных с диска. Размер метаданных линейно зависит от размера диска и блока данных,
  который задаётся в конфигурации кластера и по умолчанию составляет 128 КБ. С блоком 128 КБ метаданные
  занимают примерно 512 МБ памяти на 1 ТБ дискового пространства (и это всё равно меньше, чем нужно Ceph-у).
  Журнал вообще не должен быть большим, например, тесты производительности в данном документе проводились
  с журналом размером всего 16 МБ. Большой журнал, вероятно, даже вреден, т.к. "грязные" записи (записи,
  не сброшенные из журнала) тоже занимают память и могут немного замедлять работу.
 - В Vitastor нет внутреннего copy-on-write. Я считаю, что реализация CoW-хранилища гораздо сложнее,
  поэтому сложнее добиться устойчиво хороших результатов. Возможно, в один прекрасный день
  я придумаю красивый алгоритм для CoW-хранилища, но пока нет - внутреннего CoW в Vitastor не будет.
  Всё это не относится к "внешнему" CoW (снапшотам и клонам).
 - Базовый слой Vitastor - простое блочное хранилище с блоками фиксированного размера, а не сложное
  объектное хранилище с расширенными возможностями, как в Ceph (RADOS).
 - В Vitastor есть режим "ленивых fsync", в котором OSD группирует запросы записи перед сбросом их
  на диск, что позволяет получить лучшую производительность с дешёвыми настольными SSD без конденсаторов
  ("Advanced Power Loss Protection" / "Capacitor-Based Power Loss Protection").
  Тем не менее, такой режим всё равно медленнее использования нормальных серверных SSD и мгновенного
  fsync, так как приводит к дополнительным операциям передачи данных по сети, поэтому рекомендуется
  всё-таки использовать хорошие серверные диски, тем более, стоят они почти так же, как десктопные.
 - PG эфемерны. Это означает, что они не хранятся на дисках и существуют только в памяти работающих OSD.
 - Процессы восстановления оперируют отдельными объектами, а не целыми PG.
 - PGLOG-ов нет.
 - "Мониторы" не хранят данные. Конфигурация и состояние кластера хранятся в etcd в простых человекочитаемых
  JSON-структурах. Мониторы Vitastor только следят за состоянием кластера и управляют перемещением данных.
  В этом смысле монитор Vitastor не является критичным компонентом системы и больше похож на Ceph-овский
  менеджер (MGR). Монитор Vitastor написан на node.js.
 - Распределение PG не основано на консистентных хешах. Вместо этого все маппинги PG хранятся прямо в etcd
  (ибо нет никакой проблемы сохранить несколько сотен-тысяч записей в памяти, а не считать каждый раз хеши).
  Перераспределение PG по OSD выполняется через математическую оптимизацию,
  а конкретно, сведение задачи к ЛП (задаче линейного программирования) и решение оной с помощью утилиты
  lp_solve. Такой подход позволяет обычно выравнивать распределение места почти идеально - равномерность
  обычно составляет 96-99%, в отличие от Ceph, где на голом CRUSH-е без балансировщика обычно выходит 80-90%.
  Также это позволяет минимизировать объём перемещения данных и случайность связей между OSD, а также менять
  распределение вручную, не боясь сломать логику перебалансировки. В таком подходе есть и потенциальный
  недостаток - есть предположение, что в очень большом кластере он может сломаться - однако вплоть до
  нескольких сотен OSD подход точно работает нормально. Ну и, собственно, при необходимости легко
  реализовать и консистентные хеши.
 - Отдельный слой, подобный слою "CRUSH-правил", отсутствует. Вы настраиваете схемы отказоустойчивости,
  домены отказа и правила выбора OSD напрямую в конфигурации пулов.
 ## Понимание сути производительности систем хранения
 Вкратце: для быстрой хранилки задержки важнее, чем пиковые iops-ы.
 Лучшая возможная задержка достигается при тестировании в 1 поток с глубиной очереди 1,
 что приблизительно означает минимально нагруженное состояние кластера. В данном случае
 IOPS = 1/задержка. Ни числом серверов, ни дисков, ни серверных процессов/потоков
 задержка не масштабируется... Она зависит только от того, насколько быстро один
 серверный процесс (и клиент) обрабатывают одну операцию.
 Почему задержки важны? Потому, что некоторые приложения *не могут* использовать глубину
 очереди больше 1, ибо их задача не параллелизуется. Важный пример - это все СУБД
 с поддержкой консистентности (ACID), потому что все они обеспечивают её через
 журналирование, а журналы пишутся последовательно и с fsync() после каждой операции.
 fsync, кстати - это ещё одна очень важная вещь, про которую почти всегда забывают в тестах.
 Смысл в том, что все современные диски имеют кэши/буферы записи и не гарантируют, что
 данные реально физически записываются на носитель до того, как вы делаете fsync(),
 который транслируется в команду сброса кэша операционной системой.
 Дешёвые SSD для настольных ПК и ноутбуков очень быстрые без fsync - NVMe диски, например,
 могут обработать порядка 80000 операций записи в секунду с глубиной очереди 1 без fsync.
 Однако с fsync, когда они реально вынуждены писать каждый блок данных во флеш-память,
 они выжимают лишь 1000-2000 операций записи в секунду (число практически постоянное
 для всех моделей SSD).
 Серверные SSD часто имеют суперконденсаторы, работающие как встроенный источник
 бесперебойного питания и дающие дискам успеть сбросить их DRAM-кэш в постоянную
 флеш-память при отключении питания. Благодаря этому диски с чистой совестью
 *игнорируют fsync*, так как точно знают, что данные из кэша доедут до постоянной
 памяти.
 Все наиболее известные программные СХД, например, Ceph и внутренние СХД, используемые
 такими облачными провайдерами, как Amazon, Google, Яндекс, медленные в смысле задержки.
 В лучшем случае они дают задержки от 0.3мс на чтение и 0.6мс на запись 4 КБ блоками
 даже при условии использования наилучшего возможного железа.
 И это в эпоху SSD, когда вы можете пойти на рынок и купить там SSD, задержка которого
 на чтение будет 0.1мс, а на запись - 0.04мс, за 100$ или даже дешевле.
 Когда мне нужно быстро протестировать производительность дисковой подсистемы, я
 использую следующие 6 команд, с небольшими вариациями:
 - Линейная запись:
  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=write -runtime=60 -filename=/dev/sdX`
 - Линейное чтение:
  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=read -runtime=60 -filename=/dev/sdX`
 - Запись в 1 поток (T1Q1):
  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -fsync=1 -rw=randwrite -runtime=60 -filename=/dev/sdX`
 - Чтение в 1 поток (T1Q1):
  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -rw=randread -runtime=60 -filename=/dev/sdX`
 - Параллельная запись (numjobs используется, когда 1 ядро CPU не может насытить диск):
  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 [-numjobs=4 -group_reporting] -rw=randwrite -runtime=60 -filename=/dev/sdX`
 - Параллельное чтение (numjobs - аналогично):
  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 [-numjobs=4 -group_reporting] -rw=randread -runtime=60 -filename=/dev/sdX`
 ## Теоретическая максимальная производительность Vitastor
 При использовании репликации:
 - Задержка чтения в 1 поток (T1Q1): 1 сетевой RTT + 1 чтение с диска.
 - Запись+fsync в 1 поток:
  - С мгновенным сбросом: 2 RTT + 1 запись.
  - С отложенным ("ленивым") сбросом: 4 RTT + 1 запись + 1 fsync.
 - Параллельное чтение: сумма IOPS всех дисков либо производительность сети, если в сеть упрётся раньше.
 - Параллельная запись: сумма IOPS всех дисков / число реплик / WA либо производительность сети, если в сеть упрётся раньше.
 При использовании кодов коррекции ошибок (EC):
 - Задержка чтения в 1 поток (T1Q1): 1.5 RTT + 1 чтение.
 - Запись+fsync в 1 поток:
  - С мгновенным сбросом: 3.5 RTT + 1 чтение + 2 записи.
  - С отложенным ("ленивым") сбросом: 5.5 RTT + 1 чтение + 2 записи + 2 fsync.
 - Под 0.5 на самом деле подразумевается (k-1)/k, где k - число дисков данных,
  что означает, что дополнительное обращение по сети не нужно, когда операция
  чтения обслуживается локально.
 - Параллельное чтение: сумма IOPS всех дисков либо производительность сети, если в сеть упрётся раньше.
 - Параллельная запись: сумма IOPS всех дисков / общее число дисков данных и чётности / WA либо производительность сети, если в сеть упрётся раньше.
  Примечание: IOPS дисков в данном случае надо брать в смешанном режиме чтения/записи в пропорции, аналогичной формулам выше.
 WA (мультипликатор записи) для 4 КБ блоков в Vitastor обычно составляет 3-5:
 1. Запись метаданных в журнал
 2. Запись блока данных в журнал
 3. Запись метаданных в БД
 4. Ещё одна запись метаданных в журнал при использовании EC
 5. Запись блока данных на диск данных
 Если вы найдёте SSD, хорошо работающий с 512-байтными блоками данных (Optane?),
 то 1, 3 и 4 можно снизить до 512 байт (1/8 от размера данных) и получить WA всего 2.375.
 Кроме того, WA снижается при использовании отложенного/ленивого сброса при параллельной
 нагрузке, т.к. блоки журнала записываются на диск только когда они заполняются или явным
 образом запрашивается fsync.
 ## Пример сравнения с Ceph
 Железо - 4 сервера, в каждом:
 - 6x SATA SSD Intel D3-4510 3.84 TB
 - 2x Xeon Gold 6242 (16 cores @ 2.8 GHz)
 - 384 GB RAM
 - 1x 25 GbE сетевая карта (Mellanox ConnectX-4 LX), подключённая к свитчу Juniper QFX5200
 Экономия энергии CPU отключена. В тестах и Vitastor, и Ceph развёрнуто по 2 OSD на 1 SSD.
 Все результаты ниже относятся к случайной нагрузке 4 КБ блоками (если явно не указано обратное).
 Производительность голых дисков:
 - T1Q1 запись ~27000 iops (задержка ~0.037ms)
 - T1Q1 чтение ~9800 iops (задержка ~0.101ms)
 - T1Q32 запись ~60000 iops
 - T1Q32 чтение ~81700 iops
 Ceph 15.2.4 (Bluestore):
 - T1Q1 запись ~1000 iops (задержка ~1ms)
 - T1Q1 чтение ~1750 iops (задержка ~0.57ms)
 - T8Q64 запись ~100000 iops, потребление CPU процессами OSD около 40 ядер на каждом сервере
 - T8Q64 чтение ~480000 iops, потребление CPU процессами OSD около 40 ядер на каждом сервере
 Тесты в 8 потоков проводились на 8 400GB RBD образах со всех хостов (с каждого хоста запускалось 2 процесса fio).
 Это нужно потому, что в Ceph несколько RBD-клиентов, пишущих в 1 образ, очень сильно замедляются.
 Настройки RocksDB и Bluestore в Ceph не менялись, единственным изменением было отключение cephx_sign_messages.
 На самом деле, результаты теста не такие уж и плохие для Ceph (могло быть хуже).
 Собственно говоря, эти серверы как раз хорошо сбалансированы для Ceph - 6 SATA SSD как раз
 утилизируют 25-гигабитную сеть, а без 2 мощных процессоров Ceph-у бы не хватило ядер,
 чтобы выдать пристойный результат. Собственно, что и показывает жор 40 ядер в процессе
 параллельного теста.
 Vitastor:
 - T1Q1 запись: 7087 iops (задержка 0.14ms)
 - T1Q1 чтение: 6838 iops (задержка 0.145ms)
 - T2Q64 запись: 162000 iops, потребление CPU - 3 ядра на каждом сервере
 - T8Q64 чтение: 895000 iops, потребление CPU - 4 ядра на каждом сервере
 - Линейная запись (4M T1Q32): 2800 МБ/с
 - Линейное чтение (4M T1Q32): 1500 МБ/с
 Тест на чтение в 8 потоков проводился на 1 большом образе (3.2 ТБ) со всех хостов (опять же, по 2 fio с каждого).
 В Vitastor никакой разницы между 1 образом и 8-ю нет. Естественно, примерно 1/4 запросов чтения
 в такой конфигурации, как и в тестах Ceph выше, обслуживалась с локальной машины. Если проводить
 тест так, чтобы все операции всегда обращались к первичным OSD по сети - тест сильнее упирался
 в сеть и результат составлял примерно 689000 iops.
 Настройки Vitastor: `--disable_data_fsync true --immediate_commit all --flusher_count 8
  --disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096
  --journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024
  --journal_size 16777216`.
 ### EC/XOR 2+1
 Vitastor:
 - T1Q1 запись: 2808 iops (задержка ~0.355ms)
 - T1Q1 чтение: 6190 iops (задержка ~0.16ms)
 - T2Q64 запись: 85500 iops, потребление CPU - 3.4 ядра на каждом сервере
 - T8Q64 чтение: 812000 iops, потребление CPU - 4.7 ядра на каждом сервере
 - Линейная запись (4M T1Q32): 3200 МБ/с
 - Линейное чтение (4M T1Q32): 1800 МБ/с
 Ceph:
 - T1Q1 запись: 730 iops (задержка ~1.37ms latency)
 - T1Q1 чтение: 1500 iops с холодным кэшем метаданных (задержка ~0.66ms), 2300 iops через 2 минуты прогрева (задержка ~0.435ms)
 - T4Q128 запись (4 RBD images): 45300 iops, потребление CPU - 30 ядер на каждом сервере
 - T8Q64 чтение (4 RBD images): 278600 iops, потребление CPU - 40 ядер на каждом сервере
 - Линейная запись (4M T1Q32): 1950 МБ/с в пустой образ, 2500 МБ/с в заполненный образ
 - Линейное чтение (4M T1Q32): 2400 МБ/с
 ### NBD
 NBD расшифровывается как "сетевое блочное устройство", но на самом деле оно также
 работает просто как аналог FUSE для блочных устройств, то есть, представляет собой
 "блочное устройство в пространстве пользователя".
 NBD - на данный момент единственный способ монтировать Vitastor ядром Linux.
 NBD немного снижает производительность, так как приводит к дополнительным копированиям
 данных между ядром и пространством пользователя. Тем не менее, способ достаточно оптимален,
 а производительность случайного доступа вообще затрагивается слабо.
 Vitastor с однопоточной NBD прокси на том же стенде:
 - T1Q1 запись: 6000 iops (задержка 0.166ms)
 - T1Q1 чтение: 5518 iops (задержка 0.18ms)
 - T1Q128 запись: 94400 iops
 - T1Q128 чтение: 103000 iops
 - Линейная запись (4M T1Q128): 1266 МБ/с (в сравнении с 2800 МБ/с через fio)
 - Линейное чтение (4M T1Q128): 975 МБ/с (в сравнении с 1500 МБ/с через fio)
 ## Установка
 ### Debian
 - Добавьте ключ репозитория Vitastor:
  `wget -q -O - https://vitastor.io/debian/pubkey | sudo apt-key add -`
 - Добавьте репозиторий Vitastor в /etc/apt/sources.list:
  - Debian 11 (Bullseye/Sid): `deb https://vitastor.io/debian bullseye main`
  - Debian 10 (Buster): `deb https://vitastor.io/debian buster main`
 - Для Debian 10 (Buster) также включите репозиторий backports:
  `deb http://deb.debian.org/debian buster-backports main`
 - Установите пакеты: `apt update; apt install vitastor lp-solve etcd linux-image-amd64 qemu`
 ### CentOS
 - Добавьте в систему репозиторий Vitastor:
  - CentOS 7: `yum install https://vitastor.io/rpms/centos/7/vitastor-release-1.0-1.el7.noarch.rpm`
  - CentOS 8: `dnf install https://vitastor.io/rpms/centos/8/vitastor-release-1.0-1.el8.noarch.rpm`
 - Включите EPEL: `yum/dnf install epel-release`
 - Включите дополнительные репозитории CentOS:
  - CentOS 7: `yum install centos-release-scl`
  - CentOS 8: `dnf install centos-release-advanced-virtualization`
 - Включите elrepo-kernel:
  - CentOS 7: `yum install https://www.elrepo.org/elrepo-release-7.el7.elrepo.noarch.rpm`
  - CentOS 8: `dnf install https://www.elrepo.org/elrepo-release-8.el8.elrepo.noarch.rpm`
 - Установите пакеты: `yum/dnf install vitastor lpsolve etcd kernel-ml qemu-kvm`
 ### Установка из исходников
 - Установите ядро 5.4 или более новое, для поддержки io_uring. Желательно 5.8 или даже новее,
  так как в 5.4 есть как минимум 1 известный баг, ведущий к зависанию с io_uring и контроллером HP SmartArray.
 - Установите liburing 0.4 или более новый и его заголовки.
 - Установите lp_solve.
 - Установите etcd, версии не ниже 3.4.15. Более ранние версии работать не будут из-за различных багов,
  например [#12402](https://github.com/etcd-io/etcd/pull/12402). Также вы можете взять версию 3.4.13 с
  этим конкретным исправлением из ветки release-3.4 репозитория https://github.com/vitalif/etcd/.
 - Установите node.js 10 или новее.
 - Установите gcc и g++ 8.x или новее.
 - Склонируйте данный репозиторий с подмодулями: `git clone https://yourcmc.ru/git/vitalif/vitastor/`.
 - Желательно пересобрать QEMU с патчем, который делает необязательным запуск через LD_PRELOAD.
  См `patches/qemu-*.*-vitastor.patch` - выберите версию, наиболее близкую вашей версии QEMU.
 - Установите QEMU 3.0 или новее, возьмите исходные коды установленного пакета, начните его пересборку,
  через некоторое время остановите её и скопируйте следующие заголовки:
   - `<qemu>/include` &rarr; `<vitastor>/qemu/include`
   - Debian:
      * Берите qemu из основного репозитория
      * `<qemu>/b/qemu/config-host.h` &rarr; `<vitastor>/qemu/b/qemu/config-host.h`
      * `<qemu>/b/qemu/qapi` &rarr; `<vitastor>/qemu/b/qemu/qapi`
   - CentOS 8:
      * Берите qemu из репозитория Advanced-Virtualization. Чтобы включить его, запустите
        `yum install centos-release-advanced-virtualization.noarch` и далее `yum install qemu`
      * `<qemu>/config-host.h` &rarr; `<vitastor>/qemu/b/qemu/config-host.h`
      * Для QEMU 3.0+: `<qemu>/qapi` &rarr; `<vitastor>/qemu/b/qemu/qapi`
      * Для QEMU 2.0+: `<qemu>/qapi-types.h` &rarr; `<vitastor>/qemu/b/qemu/qapi-types.h`
   - `config-host.h` и `qapi` нужны, т.к. в них содержатся автогенерируемые заголовки
 - Установите fio 3.7 или новее, возьмите исходники пакета и сделайте на них симлинк с `<vitastor>/fio`.
 - Соберите и установите Vitastor командой `mkdir build && cd build && cmake .. && make -j8 && make install`.
  Обратите внимание на переменную cmake `QEMU_PLUGINDIR` - под RHEL её нужно установить равной `qemu-kvm`.
 ## Запуск
 Внимание: процедура пока что достаточно нетривиальная, задавать конфигурацию и смещения
 на диске нужно почти вручную. Это будет исправлено в ближайшем будущем.
 - Желательны SATA SSD или NVMe диски с конденсаторами (серверные SSD). Можно использовать и
  десктопные SSD, включив режим отложенного fsync, но производительность однопоточной записи
  в этом случае пострадает.
 - Быстрая сеть, минимум 10 гбит/с
 - Для наилучшей производительности нужно отключить энергосбережение CPU: `cpupower idle-set -D 0 && cpupower frequency-set -g performance`.
 - Пропишите нужные вам значения вверху файлов `/usr/lib/vitastor/mon/make-units.sh` и `/usr/lib/vitastor/mon/make-osd.sh`.
 - Создайте юниты systemd для etcd и мониторов: `/usr/lib/vitastor/mon/make-units.sh`
 - Создайте юниты для OSD: `/usr/lib/vitastor/mon/make-osd.sh /dev/disk/by-partuuid/XXX [/dev/disk/by-partuuid/YYY ...]`
 - Вы можете поменять параметры OSD в юнитах systemd. Смысл некоторых параметров:
  - `disable_data_fsync 1` - отключает fsync, используется с SSD с конденсаторами.
  - `immediate_commit all` - используется с SSD с конденсаторами.
  - `disable_device_lock 1` - отключает блокировку файла устройства, нужно, только если вы запускаете
    несколько OSD на одном блочном устройстве.
  - `flusher_count 256` - "flusher" - микропоток, удаляющий старые данные из журнала.
    Не волнуйтесь об этой настройке, 256 теперь достаточно практически всегда.
  - `disk_alignment`, `journal_block_size`, `meta_block_size` следует установить равными размеру
    внутреннего блока SSD. Это почти всегда 4096.
  - `journal_no_same_sector_overwrites true` запрещает перезапись одного и того же сектора журнала подряд
    много раз в процессе записи. Большинство (99%) SSD не нуждаются в данной опции. Однако выяснилось, что
    диски, используемые на одном из тестовых стендов - Intel D3-S4510 - очень сильно не любят такую
    перезапись, и для них была добавлена эта опция. Когда данный режим включён, также нужно поднимать
    значение `journal_sector_buffer_count`, так как иначе Vitastor не хватит буферов для записи в журнал.
 - Запустите все etcd: `systemctl start etcd`
 - Создайте глобальную конфигурацию в etcd: `etcdctl --endpoints=... put /vitastor/config/global '{"immediate_commit":"all"}'`
  (если все ваши диски - серверные с конденсаторами).
 - Создайте пулы: `etcdctl --endpoints=... put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":256,"failure_domain":"host"}}'`.
  Для jerasure EC-пулов конфигурация должна выглядеть так: `2:{"name":"ecpool","scheme":"jerasure","pg_size":4,"parity_chunks":2,"pg_minsize":2,"pg_count":256,"failure_domain":"host"}`.
 - Запустите все OSD: `systemctl start vitastor.target`
 - Ваш кластер должен быть готов - один из мониторов должен уже сконфигурировать PG, а OSD должны запустить их.
 - Вы можете проверить состояние PG прямо в etcd: `etcdctl --endpoints=... get --prefix /vitastor/pg/state`. Все PG должны быть 'active'.
 ### Задать имя образу
 ```
 etcdctl --endpoints=<etcd> put /vitastor/config/inode/<pool>/<inode> '{"name":"<name>","size":<size>[,"parent_id":<parent_inode_number>][,"readonly":true]}'
 ```
 Например:
 ```
 etcdctl --endpoints=http://10.115.0.10:2379/v3 put /vitastor/config/inode/1/1 '{"name":"testimg","size":2147483648}'
 ```
 Если вы зададите parent_id, то образ станет CoW-клоном, т.е. все новые запросы записи пойдут в новый инод, а запросы
 чтения будут проверять сначала его, а потом родительские слои по цепочке вверх. Чтобы случайно не перезаписать данные
 в родительском слое, вы можете переключить его в режим "только чтение", добавив флаг `"readonly":true` в его запись
 метаданных. В таком случае родительский образ становится просто снапшотом.
 Таким образом, для создания снапшота вам нужно просто переименовать предыдущий inode (например, из testimg в testimg@0),
 сделать его readonly и создать новый слой с исходным именем образа (testimg), ссылающийся на только что переименованный
 в качестве родительского.
 ### Запуск тестов с fio
 Пример команды для запуска тестов:
 ```
 fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -image=testimg
 ```
 Если вы не хотите обращаться к образу по имени, вместо `-image=testimg` можно указать номер пула, номер инода и размер:
 `-pool=1 -inode=1 -size=400G`.
 ### Загрузить образ диска ВМ в/из Vitastor
 Используйте qemu-img и строку `vitastor:etcd_host=<HOST>:image=<IMAGE>` в качестве имени файла диска. Например:
 ```
 qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:etcd_host=10.115.0.10\:2379/v3:image=testimg'
 ```
 Обратите внимание, что если вы используете немодифицированный QEMU, потребуется установить переменную окружения
 `LD_PRELOAD=/usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so`.
 Если вы не хотите обращаться к образу по имени, вместо `:image=<IMAGE>` можно указать номер пула, номер инода и размер:
 `:pool=<POOL>:inode=<INODE>:size=<SIZE>`.
 ### Запустить ВМ
 Для запуска QEMU используйте опцию `-drive file=vitastor:etcd_host=<HOST>:image=<IMAGE>` (аналогично qemu-img)
 и физический размер блока 4 KB.
 Например:
 ```
 qemu-system-x86_64 -enable-kvm -m 1024
  -drive 'file=vitastor:etcd_host=10.115.0.10\:2379/v3:image=testimg',format=raw,if=none,id=drive-virtio-disk0,cache=none
  -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=off,physical_block_size=4096,logical_block_size=512
  -vnc 0.0.0.0:0
 ```
 Обращение по номерам (`:pool=<POOL>:inode=<INODE>:size=<SIZE>` вместо `:image=<IMAGE>`) работает аналогично qemu-img.
 ### Удалить образ
 Используйте утилиту vitastor-rm. Например:
 ```
 vitastor-rm --etcd_address 10.115.0.10:2379/v3 --pool 1 --inode 1 --parallel_osds 16 --iodepth 32
 ```
 ### NBD
 Чтобы создать локальное блочное устройство, используйте NBD. Например:
 ```
 vitastor-nbd map --etcd_address 10.115.0.10:2379/v3 --image testimg
 ```
 Команда напечатает название устройства вида /dev/nbd0, которое потом можно будет форматировать
 и использовать как обычное блочное устройство.
 Для обращения по номеру инода, аналогично другим командам, можно использовать опции
 `--pool <POOL> --inode <INODE> --size <SIZE>` вместо `--image testimg`.
 ### Kubernetes
 У Vitastor есть CSI-плагин для Kubernetes, поддерживающий RWO-тома.
 Для установки возьмите манифесты из директории [csi/deploy/](csi/deploy/), поместите
 вашу конфигурацию подключения к Vitastor в [csi/deploy/001-csi-config-map.yaml](001-csi-config-map.yaml),
 настройте StorageClass в [csi/deploy/009-storage-class.yaml](009-storage-class.yaml)
 и примените все `NNN-*.yaml` к вашей инсталляции Kubernetes.
 ```
 for i in ./???-*.yaml; do kubectl apply -f $i; done
 ```
 После этого вы сможете создавать PersistentVolume. Пример смотрите в файле [csi/deploy/example-pvc.yaml](csi/deploy/example-pvc.yaml).
 ## Известные проблемы
 - Запросы удаления объектов могут в данный момент приводить к "неполным" объектам в EC-пулах,
  если в процессе удаления произойдут отказы OSD или серверов, потому что правильная обработка
  запросов удаления в кластере должна быть "трёхфазной", а это пока не реализовано. Если вы
  столкнётесь с такой ситуацией, просто повторите запрос удаления.
 ## Принципы реализации
 - Я люблю архитектурно простые решения. Vitastor проектируется именно так и я намерен
  и далее следовать данному принципу.
 - Если вы пришли сюда за идеальным кодом на C++, вы, вероятно, не по адресу. "Общепринятые"
  практики написания C++ кода меня не очень волнуют, так как зачастую, опять-таки, ведут к
  излишним усложнениям и код получается красивый... но медленный.
 - По той же причине в коде иногда можно встретить велосипеды типа собственного упрощённого
  HTTP-клиента для работы с etcd. Зато эти велосипеды маленькие и компактные и не требуют
  использования десятка внешних библиотек.
 - node.js для монитора - не случайный выбор. Он очень быстрый, имеет встроенную событийную
  машину, приятный нейтральный C-подобный язык программирования и развитую инфраструктуру.
 ## Автор и лицензия
 Автор: Виталий Филиппов (vitalif [at] yourcmc.ru), 2019+
 Заходите в Telegram-чат Vitastor: https://t.me/vitastor
 Лицензия: VNPL 1.1 на серверный код и двойная VNPL 1.1 + GPL 2.0+ на клиентский.
 VNPL - "сетевой копилефт", собственная свободная копилефт-лицензия
 Vitastor Network Public License 1.1, основанная на GNU GPL 3.0 с дополнительным
 условием "Сетевого взаимодействия", требующим распространять все программы,
 специально разработанные для использования вместе с Vitastor и взаимодействующие
 с ним по сети, под лицензией VNPL или под любой другой свободной лицензией.
 Идея VNPL - расширение действия копилефта не только на модули, явным образом
 связываемые с кодом Vitastor, но также на модули, оформленные в виде микросервисов
 и взаимодействующие с ним по сети.
 Таким образом, если вы хотите построить на основе Vitastor сервис, содержаший
 компоненты с закрытым кодом, взаимодействующие с Vitastor, вам нужна коммерческая
 лицензия от автора 😀.
 На Windows и любое другое ПО, не разработанное *специально* для использования
 вместе с Vitastor, никакие ограничения не накладываются.
 Клиентские библиотеки распространяются на условиях двойной лицензии VNPL 1.0
 и также на условиях GNU GPL 2.0 или более поздней версии. Так сделано в целях
 совместимости с таким ПО, как QEMU и fio.
 Вы можете найти полный текст VNPL 1.1 в файле [VNPL-1.1.txt](VNPL-1.1.txt),
 а GPL 2.0 в файле [GPL-2.0.txt](GPL-2.0.txt).
--- a/README.md
+++ b/README.md
@ -0,0 +1,530 @@
 ## Vitastor
 [Читать на русском](README-ru.md)
 ## The Idea
 Make Software-Defined Block Storage Great Again.
 Vitastor is a small, simple and fast clustered block storage (storage for VM drives),
 architecturally similar to Ceph which means strong consistency, primary-replication, symmetric
 clustering and automatic data distribution over any number of drives of any size
 with configurable redundancy (replication or erasure codes/XOR).
 ## Features
 Vitastor is currently a pre-release, a lot of features are missing and you can still expect
 breaking changes in the future. However, the following is implemented:
 - Basic part: highly-available block storage with symmetric clustering and no SPOF
 - Performance ;-D
 - Multiple redundancy schemes: Replication, XOR n+1, Reed-Solomon erasure codes
  based on jerasure library with any number of data and parity drives in a group
 - Configuration via simple JSON data structures in etcd
 - Automatic data distribution over OSDs, with support for:
  - Mathematical optimization for better uniformity and less data movement
  - Multiple pools
  - Placement tree, OSD selection by tags (device classes) and placement root
  - Configurable failure domains
 - Recovery of degraded blocks
 - Rebalancing (data movement between OSDs)
 - Lazy fsync support
 - I/O statistics reporting to etcd
 - Generic user-space client library
 - QEMU driver (built out-of-tree)
 - Loadable fio engine for benchmarks (also built out-of-tree)
 - NBD proxy for kernel mounts
 - Inode removal tool (vitastor-rm)
 - Packaging for Debian and CentOS
 - Per-inode I/O and space usage statistics
 - Inode metadata storage in etcd
 - Snapshots and copy-on-write image clones
 - Write throttling to smooth random write workloads in SSD+HDD configurations
 - RDMA/RoCEv2 support via libibverbs
 - CSI plugin for Kubernetes
 - Basic OpenStack support: Cinder driver, Nova and libvirt patches
 ## Roadmap
 - Snapshot deletion (layer merge) support
 - Better OSD creation and auto-start tools
 - Other administrative tools
 - Plugins for OpenNebula, Proxmox and other cloud systems
 - iSCSI proxy
 - Faster failover
 - Scrubbing without checksums (verification of replicas)
 - Checksums
 - Tiered storage
 - NVDIMM support
 - Web GUI
 - Compression (possibly)
 - Read caching using system page cache (possibly)
 ## Architecture
 Similarities:
 - Just like Ceph, Vitastor has Pools, PGs, OSDs, Monitors, Failure Domains, Placement Tree.
 - Just like Ceph, Vitastor is transactional (even though there's a "lazy fsync mode" which
  doesn't implicitly flush every operation to disks).
 - OSDs also have journal and metadata and they can also be put on separate drives.
 - Just like in Ceph, client library attempts to recover from any cluster failure so
  you can basically reboot the whole cluster and only pause, but not crash, your clients
  (I consider this a bug if the client crashes in that case).
 Some basic terms for people not familiar with Ceph:
 - OSD (Object Storage Daemon) is a process that stores data and serves read/write requests.
 - PG (Placement Group) is a container for data that (normally) shares the same replicas.
 - Pool is a container for data that has the same redundancy scheme and placement rules.
 - Monitor is a separate daemon that watches cluster state and handles failures.
 - Failure Domain is a group of OSDs that you allow to fail. It's "host" by default.
 - Placement Tree groups OSDs in a hierarchy to later split them into Failure Domains.
 Architectural differences from Ceph:
 - Vitastor's primary focus is on SSDs. Proper SSD+HDD optimizations may be added in the future, though.
 - Vitastor OSD is (and will always be) single-threaded. If you want to dedicate more than 1 core
  per drive you should run multiple OSDs each on a different partition of the drive.
  Vitastor isn't CPU-hungry though (as opposed to Ceph), so 1 core is sufficient in a lot of cases.
 - Metadata and journal are always kept in memory. Metadata size depends linearly on drive capacity
  and data store block size which is 128 KB by default. With 128 KB blocks metadata should occupy
  around 512 MB per 1 TB (which is still less than Ceph wants). Journal doesn't have to be big,
  the example test below was conducted with only 16 MB journal. A big journal is probably even
  harmful as dirty write metadata also take some memory.
 - Vitastor storage layer doesn't have internal copy-on-write or redirect-write. I know that maybe
  it's possible to create a good copy-on-write storage, but it's much harder and makes performance
  less deterministic, so CoW isn't used in Vitastor.
 - The basic layer of Vitastor is block storage with fixed-size blocks, not object storage with
  rich semantics like in Ceph (RADOS).
 - There's a "lazy fsync" mode which allows to batch writes before flushing them to the disk.
  This allows to use Vitastor with desktop SSDs, but still lowers performance due to additional
  network roundtrips, so use server SSDs with capacitor-based power loss protection
  ("Advanced Power Loss Protection") for best performance.
 - PGs are ephemeral. This means that they aren't stored on data disks and only exist in memory
  while OSDs are running.
 - Recovery process is per-object (per-block), not per-PG. Also there are no PGLOGs.
 - Monitors don't store data. Cluster configuration and state is stored in etcd in simple human-readable
  JSON structures. Monitors only watch cluster state and handle data movement.
  Thus Vitastor's Monitor isn't a critical component of the system and is more similar to Ceph's Manager.
  Vitastor's Monitor is implemented in node.js.
 - PG distribution isn't based on consistent hashes. All PG mappings are stored in etcd.
  Rebalancing PGs between OSDs is done by mathematical optimization - data distribution problem
  is reduced to a linear programming problem and solved by lp_solve. This allows for almost
  perfect (96-99% uniformity compared to Ceph's 80-90%) data distribution in most cases, ability
  to map PGs by hand without breaking rebalancing logic, reduced OSD peer-to-peer communication
  (on average, OSDs have fewer peers) and less data movement. It also probably has a drawback -
  this method may fail in very large clusters, but up to several hundreds of OSDs it's perfectly fine.
  It's also easy to add consistent hashes in the future if something proves their necessity.
 - There's no separate CRUSH layer. You select pool redundancy scheme, placement root, failure domain
  and so on directly in pool configuration.
 ## Understanding Storage Performance
 The most important thing for fast storage is latency, not parallel iops.
 The best possible latency is achieved with one thread and queue depth of 1 which basically means
 "client load as low as possible". In this case IOPS = 1/latency, and this number doesn't
 scale with number of servers, drives, server processes or threads and so on.
 Single-threaded IOPS and latency numbers only depend on *how fast a single daemon is*.
 Why is it important? It's important because some of the applications *can't* use
 queue depth greater than 1 because their task isn't parallelizable. A notable example
 is any ACID DBMS because all of them write their WALs sequentially with fsync()s.
 fsync, by the way, is another important thing often missing in benchmarks. The point is
 that drives have cache buffers and don't guarantee that your data is actually persisted
 until you call fsync() which is translated to a FLUSH CACHE command by the OS.
 Desktop SSDs are very fast without fsync - NVMes, for example, can process ~80000 write
 operations per second with queue depth of 1 without fsync - but they're really slow with
 fsync because they have to actually write data to flash chips when you call fsync. Typical
 number is around 1000-2000 iops with fsync.
 Server SSDs often have supercapacitors that act as a built-in UPS and allow the drive
 to flush its DRAM cache to the persistent flash storage when a power loss occurs.
 This makes them perform equally well with and without fsync. This feature is called
 "Advanced Power Loss Protection" by Intel; other vendors either call it similarly
 or directly as "Full Capacitor-Based Power Loss Protection".
 All software-defined storages that I currently know are slow in terms of latency.
 Notable examples are Ceph and internal SDSes used by cloud providers like Amazon, Google,
 Yandex and so on. They're all slow and can only reach ~0.3ms read and ~0.6ms 4 KB write latency
 with best-in-slot hardware.
 And that's in the SSD era when you can buy an SSD that has ~0.04ms latency for 100 $.
 I use the following 6 commands with small variations to benchmark any storage:
 - Linear write:
  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=write -runtime=60 -filename=/dev/sdX`
 - Linear read:
  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=read -runtime=60 -filename=/dev/sdX`
 - Random write latency (T1Q1, this hurts storages the most):
  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -fsync=1 -rw=randwrite -runtime=60 -filename=/dev/sdX`
 - Random read latency (T1Q1):
  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -rw=randread -runtime=60 -filename=/dev/sdX`
 - Parallel write iops (use numjobs if a single CPU core is insufficient to saturate the load):
  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 [-numjobs=4 -group_reporting] -rw=randwrite -runtime=60 -filename=/dev/sdX`
 - Parallel read iops (use numjobs if a single CPU core is insufficient to saturate the load):
  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 [-numjobs=4 -group_reporting] -rw=randread -runtime=60 -filename=/dev/sdX`
 ## Vitastor's Theoretical Maximum Random Access Performance
 Replicated setups:
 - Single-threaded (T1Q1) read latency: 1 network roundtrip + 1 disk read.
 - Single-threaded write+fsync latency:
  - With immediate commit: 2 network roundtrips + 1 disk write.
  - With lazy commit: 4 network roundtrips + 1 disk write + 1 disk flush.
 - Saturated parallel read iops: min(network bandwidth, sum(disk read iops)).
 - Saturated parallel write iops: min(network bandwidth, sum(disk write iops / number of replicas / write amplification)).
 EC/XOR setups:
 - Single-threaded (T1Q1) read latency: 1.5 network roundtrips + 1 disk read.
 - Single-threaded write+fsync latency:
  - With immediate commit: 3.5 network roundtrips + 1 disk read + 2 disk writes.
  - With lazy commit: 5.5 network roundtrips + 1 disk read + 2 disk writes + 2 disk fsyncs.
  - 0.5 in actually (k-1)/k which means that an additional roundtrip doesn't happen when
    the read sub-operation can be served locally.
 - Saturated parallel read iops: min(network bandwidth, sum(disk read iops)).
 - Saturated parallel write iops: min(network bandwidth, sum(disk write iops * number of data drives / (number of data + parity drives) / write amplification)).
  In fact, you should put disk write iops under the condition of ~10% reads / ~90% writes in this formula.
 Write amplification for 4 KB blocks is usually 3-5 in Vitastor:
 1. Journal block write
 2. Journal data write
 3. Metadata block write
 4. Another journal block write for EC/XOR setups
 5. Data block write
 If you manage to get an SSD which handles 512 byte blocks well (Optane?) you may
 lower 1, 3 and 4 to 512 bytes (1/8 of data size) and get WA as low as 2.375.
 Lazy fsync also reduces WA for parallel workloads because journal blocks are only
 written when they fill up or fsync is requested.
 ## Example Comparison with Ceph
 Hardware configuration: 4 nodes, each with:
 - 6x SATA SSD Intel D3-4510 3.84 TB
 - 2x Xeon Gold 6242 (16 cores @ 2.8 GHz)
 - 384 GB RAM
 - 1x 25 GbE network interface (Mellanox ConnectX-4 LX), connected to a Juniper QFX5200 switch
 CPU powersaving was disabled. Both Vitastor and Ceph were configured with 2 OSDs per 1 SSD.
 All of the results below apply to 4 KB blocks and random access (unless indicated otherwise).
 Raw drive performance:
 - T1Q1 write ~27000 iops (~0.037ms latency)
 - T1Q1 read ~9800 iops (~0.101ms latency)
 - T1Q32 write ~60000 iops
 - T1Q32 read ~81700 iops
 Ceph 15.2.4 (Bluestore):
 - T1Q1 write ~1000 iops (~1ms latency)
 - T1Q1 read ~1750 iops (~0.57ms latency)
 - T8Q64 write ~100000 iops, total CPU usage by OSDs about 40 virtual cores on each node
 - T8Q64 read ~480000 iops, total CPU usage by OSDs about 40 virtual cores on each node
 T8Q64 tests were conducted over 8 400GB RBD images from all hosts (every host was running 2 instances of fio).
 This is because Ceph has performance penalties related to running multiple clients over a single RBD image.
 cephx_sign_messages was set to false during tests, RocksDB and Bluestore settings were left at defaults.
 In fact, not that bad for Ceph. These servers are an example of well-balanced Ceph nodes.
 However, CPU usage and I/O latency were through the roof, as usual.
 Vitastor:
 - T1Q1 write: 7087 iops (0.14ms latency)
 - T1Q1 read: 6838 iops (0.145ms latency)
 - T2Q64 write: 162000 iops, total CPU usage by OSDs about 3 virtual cores on each node
 - T8Q64 read: 895000 iops, total CPU usage by OSDs about 4 virtual cores on each node
 - Linear write (4M T1Q32): 2800 MB/s
 - Linear read (4M T1Q32): 1500 MB/s
 T8Q64 read test was conducted over 1 larger inode (3.2T) from all hosts (every host was running 2 instances of fio).
 Vitastor has no performance penalties related to running multiple clients over a single inode.
 If conducted from one node with all primary OSDs moved to other nodes the result was slightly lower (689000 iops),
 this is because all operations resulted in network roundtrips between the client and the primary OSD.
 When fio was colocated with OSDs (like in Ceph benchmarks above), 1/4 of the read workload actually
 used the loopback network.
 Vitastor was configured with: `--disable_data_fsync true --immediate_commit all --flusher_count 8
  --disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096
  --journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024
  --journal_size 16777216`.
 ### EC/XOR 2+1
 Vitastor:
 - T1Q1 write: 2808 iops (~0.355ms latency)
 - T1Q1 read: 6190 iops (~0.16ms latency)
 - T2Q64 write: 85500 iops, total CPU usage by OSDs about 3.4 virtual cores on each node
 - T8Q64 read: 812000 iops, total CPU usage by OSDs about 4.7 virtual cores on each node
 - Linear write (4M T1Q32): 3200 MB/s
 - Linear read (4M T1Q32): 1800 MB/s
 Ceph:
 - T1Q1 write: 730 iops (~1.37ms latency)
 - T1Q1 read: 1500 iops with cold cache (~0.66ms latency), 2300 iops after 2 minute metadata cache warmup (~0.435ms latency)
 - T4Q128 write (4 RBD images): 45300 iops, total CPU usage by OSDs about 30 virtual cores on each node
 - T8Q64 read (4 RBD images): 278600 iops, total CPU usage by OSDs about 40 virtual cores on each node
 - Linear write (4M T1Q32): 1950 MB/s before preallocation, 2500 MB/s after preallocation
 - Linear read (4M T1Q32): 2400 MB/s
 ### NBD
 NBD is currently required to mount Vitastor via kernel, but it imposes additional overhead
 due to additional copying between the kernel and userspace. This mostly hurts linear
 bandwidth, not iops.
 Vitastor with single-thread NBD on the same hardware:
 - T1Q1 write: 6000 iops (0.166ms latency)
 - T1Q1 read: 5518 iops (0.18ms latency)
 - T1Q128 write: 94400 iops
 - T1Q128 read: 103000 iops
 - Linear write (4M T1Q128): 1266 MB/s (compared to 2800 MB/s via fio)
 - Linear read (4M T1Q128): 975 MB/s (compared to 1500 MB/s via fio)
 ## Installation
 ### Debian
 - Trust Vitastor package signing key:
  `wget -q -O - https://vitastor.io/debian/pubkey | sudo apt-key add -`
 - Add Vitastor package repository to your /etc/apt/sources.list:
  - Debian 11 (Bullseye/Sid): `deb https://vitastor.io/debian bullseye main`
  - Debian 10 (Buster): `deb https://vitastor.io/debian buster main`
 - For Debian 10 (Buster) also enable backports repository:
  `deb http://deb.debian.org/debian buster-backports main`
 - Install packages: `apt update; apt install vitastor lp-solve etcd linux-image-amd64 qemu`
 ### CentOS
 - Add Vitastor package repository:
  - CentOS 7: `yum install https://vitastor.io/rpms/centos/7/vitastor-release-1.0-1.el7.noarch.rpm`
  - CentOS 8: `dnf install https://vitastor.io/rpms/centos/8/vitastor-release-1.0-1.el8.noarch.rpm`
 - Enable EPEL: `yum/dnf install epel-release`
 - Enable additional CentOS repositories:
  - CentOS 7: `yum install centos-release-scl`
  - CentOS 8: `dnf install centos-release-advanced-virtualization`
 - Enable elrepo-kernel:
  - CentOS 7: `yum install https://www.elrepo.org/elrepo-release-7.el7.elrepo.noarch.rpm`
  - CentOS 8: `dnf install https://www.elrepo.org/elrepo-release-8.el8.elrepo.noarch.rpm`
 - Install packages: `yum/dnf install vitastor lpsolve etcd kernel-ml qemu-kvm`
 ### Building from Source
 - Install Linux kernel 5.4 or newer, for io_uring support. 5.8 or later is highly recommended because
  there is at least one known io_uring hang with 5.4 and an HP SmartArray controller.
 - Install liburing 0.4 or newer and its headers.
 - Install lp_solve.
 - Install etcd, at least version 3.4.15. Earlier versions won't work because of various bugs,
  for example [#12402](https://github.com/etcd-io/etcd/pull/12402). You can also take 3.4.13
  with this specific fix from here: https://github.com/vitalif/etcd/, branch release-3.4.
 - Install node.js 10 or newer.
 - Install gcc and g++ 8.x or newer.
 - Clone https://yourcmc.ru/git/vitalif/vitastor/ with submodules.
 - Install QEMU 3.0+, get its source, begin to build it, stop the build and copy headers:
   - `<qemu>/include` &rarr; `<vitastor>/qemu/include`
   - Debian:
      * Use qemu packages from the main repository
      * `<qemu>/b/qemu/config-host.h` &rarr; `<vitastor>/qemu/b/qemu/config-host.h`
      * `<qemu>/b/qemu/qapi` &rarr; `<vitastor>/qemu/b/qemu/qapi`
   - CentOS 8:
      * Use qemu packages from the Advanced-Virtualization repository. To enable it, run
        `yum install centos-release-advanced-virtualization.noarch` and then `yum install qemu`
      * `<qemu>/config-host.h` &rarr; `<vitastor>/qemu/b/qemu/config-host.h`
      * For QEMU 3.0+: `<qemu>/qapi` &rarr; `<vitastor>/qemu/b/qemu/qapi`
      * For QEMU 2.0+: `<qemu>/qapi-types.h` &rarr; `<vitastor>/qemu/b/qemu/qapi-types.h`
   - `config-host.h` and `qapi` are required because they contain generated headers
 - You can also rebuild QEMU with a patch that makes LD_PRELOAD unnecessary to load vitastor driver.
  See `patches/qemu-*.*-vitastor.patch`.
 - Install fio 3.7 or later, get its source and symlink it into `<vitastor>/fio`.
 - Build & install Vitastor with `mkdir build && cd build && cmake .. && make -j8 && make install`.
  Pay attention to the `QEMU_PLUGINDIR` cmake option - it must be set to `qemu-kvm` on RHEL.
 ## Running
 Please note that startup procedure isn't currently simple - you specify configuration
 and calculate disk offsets almost by hand. This will be fixed in near future.
 - Get some SATA or NVMe SSDs with capacitors (server-grade drives). You can use desktop SSDs
  with lazy fsync, but prepare for inferior single-thread latency.
 - Get a fast network (at least 10 Gbit/s).
 - Disable CPU powersaving: `cpupower idle-set -D 0 && cpupower frequency-set -g performance`.
 - Check `/usr/lib/vitastor/mon/make-units.sh` and `/usr/lib/vitastor/mon/make-osd.sh` and
  put desired values into the variables at the top of these files.
 - Create systemd units for the monitor and etcd: `/usr/lib/vitastor/mon/make-units.sh`
 - Create systemd units for your OSDs: `/usr/lib/vitastor/mon/make-osd.sh /dev/disk/by-partuuid/XXX [/dev/disk/by-partuuid/YYY ...]`
 - You can edit the units and change OSD configuration. Notable configuration variables:
  - `disable_data_fsync 1` - only safe with server-grade drives with capacitors.
  - `immediate_commit all` - use this if all your drives are server-grade.
  - `disable_device_lock 1` - only required if you run multiple OSDs on one block device.
  - `flusher_count 256` - flusher is a micro-thread that removes old data from the journal.
    You don't have to worry about this parameter anymore, 256 is enough.
  - `disk_alignment`, `journal_block_size`, `meta_block_size` should be set to the internal
    block size of your SSDs which is 4096 on most drives.
  - `journal_no_same_sector_overwrites true` prevents multiple overwrites of the same journal sector.
    Most (99%) SSDs don't need this option. But Intel D3-4510 does because it doesn't like when you
    overwrite the same sector twice in a short period of time. The setting forces Vitastor to never
    overwrite the same journal sector twice in a row which makes D3-4510 almost happy. Not totally
    happy, because overwrites of the same block can still happen in the metadata area... When this
    setting is set, it is also required to raise `journal_sector_buffer_count` setting, which is the
    number of dirty journal sectors that may be written to at the same time.
 - `systemctl start vitastor.target` everywhere.
 - Create global configuration in etcd: `etcdctl --endpoints=... put /vitastor/config/global '{"immediate_commit":"all"}'`
  (if all your drives have capacitors).
 - Create pool configuration in etcd: `etcdctl --endpoints=... put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":256,"failure_domain":"host"}}'`.
  For jerasure pools the configuration should look like the following: `2:{"name":"ecpool","scheme":"jerasure","pg_size":4,"parity_chunks":2,"pg_minsize":2,"pg_count":256,"failure_domain":"host"}`.
 - At this point, one of the monitors will configure PGs and OSDs will start them.
 - You can check PG states with `etcdctl --endpoints=... get --prefix /vitastor/pg/state`. All PGs should become 'active'.
 ### Name an image
 ```
 etcdctl --endpoints=<etcd> put /vitastor/config/inode/<pool>/<inode> '{"name":"<name>","size":<size>[,"parent_id":<parent_inode_number>][,"readonly":true]}'
 ```
 For example:
 ```
 etcdctl --endpoints=http://10.115.0.10:2379/v3 put /vitastor/config/inode/1/1 '{"name":"testimg","size":2147483648}'
 ```
 If you specify parent_id the image becomes a CoW clone. I.e. all writes go to the new inode and reads first check it
 and then upper layers. You can then make parent readonly by updating its entry with `"readonly":true` for safety and
 basically treat it as a snapshot.
 So to create a snapshot you basically rename the previous upper layer (for example from testimg to testimg@0), make it readonly
 and create a new top layer with the original name (testimg) and the previous one as a parent.
 ### Run fio benchmarks
 fio command example:
 ```
 fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -image=testimg
 ```
 If you don't want to access your image by name, you can specify pool number, inode number and size
 (`-pool=1 -inode=1 -size=400G`) instead of the image name (`-image=testimg`).
 ### Upload VM image
 Use qemu-img and `vitastor:etcd_host=<HOST>:image=<IMAGE>` disk filename. For example:
 ```
 qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:etcd_host=10.115.0.10\:2379/v3:image=testimg'
 ```
 Note that the command requires to be run with `LD_PRELOAD=/usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so qemu-img ...`
 if you use unmodified QEMU.
 You can also specify `:pool=<POOL>:inode=<INODE>:size=<SIZE>` instead of `:image=<IMAGE>`
 if you don't want to use inode metadata.
 ### Start a VM
 Run QEMU with `-drive file=vitastor:etcd_host=<HOST>:image=<IMAGE>` and use 4 KB physical block size.
 For example:
 ```
 qemu-system-x86_64 -enable-kvm -m 1024
  -drive 'file=vitastor:etcd_host=10.115.0.10\:2379/v3:image=testimg',format=raw,if=none,id=drive-virtio-disk0,cache=none
  -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=off,physical_block_size=4096,logical_block_size=512
  -vnc 0.0.0.0:0
 ```
 You can also specify `:pool=<POOL>:inode=<INODE>:size=<SIZE>` instead of `:image=<IMAGE>`,
 just like in qemu-img.
 ### Remove inode
 Use vitastor-rm. For example:
 ```
 vitastor-rm --etcd_address 10.115.0.10:2379/v3 --pool 1 --inode 1 --parallel_osds 16 --iodepth 32
 ```
 ### NBD
 To create a local block device for a Vitastor image, use NBD. For example:
 ```
 vitastor-nbd map --etcd_address 10.115.0.10:2379/v3 --image testimg
 ```
 It will output the device name, like /dev/nbd0 which you can then format and mount as a normal block device.
 Again, you can use `--pool <POOL> --inode <INODE> --size <SIZE>` insteaf of `--image <IMAGE>` if you want.
 ### Kubernetes
 Vitastor has a CSI plugin for Kubernetes which supports RWO volumes.
 To deploy it, take manifests from [csi/deploy/](csi/deploy/) directory, put your
 Vitastor configuration in [csi/deploy/001-csi-config-map.yaml](001-csi-config-map.yaml),
 configure storage class in [csi/deploy/009-storage-class.yaml](009-storage-class.yaml)
 and apply all `NNN-*.yaml` manifests to your Kubernetes installation:
 ```
 for i in ./???-*.yaml; do kubectl apply -f $i; done
 ```
 After that you'll be able to create PersistentVolumes. See example in [csi/deploy/example-pvc.yaml](csi/deploy/example-pvc.yaml).
 ## Known Problems
 - Object deletion requests may currently lead to 'incomplete' objects in EC pools
  if your OSDs crash during deletion because proper handling of object cleanup
  in a cluster should be "three-phase" and it's currently not implemented.
  Just repeat the removal request again in this case.
 ## Implementation Principles
 - I like architecturally simple solutions. Vitastor is and will always be designed
  exactly like that.
 - I also like reinventing the wheel to some extent, like writing my own HTTP client
  for etcd interaction instead of using prebuilt libraries, because in this case
  I'm confident about what my code does and what it doesn't do.
 - I don't care about C++ "best practices" like RAII or proper inheritance or usage of
  smart pointers or whatever and I don't intend to change my mind, so if you're here
  looking for ideal reference C++ code, this probably isn't the right place.
 - I like node.js better than any other dynamically-typed language interpreter
  because it's faster than any other interpreter in the world, has neutral C-like
  syntax and built-in event loop. That's why Monitor is implemented in node.js.
 ## Author and License
 Copyright (c) Vitaliy Filippov (vitalif [at] yourcmc.ru), 2019+
 Join Vitastor Telegram Chat: https://t.me/vitastor
 All server-side code (OSD, Monitor and so on) is licensed under the terms of
 Vitastor Network Public License 1.1 (VNPL 1.1), a copyleft license based on
 GNU GPLv3.0 with the additional "Network Interaction" clause which requires
 opensourcing all programs directly or indirectly interacting with Vitastor
 through a computer network and expressly designed to be used in conjunction
 with it ("Proxy Programs"). Proxy Programs may be made public not only under
 the terms of the same license, but also under the terms of any GPL-Compatible
 Free Software License, as listed by the Free Software Foundation.
 This is a stricter copyleft license than the Affero GPL.
 Please note that VNPL doesn't require you to open the code of proprietary
 software running inside a VM if it's not specially designed to be used with
 Vitastor.
 Basically, you can't use the software in a proprietary environment to provide
 its functionality to users without opensourcing all intermediary components
 standing between the user and Vitastor or purchasing a commercial license
 from the author 😀.
 Client libraries (cluster_client and so on) are dual-licensed under the same
 VNPL 1.1 and also GNU GPL 2.0 or later to allow for compatibility with GPLed
 software like QEMU and fio.
 You can find the full text of VNPL-1.1 in the file [VNPL-1.1.txt](VNPL-1.1.txt).
 GPL 2.0 is also included in this repository as [GPL-2.0.txt](GPL-2.0.txt).
--- a/VNPL-1.1.txt
+++ b/VNPL-1.1.txt
@ -0,0 +1,648 @@
                     VITASTOR NETWORK PUBLIC LICENSE
                     Version 1.1,  6 February 2021
 Copyright (C) 2021 Vitaliy Filippov <vitalif@yourcmc.ru>
 Everyone is permitted to copy and distribute verbatim copies
 of this license document, but changing it is not allowed.
                            Preamble
  The Vitastor Network Public License is a free, copyleft license for
 software and other kinds of works, specifically designed to ensure
 cooperation with the community in the case of network server software.
  The licenses for most software and other practical works are designed
 to take away your freedom to share and change the works.  By contrast,
 GNU General Public Licenses and Vitastor Network Public License are
 intended to guarantee your freedom to share and change all versions
 of a program--to make sure it remains free software for all its users.
  When we speak of free software, we are referring to freedom, not
 price.  GNU General Public Licenses and Vitastor Network Public License
 are designed to make sure that you have the freedom to distribute copies
 of free software (and charge for them if you wish), that you receive
 source code or can get it if you want it, that you can change the software
 or use pieces of it in new free programs, and that you know you can do these
 things.
  Developers that use GNU General Public Licenses and Vitastor
 Network Public License protect your rights with two steps:
 (1) assert copyright on the software, and (2) offer
 you this License which gives you legal permission to copy, distribute
 and/or modify the software.
  A secondary benefit of defending all users' freedom is that
 improvements made in alternate versions of the program, if they
 receive widespread use, become available for other developers to
 incorporate.  Many developers of free software are heartened and
 encouraged by the resulting cooperation.  However, in the case of
 software used on network servers, this result may fail to come about.
 The GNU General Public License permits making a modified version and
 letting the public access it on a server without ever releasing its
 source code to the public. Even the GNU Affero General Public License
 permits running a modified version in a closed environment where
 public users only interact with it through a closed-source proxy, again,
 without making the program and the proxy available to the public
 for free.
  The Vitastor Network Public License is designed specifically to
 ensure that, in such cases, the modified program and the proxy stays
 available to the community. It requires the operator of a network server to
 provide the source code of the original program and all other programs
 communicating with it running there to the users of that server.
 Therefore, public use of a modified version, on a server accessible
 directly or indirectly to the public, gives the public access to the source
 code of the modified version.
  The precise terms and conditions for copying, distribution and
 modification follow.
                       TERMS AND CONDITIONS
  0. Definitions.
  "This License" refers to version 1 of the Vitastor Network Public License.
  "Copyright" also means copyright-like laws that apply to other kinds of
 works, such as semiconductor masks.
  "The Program" refers to any copyrightable work licensed under this
 License.  Each licensee is addressed as "you".  "Licensees" and
 "recipients" may be individuals or organizations.
  To "modify" a work means to copy from or adapt all or part of the work
 in a fashion requiring copyright permission, other than the making of an
 exact copy.  The resulting work is called a "modified version" of the
 earlier work or a work "based on" the earlier work.
  A "covered work" means either the unmodified Program or a work based
 on the Program.
  To "propagate" a work means to do anything with it that, without
 permission, would make you directly or secondarily liable for
 infringement under applicable copyright law, except executing it on a
 computer or modifying a private copy.  Propagation includes copying,
 distribution (with or without modification), making available to the
 public, and in some countries other activities as well.
  To "convey" a work means any kind of propagation that enables other
 parties to make or receive copies.  Mere interaction with a user through
 a computer network, with no transfer of a copy, is not conveying.
  An interactive user interface displays "Appropriate Legal Notices"
 to the extent that it includes a convenient and prominently visible
 feature that (1) displays an appropriate copyright notice, and (2)
 tells the user that there is no warranty for the work (except to the
 extent that warranties are provided), that licensees may convey the
 work under this License, and how to view a copy of this License.  If
 the interface presents a list of user commands or options, such as a
 menu, a prominent item in the list meets this criterion.
  1. Source Code.
  The "source code" for a work means the preferred form of the work
 for making modifications to it.  "Object code" means any non-source
 form of a work.
  A "Standard Interface" means an interface that either is an official
 standard defined by a recognized standards body, or, in the case of
 interfaces specified for a particular programming language, one that
 is widely used among developers working in that language.
  The "System Libraries" of an executable work include anything, other
 than the work as a whole, that (a) is included in the normal form of
 packaging a Major Component, but which is not part of that Major
 Component, and (b) serves only to enable use of the work with that
 Major Component, or to implement a Standard Interface for which an
 implementation is available to the public in source code form.  A
 "Major Component", in this context, means a major essential component
 (kernel, window system, and so on) of the specific operating system
 (if any) on which the executable work runs, or a compiler used to
 produce the work, or an object code interpreter used to run it.
  The "Corresponding Source" for a work in object code form means all
 the source code needed to generate, install, and (for an executable
 work) run the object code and to modify the work, including scripts to
 control those activities.  However, it does not include the work's
 System Libraries, or general-purpose tools or generally available free
 programs which are used unmodified in performing those activities but
 which are not part of the work.  For example, Corresponding Source
 includes interface definition files associated with source files for
 the work, and the source code for shared libraries and dynamically
 linked subprograms that the work is specifically designed to require,
 such as by intimate data communication or control flow between those
 subprograms and other parts of the work.
  The Corresponding Source need not include anything that users
 can regenerate automatically from other parts of the Corresponding
 Source.
  The Corresponding Source for a work in source code form is that
 same work.
  2. Basic Permissions.
  All rights granted under this License are granted for the term of
 copyright on the Program, and are irrevocable provided the stated
 conditions are met.  This License explicitly affirms your unlimited
 permission to run the unmodified Program.  The output from running a
 covered work is covered by this License only if the output, given its
 content, constitutes a covered work.  This License acknowledges your
 rights of fair use or other equivalent, as provided by copyright law.
  You may make, run and propagate covered works that you do not
 convey, without conditions so long as your license otherwise remains
 in force.  You may convey covered works to others for the sole purpose
 of having them make modifications exclusively for you, or provide you
 with facilities for running those works, provided that you comply with
 the terms of this License in conveying all material for which you do
 not control copyright.  Those thus making or running the covered works
 for you must do so exclusively on your behalf, under your direction
 and control, on terms that prohibit them from making any copies of
 your copyrighted material outside their relationship with you.
  Conveying under any other circumstances is permitted solely under
 the conditions stated below.  Sublicensing is not allowed; section 10
 makes it unnecessary.
  3. Protecting Users' Legal Rights From Anti-Circumvention Law.
  No covered work shall be deemed part of an effective technological
 measure under any applicable law fulfilling obligations under article
 11 of the WIPO copyright treaty adopted on 20 December 1996, or
 similar laws prohibiting or restricting circumvention of such
 measures.
  When you convey a covered work, you waive any legal power to forbid
 circumvention of technological measures to the extent such circumvention
 is effected by exercising rights under this License with respect to
 the covered work, and you disclaim any intention to limit operation or
 modification of the work as a means of enforcing, against the work's
 users, your or third parties' legal rights to forbid circumvention of
 technological measures.
  4. Conveying Verbatim Copies.
  You may convey verbatim copies of the Program's source code as you
 receive it, in any medium, provided that you conspicuously and
 appropriately publish on each copy an appropriate copyright notice;
 keep intact all notices stating that this License and any
 non-permissive terms added in accord with section 7 apply to the code;
 keep intact all notices of the absence of any warranty; and give all
 recipients a copy of this License along with the Program.
  You may charge any price or no price for each copy that you convey,
 and you may offer support or warranty protection for a fee.
  5. Conveying Modified Source Versions.
  You may convey a work based on the Program, or the modifications to
 produce it from the Program, in the form of source code under the
 terms of section 4, provided that you also meet all of these conditions:
    a) The work must carry prominent notices stating that you modified
    it, and giving a relevant date.
    b) The work must carry prominent notices stating that it is
    released under this License and any conditions added under section
    7.  This requirement modifies the requirement in section 4 to
    "keep intact all notices".
    c) You must license the entire work, as a whole, under this
    License to anyone who comes into possession of a copy.  This
    License will therefore apply, along with any applicable section 7
    additional terms, to the whole of the work, and all its parts,
    regardless of how they are packaged.  This License gives no
    permission to license the work in any other way, but it does not
    invalidate such permission if you have separately received it.
    d) If the work has interactive user interfaces, each must display
    Appropriate Legal Notices; however, if the Program has interactive
    interfaces that do not display Appropriate Legal Notices, your
    work need not make them do so.
  A compilation of a covered work with other separate and independent
 works, which are not by their nature extensions of the covered work,
 and which are not combined with it such as to form a larger program,
 in or on a volume of a storage or distribution medium, is called an
 "aggregate" if the compilation and its resulting copyright are not
 used to limit the access or legal rights of the compilation's users
 beyond what the individual works permit.  Inclusion of a covered work
 in an aggregate does not cause this License to apply to the other
 parts of the aggregate.
  6. Conveying Non-Source Forms.
  You may convey a covered work in object code form under the terms
 of sections 4 and 5, provided that you also convey the
 machine-readable Corresponding Source under the terms of this License,
 in one of these ways:
    a) Convey the object code in, or embodied in, a physical product
    (including a physical distribution medium), accompanied by the
    Corresponding Source fixed on a durable physical medium
    customarily used for software interchange.
    b) Convey the object code in, or embodied in, a physical product
    (including a physical distribution medium), accompanied by a
    written offer, valid for at least three years and valid for as
    long as you offer spare parts or customer support for that product
    model, to give anyone who possesses the object code either (1) a
    copy of the Corresponding Source for all the software in the
    product that is covered by this License, on a durable physical
    medium customarily used for software interchange, for a price no
    more than your reasonable cost of physically performing this
    conveying of source, or (2) access to copy the
    Corresponding Source from a network server at no charge.
    c) Convey individual copies of the object code with a copy of the
    written offer to provide the Corresponding Source.  This
    alternative is allowed only occasionally and noncommercially, and
    only if you received the object code with such an offer, in accord
    with subsection 6b.
    d) Convey the object code by offering access from a designated
    place (gratis or for a charge), and offer equivalent access to the
    Corresponding Source in the same way through the same place at no
    further charge.  You need not require recipients to copy the
    Corresponding Source along with the object code.  If the place to
    copy the object code is a network server, the Corresponding Source
    may be on a different server (operated by you or a third party)
    that supports equivalent copying facilities, provided you maintain
    clear directions next to the object code saying where to find the
    Corresponding Source.  Regardless of what server hosts the
    Corresponding Source, you remain obligated to ensure that it is
    available for as long as needed to satisfy these requirements.
    e) Convey the object code using peer-to-peer transmission, provided
    you inform other peers where the object code and Corresponding
    Source of the work are being offered to the general public at no
    charge under subsection 6d.
  A separable portion of the object code, whose source code is excluded
 from the Corresponding Source as a System Library, need not be
 included in conveying the object code work.
  A "User Product" is either (1) a "consumer product", which means any
 tangible personal property which is normally used for personal, family,
 or household purposes, or (2) anything designed or sold for incorporation
 into a dwelling.  In determining whether a product is a consumer product,
 doubtful cases shall be resolved in favor of coverage.  For a particular
 product received by a particular user, "normally used" refers to a
 typical or common use of that class of product, regardless of the status
 of the particular user or of the way in which the particular user
 actually uses, or expects or is expected to use, the product.  A product
 is a consumer product regardless of whether the product has substantial
 commercial, industrial or non-consumer uses, unless such uses represent
 the only significant mode of use of the product.
  "Installation Information" for a User Product means any methods,
 procedures, authorization keys, or other information required to install
 and execute modified versions of a covered work in that User Product from
 a modified version of its Corresponding Source.  The information must
 suffice to ensure that the continued functioning of the modified object
 code is in no case prevented or interfered with solely because
 modification has been made.
  If you convey an object code work under this section in, or with, or
 specifically for use in, a User Product, and the conveying occurs as
 part of a transaction in which the right of possession and use of the
 User Product is transferred to the recipient in perpetuity or for a
 fixed term (regardless of how the transaction is characterized), the
 Corresponding Source conveyed under this section must be accompanied
 by the Installation Information.  But this requirement does not apply
 if neither you nor any third party retains the ability to install
 modified object code on the User Product (for example, the work has
 been installed in ROM).
  The requirement to provide Installation Information does not include a
 requirement to continue to provide support service, warranty, or updates
 for a work that has been modified or installed by the recipient, or for
 the User Product in which it has been modified or installed.  Access to a
 network may be denied when the modification itself materially and
 adversely affects the operation of the network or violates the rules and
 protocols for communication across the network.
  Corresponding Source conveyed, and Installation Information provided,
 in accord with this section must be in a format that is publicly
 documented (and with an implementation available to the public in
 source code form), and must require no special password or key for
 unpacking, reading or copying.
  7. Additional Terms.
  "Additional permissions" are terms that supplement the terms of this
 License by making exceptions from one or more of its conditions.
 Additional permissions that are applicable to the entire Program shall
 be treated as though they were included in this License, to the extent
 that they are valid under applicable law.  If additional permissions
 apply only to part of the Program, that part may be used separately
 under those permissions, but the entire Program remains governed by
 this License without regard to the additional permissions.
  When you convey a copy of a covered work, you may at your option
 remove any additional permissions from that copy, or from any part of
 it.  (Additional permissions may be written to require their own
 removal in certain cases when you modify the work.)  You may place
 additional permissions on material, added by you to a covered work,
 for which you have or can give appropriate copyright permission.
  Notwithstanding any other provision of this License, for material you
 add to a covered work, you may (if authorized by the copyright holders of
 that material) supplement the terms of this License with terms:
    a) Disclaiming warranty or limiting liability differently from the
    terms of sections 15 and 16 of this License; or
    b) Requiring preservation of specified reasonable legal notices or
    author attributions in that material or in the Appropriate Legal
    Notices displayed by works containing it; or
    c) Prohibiting misrepresentation of the origin of that material, or
    requiring that modified versions of such material be marked in
    reasonable ways as different from the original version; or
    d) Limiting the use for publicity purposes of names of licensors or
    authors of the material; or
    e) Declining to grant rights under trademark law for use of some
    trade names, trademarks, or service marks; or
    f) Requiring indemnification of licensors and authors of that
    material by anyone who conveys the material (or modified versions of
    it) with contractual assumptions of liability to the recipient, for
    any liability that these contractual assumptions directly impose on
    those licensors and authors.
  All other non-permissive additional terms are considered "further
 restrictions" within the meaning of section 10.  If the Program as you
 received it, or any part of it, contains a notice stating that it is
 governed by this License along with a term that is a further
 restriction, you may remove that term.  If a license document contains
 a further restriction but permits relicensing or conveying under this
 License, you may add to a covered work material governed by the terms
 of that license document, provided that the further restriction does
 not survive such relicensing or conveying.
  If you add terms to a covered work in accord with this section, you
 must place, in the relevant source files, a statement of the
 additional terms that apply to those files, or a notice indicating
 where to find the applicable terms.
  Additional terms, permissive or non-permissive, may be stated in the
 form of a separately written license, or stated as exceptions;
 the above requirements apply either way.
  8. Termination.
  You may not propagate or modify a covered work except as expressly
 provided under this License.  Any attempt otherwise to propagate or
 modify it is void, and will automatically terminate your rights under
 this License (including any patent licenses granted under the third
 paragraph of section 11).
  However, if you cease all violation of this License, then your
 license from a particular copyright holder is reinstated (a)
 provisionally, unless and until the copyright holder explicitly and
 finally terminates your license, and (b) permanently, if the copyright
 holder fails to notify you of the violation by some reasonable means
 prior to 60 days after the cessation.
  Moreover, your license from a particular copyright holder is
 reinstated permanently if the copyright holder notifies you of the
 violation by some reasonable means, this is the first time you have
 received notice of violation of this License (for any work) from that
 copyright holder, and you cure the violation prior to 30 days after
 your receipt of the notice.
  Termination of your rights under this section does not terminate the
 licenses of parties who have received copies or rights from you under
 this License.  If your rights have been terminated and not permanently
 reinstated, you do not qualify to receive new licenses for the same
 material under section 10.
  9. Acceptance Not Required for Having Copies.
  You are not required to accept this License in order to receive or
 run a copy of the Program.  Ancillary propagation of a covered work
 occurring solely as a consequence of using peer-to-peer transmission
 to receive a copy likewise does not require acceptance.  However,
 nothing other than this License grants you permission to propagate or
 modify any covered work.  These actions infringe copyright if you do
 not accept this License.  Therefore, by modifying or propagating a
 covered work, you indicate your acceptance of this License to do so.
  10. Automatic Licensing of Downstream Recipients.
  Each time you convey a covered work, the recipient automatically
 receives a license from the original licensors, to run, modify and
 propagate that work, subject to this License.  You are not responsible
 for enforcing compliance by third parties with this License.
  An "entity transaction" is a transaction transferring control of an
 organization, or substantially all assets of one, or subdividing an
 organization, or merging organizations.  If propagation of a covered
 work results from an entity transaction, each party to that
 transaction who receives a copy of the work also receives whatever
 licenses to the work the party's predecessor in interest had or could
 give under the previous paragraph, plus a right to possession of the
 Corresponding Source of the work from the predecessor in interest, if
 the predecessor has it or can get it with reasonable efforts.
  You may not impose any further restrictions on the exercise of the
 rights granted or affirmed under this License.  For example, you may
 not impose a license fee, royalty, or other charge for exercise of
 rights granted under this License, and you may not initiate litigation
 (including a cross-claim or counterclaim in a lawsuit) alleging that
 any patent claim is infringed by making, using, selling, offering for
 sale, or importing the Program or any portion of it.
  11. Patents.
  A "contributor" is a copyright holder who authorizes use under this
 License of the Program or a work on which the Program is based.  The
 work thus licensed is called the contributor's "contributor version".
  A contributor's "essential patent claims" are all patent claims
 owned or controlled by the contributor, whether already acquired or
 hereafter acquired, that would be infringed by some manner, permitted
 by this License, of making, using, or selling its contributor version,
 but do not include claims that would be infringed only as a
 consequence of further modification of the contributor version.  For
 purposes of this definition, "control" includes the right to grant
 patent sublicenses in a manner consistent with the requirements of
 this License.
  Each contributor grants you a non-exclusive, worldwide, royalty-free
 patent license under the contributor's essential patent claims, to
 make, use, sell, offer for sale, import and otherwise run, modify and
 propagate the contents of its contributor version.
  In the following three paragraphs, a "patent license" is any express
 agreement or commitment, however denominated, not to enforce a patent
 (such as an express permission to practice a patent or covenant not to
 sue for patent infringement).  To "grant" such a patent license to a
 party means to make such an agreement or commitment not to enforce a
 patent against the party.
  If you convey a covered work, knowingly relying on a patent license,
 and the Corresponding Source of the work is not available for anyone
 to copy, free of charge and under the terms of this License, through a
 publicly available network server or other readily accessible means,
 then you must either (1) cause the Corresponding Source to be so
 available, or (2) arrange to deprive yourself of the benefit of the
 patent license for this particular work, or (3) arrange, in a manner
 consistent with the requirements of this License, to extend the patent
 license to downstream recipients.  "Knowingly relying" means you have
 actual knowledge that, but for the patent license, your conveying the
 covered work in a country, or your recipient's use of the covered work
 in a country, would infringe one or more identifiable patents in that
 country that you have reason to believe are valid.
  If, pursuant to or in connection with a single transaction or
 arrangement, you convey, or propagate by procuring conveyance of, a
 covered work, and grant a patent license to some of the parties
 receiving the covered work authorizing them to use, propagate, modify
 or convey a specific copy of the covered work, then the patent license
 you grant is automatically extended to all recipients of the covered
 work and works based on it.
  A patent license is "discriminatory" if it does not include within
 the scope of its coverage, prohibits the exercise of, or is
 conditioned on the non-exercise of one or more of the rights that are
 specifically granted under this License.  You may not convey a covered
 work if you are a party to an arrangement with a third party that is
 in the business of distributing software, under which you make payment
 to the third party based on the extent of your activity of conveying
 the work, and under which the third party grants, to any of the
 parties who would receive the covered work from you, a discriminatory
 patent license (a) in connection with copies of the covered work
 conveyed by you (or copies made from those copies), or (b) primarily
 for and in connection with specific products or compilations that
 contain the covered work, unless you entered into that arrangement,
 or that patent license was granted, prior to 28 March 2007.
  Nothing in this License shall be construed as excluding or limiting
 any implied license or other defenses to infringement that may
 otherwise be available to you under applicable patent law.
  12. No Surrender of Others' Freedom.
  If conditions are imposed on you (whether by court order, agreement or
 otherwise) that contradict the conditions of this License, they do not
 excuse you from the conditions of this License.  If you cannot convey a
 covered work so as to satisfy simultaneously your obligations under this
 License and any other pertinent obligations, then as a consequence you may
 not convey it at all.  For example, if you agree to terms that obligate you
 to collect a royalty for further conveying from those to whom you convey
 the Program, the only way you could satisfy both those terms and this
 License would be to refrain entirely from conveying the Program.
  13. Remote Network Interaction.
  A "Proxy Program" means a separate program which is specially designed to
 be used in conjunction with the covered work and interacts with it directly
 or indirectly through any kind of API (application programming interfaces),
 a computer network, an imitation of such network, or another Proxy Program
 itself.
  Notwithstanding any other provision of this License, if you provide any user
 with an opportunity to interact with the covered work through a computer
 network, an imitation of such network, or any number of "Proxy Programs",
 you must prominently offer that user an opportunity to receive the
 Corresponding Source of the covered work and all Proxy Programs from a
 network server at no charge, through some standard or customary means of
 facilitating copying of software. The Corresponding Source for the covered
 work must be made available under the conditions of this License, and
 the Corresponding Source for all Proxy Programs must be made available
 under the conditions of either this License or any GPL-Compatible
 Free Software License, as described by the Free Software Foundation
 in their "GPL-Compatible License List".
  14. Revised Versions of this License.
  Vitastor Author may publish revised and/or new versions of
 the Vitastor Network Public License from time to time.  Such new versions
 will be similar in spirit to the present version, but may differ in detail to
 address new problems or concerns.
  Each version is given a distinguishing version number.  If the
 Program specifies that a certain numbered version of the Vitastor Network
 Public License "or any later version" applies to it, you have the
 option of following the terms and conditions either of that numbered
 version or of any later version. If the Program does not specify a version
 number of the Vitastor Network Public License, you may choose any version
 ever published.
  Later license versions may give you additional or different
 permissions.  However, no additional obligations are imposed on any
 author or copyright holder as a result of your choosing to follow a
 later version.
  15. Disclaimer of Warranty.
  THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
 APPLICABLE LAW.  EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
 HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
 OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
 THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 PURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
 IS WITH YOU.  SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
 ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
  16. Limitation of Liability.
  IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
 WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
 THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
 GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
 USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
 DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
 PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
 EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
 SUCH DAMAGES.
  17. Interpretation of Sections 15 and 16.
  If the disclaimer of warranty and limitation of liability provided
 above cannot be given local legal effect according to their terms,
 reviewing courts shall apply local law that most closely approximates
 an absolute waiver of all civil liability in connection with the
 Program, unless a warranty or assumption of liability accompanies a
 copy of the Program in return for a fee.
                     END OF TERMS AND CONDITIONS
            How to Apply These Terms to Your New Programs
  If you develop a new program, and you want it to be of the greatest
 possible use to the public, the best way to achieve this is to make it
 free software which everyone can redistribute and change under these terms.
  To do so, attach the following notices to the program.  It is safest
 to attach them to the start of each source file to most effectively
 state the exclusion of warranty; and each file should have at least
 the "copyright" line and a pointer to where the full notice is found.
    <one line to give the program's name and a brief idea of what it does.>
    Copyright (C) <year>  <name of author>
    This program is free software: you can redistribute it and/or modify
    it under the terms of the Vitastor Network Public License as published by
    the Vitastor Author, either version 1 of the License, or
    (at your option) any later version.
    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    Vitastor Network Public License for more details.
 Also add information on how to contact you by electronic and paper mail.
  If your software can interact with users remotely through a computer
 network, you should also make sure that it provides a way for users to
 get its source.  For example, if your program is a web application, its
 interface could display a "Source" link that leads users to an archive
 of the code.  There are many ways you could offer source, and different
 solutions will be better for different programs; see section 13 for the
 specific requirements.
--- a/blockstore_write.cpp
+++ b/blockstore_write.cpp
@ -1,488 +0,0 @@
 #include "blockstore_impl.h"
 bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
 {
    // Check or assign version number
    bool found = false, deleted = false, is_del = (op->opcode == BS_OP_DELETE);
    bool is_inflight_big = false;
    uint64_t version = 1;
    if (dirty_db.size() > 0)
    {
        auto dirty_it = dirty_db.upper_bound((obj_ver_id){
            .oid = op->oid,
            .version = UINT64_MAX,
        });
        dirty_it--; // segfaults when dirty_db is empty
        if (dirty_it != dirty_db.end() && dirty_it->first.oid == op->oid)
        {
            found = true;
            version = dirty_it->first.version + 1;
            deleted = IS_DELETE(dirty_it->second.state);
            is_inflight_big = dirty_it->second.state >= ST_D_IN_FLIGHT &&
                dirty_it->second.state < ST_D_SYNCED ||
                dirty_it->second.state == ST_J_WAIT_BIG;
        }
    }
    if (!found)
    {
        auto clean_it = clean_db.find(op->oid);
        if (clean_it != clean_db.end())
        {
            version = clean_it->second.version + 1;
        }
        else
        {
            deleted = true;
        }
    }
    if (op->version == 0)
    {
        op->version = version;
    }
    else if (op->version < version)
    {
        // Invalid version requested
        op->retval = -EEXIST;
        return false;
    }
    if (deleted && is_del)
    {
        // Already deleted
        op->retval = 0;
        return false;
    }
    if (is_inflight_big && !is_del && !deleted && op->len < block_size &&
        immediate_commit != IMMEDIATE_ALL)
    {
        // Issue an additional sync so that the previous big write can reach the journal
        blockstore_op_t *sync_op = new blockstore_op_t;
        sync_op->opcode = BS_OP_SYNC;
        sync_op->callback = [this, op](blockstore_op_t *sync_op)
        {
            delete sync_op;
        };
        enqueue_op(sync_op);
    }
 #ifdef BLOCKSTORE_DEBUG
    if (is_del)
        printf("Delete %lu:%lu v%lu\n", op->oid.inode, op->oid.stripe, op->version);
    else
        printf("Write %lu:%lu v%lu offset=%u len=%u\n", op->oid.inode, op->oid.stripe, op->version, op->offset, op->len);
 #endif
    // No strict need to add it into dirty_db here, it's just left
    // from the previous implementation where reads waited for writes
    dirty_db.emplace((obj_ver_id){
        .oid = op->oid,
        .version = op->version,
    }, (dirty_entry){
        .state = (uint32_t)(
            is_del
                ? ST_DEL_IN_FLIGHT
                : (op->len == block_size || deleted ? ST_D_IN_FLIGHT : (is_inflight_big ? ST_J_WAIT_BIG : ST_J_IN_FLIGHT))
        ),
        .flags = 0,
        .location = 0,
        .offset = is_del ? 0 : op->offset,
        .len = is_del ? 0 : op->len,
        .journal_sector = 0,
    });
    return true;
 }
 // First step of the write algorithm: dequeue operation and submit initial write(s)
 int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
 {
    if (PRIV(op)->op_state)
    {
        return continue_write(op);
    }
    auto dirty_it = dirty_db.find((obj_ver_id){
        .oid = op->oid,
        .version = op->version,
    });
    if (dirty_it->second.state == ST_J_WAIT_BIG)
    {
        return 0;
    }
    else if (dirty_it->second.state == ST_D_IN_FLIGHT)
    {
        blockstore_journal_check_t space_check(this);
        if (!space_check.check_available(op, unsynced_big_writes.size() + 1, sizeof(journal_entry_big_write), JOURNAL_STABILIZE_RESERVATION))
        {
            return 0;
        }
        // Big (redirect) write
        uint64_t loc = data_alloc->find_free();
        if (loc == UINT64_MAX)
        {
            // no space
            if (flusher->is_active())
            {
                // hope that some space will be available after flush
                PRIV(op)->wait_for = WAIT_FREE;
                return 0;
            }
            op->retval = -ENOSPC;
            FINISH_OP(op);
            return 1;
        }
        BS_SUBMIT_GET_SQE(sqe, data);
        dirty_it->second.location = loc << block_order;
        dirty_it->second.state = ST_D_SUBMITTED;
 #ifdef BLOCKSTORE_DEBUG
        printf("Allocate block %lu\n", loc);
 #endif
        data_alloc->set(loc, true);
        uint64_t stripe_offset = (op->offset % bitmap_granularity);
        uint64_t stripe_end = (op->offset + op->len) % bitmap_granularity;
        // Zero fill up to bitmap_granularity
        int vcnt = 0;
        if (stripe_offset)
        {
            PRIV(op)->iov_zerofill[vcnt++] = (struct iovec){ zero_object, stripe_offset };
        }
        PRIV(op)->iov_zerofill[vcnt++] = (struct iovec){ op->buf, op->len };
        if (stripe_end)
        {
            stripe_end = bitmap_granularity - stripe_end;
            PRIV(op)->iov_zerofill[vcnt++] = (struct iovec){ zero_object, stripe_end };
        }
        data->iov.iov_len = op->len + stripe_offset + stripe_end; // to check it in the callback
        data->callback = [this, op](ring_data_t *data) { handle_write_event(data, op); };
        my_uring_prep_writev(
            sqe, data_fd, PRIV(op)->iov_zerofill, vcnt, data_offset + (loc << block_order) + op->offset - stripe_offset
        );
        PRIV(op)->pending_ops = 1;
        PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
        if (immediate_commit != IMMEDIATE_ALL)
        {
            // Remember big write as unsynced
            unsynced_big_writes.push_back((obj_ver_id){
                .oid = op->oid,
                .version = op->version,
            });
            PRIV(op)->op_state = 3;
        }
        else
        {
            PRIV(op)->op_state = 1;
        }
    }
    else
    {
        // Small (journaled) write
        // First check if the journal has sufficient space
        blockstore_journal_check_t space_check(this);
        if (unsynced_big_writes.size() && !space_check.check_available(op, unsynced_big_writes.size(), sizeof(journal_entry_big_write), 0)
            || !space_check.check_available(op, 1, sizeof(journal_entry_small_write), op->len + JOURNAL_STABILIZE_RESERVATION))
        {
            return 0;
        }
        // There is sufficient space. Get SQE(s)
        struct io_uring_sqe *sqe1 = NULL;
        if (immediate_commit != IMMEDIATE_NONE ||
            (journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_small_write) &&
            journal.sector_info[journal.cur_sector].dirty)
        {
            // Write current journal sector only if it's dirty and full, or in the immediate_commit mode
            BS_SUBMIT_GET_SQE_DECL(sqe1);
        }
        struct io_uring_sqe *sqe2 = NULL;
        if (op->len > 0)
        {
            BS_SUBMIT_GET_SQE_DECL(sqe2);
        }
        // Got SQEs. Prepare previous journal sector write if required
        auto cb = [this, op](ring_data_t *data) { handle_write_event(data, op); };
        if (immediate_commit == IMMEDIATE_NONE)
        {
            if (sqe1)
            {
                prepare_journal_sector_write(journal, journal.cur_sector, sqe1, cb);
                PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
                PRIV(op)->pending_ops++;
            }
            else
            {
                PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
            }
        }
        // Then pre-fill journal entry
        journal_entry_small_write *je = (journal_entry_small_write*)
            prefill_single_journal_entry(journal, JE_SMALL_WRITE, sizeof(journal_entry_small_write));
        dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
        journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
 #ifdef BLOCKSTORE_DEBUG
        printf("journal offset %lu is used by %lu:%lu v%lu\n", dirty_it->second.journal_sector, dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version);
 #endif
        // Figure out where data will be
        journal.next_free = (journal.next_free + op->len) <= journal.len ? journal.next_free : journal_block_size;
        je->oid = op->oid;
        je->version = op->version;
        je->offset = op->offset;
        je->len = op->len;
        je->data_offset = journal.next_free;
        je->crc32_data = crc32c(0, op->buf, op->len);
        je->crc32 = je_crc32((journal_entry*)je);
        journal.crc32_last = je->crc32;
        if (immediate_commit != IMMEDIATE_NONE)
        {
            prepare_journal_sector_write(journal, journal.cur_sector, sqe1, cb);
            PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
            PRIV(op)->pending_ops++;
        }
        if (op->len > 0)
        {
            // Prepare journal data write
            if (journal.inmemory)
            {
                // Copy data
                memcpy(journal.buffer + journal.next_free, op->buf, op->len);
            }
            ring_data_t *data2 = ((ring_data_t*)sqe2->user_data);
            data2->iov = (struct iovec){ op->buf, op->len };
            data2->callback = cb;
            my_uring_prep_writev(
                sqe2, journal.fd, &data2->iov, 1, journal.offset + journal.next_free
            );
            PRIV(op)->pending_ops++;
        }
        else
        {
            // Zero-length overwrite. Allowed to bump object version in EC placement groups without actually writing data
        }
        dirty_it->second.location = journal.next_free;
        dirty_it->second.state = ST_J_SUBMITTED;
        journal.next_free += op->len;
        if (journal.next_free >= journal.len)
        {
            journal.next_free = journal_block_size;
        }
        if (immediate_commit == IMMEDIATE_NONE)
        {
            // Remember small write as unsynced
            unsynced_small_writes.push_back((obj_ver_id){
                .oid = op->oid,
                .version = op->version,
            });
        }
        if (!PRIV(op)->pending_ops)
        {
            PRIV(op)->op_state = 4;
            continue_write(op);
        }
        else
        {
            PRIV(op)->op_state = 3;
        }
    }
    inflight_writes++;
    return 1;
 }
 int blockstore_impl_t::continue_write(blockstore_op_t *op)
 {
    io_uring_sqe *sqe = NULL;
    journal_entry_big_write *je;
    auto dirty_it = dirty_db.find((obj_ver_id){
        .oid = op->oid,
        .version = op->version,
    });
    if (PRIV(op)->op_state == 2)
        goto resume_2;
    else if (PRIV(op)->op_state == 4)
        goto resume_4;
    else
        return 1;
 resume_2:
    // Only for the immediate_commit mode: prepare and submit big_write journal entry
    sqe = get_sqe();
    if (!sqe)
    {
        return 0;
    }
    je = (journal_entry_big_write*)prefill_single_journal_entry(journal, JE_BIG_WRITE, sizeof(journal_entry_big_write));
    dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
    journal.sector_info[journal.cur_sector].dirty = false;
    journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
 #ifdef BLOCKSTORE_DEBUG
    printf("journal offset %lu is used by %lu:%lu v%lu\n", journal.sector_info[journal.cur_sector].offset, op->oid.inode, op->oid.stripe, op->version);
 #endif
    je->oid = op->oid;
    je->version = op->version;
    je->offset = op->offset;
    je->len = op->len;
    je->location = dirty_it->second.location;
    je->crc32 = je_crc32((journal_entry*)je);
    journal.crc32_last = je->crc32;
    prepare_journal_sector_write(journal, journal.cur_sector, sqe,
        [this, op](ring_data_t *data) { handle_write_event(data, op); });
    PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
    PRIV(op)->pending_ops = 1;
    PRIV(op)->op_state = 3;
    return 1;
 resume_4:
    // Switch object state
 #ifdef BLOCKSTORE_DEBUG
    printf("Ack write %lu:%lu v%lu = %d\n", op->oid.inode, op->oid.stripe, op->version, dirty_it->second.state);
 #endif
    bool imm = dirty_it->second.state == ST_D_SUBMITTED
        ? (immediate_commit == IMMEDIATE_ALL)
        : (immediate_commit != IMMEDIATE_NONE);
    if (imm)
    {
        auto & unstab = unstable_writes[op->oid];
        unstab = unstab < op->version ? op->version : unstab;
    }
    if (dirty_it->second.state == ST_J_SUBMITTED)
    {
        dirty_it->second.state = imm ? ST_J_SYNCED : ST_J_WRITTEN;
    }
    else if (dirty_it->second.state == ST_D_SUBMITTED)
    {
        dirty_it->second.state = imm ? ST_D_SYNCED : ST_D_WRITTEN;
    }
    else if (dirty_it->second.state == ST_DEL_SUBMITTED)
    {
        dirty_it->second.state = imm ? ST_DEL_SYNCED : ST_DEL_WRITTEN;
    }
    if (immediate_commit == IMMEDIATE_ALL)
    {
        dirty_it++;
        while (dirty_it != dirty_db.end() && dirty_it->first.oid == op->oid)
        {
            if (dirty_it->second.state == ST_J_WAIT_BIG)
            {
                dirty_it->second.state = ST_J_IN_FLIGHT;
            }
            dirty_it++;
        }
    }
    inflight_writes--;
    // Acknowledge write
    op->retval = op->len;
    FINISH_OP(op);
    return 1;
 }
 void blockstore_impl_t::handle_write_event(ring_data_t *data, blockstore_op_t *op)
 {
    live = true;
    if (data->res != data->iov.iov_len)
    {
        inflight_writes--;
        // FIXME: our state becomes corrupted after a write error. maybe do something better than just die
        throw std::runtime_error(
            "write operation failed ("+std::to_string(data->res)+" != "+std::to_string(data->iov.iov_len)+
            "). in-memory state is corrupted. AAAAAAAaaaaaaaaa!!!111"
        );
    }
    PRIV(op)->pending_ops--;
    if (PRIV(op)->pending_ops == 0)
    {
        release_journal_sectors(op);
        PRIV(op)->op_state++;
        if (!continue_write(op))
        {
            submit_queue.push_front(op);
        }
    }
 }
 void blockstore_impl_t::release_journal_sectors(blockstore_op_t *op)
 {
    // Release flushed journal sectors
    if (PRIV(op)->min_flushed_journal_sector > 0 &&
        PRIV(op)->max_flushed_journal_sector > 0)
    {
        uint64_t s = PRIV(op)->min_flushed_journal_sector;
        while (1)
        {
            journal.sector_info[s-1].usage_count--;
            if (s != (1+journal.cur_sector) && journal.sector_info[s-1].usage_count == 0)
            {
                // We know for sure that we won't write into this sector anymore
                uint64_t new_ds = journal.sector_info[s-1].offset + journal.block_size;
                if ((journal.dirty_start + (journal.dirty_start >= journal.used_start ? 0 : journal.len)) <
                    (new_ds + (new_ds >= journal.used_start ? 0 : journal.len)))
                {
                    journal.dirty_start = new_ds;
                }
            }
            if (s == PRIV(op)->max_flushed_journal_sector)
                break;
            s = 1 + s % journal.sector_count;
        }
        PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
    }
 }
 int blockstore_impl_t::dequeue_del(blockstore_op_t *op)
 {
    auto dirty_it = dirty_db.find((obj_ver_id){
        .oid = op->oid,
        .version = op->version,
    });
    blockstore_journal_check_t space_check(this);
    if (!space_check.check_available(op, 1, sizeof(journal_entry_del), 0))
    {
        return 0;
    }
    io_uring_sqe *sqe = NULL;
    if (immediate_commit != IMMEDIATE_NONE ||
        (journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_del) &&
        journal.sector_info[journal.cur_sector].dirty)
    {
        // Write current journal sector only if it's dirty and full, or in the immediate_commit mode
        BS_SUBMIT_GET_SQE_DECL(sqe);
    }
    auto cb = [this, op](ring_data_t *data) { handle_write_event(data, op); };
    // Prepare journal sector write
    if (immediate_commit == IMMEDIATE_NONE)
    {
        if (sqe)
        {
            prepare_journal_sector_write(journal, journal.cur_sector, sqe, cb);
            PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
            PRIV(op)->pending_ops++;
        }
        else
        {
            PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
        }
    }
    // Pre-fill journal entry
    journal_entry_del *je = (journal_entry_del*)
        prefill_single_journal_entry(journal, JE_DELETE, sizeof(struct journal_entry_del));
    dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
    journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
 #ifdef BLOCKSTORE_DEBUG
    printf("journal offset %lu is used by %lu:%lu v%lu\n", dirty_it->second.journal_sector, dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version);
 #endif
    je->oid = op->oid;
    je->version = op->version;
    je->crc32 = je_crc32((journal_entry*)je);
    journal.crc32_last = je->crc32;
    dirty_it->second.state = ST_DEL_SUBMITTED;
    if (immediate_commit != IMMEDIATE_NONE)
    {
        prepare_journal_sector_write(journal, journal.cur_sector, sqe, cb);
        PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
        PRIV(op)->pending_ops++;
        // Remember small write as unsynced
        unsynced_small_writes.push_back((obj_ver_id){
            .oid = op->oid,
            .version = op->version,
        });
    }
    if (!PRIV(op)->pending_ops)
    {
        PRIV(op)->op_state = 4;
        continue_write(op);
    }
    else
    {
        PRIV(op)->op_state = 3;
    }
    return 1;
 }
--- a/cluster_client.cpp
+++ b/cluster_client.cpp
@ -1,358 +0,0 @@
 #include <unistd.h>
 #include <fcntl.h>
 #include <sys/socket.h>
 #include <sys/epoll.h>
 #include <netinet/tcp.h>
 #include "cluster_client.h"
 osd_op_t::~osd_op_t()
 {
    assert(!bs_op);
    if (op_data)
    {
        free(op_data);
    }
    if (rmw_buf)
    {
        free(rmw_buf);
    }
    if (buf)
    {
        // Note: reusing osd_op_t WILL currently lead to memory leaks
        // So we don't reuse it, but free it every time
        free(buf);
    }
 }
 void cluster_client_t::connect_peer(uint64_t peer_osd, json11::Json address_list, int port)
 {
    if (wanted_peers.find(peer_osd) == wanted_peers.end())
    {
        wanted_peers[peer_osd] = (osd_wanted_peer_t){
            .address_list = address_list,
            .port = port,
        };
    }
    else
    {
        wanted_peers[peer_osd].address_list = address_list;
        wanted_peers[peer_osd].port = port;
    }
    wanted_peers[peer_osd].address_changed = true;
    if (!wanted_peers[peer_osd].connecting &&
        (time(NULL) - wanted_peers[peer_osd].last_connect_attempt) >= peer_connect_interval)
    {
        try_connect_peer(peer_osd);
    }
 }
 void cluster_client_t::try_connect_peer(uint64_t peer_osd)
 {
    auto wp_it = wanted_peers.find(peer_osd);
    if (wp_it == wanted_peers.end())
    {
        return;
    }
    if (osd_peer_fds.find(peer_osd) != osd_peer_fds.end())
    {
        wanted_peers.erase(peer_osd);
        return;
    }
    auto & wp = wp_it->second;
    if (wp.address_index >= wp.address_list.array_items().size())
    {
        return;
    }
    wp.cur_addr = wp.address_list[wp.address_index].string_value();
    wp.cur_port = wp.port;
    try_connect_peer_addr(peer_osd, wp.cur_addr.c_str(), wp.cur_port);
 }
 void cluster_client_t::try_connect_peer_addr(osd_num_t peer_osd, const char *peer_host, int peer_port)
 {
    struct sockaddr_in addr;
    int r;
    if ((r = inet_pton(AF_INET, peer_host, &addr.sin_addr)) != 1)
    {
        on_connect_peer(peer_osd, -EINVAL);
        return;
    }
    addr.sin_family = AF_INET;
    addr.sin_port = htons(peer_port ? peer_port : 11203);
    int peer_fd = socket(AF_INET, SOCK_STREAM, 0);
    if (peer_fd < 0)
    {
        on_connect_peer(peer_osd, -errno);
        return;
    }
    fcntl(peer_fd, F_SETFL, fcntl(peer_fd, F_GETFL, 0) | O_NONBLOCK);
    int timeout_id = -1;
    if (peer_connect_timeout > 0)
    {
        timeout_id = tfd->set_timer(1000*peer_connect_timeout, false, [this, peer_fd](int timer_id)
        {
            osd_num_t peer_osd = clients[peer_fd].osd_num;
            stop_client(peer_fd);
            on_connect_peer(peer_osd, -EIO);
            return;
        });
    }
    r = connect(peer_fd, (sockaddr*)&addr, sizeof(addr));
    if (r < 0 && errno != EINPROGRESS)
    {
        close(peer_fd);
        on_connect_peer(peer_osd, -errno);
        return;
    }
    assert(peer_osd != this->osd_num);
    clients[peer_fd] = (osd_client_t){
        .peer_addr = addr,
        .peer_port = peer_port,
        .peer_fd = peer_fd,
        .peer_state = PEER_CONNECTING,
        .connect_timeout_id = timeout_id,
        .osd_num = peer_osd,
        .in_buf = malloc(receive_buffer_size),
    };
    tfd->set_fd_handler(peer_fd, [this](int peer_fd, int epoll_events)
    {
        // Either OUT (connected) or HUP
        handle_connect_epoll(peer_fd);
    });
 }
 void cluster_client_t::handle_connect_epoll(int peer_fd)
 {
    auto & cl = clients[peer_fd];
    if (cl.connect_timeout_id >= 0)
    {
        tfd->clear_timer(cl.connect_timeout_id);
        cl.connect_timeout_id = -1;
    }
    osd_num_t peer_osd = cl.osd_num;
    int result = 0;
    socklen_t result_len = sizeof(result);
    if (getsockopt(peer_fd, SOL_SOCKET, SO_ERROR, &result, &result_len) < 0)
    {
        result = errno;
    }
    if (result != 0)
    {
        stop_client(peer_fd);
        on_connect_peer(peer_osd, -result);
        return;
    }
    int one = 1;
    setsockopt(peer_fd, SOL_TCP, TCP_NODELAY, &one, sizeof(one));
    cl.peer_state = PEER_CONNECTED;
    // FIXME Disable EPOLLOUT on this fd
    tfd->set_fd_handler(peer_fd, [this](int peer_fd, int epoll_events)
    {
        handle_peer_epoll(peer_fd, epoll_events);
    });
    // Check OSD number
    check_peer_config(cl);
 }
 void cluster_client_t::handle_peer_epoll(int peer_fd, int epoll_events)
 {
    // Mark client as ready (i.e. some data is available)
    if (epoll_events & EPOLLRDHUP)
    {
        // Stop client
        printf("[OSD %lu] client %d disconnected\n", this->osd_num, peer_fd);
        stop_client(peer_fd);
    }
    else if (epoll_events & EPOLLIN)
    {
        // Mark client as ready (i.e. some data is available)
        auto & cl = clients[peer_fd];
        cl.read_ready++;
        if (cl.read_ready == 1)
        {
            read_ready_clients.push_back(cl.peer_fd);
            ringloop->wakeup();
        }
    }
 }
 void cluster_client_t::on_connect_peer(osd_num_t peer_osd, int peer_fd)
 {
    auto & wp = wanted_peers.at(peer_osd);
    wp.connecting = false;
    if (peer_fd < 0)
    {
        printf("Failed to connect to peer OSD %lu address %s port %d: %s\n", peer_osd, wp.cur_addr.c_str(), wp.cur_port, strerror(-peer_fd));
        if (wp.address_changed)
        {
            wp.address_changed = false;
            wp.address_index = 0;
            try_connect_peer(peer_osd);
        }
        else if (wp.address_index < wp.address_list.array_items().size()-1)
        {
            // Try other addresses
            wp.address_index++;
            try_connect_peer(peer_osd);
        }
        else
        {
            // Retry again in <peer_connect_interval> seconds
            wp.last_connect_attempt = time(NULL);
            wp.address_index = 0;
            tfd->set_timer(1000*peer_connect_interval, false, [this, peer_osd](int)
            {
                try_connect_peer(peer_osd);
            });
        }
        return;
    }
    printf("Connected with peer OSD %lu (fd %d)\n", peer_osd, peer_fd);
    wanted_peers.erase(peer_osd);
    repeer_pgs(peer_osd);
 }
 void cluster_client_t::check_peer_config(osd_client_t & cl)
 {
    osd_op_t *op = new osd_op_t();
    op->op_type = OSD_OP_OUT;
    op->send_list.push_back(op->req.buf, OSD_PACKET_SIZE);
    op->peer_fd = cl.peer_fd;
    op->req = {
        .show_conf = {
            .header = {
                .magic = SECONDARY_OSD_OP_MAGIC,
                .id = this->next_subop_id++,
                .opcode = OSD_OP_SHOW_CONFIG,
            },
        },
    };
    op->callback = [this](osd_op_t *op)
    {
        osd_client_t & cl = clients[op->peer_fd];
        std::string json_err;
        json11::Json config;
        bool err = false;
        if (op->reply.hdr.retval < 0)
        {
            err = true;
            printf("Failed to get config from OSD %lu (retval=%ld), disconnecting peer\n", cl.osd_num, op->reply.hdr.retval);
        }
        else
        {
            config = json11::Json::parse(std::string((char*)op->buf), json_err);
            if (json_err != "")
            {
                err = true;
                printf("Failed to get config from OSD %lu: bad JSON: %s, disconnecting peer\n", cl.osd_num, json_err.c_str());
            }
            else if (config["osd_num"].uint64_value() != cl.osd_num)
            {
                err = true;
                printf("Connected to OSD %lu instead of OSD %lu, peer state is outdated, disconnecting peer\n", config["osd_num"].uint64_value(), cl.osd_num);
                on_connect_peer(cl.osd_num, -1);
            }
        }
        if (err)
        {
            stop_client(op->peer_fd);
            delete op;
            return;
        }
        osd_peer_fds[cl.osd_num] = cl.peer_fd;
        on_connect_peer(cl.osd_num, cl.peer_fd);
        delete op;
    };
    outbox_push(op);
 }
 void cluster_client_t::cancel_osd_ops(osd_client_t & cl)
 {
    for (auto p: cl.sent_ops)
    {
        cancel_out_op(p.second);
    }
    cl.sent_ops.clear();
    for (auto op: cl.outbox)
    {
        cancel_out_op(op);
    }
    cl.outbox.clear();
    if (cl.write_op)
    {
        cancel_out_op(cl.write_op);
        cl.write_op = NULL;
    }
 }
 void cluster_client_t::cancel_out_op(osd_op_t *op)
 {
    op->reply.hdr.magic = SECONDARY_OSD_REPLY_MAGIC;
    op->reply.hdr.id = op->req.hdr.id;
    op->reply.hdr.opcode = op->req.hdr.opcode;
    op->reply.hdr.retval = -EPIPE;
    // Copy lambda to be unaffected by `delete op`
    std::function<void(osd_op_t*)>(op->callback)(op);
 }
 void cluster_client_t::stop_client(int peer_fd)
 {
    assert(peer_fd != 0);
    auto it = clients.find(peer_fd);
    if (it == clients.end())
    {
        return;
    }
    uint64_t repeer_osd = 0;
    osd_client_t cl = it->second;
    if (cl.peer_state == PEER_CONNECTED)
    {
        if (cl.osd_num)
        {
            // Reload configuration from etcd when the connection is dropped
            printf("[OSD %lu] Stopping client %d (OSD peer %lu)\n", osd_num, peer_fd, cl.osd_num);
            repeer_osd = cl.osd_num;
        }
        else
        {
            printf("[OSD %lu] Stopping client %d (regular client)\n", osd_num, peer_fd);
        }
    }
    clients.erase(it);
    tfd->set_fd_handler(peer_fd, NULL);
    if (cl.osd_num)
    {
        osd_peer_fds.erase(cl.osd_num);
        // Cancel outbound operations
        cancel_osd_ops(cl);
    }
    if (cl.read_op)
    {
        delete cl.read_op;
        cl.read_op = NULL;
    }
    for (auto rit = read_ready_clients.begin(); rit != read_ready_clients.end(); rit++)
    {
        if (*rit == peer_fd)
        {
            read_ready_clients.erase(rit);
            break;
        }
    }
    for (auto wit = write_ready_clients.begin(); wit != write_ready_clients.end(); wit++)
    {
        if (*wit == peer_fd)
        {
            write_ready_clients.erase(wit);
            break;
        }
    }
    free(cl.in_buf);
    assert(peer_fd != 0);
    close(peer_fd);
    if (repeer_osd)
    {
        repeer_pgs(repeer_osd);
    }
 }
--- a/cluster_client.h
+++ b/cluster_client.h
@ -1,209 +0,0 @@
 #pragma once
 #include <sys/types.h>
 #include <stdint.h>
 #include <arpa/inet.h>
 #include <malloc.h>
 #include <set>
 #include <map>
 #include <deque>
 #include <vector>
 #include "json11/json11.hpp"
 #include "osd_ops.h"
 #include "timerfd_manager.h"
 #include "ringloop.h"
 #define OSD_OP_IN 0
 #define OSD_OP_OUT 1
 #define CL_READ_HDR 1
 #define CL_READ_DATA 2
 #define CL_READ_REPLY_DATA 3
 #define CL_WRITE_READY 1
 #define CL_WRITE_REPLY 2
 #define MAX_EPOLL_EVENTS 64
 #define OSD_OP_INLINE_BUF_COUNT 16
 #define PEER_CONNECTING 1
 #define PEER_CONNECTED 2
 struct osd_op_buf_list_t
 {
    int count = 0, alloc = 0, sent = 0;
    iovec *buf = NULL;
    iovec inline_buf[OSD_OP_INLINE_BUF_COUNT];
    ~osd_op_buf_list_t()
    {
        if (buf && buf != inline_buf)
        {
            free(buf);
        }
    }
    inline iovec* get_iovec()
    {
        return (buf ? buf : inline_buf) + sent;
    }
    inline int get_size()
    {
        return count - sent;
    }
    inline void push_back(void *nbuf, size_t len)
    {
        if (count >= alloc)
        {
            if (!alloc)
            {
                alloc = OSD_OP_INLINE_BUF_COUNT;
                buf = inline_buf;
            }
            else if (buf == inline_buf)
            {
                int old = alloc;
                alloc = ((alloc/16)*16 + 1);
                buf = (iovec*)malloc(sizeof(iovec) * alloc);
                memcpy(buf, inline_buf, sizeof(iovec)*old);
            }
            else
            {
                alloc = ((alloc/16)*16 + 1);
                buf = (iovec*)realloc(buf, sizeof(iovec) * alloc);
            }
        }
        buf[count++] = { .iov_base = nbuf, .iov_len = len };
    }
 };
 struct blockstore_op_t;
 struct osd_primary_op_data_t;
 struct osd_op_t
 {
    timespec tv_begin;
    uint64_t op_type = OSD_OP_IN;
    int peer_fd;
    osd_any_op_t req;
    osd_any_reply_t reply;
    blockstore_op_t *bs_op = NULL;
    void *buf = NULL;
    void *rmw_buf = NULL;
    osd_primary_op_data_t* op_data = NULL;
    std::function<void(osd_op_t*)> callback;
    osd_op_buf_list_t send_list;
    ~osd_op_t();
 };
 struct osd_client_t
 {
    sockaddr_in peer_addr;
    int peer_port;
    int peer_fd;
    int peer_state;
    int connect_timeout_id = -1;
    osd_num_t osd_num = 0;
    void *in_buf = NULL;
    // Read state
    int read_ready = 0;
    osd_op_t *read_op = NULL;
    int read_reply_id = 0;
    iovec read_iov;
    msghdr read_msg;
    void *read_buf = NULL;
    int read_remaining = 0;
    int read_state = 0;
    // Outbound operations sent to this peer
    std::map<int, osd_op_t*> sent_ops;
    // Outbound messages (replies or requests)
    std::deque<osd_op_t*> outbox;
    // PGs dirtied by this client's primary-writes (FIXME to drop the connection)
    std::set<pg_num_t> dirty_pgs;
    // Write state
    osd_op_t *write_op = NULL;
    msghdr write_msg;
    int write_state = 0;
 };
 struct osd_wanted_peer_t
 {
    json11::Json address_list;
    int port;
    time_t last_connect_attempt;
    bool connecting, address_changed;
    int address_index;
    std::string cur_addr;
    int cur_port;
 };
 struct osd_op_stats_t
 {
    uint64_t op_stat_sum[OSD_OP_MAX+1] = { 0 };
    uint64_t op_stat_count[OSD_OP_MAX+1] = { 0 };
    uint64_t op_stat_bytes[OSD_OP_MAX+1] = { 0 };
    uint64_t subop_stat_sum[OSD_OP_MAX+1] = { 0 };
    uint64_t subop_stat_count[OSD_OP_MAX+1] = { 0 };
 };
 struct cluster_client_t
 {
    timerfd_manager_t *tfd;
    ring_loop_t *ringloop;
    // osd_num_t is only for logging and asserts
    osd_num_t osd_num;
    int receive_buffer_size = 9000;
    int peer_connect_interval = 5;
    int peer_connect_timeout = 5;
    int log_level = 0;
    std::map<osd_num_t, osd_wanted_peer_t> wanted_peers;
    std::map<uint64_t, int> osd_peer_fds;
    uint64_t next_subop_id = 1;
    std::map<int, osd_client_t> clients;
    std::vector<int> read_ready_clients;
    std::vector<int> write_ready_clients;
    // op statistics
    osd_op_stats_t stats;
    // public
    void connect_peer(uint64_t osd_num, json11::Json address_list, int port);
    void stop_client(int peer_fd);
    void outbox_push(osd_op_t *cur_op);
    std::function<void(osd_op_t*)> exec_op;
    std::function<void(osd_num_t)> repeer_pgs;
    // private
    void try_connect_peer(uint64_t osd_num);
    void try_connect_peer_addr(osd_num_t peer_osd, const char *peer_host, int peer_port);
    void handle_connect_epoll(int peer_fd);
    void handle_peer_epoll(int peer_fd, int epoll_events);
    void on_connect_peer(osd_num_t peer_osd, int peer_fd);
    void check_peer_config(osd_client_t & cl);
    void cancel_osd_ops(osd_client_t & cl);
    void cancel_out_op(osd_op_t *op);
    bool try_send(osd_client_t & cl);
    void send_replies();
    void handle_send(ring_data_t *data, int peer_fd);
    void read_requests();
    void handle_read(ring_data_t *data, int peer_fd);
    void handle_finished_read(osd_client_t & cl);
    void handle_op_hdr(osd_client_t *cl);
    void handle_reply_hdr(osd_client_t *cl);
 };
--- a/copy-fio-includes.sh
+++ b/copy-fio-includes.sh
@ -0,0 +1,13 @@
 #!/bin/bash
 gcc -I. -E -o fio_headers.i src/fio_headers.h
 rm -rf fio-copy
 for i in `grep -Po 'fio/[^"]+' fio_headers.i | sort | uniq`; do
    j=${i##fio/}
    p=$(dirname $j)
    mkdir -p fio-copy/$p
    cp $i fio-copy/$j
 done
 rm fio_headers.i
--- a/copy-qemu-includes.sh
+++ b/copy-qemu-includes.sh
@ -0,0 +1,18 @@
 #!/bin/bash
 #cd qemu
 #debian/rules b/configure-stamp
 #cd b/qemu; make qapi
 gcc -I qemu/b/qemu `pkg-config glib-2.0 --cflags` \
    -I qemu/include -E -o qemu_driver.i src/qemu_driver.c
 rm -rf qemu-copy
 for i in `grep -Po 'qemu/[^"]+' qemu_driver.i | sort | uniq`; do
    j=${i##qemu/}
    p=$(dirname $j)
    mkdir -p qemu-copy/$p
    cp $i qemu-copy/$j
 done
 rm qemu_driver.i
--- a/1
+++ b/1
@ -0,0 +1 @@
 Subproject commit 5dc108754ad40d3b1d024f9bd7cca0595ef1a1db
--- a/csi/.dockerignore
+++ b/csi/.dockerignore
@ -0,0 +1,3 @@
 vitastor-csi
 go.sum
 Dockerfile
--- a/csi/Dockerfile
+++ b/csi/Dockerfile
@ -0,0 +1,32 @@
 # Compile stage
 FROM golang:buster AS build
 ADD go.mod /app/
 RUN cd /app; CGO_ENABLED=1 GOOS=linux GOARCH=amd64 go mod download -x
 ADD . /app
 RUN perl -i -e '$/ = undef; while(<>) { s/\n\s*(\{\s*\n)/$1\n/g; s/\}(\s*\n\s*)else\b/$1} else/g; print; }' `find /app -name '*.go'`
 RUN cd /app; CGO_ENABLED=1 GOOS=linux GOARCH=amd64 go build -o vitastor-csi
 # Final stage
 FROM debian:buster
 LABEL maintainers="Vitaliy Filippov <vitalif@yourcmc.ru>"
 LABEL description="Vitastor CSI Driver"
 ENV NODE_ID=""
 ENV CSI_ENDPOINT=""
 RUN apt-get update && \
    apt-get install -y wget && \
    wget -q -O /etc/apt/trusted.gpg.d/vitastor.gpg https://vitastor.io/debian/pubkey.gpg && \
    (echo deb http://vitastor.io/debian buster main > /etc/apt/sources.list.d/vitastor.list) && \
    (echo deb http://deb.debian.org/debian buster-backports main > /etc/apt/sources.list.d/backports.list) && \
    (echo "APT::Install-Recommends false;" > /etc/apt/apt.conf) && \
    apt-get update && \
    apt-get install -y e2fsprogs xfsprogs vitastor kmod && \
    apt-get clean && \
    (echo options nbd nbds_max=128 > /etc/modprobe.d/nbd.conf)
 COPY --from=build /app/vitastor-csi /bin/
 ENTRYPOINT ["/bin/vitastor-csi"]
--- a/csi/Makefile
+++ b/csi/Makefile
@ -0,0 +1,9 @@
 VERSION ?= v0.6.5
 all: build push
 build:
 	@docker build --rm -t vitalif/vitastor-csi:$(VERSION) .
 push:
 	@docker push vitalif/vitastor-csi:$(VERSION)
--- a/csi/deploy/000-csi-namespace.yaml
+++ b/csi/deploy/000-csi-namespace.yaml
@ -0,0 +1,5 @@
 ---
 apiVersion: v1
 kind: Namespace
 metadata:
  name: vitastor-system
--- a/csi/deploy/001-csi-config-map.yaml
+++ b/csi/deploy/001-csi-config-map.yaml
@ -0,0 +1,9 @@
 ---
 apiVersion: v1
 kind: ConfigMap
 data:
  vitastor.conf: |-
    {"etcd_address":"http://192.168.7.2:2379","etcd_prefix":"/vitastor"}
 metadata:
  namespace: vitastor-system
  name: vitastor-config
--- a/csi/deploy/002-csi-nodeplugin-rbac.yaml
+++ b/csi/deploy/002-csi-nodeplugin-rbac.yaml
@ -0,0 +1,37 @@
 ---
 apiVersion: v1
 kind: ServiceAccount
 metadata:
  namespace: vitastor-system
  name: vitastor-csi-nodeplugin
 ---
 kind: ClusterRole
 apiVersion: rbac.authorization.k8s.io/v1
 metadata:
  namespace: vitastor-system
  name: vitastor-csi-nodeplugin
 rules:
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get"]
  # allow to read Vault Token and connection options from the Tenants namespace
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["get"]
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get"]
 ---
 kind: ClusterRoleBinding
 apiVersion: rbac.authorization.k8s.io/v1
 metadata:
  namespace: vitastor-system
  name: vitastor-csi-nodeplugin
 subjects:
  - kind: ServiceAccount
    name: vitastor-csi-nodeplugin
    namespace: vitastor-system
 roleRef:
  kind: ClusterRole
  name: vitastor-csi-nodeplugin
  apiGroup: rbac.authorization.k8s.io
--- a/csi/deploy/003-csi-nodeplugin-psp.yaml
+++ b/csi/deploy/003-csi-nodeplugin-psp.yaml
@ -0,0 +1,72 @@
 ---
 apiVersion: policy/v1beta1
 kind: PodSecurityPolicy
 metadata:
  namespace: vitastor-system
  name: vitastor-csi-nodeplugin-psp
 spec:
  allowPrivilegeEscalation: true
  allowedCapabilities:
    - 'SYS_ADMIN'
  fsGroup:
    rule: RunAsAny
  privileged: true
  hostNetwork: true
  hostPID: true
  runAsUser:
    rule: RunAsAny
  seLinux:
    rule: RunAsAny
  supplementalGroups:
    rule: RunAsAny
  volumes:
    - 'configMap'
    - 'emptyDir'
    - 'projected'
    - 'secret'
    - 'downwardAPI'
    - 'hostPath'
  allowedHostPaths:
    - pathPrefix: '/dev'
      readOnly: false
    - pathPrefix: '/run/mount'
      readOnly: false
    - pathPrefix: '/sys'
      readOnly: false
    - pathPrefix: '/lib/modules'
      readOnly: true
    - pathPrefix: '/var/lib/kubelet/pods'
      readOnly: false
    - pathPrefix: '/var/lib/kubelet/plugins/csi.vitastor.io'
      readOnly: false
    - pathPrefix: '/var/lib/kubelet/plugins_registry'
      readOnly: false
    - pathPrefix: '/var/lib/kubelet/plugins'
      readOnly: false
 ---
 kind: Role
 apiVersion: rbac.authorization.k8s.io/v1
 metadata:
  namespace: vitastor-system
  name: vitastor-csi-nodeplugin-psp
 rules:
  - apiGroups: ['policy']
    resources: ['podsecuritypolicies']
    verbs: ['use']
    resourceNames: ['vitastor-csi-nodeplugin-psp']
 ---
 kind: RoleBinding
 apiVersion: rbac.authorization.k8s.io/v1
 metadata:
  namespace: vitastor-system
  name: vitastor-csi-nodeplugin-psp
 subjects:
  - kind: ServiceAccount
    name: vitastor-csi-nodeplugin
    namespace: vitastor-system
 roleRef:
  kind: Role
  name: vitastor-csi-nodeplugin-psp
  apiGroup: rbac.authorization.k8s.io
--- a/csi/deploy/004-csi-nodeplugin.yaml
+++ b/csi/deploy/004-csi-nodeplugin.yaml
@ -0,0 +1,140 @@
 ---
 kind: DaemonSet
 apiVersion: apps/v1
 metadata:
  namespace: vitastor-system
  name: csi-vitastor
 spec:
  selector:
    matchLabels:
      app: csi-vitastor
  template:
    metadata:
      namespace: vitastor-system
      labels:
        app: csi-vitastor
    spec:
      serviceAccountName: vitastor-csi-nodeplugin
      hostNetwork: true
      hostPID: true
      priorityClassName: system-node-critical
      # to use e.g. Rook orchestrated cluster, and mons' FQDN is
      # resolved through k8s service, set dns policy to cluster first
      dnsPolicy: ClusterFirstWithHostNet
      containers:
        - name: driver-registrar
          # This is necessary only for systems with SELinux, where
          # non-privileged sidecar containers cannot access unix domain socket
          # created by privileged CSI driver container.
          securityContext:
            privileged: true
          image: k8s.gcr.io/sig-storage/csi-node-driver-registrar:v2.2.0
          args:
            - "--v=5"
            - "--csi-address=/csi/csi.sock"
            - "--kubelet-registration-path=/var/lib/kubelet/plugins/csi.vitastor.io/csi.sock"
          env:
            - name: KUBE_NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          volumeMounts:
            - name: socket-dir
              mountPath: /csi
            - name: registration-dir
              mountPath: /registration
        - name: csi-vitastor
          securityContext:
            privileged: true
            capabilities:
              add: ["SYS_ADMIN"]
            allowPrivilegeEscalation: true
          image: vitalif/vitastor-csi:v0.6.5
          args:
            - "--node=$(NODE_ID)"
            - "--endpoint=$(CSI_ENDPOINT)"
          env:
            - name: NODE_ID
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: CSI_ENDPOINT
              value: unix:///csi/csi.sock
          imagePullPolicy: "IfNotPresent"
          ports:
          - containerPort: 9898
            name: healthz
            protocol: TCP
          livenessProbe:
            failureThreshold: 5
            httpGet:
              path: /healthz
              port: healthz
            initialDelaySeconds: 10
            timeoutSeconds: 3
            periodSeconds: 2
          volumeMounts:
            - name: socket-dir
              mountPath: /csi
            - mountPath: /dev
              name: host-dev
            - mountPath: /sys
              name: host-sys
            - mountPath: /run/mount
              name: host-mount
            - mountPath: /lib/modules
              name: lib-modules
              readOnly: true
            - name: vitastor-config
              mountPath: /etc/vitastor
            - name: plugin-dir
              mountPath: /var/lib/kubelet/plugins
              mountPropagation: "Bidirectional"
            - name: mountpoint-dir
              mountPath: /var/lib/kubelet/pods
              mountPropagation: "Bidirectional"
        - name: liveness-probe
          securityContext:
            privileged: true
          image: quay.io/k8scsi/livenessprobe:v1.1.0
          args:
            - "--csi-address=$(CSI_ENDPOINT)"
            - "--health-port=9898"
          env:
            - name: CSI_ENDPOINT
              value: unix://csi/csi.sock
          volumeMounts:
          - mountPath: /csi
            name: socket-dir
      volumes:
        - name: socket-dir
          hostPath:
            path: /var/lib/kubelet/plugins/csi.vitastor.io
            type: DirectoryOrCreate
        - name: plugin-dir
          hostPath:
            path: /var/lib/kubelet/plugins
            type: Directory
        - name: mountpoint-dir
          hostPath:
            path: /var/lib/kubelet/pods
            type: DirectoryOrCreate
        - name: registration-dir
          hostPath:
            path: /var/lib/kubelet/plugins_registry/
            type: Directory
        - name: host-dev
          hostPath:
            path: /dev
        - name: host-sys
          hostPath:
            path: /sys
        - name: host-mount
          hostPath:
            path: /run/mount
        - name: lib-modules
          hostPath:
            path: /lib/modules
        - name: vitastor-config
          configMap:
            name: vitastor-config
--- a/csi/deploy/005-csi-provisioner-rbac.yaml
+++ b/csi/deploy/005-csi-provisioner-rbac.yaml
@ -0,0 +1,102 @@
 ---
 apiVersion: v1
 kind: ServiceAccount
 metadata:
  namespace: vitastor-system
  name: vitastor-csi-provisioner
 ---
 kind: ClusterRole
 apiVersion: rbac.authorization.k8s.io/v1
 metadata:
  namespace: vitastor-system
  name: vitastor-external-provisioner-runner
 rules:
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["events"]
    verbs: ["list", "watch", "create", "update", "patch"]
  - apiGroups: [""]
    resources: ["persistentvolumes"]
    verbs: ["get", "list", "watch", "create", "update", "delete", "patch"]
  - apiGroups: [""]
    resources: ["persistentvolumeclaims"]
    verbs: ["get", "list", "watch", "update"]
  - apiGroups: [""]
    resources: ["persistentvolumeclaims/status"]
    verbs: ["update", "patch"]
  - apiGroups: ["storage.k8s.io"]
    resources: ["storageclasses"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["snapshot.storage.k8s.io"]
    resources: ["volumesnapshots"]
    verbs: ["get", "list"]
  - apiGroups: ["snapshot.storage.k8s.io"]
    resources: ["volumesnapshotcontents"]
    verbs: ["create", "get", "list", "watch", "update", "delete"]
  - apiGroups: ["snapshot.storage.k8s.io"]
    resources: ["volumesnapshotclasses"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["storage.k8s.io"]
    resources: ["volumeattachments"]
    verbs: ["get", "list", "watch", "update", "patch"]
  - apiGroups: ["storage.k8s.io"]
    resources: ["volumeattachments/status"]
    verbs: ["patch"]
  - apiGroups: ["storage.k8s.io"]
    resources: ["csinodes"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["snapshot.storage.k8s.io"]
    resources: ["volumesnapshotcontents/status"]
    verbs: ["update"]
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get"]
 ---
 kind: ClusterRoleBinding
 apiVersion: rbac.authorization.k8s.io/v1
 metadata:
  namespace: vitastor-system
  name: vitastor-csi-provisioner-role
 subjects:
  - kind: ServiceAccount
    name: vitastor-csi-provisioner
    namespace: vitastor-system
 roleRef:
  kind: ClusterRole
  name: vitastor-external-provisioner-runner
  apiGroup: rbac.authorization.k8s.io
 ---
 kind: Role
 apiVersion: rbac.authorization.k8s.io/v1
 metadata:
  namespace: vitastor-system
  name: vitastor-external-provisioner-cfg
 rules:
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get", "list", "watch", "create", "update", "delete"]
  - apiGroups: ["coordination.k8s.io"]
    resources: ["leases"]
    verbs: ["get", "watch", "list", "delete", "update", "create"]
 ---
 kind: RoleBinding
 apiVersion: rbac.authorization.k8s.io/v1
 metadata:
  name: vitastor-csi-provisioner-role-cfg
  namespace: vitastor-system
 subjects:
  - kind: ServiceAccount
    name: vitastor-csi-provisioner
    namespace: vitastor-system
 roleRef:
  kind: Role
  name: vitastor-external-provisioner-cfg
  apiGroup: rbac.authorization.k8s.io
--- a/csi/deploy/006-csi-provisioner-psp.yaml
+++ b/csi/deploy/006-csi-provisioner-psp.yaml
@ -0,0 +1,60 @@
 ---
 apiVersion: policy/v1beta1
 kind: PodSecurityPolicy
 metadata:
  namespace: vitastor-system
  name: vitastor-csi-provisioner-psp
 spec:
  allowPrivilegeEscalation: true
  allowedCapabilities:
    - 'SYS_ADMIN'
  fsGroup:
    rule: RunAsAny
  privileged: true
  runAsUser:
    rule: RunAsAny
  seLinux:
    rule: RunAsAny
  supplementalGroups:
    rule: RunAsAny
  volumes:
    - 'configMap'
    - 'emptyDir'
    - 'projected'
    - 'secret'
    - 'downwardAPI'
    - 'hostPath'
  allowedHostPaths:
    - pathPrefix: '/dev'
      readOnly: false
    - pathPrefix: '/sys'
      readOnly: false
    - pathPrefix: '/lib/modules'
      readOnly: true
 ---
 kind: Role
 apiVersion: rbac.authorization.k8s.io/v1
 metadata:
  namespace: vitastor-system
  name: vitastor-csi-provisioner-psp
 rules:
  - apiGroups: ['policy']
    resources: ['podsecuritypolicies']
    verbs: ['use']
    resourceNames: ['vitastor-csi-provisioner-psp']
 ---
 kind: RoleBinding
 apiVersion: rbac.authorization.k8s.io/v1
 metadata:
  name: vitastor-csi-provisioner-psp
  namespace: vitastor-system
 subjects:
  - kind: ServiceAccount
    name: vitastor-csi-provisioner
    namespace: vitastor-system
 roleRef:
  kind: Role
  name: vitastor-csi-provisioner-psp
  apiGroup: rbac.authorization.k8s.io
--- a/csi/deploy/007-csi-provisioner.yaml
+++ b/csi/deploy/007-csi-provisioner.yaml
@ -0,0 +1,159 @@
 ---
 kind: Service
 apiVersion: v1
 metadata:
  namespace: vitastor-system
  name: csi-vitastor-provisioner
  labels:
    app: csi-metrics
 spec:
  selector:
    app: csi-vitastor-provisioner
  ports:
    - name: http-metrics
      port: 8080
      protocol: TCP
      targetPort: 8680
 ---
 kind: Deployment
 apiVersion: apps/v1
 metadata:
  namespace: vitastor-system
  name: csi-vitastor-provisioner
 spec:
  replicas: 3
  selector:
    matchLabels:
      app: csi-vitastor-provisioner
  template:
    metadata:
      namespace: vitastor-system
      labels:
        app: csi-vitastor-provisioner
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - csi-vitastor-provisioner
              topologyKey: "kubernetes.io/hostname"
      serviceAccountName: vitastor-csi-provisioner
      priorityClassName: system-cluster-critical
      containers:
        - name: csi-provisioner
          image: k8s.gcr.io/sig-storage/csi-provisioner:v2.2.0
          args:
            - "--csi-address=$(ADDRESS)"
            - "--v=5"
            - "--timeout=150s"
            - "--retry-interval-start=500ms"
            - "--leader-election=true"
            #  set it to true to use topology based provisioning
            - "--feature-gates=Topology=false"
            # if fstype is not specified in storageclass, ext4 is default
            - "--default-fstype=ext4"
            - "--extra-create-metadata=true"
          env:
            - name: ADDRESS
              value: unix:///csi/csi-provisioner.sock
          imagePullPolicy: "IfNotPresent"
          volumeMounts:
            - name: socket-dir
              mountPath: /csi
        - name: csi-snapshotter
          image: k8s.gcr.io/sig-storage/csi-snapshotter:v4.0.0
          args:
            - "--csi-address=$(ADDRESS)"
            - "--v=5"
            - "--timeout=150s"
            - "--leader-election=true"
          env:
            - name: ADDRESS
              value: unix:///csi/csi-provisioner.sock
          imagePullPolicy: "IfNotPresent"
          securityContext:
            privileged: true
          volumeMounts:
            - name: socket-dir
              mountPath: /csi
        - name: csi-attacher
          image: k8s.gcr.io/sig-storage/csi-attacher:v3.1.0
          args:
            - "--v=5"
            - "--csi-address=$(ADDRESS)"
            - "--leader-election=true"
            - "--retry-interval-start=500ms"
          env:
            - name: ADDRESS
              value: /csi/csi-provisioner.sock
          imagePullPolicy: "IfNotPresent"
          volumeMounts:
            - name: socket-dir
              mountPath: /csi
        - name: csi-resizer
          image: k8s.gcr.io/sig-storage/csi-resizer:v1.1.0
          args:
            - "--csi-address=$(ADDRESS)"
            - "--v=5"
            - "--timeout=150s"
            - "--leader-election"
            - "--retry-interval-start=500ms"
            - "--handle-volume-inuse-error=false"
          env:
            - name: ADDRESS
              value: unix:///csi/csi-provisioner.sock
          imagePullPolicy: "IfNotPresent"
          volumeMounts:
            - name: socket-dir
              mountPath: /csi
        - name: csi-vitastor
          securityContext:
            privileged: true
            capabilities:
              add: ["SYS_ADMIN"]
          image: vitalif/vitastor-csi:v0.6.5
          args:
            - "--node=$(NODE_ID)"
            - "--endpoint=$(CSI_ENDPOINT)"
          env:
            - name: NODE_ID
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: CSI_ENDPOINT
              value: unix:///csi/csi-provisioner.sock
          imagePullPolicy: "IfNotPresent"
          volumeMounts:
            - name: socket-dir
              mountPath: /csi
            - mountPath: /dev
              name: host-dev
            - mountPath: /sys
              name: host-sys
            - mountPath: /lib/modules
              name: lib-modules
              readOnly: true
            - name: vitastor-config
              mountPath: /etc/vitastor
      volumes:
        - name: host-dev
          hostPath:
            path: /dev
        - name: host-sys
          hostPath:
            path: /sys
        - name: lib-modules
          hostPath:
            path: /lib/modules
        - name: socket-dir
          emptyDir: {
            medium: "Memory"
          }
        - name: vitastor-config
          configMap:
            name: vitastor-config
--- a/csi/deploy/008-csi-driver.yaml
+++ b/csi/deploy/008-csi-driver.yaml
@ -0,0 +1,11 @@
 ---
 # if Kubernetes version is less than 1.18 change
 # apiVersion to storage.k8s.io/v1betav1
 apiVersion: storage.k8s.io/v1
 kind: CSIDriver
 metadata:
  namespace: vitastor-system
  name: csi.vitastor.io
 spec:
  attachRequired: true
  podInfoOnMount: false
--- a/csi/deploy/009-storage-class.yaml
+++ b/csi/deploy/009-storage-class.yaml
@ -0,0 +1,19 @@
 ---
 apiVersion: storage.k8s.io/v1
 kind: StorageClass
 metadata:
  namespace: vitastor-system
  name: vitastor
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
 provisioner: csi.vitastor.io
 volumeBindingMode: Immediate
 parameters:
  etcdVolumePrefix: ""
  poolId: "1"
  # you can choose other configuration file if you have it in the config map
  #configPath: "/etc/vitastor/vitastor.conf"
  # you can also specify etcdUrl here, maybe to connect to another Vitastor cluster
  # multiple etcdUrls may be specified, delimited by comma
  #etcdUrl: "http://192.168.7.2:2379"
  #etcdPrefix: "/vitastor"
--- a/csi/deploy/example-pvc.yaml
+++ b/csi/deploy/example-pvc.yaml
@ -0,0 +1,12 @@
 ---
 apiVersion: v1
 kind: PersistentVolumeClaim
 metadata:
  name: test-vitastor-pvc
 spec:
  storageClassName: vitastor
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
--- a/csi/go.mod
+++ b/csi/go.mod
@ -0,0 +1,35 @@
 module vitastor.io/csi
 go 1.15
 require (
 	github.com/container-storage-interface/spec v1.4.0
 	github.com/coreos/bbolt v0.0.0-00010101000000-000000000000 // indirect
 	github.com/coreos/etcd v3.3.25+incompatible // indirect
 	github.com/coreos/go-semver v0.3.0 // indirect
 	github.com/coreos/go-systemd v0.0.0-20191104093116-d3cd4ed1dbcf // indirect
 	github.com/coreos/pkg v0.0.0-20180928190104-399ea9e2e55f // indirect
 	github.com/dustin/go-humanize v1.0.0 // indirect
 	github.com/golang/glog v0.0.0-20160126235308-23def4e6c14b
 	github.com/gorilla/websocket v1.4.2 // indirect
 	github.com/grpc-ecosystem/go-grpc-middleware v1.3.0 // indirect
 	github.com/grpc-ecosystem/go-grpc-prometheus v1.2.0 // indirect
 	github.com/grpc-ecosystem/grpc-gateway v1.16.0 // indirect
 	github.com/jonboulle/clockwork v0.2.2 // indirect
 	github.com/kubernetes-csi/csi-lib-utils v0.9.1
 	github.com/soheilhy/cmux v0.1.5 // indirect
 	github.com/tmc/grpc-websocket-proxy v0.0.0-20201229170055-e5319fda7802 // indirect
 	github.com/xiang90/probing v0.0.0-20190116061207-43a291ad63a2 // indirect
 	go.etcd.io/bbolt v0.0.0-00010101000000-000000000000 // indirect
 	go.etcd.io/etcd v3.3.25+incompatible
 	golang.org/x/net v0.0.0-20201202161906-c7110b5ffcbb
 	google.golang.org/grpc v1.33.1
 	k8s.io/klog v1.0.0
 	k8s.io/utils v0.0.0-20210305010621-2afb4311ab10
 )
 replace github.com/coreos/bbolt => go.etcd.io/bbolt v1.3.5
 replace go.etcd.io/bbolt => github.com/coreos/bbolt v1.3.5
 replace google.golang.org/grpc => google.golang.org/grpc v1.25.1
--- a/csi/src/config.go
+++ b/csi/src/config.go
@ -0,0 +1,22 @@
 // Copyright (c) Vitaliy Filippov, 2019+
 // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
 package vitastor
 const (
    vitastorCSIDriverName    = "csi.vitastor.io"
    vitastorCSIDriverVersion = "0.6.5"
 )
 // Config struct fills the parameters of request or user input
 type Config struct
 {
    Endpoint string
    NodeID   string
 }
 // NewConfig returns config struct to initialize new driver
 func NewConfig() *Config
 {
    return &Config{}
 }
--- a/csi/src/controllerserver.go
+++ b/csi/src/controllerserver.go
@ -0,0 +1,530 @@
 // Copyright (c) Vitaliy Filippov, 2019+
 // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
 package vitastor
 import (
    "context"
    "encoding/json"
    "strings"
    "bytes"
    "strconv"
    "time"
    "fmt"
    "os"
    "os/exec"
    "io/ioutil"
    "github.com/kubernetes-csi/csi-lib-utils/protosanitizer"
    "k8s.io/klog"
    "google.golang.org/grpc/codes"
    "google.golang.org/grpc/status"
    "go.etcd.io/etcd/clientv3"
    "github.com/container-storage-interface/spec/lib/go/csi"
 )
 const (
    KB int64 = 1024
    MB int64 = 1024 * KB
    GB int64 = 1024 * MB
    TB int64 = 1024 * GB
    ETCD_TIMEOUT time.Duration = 15*time.Second
 )
 type InodeIndex struct
 {
    Id uint64 `json:"id"`
    PoolId uint64 `json:"pool_id"`
 }
 type InodeConfig struct
 {
    Name string `json:"name"`
    Size uint64 `json:"size,omitempty"`
    ParentPool uint64 `json:"parent_pool,omitempty"`
    ParentId uint64 `json:"parent_id,omitempty"`
    Readonly bool `json:"readonly,omitempty"`
 }
 type ControllerServer struct
 {
    *Driver
 }
 // NewControllerServer create new instance controller
 func NewControllerServer(driver *Driver) *ControllerServer
 {
    return &ControllerServer{
        Driver: driver,
    }
 }
 func GetConnectionParams(params map[string]string) (map[string]string, []string, string)
 {
    ctxVars := make(map[string]string)
    configPath := params["configPath"]
    if (configPath == "")
    {
        configPath = "/etc/vitastor/vitastor.conf"
    }
    else
    {
        ctxVars["configPath"] = configPath
    }
    config := make(map[string]interface{})
    if configFD, err := os.Open(configPath); err == nil
    {
        defer configFD.Close()
        data, _ := ioutil.ReadAll(configFD)
        json.Unmarshal(data, &config)
    }
    // Try to load prefix & etcd URL from the config
    var etcdUrl []string
    if (params["etcdUrl"] != "")
    {
        ctxVars["etcdUrl"] = params["etcdUrl"]
        etcdUrl = strings.Split(params["etcdUrl"], ",")
    }
    if (len(etcdUrl) == 0)
    {
        switch config["etcd_address"].(type)
        {
        case string:
            etcdUrl = strings.Split(config["etcd_address"].(string), ",")
        case []string:
            etcdUrl = config["etcd_address"].([]string)
        }
    }
    etcdPrefix := params["etcdPrefix"]
    if (etcdPrefix == "")
    {
        etcdPrefix, _ = config["etcd_prefix"].(string)
        if (etcdPrefix == "")
        {
            etcdPrefix = "/vitastor"
        }
    }
    else
    {
        ctxVars["etcdPrefix"] = etcdPrefix
    }
    return ctxVars, etcdUrl, etcdPrefix
 }
 // Create the volume
 func (cs *ControllerServer) CreateVolume(ctx context.Context, req *csi.CreateVolumeRequest) (*csi.CreateVolumeResponse, error)
 {
    klog.Infof("received controller create volume request %+v", protosanitizer.StripSecrets(req))
    if (req == nil)
    {
        return nil, status.Errorf(codes.InvalidArgument, "request cannot be empty")
    }
    if (req.GetName() == "")
    {
        return nil, status.Error(codes.InvalidArgument, "name is a required field")
    }
    volumeCapabilities := req.GetVolumeCapabilities()
    if (volumeCapabilities == nil)
    {
        return nil, status.Error(codes.InvalidArgument, "volume capabilities is a required field")
    }
    etcdVolumePrefix := req.Parameters["etcdVolumePrefix"]
    poolId, _ := strconv.ParseUint(req.Parameters["poolId"], 10, 64)
    if (poolId == 0)
    {
        return nil, status.Error(codes.InvalidArgument, "poolId is missing in storage class configuration")
    }
    volName := etcdVolumePrefix + req.GetName()
    volSize := 1 * GB
    if capRange := req.GetCapacityRange(); capRange != nil
    {
        volSize = ((capRange.GetRequiredBytes() + MB - 1) / MB) * MB
    }
    // FIXME: The following should PROBABLY be implemented externally in a management tool
    ctxVars, etcdUrl, etcdPrefix := GetConnectionParams(req.Parameters)
    if (len(etcdUrl) == 0)
    {
        return nil, status.Error(codes.InvalidArgument, "no etcdUrl in storage class configuration and no etcd_address in vitastor.conf")
    }
    // Connect to etcd
    cli, err := clientv3.New(clientv3.Config{
        DialTimeout: ETCD_TIMEOUT,
        Endpoints: etcdUrl,
    })
    if (err != nil)
    {
        return nil, status.Error(codes.Internal, "failed to connect to etcd at "+strings.Join(etcdUrl, ",")+": "+err.Error())
    }
    defer cli.Close()
    var imageId uint64 = 0
    for
    {
        // Check if the image exists
        ctx, cancel := context.WithTimeout(context.Background(), ETCD_TIMEOUT)
        resp, err := cli.Get(ctx, etcdPrefix+"/index/image/"+volName)
        cancel()
        if (err != nil)
        {
            return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
        }
        if (len(resp.Kvs) > 0)
        {
            kv := resp.Kvs[0]
            var v InodeIndex
            err := json.Unmarshal(kv.Value, &v)
            if (err != nil)
            {
                return nil, status.Error(codes.Internal, "invalid /index/image/"+volName+" key in etcd: "+err.Error())
            }
            poolId = v.PoolId
            imageId = v.Id
            inodeCfgKey := fmt.Sprintf("/config/inode/%d/%d", poolId, imageId)
            ctx, cancel := context.WithTimeout(context.Background(), ETCD_TIMEOUT)
            resp, err := cli.Get(ctx, etcdPrefix+inodeCfgKey)
            cancel()
            if (err != nil)
            {
                return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
            }
            if (len(resp.Kvs) == 0)
            {
                return nil, status.Error(codes.Internal, "missing "+inodeCfgKey+" key in etcd")
            }
            var inodeCfg InodeConfig
            err = json.Unmarshal(resp.Kvs[0].Value, &inodeCfg)
            if (err != nil)
            {
                return nil, status.Error(codes.Internal, "invalid "+inodeCfgKey+" key in etcd: "+err.Error())
            }
            if (inodeCfg.Size < uint64(volSize))
            {
                return nil, status.Error(codes.Internal, "image "+volName+" is already created, but size is less than expected")
            }
        }
        else
        {
            // Find a free ID
            // Create image metadata in a transaction verifying that the image doesn't exist yet AND ID is still free
            maxIdKey := fmt.Sprintf("%s/index/maxid/%d", etcdPrefix, poolId)
            ctx, cancel := context.WithTimeout(context.Background(), ETCD_TIMEOUT)
            resp, err := cli.Get(ctx, maxIdKey)
            cancel()
            if (err != nil)
            {
                return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
            }
            var modRev int64
            var nextId uint64
            if (len(resp.Kvs) > 0)
            {
                var err error
                nextId, err = strconv.ParseUint(string(resp.Kvs[0].Value), 10, 64)
                if (err != nil)
                {
                    return nil, status.Error(codes.Internal, maxIdKey+" contains invalid ID")
                }
                modRev = resp.Kvs[0].ModRevision
                nextId++
            }
            else
            {
                nextId = 1
            }
            inodeIdxJson, _ := json.Marshal(InodeIndex{
                Id: nextId,
                PoolId: poolId,
            })
            inodeCfgJson, _ := json.Marshal(InodeConfig{
                Name: volName,
                Size: uint64(volSize),
            })
            ctx, cancel = context.WithTimeout(context.Background(), ETCD_TIMEOUT)
            txnResp, err := cli.Txn(ctx).If(
                clientv3.Compare(clientv3.ModRevision(fmt.Sprintf("%s/index/maxid/%d", etcdPrefix, poolId)), "=", modRev),
                clientv3.Compare(clientv3.CreateRevision(fmt.Sprintf("%s/index/image/%s", etcdPrefix, volName)), "=", 0),
                clientv3.Compare(clientv3.CreateRevision(fmt.Sprintf("%s/config/inode/%d/%d", etcdPrefix, poolId, nextId)), "=", 0),
            ).Then(
                clientv3.OpPut(fmt.Sprintf("%s/index/maxid/%d", etcdPrefix, poolId), fmt.Sprintf("%d", nextId)),
                clientv3.OpPut(fmt.Sprintf("%s/index/image/%s", etcdPrefix, volName), string(inodeIdxJson)),
                clientv3.OpPut(fmt.Sprintf("%s/config/inode/%d/%d", etcdPrefix, poolId, nextId), string(inodeCfgJson)),
            ).Commit()
            cancel()
            if (err != nil)
            {
                return nil, status.Error(codes.Internal, "failed to commit transaction in etcd: "+err.Error())
            }
            if (txnResp.Succeeded)
            {
                imageId = nextId
                break
            }
            // Start over if the transaction fails
        }
    }
    ctxVars["name"] = volName
    volumeIdJson, _ := json.Marshal(ctxVars)
    return &csi.CreateVolumeResponse{
        Volume: &csi.Volume{
            // Ugly, but VolumeContext isn't passed to DeleteVolume :-(
            VolumeId: string(volumeIdJson),
            CapacityBytes: volSize,
        },
    }, nil
 }
 // DeleteVolume deletes the given volume
 func (cs *ControllerServer) DeleteVolume(ctx context.Context, req *csi.DeleteVolumeRequest) (*csi.DeleteVolumeResponse, error)
 {
    klog.Infof("received controller delete volume request %+v", protosanitizer.StripSecrets(req))
    if (req == nil)
    {
        return nil, status.Error(codes.InvalidArgument, "request cannot be empty")
    }
    ctxVars := make(map[string]string)
    err := json.Unmarshal([]byte(req.VolumeId), &ctxVars)
    if (err != nil)
    {
        return nil, status.Error(codes.Internal, "volume ID not in JSON format")
    }
    volName := ctxVars["name"]
    _, etcdUrl, etcdPrefix := GetConnectionParams(ctxVars)
    if (len(etcdUrl) == 0)
    {
        return nil, status.Error(codes.InvalidArgument, "no etcdUrl in storage class configuration and no etcd_address in vitastor.conf")
    }
    cli, err := clientv3.New(clientv3.Config{
        DialTimeout: ETCD_TIMEOUT,
        Endpoints: etcdUrl,
    })
    if (err != nil)
    {
        return nil, status.Error(codes.Internal, "failed to connect to etcd at "+strings.Join(etcdUrl, ",")+": "+err.Error())
    }
    defer cli.Close()
    // Find inode by name
    ctx, cancel := context.WithTimeout(context.Background(), ETCD_TIMEOUT)
    resp, err := cli.Get(ctx, etcdPrefix+"/index/image/"+volName)
    cancel()
    if (err != nil)
    {
        return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
    }
    if (len(resp.Kvs) == 0)
    {
        return nil, status.Error(codes.NotFound, "volume "+volName+" does not exist")
    }
    var idx InodeIndex
    err = json.Unmarshal(resp.Kvs[0].Value, &idx)
    if (err != nil)
    {
        return nil, status.Error(codes.Internal, "invalid /index/image/"+volName+" key in etcd: "+err.Error())
    }
    // Get inode config
    inodeCfgKey := fmt.Sprintf("%s/config/inode/%d/%d", etcdPrefix, idx.PoolId, idx.Id)
    ctx, cancel = context.WithTimeout(context.Background(), ETCD_TIMEOUT)
    resp, err = cli.Get(ctx, inodeCfgKey)
    cancel()
    if (err != nil)
    {
        return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
    }
    if (len(resp.Kvs) == 0)
    {
        return nil, status.Error(codes.NotFound, "volume "+volName+" does not exist")
    }
    var inodeCfg InodeConfig
    err = json.Unmarshal(resp.Kvs[0].Value, &inodeCfg)
    if (err != nil)
    {
        return nil, status.Error(codes.Internal, "invalid "+inodeCfgKey+" key in etcd: "+err.Error())
    }
    // Delete inode data by invoking vitastor-rm
    args := []string{
        "--etcd_address", strings.Join(etcdUrl, ","),
        "--pool", fmt.Sprintf("%d", idx.PoolId),
        "--inode", fmt.Sprintf("%d", idx.Id),
    }
    if (ctxVars["configPath"] != "")
    {
        args = append(args, "--config_path", ctxVars["configPath"])
    }
    c := exec.Command("/usr/bin/vitastor-rm", args...)
    var stderr bytes.Buffer
    c.Stdout = nil
    c.Stderr = &stderr
    err = c.Run()
    stderrStr := string(stderr.Bytes())
    if (err != nil)
    {
        klog.Errorf("vitastor-rm failed: %s, status %s\n", stderrStr, err)
        return nil, status.Error(codes.Internal, stderrStr+" (status "+err.Error()+")")
    }
    // Delete inode config in etcd
    ctx, cancel = context.WithTimeout(context.Background(), ETCD_TIMEOUT)
    txnResp, err := cli.Txn(ctx).Then(
        clientv3.OpDelete(fmt.Sprintf("%s/index/image/%s", etcdPrefix, volName)),
        clientv3.OpDelete(fmt.Sprintf("%s/config/inode/%d/%d", etcdPrefix, idx.PoolId, idx.Id)),
    ).Commit()
    cancel()
    if (err != nil)
    {
        return nil, status.Error(codes.Internal, "failed to delete keys in etcd: "+err.Error())
    }
    if (!txnResp.Succeeded)
    {
        return nil, status.Error(codes.Internal, "failed to delete keys in etcd: transaction failed")
    }
    return &csi.DeleteVolumeResponse{}, nil
 }
 // ControllerPublishVolume return Unimplemented error
 func (cs *ControllerServer) ControllerPublishVolume(ctx context.Context, req *csi.ControllerPublishVolumeRequest) (*csi.ControllerPublishVolumeResponse, error)
 {
    return nil, status.Error(codes.Unimplemented, "")
 }
 // ControllerUnpublishVolume return Unimplemented error
 func (cs *ControllerServer) ControllerUnpublishVolume(ctx context.Context, req *csi.ControllerUnpublishVolumeRequest) (*csi.ControllerUnpublishVolumeResponse, error)
 {
    return nil, status.Error(codes.Unimplemented, "")
 }
 // ValidateVolumeCapabilities checks whether the volume capabilities requested are supported.
 func (cs *ControllerServer) ValidateVolumeCapabilities(ctx context.Context, req *csi.ValidateVolumeCapabilitiesRequest) (*csi.ValidateVolumeCapabilitiesResponse, error)
 {
    klog.Infof("received controller validate volume capability request %+v", protosanitizer.StripSecrets(req))
    if (req == nil)
    {
        return nil, status.Errorf(codes.InvalidArgument, "request is nil")
    }
    volumeID := req.GetVolumeId()
    if (volumeID == "")
    {
        return nil, status.Error(codes.InvalidArgument, "volumeId is nil")
    }
    volumeCapabilities := req.GetVolumeCapabilities()
    if (volumeCapabilities == nil)
    {
        return nil, status.Error(codes.InvalidArgument, "volumeCapabilities is nil")
    }
    var volumeCapabilityAccessModes []*csi.VolumeCapability_AccessMode
    for _, mode := range []csi.VolumeCapability_AccessMode_Mode{
        csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER,
        csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER,
    } {
        volumeCapabilityAccessModes = append(volumeCapabilityAccessModes, &csi.VolumeCapability_AccessMode{Mode: mode})
    }
    capabilitySupport := false
    for _, capability := range volumeCapabilities
    {
        for _, volumeCapabilityAccessMode := range volumeCapabilityAccessModes
        {
            if (volumeCapabilityAccessMode.Mode == capability.AccessMode.Mode)
            {
                capabilitySupport = true
            }
        }
    }
    if (!capabilitySupport)
    {
        return nil, status.Errorf(codes.NotFound, "%v not supported", req.GetVolumeCapabilities())
    }
    return &csi.ValidateVolumeCapabilitiesResponse{
        Confirmed: &csi.ValidateVolumeCapabilitiesResponse_Confirmed{
            VolumeCapabilities: req.VolumeCapabilities,
        },
    }, nil
 }
 // ListVolumes returns a list of volumes
 func (cs *ControllerServer) ListVolumes(ctx context.Context, req *csi.ListVolumesRequest) (*csi.ListVolumesResponse, error)
 {
    return nil, status.Error(codes.Unimplemented, "")
 }
 // GetCapacity returns the capacity of the storage pool
 func (cs *ControllerServer) GetCapacity(ctx context.Context, req *csi.GetCapacityRequest) (*csi.GetCapacityResponse, error)
 {
    return nil, status.Error(codes.Unimplemented, "")
 }
 // ControllerGetCapabilities returns the capabilities of the controller service.
 func (cs *ControllerServer) ControllerGetCapabilities(ctx context.Context, req *csi.ControllerGetCapabilitiesRequest) (*csi.ControllerGetCapabilitiesResponse, error)
 {
    functionControllerServerCapabilities := func(cap csi.ControllerServiceCapability_RPC_Type) *csi.ControllerServiceCapability
    {
        return &csi.ControllerServiceCapability{
            Type: &csi.ControllerServiceCapability_Rpc{
                Rpc: &csi.ControllerServiceCapability_RPC{
                    Type: cap,
                },
            },
        }
    }
    var controllerServerCapabilities []*csi.ControllerServiceCapability
    for _, capability := range []csi.ControllerServiceCapability_RPC_Type{
        csi.ControllerServiceCapability_RPC_CREATE_DELETE_VOLUME,
        csi.ControllerServiceCapability_RPC_LIST_VOLUMES,
        csi.ControllerServiceCapability_RPC_EXPAND_VOLUME,
        csi.ControllerServiceCapability_RPC_CREATE_DELETE_SNAPSHOT,
    } {
        controllerServerCapabilities = append(controllerServerCapabilities, functionControllerServerCapabilities(capability))
    }
    return &csi.ControllerGetCapabilitiesResponse{
        Capabilities: controllerServerCapabilities,
    }, nil
 }
 // CreateSnapshot create snapshot of an existing PV
 func (cs *ControllerServer) CreateSnapshot(ctx context.Context, req *csi.CreateSnapshotRequest) (*csi.CreateSnapshotResponse, error)
 {
    return nil, status.Error(codes.Unimplemented, "")
 }
 // DeleteSnapshot delete provided snapshot of a PV
 func (cs *ControllerServer) DeleteSnapshot(ctx context.Context, req *csi.DeleteSnapshotRequest) (*csi.DeleteSnapshotResponse, error)
 {
    return nil, status.Error(codes.Unimplemented, "")
 }
 // ListSnapshots list the snapshots of a PV
 func (cs *ControllerServer) ListSnapshots(ctx context.Context, req *csi.ListSnapshotsRequest) (*csi.ListSnapshotsResponse, error)
 {
    return nil, status.Error(codes.Unimplemented, "")
 }
 // ControllerExpandVolume resizes a volume
 func (cs *ControllerServer) ControllerExpandVolume(ctx context.Context, req *csi.ControllerExpandVolumeRequest) (*csi.ControllerExpandVolumeResponse, error)
 {
    return nil, status.Error(codes.Unimplemented, "")
 }
 // ControllerGetVolume get volume info
 func (cs *ControllerServer) ControllerGetVolume(ctx context.Context, req *csi.ControllerGetVolumeRequest) (*csi.ControllerGetVolumeResponse, error)
 {
    return nil, status.Error(codes.Unimplemented, "")
 }
--- a/csi/src/grpc.go
+++ b/csi/src/grpc.go
@ -0,0 +1,137 @@
 /*
 Copyright 2017 The Kubernetes Authors.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 */
 package vitastor
 import (
    "fmt"
    "net"
    "os"
    "strings"
    "sync"
    "github.com/golang/glog"
    "golang.org/x/net/context"
    "google.golang.org/grpc"
    "github.com/container-storage-interface/spec/lib/go/csi"
    "github.com/kubernetes-csi/csi-lib-utils/protosanitizer"
 )
 // Defines Non blocking GRPC server interfaces
 type NonBlockingGRPCServer interface {
    // Start services at the endpoint
    Start(endpoint string, ids csi.IdentityServer, cs csi.ControllerServer, ns csi.NodeServer)
    // Waits for the service to stop
    Wait()
    // Stops the service gracefully
    Stop()
    // Stops the service forcefully
    ForceStop()
 }
 func NewNonBlockingGRPCServer() NonBlockingGRPCServer {
    return &nonBlockingGRPCServer{}
 }
 // NonBlocking server
 type nonBlockingGRPCServer struct {
    wg     sync.WaitGroup
    server *grpc.Server
 }
 func (s *nonBlockingGRPCServer) Start(endpoint string, ids csi.IdentityServer, cs csi.ControllerServer, ns csi.NodeServer) {
    s.wg.Add(1)
    go s.serve(endpoint, ids, cs, ns)
    return
 }
 func (s *nonBlockingGRPCServer) Wait() {
    s.wg.Wait()
 }
 func (s *nonBlockingGRPCServer) Stop() {
    s.server.GracefulStop()
 }
 func (s *nonBlockingGRPCServer) ForceStop() {
    s.server.Stop()
 }
 func (s *nonBlockingGRPCServer) serve(endpoint string, ids csi.IdentityServer, cs csi.ControllerServer, ns csi.NodeServer) {
    proto, addr, err := ParseEndpoint(endpoint)
    if err != nil {
        glog.Fatal(err.Error())
    }
    if proto == "unix" {
        addr = "/" + addr
        if err := os.Remove(addr); err != nil && !os.IsNotExist(err) {
            glog.Fatalf("Failed to remove %s, error: %s", addr, err.Error())
        }
    }
    listener, err := net.Listen(proto, addr)
    if err != nil {
        glog.Fatalf("Failed to listen: %v", err)
    }
    opts := []grpc.ServerOption{
        grpc.UnaryInterceptor(logGRPC),
    }
    server := grpc.NewServer(opts...)
    s.server = server
    if ids != nil {
        csi.RegisterIdentityServer(server, ids)
    }
    if cs != nil {
        csi.RegisterControllerServer(server, cs)
    }
    if ns != nil {
        csi.RegisterNodeServer(server, ns)
    }
    glog.Infof("Listening for connections on address: %#v", listener.Addr())
    server.Serve(listener)
 }
 func ParseEndpoint(ep string) (string, string, error) {
    if strings.HasPrefix(strings.ToLower(ep), "unix://") || strings.HasPrefix(strings.ToLower(ep), "tcp://") {
        s := strings.SplitN(ep, "://", 2)
        if s[1] != "" {
            return s[0], s[1], nil
        }
    }
    return "", "", fmt.Errorf("Invalid endpoint: %v", ep)
 }
 func logGRPC(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (interface{}, error) {
    glog.V(3).Infof("GRPC call: %s", info.FullMethod)
    glog.V(5).Infof("GRPC request: %s", protosanitizer.StripSecrets(req))
    resp, err := handler(ctx, req)
    if err != nil {
        glog.Errorf("GRPC error: %v", err)
    } else {
        glog.V(5).Infof("GRPC response: %s", protosanitizer.StripSecrets(resp))
    }
    return resp, err
 }
--- a/csi/src/identityserver.go
+++ b/csi/src/identityserver.go
@ -0,0 +1,60 @@
 // Copyright (c) Vitaliy Filippov, 2019+
 // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
 package vitastor
 import (
    "context"
    "github.com/kubernetes-csi/csi-lib-utils/protosanitizer"
    "k8s.io/klog"
    "github.com/container-storage-interface/spec/lib/go/csi"
 )
 // IdentityServer struct of Vitastor CSI driver with supported methods of CSI identity server spec.
 type IdentityServer struct
 {
    *Driver
 }
 // NewIdentityServer create new instance identity
 func NewIdentityServer(driver *Driver) *IdentityServer
 {
    return &IdentityServer{
        Driver: driver,
    }
 }
 // GetPluginInfo returns metadata of the plugin
 func (is *IdentityServer) GetPluginInfo(ctx context.Context, req *csi.GetPluginInfoRequest) (*csi.GetPluginInfoResponse, error)
 {
    klog.Infof("received identity plugin info request %+v", protosanitizer.StripSecrets(req))
    return &csi.GetPluginInfoResponse{
        Name:          vitastorCSIDriverName,
        VendorVersion: vitastorCSIDriverVersion,
    }, nil
 }
 // GetPluginCapabilities returns available capabilities of the plugin
 func (is *IdentityServer) GetPluginCapabilities(ctx context.Context, req *csi.GetPluginCapabilitiesRequest) (*csi.GetPluginCapabilitiesResponse, error)
 {
    klog.Infof("received identity plugin capabilities request %+v", protosanitizer.StripSecrets(req))
    return &csi.GetPluginCapabilitiesResponse{
        Capabilities: []*csi.PluginCapability{
            {
                Type: &csi.PluginCapability_Service_{
                    Service: &csi.PluginCapability_Service{
                        Type: csi.PluginCapability_Service_CONTROLLER_SERVICE,
                    },
                },
            },
        },
    }, nil
 }
 // Probe returns the health and readiness of the plugin
 func (is *IdentityServer) Probe(ctx context.Context, req *csi.ProbeRequest) (*csi.ProbeResponse, error)
 {
    return &csi.ProbeResponse{}, nil
 }
--- a/csi/src/nodeserver.go
+++ b/csi/src/nodeserver.go
@ -0,0 +1,279 @@
 // Copyright (c) Vitaliy Filippov, 2019+
 // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
 package vitastor
 import (
    "context"
    "os"
    "os/exec"
    "encoding/json"
    "strings"
    "bytes"
    "google.golang.org/grpc/codes"
    "google.golang.org/grpc/status"
    "k8s.io/utils/mount"
    utilexec "k8s.io/utils/exec"
    "github.com/container-storage-interface/spec/lib/go/csi"
    "github.com/kubernetes-csi/csi-lib-utils/protosanitizer"
    "k8s.io/klog"
 )
 // NodeServer struct of Vitastor CSI driver with supported methods of CSI node server spec.
 type NodeServer struct
 {
    *Driver
    mounter mount.Interface
 }
 // NewNodeServer create new instance node
 func NewNodeServer(driver *Driver) *NodeServer
 {
    return &NodeServer{
        Driver: driver,
        mounter: mount.New(""),
    }
 }
 // NodeStageVolume mounts the volume to a staging path on the node.
 func (ns *NodeServer) NodeStageVolume(ctx context.Context, req *csi.NodeStageVolumeRequest) (*csi.NodeStageVolumeResponse, error)
 {
    return &csi.NodeStageVolumeResponse{}, nil
 }
 // NodeUnstageVolume unstages the volume from the staging path
 func (ns *NodeServer) NodeUnstageVolume(ctx context.Context, req *csi.NodeUnstageVolumeRequest) (*csi.NodeUnstageVolumeResponse, error)
 {
    return &csi.NodeUnstageVolumeResponse{}, nil
 }
 func Contains(list []string, s string) bool
 {
    for i := 0; i < len(list); i++
    {
        if (list[i] == s)
        {
            return true
        }
    }
    return false
 }
 // NodePublishVolume mounts the volume mounted to the staging path to the target path
 func (ns *NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublishVolumeRequest) (*csi.NodePublishVolumeResponse, error)
 {
    klog.Infof("received node publish volume request %+v", protosanitizer.StripSecrets(req))
    targetPath := req.GetTargetPath()
    // Check that it's not already mounted
    free, error := mount.IsNotMountPoint(ns.mounter, targetPath)
    if (error != nil)
    {
        if (os.IsNotExist(error))
        {
            error := os.MkdirAll(targetPath, 0777)
            if (error != nil)
            {
                return nil, status.Error(codes.Internal, error.Error())
            }
            free = true
        }
        else
        {
            return nil, status.Error(codes.Internal, error.Error())
        }
    }
    if (!free)
    {
        return &csi.NodePublishVolumeResponse{}, nil
    }
    ctxVars := make(map[string]string)
    err := json.Unmarshal([]byte(req.VolumeId), &ctxVars)
    if (err != nil)
    {
        return nil, status.Error(codes.Internal, "volume ID not in JSON format")
    }
    volName := ctxVars["name"]
    _, etcdUrl, etcdPrefix := GetConnectionParams(ctxVars)
    if (len(etcdUrl) == 0)
    {
        return nil, status.Error(codes.InvalidArgument, "no etcdUrl in storage class configuration and no etcd_address in vitastor.conf")
    }
    // Map NBD device
    // FIXME: Check if already mapped
    args := []string{
        "map", "--etcd_address", strings.Join(etcdUrl, ","),
        "--etcd_prefix", etcdPrefix,
        "--image", volName,
    };
    if (ctxVars["configPath"] != "")
    {
        args = append(args, "--config_path", ctxVars["configPath"])
    }
    if (req.GetReadonly())
    {
        args = append(args, "--readonly", "1")
    }
    c := exec.Command("/usr/bin/vitastor-nbd", args...)
    var stdout, stderr bytes.Buffer
    c.Stdout, c.Stderr = &stdout, &stderr
    err = c.Run()
    stdoutStr, stderrStr := string(stdout.Bytes()), string(stderr.Bytes())
    if (err != nil)
    {
        klog.Errorf("vitastor-nbd map failed: %s, status %s\n", stdoutStr+stderrStr, err)
        return nil, status.Error(codes.Internal, stdoutStr+stderrStr+" (status "+err.Error()+")")
    }
    devicePath := strings.TrimSpace(stdoutStr)
    // Check existing format
    diskMounter := &mount.SafeFormatAndMount{Interface: ns.mounter, Exec: utilexec.New()}
    existingFormat, err := diskMounter.GetDiskFormat(devicePath)
    if (err != nil)
    {
        klog.Errorf("failed to get disk format for path %s, error: %v", err)
        // unmap NBD device
        unmapOut, unmapErr := exec.Command("/usr/bin/vitastor-nbd", "unmap", devicePath).CombinedOutput()
        if (unmapErr != nil)
        {
            klog.Errorf("failed to unmap NBD device %s: %s, error: %v", devicePath, unmapOut, unmapErr)
        }
        return nil, err
    }
    // Format the device (ext4 or xfs)
    fsType := req.GetVolumeCapability().GetMount().GetFsType()
    isBlock := req.GetVolumeCapability().GetBlock() != nil
    opt := req.GetVolumeCapability().GetMount().GetMountFlags()
    opt = append(opt, "_netdev")
    if ((req.VolumeCapability.AccessMode.Mode == csi.VolumeCapability_AccessMode_MULTI_NODE_READER_ONLY ||
        req.VolumeCapability.AccessMode.Mode == csi.VolumeCapability_AccessMode_SINGLE_NODE_READER_ONLY) &&
        !Contains(opt, "ro"))
    {
        opt = append(opt, "ro")
    }
    if (fsType == "xfs")
    {
        opt = append(opt, "nouuid")
    }
    readOnly := Contains(opt, "ro")
    if (existingFormat == "" && !readOnly)
    {
        args := []string{}
        switch fsType
        {
            case "ext4":
                args = []string{"-m0", "-Enodiscard,lazy_itable_init=1,lazy_journal_init=1", devicePath}
            case "xfs":
                args = []string{"-K", devicePath}
        }
        if (len(args) > 0)
        {
            cmdOut, cmdErr := diskMounter.Exec.Command("mkfs."+fsType, args...).CombinedOutput()
            if (cmdErr != nil)
            {
                klog.Errorf("failed to run mkfs error: %v, output: %v", cmdErr, string(cmdOut))
                // unmap NBD device
                unmapOut, unmapErr := exec.Command("/usr/bin/vitastor-nbd", "unmap", devicePath).CombinedOutput()
                if (unmapErr != nil)
                {
                    klog.Errorf("failed to unmap NBD device %s: %s, error: %v", devicePath, unmapOut, unmapErr)
                }
                return nil, status.Error(codes.Internal, cmdErr.Error())
            }
        }
    }
    if (isBlock)
    {
        opt = append(opt, "bind")
        err = diskMounter.Mount(devicePath, targetPath, fsType, opt)
    }
    else
    {
        err = diskMounter.FormatAndMount(devicePath, targetPath, fsType, opt)
    }
    if (err != nil)
    {
        klog.Errorf(
            "failed to mount device path (%s) to path (%s) for volume (%s) error: %s",
            devicePath, targetPath, volName, err,
        )
        // unmap NBD device
        unmapOut, unmapErr := exec.Command("/usr/bin/vitastor-nbd", "unmap", devicePath).CombinedOutput()
        if (unmapErr != nil)
        {
            klog.Errorf("failed to unmap NBD device %s: %s, error: %v", devicePath, unmapOut, unmapErr)
        }
        return nil, status.Error(codes.Internal, err.Error())
    }
    return &csi.NodePublishVolumeResponse{}, nil
 }
 // NodeUnpublishVolume unmounts the volume from the target path
 func (ns *NodeServer) NodeUnpublishVolume(ctx context.Context, req *csi.NodeUnpublishVolumeRequest) (*csi.NodeUnpublishVolumeResponse, error)
 {
    klog.Infof("received node unpublish volume request %+v", protosanitizer.StripSecrets(req))
    targetPath := req.GetTargetPath()
    devicePath, refCount, err := mount.GetDeviceNameFromMount(ns.mounter, targetPath)
    if (err != nil)
    {
        if (os.IsNotExist(err))
        {
            return nil, status.Error(codes.NotFound, "Target path not found")
        }
        return nil, status.Error(codes.Internal, err.Error())
    }
    if (devicePath == "")
    {
        return nil, status.Error(codes.NotFound, "Volume not mounted")
    }
    // unmount
    err = mount.CleanupMountPoint(targetPath, ns.mounter, false)
    if (err != nil)
    {
        return nil, status.Error(codes.Internal, err.Error())
    }
    // unmap NBD device
    if (refCount == 1)
    {
        unmapOut, unmapErr := exec.Command("/usr/bin/vitastor-nbd", "unmap", devicePath).CombinedOutput()
        if (unmapErr != nil)
        {
            klog.Errorf("failed to unmap NBD device %s: %s, error: %v", devicePath, unmapOut, unmapErr)
        }
    }
    return &csi.NodeUnpublishVolumeResponse{}, nil
 }
 // NodeGetVolumeStats returns volume capacity statistics available for the volume
 func (ns *NodeServer) NodeGetVolumeStats(ctx context.Context, req *csi.NodeGetVolumeStatsRequest) (*csi.NodeGetVolumeStatsResponse, error)
 {
    return nil, status.Error(codes.Unimplemented, "")
 }
 // NodeExpandVolume expanding the file system on the node
 func (ns *NodeServer) NodeExpandVolume(ctx context.Context, req *csi.NodeExpandVolumeRequest) (*csi.NodeExpandVolumeResponse, error)
 {
    return nil, status.Error(codes.Unimplemented, "")
 }
 // NodeGetCapabilities returns the supported capabilities of the node server
 func (ns *NodeServer) NodeGetCapabilities(ctx context.Context, req *csi.NodeGetCapabilitiesRequest) (*csi.NodeGetCapabilitiesResponse, error)
 {
    return &csi.NodeGetCapabilitiesResponse{}, nil
 }
 // NodeGetInfo returns NodeGetInfoResponse for CO.
 func (ns *NodeServer) NodeGetInfo(ctx context.Context, req *csi.NodeGetInfoRequest) (*csi.NodeGetInfoResponse, error)
 {
    klog.Infof("received node get info request %+v", protosanitizer.StripSecrets(req))
    return &csi.NodeGetInfoResponse{
        NodeId: ns.NodeID,
    }, nil
 }
--- a/csi/src/server.go
+++ b/csi/src/server.go
@ -0,0 +1,36 @@
 // Copyright (c) Vitaliy Filippov, 2019+
 // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
 package vitastor
 import (
    "k8s.io/klog"
 )
 type Driver struct
 {
    *Config
 }
 // NewDriver create new instance driver
 func NewDriver(config *Config) (*Driver, error)
 {
    if (config == nil)
    {
        klog.Errorf("Vitastor CSI driver initialization failed")
        return nil, nil
    }
    driver := &Driver{
        Config: config,
    }
    klog.Infof("Vitastor CSI driver initialized")
    return driver, nil
 }
 // Start server
 func (driver *Driver) Run()
 {
    server := NewNonBlockingGRPCServer()
    server.Start(driver.Endpoint, NewIdentityServer(driver), NewControllerServer(driver), NewNodeServer(driver))
    server.Wait()
 }
--- a/csi/vitastor-csi.go
+++ b/csi/vitastor-csi.go
@ -0,0 +1,39 @@
 // Copyright (c) Vitaliy Filippov, 2019+
 // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
 package main
 import (
    "flag"
    "fmt"
    "os"
    "k8s.io/klog"
    "vitastor.io/csi/src"
 )
 func main()
 {
    var config = vitastor.NewConfig()
    flag.StringVar(&config.Endpoint, "endpoint", "", "CSI endpoint")
    flag.StringVar(&config.NodeID, "node", "", "Node ID")
    flag.Parse()
    if (config.Endpoint == "")
    {
        config.Endpoint = os.Getenv("CSI_ENDPOINT")
    }
    if (config.NodeID == "")
    {
        config.NodeID = os.Getenv("NODE_ID")
    }
    if (config.Endpoint == "" && config.NodeID == "")
    {
        fmt.Fprintf(os.Stderr, "Please set -endpoint and -node / CSI_ENDPOINT & NODE_ID env vars\n")
        os.Exit(1)
    }
    drv, err := vitastor.NewDriver(config)
    if (err != nil)
    {
        klog.Fatalln(err)
    }
    drv.Run()
 }
--- a/debian/build-vitastor-bullseye.sh
+++ b/debian/build-vitastor-bullseye.sh
@ -0,0 +1,7 @@
 #!/bin/bash
 sed 's/$REL/bullseye/g' < vitastor.Dockerfile > ../Dockerfile
 cd ..
 mkdir -p packages
 sudo podman build -v `pwd`/packages:/root/packages -f Dockerfile .
 rm Dockerfile
--- a/debian/build-vitastor-buster.sh
+++ b/debian/build-vitastor-buster.sh
@ -0,0 +1,7 @@
 #!/bin/bash
 sed 's/$REL/buster/g' < vitastor.Dockerfile > ../Dockerfile
 cd ..
 mkdir -p packages
 sudo podman build -v `pwd`/packages:/root/packages -f Dockerfile .
 rm Dockerfile
--- a/debian/changelog
+++ b/debian/changelog
@ -0,0 +1,27 @@
 vitastor (0.6.5-1) unstable; urgency=medium
  * RDMA support
  * Bugfixes
 -- Vitaliy Filippov <vitalif@yourcmc.ru>  Sat, 01 May 2021 18:46:10 +0300
 vitastor (0.6.0-1) unstable; urgency=medium
  * Snapshots and Copy-on-Write clones
  * Image metadata in etcd (name, size)
  * Image I/O and space statistics in etcd
  * Write throttling for smoothing random write workloads in SSD+HDD configurations
 -- Vitaliy Filippov <vitalif@yourcmc.ru>  Sun, 11 Apr 2021 00:49:18 +0300
 vitastor (0.5.1-1) unstable; urgency=medium
  * Add jerasure support
 -- Vitaliy Filippov <vitalif@yourcmc.ru>  Sat, 05 Dec 2020 17:02:26 +0300
 vitastor (0.5-1) unstable; urgency=medium
  * First packaging for Debian
 -- Vitaliy Filippov <vitalif@yourcmc.ru>  Thu, 05 Nov 2020 02:20:59 +0300
--- a/debian/compat
+++ b/debian/compat
@ -0,0 +1 @@
 13
--- a/debian/control
+++ b/debian/control
@ -0,0 +1,17 @@
 Source: vitastor
 Section: admin
 Priority: optional
 Maintainer: Vitaliy Filippov <vitalif@yourcmc.ru>
 Build-Depends: debhelper, liburing-dev (>= 0.6), g++ (>= 8), libstdc++6 (>= 8), linux-libc-dev, libgoogle-perftools-dev, libjerasure-dev, libgf-complete-dev, libibverbs-dev
 Standards-Version: 4.5.0
 Homepage: https://vitastor.io/
 Rules-Requires-Root: no
 Package: vitastor
 Architecture: amd64
 Depends: ${shlibs:Depends}, ${misc:Depends}, fio (= ${dep:fio}), qemu (= ${dep:qemu}), nodejs (>= 10), node-sprintf-js, node-ws (>= 7), libjerasure2, lp-solve
 Description: Vitastor, a fast software-defined clustered block storage
 Vitastor is a small, simple and fast clustered block storage (storage for VM drives),
 architecturally similar to Ceph which means strong consistency, primary-replication,
 symmetric clustering and automatic data distribution over any number of drives of any
 size with configurable redundancy (replication or erasure codes/XOR).
--- a/debian/copyright
+++ b/debian/copyright
@ -0,0 +1,21 @@
 Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
 Upstream-Name: vitastor
 Upstream-Contact: Vitaliy Filippov <vitalif@yourcmc.ru>
 Source: https://vitastor.io
 Files: *
 Copyright: 2019+ Vitaliy Filippov <vitalif@yourcmc.ru>
 License: Multiple licenses VNPL-1.1 and/or GPL-2.0+
 All server-side code (OSD, Monitor and so on) is licensed under the terms of
 Vitastor Network Public License 1.1 (VNPL 1.1), a copyleft license based on
 GNU GPLv3.0 with the additional "Network Interaction" clause which requires
 opensourcing all programs directly or indirectly interacting with Vitastor
 through a computer network and expressly designed to be used in conjunction
 with it ("Proxy Programs"). Proxy Programs may be made public not only under
 the terms of the same license, but also under the terms of any GPL-Compatible
 Free Software License, as listed by the Free Software Foundation.
 This is a stricter copyleft license than the Affero GPL.
 .
 Client libraries (cluster_client and so on) are dual-licensed under the same
 VNPL 1.1 and also GNU GPL 2.0 or later to allow for compatibility with GPLed
 software like QEMU and fio.
--- a/debian/install
+++ b/debian/install
@ -0,0 +1,3 @@
 VNPL-1.1.txt usr/share/doc/vitastor
 GPL-2.0.txt usr/share/doc/vitastor
 mon usr/lib/vitastor
--- a/debian/patched-qemu.Dockerfile
+++ b/debian/patched-qemu.Dockerfile
@ -0,0 +1,50 @@
 # Build patched QEMU for Debian Buster or Bullseye/Sid inside a container
 # cd ..; podman build --build-arg REL=bullseye -v `pwd`/packages:/root/packages -f debian/patched-qemu.Dockerfile .
 FROM debian:$REL
 WORKDIR /root
 RUN if [ "$REL" = "buster" ]; then \
        echo 'deb http://deb.debian.org/debian buster-backports main' >> /etc/apt/sources.list; \
        echo >> /etc/apt/preferences; \
        echo 'Package: *' >> /etc/apt/preferences; \
        echo 'Pin: release a=buster-backports' >> /etc/apt/preferences; \
        echo 'Pin-Priority: 500' >> /etc/apt/preferences; \
        echo >> /etc/apt/preferences; \
        echo 'Package: libglvnd* libgles* libglx* libgl1 libegl* libopengl* mesa*' >> /etc/apt/preferences; \
        echo 'Pin: release a=buster-backports' >> /etc/apt/preferences; \
        echo 'Pin-Priority: 50' >> /etc/apt/preferences; \
    fi; \
    grep '^deb ' /etc/apt/sources.list | perl -pe 's/^deb/deb-src/' >> /etc/apt/sources.list; \
    echo 'APT::Install-Recommends false;' >> /etc/apt/apt.conf; \
    echo 'APT::Install-Suggests false;' >> /etc/apt/apt.conf
 RUN apt-get update
 RUN apt-get -y install qemu fio liburing1 liburing-dev libgoogle-perftools-dev devscripts
 RUN apt-get -y build-dep qemu
 RUN apt-get -y build-dep fio
 # To build a custom version
 #RUN cp /root/packages/qemu-orig/* /root
 RUN apt-get --download-only source qemu
 RUN apt-get --download-only source fio
 ADD patches/qemu-5.0-vitastor.patch patches/qemu-5.1-vitastor.patch /root/vitastor/patches/
 RUN set -e; \
    mkdir -p /root/packages/qemu-$REL; \
    rm -rf /root/packages/qemu-$REL/*; \
    cd /root/packages/qemu-$REL; \
    dpkg-source -x /root/qemu*.dsc; \
    if [ -d /root/packages/qemu-$REL/qemu-5.0 ]; then \
        cp /root/vitastor/patches/qemu-5.0-vitastor.patch /root/packages/qemu-$REL/qemu-5.0/debian/patches; \
        echo qemu-5.0-vitastor.patch >> /root/packages/qemu-$REL/qemu-5.0/debian/patches/series; \
    else \
        cp /root/vitastor/patches/qemu-5.1-vitastor.patch /root/packages/qemu-$REL/qemu-*/debian/patches; \
        P=`ls -d /root/packages/qemu-$REL/qemu-*/debian/patches`; \
        echo qemu-5.1-vitastor.patch >> $P/series; \
    fi; \
    cd /root/packages/qemu-$REL/qemu-*/; \
    V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)(~bpo[\d\+]*)?\).*$/$1/')+vitastor1; \
    DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v $V 'Plug Vitastor block driver'; \
    DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \
    rm -rf /root/packages/qemu-$REL/qemu-*/
--- a/debian/rules
+++ b/debian/rules
@ -0,0 +1,9 @@
 #!/usr/bin/make -f
 export DH_VERBOSE = 1
 %:
 	dh $@
 override_dh_installdeb:
 	cat debian/substvars >> debian/vitastor.substvars
 	dh_installdeb
--- a/debian/source/format
+++ b/debian/source/format
@ -0,0 +1 @@
 3.0 (quilt)
--- a/debian/substvars
+++ b/debian/substvars
@ -0,0 +1,2 @@
 dep:fio=3.16-1
 dep:qemu=1:5.1+dfsg-4+vitastor1
--- a/debian/vitastor.Dockerfile
+++ b/debian/vitastor.Dockerfile
@ -0,0 +1,67 @@
 # Build Vitastor packages for Debian Buster or Bullseye/Sid inside a container
 # cd ..; podman build --build-arg REL=bullseye -v `pwd`/packages:/root/packages -f debian/vitastor.Dockerfile .
 FROM debian:$REL
 WORKDIR /root
 RUN if [ "$REL" = "buster" ]; then \
        echo 'deb http://deb.debian.org/debian buster-backports main' >> /etc/apt/sources.list; \
        echo >> /etc/apt/preferences; \
        echo 'Package: *' >> /etc/apt/preferences; \
        echo 'Pin: release a=buster-backports' >> /etc/apt/preferences; \
        echo 'Pin-Priority: 500' >> /etc/apt/preferences; \
    fi; \
    grep '^deb ' /etc/apt/sources.list | perl -pe 's/^deb/deb-src/' >> /etc/apt/sources.list; \
    echo 'APT::Install-Recommends false;' >> /etc/apt/apt.conf; \
    echo 'APT::Install-Suggests false;' >> /etc/apt/apt.conf
 RUN apt-get update
 RUN apt-get -y install qemu fio liburing1 liburing-dev libgoogle-perftools-dev devscripts
 RUN apt-get -y build-dep qemu
 RUN apt-get -y build-dep fio
 RUN apt-get --download-only source qemu
 RUN apt-get --download-only source fio
 RUN apt-get update && apt-get -y install libjerasure-dev cmake libibverbs-dev
 ADD . /root/vitastor
 RUN set -e -x; \
    mkdir -p /root/fio-build/; \
    cd /root/fio-build/; \
    rm -rf /root/fio-build/*; \
    dpkg-source -x /root/fio*.dsc; \
    cd /root/packages/qemu-$REL/; \
    rm -rf qemu*/; \
    dpkg-source -x qemu*.dsc; \
    cd /root/packages/qemu-$REL/qemu*/; \
    debian/rules b/configure-stamp; \
    cd b/qemu; \
    make -j8 qapi/qapi-builtin-types.h; \
    mkdir -p /root/packages/vitastor-$REL; \
    rm -rf /root/packages/vitastor-$REL/*; \
    cd /root/packages/vitastor-$REL; \
    cp -r /root/vitastor vitastor-0.6.5; \
    ln -s /root/packages/qemu-$REL/qemu-*/ vitastor-0.6.5/qemu; \
    ln -s /root/fio-build/fio-*/ vitastor-0.6.5/fio; \
    cd vitastor-0.6.5; \
    FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
    QEMU=$(head -n1 qemu/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
    sh copy-qemu-includes.sh; \
    sh copy-fio-includes.sh; \
    rm qemu fio; \
    mkdir -p a b debian/patches; \
    mv qemu-copy b/qemu; \
    mv fio-copy b/fio; \
    diff -NaurpbB a b > debian/patches/qemu-fio-headers.patch || true; \
    echo qemu-fio-headers.patch >> debian/patches/series; \
    rm -rf a b; \
    rm -rf /root/packages/qemu-$REL/qemu*/; \
    echo "dep:fio=$FIO" > debian/substvars; \
    echo "dep:qemu=$QEMU" >> debian/substvars; \
    cd /root/packages/vitastor-$REL; \
    tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.6.5.orig.tar.xz vitastor-0.6.5; \
    cd vitastor-0.6.5; \
    V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
    DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \
    DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \
    rm -rf /root/packages/vitastor-$REL/vitastor-*/
--- a/dump_journal.cpp
+++ b/dump_journal.cpp
@ -1,165 +0,0 @@
 #define _LARGEFILE64_SOURCE
 #include <sys/types.h>
 #include <sys/ioctl.h>
 #include <sys/stat.h>
 #include <sys/time.h>
 #include <fcntl.h>
 #include <unistd.h>
 #include <stdint.h>
 #include <malloc.h>
 #include <linux/fs.h>
 #include <string.h>
 #include <errno.h>
 #include <assert.h>
 #include <stdio.h>
 #include "blockstore_impl.h"
 #include "crc32c.h"
 struct journal_dump_t
 {
    char *journal_device;
    uint32_t journal_block;
    uint64_t journal_offset;
    uint64_t journal_len;
    uint64_t journal_pos;
    int fd;
    void dump_block(void *buf);
 };
 int main(int argc, char *argv[])
 {
    if (argc < 5)
    {
        printf("USAGE: %s <journal_file> <journal_block_size> <offset> <size>\n", argv[0]);
        return 1;
    }
    journal_dump_t self;
    self.journal_device = argv[1];
    self.journal_block = strtoul(argv[2], NULL, 10);
    self.journal_offset = strtoull(argv[3], NULL, 10);
    self.journal_len = strtoull(argv[4], NULL, 10);
    if (self.journal_block < MEM_ALIGNMENT || (self.journal_block % MEM_ALIGNMENT) ||
        self.journal_block > 128*1024)
    {
        printf("Invalid journal block size\n");
        return 1;
    }
    self.fd = open(self.journal_device, O_DIRECT|O_RDONLY);
    if (self.fd == -1)
    {
        printf("Failed to open journal\n");
        return 1;
    }
    void *data = memalign(MEM_ALIGNMENT, self.journal_block);
    self.journal_pos = 0;
    while (self.journal_pos < self.journal_len)
    {
        int r = pread(self.fd, data, self.journal_block, self.journal_offset+self.journal_pos);
        assert(r == self.journal_block);
        uint64_t s;
        for (s = 0; s < self.journal_block; s += 8)
        {
            if (*((uint64_t*)(data+s)) != 0)
                break;
        }
        if (s == self.journal_block)
        {
            printf("offset %08lx: zeroes\n", self.journal_pos);
            self.journal_pos += self.journal_block;
        }
        else if (((journal_entry*)data)->magic == JOURNAL_MAGIC)
        {
            printf("offset %08lx:\n", self.journal_pos);
            self.dump_block(data);
        }
        else
        {
            printf("offset %08lx: no magic in the beginning, looks like random data (pattern=%lx)\n", self.journal_pos, *((uint64_t*)data));
            self.journal_pos += self.journal_block;
        }
    }
    free(data);
    close(self.fd);
    return 0;
 }
 void journal_dump_t::dump_block(void *buf)
 {
    uint32_t pos = 0;
    journal_pos += journal_block;
    int entry = 0;
    bool wrapped = false;
    while (pos < journal_block)
    {
        journal_entry *je = (journal_entry*)(buf + pos);
        if (je->magic != JOURNAL_MAGIC || je->type < JE_START || je->type > JE_DELETE)
        {
            break;
        }
        const char *crc32_valid = je_crc32(je) == je->crc32 ? "(valid)" : "(invalid)";
        printf("entry % 3d: crc32=%08x %s prev=%08x ", entry, je->crc32, crc32_valid, je->crc32_prev);
        if (je->type == JE_START)
        {
            printf("je_start start=%08lx\n", je->start.journal_start);
        }
        else if (je->type == JE_SMALL_WRITE)
        {
            printf(
                "je_small_write oid=%lu:%lu ver=%lu offset=%u len=%u loc=%08lx",
                je->small_write.oid.inode, je->small_write.oid.stripe,
                je->small_write.version, je->small_write.offset, je->small_write.len,
                je->small_write.data_offset
            );
            if (journal_pos + je->small_write.len > journal_len)
            {
                // data continues from the beginning of the journal
                journal_pos = journal_block;
                wrapped = true;
            }
            if (journal_pos != je->small_write.data_offset)
            {
                printf(" (mismatched, calculated = %lu)", journal_pos);
            }
            journal_pos += je->small_write.len;
            if (journal_pos >= journal_len)
            {
                journal_pos = journal_block;
                wrapped = true;
            }
            uint32_t data_crc32 = 0;
            void *data = memalign(MEM_ALIGNMENT, je->small_write.len);
            assert(pread(fd, data, je->small_write.len, journal_offset+je->small_write.data_offset) == je->small_write.len);
            data_crc32 = crc32c(0, data, je->small_write.len);
            free(data);
            printf(
                " data_crc32=%08x%s", je->small_write.crc32_data,
                (data_crc32 != je->small_write.crc32_data) ? " (invalid)" : " (valid)"
            );
            printf("\n");
        }
        else if (je->type == JE_BIG_WRITE)
        {
            printf("je_big_write oid=%lu:%lu ver=%lu loc=%08lx\n", je->big_write.oid.inode, je->big_write.oid.stripe, je->big_write.version, je->big_write.location);
        }
        else if (je->type == JE_STABLE)
        {
            printf("je_stable oid=%lu:%lu ver=%lu\n", je->stable.oid.inode, je->stable.oid.stripe, je->stable.version);
        }
        else if (je->type == JE_ROLLBACK)
        {
            printf("je_rollback oid=%lu:%lu ver=%lu\n", je->rollback.oid.inode, je->rollback.oid.stripe, je->rollback.version);
        }
        else if (je->type == JE_DELETE)
        {
            printf("je_delete oid=%lu:%lu ver=%lu\n", je->del.oid.inode, je->del.oid.stripe, je->del.version);
        }
        pos += je->size;
        entry++;
    }
    if (wrapped)
    {
        journal_pos = journal_len;
    }
 }
--- a/etcd_state_client.cpp
+++ b/etcd_state_client.cpp
@ -1,374 +0,0 @@
 #include "osd_ops.h"
 #include "pg_states.h"
 #include "etcd_state_client.h"
 #include "http_client.h"
 #include "base64.h"
 json_kv_t etcd_state_client_t::parse_etcd_kv(const json11::Json & kv_json)
 {
    json_kv_t kv;
    kv.key = base64_decode(kv_json["key"].string_value());
    std::string json_err, json_text = base64_decode(kv_json["value"].string_value());
    kv.value = json_text == "" ? json11::Json() : json11::Json::parse(json_text, json_err);
    if (json_err != "")
    {
        printf("Bad JSON in etcd key %s: %s (value: %s)\n", kv.key.c_str(), json_err.c_str(), json_text.c_str());
        kv.key = "";
    }
    return kv;
 }
 void etcd_state_client_t::etcd_txn(json11::Json txn, int timeout, std::function<void(std::string, json11::Json)> callback)
 {
    etcd_call("/kv/txn", txn, timeout, callback);
 }
 void etcd_state_client_t::etcd_call(std::string api, json11::Json payload, int timeout, std::function<void(std::string, json11::Json)> callback)
 {
    std::string etcd_address = etcd_addresses[rand() % etcd_addresses.size()];
    std::string etcd_api_path;
    int pos = etcd_address.find('/');
    if (pos >= 0)
    {
        etcd_api_path = etcd_address.substr(pos);
        etcd_address = etcd_address.substr(0, pos);
    }
    std::string req = payload.dump();
    req = "POST "+etcd_api_path+api+" HTTP/1.1\r\n"
        "Host: "+etcd_address+"\r\n"
        "Content-Type: application/json\r\n"
        "Content-Length: "+std::to_string(req.size())+"\r\n"
        "Connection: close\r\n"
        "\r\n"+req;
    http_request_json(tfd, etcd_address, req, timeout, callback);
 }
 void etcd_state_client_t::start_etcd_watcher()
 {
    std::string etcd_address = etcd_addresses[rand() % etcd_addresses.size()];
    std::string etcd_api_path;
    int pos = etcd_address.find('/');
    if (pos >= 0)
    {
        etcd_api_path = etcd_address.substr(pos);
        etcd_address = etcd_address.substr(0, pos);
    }
    etcd_watches_initialised = 0;
    etcd_watch_ws = open_websocket(tfd, etcd_address, etcd_api_path+"/watch", ETCD_SLOW_TIMEOUT, [this](const http_response_t *msg)
    {
        if (msg->body.length())
        {
            std::string json_err;
            json11::Json data = json11::Json::parse(msg->body, json_err);
            if (json_err != "")
            {
                printf("Bad JSON in etcd event: %s, ignoring event\n", json_err.c_str());
            }
            else
            {
                if (data["result"]["created"].bool_value())
                {
                    etcd_watches_initialised++;
                }
                if (etcd_watches_initialised == 4)
                {
                    etcd_watch_revision = data["result"]["header"]["revision"].uint64_value();
                }
                // First gather all changes into a hash to remove multiple overwrites
                json11::Json::object changes;
                for (auto & ev: data["result"]["events"].array_items())
                {
                    auto kv = parse_etcd_kv(ev["kv"]);
                    if (kv.key != "")
                    {
                        changes[kv.key] = kv.value;
                    }
                }
                for (auto & kv: changes)
                {
                    if (this->log_level > 0)
                    {
                        printf("Incoming event: %s -> %s\n", kv.first.c_str(), kv.second.dump().c_str());
                    }
                    parse_state(kv.first, kv.second);
                }
                // React to changes
                on_change_hook(changes);
            }
        }
        if (msg->eof)
        {
            etcd_watch_ws = NULL;
            if (etcd_watches_initialised == 0)
            {
                // Connection not established, retry in <ETCD_SLOW_TIMEOUT>
                tfd->set_timer(ETCD_SLOW_TIMEOUT, false, [this](int)
                {
                    start_etcd_watcher();
                });
            }
            else
            {
                // Connection was live, retry immediately
                start_etcd_watcher();
            }
        }
    });
    etcd_watch_ws->post_message(WS_TEXT, json11::Json(json11::Json::object {
        { "create_request", json11::Json::object {
            { "key", base64_encode(etcd_prefix+"/config/") },
            { "range_end", base64_encode(etcd_prefix+"/config0") },
            { "start_revision", etcd_watch_revision+1 },
            { "watch_id", ETCD_CONFIG_WATCH_ID },
        } }
    }).dump());
    etcd_watch_ws->post_message(WS_TEXT, json11::Json(json11::Json::object {
        { "create_request", json11::Json::object {
            { "key", base64_encode(etcd_prefix+"/osd/state/") },
            { "range_end", base64_encode(etcd_prefix+"/osd/state0") },
            { "start_revision", etcd_watch_revision+1 },
            { "watch_id", ETCD_OSD_STATE_WATCH_ID },
        } }
    }).dump());
    etcd_watch_ws->post_message(WS_TEXT, json11::Json(json11::Json::object {
        { "create_request", json11::Json::object {
            { "key", base64_encode(etcd_prefix+"/pg/state/") },
            { "range_end", base64_encode(etcd_prefix+"/pg/state0") },
            { "start_revision", etcd_watch_revision+1 },
            { "watch_id", ETCD_PG_STATE_WATCH_ID },
        } }
    }).dump());
    etcd_watch_ws->post_message(WS_TEXT, json11::Json(json11::Json::object {
        { "create_request", json11::Json::object {
            { "key", base64_encode(etcd_prefix+"/pg/history/") },
            { "range_end", base64_encode(etcd_prefix+"/pg/history0") },
            { "start_revision", etcd_watch_revision+1 },
            { "watch_id", ETCD_PG_HISTORY_WATCH_ID },
        } }
    }).dump());
 }
 void etcd_state_client_t::load_global_config()
 {
    etcd_call("/kv/range", json11::Json::object {
        { "key", base64_encode(etcd_prefix+"/config/global") }
    }, ETCD_SLOW_TIMEOUT, [this](std::string err, json11::Json data)
    {
        if (err != "")
        {
            printf("Error reading OSD configuration from etcd: %s\n", err.c_str());
            tfd->set_timer(ETCD_SLOW_TIMEOUT, false, [this](int timer_id)
            {
                load_global_config();
            });
            return;
        }
        if (!etcd_watch_revision)
        {
            etcd_watch_revision = data["header"]["revision"].uint64_value();
        }
        json11::Json::object global_config;
        if (data["kvs"].array_items().size() > 0)
        {
            auto kv = parse_etcd_kv(data["kvs"][0]);
            if (kv.value.is_object())
            {
                global_config = kv.value.object_items();
            }
        }
        on_load_config_hook(global_config);
    });
 }
 void etcd_state_client_t::load_pgs()
 {
    json11::Json::array txn = {
        json11::Json::object {
            { "request_range", json11::Json::object {
                { "key", base64_encode(etcd_prefix+"/config/pgs") },
            } }
        },
        json11::Json::object {
            { "request_range", json11::Json::object {
                { "key", base64_encode(etcd_prefix+"/pg/history/") },
                { "range_end", base64_encode(etcd_prefix+"/pg/history0") },
            } }
        },
        json11::Json::object {
            { "request_range", json11::Json::object {
                { "key", base64_encode(etcd_prefix+"/pg/state/") },
                { "range_end", base64_encode(etcd_prefix+"/pg/state0") },
            } }
        },
        json11::Json::object {
            { "request_range", json11::Json::object {
                { "key", base64_encode(etcd_prefix+"/osd/state/") },
                { "range_end", base64_encode(etcd_prefix+"/osd/state0") },
            } }
        },
    };
    json11::Json::object req = { { "success", txn } };
    json11::Json checks = load_pgs_checks_hook();
    if (checks.array_items().size() > 0)
    {
        req["compare"] = checks;
    }
    etcd_txn(req, ETCD_SLOW_TIMEOUT, [this](std::string err, json11::Json data)
    {
        if (err != "")
        {
            printf("Error loading PGs from etcd: %s\n", err.c_str());
            tfd->set_timer(ETCD_SLOW_TIMEOUT, false, [this](int timer_id)
            {
                load_pgs();
            });
            return;
        }
        if (!data["succeeded"].bool_value())
        {
            on_load_pgs_hook(false);
            return;
        }
        for (auto & res: data["responses"].array_items())
        {
            for (auto & kv_json: res["response_range"]["kvs"].array_items())
            {
                auto kv = parse_etcd_kv(kv_json);
                parse_state(kv.key, kv.value);
            }
        }
        on_load_pgs_hook(true);
    });
 }
 void etcd_state_client_t::parse_state(const std::string & key, const json11::Json & value)
 {
    if (key == etcd_prefix+"/config/pgs")
    {
        for (auto & pg_item: this->pg_config)
        {
            pg_item.second.exists = false;
        }
        for (auto & pg_item: value["items"].object_items())
        {
            pg_num_t pg_num = stoull_full(pg_item.first);
            if (!pg_num)
            {
                printf("Bad key in PG configuration: %s (must be a number), skipped\n", pg_item.first.c_str());
                continue;
            }
            this->pg_config[pg_num].exists = true;
            this->pg_config[pg_num].pause = pg_item.second["pause"].bool_value();
            this->pg_config[pg_num].primary = pg_item.second["primary"].uint64_value();
            this->pg_config[pg_num].target_set.clear();
            for (auto pg_osd: pg_item.second["osd_set"].array_items())
            {
                this->pg_config[pg_num].target_set.push_back(pg_osd.uint64_value());
            }
            if (this->pg_config[pg_num].target_set.size() != 3)
            {
                printf("Bad PG %u config format: incorrect osd_set = %s\n", pg_num, pg_item.second["osd_set"].dump().c_str());
                this->pg_config[pg_num].target_set.resize(3);
                this->pg_config[pg_num].pause = true;
            }
        }
    }
    else if (key.substr(0, etcd_prefix.length()+12) == etcd_prefix+"/pg/history/")
    {
        // <etcd_prefix>/pg/history/%d
        pg_num_t pg_num = stoull_full(key.substr(etcd_prefix.length()+12));
        if (!pg_num)
        {
            printf("Bad etcd key %s, ignoring\n", key.c_str());
        }
        else
        {
            auto & pg_cfg = this->pg_config[pg_num];
            pg_cfg.target_history.clear();
            pg_cfg.all_peers.clear();
            // Refuse to start PG if any set of the <osd_sets> has no live OSDs
            for (auto hist_item: value["osd_sets"].array_items())
            {
                std::vector<osd_num_t> history_set;
                for (auto pg_osd: hist_item.array_items())
                {
                    history_set.push_back(pg_osd.uint64_value());
                }
                pg_cfg.target_history.push_back(history_set);
            }
            // Include these additional OSDs when peering the PG
            for (auto pg_osd: value["all_peers"].array_items())
            {
                pg_cfg.all_peers.push_back(pg_osd.uint64_value());
            }
        }
    }
    else if (key.substr(0, etcd_prefix.length()+10) == etcd_prefix+"/pg/state/")
    {
        // <etcd_prefix>/pg/state/%d
        pg_num_t pg_num = stoull_full(key.substr(etcd_prefix.length()+10));
        if (!pg_num)
        {
            printf("Bad etcd key %s, ignoring\n", key.c_str());
        }
        else if (value.is_null())
        {
            this->pg_config[pg_num].cur_primary = 0;
            this->pg_config[pg_num].cur_state = 0;
        }
        else
        {
            osd_num_t cur_primary = value["primary"].uint64_value();
            int state = 0;
            for (auto & e: value["state"].array_items())
            {
                int i;
                for (i = 0; i < pg_state_bit_count; i++)
                {
                    if (e.string_value() == pg_state_names[i])
                    {
                        state = state | pg_state_bits[i];
                        break;
                    }
                }
                if (i >= pg_state_bit_count)
                {
                    printf("Unexpected PG %u state keyword in etcd: %s\n", pg_num, e.dump().c_str());
                    return;
                }
            }
            if (!cur_primary || !value["state"].is_array() || !state ||
                (state & PG_OFFLINE) && state != PG_OFFLINE ||
                (state & PG_PEERING) && state != PG_PEERING ||
                (state & PG_INCOMPLETE) && state != PG_INCOMPLETE)
            {
                printf("Unexpected PG %u state in etcd: primary=%lu, state=%s\n", pg_num, cur_primary, value["state"].dump().c_str());
                return;
            }
            this->pg_config[pg_num].cur_primary = cur_primary;
            this->pg_config[pg_num].cur_state = state;
        }
    }
    else if (key.substr(0, etcd_prefix.length()+11) == etcd_prefix+"/osd/state/")
    {
        // <etcd_prefix>/osd/state/%d
        osd_num_t peer_osd = std::stoull(key.substr(etcd_prefix.length()+11));
        if (peer_osd > 0)
        {
            if (value.is_object() && value["state"] == "up" &&
                value["addresses"].is_array() &&
                value["port"].int64_value() > 0 && value["port"].int64_value() < 65536)
            {
                this->peer_states[peer_osd] = value;
            }
            else
            {
                this->peer_states.erase(peer_osd);
            }
            if (on_change_osd_state_hook != NULL)
            {
                on_change_osd_state_hook(peer_osd);
            }
        }
    }
 }
--- a/etcd_state_client.h
+++ b/etcd_state_client.h
@ -1,59 +0,0 @@
 #pragma once
 #include "http_client.h"
 #include "timerfd_manager.h"
 #define ETCD_CONFIG_WATCH_ID 1
 #define ETCD_PG_STATE_WATCH_ID 2
 #define ETCD_PG_HISTORY_WATCH_ID 3
 #define ETCD_OSD_STATE_WATCH_ID 4
 #define MAX_ETCD_ATTEMPTS 5
 #define ETCD_SLOW_TIMEOUT 5000
 #define ETCD_QUICK_TIMEOUT 1000
 struct pg_config_t
 {
    bool exists;
    osd_num_t primary;
    std::vector<osd_num_t> target_set;
    std::vector<std::vector<osd_num_t>> target_history;
    std::vector<osd_num_t> all_peers;
    bool pause;
    osd_num_t cur_primary;
    int cur_state;
 };
 struct json_kv_t
 {
    std::string key;
    json11::Json value;
 };
 struct etcd_state_client_t
 {
    std::vector<std::string> etcd_addresses;
    std::string etcd_prefix;
    int log_level = 0;
    timerfd_manager_t *tfd = NULL;
    int etcd_watches_initialised = 0;
    uint64_t etcd_watch_revision = 0;
    websocket_t *etcd_watch_ws = NULL;
    std::map<pg_num_t, pg_config_t> pg_config;
    std::map<osd_num_t, json11::Json> peer_states;
    std::function<void(json11::Json::object &)> on_change_hook;
    std::function<void(json11::Json::object &)> on_load_config_hook;
    std::function<json11::Json()> load_pgs_checks_hook;
    std::function<void(bool)> on_load_pgs_hook;
    std::function<void(uint64_t)> on_change_osd_state_hook;
    json_kv_t parse_etcd_kv(const json11::Json & kv_json);
    void etcd_call(std::string api, json11::Json payload, int timeout, std::function<void(std::string, json11::Json)> callback);
    void etcd_txn(json11::Json txn, int timeout, std::function<void(std::string, json11::Json)> callback);
    void start_etcd_watcher();
    void load_global_config();
    void load_pgs();
    void parse_state(const std::string & key, const json11::Json & value);
 };
--- a/1
+++ b/1
@ -0,0 +1 @@
 Subproject commit 97f06cb20c1e136fd37d58fb40f57dd8f8a3a4a7
--- a/lambda_size.cpp
+++ b/lambda_size.cpp
@ -1,48 +0,0 @@
 #include <iostream>
 #include <functional>
 #include <array>
 #include <cstdlib> // for malloc() and free()
 using namespace std;
 // replace operator new and delete to log allocations
 void* operator new(std::size_t n)
 {
    cout << "Allocating " << n << " bytes" << endl;
    return malloc(n);
 }
 void operator delete(void* p) throw()
 {
    free(p);
 }
 class test
 {
 public:
    std::string s;
    void a(std::function<void()> & f, const char *str)
    {
        auto l = [this, str]() { cout << str << " ? " << s << " from this\n"; };
        cout << "Assigning lambda3 of size " << sizeof(l) << endl;
        f = l;
    }
 };
 int main()
 {
    std::array<char, 16> arr1;
    auto lambda1 = [arr1](){};
    cout << "Assigning lambda1 of size " << sizeof(lambda1) << endl;
    std::function<void()> f1 = lambda1;
    std::array<char, 17> arr2;
    auto lambda2 = [arr2](){};
    cout << "Assigning lambda2 of size " << sizeof(lambda2) << endl;
    std::function<void()> f2 = lambda2;
    test t;
    std::function<void()> f3;
    t.s = "str";
    t.a(f3, "huyambda");
    f3();
 }
--- a/lp/mon.js
+++ b/lp/mon.js
@ -1,858 +0,0 @@
 const http = require('http');
 const os = require('os');
 const WebSocket = require('ws');
 const LPOptimizer = require('./lp-optimizer.js');
 const stableStringify = require('./stable-stringify.js');
 class Mon
 {
    static etcd_tree = {
        config: {
            global: null,
            /* placement_tree = {
                levels: { datacenter: 1, rack: 2, host: 3, osd: 4, ... },
                nodes: { host1: { level: 'host', parent: 'rack1' }, ... },
                failure_domain: 'host',
            } */
            placement_tree: null,
            osd: {},
            pgs: {},
        },
        osd: {
            state: {},
            stats: {},
        },
        mon: {
            master: null,
        },
        pg: {
            change_stamp: null,
            state: {},
            stats: {},
            history: {},
        },
    }
    constructor(config)
    {
        // FIXME: Maybe prefer local etcd
        this.etcd_urls = [];
        for (let url of config.etcd_url.split(/,/))
        {
            let scheme = 'http';
            url = url.trim().replace(/^(https?):\/\//, (m, m1) => { scheme = m1; return ''; });
            if (!/\/[^\/]/.exec(url))
                url += '/v3';
            this.etcd_urls.push(scheme+'://'+url);
        }
        this.etcd_prefix = config.etcd_prefix || '/rage';
        this.etcd_prefix = this.etcd_prefix.replace(/\/\/+/g, '/').replace(/^\/?(.*[^\/])\/?$/, '/$1');
        this.etcd_start_timeout = (config.etcd_start_timeout || 5) * 1000;
        this.state = JSON.parse(JSON.stringify(Mon.etcd_tree));
    }
    async start()
    {
        await this.load_config();
        await this.get_lease();
        await this.become_master();
        await this.load_cluster_state();
        await this.start_watcher();
        await this.recheck_pgs();
    }
    async load_config()
    {
        const res = await this.etcd_call('/txn', { success: [
            { requestRange: { key: b64(this.etcd_prefix+'/config/global') } }
        ] }, this.etcd_start_timeout, -1);
        this.parse_kv(res.responses[0].response_range.kvs[0]);
        this.check_config();
    }
    check_config()
    {
        this.config.etcd_mon_timeout = Number(this.config.etcd_mon_timeout) || 0;
        if (this.config.etcd_mon_timeout <= 0)
        {
            this.config.etcd_mon_timeout = 1000;
        }
        this.config.etcd_mon_retries = Number(this.config.etcd_mon_retries) || 5;
        if (this.config.etcd_mon_retries < 0)
        {
            this.config.etcd_mon_retries = 0;
        }
        this.config.mon_change_timeout = Number(this.config.mon_change_timeout) || 1000;
        if (this.config.mon_change_timeout < 100)
        {
            this.config.mon_change_timeout = 100;
        }
        this.config.mon_stats_timeout = Number(this.config.mon_stats_timeout) || 1000;
        if (this.config.mon_stats_timeout < 100)
        {
            this.config.mon_stats_timeout = 100;
        }
        // After this number of seconds, a dead OSD will be removed from PG distribution
        this.config.osd_out_time = Number(this.config.osd_out_time) || 0;
        if (!this.config.osd_out_time)
        {
            this.config.osd_out_time = 30*60; // 30 minutes by default
        }
        this.config.max_osd_combinations = Number(this.config.max_osd_combinations) || 10000;
        if (this.config.max_osd_combinations < 100)
        {
            this.config.max_osd_combinations = 100;
        }
    }
    async start_watcher(retries)
    {
        let retry = 0;
        if (retries >= 0 && retries < 1)
        {
            retries = 1;
        }
        while (retries < 0 || retry < retries)
        {
            const base = 'ws'+this.etcd_urls[Math.floor(Math.random()*this.etcd_urls.length)].substr(4);
            const ok = await new Promise((ok, no) =>
            {
                const timer_id = setTimeout(() =>
                {
                    this.ws.close();
                    ok(false);
                }, timeout);
                this.ws = new WebSocket(base+'/watch');
                this.ws.on('open', () =>
                {
                    if (timer_id)
                        clearTimeout(timer_id);
                    ok(true);
                });
            });
            if (!ok)
            {
                this.ws = null;
            }
            retry++;
        }
        if (!this.ws)
        {
            this.die('Failed to open etcd watch websocket');
        }
        this.ws.send(JSON.stringify({
            create_request: {
                key: b64(this.etcd_prefix+'/'),
                range_end: b64(this.etcd_prefix+'0'),
                start_revision: ''+this.etcd_watch_revision,
                watch_id: 1,
            },
        }));
        this.ws.on('message', (msg) =>
        {
            let data;
            try
            {
                data = JSON.parse(msg);
            }
            catch (e)
            {
            }
            if (!data || !data.result || !data.result.events)
            {
                console.error('Garbage received from watch websocket: '+msg);
            }
            else
            {
                let stats_changed = false, changed = false;
                console.log('Revision '+data.result.header.revision+' events: ');
                for (const e of data.result.events)
                {
                    this.parse_kv(e.kv);
                    const key = e.kv.key.substr(this.etcd_prefix.length);
                    if (key.substr(0, 11) == '/osd/stats/' || key.substr(0, 10) == '/pg/stats/')
                    {
                        stats_changed = true;
                    }
                    else if (key != '/stats')
                    {
                        changed = true;
                    }
                    console.log(e);
                }
                if (stats_changed)
                {
                    this.schedule_update_stats();
                }
                if (changed)
                {
                    this.schedule_recheck();
                }
            }
        });
    }
    async get_lease()
    {
        const max_ttl = this.config.etcd_mon_ttl + this.config.etcd_mon_timeout/1000*this.config.etcd_mon_retries;
        const res = await this.etcd_call('/lease/grant', { TTL: max_ttl }, this.config.etcd_mon_timeout, this.config.etcd_mon_retries);
        this.etcd_lease_id = res.ID;
        setInterval(async () =>
        {
            const res = await this.etcd_call('/lease/keepalive', { ID: this.etcd_lease_id }, this.config.etcd_mon_timeout, this.config.etcd_mon_retries);
            if (!res.result.TTL)
            {
                this.die('Lease expired');
            }
        }, config.etcd_mon_timeout);
    }
    async become_master()
    {
        const state = { ip: this.local_ips() };
        while (1)
        {
            const res = await this.etcd_call('/txn', {
                compare: [ { target: 'CREATE', create_revision: 0, key: b64(this.etcd_prefix+'/mon/master') } ],
                success: [ { key: b64(this.etcd_prefix+'/mon/master'), value: b64(JSON.stringify(state)), lease: ''+this.etcd_lease_id } ],
            }, this.etcd_start_timeout, 0);
            if (!res.succeeded)
            {
                await new Promise(ok => setTimeout(ok, this.etcd_start_timeout));
            }
        }
    }
    async load_cluster_state()
    {
        const res = await this.etcd_call('/txn', { success: [
            { requestRange: { key: b64(this.etcd_prefix+'/'), range_end: b64(this.etcd_prefix+'0') } },
        ] }, this.etcd_start_timeout, -1);
        this.etcd_watch_revision = BigInt(res.header.revision)+BigInt(1);
        const data = JSON.parse(JSON.stringify(Mon.etcd_tree));
        for (const response of res.responses)
        {
            for (const kv of response.response_range.kvs)
            {
                this.parse_kv(kv);
            }
        }
        this.state = data;
    }
    all_osds()
    {
        return Object.keys(this.state.osd.stats);
    }
    get_osd_tree()
    {
        this.state.config.placement_tree = this.state.config.placement_tree||{};
        const levels = this.state.config.placement_tree.levels||{};
        levels.host = levels.host || 100;
        levels.osd = levels.osd || 101;
        const tree = { '': { children: [] } };
        for (const node_id in this.state.config.placement_tree.nodes||{})
        {
            const node_cfg = this.state.config.placement_tree.nodes[node_id];
            if (!node_id || /^\d/.exec(node_id) ||
                !node_cfg.level || !levels[node_cfg.level])
            {
                // All nodes must have non-empty non-numeric IDs and valid levels
                continue;
            }
            tree[node_id] = { id: node_id, level: node_cfg.level, parent: node_cfg.parent, children: [] };
        }
        // This requires monitor system time to be in sync with OSD system times (at least to some extent)
        const down_time = Date.now()/1000 - this.config.osd_out_time;
        for (const osd_num of this.all_osds().sort((a, b) => a - b))
        {
            const stat = this.state.osd.stats[osd_num];
            if (stat.size && (this.state.osd.state[osd_num] || Number(stat.time) >= down_time))
            {
                // Numeric IDs are reserved for OSDs
                const reweight = this.state.config.osd[osd_num] && Number(this.state.config.osd[osd_num].reweight) || 1;
                tree[osd_num] = tree[osd_num] || { id: osd_num, parent: stat.host };
                tree[osd_num].level = 'osd';
                tree[osd_num].size = reweight * stat.size / 1024 / 1024 / 1024 / 1024; // terabytes
                delete tree[osd_num].children;
            }
        }
        for (const node_id in tree)
        {
            if (node_id === '')
            {
                continue;
            }
            const node_cfg = tree[node_id];
            const node_level = levels[node_cfg.level] || node_cfg.level;
            let parent_level = node_cfg.parent && tree[node_cfg.parent] && tree[node_cfg.parent].children
                && tree[node_cfg.parent].level;
            parent_level = parent_level ? (levels[parent_level] || parent_level) : null;
            // Parent's level must be less than child's; OSDs must be leaves
            const parent = parent_level && parent_level < node_level ? tree[node_cfg.parent] : '';
            tree[parent].children.push(tree[node_id]);
            delete node_cfg.parent;
        }
        return LPOptimizer.flatten_tree(tree[''].children, levels, this.state.config.failure_domain, 'osd');
    }
    async stop_all_pgs()
    {
        let has_online = false, paused = true;
        for (const pg in this.state.config.pgs.items||{})
        {
            const cur_state = ((this.state.pg.state[pg]||{}).state||[]).join(',');
            if (cur_state != '' && cur_state != 'offline')
            {
                has_online = true;
            }
            if (!this.state.config.pgs.items[pg].pause)
            {
                paused = false;
            }
        }
        if (!paused)
        {
            console.log('Stopping all PGs before changing PG count');
            const new_cfg = JSON.parse(JSON.stringify(this.state.config.pgs));
            for (const pg in new_cfg.items)
            {
                new_cfg.items[pg].pause = true;
            }
            // Check that no OSDs change their state before we pause PGs
            // Doing this we make sure that OSDs don't wake up in the middle of our "transaction"
            // and can't see the old PG configuration
            const checks = [];
            for (const osd_num of this.all_osds())
            {
                const key = b64(this.etcd_prefix+'/osd/state/'+osd_num);
                checks.push({ key, target: 'MOD', result: 'LESS', mod_revision: ''+this.etcd_watch_revision });
            }
            const res = await this.etcd_call('/txn', {
                compare: [
                    { key: b64(this.etcd_prefix+'/mon/master'), target: 'LEASE', lease: ''+this.etcd_lease_id },
                    { key: b64(this.etcd_prefix+'/config/pgs'), target: 'MOD', mod_revision: ''+this.etcd_watch_revision, result: 'LESS' },
                    ...checks,
                ],
                success: [
                    { requestPut: { key: b64(this.etcd_prefix+'/config/pgs'), value: b64(JSON.stringify(new_cfg)) } },
                ],
            }, this.config.etcd_mon_timeout, 0);
            if (!res.succeeded)
            {
                return false;
            }
            this.state.config.pgs = new_cfg;
        }
        return !has_online;
    }
    scale_pg_count(prev_pgs, pg_history, new_pg_count)
    {
        const old_pg_count = prev_pgs.length;
        // Add all possibly intersecting PGs into the history of new PGs
        if (!(new_pg_count % old_pg_count))
        {
            // New PG count is a multiple of the old PG count
            const mul = (new_pg_count / old_pg_count);
            for (let i = 0; i < new_pg_count; i++)
            {
                const old_i = Math.floor(new_pg_count / mul);
                pg_history[i] = JSON.parse(JSON.stringify(this.state.pg.history[1+old_i]));
            }
        }
        else if (!(old_pg_count % new_pg_count))
        {
            // Old PG count is a multiple of the new PG count
            const mul = (old_pg_count / new_pg_count);
            for (let i = 0; i < new_pg_count; i++)
            {
                pg_history[i] = {
                    osd_sets: [],
                    all_peers: [],
                };
                for (let j = 0; j < mul; j++)
                {
                    pg_history[i].osd_sets.push(prev_pgs[i*mul]);
                    const hist = this.state.pg.history[1+i*mul+j];
                    if (hist && hist.osd_sets && hist.osd_sets.length)
                    {
                        Array.prototype.push.apply(pg_history[i].osd_sets, hist.osd_sets);
                    }
                    if (hist && hist.all_peers && hist.all_peers.length)
                    {
                        Array.prototype.push.apply(pg_history[i].all_peers, hist.all_peers);
                    }
                }
            }
        }
        else
        {
            // Any PG may intersect with any PG after non-multiple PG count change
            // So, merge ALL PGs history
            let all_sets = {};
            let all_peers = {};
            for (const pg of prev_pgs)
            {
                all_sets[pg.join(' ')] = pg;
            }
            for (const pg in this.state.pg.history)
            {
                const hist = this.state.pg.history[pg];
                if (hist && hist.osd_sets)
                {
                    for (const pg of hist.osd_sets)
                    {
                        all_sets[pg.join(' ')] = pg;
                    }
                }
                if (hist && hist.all_peers)
                {
                    for (const osd_num of hist.all_peers)
                    {
                        all_peers[osd_num] = Number(osd_num);
                    }
                }
            }
            all_sets = Object.values(all_sets);
            all_peers = Object.values(all_peers);
            for (let i = 0; i < new_pg_count; i++)
            {
                pg_history[i] = { osd_sets: all_sets, all_peers };
            }
        }
        // Mark history keys for removed PGs as removed
        for (let i = new_pg_count; i < old_pg_count; i++)
        {
            pg_history[i] = null;
        }
        if (old_pg_count < new_pg_count)
        {
            for (let i = new_pg_count-1; i >= 0; i--)
            {
                prev_pgs[i] = prev_pgs[Math.floor(i/new_pg_count*old_pg_count)];
            }
        }
        else if (old_pg_count > new_pg_count)
        {
            for (let i = 0; i < new_pg_count; i++)
            {
                prev_pgs[i] = prev_pgs[Math.round(i/new_pg_count*old_pg_count)];
            }
            prev_pgs.splice(new_pg_count, old_pg_count-new_pg_count);
        }
    }
    async save_new_pgs(prev_pgs, new_pgs, pg_history, tree_hash)
    {
        const txn = [], checks = [];
        const pg_items = {};
        new_pgs.map((osd_set, i) =>
        {
            osd_set = osd_set.map(osd_num => osd_num === LPOptimizer.NO_OSD ? 0 : osd_num);
            const alive_set = osd_set.filter(osd_num => osd_num);
            pg_items[i+1] = {
                osd_set,
                primary: alive_set.length ? alive_set[Math.floor(Math.random()*alive_set.length)] : 0,
            };
            if (prev_pgs[i] && prev_pgs[i].join(' ') != osd_set.join(' '))
            {
                pg_history[i] = pg_history[i] || {};
                pg_history[i].osd_sets = pg_history[i].osd_sets || [];
                pg_history[i].osd_sets.push(prev_pgs[i]);
            }
        });
        for (let i = 0; i < new_pgs.length || i < prev_pgs.length; i++)
        {
            checks.push({
                key: b64(this.etcd_prefix+'/pg/history/'+(i+1)),
                target: 'MOD',
                mod_revision: ''+this.etcd_watch_revision,
                result: 'LESS',
            });
            if (pg_history[i])
            {
                txn.push({
                    requestPut: {
                        key: b64(this.etcd_prefix+'/pg/history/'+(i+1)),
                        value: b64(JSON.stringify(pg_history[i])),
                    },
                });
            }
            else
            {
                txn.push({
                    requestDeleteRange: {
                        key: b64(this.etcd_prefix+'/pg/history/'+(i+1)),
                    },
                });
            }
        }
        this.state.config.pgs = {
            hash: tree_hash,
            items: pg_items,
        };
        const res = await this.etcd_call('/txn', {
            compare: [
                { key: b64(this.etcd_prefix+'/mon/master'), target: 'LEASE', lease: ''+this.etcd_lease_id },
                { key: b64(this.etcd_prefix+'/config/pgs'), target: 'MOD', mod_revision: ''+this.etcd_watch_revision, result: 'LESS' },
                ...checks,
            ],
            success: [
                { requestPut: { key: b64(this.etcd_prefix+'/config/pgs'), value: b64(JSON.stringify(this.state.config.pgs)) } },
                ...txn,
            ],
        }, this.config.etcd_mon_timeout, 0);
        return res.succeeded;
    }
    async recheck_pgs()
    {
        // Take configuration and state, check it against the stored configuration hash
        // Recalculate PGs and save them to etcd if the configuration is changed
        const tree_cfg = {
            osd_tree: this.get_osd_tree(),
            pg_count: this.config.pg_count || Object.keys(this.state.config.pgs.items||{}).length || 128,
            max_osd_combinations: this.config.max_osd_combinations,
        };
        const tree_hash = sha1hex(stableStringify(tree_cfg));
        if (this.state.config.pgs.hash != tree_hash)
        {
            // Something has changed
            const prev_pgs = [];
            for (const pg in this.state.config.pgs.items||{})
            {
                prev_pgs[pg-1] = this.state.config.pgs.items[pg].osd_set;
            }
            const pg_history = [];
            const old_pg_count = prev_pgs.length;
            let optimize_result;
            if (old_pg_count > 0)
            {
                if (old_pg_count != tree_cfg.pg_count)
                {
                    // PG count changed. Need to bring all PGs down.
                    if (!await this.stop_all_pgs())
                    {
                        this.schedule_recheck();
                        return;
                    }
                    this.scale_pg_count(prev_pgs, pg_history, new_pg_count);
                }
                optimize_result = await LPOptimizer.optimize_change(prev_pgs, tree_cfg.osd_tree, tree_cfg.max_osd_combinations);
            }
            else
            {
                optimize_result = await LPOptimizer.optimize_initial(tree_cfg.osd_tree, tree_cfg.pg_count, tree_cfg.max_osd_combinations);
            }
            if (!await this.save_new_pgs(prev_pgs, optimize_result.int_pgs, pg_history, tree_hash))
            {
                console.log('Someone changed PG configuration while we also tried to change it. Retrying in '+this.config.mon_change_timeout+' ms');
                this.schedule_recheck();
                return;
            }
            console.log('PG configuration successfully changed');
            if (old_pg_count != optimize_result.int_pgs.length)
            {
                console.log(`PG count changed from: ${old_pg_count} to ${optimize_result.int_pgs.length}`);
            }
            LPOptimizer.print_change_stats(optimize_result);
        }
    }
    schedule_recheck()
    {
        if (this.recheck_timer)
        {
            clearTimeout(this.recheck_timer);
            this.recheck_timer = null;
        }
        this.recheck_timer = setTimeout(() =>
        {
            this.recheck_timer = null;
            this.recheck_pgs().catch(console.error);
        }, this.config.mon_change_timeout || 1000);
    }
    sum_stats()
    {
        let overflow = false;
        this.prev_stats = this.prev_stats || { op_stats: {}, subop_stats: {}, recovery_stats: {} };
        const op_stats = {}, subop_stats = {}, recovery_stats = {};
        for (const osd in this.state.osd.stats)
        {
            const st = this.state.osd.stats[osd];
            for (const op in st.op_stats||{})
            {
                op_stats[op] = op_stats[op] || { count: 0n, usec: 0n, bytes: 0n };
                op_stats[op].count += BigInt(st.op_stats.count||0);
                op_stats[op].usec += BigInt(st.op_stats.usec||0);
                op_stats[op].bytes += BigInt(st.op_stats.bytes||0);
            }
            for (const op in st.subop_stats||{})
            {
                subop_stats[op] = subop_stats[op] || { count: 0n, usec: 0n };
                subop_stats[op].count += BigInt(st.subop_stats.count||0);
                subop_stats[op].usec += BigInt(st.subop_stats.usec||0);
            }
            for (const op in st.recovery_stats||{})
            {
                recovery_stats[op] = recovery_stats[op] || { count: 0n, bytes: 0n };
                recovery_stats[op].count += BigInt(st.recovery_stats.count||0);
                recovery_stats[op].bytes += BigInt(st.recovery_stats.bytes||0);
            }
        }
        for (const op in op_stats)
        {
            if (op_stats[op].count >= 0x10000000000000000n)
            {
                if (!this.prev_stats.op_stats[op])
                {
                    overflow = true;
                }
                else
                {
                    op_stats[op].count -= this.prev_stats.op_stats[op].count;
                    op_stats[op].usec -= this.prev_stats.op_stats[op].usec;
                    op_stats[op].bytes -= this.prev_stats.op_stats[op].bytes;
                }
            }
        }
        for (const op in subop_stats)
        {
            if (subop_stats[op].count >= 0x10000000000000000n)
            {
                if (!this.prev_stats.subop_stats[op])
                {
                    overflow = true;
                }
                else
                {
                    subop_stats[op].count -= this.prev_stats.subop_stats[op].count;
                    subop_stats[op].usec -= this.prev_stats.subop_stats[op].usec;
                }
            }
        }
        for (const op in recovery_stats)
        {
            if (recovery_stats[op].count >= 0x10000000000000000n)
            {
                if (!this.prev_stats.recovery_stats[op])
                {
                    overflow = true;
                }
                else
                {
                    recovery_stats[op].count -= this.prev_stats.recovery_stats[op].count;
                    recovery_stats[op].bytes -= this.prev_stats.recovery_stats[op].bytes;
                }
            }
        }
        const object_counts = { object: 0n, clean: 0n, misplaced: 0n, degraded: 0n, incomplete: 0n };
        for (const pg_num in this.state.pg.stats)
        {
            const st = this.state.pg.stats[pg_num];
            for (const k in object_counts)
            {
                if (st[k+'_count'])
                {
                    object_counts[k] += BigInt(st[k+'_count']);
                }
            }
        }
        return (this.prev_stats = { overflow, op_stats, subop_stats, recovery_stats, object_counts });
    }
    async update_total_stats()
    {
        const stats = this.sum_stats();
        if (!stats.overflow)
        {
            // Convert to strings, serialize and save
            const ser = {};
            for (const st of [ 'op_stats', 'subop_stats', 'recovery_stats' ])
            {
                ser[st] = {};
                for (const op in stats[st])
                {
                    ser[st][op] = {};
                    for (const k in stats[st][op])
                    {
                        ser[st][op][k] = ''+stats[st][op][k];
                    }
                }
            }
            ser.object_counts = {};
            for (const k in stats.object_counts)
            {
                ser.object_counts[k] = ''+stats.object_counts[k];
            }
            await this.etcd_call('/txn', {
                success: [ { requestPut: { key: b64(this.etcd_prefix+'/stats'), value: b64(JSON.stringify(ser)) } } ],
            }, this.config.etcd_mon_timeout, 0);
        }
    }
    schedule_update_stats()
    {
        if (this.stats_timer)
        {
            clearTimeout(this.stats_timer);
            this.stats_timer = null;
        }
        this.stats_timer = setTimeout(() =>
        {
            this.stats_timer = null;
            this.update_total_stats().catch(console.error);
        }, this.config.mon_stats_timeout || 1000);
    }
    parse_kv(kv)
    {
        if (!kv || !kv.key)
        {
            return;
        }
        kv.key = de64(kv.key);
        kv.value = kv.value ? JSON.parse(de64(kv.value)) : null;
        const key = kv.key.substr(this.etcd_prefix.length).replace(/^\/+/, '').split('/');
        const cur = this.state, orig = Mon.etcd_tree;
        for (let i = 0; i < key.length-1; i++)
        {
            if (!orig[key[i]])
            {
                console.log('Bad key in etcd: '+kv.key+' = '+kv.value);
                return;
            }
            orig = orig[key[i]];
            cur = (cur[key[i]] = cur[key[i]] || {});
        }
        if (orig[key.length-1])
        {
            console.log('Bad key in etcd: '+kv.key+' = '+kv.value);
            return;
        }
        cur[key[key.length-1]] = kv.value;
        if (key.join('/') === 'config/global')
        {
            this.state.config.global = this.state.config.global || {};
            this.config = this.state.config.global;
            this.check_config();
        }
    }
    async etcd_call(path, body, timeout, retries)
    {
        let retry = 0;
        if (retries >= 0 && retries < 1)
        {
            retries = 1;
        }
        while (retries < 0 || retry < retries)
        {
            const base = this.etcd_urls[Math.floor(Math.random()*this.etcd_urls.length)];
            const res = await POST(base+path, body, timeout);
            if (res.json)
            {
                if (res.json.error)
                {
                    console.log('etcd returned error: '+res.json.error);
                    break;
                }
                return res.json;
            }
            retry++;
        }
        this.die();
    }
    die(err)
    {
        // In fact we can just try to rejoin
        console.fatal(err || 'Cluster connection failed');
        process.exit(1);
    }
    local_ips()
    {
        const ips = [];
        const ifaces = os.networkInterfaces();
        for (const ifname in ifaces)
        {
            for (const iface of ifaces[ifname])
            {
                if (iface.family == 'IPv4' && !iface.internal)
                {
                    ips.push(iface.address);
                }
            }
        }
        return ips;
    }
 }
 function POST(url, body, timeout)
 {
    return new Promise((ok, no) =>
    {
        const body_text = Buffer.from(JSON.stringify(body));
        let timer_id = timeout > 0 ? setTimeout(() =>
        {
            if (req)
                req.abort();
            req = null;
            ok({ error: 'timeout' });
        }, timeout) : null;
        let req = http.request(url, { method: 'POST', headers: {
            'Content-Type': 'application/json',
            'Content-Length': body_text,
        } }, (res) =>
        {
            if (!req)
            {
                return;
            }
            clearTimeout(timer_id);
            if (res.statusCode != 200)
            {
                ok({ error: res.statusCode, response: res });
                return;
            }
            let res_body = '';
            res.setEncoding('utf8');
            res.on('data', chunk => { res_body += chunk });
            res.on('end', () =>
            {
                try
                {
                    res_body = JSON.parse(res_body);
                    ok({ response: res, json: res_body });
                }
                catch (e)
                {
                    ok({ error: e, response: res, body: res_body });
                }
            });
        });
        req.write(body_text);
        req.end();
    });
 }
 function b64(str)
 {
    return Buffer.from(str).toString('base64');
 }
 function de64(str)
 {
    return Buffer.from(str, 'base64').toString();
 }
 function sha1hex(str)
 {
    const hash = crypto.createHash('sha1');
    hash.update(str);
    return hash.digest('hex');
 }
--- a/mon/PGUtil.js
+++ b/mon/PGUtil.js
@ -0,0 +1,104 @@
 // Copyright (c) Vitaliy Filippov, 2019+
 // License: VNPL-1.1 (see README.md for details)
 module.exports = {
    scale_pg_count,
 };
 function add_pg_history(new_pg_history, new_pg, prev_pgs, prev_pg_history, old_pg)
 {
    if (!new_pg_history[new_pg])
    {
        new_pg_history[new_pg] = {
            osd_sets: {},
            all_peers: {},
            epoch: 0,
        };
    }
    const nh = new_pg_history[new_pg], oh = prev_pg_history[old_pg];
    nh.osd_sets[prev_pgs[old_pg].join(' ')] = prev_pgs[old_pg];
    if (oh && oh.osd_sets && oh.osd_sets.length)
    {
        for (const pg of oh.osd_sets)
        {
            nh.osd_sets[pg.join(' ')] = pg;
        }
    }
    if (oh && oh.all_peers && oh.all_peers.length)
    {
        for (const osd_num of oh.all_peers)
        {
            nh.all_peers[osd_num] = Number(osd_num);
        }
    }
    if (oh && oh.epoch)
    {
        nh.epoch = nh.epoch < oh.epoch ? oh.epoch : nh.epoch;
    }
 }
 function finish_pg_history(merged_history)
 {
    merged_history.osd_sets = Object.values(merged_history.osd_sets);
    merged_history.all_peers = Object.values(merged_history.all_peers);
 }
 function scale_pg_count(prev_pgs, prev_pg_history, new_pg_history, new_pg_count)
 {
    const old_pg_count = prev_pgs.length;
    // Add all possibly intersecting PGs to the history of new PGs
    if (!(new_pg_count % old_pg_count))
    {
        // New PG count is a multiple of old PG count
        for (let i = 0; i < new_pg_count; i++)
        {
            add_pg_history(new_pg_history, i, prev_pgs, prev_pg_history, i % old_pg_count);
            finish_pg_history(new_pg_history[i]);
        }
    }
    else if (!(old_pg_count % new_pg_count))
    {
        // Old PG count is a multiple of the new PG count
        const mul = (old_pg_count / new_pg_count);
        for (let i = 0; i < new_pg_count; i++)
        {
            for (let j = 0; j < mul; j++)
            {
                add_pg_history(new_pg_history, i, prev_pgs, prev_pg_history, i+j*new_pg_count);
            }
            finish_pg_history(new_pg_history[i]);
        }
    }
    else
    {
        // Any PG may intersect with any PG after non-multiple PG count change
        // So, merge ALL PGs history
        let merged_history = {};
        for (let i = 0; i < old_pg_count; i++)
        {
            add_pg_history(merged_history, 1, prev_pgs, prev_pg_history, i);
        }
        finish_pg_history(merged_history[1]);
        for (let i = 0; i < new_pg_count; i++)
        {
            new_pg_history[i] = { ...merged_history[1] };
        }
    }
    // Mark history keys for removed PGs as removed
    for (let i = new_pg_count; i < old_pg_count; i++)
    {
        new_pg_history[i] = null;
    }
    // Just for the lp_solve optimizer - pick a "previous" PG for each "new" one
    if (old_pg_count < new_pg_count)
    {
        for (let i = old_pg_count; i < new_pg_count; i++)
        {
            prev_pgs[i] = prev_pgs[i % old_pg_count];
        }
    }
    else if (old_pg_count > new_pg_count)
    {
        prev_pgs.splice(new_pg_count, old_pg_count-new_pg_count);
    }
 }
--- a/mon/afr.js
+++ b/mon/afr.js
@ -0,0 +1,89 @@
 // Functions to calculate Annualized Failure Rate of your cluster
 // if you know AFR of your drives, number of drives, expected rebalance time
 // and replication factor
 // License: VNPL-1.1 (see https://yourcmc.ru/git/vitalif/vitastor/src/branch/master/README.md for details) or AGPL-3.0
 // Author: Vitaliy Filippov, 2020+
 module.exports = {
    cluster_afr_fullmesh,
    failure_rate_fullmesh,
    cluster_afr,
    c_n_k,
 };
 /******** "FULL MESH": ASSUME EACH OSD COMMUNICATES WITH ALL OTHER OSDS ********/
 // Estimate AFR of the cluster
 // n - number of drives
 // afr - annualized failure rate of a single drive
 // l - expected rebalance time in days after a single drive failure
 // k - replication factor / number of drives that must fail at the same time for the cluster to fail
 function cluster_afr_fullmesh(n, afr, l, k)
 {
    return 1 - (1 - afr * failure_rate_fullmesh(n-(k-1), afr*l/365, k-1)) ** (n-(k-1));
 }
 // Probability of at least <f> failures in a cluster with <n> drives with AFR=<a>
 function failure_rate_fullmesh(n, a, f)
 {
    if (f <= 0)
    {
        return (1-a)**n;
    }
    let p = 1;
    for (let i = 0; i < f; i++)
    {
        p -= c_n_k(n, i) * (1-a)**(n-i) * a**i;
    }
    return p;
 }
 /******** PGS: EACH OSD ONLY COMMUNICATES WITH <pgs> OTHER OSDs ********/
 // <n> hosts of <m> drives of <capacity> GB, each able to backfill at <speed> GB/s,
 // <k> replicas, <pgs> unique peer PGs per OSD (~50 for 100 PG-per-OSD in a big cluster)
 //
 // For each of n*m drives: P(drive fails in a year) * P(any of its peers fail in <l*365> next days).
 // More peers per OSD increase rebalance speed (more drives work together to resilver) if you
 // let them finish rebalance BEFORE replacing the failed drive (degraded_replacement=false).
 // At the same time, more peers per OSD increase probability of any of them to fail!
 // osd_rm=true means that failed OSDs' data is rebalanced over all other hosts,
 // not over the same host as it's in Ceph by default (dead OSDs are marked 'out').
 //
 // Probability of all except one drives in a replica group to fail is (AFR^(k-1)).
 // So with <x> PGs it becomes ~ (x * (AFR*L/365)^(k-1)). Interesting but reasonable consequence
 // is that, with k=2, total failure rate doesn't depend on number of peers per OSD,
 // because it gets increased linearly by increased number of peers to fail
 // and decreased linearly by reduced rebalance time.
 function cluster_afr({ n_hosts, n_drives, afr_drive, afr_host, capacity, speed, ec, ec_data, ec_parity, replicas, pgs = 1, osd_rm, degraded_replacement, down_out_interval = 600 })
 {
    const pg_size = (ec ? ec_data+ec_parity : replicas);
    pgs = Math.min(pgs, (n_hosts-1)*n_drives/(pg_size-1));
    const host_pgs = Math.min(pgs*n_drives, (n_hosts-1)*n_drives/(pg_size-1));
    const resilver_disk = n_drives == 1 || osd_rm ? pgs : (n_drives-1);
    const disk_heal_time = (down_out_interval + capacity/(degraded_replacement ? 1 : resilver_disk)/speed)/86400/365;
    const host_heal_time = (down_out_interval + n_drives*capacity/pgs/speed)/86400/365;
    const disk_heal_fail = ((afr_drive+afr_host/n_drives)*disk_heal_time);
    const host_heal_fail = ((afr_drive+afr_host/n_drives)*host_heal_time);
    const disk_pg_fail = ec
        ? failure_rate_fullmesh(ec_data+ec_parity-1, disk_heal_fail, ec_parity)
        : disk_heal_fail**(replicas-1);
    const host_pg_fail = ec
        ? failure_rate_fullmesh(ec_data+ec_parity-1, host_heal_fail, ec_parity)
        : host_heal_fail**(replicas-1);
    return 1 - ((1 - afr_drive * (1-(1-disk_pg_fail)**pgs)) ** (n_hosts*n_drives))
        * ((1 - afr_host * (1-(1-host_pg_fail)**host_pgs)) ** n_hosts);
 }
 /******** UTILITY ********/
 // Combination count
 function c_n_k(n, k)
 {
    let r = 1;
    for (let i = 0; i < k; i++)
    {
        r *= (n-i) / (i+1);
    }
    return r;
 }
--- a/mon/afr_test.js
+++ b/mon/afr_test.js
@ -0,0 +1,28 @@
 const { sprintf } = require('sprintf-js');
 const { cluster_afr } = require('./afr.js');
 print_cluster_afr({ n_hosts: 4, n_drives: 6, afr_drive: 0.03, afr_host: 0.05, capacity: 4000, speed: 0.1, replicas: 2 });
 print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, afr_host: 0, capacity: 4000, speed: 0.1, replicas: 2 });
 print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, afr_host: 0.05, capacity: 4000, speed: 0.1, replicas: 2 });
 print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, afr_host: 0, capacity: 4000, speed: 0.1, ec: true, ec_data: 2, ec_parity: 1 });
 print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, afr_host: 0.05, capacity: 4000, speed: 0.1, ec: true, ec_data: 2, ec_parity: 1 });
 print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0, capacity: 8000, speed: 0.02, replicas: 2 });
 print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 2 });
 print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0, capacity: 8000, speed: 0.02, replicas: 3 });
 print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 3 });
 print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0, capacity: 8000, speed: 0.02, replicas: 3, pgs: 100 });
 print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 3, pgs: 100 });
 print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 3, pgs: 100, degraded_replacement: 1 });
 function print_cluster_afr(config)
 {
    console.log(
        `${config.n_hosts} nodes with ${config.n_drives} ${sprintf("%.1f", config.capacity/1000)}TB drives`+
        `, capable to backfill at ${sprintf("%.1f", config.speed*1000)} MB/s, drive AFR ${sprintf("%.1f", config.afr_drive*100)}%`+
        (config.afr_host ? `, host AFR ${sprintf("%.1f", config.afr_host*100)}%` : '')+
        (config.ec ? `, EC ${config.ec_data}+${config.ec_parity}` : `, ${config.replicas} replicas`)+
        `, ${config.pgs||1} PG per OSD`+
        (config.degraded_replacement ? `\n...and you don't let the rebalance finish before replacing drives` : '')
    );
    console.log('-> '+sprintf("%.7f%%", 100*cluster_afr(config))+'\n');
 }
--- a/mon/lp-optimizer.js
+++ b/mon/lp-optimizer.js
@ -1,3 +1,6 @@
 // Copyright (c) Vitaliy Filippov, 2019+
 // License: VNPL-1.1 (see README.md for details)
 // Data distribution optimizer using linear programming (lp_solve)
 const child_process = require('child_process');
@ -25,7 +28,7 @@ async function lp_solve(text)
    let vars = {};
    for (const line of stdout.split(/\n/))
    {
-        let m = /^(^Value of objective function: ([\d\.]+)|Actual values of the variables:)\s*$/.exec(line);
+        let m = /^(^Value of objective function: (-?[\d\.]+)|Actual values of the variables:)\s*$/.exec(line);
        if (m)
        {
            if (m[2])
@ -47,34 +50,34 @@ async function lp_solve(text)
    return { score, vars };
 }
-async function optimize_initial(osd_tree, pg_count, max_combinations)
+async function optimize_initial({ osd_tree, pg_count, pg_size = 3, pg_minsize = 2, max_combinations = 10000, parity_space = 1 })
 {
-    max_combinations = max_combinations || 10000;
+    if (!pg_count || !osd_tree)
    {
        return null;
    }
    const all_weights = Object.assign({}, ...Object.values(osd_tree));
    const total_weight = Object.values(all_weights).reduce((a, c) => Number(a) + Number(c), 0);
-    let all_pgs = all_combinations(osd_tree, null, true);
+    const all_pgs = Object.values(random_combinations(osd_tree, pg_size, max_combinations, parity_space > 1));
    if (all_pgs.length > max_combinations)
    {
        const prob = max_combinations/all_pgs.length;
        all_pgs = all_pgs.filter(pg => Math.random() < prob);
    }
    const pg_per_osd = {};
    for (const pg of all_pgs)
    {
-        for (const osd of pg)
+        for (let i = 0; i < pg.length; i++)
        {
            const osd = pg[i];
            pg_per_osd[osd] = pg_per_osd[osd] || [];
-            pg_per_osd[osd].push("pg_"+pg.join("_"));
+            pg_per_osd[osd].push((i >= pg_minsize ? parity_space+'*' : '')+"pg_"+pg.join("_"));
        }
    }
-    const pg_size = Math.min(Object.keys(osd_tree).length, 3);
+    const pg_effsize = Math.min(pg_minsize, Object.keys(osd_tree).length)
        + Math.max(0, Math.min(pg_size, Object.keys(osd_tree).length) - pg_minsize) * parity_space;
    let lp = '';
    lp += "max: "+all_pgs.map(pg => 'pg_'+pg.join('_')).join(' + ')+";\n";
    for (const osd in pg_per_osd)
    {
        if (osd !== NO_OSD)
        {
-            let osd_pg_count = all_weights[osd]/total_weight*pg_size*pg_count;
+            let osd_pg_count = all_weights[osd]/total_weight*pg_effsize*pg_count;
            lp += pg_per_osd[osd].join(' + ')+' <= '+osd_pg_count+';\n';
        }
    }
@ -86,11 +89,30 @@ async function optimize_initial(osd_tree, pg_count, max_combinations)
    const lp_result = await lp_solve(lp);
    if (!lp_result)
    {
        console.log(lp);
        throw new Error('Problem is infeasible or unbounded - is it a bug?');
    }
    const int_pgs = make_int_pgs(lp_result.vars, pg_count);
-    const eff = pg_list_space_efficiency(int_pgs, all_weights);
+    const eff = pg_list_space_efficiency(int_pgs, all_weights, pg_minsize, parity_space);
-    return { score: lp_result.score, weights: lp_result.vars, int_pgs, space: eff*pg_size, total_space: total_weight };
+    const res = {
        score: lp_result.score,
        weights: lp_result.vars,
        int_pgs,
        space: eff * pg_effsize,
        total_space: total_weight,
    };
    return res;
 }
 function shuffle(array)
 {
    for (let i = array.length - 1, j, x; i > 0; i--)
    {
        j = Math.floor(Math.random() * (i + 1));
        x = array[i];
        array[i] = array[j];
        array[j] = x;
    }
 }
 function make_int_pgs(weights, pg_count)
@ -109,14 +131,122 @@ function make_int_pgs(weights, pg_count)
        weight_left -= weights[pg_name];
        pg_left -= n;
    }
    shuffle(int_pgs);
    return int_pgs;
 }
-// Try to minimize data movement
+function calc_intersect_weights(pg_size, pg_count, prev_weights, all_pgs)
 async function optimize_change(prev_int_pgs, osd_tree, max_combinations)
 {
-    max_combinations = max_combinations || 10000;
+    const move_weights = {};
-    const pg_size = Math.min(Object.keys(osd_tree).length, 3);
+    if ((1 << pg_size) < pg_count)
    {
        const intersect = {};
        for (const pg_name in prev_weights)
        {
            const pg = pg_name.substr(3).split(/_/);
            for (let omit = 1; omit < (1 << pg_size); omit++)
            {
                let pg_omit = [ ...pg ];
                let intersect_count = pg_size;
                for (let i = 0; i < pg_size; i++)
                {
                    if (omit & (1 << i))
                    {
                        pg_omit[i] = '';
                        intersect_count--;
                    }
                }
                pg_omit = pg_omit.join(':');
                intersect[pg_omit] = Math.max(intersect[pg_omit] || 0, intersect_count);
            }
        }
        for (const pg of all_pgs)
        {
            let max_int = 0;
            for (let omit = 1; omit < (1 << pg_size); omit++)
            {
                let pg_omit = [ ...pg ];
                for (let i = 0; i < pg_size; i++)
                {
                    if (omit & (1 << i))
                    {
                        pg_omit[i] = '';
                    }
                }
                pg_omit = pg_omit.join(':');
                max_int = Math.max(max_int, intersect[pg_omit] || 0);
            }
            move_weights['pg_'+pg.join('_')] = pg_size-max_int;
        }
    }
    else
    {
        const prev_pg_hashed = Object.keys(prev_weights).map(pg_name => pg_name.substr(3).split(/_/).reduce((a, c) => { a[c] = 1; return a; }, {}));
        for (const pg of all_pgs)
        {
            if (!prev_weights['pg_'+pg.join('_')])
            {
                let max_int = 0;
                for (const prev_hash in prev_pg_hashed)
                {
                    const intersect_count = pg.reduce((a, osd) => a + (prev_hash[osd] ? 1 : 0), 0);
                    if (max_int < intersect_count)
                    {
                        max_int = intersect_count;
                        if (max_int >= pg_size)
                        {
                            break;
                        }
                    }
                }
                move_weights['pg_'+pg.join('_')] = pg_size-max_int;
            }
        }
    }
    return move_weights;
 }
 function add_valid_previous(osd_tree, prev_weights, all_pgs)
 {
    // Add previous combinations that are still valid
    const hosts = Object.keys(osd_tree).sort();
    const host_per_osd = {};
    for (const host in osd_tree)
    {
        for (const osd in osd_tree[host])
        {
            host_per_osd[osd] = host;
        }
    }
    skip_pg: for (const pg_name in prev_weights)
    {
        const seen_hosts = {};
        const pg = pg_name.substr(3).split(/_/);
        for (const osd of pg)
        {
            if (!host_per_osd[osd] || seen_hosts[host_per_osd[osd]])
            {
                continue skip_pg;
            }
            seen_hosts[host_per_osd[osd]] = true;
        }
        if (!all_pgs[pg_name])
        {
            all_pgs[pg_name] = pg;
        }
    }
 }
 // Try to minimize data movement
 async function optimize_change({ prev_pgs: prev_int_pgs, osd_tree, pg_size = 3, pg_minsize = 2, max_combinations = 10000, parity_space = 1 })
 {
    if (!osd_tree)
    {
        return null;
    }
    // FIXME: use parity_chunks with parity_space instead of pg_minsize
    const pg_effsize = Math.min(pg_minsize, Object.keys(osd_tree).length)
        + Math.max(0, Math.min(pg_size, Object.keys(osd_tree).length) - pg_minsize) * parity_space;
    const pg_count = prev_int_pgs.length;
    const prev_weights = {};
    const prev_pg_per_osd = {};
@ -124,70 +254,55 @@ async function optimize_change(prev_int_pgs, osd_tree, max_combinations)
    {
        const pg_name = 'pg_'+pg.join('_');
        prev_weights[pg_name] = (prev_weights[pg_name]||0) + 1;
-        for (const osd of pg)
+        for (let i = 0; i < pg.length; i++)
        {
            const osd = pg[i];
            prev_pg_per_osd[osd] = prev_pg_per_osd[osd] || [];
-            prev_pg_per_osd[osd].push(pg_name);
+            prev_pg_per_osd[osd].push([ pg_name, (i >= pg_minsize ? parity_space : 1) ]);
        }
    }
    // Get all combinations
-    let all_pgs = all_combinations(osd_tree, null, true);
+    let all_pgs = random_combinations(osd_tree, pg_size, max_combinations, parity_space > 1);
-    if (all_pgs.length > max_combinations)
+    add_valid_previous(osd_tree, prev_weights, all_pgs);
-    {
+    all_pgs = Object.values(all_pgs);
        const intersecting = all_pgs.filter(pg => prev_weights['pg_'+pg.join('_')]);
        if (intersecting.length > max_combinations)
        {
            const prob = max_combinations/intersecting.length;
            all_pgs = intersecting.filter(pg => Math.random() < prob);
        }
        else
        {
            const prob = (max_combinations-intersecting.length)/all_pgs.length;
            all_pgs = all_pgs.filter(pg => Math.random() < prob || prev_weights['pg_'+pg.join('_')]);
        }
    }
    const pg_per_osd = {};
    for (const pg of all_pgs)
    {
        const pg_name = 'pg_'+pg.join('_');
-        for (const osd of pg)
+        for (let i = 0; i < pg.length; i++)
        {
            const osd = pg[i];
            pg_per_osd[osd] = pg_per_osd[osd] || [];
-            pg_per_osd[osd].push(pg_name);
+            pg_per_osd[osd].push([ pg_name, (i >= pg_minsize ? parity_space : 1) ]);
        }
    }
    // Penalize PGs based on their similarity to old PGs
-    const intersect = {};
+    const move_weights = calc_intersect_weights(pg_size, pg_count, prev_weights, all_pgs);
    for (const pg_name in prev_weights)
    {
        const pg = pg_name.substr(3).split(/_/);
        intersect[pg[0]+'::'] = intersect[':'+pg[1]+':'] = intersect['::'+pg[2]] = 2;
        intersect[pg[0]+'::'+pg[2]] = intersect[':'+pg[1]+':'+pg[2]] = intersect[pg[0]+':'+pg[1]+':'] = 1;
    }
    const move_weights = {};
    for (const pg of all_pgs)
    {
        move_weights['pg_'+pg.join('_')] =
            intersect[pg[0]+'::'+pg[2]] || intersect[':'+pg[1]+':'+pg[2]] || intersect[pg[0]+':'+pg[1]+':'] ||
            intersect[pg[0]+'::'] || intersect[':'+pg[1]+':'] || intersect['::'+pg[2]] ||
            3;
    }
    // Calculate total weight - old PG weights
    const all_pg_names = all_pgs.map(pg => 'pg_'+pg.join('_'));
    const all_pgs_hash = all_pg_names.reduce((a, c) => { a[c] = true; return a; }, {});
    const all_weights = Object.assign({}, ...Object.values(osd_tree));
    const total_weight = Object.values(all_weights).reduce((a, c) => Number(a) + Number(c), 0);
    // Generate the LP problem
    let lp = '';
    lp += 'max: '+all_pg_names.map(pg_name => (
-        prev_weights[pg_name] ? `${4-move_weights[pg_name]}*add_${pg_name} - 4*del_${pg_name}` : `${4-move_weights[pg_name]}*${pg_name}`
+        prev_weights[pg_name] ? `${pg_size+1}*add_${pg_name} - ${pg_size+1}*del_${pg_name}` : `${pg_size+1-move_weights[pg_name]}*${pg_name}`
    )).join(' + ')+';\n';
    lp += all_pg_names
        .map(pg_name => (prev_weights[pg_name] ? `add_${pg_name} - del_${pg_name}` : `${pg_name}`))
        .join(' + ')+' = '+(pg_count
            - Object.keys(prev_weights).reduce((a, old_pg_name) => (a + (all_pgs_hash[old_pg_name] ? prev_weights[old_pg_name] : 0)), 0)
        )+';\n';
    for (const osd in pg_per_osd)
    {
        if (osd !== NO_OSD)
        {
-            const osd_sum = (pg_per_osd[osd]||[]).map(pg_name => prev_weights[pg_name] ? `add_${pg_name} - del_${pg_name}` : pg_name).join(' + ');
+            const osd_sum = (pg_per_osd[osd]||[]).map(([ pg_name, space ]) => (
-            const rm_osd_pg_count = (prev_pg_per_osd[osd]||[]).filter(old_pg_name => move_weights[old_pg_name]).length;
+                prev_weights[pg_name] ? `${space} * add_${pg_name} - ${space} * del_${pg_name}` : `${space} * ${pg_name}`
-            let osd_pg_count = all_weights[osd]*3/total_weight*pg_count - rm_osd_pg_count;
+            )).join(' + ');
            const rm_osd_pg_count = (prev_pg_per_osd[osd]||[])
                .reduce((a, [ old_pg_name, space ]) => (a + (all_pgs_hash[old_pg_name] ? space : 0)), 0);
            const osd_pg_count = all_weights[osd]*pg_effsize/total_weight*pg_count - rm_osd_pg_count;
            lp += osd_sum + ' <= ' + osd_pg_count + ';\n';
        }
    }
@ -221,7 +336,7 @@ async function optimize_change(prev_int_pgs, osd_tree, max_combinations)
    const weights = { ...prev_weights };
    for (const k in prev_weights)
    {
-        if (!move_weights[k])
+        if (!all_pgs_hash[k])
        {
            delete weights[k];
        }
@ -236,7 +351,7 @@ async function optimize_change(prev_int_pgs, osd_tree, max_combinations)
        {
            weights[k.substr(4)] = (weights[k.substr(4)] || 0) - Number(lp_result.vars[k]);
        }
-        else
+        else if (k.substr(0, 3) === 'pg_')
        {
            weights[k] = Number(lp_result.vars[k]);
        }
@ -258,7 +373,7 @@ async function optimize_change(prev_int_pgs, osd_tree, max_combinations)
        {
            differs++;
        }
-        for (let j = 0; j < 3; j++)
+        for (let j = 0; j < pg_size; j++)
        {
            if (new_pgs[i][j] != prev_int_pgs[i][j])
            {
@ -273,7 +388,7 @@ async function optimize_change(prev_int_pgs, osd_tree, max_combinations)
        int_pgs: new_pgs,
        differs,
        osd_differs,
-        space: pg_size * pg_list_space_efficiency(new_pgs, all_weights),
+        space: pg_effsize * pg_list_space_efficiency(new_pgs, all_weights, pg_minsize, parity_space),
        total_space: total_weight,
    };
 }
@ -391,64 +506,155 @@ function extract_osds(osd_tree, levels, osd_level, osds = {})
    return osds;
 }
-// FIXME: support different pg_sizes, not just 3
+// ordered = don't treat (x,y) and (y,x) as equal
-// osd_tree = { failure_domain1: { osd1: size1, ... }, ... }
+function random_combinations(osd_tree, pg_size, count, ordered)
 function all_combinations(osd_tree, count, ordered)
 {
    let seed = 0x5f020e43;
    let rng = () =>
    {
        seed ^= seed << 13;
        seed ^= seed >> 17;
        seed ^= seed << 5;
        return seed + 2147483648;
    };
    const hosts = Object.keys(osd_tree).sort();
    const osds = Object.keys(osd_tree).reduce((a, c) => { a[c] = Object.keys(osd_tree[c]).sort(); return a; }, {});
-    while (hosts.length < 3)
+    const r = {};
    // Generate random combinations including each OSD at least once
    for (let h = 0; h < hosts.length; h++)
    {
-        osds[NO_OSD] = [ NO_OSD ];
+        for (let o = 0; o < osds[hosts[h]].length; o++)
        hosts.push(NO_OSD);
    }
    let host_idx = [ 0, 1, 2 ];
    let osd_idx = [ 0, 0, 0 ];
    const r = [];
    while (!count || count < 0 || r.length < count)
    {
        let inc;
        if (host_idx[2] != host_idx[1] && host_idx[2] != host_idx[0] && host_idx[1] != host_idx[0])
        {
-            r.push(host_idx.map((hi, i) => osds[hosts[hi]][osd_idx[i]]));
+            const pg = [ osds[hosts[h]][o] ];
-            inc = 2;
+            const cur_hosts = [ ...hosts ];
-            while (inc >= 0)
+            cur_hosts.splice(h, 1);
            for (let i = 1; i < pg_size && i < hosts.length; i++)
            {
-                osd_idx[inc]++;
+                const next_host = rng() % cur_hosts.length;
-                if (osd_idx[inc] >= osds[hosts[host_idx[inc]]].length)
+                const next_osd = rng() % osds[cur_hosts[next_host]].length;
                pg.push(osds[cur_hosts[next_host]][next_osd]);
                cur_hosts.splice(next_host, 1);
            }
            const cyclic_pgs = [ pg ];
            if (ordered)
            {
                for (let i = 1; i < pg.size; i++)
                {
-                    osd_idx[inc] = 0;
+                    cyclic_pgs.push([ ...pg.slice(i), ...pg.slice(0, i) ]);
                    inc--;
                }
-                else
+            }
            for (const pg of cyclic_pgs)
            {
                while (pg.length < pg_size)
                {
-                    break;
+                    pg.push(NO_OSD);
                }
                r['pg_'+pg.join('_')] = pg;
            }
        }
    }
    // Generate purely random combinations
    while (count > 0)
    {
        let host_idx = [];
        const cur_hosts = [ ...hosts.map((h, i) => i) ];
        const max_hosts = pg_size < hosts.length ? pg_size : hosts.length;
        if (ordered)
        {
            for (let i = 0; i < max_hosts; i++)
            {
                const r = rng() % cur_hosts.length;
                host_idx[i] = cur_hosts[r];
                cur_hosts.splice(r, 1);
            }
        }
        else
        {
-            inc = -1;
+            for (let i = 0; i < max_hosts; i++)
            {
                const r = rng() % (cur_hosts.length - (max_hosts - i - 1));
                host_idx[i] = cur_hosts[r];
                cur_hosts.splice(0, r+1);
            }
        }
        let pg = host_idx.map(h => osds[hosts[h]][rng() % osds[hosts[h]].length]);
        while (pg.length < pg_size)
        {
            pg.push(NO_OSD);
        }
        r['pg_'+pg.join('_')] = pg;
        count--;
    }
    return r;
 }
 // Super-stupid algorithm. Given the current OSD tree, generate all possible OSD combinations
 // osd_tree = { failure_domain1: { osd1: size1, ... }, ... }
 // ordered = return combinations without duplicates having different order
 function all_combinations(osd_tree, pg_size, ordered, count)
 {
    const hosts = Object.keys(osd_tree).sort();
    const osds = Object.keys(osd_tree).reduce((a, c) => { a[c] = Object.keys(osd_tree[c]).sort(); return a; }, {});
    while (hosts.length < pg_size)
    {
        osds[NO_OSD] = [ NO_OSD ];
        hosts.push(NO_OSD);
    }
    let host_idx = [];
    let osd_idx = [];
    for (let i = 0; i < pg_size; i++)
    {
        host_idx.push(i);
        osd_idx.push(0);
    }
    const r = [];
    while (!count || count < 0 || r.length < count)
    {
        r.push(host_idx.map((hi, i) => osds[hosts[hi]][osd_idx[i]]));
        let inc = pg_size-1;
        while (inc >= 0)
        {
            osd_idx[inc]++;
            if (osd_idx[inc] >= osds[hosts[host_idx[inc]]].length)
            {
                osd_idx[inc] = 0;
                inc--;
            }
            else
            {
                break;
            }
        }
        if (inc < 0)
        {
-            // no osds left in current host combination, select the next one
+            // no osds left in the current host combination, select the next one
-            osd_idx = [ 0, 0, 0 ];
+            inc = pg_size-1;
-            host_idx[2]++;
+            same_again: while (inc >= 0)
            if (host_idx[2] >= hosts.length)
            {
-                host_idx[1]++;
+                host_idx[inc]++;
-                host_idx[2] = ordered ? host_idx[1]+1 : 0;
+                for (let prev_host = 0; prev_host < inc; prev_host++)
                if ((ordered ? host_idx[2] : host_idx[1]) >= hosts.length)
                {
-                    host_idx[0]++;
+                    if (host_idx[prev_host] == host_idx[inc])
                    host_idx[1] = ordered ? host_idx[0]+1 : 0;
                    host_idx[2] = ordered ? host_idx[1]+1 : 0;
                    if ((ordered ? host_idx[2] : host_idx[0]) >= hosts.length)
                    {
-                        break;
+                        continue same_again;
                    }
                }
                if (host_idx[inc] < (ordered ? hosts.length-(pg_size-1-inc) : hosts.length))
                {
                    while ((++inc) < pg_size)
                    {
                        host_idx[inc] = (ordered ? host_idx[inc-1]+1 : 0);
                    }
                    break;
                }
                else
                {
                    inc--;
                }
            }
            if (inc < 0)
            {
                break;
            }
        }
    }
@ -468,14 +674,15 @@ function pg_weights_space_efficiency(weights, pg_count, osd_sizes)
    return pg_per_osd_space_efficiency(per_osd, pg_count, osd_sizes);
 }
-function pg_list_space_efficiency(pgs, osd_sizes)
+function pg_list_space_efficiency(pgs, osd_sizes, pg_minsize, parity_space)
 {
    const per_osd = {};
    for (const pg of pgs)
    {
-        for (const osd of pg)
+        for (let i = 0; i < pg.length; i++)
        {
-            per_osd[osd] = (per_osd[osd]||0) + 1;
+            const osd = pg[i];
            per_osd[osd] = (per_osd[osd]||0) + (i >= pg_minsize ? (parity_space||1) : 1);
        }
    }
    return pg_per_osd_space_efficiency(per_osd, pgs.length, osd_sizes);
@ -517,5 +724,6 @@ module.exports = {
    lp_solve,
    make_int_pgs,
    align_pgs,
    random_combinations,
    all_combinations,
 };
--- a/mon/make-osd.sh
+++ b/mon/make-osd.sh
@ -0,0 +1,75 @@
 #!/bin/bash
 # Very simple systemd unit generator for vitastor-osd services
 # Not the final solution yet, mostly for tests
 # Copyright (c) Vitaliy Filippov, 2019+
 # License: MIT
 # USAGE: ./make-osd.sh /dev/disk/by-partuuid/xxx [ /dev/disk/by-partuuid/yyy]...
 IP_SUBSTR="10.200.1."
 ETCD_HOSTS="etcd0=http://10.200.1.10:2380,etcd1=http://10.200.1.11:2380,etcd2=http://10.200.1.12:2380"
 set -e -x
 IP=`ip -json a s | jq -r '.[].addr_info[] | select(.local | startswith("'$IP_SUBSTR'")) | .local'`
 [ "$IP" != "" ] || exit 1
 ETCD_MON=$(echo $ETCD_HOSTS | perl -pe 's/:2380/:2379/g; s/etcd\d*=//g;')
 D=`dirname $0`
 # Create OSDs on all passed devices
 OSD_NUM=1
 for DEV in $*; do
 # Ugly :) -> node.js rework pending
 while true; do
    ST=$(etcdctl --endpoints="$ETCD_MON" get --print-value-only /vitastor/osd/stats/$OSD_NUM)
    if [ "$ST" = "" ]; then
        break
    fi
    OSD_NUM=$((OSD_NUM+1))
 done
 etcdctl --endpoints="$ETCD_MON" put /vitastor/osd/stats/$OSD_NUM '{}'
 echo Creating OSD $OSD_NUM on $DEV
 OPT=`node $D/simple-offsets.js --device $DEV --format options | tr '\n' ' '`
 META=`echo $OPT | grep -Po '(?<=data_offset )\d+'`
 dd if=/dev/zero of=$DEV bs=1048576 count=$(((META+1048575)/1048576)) oflag=direct
 cat >/etc/systemd/system/vitastor-osd$OSD_NUM.service <<EOF
 [Unit]
 Description=Vitastor object storage daemon osd.$OSD_NUM
 After=network-online.target local-fs.target time-sync.target
 Wants=network-online.target local-fs.target time-sync.target
 PartOf=vitastor.target
 [Service]
 LimitNOFILE=1048576
 LimitNPROC=1048576
 LimitMEMLOCK=infinity
 ExecStart=/usr/bin/vitastor-osd \\
    --etcd_address $IP:2379/v3 \\
    --bind_address $IP \\
    --osd_num $OSD_NUM \\
    --disable_data_fsync 1 \\
    --immediate_commit all \\
    --disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 \\
    --journal_no_same_sector_overwrites true \\
    --journal_sector_buffer_count 1024 \\
    $OPT
 WorkingDirectory=/
 ExecStartPre=+chown vitastor:vitastor $DEV
 User=vitastor
 PrivateTmp=false
 TasksMax=infinity
 Restart=always
 StartLimitInterval=0
 RestartSec=10
 [Install]
 WantedBy=vitastor.target
 EOF
 systemctl enable vitastor-osd$OSD_NUM
 done
--- a/mon/make-units.sh
+++ b/mon/make-units.sh
@ -0,0 +1,86 @@
 #!/bin/bash
 # Very simple systemd unit generator for etcd & vitastor-mon services
 # Not the final solution yet, mostly for tests
 # Copyright (c) Vitaliy Filippov, 2019+
 # License: MIT
 # USAGE: ./make-units.sh
 IP_SUBSTR="10.200.1."
 ETCD_HOSTS="etcd0=http://10.200.1.10:2380,etcd1=http://10.200.1.11:2380,etcd2=http://10.200.1.12:2380"
 # determine IP
 IP=`ip -json a s | jq -r '.[].addr_info[] | select(.local | startswith("'$IP_SUBSTR'")) | .local'`
 [ "$IP" != "" ] || exit 1
 ETCD_NUM=${ETCD_HOSTS/$IP*/}
 [ "$ETCD_NUM" != "$ETCD_HOSTS" ] || exit 1
 ETCD_NUM=$(echo $ETCD_NUM | tr -d -c , | wc -c)
 # etcd
 useradd etcd
 mkdir -p /var/lib/etcd$ETCD_NUM.etcd
 cat >/etc/systemd/system/etcd.service <<EOF
 [Unit]
 Description=etcd for vitastor
 After=network-online.target local-fs.target time-sync.target
 Wants=network-online.target local-fs.target time-sync.target
 [Service]
 Restart=always
 ExecStart=/usr/local/bin/etcd -name etcd$ETCD_NUM --data-dir /var/lib/etcd$ETCD_NUM.etcd \\
    --advertise-client-urls http://$IP:2379 --listen-client-urls http://$IP:2379 \\
    --initial-advertise-peer-urls http://$IP:2380 --listen-peer-urls http://$IP:2380 \\
    --initial-cluster-token vitastor-etcd-1 --initial-cluster $ETCD_HOSTS \\
    --initial-cluster-state new --max-txn-ops=100000 --max-request-bytes=104857600 \\
    --auto-compaction-retention=10 --auto-compaction-mode=revision
 WorkingDirectory=/var/lib/etcd$ETCD_NUM.etcd
 ExecStartPre=+chown -R etcd /var/lib/etcd$ETCD_NUM.etcd
 User=etcd
 PrivateTmp=false
 TasksMax=infinity
 Restart=always
 StartLimitInterval=0
 RestartSec=10
 [Install]
 WantedBy=local.target
 EOF
 systemctl daemon-reload
 systemctl enable etcd
 systemctl start etcd
 useradd vitastor
 chmod 755 /root
 # Vitastor target
 cat >/etc/systemd/system/vitastor.target <<EOF
 [Unit]
 Description=vitastor target
 [Install]
 WantedBy=multi-user.target
 EOF
 # Monitor unit
 ETCD_MON=$(echo $ETCD_HOSTS | perl -pe 's/:2380/:2379/g; s/etcd\d*=//g;')
 cat >/etc/systemd/system/vitastor-mon.service <<EOF
 [Unit]
 Description=Vitastor monitor
 After=network-online.target local-fs.target time-sync.target
 Wants=network-online.target local-fs.target time-sync.target
 [Service]
 Restart=always
 ExecStart=node /usr/lib/vitastor/mon/mon-main.js --etcd_url '$ETCD_MON' --etcd_prefix '/vitastor' --etcd_start_timeout 5
 WorkingDirectory=/
 User=vitastor
 PrivateTmp=false
 TasksMax=infinity
 Restart=always
 StartLimitInterval=0
 RestartSec=10
 [Install]
 WantedBy=vitastor.target
 EOF
--- a/mon/merge.js
+++ b/mon/merge.js
@ -0,0 +1,23 @@
 const fsp = require('fs').promises;
 async function merge(file1, file2, out)
 {
    if (!out)
    {
        console.error('USAGE: nodejs merge.js layer1 layer2 output');
        process.exit();
    }
    const layer1 = await fsp.readFile(file1);
    const layer2 = await fsp.readFile(file2);
    const zero = Buffer.alloc(4096);
    for (let i = 0; i < layer2.length; i += 4096)
    {
        if (zero.compare(layer2, i, i+4096) != 0)
        {
            layer2.copy(layer1, i, i, i+4096);
        }
    }
    await fsp.writeFile(out, layer1);
 }
 merge(process.argv[2], process.argv[3], process.argv[4]);
--- a/mon/mon-main.js
+++ b/mon/mon-main.js
@ -1,5 +1,8 @@
 #!/usr/bin/node
 // Copyright (c) Vitaliy Filippov, 2019+
 // License: VNPL-1.1 (see README.md for details)
 const Mon = require('./mon.js');
 const options = {};
@ -15,8 +18,8 @@ for (let i = 2; i < process.argv.length; i++)
 if (!options.etcd_url)
 {
-    console.error('USAGE: '+process.argv[0]+' '+process.argv[1]+' --etcd_url "http://127.0.0.1:2379,..." --etcd_prefix "/rage" --etcd_start_timeout 5');
+    console.error('USAGE: '+process.argv[0]+' '+process.argv[1]+' --etcd_url "http://127.0.0.1:2379,..." --etcd_prefix "/vitastor" --etcd_start_timeout 5 [--verbose 1]');
    process.exit();
 }
-new Mon(options).start();
+new Mon(options).start().catch(e => { console.error(e); process.exit(); });
--- a/mon/mon.js
+++ b/mon/mon.js
--- a/mon/package.json
+++ b/mon/package.json
@ -1,14 +1,15 @@
 {
-  "name": "rage-mon",
+  "name": "vitastor-mon",
  "version": "1.0.0",
-  "description": "RAGE storage monitor service",
+  "description": "Vitastor SDS monitor service",
-  "main": "mon.js",
+  "main": "mon-main.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "author": "Vitaliy Filippov",
  "license": "UNLICENSED",
  "dependencies": {
    "sprintf-js": "^1.1.2",
    "ws": "^7.2.5"
  }
 }
--- a/mon/simple-offsets.js
+++ b/mon/simple-offsets.js
@ -0,0 +1,96 @@
 // Copyright (c) Vitaliy Filippov, 2019+
 // License: MIT
 // Simple tool to calculate journal and metadata offsets for a single device
 // Will be replaced by smarter tools in the future
 const fs = require('fs').promises;
 const child_process = require('child_process');
 async function run()
 {
    const options = {
        object_size: 128*1024,
        bitmap_granularity: 4096,
        journal_size: 16*1024*1024,
        device_block_size: 4096,
        journal_offset: 0,
        device_size: 0,
        format: 'text',
    };
    for (let i = 2; i < process.argv.length; i++)
    {
        if (process.argv[i].substr(0, 2) == '--')
        {
            options[process.argv[i].substr(2)] = process.argv[i+1];
            i++;
        }
    }
    if (!options.device)
    {
        process.stderr.write('USAGE: nodejs '+process.argv[1]+' --device /dev/sdXXX\n');
        process.exit(1);
    }
    options.device_size = Number(options.device_size);
    let device_size = options.device_size;
    if (!device_size)
    {
        const st = await fs.stat(options.device);
        options.device_block_size = st.blksize;
        if (st.isBlockDevice())
            device_size = Number(await system("/sbin/blockdev --getsize64 "+options.device))
        else
            device_size = st.size;
    }
    if (!device_size)
    {
        process.stderr.write('Failed to get device size\n');
        process.exit(1);
    }
    options.journal_offset = Math.ceil(options.journal_offset/options.device_block_size)*options.device_block_size;
    const meta_offset = options.journal_offset + Math.ceil(options.journal_size/options.device_block_size)*options.device_block_size;
    const entries_per_block = Math.floor(options.device_block_size / (24 + 2*options.object_size/options.bitmap_granularity/8));
    const object_count = Math.floor((device_size-meta_offset)/options.object_size);
    const meta_size = Math.ceil(1 + object_count / entries_per_block) * options.device_block_size;
    const data_offset = meta_offset + meta_size;
    const meta_size_fmt = (meta_size > 1024*1024*1024 ? Math.round(meta_size/1024/1024/1024*100)/100+" GB"
        : Math.round(meta_size/1024/1024*100)/100+" MB");
    if (options.format == 'text' || options.format == 'options')
    {
        if (options.format == 'text')
        {
            process.stderr.write(
                `Metadata size: ${meta_size_fmt}\n`+
                `Options for the OSD:\n`
            );
        }
        process.stdout.write(
            (options.device_block_size != 4096 ?
                `    --meta_block_size ${options.device}\n`+
                `    --journal_block-size ${options.device}\n` : '')+
            `    --data_device ${options.device}\n`+
            `    --journal_offset ${options.journal_offset}\n`+
            `    --meta_offset ${meta_offset}\n`+
            `    --data_offset ${data_offset}\n`+
            (options.device_size ? `    --data_size ${device_size-data_offset}\n` : '')
        );
    }
    else if (options.format == 'env')
    {
        process.stdout.write(
            `journal_offset=${options.journal_offset}\n`+
            `meta_offset=${meta_offset}\n`+
            `data_offset=${data_offset}\n`+
            `data_size=${device_size-data_offset}\n`
        );
    }
    else
        process.stdout.write('Unknown format: '+options.format);
 }
 function system(cmd)
 {
    return new Promise((ok, no) => child_process.exec(cmd, { maxBuffer: 64*1024*1024 }, (err, stdout, stderr) => (err ? no(err.message) : ok(stdout))));
 }
 run().catch(err => { console.error(err); process.exit(1); });
--- a/mon/stable-stringify.js
+++ b/mon/stable-stringify.js
@ -0,0 +1,78 @@
 // Copyright (c) Vitaliy Filippov, 2019+
 // License: MIT
 function stableStringify(obj, opts)
 {
    if (!opts)
        opts = {};
    if (typeof opts === 'function')
        opts = { cmp: opts };
    let space = opts.space || '';
    if (typeof space === 'number')
        space = Array(space+1).join(' ');
    const cycles = (typeof opts.cycles === 'boolean') ? opts.cycles : false;
    const cmp = opts.cmp && (function (f)
    {
        return function (node)
        {
            return function (a, b)
            {
                let aobj = { key: a, value: node[a] };
                let bobj = { key: b, value: node[b] };
                return f(aobj, bobj);
            };
        };
    })(opts.cmp);
    const seen = new Map();
    return (function stringify (parent, key, node, level)
    {
        const indent = space ? ('\n' + new Array(level + 1).join(space)) : '';
        const colonSeparator = space ? ': ' : ':';
        if (node === undefined)
        {
            return;
        }
        if (typeof node !== 'object' || node === null)
        {
            return JSON.stringify(node);
        }
        if (node instanceof Array)
        {
            const out = [];
            for (let i = 0; i < node.length; i++)
            {
                const item = stringify(node, i, node[i], level+1) || JSON.stringify(null);
                out.push(indent + space + item);
            }
            return '[' + out.join(',') + indent + ']';
        }
        else
        {
            if (seen.has(node))
            {
                if (cycles)
                    return JSON.stringify('__cycle__');
                throw new TypeError('Converting circular structure to JSON');
            }
            else
                seen.set(node, true);
            const keys = Object.keys(node).sort(cmp && cmp(node));
            const out = [];
            for (let i = 0; i < keys.length; i++)
            {
                const key = keys[i];
                const value = stringify(node, key, node[key], level+1);
                if (!value)
                    continue;
                const keyValue = JSON.stringify(key)
                    + colonSeparator
                    + value;
                out.push(indent + space + keyValue);
            }
            seen.delete(node);
            return '{' + out.join(',') + indent + '}';
        }
    })({ '': obj }, '', obj, 0);
 }
 module.exports = stableStringify;
--- a/mon/test-nonuniform.js
+++ b/mon/test-nonuniform.js
@ -0,0 +1,130 @@
 // Copyright (c) Vitaliy Filippov, 2019+
 // License: VNPL-1.1 (see README.md for details)
 // Interesting real-world example coming from Ceph with EC and compression enabled.
 // EC parity chunks can't be compressed as efficiently as data chunks,
 // thus they occupy more space (2.26x more space) in OSD object stores.
 // This leads to really uneven OSD fill ratio in Ceph even when PGs are perfectly balanced.
 // But we support this case with the "parity_space" parameter in optimize_initial()/optimize_change().
 const LPOptimizer = require('./lp-optimizer.js');
 const osd_tree = {
    ripper5: {
        osd0: 3.493144989013672,
        osd1: 3.493144989013672,
        osd2: 3.454082489013672,
        osd12: 3.461894989013672,
    },
    ripper7: {
        osd4: 3.638690948486328,
        osd5: 3.638690948486328,
        osd6: 3.638690948486328,
    },
    ripper4: {
        osd9: 3.4609375,
        osd10: 3.4609375,
        osd11: 3.4609375,
    },
    ripper6: {
        osd3: 3.5849609375,
        osd7: 3.5859336853027344,
        osd8: 3.638690948486328,
        osd13: 3.461894989013672
    },
 };
 const prev_pgs = [[12,7,5],[6,11,12],[3,6,9],[10,0,5],[2,5,13],[9,8,6],[3,4,12],[7,4,12],[12,11,13],[13,6,0],[4,13,10],[9,7,6],[7,10,0],[10,8,0],[3,10,2],[3,0,4],[6,13,0],[13,10,0],[13,10,5],[8,11,6],[3,9,2],[2,8,5],[8,9,5],[3,12,11],[0,7,4],[13,11,1],[11,3,12],[12,8,10],[7,5,12],[2,13,5],[7,11,0],[13,2,6],[0,6,8],[13,1,6],[0,13,4],[0,8,10],[4,10,0],[8,12,4],[8,12,9],[12,7,4],[13,9,5],[3,2,11],[1,9,7],[1,8,5],[5,12,9],[3,5,12],[2,8,10],[0,8,4],[1,4,11],[7,10,2],[12,13,5],[3,1,11],[7,1,4],[4,12,8],[7,0,9],[11,1,8],[3,0,5],[11,13,0],[1,13,5],[12,7,10],[12,8,4],[11,13,5],[0,11,6],[2,11,3],[13,1,11],[2,7,10],[7,10,12],[7,12,10],[12,11,5],[13,12,10],[2,3,9],[4,3,9],[13,2,5],[7,12,6],[12,10,13],[9,8,1],[13,1,5],[9,5,12],[5,11,7],[6,2,9],[8,11,6],[12,5,8],[6,13,1],[7,6,11],[2,3,6],[8,5,9],[1,13,6],[9,3,2],[7,11,1],[3,10,1],[0,11,7],[3,0,5],[1,3,6],[6,0,9],[3,11,4],[8,10,2],[13,1,9],[12,6,9],[3,12,9],[12,8,9],[7,5,0],[8,12,5],[0,11,3],[12,11,13],[0,7,11],[0,3,10],[1,3,11],[2,7,11],[13,2,6],[9,12,13],[8,2,4],[0,7,4],[5,13,0],[13,12,9],[1,9,8],[0,10,3],[3,5,10],[7,12,9],[2,13,4],[12,7,5],[9,2,7],[3,2,9],[6,2,7],[3,1,9],[4,3,2],[5,3,11],[0,7,6],[1,6,13],[7,10,2],[12,4,8],[13,12,6],[7,5,11],[6,2,3],[2,7,6],[2,3,10],[2,7,10],[11,12,6],[0,13,5],[10,2,4],[13,0,11],[7,0,6],[8,9,4],[8,4,11],[7,11,2],[3,4,2],[6,1,3],[7,2,11],[8,9,4],[11,4,8],[10,3,1],[2,10,13],[1,7,11],[13,11,12],[2,6,9],[10,0,13],[7,10,4],[0,11,13],[13,10,1],[7,5,0],[7,12,10],[3,1,4],[7,1,5],[3,11,5],[7,5,0],[1,3,5],[10,5,12],[0,3,9],[7,1,11],[11,8,12],[3,6,2],[7,12,9],[7,11,12],[4,11,3],[0,11,13],[13,2,5],[1,5,8],[0,11,8],[3,5,1],[11,0,6],[3,11,2],[11,8,12],[4,1,3],[10,13,4],[13,9,6],[2,3,10],[12,7,9],[10,0,4],[10,13,2],[3,11,1],[7,2,9],[1,7,4],[13,1,4],[7,0,6],[5,3,9],[10,0,7],[0,7,10],[3,6,10],[13,0,5],[8,4,1],[3,1,10],[2,10,13],[13,0,5],[13,10,2],[12,7,9],[6,8,10],[6,1,8],[10,8,1],[13,5,0],[5,11,3],[7,6,1],[8,5,9],[2,13,11],[10,12,4],[13,4,1],[2,13,4],[11,7,0],[2,9,7],[1,7,6],[8,0,4],[8,1,9],[7,10,12],[13,9,6],[7,6,11],[13,0,4],[1,8,4],[3,12,5],[10,3,1],[10,2,13],[2,4,8],[6,2,3],[3,0,10],[6,7,12],[8,12,5],[3,0,6],[13,12,10],[11,3,6],[9,0,13],[10,0,6],[7,5,2],[1,3,11],[7,10,2],[2,9,8],[11,13,12],[0,8,4],[8,12,11],[6,0,3],[1,13,4],[11,8,2],[12,3,6],[4,7,1],[7,6,12],[3,10,6],[0,10,7],[8,9,1],[0,10,6],[8,10,1]]
    .map(pg => pg.map(n => 'osd'+n));
 const by_osd = {};
 for (let i = 0; i < prev_pgs.length; i++)
 {
    for (let j = 0; j < prev_pgs[i].length; j++)
    {
        by_osd[prev_pgs[i][j]] = by_osd[prev_pgs[i][j]] || [];
        by_osd[prev_pgs[i][j]][j] = (by_osd[prev_pgs[i][j]][j] || 0) + 1;
    }
 }
 /*
 This set of PGs was balanced by hand, by heavily tuning OSD weights in Ceph:
 {
  osd0: 4.2,
  osd1: 3.5,
  osd2: 3.45409,
  osd3: 4.5,
  osd4: 1.4,
  osd5: 1.4,
  osd6: 1.75,
  osd7: 4.5,
  osd8: 4.4,
  osd9: 2.2,
  osd10: 2.7,
  osd11: 2,
  osd12: 3.4,
  osd13: 3.4,
 }
 EC+compression is a nightmare in Ceph, yeah :))
 To calculate the average ratio between data chunks and parity chunks we
 calculate the number of PG chunks for each chunk role for each OSD:
 {
  osd12: [ 18, 22, 17 ],
  osd7: [ 35, 22, 8 ],
  osd5: [ 6, 17, 27 ],
  osd6: [ 13, 12, 28 ],
  osd11: [ 13, 26, 20 ],
  osd3: [ 30, 20, 10 ],
  osd9: [ 8, 12, 26 ],
  osd10: [ 15, 23, 20 ],
  osd0: [ 22, 22, 14 ],
  osd2: [ 22, 16, 16 ],
  osd13: [ 29, 19, 13 ],
  osd8: [ 20, 18, 12 ],
  osd4: [ 8, 10, 28 ],
  osd1: [ 17, 17, 17 ]
 }
 And now we can pick a pair of OSDs and determine the ratio by solving the following:
 osd5 = 23*X + 27*Y = 3249728140
 osd13 = 48*X + 13*Y = 2991675992
 =>
 osd5 - 27/13*osd13 = 23*X - 27/13*48*X = -76.6923076923077*X = -2963752766.46154
 =>
 X = 38644720.1243731
 Y = (osd5-23*X)/27 = 87440725.0792377
 Y/X = 2.26268232239284 ~= 2.26
 Which means that parity chunks are compressed ~2.26 times worse than data chunks.
 Fine, let's try to optimize for it.
 */
 async function run()
 {
    const all_weights = Object.assign({}, ...Object.values(osd_tree));
    const total_weight = Object.values(all_weights).reduce((a, c) => Number(a) + Number(c), 0);
    const eff = LPOptimizer.pg_list_space_efficiency(prev_pgs, all_weights, 2, 2.26);
    const orig = eff*4.26 / total_weight;
    console.log('Original efficiency was: '+Math.round(orig*10000)/100+' %');
    let prev = await LPOptimizer.optimize_initial({ osd_tree, pg_size: 3, pg_count: 256, parity_space: 2.26 });
    LPOptimizer.print_change_stats(prev);
    let next = await LPOptimizer.optimize_change({ prev_pgs, osd_tree, pg_size: 3, max_combinations: 10000, parity_space: 2.26 });
    LPOptimizer.print_change_stats(next);
 }
 run().catch(console.error);
--- a/mon/test-optimize-simple.js
+++ b/mon/test-optimize-simple.js
@ -0,0 +1,25 @@
 // Copyright (c) Vitaliy Filippov, 2019+
 // License: VNPL-1.1 (see README.md for details)
 const LPOptimizer = require('./lp-optimizer.js');
 async function run()
 {
    const osd_tree = { a: { 1: 1 }, b: { 2: 1 }, c: { 3: 1 } };
    let res;
    console.log('16 PGs, size=3');
    res = await LPOptimizer.optimize_initial({ osd_tree, pg_size: 3, pg_count: 16 });
    LPOptimizer.print_change_stats(res, false);
    console.log('\nReduce PG size to 2');
    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs.map(pg => pg.slice(0, 2)), osd_tree, pg_size: 2 });
    LPOptimizer.print_change_stats(res, false);
    console.log('\nRemove OSD 3');
    delete osd_tree['c'];
    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree, pg_size: 2 });
    LPOptimizer.print_change_stats(res, false);
 }
 run().catch(console.error);
--- a/mon/test-optimize-undersized.js
+++ b/mon/test-optimize-undersized.js
@ -1,3 +1,6 @@
 // Copyright (c) Vitaliy Filippov, 2019+
 // License: VNPL-1.1 (see README.md for details)
 const LPOptimizer = require('./lp-optimizer.js');
 const crush_tree = [
@ -40,31 +43,31 @@ async function run()
 {
    const cur_tree = {};
    console.log('Empty tree:');
-    let res = await LPOptimizer.optimize_initial(cur_tree, 256);
+    let res = await LPOptimizer.optimize_initial({ osd_tree: cur_tree, pg_size: 3, pg_count: 256 });
    LPOptimizer.print_change_stats(res, false);
    console.log('\nAdding 1st failure domain:');
    cur_tree['dom1'] = osd_tree['dom1'];
-    res = await LPOptimizer.optimize_change(res.int_pgs, cur_tree);
+    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: cur_tree, pg_size: 3 });
    LPOptimizer.print_change_stats(res, false);
    console.log('\nAdding 2nd failure domain:');
    cur_tree['dom2'] = osd_tree['dom2'];
-    res = await LPOptimizer.optimize_change(res.int_pgs, cur_tree);
+    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: cur_tree, pg_size: 3 });
    LPOptimizer.print_change_stats(res, false);
    console.log('\nAdding 3rd failure domain:');
    cur_tree['dom3'] = osd_tree['dom3'];
-    res = await LPOptimizer.optimize_change(res.int_pgs, cur_tree);
+    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: cur_tree, pg_size: 3 });
    LPOptimizer.print_change_stats(res, false);
    console.log('\nRemoving 3rd failure domain:');
    delete cur_tree['dom3'];
-    res = await LPOptimizer.optimize_change(res.int_pgs, cur_tree);
+    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: cur_tree, pg_size: 3 });
    LPOptimizer.print_change_stats(res, false);
    console.log('\nRemoving 2nd failure domain:');
    delete cur_tree['dom2'];
-    res = await LPOptimizer.optimize_change(res.int_pgs, cur_tree);
+    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: cur_tree, pg_size: 3 });
    LPOptimizer.print_change_stats(res, false);
    console.log('\nRemoving 1st failure domain:');
    delete cur_tree['dom1'];
-    res = await LPOptimizer.optimize_change(res.int_pgs, cur_tree);
+    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: cur_tree, pg_size: 3 });
    LPOptimizer.print_change_stats(res, false);
 }
--- a/mon/test-optimize.js
+++ b/mon/test-optimize.js
@ -1,3 +1,6 @@
 // Copyright (c) Vitaliy Filippov, 2019+
 // License: VNPL-1.1 (see README.md for details)
 const LPOptimizer = require('./lp-optimizer.js');
 const osd_tree = {
@ -75,19 +78,37 @@ const crush_tree = [
 async function run()
 {
    let res;
    // Test: add 1 OSD of almost the same size. Ideal data movement could be 1/12 = 8.33%. Actual is ~13%
-    // Space efficiency is ~99.5% in both cases.
+    // Space efficiency is ~99% in all cases.
-    let res = await LPOptimizer.optimize_initial(osd_tree, 256);
+
    console.log('256 PGs, size=2');
    res = await LPOptimizer.optimize_initial({ osd_tree, pg_size: 2, pg_count: 256 });
    LPOptimizer.print_change_stats(res, false);
-    console.log('adding osd.8');
+    console.log('\nAdding osd.8');
    osd_tree[500][8] = 3.58589;
-    res = await LPOptimizer.optimize_change(res.int_pgs, osd_tree);
+    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree, pg_size: 2 });
    LPOptimizer.print_change_stats(res, false);
-    console.log('removing osd.8');
+    console.log('\nRemoving osd.8');
    delete osd_tree[500][8];
-    res = await LPOptimizer.optimize_change(res.int_pgs, osd_tree);
+    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree, pg_size: 2 });
    LPOptimizer.print_change_stats(res, false);
-    res = await LPOptimizer.optimize_initial(LPOptimizer.flatten_tree(crush_tree, {}, 1, 3), 256);
+
    console.log('\n256 PGs, size=3');
    res = await LPOptimizer.optimize_initial({ osd_tree, pg_size: 3, pg_count: 256 });
    LPOptimizer.print_change_stats(res, false);
    console.log('\nAdding osd.8');
    osd_tree[500][8] = 3.58589;
    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree, pg_size: 3 });
    LPOptimizer.print_change_stats(res, false);
    console.log('\nRemoving osd.8');
    delete osd_tree[500][8];
    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree, pg_size: 3 });
    LPOptimizer.print_change_stats(res, false);
    console.log('\n256 PGs, size=3, failure domain=rack');
    res = await LPOptimizer.optimize_initial({ osd_tree: LPOptimizer.flatten_tree(crush_tree, {}, 1, 3), pg_size: 3, pg_count: 256 });
    LPOptimizer.print_change_stats(res, false);
 }
--- a/osd.cpp
+++ b/osd.cpp
@ -1,464 +0,0 @@
 #include <sys/socket.h>
 #include <sys/epoll.h>
 #include <sys/poll.h>
 #include <netinet/in.h>
 #include <netinet/tcp.h>
 #include <arpa/inet.h>
 #include "osd.h"
 const char* osd_op_names[] = {
    "",
    "read",
    "write",
    "sync",
    "stabilize",
    "rollback",
    "delete",
    "sync_stab_all",
    "list",
    "show_config",
    "primary_read",
    "primary_write",
    "primary_sync",
    "primary_delete",
 };
 osd_t::osd_t(blockstore_config_t & config, blockstore_t *bs, ring_loop_t *ringloop)
 {
    this->config = config;
    this->bs = bs;
    this->ringloop = ringloop;
    this->bs_block_size = bs->get_block_size();
    // FIXME: use bitmap granularity instead
    this->bs_disk_alignment = bs->get_disk_alignment();
    parse_config(config);
    epoll_fd = epoll_create(1);
    if (epoll_fd < 0)
    {
        throw std::runtime_error(std::string("epoll_create: ") + strerror(errno));
    }
    this->tfd = new timerfd_manager_t([this](int fd, std::function<void(int, int)> handler) { set_fd_handler(fd, handler); });
    this->tfd->set_timer(print_stats_interval*1000, true, [this](int timer_id)
    {
        print_stats();
    });
    c_cli.tfd = this->tfd;
    c_cli.ringloop = this->ringloop;
    c_cli.exec_op = [this](osd_op_t *op) { exec_op(op); };
    c_cli.repeer_pgs = [this](osd_num_t peer_osd) { repeer_pgs(peer_osd); };
    init_cluster();
    consumer.loop = [this]() { loop(); };
    ringloop->register_consumer(&consumer);
 }
 osd_t::~osd_t()
 {
    if (tfd)
    {
        delete tfd;
        tfd = NULL;
    }
    ringloop->unregister_consumer(&consumer);
    close(epoll_fd);
    close(listen_fd);
 }
 void osd_t::parse_config(blockstore_config_t & config)
 {
    int pos;
    // Initial startup configuration
    {
        std::string ea = config["etcd_address"];
        while (1)
        {
            pos = ea.find(',');
            std::string addr = pos >= 0 ? ea.substr(0, pos) : ea;
            if (addr.length() > 0)
            {
                if (addr.find('/') < 0)
                    addr += "/v3";
                st_cli.etcd_addresses.push_back(addr);
            }
            if (pos >= 0)
                ea = ea.substr(pos+1);
            else
                break;
        }
    }
    st_cli.etcd_prefix = config["etcd_prefix"];
    if (st_cli.etcd_prefix == "")
        st_cli.etcd_prefix = "/microceph";
    etcd_report_interval = strtoull(config["etcd_report_interval"].c_str(), NULL, 10);
    if (etcd_report_interval <= 0)
        etcd_report_interval = 30;
    osd_num = strtoull(config["osd_num"].c_str(), NULL, 10);
    if (!osd_num)
        throw std::runtime_error("osd_num is required in the configuration");
    c_cli.osd_num = osd_num;
    run_primary = config["run_primary"] != "false" && config["run_primary"] != "0" && config["run_primary"] != "no";
    // Cluster configuration
    bind_address = config["bind_address"];
    if (bind_address == "")
        bind_address = "0.0.0.0";
    bind_port = stoull_full(config["bind_port"]);
    if (bind_port <= 0 || bind_port > 65535)
        bind_port = 0;
    if (config["immediate_commit"] == "all")
        immediate_commit = IMMEDIATE_ALL;
    else if (config["immediate_commit"] == "small")
        immediate_commit = IMMEDIATE_SMALL;
    if (config.find("autosync_interval") != config.end())
    {
        autosync_interval = strtoull(config["autosync_interval"].c_str(), NULL, 10);
        if (autosync_interval > MAX_AUTOSYNC_INTERVAL)
            autosync_interval = DEFAULT_AUTOSYNC_INTERVAL;
    }
    if (config.find("client_queue_depth") != config.end())
    {
        client_queue_depth = strtoull(config["client_queue_depth"].c_str(), NULL, 10);
        if (client_queue_depth < 128)
            client_queue_depth = 128;
    }
    if (config.find("pg_stripe_size") != config.end())
    {
        pg_stripe_size = strtoull(config["pg_stripe_size"].c_str(), NULL, 10);
        if (!pg_stripe_size || !bs_block_size || pg_stripe_size < bs_block_size || (pg_stripe_size % bs_block_size) != 0)
            pg_stripe_size = DEFAULT_PG_STRIPE_SIZE;
    }
    recovery_queue_depth = strtoull(config["recovery_queue_depth"].c_str(), NULL, 10);
    if (recovery_queue_depth < 1 || recovery_queue_depth > MAX_RECOVERY_QUEUE)
        recovery_queue_depth = DEFAULT_RECOVERY_QUEUE;
    if (config["readonly"] == "true" || config["readonly"] == "1" || config["readonly"] == "yes")
        readonly = true;
    print_stats_interval = strtoull(config["print_stats_interval"].c_str(), NULL, 10);
    if (!print_stats_interval)
        print_stats_interval = 3;
    c_cli.peer_connect_interval = strtoull(config["peer_connect_interval"].c_str(), NULL, 10);
    if (!c_cli.peer_connect_interval)
        c_cli.peer_connect_interval = 5;
    c_cli.peer_connect_timeout = strtoull(config["peer_connect_timeout"].c_str(), NULL, 10);
    if (!c_cli.peer_connect_timeout)
        c_cli.peer_connect_timeout = 5;
    log_level = strtoull(config["log_level"].c_str(), NULL, 10);
    st_cli.log_level = log_level;
    c_cli.log_level = log_level;
 }
 void osd_t::bind_socket()
 {
    listen_fd = socket(AF_INET, SOCK_STREAM, 0);
    if (listen_fd < 0)
    {
        throw std::runtime_error(std::string("socket: ") + strerror(errno));
    }
    int enable = 1;
    setsockopt(listen_fd, SOL_SOCKET, SO_REUSEADDR, &enable, sizeof(enable));
    sockaddr_in addr;
    int r;
    if ((r = inet_pton(AF_INET, bind_address.c_str(), &addr.sin_addr)) != 1)
    {
        close(listen_fd);
        throw std::runtime_error("bind address "+bind_address+(r == 0 ? " is not valid" : ": no ipv4 support"));
    }
    addr.sin_family = AF_INET;
    addr.sin_port = htons(bind_port);
    if (bind(listen_fd, (sockaddr*)&addr, sizeof(addr)) < 0)
    {
        close(listen_fd);
        throw std::runtime_error(std::string("bind: ") + strerror(errno));
    }
    if (bind_port == 0)
    {
        socklen_t len = sizeof(addr);
        if (getsockname(listen_fd, (sockaddr *)&addr, &len) == -1)
        {
            close(listen_fd);
            throw std::runtime_error(std::string("getsockname: ") + strerror(errno));
        }
        listening_port = ntohs(addr.sin_port);
    }
    else
    {
        listening_port = bind_port;
    }
    if (listen(listen_fd, listen_backlog) < 0)
    {
        close(listen_fd);
        throw std::runtime_error(std::string("listen: ") + strerror(errno));
    }
    fcntl(listen_fd, F_SETFL, fcntl(listen_fd, F_GETFL, 0) | O_NONBLOCK);
    epoll_event ev;
    ev.data.fd = listen_fd;
    ev.events = EPOLLIN | EPOLLET;
    if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, listen_fd, &ev) < 0)
    {
        close(listen_fd);
        close(epoll_fd);
        throw std::runtime_error(std::string("epoll_ctl: ") + strerror(errno));
    }
 }
 bool osd_t::shutdown()
 {
    stopping = true;
    if (inflight_ops > 0)
    {
        return false;
    }
    return bs->is_safe_to_stop();
 }
 void osd_t::loop()
 {
    if (!wait_state)
    {
        handle_epoll_events();
        wait_state = 1;
    }
    handle_peers();
    c_cli.read_requests();
    c_cli.send_replies();
    ringloop->submit();
 }
 void osd_t::set_fd_handler(int fd, std::function<void(int, int)> handler)
 {
    if (handler != NULL)
    {
        bool exists = epoll_handlers.find(fd) != epoll_handlers.end();
        epoll_event ev;
        ev.data.fd = fd;
        ev.events = EPOLLOUT | EPOLLIN | EPOLLRDHUP | EPOLLET;
        if (epoll_ctl(epoll_fd, exists ? EPOLL_CTL_MOD : EPOLL_CTL_ADD, fd, &ev) < 0)
        {
            throw std::runtime_error(std::string("epoll_ctl: ") + strerror(errno));
        }
        epoll_handlers[fd] = handler;
    }
    else
    {
        if (epoll_ctl(epoll_fd, EPOLL_CTL_DEL, fd, NULL) < 0 && errno != ENOENT)
        {
            throw std::runtime_error(std::string("epoll_ctl: ") + strerror(errno));
        }
        epoll_handlers.erase(fd);
    }
 }
 void osd_t::handle_epoll_events()
 {
    io_uring_sqe *sqe = ringloop->get_sqe();
    if (!sqe)
    {
        throw std::runtime_error("can't get SQE, will fall out of sync with EPOLLET");
    }
    ring_data_t *data = ((ring_data_t*)sqe->user_data);
    my_uring_prep_poll_add(sqe, epoll_fd, POLLIN);
    data->callback = [this](ring_data_t *data)
    {
        if (data->res < 0)
        {
            throw std::runtime_error(std::string("epoll failed: ") + strerror(-data->res));
        }
        handle_epoll_events();
    };
    ringloop->submit();
    int nfds;
    epoll_event events[MAX_EPOLL_EVENTS];
 restart:
    nfds = epoll_wait(epoll_fd, events, MAX_EPOLL_EVENTS, 0);
    for (int i = 0; i < nfds; i++)
    {
        if (events[i].data.fd == listen_fd)
        {
            // Accept new connections
            sockaddr_in addr;
            socklen_t peer_addr_size = sizeof(addr);
            int peer_fd;
            while ((peer_fd = accept(listen_fd, (sockaddr*)&addr, &peer_addr_size)) >= 0)
            {
                assert(peer_fd != 0);
                char peer_str[256];
                printf("[OSD %lu] new client %d: connection from %s port %d\n", this->osd_num, peer_fd,
                    inet_ntop(AF_INET, &addr.sin_addr, peer_str, 256), ntohs(addr.sin_port));
                fcntl(peer_fd, F_SETFL, fcntl(listen_fd, F_GETFL, 0) | O_NONBLOCK);
                int one = 1;
                setsockopt(peer_fd, SOL_TCP, TCP_NODELAY, &one, sizeof(one));
                c_cli.clients[peer_fd] = {
                    .peer_addr = addr,
                    .peer_port = ntohs(addr.sin_port),
                    .peer_fd = peer_fd,
                    .peer_state = PEER_CONNECTED,
                    .in_buf = malloc(c_cli.receive_buffer_size),
                };
                // Add FD to epoll
                set_fd_handler(peer_fd, [this](int peer_fd, int epoll_events)
                {
                    c_cli.handle_peer_epoll(peer_fd, epoll_events);
                });
                // Try to accept next connection
                peer_addr_size = sizeof(addr);
            }
            if (peer_fd == -1 && errno != EAGAIN)
            {
                throw std::runtime_error(std::string("accept: ") + strerror(errno));
            }
        }
        else
        {
            auto & cb = epoll_handlers[events[i].data.fd];
            cb(events[i].data.fd, events[i].events);
        }
    }
    if (nfds == MAX_EPOLL_EVENTS)
    {
        goto restart;
    }
 }
 void osd_t::exec_op(osd_op_t *cur_op)
 {
    clock_gettime(CLOCK_REALTIME, &cur_op->tv_begin);
    if (stopping)
    {
        // Throw operation away
        delete cur_op;
        return;
    }
    inflight_ops++;
    cur_op->send_list.push_back(cur_op->reply.buf, OSD_PACKET_SIZE);
    if (cur_op->req.hdr.magic != SECONDARY_OSD_OP_MAGIC ||
        cur_op->req.hdr.opcode < OSD_OP_MIN || cur_op->req.hdr.opcode > OSD_OP_MAX ||
        (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_READ || cur_op->req.hdr.opcode == OSD_OP_SECONDARY_WRITE) &&
        (cur_op->req.sec_rw.len > OSD_RW_MAX || cur_op->req.sec_rw.len % bs_disk_alignment || cur_op->req.sec_rw.offset % bs_disk_alignment) ||
        (cur_op->req.hdr.opcode == OSD_OP_READ || cur_op->req.hdr.opcode == OSD_OP_WRITE || cur_op->req.hdr.opcode == OSD_OP_DELETE) &&
        (cur_op->req.rw.len > OSD_RW_MAX || cur_op->req.rw.len % bs_disk_alignment || cur_op->req.rw.offset % bs_disk_alignment))
    {
        // Bad command
        finish_op(cur_op, -EINVAL);
        return;
    }
    if (readonly &&
        cur_op->req.hdr.opcode != OSD_OP_SECONDARY_READ &&
        cur_op->req.hdr.opcode != OSD_OP_SECONDARY_LIST &&
        cur_op->req.hdr.opcode != OSD_OP_READ &&
        cur_op->req.hdr.opcode != OSD_OP_SHOW_CONFIG)
    {
        // Readonly mode
        finish_op(cur_op, -EROFS);
        return;
    }
    if (cur_op->req.hdr.opcode == OSD_OP_TEST_SYNC_STAB_ALL)
    {
        exec_sync_stab_all(cur_op);
    }
    else if (cur_op->req.hdr.opcode == OSD_OP_SHOW_CONFIG)
    {
        exec_show_config(cur_op);
    }
    else if (cur_op->req.hdr.opcode == OSD_OP_READ)
    {
        continue_primary_read(cur_op);
    }
    else if (cur_op->req.hdr.opcode == OSD_OP_WRITE)
    {
        continue_primary_write(cur_op);
    }
    else if (cur_op->req.hdr.opcode == OSD_OP_SYNC)
    {
        continue_primary_sync(cur_op);
    }
    else if (cur_op->req.hdr.opcode == OSD_OP_DELETE)
    {
        continue_primary_del(cur_op);
    }
    else
    {
        exec_secondary(cur_op);
    }
 }
 void osd_t::reset_stats()
 {
    c_cli.stats = { 0 };
    prev_stats = { 0 };
    memset(recovery_stat_count, 0, sizeof(recovery_stat_count));
    memset(recovery_stat_bytes, 0, sizeof(recovery_stat_bytes));
 }
 void osd_t::print_stats()
 {
    for (int i = 0; i <= OSD_OP_MAX; i++)
    {
        if (c_cli.stats.op_stat_count[i] != prev_stats.op_stat_count[i])
        {
            uint64_t avg = (c_cli.stats.op_stat_sum[i] - prev_stats.op_stat_sum[i])/(c_cli.stats.op_stat_count[i] - prev_stats.op_stat_count[i]);
            uint64_t bw = (c_cli.stats.op_stat_bytes[i] - prev_stats.op_stat_bytes[i]) / print_stats_interval;
            if (c_cli.stats.op_stat_bytes[i] != 0)
            {
                printf(
                    "[OSD %lu] avg latency for op %d (%s): %lu us, B/W: %.2f %s\n", osd_num, i, osd_op_names[i], avg,
                    (bw > 1024*1024*1024 ? bw/1024.0/1024/1024 : (bw > 1024*1024 ? bw/1024.0/1024 : bw/1024.0)),
                    (bw > 1024*1024*1024 ? "GB/s" : (bw > 1024*1024 ? "MB/s" : "KB/s"))
                );
            }
            else
            {
                printf("[OSD %lu] avg latency for op %d (%s): %lu us\n", osd_num, i, osd_op_names[i], avg);
            }
            prev_stats.op_stat_count[i] = c_cli.stats.op_stat_count[i];
            prev_stats.op_stat_sum[i] = c_cli.stats.op_stat_sum[i];
            prev_stats.op_stat_bytes[i] = c_cli.stats.op_stat_bytes[i];
        }
    }
    for (int i = 0; i <= OSD_OP_MAX; i++)
    {
        if (c_cli.stats.subop_stat_count[i] != prev_stats.subop_stat_count[i])
        {
            uint64_t avg = (c_cli.stats.subop_stat_sum[i] - prev_stats.subop_stat_sum[i])/(c_cli.stats.subop_stat_count[i] - prev_stats.subop_stat_count[i]);
            printf("[OSD %lu] avg latency for subop %d (%s): %ld us\n", osd_num, i, osd_op_names[i], avg);
            prev_stats.subop_stat_count[i] = c_cli.stats.subop_stat_count[i];
            prev_stats.subop_stat_sum[i] = c_cli.stats.subop_stat_sum[i];
        }
    }
    for (int i = 0; i < 2; i++)
    {
        if (recovery_stat_count[0][i] != recovery_stat_count[1][i])
        {
            uint64_t bw = (recovery_stat_bytes[0][i] - recovery_stat_bytes[1][i]) / print_stats_interval;
            printf(
                "[OSD %lu] %s recovery: %.1f op/s, B/W: %.2f %s\n", osd_num, recovery_stat_names[i],
                (recovery_stat_count[0][i] - recovery_stat_count[1][i]) * 1.0 / print_stats_interval,
                (bw > 1024*1024*1024 ? bw/1024.0/1024/1024 : (bw > 1024*1024 ? bw/1024.0/1024 : bw/1024.0)),
                (bw > 1024*1024*1024 ? "GB/s" : (bw > 1024*1024 ? "MB/s" : "KB/s"))
            );
            recovery_stat_count[1][i] = recovery_stat_count[0][i];
            recovery_stat_bytes[1][i] = recovery_stat_bytes[0][i];
        }
    }
    if (incomplete_objects > 0)
    {
        printf("[OSD %lu] %lu object(s) incomplete\n", osd_num, incomplete_objects);
    }
    if (degraded_objects > 0)
    {
        printf("[OSD %lu] %lu object(s) degraded\n", osd_num, degraded_objects);
    }
    if (misplaced_objects > 0)
    {
        printf("[OSD %lu] %lu object(s) misplaced\n", osd_num, misplaced_objects);
    }
 }
--- a/osd_client.cpp
+++ b/osd_client.cpp
@ -1,40 +0,0 @@
 void slice()
 {
    // Slice the request into blockstore requests to individual objects
    // Primary OSD still operates individual stripes, except they're twice the size of the blockstore's stripe.
    std::vector read_parts;
    int block = bs->get_block_size();
    uint64_t stripe1 = cur_op->req.rw.offset / block / 2;
    uint64_t stripe2 = (cur_op->req.rw.offset + cur_op->req.rw.len + block*2 - 1) / block / 2 - 1;
    for (uint64_t s = stripe1; s <= stripe2; s++)
    {
        uint64_t start = s == stripe1 ? cur_op->req.rw.offset - stripe1*block*2 : 0;
        uint64_t end = s == stripe2 ? cur_op->req.rw.offset + cur_op->req.rw.len - stripe2*block*2 : block*2;
        if (start < block)
        {
            read_parts.push_back({
                .role = 1,
                .oid = {
                    .inode = cur_op->req.rw.inode,
                    .stripe = (s << STRIPE_ROLE_BITS) | 1,
                },
                .version = UINT64_MAX,
                .offset = start,
                .len = (block < end ? block : end) - start,
            });
        }
        if (end > block)
        {
            read_parts.push_back({
                .role = 2,
                .oid = {
                    .inode = cur_op->req.rw.inode,
                    .stripe = (s << STRIPE_ROLE_BITS) | 2,
                },
                .version = UINT64_MAX,
                .offset = (start > block ? start-block : 0),
                .len = end - (start > block ? start-block : 0),
            });
        }
    }
 }
--- a/osd_id.h
+++ b/osd_id.h
@ -1,4 +0,0 @@
 #pragma once
 typedef uint64_t osd_num_t;
 typedef uint32_t pg_num_t;
--- a/osd_primary.cpp
+++ b/osd_primary.cpp
@ -1,669 +0,0 @@
 #include "osd_primary.h"
 // read: read directly or read paired stripe(s), reconstruct, return
 // write: read paired stripe(s), reconstruct, modify, calculate parity, write
 //
 // nuance: take care to read the same version from paired stripes!
 // to do so, we remember "last readable" version until a write request completes
 // and we postpone other write requests to the same stripe until completion of previous ones
 //
 // sync: sync peers, get unstable versions, stabilize them
 bool osd_t::prepare_primary_rw(osd_op_t *cur_op)
 {
    // PG number is calculated from the offset
    // Our EC scheme stores data in fixed chunks equal to (K*block size)
    // But we must not use K in the process of calculating the PG number
    // So we calculate the PG number using a separate setting which should be per-inode (FIXME)
    pg_num_t pg_num = (cur_op->req.rw.inode + cur_op->req.rw.offset / pg_stripe_size) % pg_count + 1;
    auto pg_it = pgs.find(pg_num);
    if (pg_it == pgs.end() || !(pg_it->second.state & PG_ACTIVE))
    {
        // This OSD is not primary for this PG or the PG is inactive
        finish_op(cur_op, -EPIPE);
        return false;
    }
    uint64_t pg_block_size = bs_block_size * pg_it->second.pg_minsize;
    object_id oid = {
        .inode = cur_op->req.rw.inode,
        // oid.stripe = starting offset of the parity stripe, so it can be mapped back to the PG
        .stripe = (cur_op->req.rw.offset / pg_stripe_size) * pg_stripe_size +
            ((cur_op->req.rw.offset % pg_stripe_size) / pg_block_size) * pg_block_size
    };
    if ((cur_op->req.rw.offset + cur_op->req.rw.len) > (oid.stripe + pg_block_size) ||
        (cur_op->req.rw.offset % bs_disk_alignment) != 0 ||
        (cur_op->req.rw.len % bs_disk_alignment) != 0)
    {
        finish_op(cur_op, -EINVAL);
        return false;
    }
    osd_primary_op_data_t *op_data = (osd_primary_op_data_t*)calloc(
        sizeof(osd_primary_op_data_t) + sizeof(osd_rmw_stripe_t) * pg_it->second.pg_size, 1
    );
    op_data->pg_num = pg_num;
    op_data->oid = oid;
    op_data->stripes = ((osd_rmw_stripe_t*)(op_data+1));
    cur_op->op_data = op_data;
    split_stripes(pg_it->second.pg_minsize, bs_block_size, (uint32_t)(cur_op->req.rw.offset - oid.stripe), cur_op->req.rw.len, op_data->stripes);
    pg_it->second.inflight++;
    return true;
 }
 static uint64_t* get_object_osd_set(pg_t &pg, object_id &oid, uint64_t *def, pg_osd_set_state_t **object_state)
 {
    if (!(pg.state & (PG_HAS_INCOMPLETE | PG_HAS_DEGRADED | PG_HAS_MISPLACED)))
    {
        *object_state = NULL;
        return def;
    }
    auto st_it = pg.incomplete_objects.find(oid);
    if (st_it != pg.incomplete_objects.end())
    {
        *object_state = st_it->second;
        return st_it->second->read_target.data();
    }
    st_it = pg.degraded_objects.find(oid);
    if (st_it != pg.degraded_objects.end())
    {
        *object_state = st_it->second;
        return st_it->second->read_target.data();
    }
    st_it = pg.misplaced_objects.find(oid);
    if (st_it != pg.misplaced_objects.end())
    {
        *object_state = st_it->second;
        return st_it->second->read_target.data();
    }
    *object_state = NULL;
    return def;
 }
 void osd_t::continue_primary_read(osd_op_t *cur_op)
 {
    if (!cur_op->op_data && !prepare_primary_rw(cur_op))
    {
        return;
    }
    osd_primary_op_data_t *op_data = cur_op->op_data;
    if (op_data->st == 1)      goto resume_1;
    else if (op_data->st == 2) goto resume_2;
    {
        auto & pg = pgs[op_data->pg_num];
        for (int role = 0; role < pg.pg_minsize; role++)
        {
            op_data->stripes[role].read_start = op_data->stripes[role].req_start;
            op_data->stripes[role].read_end = op_data->stripes[role].req_end;
        }
        // Determine version
        auto vo_it = pg.ver_override.find(op_data->oid);
        op_data->target_ver = vo_it != pg.ver_override.end() ? vo_it->second : UINT64_MAX;
        if (pg.state == PG_ACTIVE)
        {
            // Fast happy-path
            cur_op->buf = alloc_read_buffer(op_data->stripes, pg.pg_minsize, 0);
            submit_primary_subops(SUBMIT_READ, pg.pg_minsize, pg.cur_set.data(), cur_op);
            cur_op->send_list.push_back(cur_op->buf, cur_op->req.rw.len);
            op_data->st = 1;
        }
        else
        {
            // PG may be degraded or have misplaced objects
            uint64_t* cur_set = get_object_osd_set(pg, op_data->oid, pg.cur_set.data(), &op_data->object_state);
            if (extend_missing_stripes(op_data->stripes, cur_set, pg.pg_minsize, pg.pg_size) < 0)
            {
                finish_op(cur_op, -EIO);
                return;
            }
            // Submit reads
            op_data->pg_minsize = pg.pg_minsize;
            op_data->pg_size = pg.pg_size;
            op_data->degraded = 1;
            cur_op->buf = alloc_read_buffer(op_data->stripes, pg.pg_size, 0);
            submit_primary_subops(SUBMIT_READ, pg.pg_size, cur_set, cur_op);
            op_data->st = 1;
        }
    }
 resume_1:
    return;
 resume_2:
    if (op_data->errors > 0)
    {
        finish_op(cur_op, op_data->epipe > 0 ? -EPIPE : -EIO);
        return;
    }
    if (op_data->degraded)
    {
        // Reconstruct missing stripes
        // FIXME: Always EC(k+1) by now. Add different coding schemes
        osd_rmw_stripe_t *stripes = op_data->stripes;
        for (int role = 0; role < op_data->pg_minsize; role++)
        {
            if (stripes[role].read_end != 0 && stripes[role].missing)
            {
                reconstruct_stripe(stripes, op_data->pg_size, role);
            }
            if (stripes[role].req_end != 0)
            {
                // Send buffer in parts to avoid copying
                cur_op->send_list.push_back(
                    stripes[role].read_buf + (stripes[role].req_start - stripes[role].read_start),
                    stripes[role].req_end - stripes[role].req_start
                );
            }
        }
    }
    finish_op(cur_op, cur_op->req.rw.len);
 }
 bool osd_t::check_write_queue(osd_op_t *cur_op, pg_t & pg)
 {
    osd_primary_op_data_t *op_data = cur_op->op_data;
    // Check if actions are pending for this object
    auto act_it = pg.flush_actions.lower_bound((obj_piece_id_t){
        .oid = op_data->oid,
        .osd_num = 0,
    });
    if (act_it != pg.flush_actions.end() &&
        act_it->first.oid.inode == op_data->oid.inode &&
        (act_it->first.oid.stripe & ~STRIPE_MASK) == op_data->oid.stripe)
    {
        pg.write_queue.emplace(op_data->oid, cur_op);
        return false;
    }
    // Check if there are other write requests to the same object
    auto vo_it = pg.write_queue.find(op_data->oid);
    if (vo_it != pg.write_queue.end())
    {
        op_data->st = 1;
        pg.write_queue.emplace(op_data->oid, cur_op);
        return false;
    }
    pg.write_queue.emplace(op_data->oid, cur_op);
    return true;
 }
 void osd_t::continue_primary_write(osd_op_t *cur_op)
 {
    if (!cur_op->op_data && !prepare_primary_rw(cur_op))
    {
        return;
    }
    osd_primary_op_data_t *op_data = cur_op->op_data;
    auto & pg = pgs[op_data->pg_num];
    if (op_data->st == 1)      goto resume_1;
    else if (op_data->st == 2) goto resume_2;
    else if (op_data->st == 3) goto resume_3;
    else if (op_data->st == 4) goto resume_4;
    else if (op_data->st == 5) goto resume_5;
    else if (op_data->st == 6) goto resume_6;
    else if (op_data->st == 7) goto resume_7;
    else if (op_data->st == 8) goto resume_8;
    assert(op_data->st == 0);
    if (!check_write_queue(cur_op, pg))
    {
        return;
    }
 resume_1:
    // Determine blocks to read and write
    // Missing chunks are allowed to be overwritten even in incomplete objects
    // FIXME: Allow to do small writes to the old (degraded/misplaced) OSD set for the lower performance impact
    op_data->prev_set = get_object_osd_set(pg, op_data->oid, pg.cur_set.data(), &op_data->object_state);
    cur_op->rmw_buf = calc_rmw(cur_op->buf, op_data->stripes, op_data->prev_set,
        pg.pg_size, pg.pg_minsize, pg.pg_cursize, pg.cur_set.data(), bs_block_size);
    // Read required blocks
    submit_primary_subops(SUBMIT_RMW_READ, pg.pg_size, op_data->prev_set, cur_op);
 resume_2:
    op_data->st = 2;
    return;
 resume_3:
    if (op_data->errors > 0)
    {
        pg_cancel_write_queue(pg, op_data->oid, op_data->epipe > 0 ? -EPIPE : -EIO);
        return;
    }
    // Save version override for parallel reads
    pg.ver_override[op_data->oid] = op_data->fact_ver;
    // Recover missing stripes, calculate parity
    calc_rmw_parity(op_data->stripes, pg.pg_size, op_data->prev_set, pg.cur_set.data(), bs_block_size);
    // Send writes
    submit_primary_subops(SUBMIT_WRITE, pg.pg_size, pg.cur_set.data(), cur_op);
 resume_4:
    op_data->st = 4;
    return;
 resume_5:
    if (op_data->errors > 0)
    {
        pg_cancel_write_queue(pg, op_data->oid, op_data->epipe > 0 ? -EPIPE : -EIO);
        return;
    }
    if (op_data->fact_ver == 1)
    {
        // Object is created
        pg.clean_count++;
        pg.total_count++;
    }
    if (op_data->object_state)
    {
        {
            int recovery_type = op_data->object_state->state & (OBJ_DEGRADED|OBJ_INCOMPLETE) ? 0 : 1;
            recovery_stat_count[0][recovery_type]++;
            if (!recovery_stat_count[0][recovery_type])
            {
                recovery_stat_count[0][recovery_type]++;
                recovery_stat_bytes[0][recovery_type] = 0;
            }
            for (int role = 0; role < pg.pg_size; role++)
            {
                recovery_stat_bytes[0][recovery_type] += op_data->stripes[role].write_end - op_data->stripes[role].write_start;
            }
        }
        if (op_data->object_state->state & OBJ_MISPLACED)
        {
            // Remove extra chunks
            submit_primary_del_subops(cur_op, pg.cur_set.data(), op_data->object_state->osd_set);
            if (op_data->n_subops > 0)
            {
                op_data->st = 8;
                return;
 resume_8:
                if (op_data->errors > 0)
                {
                    pg_cancel_write_queue(pg, op_data->oid, op_data->epipe > 0 ? -EPIPE : -EIO);
                    return;
                }
            }
        }
        // Clear object state
        remove_object_from_state(op_data->oid, op_data->object_state, pg);
        pg.clean_count++;
    }
    // Remove version override
    pg.ver_override.erase(op_data->oid);
    // FIXME: Check for immediate_commit == IMMEDIATE_SMALL
 resume_6:
 resume_7:
    if (!finalize_primary_write(cur_op, pg, pg.cur_loc_set, 6))
    {
        return;
    }
    object_id oid = op_data->oid;
    finish_op(cur_op, cur_op->req.rw.len);
    // Continue other write operations to the same object
    auto next_it = pg.write_queue.find(oid);
    auto this_it = next_it;
    next_it++;
    pg.write_queue.erase(this_it);
    if (next_it != pg.write_queue.end() &&
        next_it->first == oid)
    {
        osd_op_t *next_op = next_it->second;
        continue_primary_write(next_op);
    }
 }
 bool osd_t::finalize_primary_write(osd_op_t *cur_op, pg_t & pg, pg_osd_set_t & loc_set, int base_state)
 {
    osd_primary_op_data_t *op_data = cur_op->op_data;
    if (op_data->st == base_state)
    {
        goto resume_6;
    }
    else if (op_data->st == base_state+1)
    {
        goto resume_7;
    }
    if (immediate_commit == IMMEDIATE_ALL)
    {
        op_data->unstable_write_osds = new std::vector<unstable_osd_num_t>();
        op_data->unstable_writes = new obj_ver_id[loc_set.size()];
        {
            int last_start = 0;
            for (auto & chunk: loc_set)
            {
                op_data->unstable_writes[last_start] = (obj_ver_id){
                    .oid = {
                        .inode = op_data->oid.inode,
                        .stripe = op_data->oid.stripe | chunk.role,
                    },
                    .version = op_data->fact_ver,
                };
                op_data->unstable_write_osds->push_back((unstable_osd_num_t){
                    .osd_num = chunk.osd_num,
                    .start = last_start,
                    .len = 1,
                });
                last_start++;
            }
        }
        submit_primary_stab_subops(cur_op);
 resume_6:
        op_data->st = 6;
        return false;
 resume_7:
        // FIXME: Free those in the destructor?
        delete op_data->unstable_write_osds;
        delete[] op_data->unstable_writes;
        op_data->unstable_writes = NULL;
        op_data->unstable_write_osds = NULL;
        if (op_data->errors > 0)
        {
            pg_cancel_write_queue(pg, op_data->oid, op_data->epipe > 0 ? -EPIPE : -EIO);
            return false;
        }
    }
    else
    {
        // Remember version as unstable
        for (auto & chunk: loc_set)
        {
            this->unstable_writes[(osd_object_id_t){
                .osd_num = chunk.osd_num,
                .oid = {
                    .inode = op_data->oid.inode,
                    .stripe = op_data->oid.stripe | chunk.role,
                },
            }] = op_data->fact_ver;
        }
        // Remember PG as dirty to drop the connection when PG goes offline
        // (this is required because of the "lazy sync")
        c_cli.clients[cur_op->peer_fd].dirty_pgs.insert(op_data->pg_num);
        dirty_pgs.insert(op_data->pg_num);
    }
    return true;
 }
 // Save and clear unstable_writes -> SYNC all -> STABLE all
 void osd_t::continue_primary_sync(osd_op_t *cur_op)
 {
    if (!cur_op->op_data)
    {
        cur_op->op_data = (osd_primary_op_data_t*)calloc(sizeof(osd_primary_op_data_t), 1);
    }
    osd_primary_op_data_t *op_data = cur_op->op_data;
    if (op_data->st == 1)      goto resume_1;
    else if (op_data->st == 2) goto resume_2;
    else if (op_data->st == 3) goto resume_3;
    else if (op_data->st == 4) goto resume_4;
    else if (op_data->st == 5) goto resume_5;
    else if (op_data->st == 6) goto resume_6;
    assert(op_data->st == 0);
    if (syncs_in_progress.size() > 0)
    {
        // Wait for previous syncs, if any
        // FIXME: We may try to execute the current one in parallel, like in Blockstore, but I'm not sure if it matters at all
        syncs_in_progress.push_back(cur_op);
        op_data->st = 1;
 resume_1:
        return;
    }
    else
    {
        syncs_in_progress.push_back(cur_op);
    }
 resume_2:
    if (unstable_writes.size() == 0)
    {
        // Nothing to sync
        goto finish;
    }
    // Save and clear unstable_writes
    // In theory it is possible to do in on a per-client basis, but this seems to be an unnecessary complication
    // It would be cool not to copy these here at all, but someone has to deduplicate them by object IDs anyway
    {
        op_data->unstable_write_osds = new std::vector<unstable_osd_num_t>();
        op_data->unstable_writes = new obj_ver_id[this->unstable_writes.size()];
        op_data->dirty_pgs = new pg_num_t[dirty_pgs.size()];
        op_data->dirty_pg_count = dirty_pgs.size();
        osd_num_t last_osd = 0;
        int last_start = 0, last_end = 0;
        for (auto it = this->unstable_writes.begin(); it != this->unstable_writes.end(); it++)
        {
            if (last_osd != it->first.osd_num)
            {
                if (last_osd != 0)
                {
                    op_data->unstable_write_osds->push_back((unstable_osd_num_t){
                        .osd_num = last_osd,
                        .start = last_start,
                        .len = last_end - last_start,
                    });
                }
                last_osd = it->first.osd_num;
                last_start = last_end;
            }
            op_data->unstable_writes[last_end] = (obj_ver_id){
                .oid = it->first.oid,
                .version = it->second,
            };
            last_end++;
        }
        if (last_osd != 0)
        {
            op_data->unstable_write_osds->push_back((unstable_osd_num_t){
                .osd_num = last_osd,
                .start = last_start,
                .len = last_end - last_start,
            });
        }
        int dpg = 0;
        for (auto dirty_pg_num: dirty_pgs)
        {
            pgs[dirty_pg_num].inflight++;
            op_data->dirty_pgs[dpg++] = dirty_pg_num;
        }
        dirty_pgs.clear();
        this->unstable_writes.clear();
    }
    if (immediate_commit != IMMEDIATE_ALL)
    {
        // SYNC
        submit_primary_sync_subops(cur_op);
 resume_3:
        op_data->st = 3;
        return;
 resume_4:
        if (op_data->errors > 0)
        {
            goto resume_6;
        }
    }
    // Stabilize version sets
    submit_primary_stab_subops(cur_op);
 resume_5:
    op_data->st = 5;
    return;
 resume_6:
    if (op_data->errors > 0)
    {
        // Return objects back into the unstable write set
        for (auto unstable_osd: *(op_data->unstable_write_osds))
        {
            for (int i = 0; i < unstable_osd.len; i++)
            {
                // Except those from peered PGs
                auto & w = op_data->unstable_writes[i];
                pg_num_t wpg = map_to_pg(w.oid);
                if (pgs[wpg].state & PG_ACTIVE)
                {
                    uint64_t & dest = this->unstable_writes[(osd_object_id_t){
                        .osd_num = unstable_osd.osd_num,
                        .oid = w.oid,
                    }];
                    dest = dest < w.version ? w.version : dest;
                    dirty_pgs.insert(wpg);
                }
            }
        }
    }
    for (int i = 0; i < op_data->dirty_pg_count; i++)
    {
        auto & pg = pgs.at(op_data->dirty_pgs[i]);
        pg.inflight--;
        if ((pg.state & PG_STOPPING) && pg.inflight == 0 && !pg.flush_batch)
        {
            finish_stop_pg(pg);
        }
    }
    // FIXME: Free those in the destructor?
    delete op_data->dirty_pgs;
    delete op_data->unstable_write_osds;
    delete[] op_data->unstable_writes;
    op_data->unstable_writes = NULL;
    op_data->unstable_write_osds = NULL;
    if (op_data->errors > 0)
    {
        finish_op(cur_op, op_data->epipe > 0 ? -EPIPE : -EIO);
    }
    else
    {
 finish:
        if (cur_op->peer_fd)
        {
            auto it = c_cli.clients.find(cur_op->peer_fd);
            if (it != c_cli.clients.end())
                it->second.dirty_pgs.clear();
        }
        finish_op(cur_op, 0);
    }
    assert(syncs_in_progress.front() == cur_op);
    syncs_in_progress.pop_front();
    if (syncs_in_progress.size() > 0)
    {
        cur_op = syncs_in_progress.front();
        op_data = cur_op->op_data;
        op_data->st++;
        goto resume_2;
    }
 }
 // Decrement pg_osd_set_state_t's object_count and change PG state accordingly
 void osd_t::remove_object_from_state(object_id & oid, pg_osd_set_state_t *object_state, pg_t & pg)
 {
    if (object_state->state & OBJ_INCOMPLETE)
    {
        // Successful write means that object is not incomplete anymore
        this->incomplete_objects--;
        pg.incomplete_objects.erase(oid);
        if (!pg.incomplete_objects.size())
        {
            pg.state = pg.state & ~PG_HAS_INCOMPLETE;
            report_pg_state(pg);
        }
    }
    else if (object_state->state & OBJ_DEGRADED)
    {
        this->degraded_objects--;
        pg.degraded_objects.erase(oid);
        if (!pg.degraded_objects.size())
        {
            pg.state = pg.state & ~PG_HAS_DEGRADED;
            report_pg_state(pg);
        }
    }
    else if (object_state->state & OBJ_MISPLACED)
    {
        this->misplaced_objects--;
        pg.misplaced_objects.erase(oid);
        if (!pg.misplaced_objects.size())
        {
            pg.state = pg.state & ~PG_HAS_MISPLACED;
            report_pg_state(pg);
        }
    }
    else
    {
        throw std::runtime_error("BUG: Invalid object state: "+std::to_string(object_state->state));
    }
    object_state->object_count--;
    if (!object_state->object_count)
    {
        pg.state_dict.erase(object_state->osd_set);
    }
 }
 void osd_t::continue_primary_del(osd_op_t *cur_op)
 {
    if (!cur_op->op_data && !prepare_primary_rw(cur_op))
    {
        return;
    }
    osd_primary_op_data_t *op_data = cur_op->op_data;
    auto & pg = pgs[op_data->pg_num];
    if (op_data->st == 1)      goto resume_1;
    else if (op_data->st == 2) goto resume_2;
    else if (op_data->st == 3) goto resume_3;
    else if (op_data->st == 4) goto resume_4;
    else if (op_data->st == 5) goto resume_5;
    else if (op_data->st == 6) goto resume_6;
    else if (op_data->st == 7) goto resume_7;
    assert(op_data->st == 0);
    // Delete is forbidden even in active PGs if they're also degraded or have previous dead OSDs
    if (pg.state & (PG_DEGRADED | PG_LEFT_ON_DEAD))
    {
        finish_op(cur_op, -EBUSY);
        return;
    }
    if (!check_write_queue(cur_op, pg))
    {
        return;
    }
 resume_1:
    // Determine which OSDs contain this object and delete it
    op_data->prev_set = get_object_osd_set(pg, op_data->oid, pg.cur_set.data(), &op_data->object_state);
    // Submit 1 read to determine the actual version number
    submit_primary_subops(SUBMIT_RMW_READ, pg.pg_size, op_data->prev_set, cur_op);
 resume_2:
    op_data->st = 2;
    return;
 resume_3:
    if (op_data->errors > 0)
    {
        pg_cancel_write_queue(pg, op_data->oid, op_data->epipe > 0 ? -EPIPE : -EIO);
        return;
    }
    // Save version override for parallel reads
    pg.ver_override[op_data->oid] = op_data->fact_ver;
    // Submit deletes
    op_data->fact_ver++;
    submit_primary_del_subops(cur_op, NULL, op_data->object_state ? op_data->object_state->osd_set : pg.cur_loc_set);
 resume_4:
    op_data->st = 4;
    return;
 resume_5:
    if (op_data->errors > 0)
    {
        pg_cancel_write_queue(pg, op_data->oid, op_data->epipe > 0 ? -EPIPE : -EIO);
        return;
    }
    // Remove version override
    pg.ver_override.erase(op_data->oid);
 resume_6:
 resume_7:
    if (!finalize_primary_write(cur_op, pg, op_data->object_state ? op_data->object_state->osd_set : pg.cur_loc_set, 6))
    {
        return;
    }
    // Adjust PG stats after "instant stabilize", because we need object_state above
    if (!op_data->object_state)
    {
        pg.clean_count--;
    }
    else
    {
        remove_object_from_state(op_data->oid, op_data->object_state, pg);
    }
    pg.total_count--;
    object_id oid = op_data->oid;
    finish_op(cur_op, cur_op->req.rw.len);
    // Continue other write operations to the same object
    auto next_it = pg.write_queue.find(oid);
    auto this_it = next_it;
    next_it++;
    pg.write_queue.erase(this_it);
    if (next_it != pg.write_queue.end() &&
        next_it->first == oid)
    {
        osd_op_t *next_op = next_it->second;
        continue_primary_write(next_op);
    }
 }
--- a/osd_primary.h
+++ b/osd_primary.h
@ -1,35 +0,0 @@
 #pragma once
 #include "osd.h"
 #include "osd_rmw.h"
 #define SUBMIT_READ 0
 #define SUBMIT_RMW_READ 1
 #define SUBMIT_WRITE 2
 struct unstable_osd_num_t
 {
    osd_num_t osd_num;
    int start, len;
 };
 struct osd_primary_op_data_t
 {
    int st = 0;
    pg_num_t pg_num;
    object_id oid;
    uint64_t target_ver;
    uint64_t fact_ver = 0;
    int n_subops = 0, done = 0, errors = 0, epipe = 0;
    int degraded = 0, pg_size, pg_minsize;
    osd_rmw_stripe_t *stripes;
    osd_op_t *subops = NULL;
    uint64_t *prev_set = NULL;
    pg_osd_set_state_t *object_state = NULL;
    // for sync. oops, requires freeing
    std::vector<unstable_osd_num_t> *unstable_write_osds = NULL;
    pg_num_t *dirty_pgs = NULL;
    int dirty_pg_count = 0;
    obj_ver_id *unstable_writes = NULL;
 };
--- a/osd_primary_subops.cpp
+++ b/osd_primary_subops.cpp
@ -1,489 +0,0 @@
 #include "osd_primary.h"
 void osd_t::autosync()
 {
    // FIXME Autosync based on the number of unstable writes to prevent
    // "journal_sector_buffer_count is too low for this batch" errors
    if (immediate_commit != IMMEDIATE_ALL && !autosync_op)
    {
        autosync_op = new osd_op_t();
        autosync_op->op_type = OSD_OP_IN;
        autosync_op->req = {
            .sync = {
                .header = {
                    .magic = SECONDARY_OSD_OP_MAGIC,
                    .id = 1,
                    .opcode = OSD_OP_SYNC,
                },
            },
        };
        autosync_op->callback = [this](osd_op_t *op)
        {
            if (op->reply.hdr.retval < 0)
            {
                printf("Warning: automatic sync resulted in an error: %ld (%s)\n", -op->reply.hdr.retval, strerror(-op->reply.hdr.retval));
            }
            delete autosync_op;
            autosync_op = NULL;
        };
        exec_op(autosync_op);
    }
 }
 void osd_t::finish_op(osd_op_t *cur_op, int retval)
 {
    inflight_ops--;
    if (cur_op->op_data && cur_op->op_data->pg_num > 0)
    {
        auto & pg = pgs[cur_op->op_data->pg_num];
        pg.inflight--;
        assert(pg.inflight >= 0);
        if ((pg.state & PG_STOPPING) && pg.inflight == 0 && !pg.flush_batch)
        {
            finish_stop_pg(pg);
        }
    }
    if (!cur_op->peer_fd)
    {
        // Copy lambda to be unaffected by `delete op`
        std::function<void(osd_op_t*)>(cur_op->callback)(cur_op);
    }
    else
    {
        // FIXME add separate magic number
        auto cl_it = c_cli.clients.find(cur_op->peer_fd);
        if (cl_it != c_cli.clients.end())
        {
            cur_op->reply.hdr.magic = SECONDARY_OSD_REPLY_MAGIC;
            cur_op->reply.hdr.id = cur_op->req.hdr.id;
            cur_op->reply.hdr.opcode = cur_op->req.hdr.opcode;
            cur_op->reply.hdr.retval = retval;
            c_cli.outbox_push(cur_op);
        }
        else
        {
            delete cur_op;
        }
    }
 }
 void osd_t::submit_primary_subops(int submit_type, int pg_size, const uint64_t* osd_set, osd_op_t *cur_op)
 {
    bool w = submit_type == SUBMIT_WRITE;
    osd_primary_op_data_t *op_data = cur_op->op_data;
    osd_rmw_stripe_t *stripes = op_data->stripes;
    // Allocate subops
    int n_subops = 0, zero_read = -1;
    for (int role = 0; role < pg_size; role++)
    {
        if (osd_set[role] == this->osd_num || osd_set[role] != 0 && zero_read == -1)
        {
            zero_read = role;
        }
        if (osd_set[role] != 0 && (w || stripes[role].read_end != 0))
        {
            n_subops++;
        }
    }
    if (!n_subops && submit_type == SUBMIT_RMW_READ)
    {
        n_subops = 1;
    }
    else
    {
        zero_read = -1;
    }
    uint64_t op_version = w ? op_data->fact_ver+1 : (submit_type == SUBMIT_RMW_READ ? UINT64_MAX : op_data->target_ver);
    osd_op_t *subops = new osd_op_t[n_subops];
    op_data->fact_ver = 0;
    op_data->done = op_data->errors = 0;
    op_data->n_subops = n_subops;
    op_data->subops = subops;
    int i = 0;
    for (int role = 0; role < pg_size; role++)
    {
        // We always submit zero-length writes to all replicas, even if the stripe is not modified
        if (!(w || stripes[role].read_end != 0 || zero_read == role))
        {
            continue;
        }
        osd_num_t role_osd_num = osd_set[role];
        if (role_osd_num != 0)
        {
            if (role_osd_num == this->osd_num)
            {
                clock_gettime(CLOCK_REALTIME, &subops[i].tv_begin);
                subops[i].op_type = (uint64_t)cur_op;
                subops[i].bs_op = new blockstore_op_t({
                    .opcode = (uint64_t)(w ? BS_OP_WRITE : BS_OP_READ),
                    .callback = [subop = &subops[i], this](blockstore_op_t *bs_subop)
                    {
                        handle_primary_bs_subop(subop);
                    },
                    .oid = {
                        .inode = op_data->oid.inode,
                        .stripe = op_data->oid.stripe | role,
                    },
                    .version = op_version,
                    .offset = w ? stripes[role].write_start : stripes[role].read_start,
                    .len = w ? stripes[role].write_end - stripes[role].write_start : stripes[role].read_end - stripes[role].read_start,
                    .buf = w ? stripes[role].write_buf : stripes[role].read_buf,
                });
                bs->enqueue_op(subops[i].bs_op);
            }
            else
            {
                subops[i].op_type = OSD_OP_OUT;
                subops[i].send_list.push_back(subops[i].req.buf, OSD_PACKET_SIZE);
                subops[i].peer_fd = c_cli.osd_peer_fds.at(role_osd_num);
                subops[i].req.sec_rw = {
                    .header = {
                        .magic = SECONDARY_OSD_OP_MAGIC,
                        .id = c_cli.next_subop_id++,
                        .opcode = (uint64_t)(w ? OSD_OP_SECONDARY_WRITE : OSD_OP_SECONDARY_READ),
                    },
                    .oid = {
                        .inode = op_data->oid.inode,
                        .stripe = op_data->oid.stripe | role,
                    },
                    .version = op_version,
                    .offset = w ? stripes[role].write_start : stripes[role].read_start,
                    .len = w ? stripes[role].write_end - stripes[role].write_start : stripes[role].read_end - stripes[role].read_start,
                };
                subops[i].buf = w ? stripes[role].write_buf : stripes[role].read_buf;
                if (w && stripes[role].write_end > 0)
                {
                    subops[i].send_list.push_back(stripes[role].write_buf, stripes[role].write_end - stripes[role].write_start);
                }
                subops[i].callback = [cur_op, this](osd_op_t *subop)
                {
                    int fail_fd = subop->req.hdr.opcode == OSD_OP_SECONDARY_WRITE &&
                        subop->reply.hdr.retval != subop->req.sec_rw.len ? subop->peer_fd : -1;
                    // so it doesn't get freed
                    subop->buf = NULL;
                    handle_primary_subop(
                        subop->req.hdr.opcode, cur_op, subop->reply.hdr.retval,
                        subop->req.sec_rw.len, subop->reply.sec_rw.version
                    );
                    if (fail_fd >= 0)
                    {
                        // write operation failed, drop the connection
                        c_cli.stop_client(fail_fd);
                    }
                };
                c_cli.outbox_push(&subops[i]);
            }
            i++;
        }
    }
 }
 static uint64_t bs_op_to_osd_op[] = {
    0,
    OSD_OP_SECONDARY_READ,      // BS_OP_READ
    OSD_OP_SECONDARY_WRITE,     // BS_OP_WRITE
    OSD_OP_SECONDARY_SYNC,      // BS_OP_SYNC
    OSD_OP_SECONDARY_STABILIZE, // BS_OP_STABLE
    OSD_OP_SECONDARY_DELETE,    // BS_OP_DELETE
    OSD_OP_SECONDARY_LIST,      // BS_OP_LIST
    OSD_OP_SECONDARY_ROLLBACK,  // BS_OP_ROLLBACK
    OSD_OP_TEST_SYNC_STAB_ALL,  // BS_OP_SYNC_STAB_ALL
 };
 void osd_t::handle_primary_bs_subop(osd_op_t *subop)
 {
    osd_op_t *cur_op = (osd_op_t*)subop->op_type;
    blockstore_op_t *bs_op = subop->bs_op;
    int expected = bs_op->opcode == BS_OP_READ || bs_op->opcode == BS_OP_WRITE ? bs_op->len : 0;
    if (bs_op->retval != expected && bs_op->opcode != BS_OP_READ)
    {
        // die
        throw std::runtime_error(
            "local blockstore modification failed (opcode = "+std::to_string(bs_op->opcode)+
            " retval = "+std::to_string(bs_op->retval)+")"
        );
    }
    add_bs_subop_stats(subop);
    uint64_t opcode = bs_op_to_osd_op[bs_op->opcode];
    int retval = bs_op->retval;
    uint64_t version = bs_op->version;
    delete bs_op;
    subop->bs_op = NULL;
    handle_primary_subop(opcode, cur_op, retval, expected, version);
 }
 void osd_t::add_bs_subop_stats(osd_op_t *subop)
 {
    // Include local blockstore ops in statistics
    uint64_t opcode = bs_op_to_osd_op[subop->bs_op->opcode];
    timespec tv_end;
    clock_gettime(CLOCK_REALTIME, &tv_end);
    c_cli.stats.op_stat_count[opcode]++;
    if (!c_cli.stats.op_stat_count[opcode])
    {
        c_cli.stats.op_stat_count[opcode] = 1;
        c_cli.stats.op_stat_sum[opcode] = 0;
        c_cli.stats.op_stat_bytes[opcode] = 0;
    }
    c_cli.stats.op_stat_sum[opcode] += (
        (tv_end.tv_sec - subop->tv_begin.tv_sec)*1000000 +
        (tv_end.tv_nsec - subop->tv_begin.tv_nsec)/1000
    );
    if (opcode == OSD_OP_SECONDARY_READ || opcode == OSD_OP_SECONDARY_WRITE)
    {
        c_cli.stats.op_stat_bytes[opcode] += subop->bs_op->len;
    }
 }
 void osd_t::handle_primary_subop(uint64_t opcode, osd_op_t *cur_op, int retval, int expected, uint64_t version)
 {
    osd_primary_op_data_t *op_data = cur_op->op_data;
    if (retval != expected)
    {
        printf("%s subop failed: retval = %d (expected %d)\n", osd_op_names[opcode], retval, expected);
        if (retval == -EPIPE)
        {
            op_data->epipe++;
        }
        op_data->errors++;
    }
    else
    {
        op_data->done++;
        if (opcode == OSD_OP_SECONDARY_READ || opcode == OSD_OP_SECONDARY_WRITE)
        {
            if (op_data->fact_ver != 0 && op_data->fact_ver != version)
            {
                throw std::runtime_error(
                    "different fact_versions returned from "+std::string(osd_op_names[opcode])+
                    " subops: "+std::to_string(version)+" vs "+std::to_string(op_data->fact_ver)
                );
            }
            op_data->fact_ver = version;
        }
    }
    if ((op_data->errors + op_data->done) >= op_data->n_subops)
    {
        delete[] op_data->subops;
        op_data->subops = NULL;
        op_data->st++;
        if (cur_op->req.hdr.opcode == OSD_OP_READ)
        {
            continue_primary_read(cur_op);
        }
        else if (cur_op->req.hdr.opcode == OSD_OP_WRITE)
        {
            continue_primary_write(cur_op);
        }
        else if (cur_op->req.hdr.opcode == OSD_OP_SYNC)
        {
            continue_primary_sync(cur_op);
        }
        else if (cur_op->req.hdr.opcode == OSD_OP_DELETE)
        {
            continue_primary_del(cur_op);
        }
        else
        {
            throw std::runtime_error("BUG: unknown opcode");
        }
    }
 }
 void osd_t::submit_primary_del_subops(osd_op_t *cur_op, uint64_t *cur_set, pg_osd_set_t & loc_set)
 {
    osd_primary_op_data_t *op_data = cur_op->op_data;
    int extra_chunks = 0;
    for (auto & chunk: loc_set)
    {
        if (!cur_set || chunk.osd_num != cur_set[chunk.role])
        {
            extra_chunks++;
        }
    }
    op_data->n_subops = extra_chunks;
    op_data->done = op_data->errors = 0;
    if (!extra_chunks)
    {
        return;
    }
    osd_op_t *subops = new osd_op_t[extra_chunks];
    op_data->subops = subops;
    int i = 0;
    for (auto & chunk: loc_set)
    {
        if (!cur_set || chunk.osd_num != cur_set[chunk.role])
        {
            if (chunk.osd_num == this->osd_num)
            {
                clock_gettime(CLOCK_REALTIME, &subops[i].tv_begin);
                subops[i].op_type = (uint64_t)cur_op;
                subops[i].bs_op = new blockstore_op_t({
                    .opcode = BS_OP_DELETE,
                    .callback = [subop = &subops[i], this](blockstore_op_t *bs_subop)
                    {
                        handle_primary_bs_subop(subop);
                    },
                    .oid = {
                        .inode = op_data->oid.inode,
                        .stripe = op_data->oid.stripe | chunk.role,
                    },
                    // Same version as write
                    .version = op_data->fact_ver,
                });
                bs->enqueue_op(subops[i].bs_op);
            }
            else
            {
                subops[i].op_type = OSD_OP_OUT;
                subops[i].send_list.push_back(subops[i].req.buf, OSD_PACKET_SIZE);
                subops[i].peer_fd = c_cli.osd_peer_fds.at(chunk.osd_num);
                subops[i].req.sec_del = {
                    .header = {
                        .magic = SECONDARY_OSD_OP_MAGIC,
                        .id = c_cli.next_subop_id++,
                        .opcode = OSD_OP_SECONDARY_DELETE,
                    },
                    .oid = {
                        .inode = op_data->oid.inode,
                        .stripe = op_data->oid.stripe | chunk.role,
                    },
                    // Same version as write
                    .version = op_data->fact_ver,
                };
                subops[i].callback = [cur_op, this](osd_op_t *subop)
                {
                    int fail_fd = subop->reply.hdr.retval != 0 ? subop->peer_fd : -1;
                    handle_primary_subop(OSD_OP_SECONDARY_DELETE, cur_op, subop->reply.hdr.retval, 0, 0);
                    if (fail_fd >= 0)
                    {
                        // delete operation failed, drop the connection
                        c_cli.stop_client(fail_fd);
                    }
                };
                c_cli.outbox_push(&subops[i]);
            }
            i++;
        }
    }
 }
 void osd_t::submit_primary_sync_subops(osd_op_t *cur_op)
 {
    osd_primary_op_data_t *op_data = cur_op->op_data;
    int n_osds = op_data->unstable_write_osds->size();
    osd_op_t *subops = new osd_op_t[n_osds];
    op_data->done = op_data->errors = 0;
    op_data->n_subops = n_osds;
    op_data->subops = subops;
    for (int i = 0; i < n_osds; i++)
    {
        osd_num_t sync_osd = (*(op_data->unstable_write_osds))[i].osd_num;
        if (sync_osd == this->osd_num)
        {
            clock_gettime(CLOCK_REALTIME, &subops[i].tv_begin);
            subops[i].op_type = (uint64_t)cur_op;
            subops[i].bs_op = new blockstore_op_t({
                .opcode = BS_OP_SYNC,
                .callback = [subop = &subops[i], this](blockstore_op_t *bs_subop)
                {
                    handle_primary_bs_subop(subop);
                },
            });
            bs->enqueue_op(subops[i].bs_op);
        }
        else
        {
            subops[i].op_type = OSD_OP_OUT;
            subops[i].send_list.push_back(subops[i].req.buf, OSD_PACKET_SIZE);
            subops[i].peer_fd = c_cli.osd_peer_fds.at(sync_osd);
            subops[i].req.sec_sync = {
                .header = {
                    .magic = SECONDARY_OSD_OP_MAGIC,
                    .id = c_cli.next_subop_id++,
                    .opcode = OSD_OP_SECONDARY_SYNC,
                },
            };
            subops[i].callback = [cur_op, this](osd_op_t *subop)
            {
                int fail_fd = subop->reply.hdr.retval != 0 ? subop->peer_fd : -1;
                handle_primary_subop(OSD_OP_SECONDARY_SYNC, cur_op, subop->reply.hdr.retval, 0, 0);
                if (fail_fd >= 0)
                {
                    // sync operation failed, drop the connection
                    c_cli.stop_client(fail_fd);
                }
            };
            c_cli.outbox_push(&subops[i]);
        }
    }
 }
 void osd_t::submit_primary_stab_subops(osd_op_t *cur_op)
 {
    osd_primary_op_data_t *op_data = cur_op->op_data;
    int n_osds = op_data->unstable_write_osds->size();
    osd_op_t *subops = new osd_op_t[n_osds];
    op_data->done = op_data->errors = 0;
    op_data->n_subops = n_osds;
    op_data->subops = subops;
    for (int i = 0; i < n_osds; i++)
    {
        auto & stab_osd = (*(op_data->unstable_write_osds))[i];
        if (stab_osd.osd_num == this->osd_num)
        {
            clock_gettime(CLOCK_REALTIME, &subops[i].tv_begin);
            subops[i].op_type = (uint64_t)cur_op;
            subops[i].bs_op = new blockstore_op_t({
                .opcode = BS_OP_STABLE,
                .callback = [subop = &subops[i], this](blockstore_op_t *bs_subop)
                {
                    handle_primary_bs_subop(subop);
                },
                .len = (uint32_t)stab_osd.len,
                .buf = (void*)(op_data->unstable_writes + stab_osd.start),
            });
            bs->enqueue_op(subops[i].bs_op);
        }
        else
        {
            subops[i].op_type = OSD_OP_OUT;
            subops[i].send_list.push_back(subops[i].req.buf, OSD_PACKET_SIZE);
            subops[i].peer_fd = c_cli.osd_peer_fds.at(stab_osd.osd_num);
            subops[i].req.sec_stab = {
                .header = {
                    .magic = SECONDARY_OSD_OP_MAGIC,
                    .id = c_cli.next_subop_id++,
                    .opcode = OSD_OP_SECONDARY_STABILIZE,
                },
                .len = (uint64_t)(stab_osd.len * sizeof(obj_ver_id)),
            };
            subops[i].send_list.push_back(op_data->unstable_writes + stab_osd.start, stab_osd.len * sizeof(obj_ver_id));
            subops[i].callback = [cur_op, this](osd_op_t *subop)
            {
                int fail_fd = subop->reply.hdr.retval != 0 ? subop->peer_fd : -1;
                handle_primary_subop(OSD_OP_SECONDARY_STABILIZE, cur_op, subop->reply.hdr.retval, 0, 0);
                if (fail_fd >= 0)
                {
                    // sync operation failed, drop the connection
                    c_cli.stop_client(fail_fd);
                }
            };
            c_cli.outbox_push(&subops[i]);
        }
    }
 }
 void osd_t::pg_cancel_write_queue(pg_t & pg, object_id oid, int retval)
 {
    auto st_it = pg.write_queue.find(oid), it = st_it;
    while (it != pg.write_queue.end() && it->first == oid)
    {
        finish_op(it->second, retval);
        it++;
    }
    if (st_it != it)
    {
        pg.write_queue.erase(st_it, it);
    }
 }
--- a/osd_receive.cpp
+++ b/osd_receive.cpp
@ -1,272 +0,0 @@
 #include "cluster_client.h"
 void cluster_client_t::read_requests()
 {
    for (int i = 0; i < read_ready_clients.size(); i++)
    {
        int peer_fd = read_ready_clients[i];
        auto & cl = clients[peer_fd];
        io_uring_sqe* sqe = ringloop->get_sqe();
        if (!sqe)
        {
            read_ready_clients.erase(read_ready_clients.begin(), read_ready_clients.begin() + i);
            return;
        }
        ring_data_t* data = ((ring_data_t*)sqe->user_data);
        if (!cl.read_op || cl.read_remaining < receive_buffer_size)
        {
            cl.read_iov.iov_base = cl.in_buf;
            cl.read_iov.iov_len = receive_buffer_size;
        }
        else
        {
            cl.read_iov.iov_base = cl.read_buf;
            cl.read_iov.iov_len = cl.read_remaining;
        }
        cl.read_msg.msg_iov = &cl.read_iov;
        cl.read_msg.msg_iovlen = 1;
        data->callback = [this, peer_fd](ring_data_t *data) { handle_read(data, peer_fd); };
        my_uring_prep_recvmsg(sqe, peer_fd, &cl.read_msg, 0);
    }
    read_ready_clients.clear();
 }
 void cluster_client_t::handle_read(ring_data_t *data, int peer_fd)
 {
    auto cl_it = clients.find(peer_fd);
    if (cl_it != clients.end())
    {
        auto & cl = cl_it->second;
        if (data->res < 0 && data->res != -EAGAIN)
        {
            // this is a client socket, so don't panic. just disconnect it
            printf("Client %d socket read error: %d (%s). Disconnecting client\n", peer_fd, -data->res, strerror(-data->res));
            stop_client(peer_fd);
            return;
        }
        if (data->res == -EAGAIN || cl.read_iov.iov_base == cl.in_buf && data->res < receive_buffer_size)
        {
            cl.read_ready--;
            if (cl.read_ready > 0)
                read_ready_clients.push_back(peer_fd);
        }
        else
        {
            read_ready_clients.push_back(peer_fd);
        }
        if (data->res == -EAGAIN)
        {
            return;
        }
        if (data->res > 0)
        {
            if (cl.read_iov.iov_base == cl.in_buf)
            {
                // Compose operation(s) from the buffer
                int remain = data->res;
                void *curbuf = cl.in_buf;
                while (remain > 0)
                {
                    if (!cl.read_op)
                    {
                        cl.read_op = new osd_op_t;
                        cl.read_op->peer_fd = peer_fd;
                        cl.read_op->op_type = OSD_OP_IN;
                        cl.read_buf = cl.read_op->req.buf;
                        cl.read_remaining = OSD_PACKET_SIZE;
                        cl.read_state = CL_READ_HDR;
                    }
                    if (cl.read_remaining > remain)
                    {
                        memcpy(cl.read_buf, curbuf, remain);
                        cl.read_remaining -= remain;
                        cl.read_buf += remain;
                        remain = 0;
                        if (cl.read_remaining <= 0)
                            handle_finished_read(cl);
                    }
                    else
                    {
                        memcpy(cl.read_buf, curbuf, cl.read_remaining);
                        curbuf += cl.read_remaining;
                        remain -= cl.read_remaining;
                        cl.read_remaining = 0;
                        cl.read_buf = NULL;
                        handle_finished_read(cl);
                    }
                }
            }
            else
            {
                // Long data
                cl.read_remaining -= data->res;
                cl.read_buf += data->res;
                if (cl.read_remaining <= 0)
                {
                    handle_finished_read(cl);
                }
            }
        }
    }
 }
 void cluster_client_t::handle_finished_read(osd_client_t & cl)
 {
    if (cl.read_state == CL_READ_HDR)
    {
        if (cl.read_op->req.hdr.magic == SECONDARY_OSD_REPLY_MAGIC)
            handle_reply_hdr(&cl);
        else
            handle_op_hdr(&cl);
    }
    else if (cl.read_state == CL_READ_DATA)
    {
        // Operation is ready
        exec_op(cl.read_op);
        cl.read_op = NULL;
        cl.read_state = 0;
    }
    else if (cl.read_state == CL_READ_REPLY_DATA)
    {
        // Reply is ready
        auto req_it = cl.sent_ops.find(cl.read_reply_id);
        osd_op_t *request = req_it->second;
        cl.sent_ops.erase(req_it);
        cl.read_reply_id = 0;
        delete cl.read_op;
        cl.read_op = NULL;
        cl.read_state = 0;
        // Measure subop latency
        timespec tv_end;
        clock_gettime(CLOCK_REALTIME, &tv_end);
        stats.subop_stat_count[request->req.hdr.opcode]++;
        if (!stats.subop_stat_count[request->req.hdr.opcode])
        {
            stats.subop_stat_count[request->req.hdr.opcode]++;
            stats.subop_stat_sum[request->req.hdr.opcode] = 0;
        }
        stats.subop_stat_sum[request->req.hdr.opcode] += (
            (tv_end.tv_sec - request->tv_begin.tv_sec)*1000000 +
            (tv_end.tv_nsec - request->tv_begin.tv_nsec)/1000
        );
        request->callback(request);
    }
    else
    {
        assert(0);
    }
 }
 void cluster_client_t::handle_op_hdr(osd_client_t *cl)
 {
    osd_op_t *cur_op = cl->read_op;
    if (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_READ)
    {
        if (cur_op->req.sec_rw.len > 0)
            cur_op->buf = memalign(MEM_ALIGNMENT, cur_op->req.sec_rw.len);
        cl->read_remaining = 0;
    }
    else if (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_WRITE)
    {
        if (cur_op->req.sec_rw.len > 0)
            cur_op->buf = memalign(MEM_ALIGNMENT, cur_op->req.sec_rw.len);
        cl->read_remaining = cur_op->req.sec_rw.len;
    }
    else if (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_STABILIZE ||
        cur_op->req.hdr.opcode == OSD_OP_SECONDARY_ROLLBACK)
    {
        if (cur_op->req.sec_stab.len > 0)
            cur_op->buf = memalign(MEM_ALIGNMENT, cur_op->req.sec_stab.len);
        cl->read_remaining = cur_op->req.sec_stab.len;
    }
    else if (cur_op->req.hdr.opcode == OSD_OP_READ)
    {
        if (cur_op->req.rw.len > 0)
            cur_op->buf = memalign(MEM_ALIGNMENT, cur_op->req.rw.len);
        cl->read_remaining = 0;
    }
    else if (cur_op->req.hdr.opcode == OSD_OP_WRITE)
    {
        if (cur_op->req.rw.len > 0)
            cur_op->buf = memalign(MEM_ALIGNMENT, cur_op->req.rw.len);
        cl->read_remaining = cur_op->req.rw.len;
    }
    if (cl->read_remaining > 0)
    {
        // Read data
        cl->read_buf = cur_op->buf;
        cl->read_state = CL_READ_DATA;
    }
    else
    {
        // Operation is ready
        cl->read_op = NULL;
        cl->read_state = 0;
        exec_op(cur_op);
    }
 }
 void cluster_client_t::handle_reply_hdr(osd_client_t *cl)
 {
    osd_op_t *cur_op = cl->read_op;
    auto req_it = cl->sent_ops.find(cur_op->req.hdr.id);
    if (req_it == cl->sent_ops.end())
    {
        // Command out of sync. Drop connection
        printf("Client %d command out of sync: id %lu\n", cl->peer_fd, cur_op->req.hdr.id);
        stop_client(cl->peer_fd);
        return;
    }
    osd_op_t *op = req_it->second;
    memcpy(op->reply.buf, cur_op->req.buf, OSD_PACKET_SIZE);
    if (op->reply.hdr.opcode == OSD_OP_SECONDARY_READ &&
        op->reply.hdr.retval > 0)
    {
        // Read data. In this case we assume that the buffer is preallocated by the caller (!)
        assert(op->buf);
        cl->read_state = CL_READ_REPLY_DATA;
        cl->read_reply_id = op->req.hdr.id;
        cl->read_buf = op->buf;
        cl->read_remaining = op->reply.hdr.retval;
    }
    else if (op->reply.hdr.opcode == OSD_OP_SECONDARY_LIST &&
        op->reply.hdr.retval > 0)
    {
        op->buf = memalign(MEM_ALIGNMENT, sizeof(obj_ver_id) * op->reply.hdr.retval);
        cl->read_state = CL_READ_REPLY_DATA;
        cl->read_reply_id = op->req.hdr.id;
        cl->read_buf = op->buf;
        cl->read_remaining = sizeof(obj_ver_id) * op->reply.hdr.retval;
    }
    else if (op->reply.hdr.opcode == OSD_OP_SHOW_CONFIG &&
        op->reply.hdr.retval > 0)
    {
        op->buf = malloc(op->reply.hdr.retval);
        cl->read_state = CL_READ_REPLY_DATA;
        cl->read_reply_id = op->req.hdr.id;
        cl->read_buf = op->buf;
        cl->read_remaining = op->reply.hdr.retval;
    }
    else
    {
        delete cl->read_op;
        cl->read_state = 0;
        cl->read_op = NULL;
        cl->sent_ops.erase(req_it);
        // Measure subop latency
        timespec tv_end;
        clock_gettime(CLOCK_REALTIME, &tv_end);
        stats.subop_stat_count[op->req.hdr.opcode]++;
        if (!stats.subop_stat_count[op->req.hdr.opcode])
        {
            stats.subop_stat_count[op->req.hdr.opcode]++;
            stats.subop_stat_sum[op->req.hdr.opcode] = 0;
        }
        stats.subop_stat_sum[op->req.hdr.opcode] += (
            (tv_end.tv_sec - op->tv_begin.tv_sec)*1000000 +
            (tv_end.tv_nsec - op->tv_begin.tv_nsec)/1000
        );
        // Copy lambda to be unaffected by `delete op`
        std::function<void(osd_op_t*)>(op->callback)(op);
    }
 }
--- a/osd_rmw.cpp
+++ b/osd_rmw.cpp
@ -1,450 +0,0 @@
 #include <malloc.h>
 #include <string.h>
 #include <assert.h>
 #include "xor.h"
 #include "osd_rmw.h"
 static inline void extend_read(uint32_t start, uint32_t end, osd_rmw_stripe_t & stripe)
 {
    if (stripe.read_end == 0)
    {
        stripe.read_start = start;
        stripe.read_end = end;
    }
    else
    {
        if (stripe.read_end < end)
            stripe.read_end = end;
        if (stripe.read_start > start)
            stripe.read_start = start;
    }
 }
 static inline void cover_read(uint32_t start, uint32_t end, osd_rmw_stripe_t & stripe)
 {
    // Subtract <to> write request from <from> request
    if (start >= stripe.req_start &&
        end <= stripe.req_end)
    {
        return;
    }
    if (start <= stripe.req_start &&
        end >= stripe.req_start &&
        end <= stripe.req_end)
    {
        end = stripe.req_start;
    }
    else if (start >= stripe.req_start &&
        start <= stripe.req_end &&
        end >= stripe.req_end)
    {
        start = stripe.req_end;
    }
    if (stripe.read_end == 0)
    {
        stripe.read_start = start;
        stripe.read_end = end;
    }
    else
    {
        if (stripe.read_end < end)
            stripe.read_end = end;
        if (stripe.read_start > start)
            stripe.read_start = start;
    }
 }
 void split_stripes(uint64_t pg_minsize, uint32_t bs_block_size, uint32_t start, uint32_t end, osd_rmw_stripe_t *stripes)
 {
    if (end == 0)
    {
        // Zero length request - offset doesn't matter
        return;
    }
    end = start+end;
    for (int role = 0; role < pg_minsize; role++)
    {
        if (start < (1+role)*bs_block_size && end > role*bs_block_size)
        {
            stripes[role].req_start = start < role*bs_block_size ? 0 : start-role*bs_block_size;
            stripes[role].req_end = end > (role+1)*bs_block_size ? bs_block_size : end-role*bs_block_size;
        }
    }
 }
 void reconstruct_stripe(osd_rmw_stripe_t *stripes, int pg_size, int role)
 {
    int prev = -2;
    for (int other = 0; other < pg_size; other++)
    {
        if (other != role)
        {
            if (prev == -2)
            {
                prev = other;
            }
            else if (prev >= 0)
            {
                assert(stripes[role].read_start >= stripes[prev].read_start &&
                    stripes[role].read_start >= stripes[other].read_start);
                memxor(
                    stripes[prev].read_buf + (stripes[role].read_start - stripes[prev].read_start),
                    stripes[other].read_buf + (stripes[role].read_start - stripes[other].read_start),
                    stripes[role].read_buf, stripes[role].read_end - stripes[role].read_start
                );
                prev = -1;
            }
            else
            {
                assert(stripes[role].read_start >= stripes[other].read_start);
                memxor(
                    stripes[role].read_buf,
                    stripes[other].read_buf + (stripes[role].read_start - stripes[other].read_start),
                    stripes[role].read_buf, stripes[role].read_end - stripes[role].read_start
                );
            }
        }
    }
 }
 int extend_missing_stripes(osd_rmw_stripe_t *stripes, osd_num_t *osd_set, int minsize, int size)
 {
    for (int role = 0; role < minsize; role++)
    {
        if (stripes[role].read_end != 0 && osd_set[role] == 0)
        {
            stripes[role].missing = true;
            // Stripe is missing. Extend read to other stripes.
            // We need at least pg_minsize stripes to recover the lost part.
            // FIXME: LRC EC and similar don't require to read all other stripes.
            int exist = 0;
            for (int j = 0; j < size; j++)
            {
                if (osd_set[j] != 0)
                {
                    extend_read(stripes[role].read_start, stripes[role].read_end, stripes[j]);
                    exist++;
                    if (exist >= minsize)
                    {
                        break;
                    }
                }
            }
            if (exist < minsize)
            {
                // Less than minsize stripes are available for this object
                return -1;
            }
        }
    }
    return 0;
 }
 void* alloc_read_buffer(osd_rmw_stripe_t *stripes, int read_pg_size, uint64_t add_size)
 {
    // Calculate buffer size
    uint64_t buf_size = add_size;
    for (int role = 0; role < read_pg_size; role++)
    {
        if (stripes[role].read_end != 0)
        {
            buf_size += stripes[role].read_end - stripes[role].read_start;
        }
    }
    // Allocate buffer
    void *buf = memalign(MEM_ALIGNMENT, buf_size);
    uint64_t buf_pos = add_size;
    for (int role = 0; role < read_pg_size; role++)
    {
        if (stripes[role].read_end != 0)
        {
            stripes[role].read_buf = buf + buf_pos;
            buf_pos += stripes[role].read_end - stripes[role].read_start;
        }
    }
    return buf;
 }
 void* calc_rmw(void *request_buf, osd_rmw_stripe_t *stripes, uint64_t *read_osd_set,
    uint64_t pg_size, uint64_t pg_minsize, uint64_t pg_cursize, uint64_t *write_osd_set, uint64_t chunk_size)
 {
    // Generic parity modification (read-modify-write) algorithm
    // Read -> Reconstruct missing chunks -> Calc parity chunks -> Write
    // Now we always read continuous ranges. This means that an update of the beginning
    // of one data stripe and the end of another will lead to a read of full paired stripes.
    // FIXME: (Maybe) read small individual ranges in that case instead.
    uint32_t start = 0, end = 0;
    for (int role = 0; role < pg_minsize; role++)
    {
        if (stripes[role].req_end != 0)
        {
            start = !end || stripes[role].req_start < start ? stripes[role].req_start : start;
            end = std::max(stripes[role].req_end, end);
            stripes[role].write_start = stripes[role].req_start;
            stripes[role].write_end = stripes[role].req_end;
        }
    }
    int write_parity = 0;
    for (int role = pg_minsize; role < pg_size; role++)
    {
        if (write_osd_set[role] != 0)
        {
            write_parity = 1;
            stripes[role].write_start = start;
            stripes[role].write_end = end;
        }
    }
    if (write_parity)
    {
        for (int role = 0; role < pg_minsize; role++)
        {
            cover_read(start, end, stripes[role]);
        }
    }
    if (write_osd_set != read_osd_set)
    {
        pg_cursize = 0;
        // Object is degraded/misplaced and will be moved to <write_osd_set>
        for (int role = 0; role < pg_size; role++)
        {
            if (write_osd_set[role] != read_osd_set[role])
            {
                // FIXME: For EC more than 2+1: handle case when write_osd_set == 0 and read_osd_set != 0
                // We need to get data for any moved / recovered chunk
                // And we need a continuous write buffer so we'll only optimize
                // for the case when the whole chunk is ovewritten in the request
                if (stripes[role].req_start != 0 ||
                    stripes[role].req_end != chunk_size)
                {
                    stripes[role].read_start = 0;
                    stripes[role].read_end = chunk_size;
                    // Warning: We don't modify write_start/write_end here, we do it in calc_rmw_parity()
                }
            }
            if (read_osd_set[role] != 0)
            {
                pg_cursize++;
            }
        }
    }
    if (pg_cursize < pg_size)
    {
        // Some stripe(s) are missing, so we need to read parity
        for (int role = 0; role < pg_size; role++)
        {
            if (read_osd_set[role] == 0)
            {
                stripes[role].missing = true;
                if (stripes[role].read_end != 0)
                {
                    int found = 0;
                    for (int r2 = 0; r2 < pg_size && found < pg_minsize; r2++)
                    {
                        // Read the non-covered range of <role> from at least <minsize> other stripes to reconstruct it
                        if (read_osd_set[r2] != 0)
                        {
                            extend_read(stripes[role].read_start, stripes[role].read_end, stripes[r2]);
                            found++;
                        }
                    }
                    if (found < pg_minsize)
                    {
                        // FIXME Object is incomplete - refuse partial overwrite
                        assert(0);
                    }
                }
            }
        }
    }
    // Allocate read buffers
    void *rmw_buf = alloc_read_buffer(stripes, pg_size, (write_parity ? pg_size-pg_minsize : 0) * (end - start));
    // Position write buffers
    uint64_t buf_pos = 0, in_pos = 0;
    for (int role = 0; role < pg_size; role++)
    {
        if (stripes[role].req_end != 0)
        {
            stripes[role].write_buf = request_buf + in_pos;
            in_pos += stripes[role].req_end - stripes[role].req_start;
        }
        else if (role >= pg_minsize && write_osd_set[role] != 0 && end != 0)
        {
            stripes[role].write_buf = rmw_buf + buf_pos;
            buf_pos += end - start;
        }
    }
    return rmw_buf;
 }
 static void get_old_new_buffers(osd_rmw_stripe_t & stripe, uint32_t wr_start, uint32_t wr_end, buf_len_t *bufs, int & nbufs)
 {
    uint32_t ns = 0, ne = 0, os = 0, oe = 0;
    if (stripe.req_end > wr_start &&
        stripe.req_start < wr_end)
    {
        ns = std::max(stripe.req_start, wr_start);
        ne = std::min(stripe.req_end, wr_end);
    }
    if (stripe.read_end > wr_start &&
        stripe.read_start < wr_end)
    {
        os = std::max(stripe.read_start, wr_start);
        oe = std::min(stripe.read_end, wr_end);
    }
    if (ne && (!oe || ns <= os))
    {
        // NEW or NEW->OLD
        bufs[nbufs++] = { .buf = stripe.write_buf + ns - stripe.req_start, .len = ne-ns };
        if (os < ne)
            os = ne;
        if (oe > os)
        {
            // NEW->OLD
            bufs[nbufs++] = { .buf = stripe.read_buf + os - stripe.read_start, .len = oe-os };
        }
    }
    else if (oe)
    {
        // OLD or OLD->NEW or OLD->NEW->OLD
        if (ne)
        {
            // OLD->NEW or OLD->NEW->OLD
            bufs[nbufs++] = { .buf = stripe.read_buf + os - stripe.read_start, .len = ns-os };
            bufs[nbufs++] = { .buf = stripe.write_buf + ns - stripe.req_start, .len = ne-ns };
            if (oe > ne)
            {
                // OLD->NEW->OLD
                bufs[nbufs++] = { .buf = stripe.read_buf + ne - stripe.read_start, .len = oe-ne };
            }
        }
        else
        {
            // OLD
            bufs[nbufs++] = { .buf = stripe.read_buf + os - stripe.read_start, .len = oe-os };
        }
    }
 }
 static void xor_multiple_buffers(buf_len_t *xor1, int n1, buf_len_t *xor2, int n2, void *dest, uint32_t len)
 {
    assert(n1 > 0 && n2 > 0);
    int i1 = 0, i2 = 0;
    uint32_t start1 = 0, start2 = 0, end1 = xor1[0].len, end2 = xor2[0].len;
    uint32_t pos = 0;
    while (pos < len)
    {
        // We know for sure that ranges overlap
        uint32_t end = std::min(end1, end2);
        memxor(xor1[i1].buf + pos-start1, xor2[i2].buf + pos-start2, dest+pos, end-pos);
        pos = end;
        if (pos >= end1)
        {
            i1++;
            if (i1 >= n1)
            {
                assert(pos >= end2);
                return;
            }
            start1 = end1;
            end1 += xor1[i1].len;
        }
        if (pos >= end2)
        {
            i2++;
            start2 = end2;
            end2 += xor2[i2].len;
        }
    }
 }
 void calc_rmw_parity(osd_rmw_stripe_t *stripes, int pg_size, uint64_t *read_osd_set, uint64_t *write_osd_set, uint32_t chunk_size)
 {
    int pg_minsize = pg_size-1;
    for (int role = 0; role < pg_size; role++)
    {
        if (stripes[role].read_end != 0 && stripes[role].missing)
        {
            // Reconstruct missing stripe (EC k+1)
            reconstruct_stripe(stripes, pg_size, role);
            break;
        }
    }
    uint32_t start = 0, end = 0;
    if (!stripes[pg_minsize].missing || write_osd_set != read_osd_set)
    {
        for (int role = 0; role < pg_minsize; role++)
        {
            if (stripes[role].req_end != 0)
            {
                start = !end || stripes[role].req_start < start ? stripes[role].req_start : start;
                end = std::max(stripes[role].req_end, end);
            }
        }
    }
    if (write_osd_set != read_osd_set)
    {
        for (int role = 0; role < pg_minsize; role++)
        {
            if (write_osd_set[role] != read_osd_set[role] &&
                (stripes[role].req_start != 0 || stripes[role].req_end != chunk_size))
            {
                // FIXME again, handle case when write_osd_set[role] is 0
                // Copy modified chunk into the read buffer to write it back
                memcpy(
                    stripes[role].read_buf + stripes[role].req_start,
                    stripes[role].write_buf,
                    stripes[role].req_end - stripes[role].req_start
                );
                stripes[role].write_buf = stripes[role].read_buf;
                stripes[role].write_start = 0;
                stripes[role].write_end = chunk_size;
            }
        }
    }
    if (!stripes[pg_minsize].missing && end != 0)
    {
        // Calculate new parity (EC k+1)
        int parity = pg_minsize, prev = -2;
        for (int other = 0; other < pg_minsize; other++)
        {
            if (prev == -2)
            {
                prev = other;
            }
            else
            {
                int n1 = 0, n2 = 0;
                buf_len_t xor1[3], xor2[3];
                if (prev == -1)
                {
                    xor1[n1++] = { .buf = stripes[parity].write_buf, .len = end-start };
                }
                else
                {
                    get_old_new_buffers(stripes[prev], start, end, xor1, n1);
                    prev = -1;
                }
                get_old_new_buffers(stripes[other], start, end, xor2, n2);
                xor_multiple_buffers(xor1, n1, xor2, n2, stripes[parity].write_buf, end-start);
            }
        }
    }
    if (write_osd_set != read_osd_set)
    {
        for (int role = pg_minsize; role < pg_size; role++)
        {
            if (write_osd_set[role] != read_osd_set[role] && (start != 0 || end != chunk_size))
            {
                // Copy new parity into the read buffer to write it back
                memcpy(
                    stripes[role].read_buf + start,
                    stripes[role].write_buf,
                    end - start
                );
                stripes[role].write_buf = stripes[role].read_buf;
                stripes[role].write_start = 0;
                stripes[role].write_end = chunk_size;
            }
        }
    }
 }
--- a/osd_rmw.h
+++ b/osd_rmw.h
@ -1,37 +0,0 @@
 #pragma once
 #include <stdint.h>
 #include "object_id.h"
 #include "osd_id.h"
 #ifndef MEM_ALIGNMENT
 #define MEM_ALIGNMENT 512
 #endif
 struct buf_len_t
 {
    void *buf;
    uint64_t len;
 };
 struct osd_rmw_stripe_t
 {
    void *read_buf, *write_buf;
    uint32_t req_start, req_end;
    uint32_t read_start, read_end;
    uint32_t write_start, write_end;
    bool missing;
 };
 void split_stripes(uint64_t pg_minsize, uint32_t bs_block_size, uint32_t start, uint32_t len, osd_rmw_stripe_t *stripes);
 void reconstruct_stripe(osd_rmw_stripe_t *stripes, int pg_size, int role);
 int extend_missing_stripes(osd_rmw_stripe_t *stripes, osd_num_t *osd_set, int minsize, int size);
 void* alloc_read_buffer(osd_rmw_stripe_t *stripes, int read_pg_size, uint64_t add_size);
 void* calc_rmw(void *request_buf, osd_rmw_stripe_t *stripes, uint64_t *read_osd_set,
    uint64_t pg_size, uint64_t pg_minsize, uint64_t pg_cursize, uint64_t *write_osd_set, uint64_t chunk_size);
 void calc_rmw_parity(osd_rmw_stripe_t *stripes, int pg_size, uint64_t *read_osd_set, uint64_t *write_osd_set, uint32_t chunk_size);
--- a/osd_rmw_test.cpp
+++ b/osd_rmw_test.cpp
@ -1,360 +0,0 @@
 #include <string.h>
 #include "osd_rmw.cpp"
 #include "test_pattern.h"
 void dump_stripes(osd_rmw_stripe_t *stripes, int pg_size);
 void test1();
 void test4();
 void test5();
 void test6();
 void test7();
 void test8();
 void test9();
 /***
 Cases:
 1. split(offset=128K-4K, len=8K)
   = [ [ 128K-4K, 128K ], [ 0, 4K ], [ 0, 0 ] ]
 2. read(offset=128K-4K, len=8K, osd_set=[1,0,3])
   = { read: [ [ 0, 128K ], [ 0, 4K ], [ 0, 4K ] ] }
 3. cover_read(0, 128K, { req: [ 128K-4K, 4K ] })
   = { read: [ 0, 128K-4K ] }
 4. write(offset=128K-4K, len=8K, osd_set=[1,0,3])
   = {
     read: [ [ 0, 128K ], [ 4K, 128K ], [ 4K, 128K ] ],
     write: [ [ 128K-4K, 128K ], [ 0, 4K ], [ 0, 128K ] ],
     input buffer: [ write0, write1 ],
     rmw buffer: [ write2, read0, read1, read2 ],
   }
   + check write2 buffer
 5. write(offset=0, len=128K+64K, osd_set=[1,0,3])
   = {
     req: [ [ 0, 128K ], [ 0, 64K ], [ 0, 0 ] ],
     read: [ [ 64K, 128K ], [ 64K, 128K ], [ 64K, 128K ] ],
     write: [ [ 0, 128K ], [ 0, 64K ], [ 0, 128K ] ],
     input buffer: [ write0, write1 ],
     rmw buffer: [ write2, read0, read1, read2 ],
   }
 6. write(offset=0, len=128K+64K, osd_set=[1,2,3])
   = {
     req: [ [ 0, 128K ], [ 0, 64K ], [ 0, 0 ] ],
     read: [ [ 0, 0 ], [ 64K, 128K ], [ 0, 0 ] ],
     write: [ [ 0, 128K ], [ 0, 64K ], [ 0, 128K ] ],
     input buffer: [ write0, write1 ],
     rmw buffer: [ write2, read1 ],
   }
 7. calc_rmw(offset=128K-4K, len=8K, osd_set=[1,0,3], write_set=[1,2,3])
   = {
     read: [ [ 0, 128K ], [ 0, 128K ], [ 0, 128K ] ],
     write: [ [ 128K-4K, 128K ], [ 0, 4K ], [ 0, 128K ] ],
     input buffer: [ write0, write1 ],
     rmw buffer: [ write2, read0, read1, read2 ],
   }
   then, after calc_rmw_parity(): {
     write: [ [ 128K-4K, 128K ], [ 0, 128K ], [ 0, 128K ] ],
     write1==read1,
   }
   + check write1 buffer
   + check write2 buffer
 8. calc_rmw(offset=0, len=128K+4K, osd_set=[0,2,3], write_set=[1,2,3])
   = {
     read: [ [ 0, 0 ], [ 4K, 128K ], [ 0, 0 ] ],
     write: [ [ 0, 128K ], [ 0, 4K ], [ 0, 128K ] ],
     input buffer: [ write0, write1 ],
     rmw buffer: [ write2, read1 ],
   }
   + check write2 buffer
 9. object recovery case:
   calc_rmw(offset=0, len=0, read_osd_set=[0,2,3], write_osd_set=[1,2,3])
   = {
     read: [ [ 0, 128K ], [ 0, 128K ], [ 0, 128K ] ],
     write: [ [ 0, 0 ], [ 0, 0 ], [ 0, 0 ] ],
     input buffer: NULL,
     rmw buffer: [ read0, read1, read2 ],
   }
   then, after calc_rmw_parity(): {
     write: [ [ 0, 128K ], [ 0, 0 ], [ 0, 0 ] ],
     write0==read0,
   }
   + check write0 buffer
 ***/
 int main(int narg, char *args[])
 {
    // Test 1
    test1();
    // Test 4
    test4();
    // Test 5
    test5();
    // Test 6
    test6();
    // Test 7
    test7();
    // Test 8
    test8();
    // Test 9
    test9();
    // End
    printf("all ok\n");
    return 0;
 }
 void dump_stripes(osd_rmw_stripe_t *stripes, int pg_size)
 {
    printf("request");
    for (int i = 0; i < pg_size; i++)
    {
        printf(" {%uK-%uK}", stripes[i].req_start/1024, stripes[i].req_end/1024);
    }
    printf("\n");
    printf("read");
    for (int i = 0; i < pg_size; i++)
    {
        printf(" {%uK-%uK}", stripes[i].read_start/1024, stripes[i].read_end/1024);
    }
    printf("\n");
    printf("write");
    for (int i = 0; i < pg_size; i++)
    {
        printf(" {%uK-%uK}", stripes[i].write_start/1024, stripes[i].write_end/1024);
    }
    printf("\n");
 }
 void test1()
 {
    osd_num_t osd_set[3] = { 1, 0, 3 };
    osd_rmw_stripe_t stripes[3] = { 0 };
    // Test 1.1
    split_stripes(2, 128*1024, 128*1024-4096, 8192, stripes);
    assert(stripes[0].req_start == 128*1024-4096 && stripes[0].req_end == 128*1024);
    assert(stripes[1].req_start == 0 && stripes[1].req_end == 4096);
    assert(stripes[2].req_end == 0);
    // Test 1.2
    for (int i = 0; i < 3; i++)
    {
        stripes[i].read_start = stripes[i].req_start;
        stripes[i].read_end = stripes[i].req_end;
    }
    assert(extend_missing_stripes(stripes, osd_set, 2, 3) == 0);
    assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024);
    assert(stripes[2].read_start == 0 && stripes[2].read_end == 4096);
    // Test 1.3
    stripes[0] = { .req_start = 128*1024-4096, .req_end = 128*1024 };
    cover_read(0, 128*1024, stripes[0]);
    assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024-4096);
 }
 void test4()
 {
    osd_num_t osd_set[3] = { 1, 0, 3 };
    osd_rmw_stripe_t stripes[3] = { 0 };
    // Test 4.1
    split_stripes(2, 128*1024, 128*1024-4096, 8192, stripes);
    void* write_buf = malloc(8192);
    void* rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 2, osd_set, 128*1024);
    assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024);
    assert(stripes[1].read_start == 4096 && stripes[1].read_end == 128*1024);
    assert(stripes[2].read_start == 4096 && stripes[2].read_end == 128*1024);
    assert(stripes[0].write_start == 128*1024-4096 && stripes[0].write_end == 128*1024);
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 4096);
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
    assert(stripes[0].read_buf == rmw_buf+128*1024);
    assert(stripes[1].read_buf == rmw_buf+128*1024*2);
    assert(stripes[2].read_buf == rmw_buf+128*1024*3-4096);
    assert(stripes[0].write_buf == write_buf);
    assert(stripes[1].write_buf == write_buf+4096);
    assert(stripes[2].write_buf == rmw_buf);
    // Test 4.2
    set_pattern(write_buf, 8192, PATTERN0);
    set_pattern(stripes[0].read_buf, 128*1024, PATTERN1); // old data
    set_pattern(stripes[1].read_buf, 128*1024-4096, UINT64_MAX); // didn't read it, it's missing
    set_pattern(stripes[2].read_buf, 128*1024-4096, 0); // old parity = 0
    calc_rmw_parity(stripes, 3, osd_set, osd_set, 128*1024);
    check_pattern(stripes[2].write_buf, 4096, PATTERN0^PATTERN1); // new parity
    check_pattern(stripes[2].write_buf+4096, 128*1024-4096*2, 0); // new parity
    check_pattern(stripes[2].write_buf+128*1024-4096, 4096, PATTERN0^PATTERN1); // new parity
    free(rmw_buf);
    free(write_buf);
 }
 void test5()
 {
    osd_num_t osd_set[3] = { 1, 0, 3 };
    osd_rmw_stripe_t stripes[3] = { 0 };
    // Test 5.1
    split_stripes(2, 128*1024, 0, 64*1024*3, stripes);
    assert(stripes[0].req_start == 0 && stripes[0].req_end == 128*1024);
    assert(stripes[1].req_start == 0 && stripes[1].req_end == 64*1024);
    assert(stripes[2].req_end == 0);
    // Test 5.2
    void *write_buf = malloc(64*1024*3);
    void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 2, osd_set, 128*1024);
    assert(stripes[0].read_start == 64*1024 && stripes[0].read_end == 128*1024);
    assert(stripes[1].read_start == 64*1024 && stripes[1].read_end == 128*1024);
    assert(stripes[2].read_start == 64*1024 && stripes[2].read_end == 128*1024);
    assert(stripes[0].write_start == 0 && stripes[0].write_end == 128*1024);
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 64*1024);
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
    assert(stripes[0].read_buf == rmw_buf+128*1024);
    assert(stripes[1].read_buf == rmw_buf+64*3*1024);
    assert(stripes[2].read_buf == rmw_buf+64*4*1024);
    assert(stripes[0].write_buf == write_buf);
    assert(stripes[1].write_buf == write_buf+128*1024);
    assert(stripes[2].write_buf == rmw_buf);
    free(rmw_buf);
    free(write_buf);
 }
 void test6()
 {
    osd_num_t osd_set[3] = { 1, 2, 3 };
    osd_rmw_stripe_t stripes[3] = { 0 };
    // Test 6.1
    split_stripes(2, 128*1024, 0, 64*1024*3, stripes);
    void *write_buf = malloc(64*1024*3);
    void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 3, osd_set, 128*1024);
    assert(stripes[0].read_end == 0);
    assert(stripes[1].read_start == 64*1024 && stripes[1].read_end == 128*1024);
    assert(stripes[2].read_end == 0);
    assert(stripes[0].write_start == 0 && stripes[0].write_end == 128*1024);
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 64*1024);
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
    assert(stripes[0].read_buf == 0);
    assert(stripes[1].read_buf == rmw_buf+128*1024);
    assert(stripes[2].read_buf == 0);
    assert(stripes[0].write_buf == write_buf);
    assert(stripes[1].write_buf == write_buf+128*1024);
    assert(stripes[2].write_buf == rmw_buf);
    free(rmw_buf);
    free(write_buf);
 }
 void test7()
 {
    osd_num_t osd_set[3] = { 1, 0, 3 };
    osd_num_t write_osd_set[3] = { 1, 2, 3 };
    osd_rmw_stripe_t stripes[3] = { 0 };
    // Test 7.1
    split_stripes(2, 128*1024, 128*1024-4096, 8192, stripes);
    void *write_buf = malloc(8192);
    void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 2, write_osd_set, 128*1024);
    assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024);
    assert(stripes[1].read_start == 0 && stripes[1].read_end == 128*1024);
    assert(stripes[2].read_start == 0 && stripes[2].read_end == 128*1024);
    assert(stripes[0].write_start == 128*1024-4096 && stripes[0].write_end == 128*1024);
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 4096);
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
    assert(stripes[0].read_buf == rmw_buf+128*1024);
    assert(stripes[1].read_buf == rmw_buf+128*1024*2);
    assert(stripes[2].read_buf == rmw_buf+128*1024*3);
    assert(stripes[0].write_buf == write_buf);
    assert(stripes[1].write_buf == write_buf+4096);
    assert(stripes[2].write_buf == rmw_buf);
    // Test 7.2
    set_pattern(write_buf, 8192, PATTERN0);
    set_pattern(stripes[0].read_buf, 128*1024, PATTERN1); // old data
    set_pattern(stripes[1].read_buf, 128*1024, UINT64_MAX); // didn't read it, it's missing
    set_pattern(stripes[2].read_buf, 128*1024, 0); // old parity = 0
    calc_rmw_parity(stripes, 3, osd_set, write_osd_set, 128*1024);
    assert(stripes[0].write_start == 128*1024-4096 && stripes[0].write_end == 128*1024);
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 128*1024);
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
    assert(stripes[1].write_buf == stripes[1].read_buf);
    check_pattern(stripes[1].write_buf, 4096, PATTERN0);
    check_pattern(stripes[1].write_buf+4096, 128*1024-4096, PATTERN1);
    check_pattern(stripes[2].write_buf, 4096, PATTERN0^PATTERN1); // new parity
    check_pattern(stripes[2].write_buf+4096, 128*1024-4096*2, 0); // new parity
    check_pattern(stripes[2].write_buf+128*1024-4096, 4096, PATTERN0^PATTERN1); // new parity
    free(rmw_buf);
    free(write_buf);
 }
 void test8()
 {
    osd_num_t osd_set[3] = { 0, 2, 3 };
    osd_num_t write_osd_set[3] = { 1, 2, 3 };
    osd_rmw_stripe_t stripes[3] = { 0 };
    // Test 8.1
    split_stripes(2, 128*1024, 0, 128*1024+4096, stripes);
    void *write_buf = malloc(128*1024+4096);
    void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 2, write_osd_set, 128*1024);
    assert(stripes[0].read_start == 0 && stripes[0].read_end == 0);
    assert(stripes[1].read_start == 4096 && stripes[1].read_end == 128*1024);
    assert(stripes[2].read_start == 0 && stripes[2].read_end == 0);
    assert(stripes[0].write_start == 0 && stripes[0].write_end == 128*1024);
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 4096);
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
    assert(stripes[0].read_buf == NULL);
    assert(stripes[1].read_buf == rmw_buf+128*1024);
    assert(stripes[2].read_buf == NULL);
    assert(stripes[0].write_buf == write_buf);
    assert(stripes[1].write_buf == write_buf+128*1024);
    assert(stripes[2].write_buf == rmw_buf);
    // Test 8.2
    set_pattern(write_buf, 128*1024+4096, PATTERN0);
    set_pattern(stripes[1].read_buf, 128*1024-4096, PATTERN1);
    calc_rmw_parity(stripes, 3, osd_set, write_osd_set, 128*1024);
    assert(stripes[0].write_start == 0 && stripes[0].write_end == 128*1024); // recheck again
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 4096);     // recheck again
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024); // recheck again
    assert(stripes[0].write_buf == write_buf);                               // recheck again
    assert(stripes[1].write_buf == write_buf+128*1024);                      // recheck again
    assert(stripes[2].write_buf == rmw_buf);                                 // recheck again
    check_pattern(stripes[2].write_buf, 4096, 0); // new parity
    check_pattern(stripes[2].write_buf+4096, 128*1024-4096, PATTERN0^PATTERN1); // new parity
    free(rmw_buf);
    free(write_buf);
 }
 void test9()
 {
    osd_num_t osd_set[3] = { 0, 2, 3 };
    osd_num_t write_osd_set[3] = { 1, 2, 3 };
    osd_rmw_stripe_t stripes[3] = { 0 };
    // Test 9.0
    split_stripes(2, 128*1024, 64*1024, 0, stripes);
    assert(stripes[0].req_start == 0 && stripes[0].req_end == 0);
    assert(stripes[1].req_start == 0 && stripes[1].req_end == 0);
    assert(stripes[2].req_start == 0 && stripes[2].req_end == 0);
    // Test 9.1
    void *write_buf = NULL;
    void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 3, write_osd_set, 128*1024);
    assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024);
    assert(stripes[1].read_start == 0 && stripes[1].read_end == 128*1024);
    assert(stripes[2].read_start == 0 && stripes[2].read_end == 128*1024);
    assert(stripes[0].write_start == 0 && stripes[0].write_end == 0);
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 0);
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 0);
    assert(stripes[0].read_buf == rmw_buf);
    assert(stripes[1].read_buf == rmw_buf+128*1024);
    assert(stripes[2].read_buf == rmw_buf+128*1024*2);
    assert(stripes[0].write_buf == NULL);
    assert(stripes[1].write_buf == NULL);
    assert(stripes[2].write_buf == NULL);
    // Test 8.2
    set_pattern(stripes[1].read_buf, 128*1024, 0);
    set_pattern(stripes[2].read_buf, 128*1024, PATTERN1);
    calc_rmw_parity(stripes, 3, osd_set, write_osd_set, 128*1024);
    assert(stripes[0].write_start == 0 && stripes[0].write_end == 128*1024);
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 0);
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 0);
    assert(stripes[0].write_buf == rmw_buf);
    assert(stripes[1].write_buf == NULL);
    assert(stripes[2].write_buf == NULL);
    check_pattern(stripes[0].read_buf, 128*1024, PATTERN1);
    check_pattern(stripes[0].write_buf, 128*1024, PATTERN1);
    free(rmw_buf);
 }
--- a/osd_secondary.cpp
+++ b/osd_secondary.cpp
@ -1,134 +0,0 @@
 #include "osd.h"
 #include "json11/json11.hpp"
 void osd_t::secondary_op_callback(osd_op_t *op)
 {
    if (op->req.hdr.opcode == OSD_OP_SECONDARY_READ ||
        op->req.hdr.opcode == OSD_OP_SECONDARY_WRITE)
    {
        op->reply.sec_rw.version = op->bs_op->version;
    }
    else if (op->req.hdr.opcode == OSD_OP_SECONDARY_DELETE)
    {
        op->reply.sec_del.version = op->bs_op->version;
    }
    if (op->req.hdr.opcode == OSD_OP_SECONDARY_READ &&
        op->bs_op->retval > 0)
    {
        op->send_list.push_back(op->buf, op->bs_op->retval);
    }
    else if (op->req.hdr.opcode == OSD_OP_SECONDARY_LIST)
    {
        // allocated by blockstore
        op->buf = op->bs_op->buf;
        if (op->bs_op->retval > 0)
        {
            op->send_list.push_back(op->buf, op->bs_op->retval * sizeof(obj_ver_id));
        }
        op->reply.sec_list.stable_count = op->bs_op->version;
    }
    int retval = op->bs_op->retval;
    delete op->bs_op;
    op->bs_op = NULL;
    finish_op(op, retval);
 }
 void osd_t::exec_secondary(osd_op_t *cur_op)
 {
    cur_op->bs_op = new blockstore_op_t();
    cur_op->bs_op->callback = [this, cur_op](blockstore_op_t* bs_op) { secondary_op_callback(cur_op); };
    cur_op->bs_op->opcode = (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_READ ? BS_OP_READ
        : (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_WRITE ? BS_OP_WRITE
        : (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_SYNC ? BS_OP_SYNC
        : (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_STABILIZE ? BS_OP_STABLE
        : (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_ROLLBACK ? BS_OP_ROLLBACK
        : (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_DELETE ? BS_OP_DELETE
        : (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_LIST ? BS_OP_LIST
        : -1)))))));
    if (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_READ ||
        cur_op->req.hdr.opcode == OSD_OP_SECONDARY_WRITE)
    {
        cur_op->bs_op->oid = cur_op->req.sec_rw.oid;
        cur_op->bs_op->version = cur_op->req.sec_rw.version;
        cur_op->bs_op->offset = cur_op->req.sec_rw.offset;
        cur_op->bs_op->len = cur_op->req.sec_rw.len;
        cur_op->bs_op->buf = cur_op->buf;
 #ifdef OSD_STUB
        cur_op->bs_op->retval = cur_op->bs_op->len;
 #endif
    }
    else if (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_DELETE)
    {
        cur_op->bs_op->oid = cur_op->req.sec_del.oid;
        cur_op->bs_op->version = cur_op->req.sec_del.version;
 #ifdef OSD_STUB
        cur_op->bs_op->retval = 0;
 #endif
    }
    else if (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_STABILIZE ||
        cur_op->req.hdr.opcode == OSD_OP_SECONDARY_ROLLBACK)
    {
        cur_op->bs_op->len = cur_op->req.sec_stab.len/sizeof(obj_ver_id);
        cur_op->bs_op->buf = cur_op->buf;
 #ifdef OSD_STUB
        cur_op->bs_op->retval = 0;
 #endif
    }
    else if (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_LIST)
    {
        if (cur_op->req.sec_list.pg_count < cur_op->req.sec_list.list_pg)
        {
            // requested pg number is greater than total pg count
            cur_op->bs_op->retval = -EINVAL;
            secondary_op_callback(cur_op);
            return;
        }
        cur_op->bs_op->oid.stripe = cur_op->req.sec_list.pg_stripe_size;
        cur_op->bs_op->len = cur_op->req.sec_list.pg_count;
        cur_op->bs_op->offset = cur_op->req.sec_list.list_pg - 1;
 #ifdef OSD_STUB
        cur_op->bs_op->retval = 0;
        cur_op->bs_op->buf = NULL;
 #endif
    }
 #ifdef OSD_STUB
    secondary_op_callback(cur_op);
 #else
    bs->enqueue_op(cur_op->bs_op);
 #endif
 }
 void osd_t::exec_show_config(osd_op_t *cur_op)
 {
    // FIXME: Send the real config, not its source
    std::string cfg_str = json11::Json(config).dump();
    cur_op->buf = malloc(cfg_str.size()+1);
    memcpy(cur_op->buf, cfg_str.c_str(), cfg_str.size()+1);
    cur_op->send_list.push_back(cur_op->buf, cfg_str.size()+1);
    finish_op(cur_op, cfg_str.size()+1);
 }
 void osd_t::exec_sync_stab_all(osd_op_t *cur_op)
 {
    // Sync and stabilize all objects
    // This command is only valid for tests
    cur_op->bs_op = new blockstore_op_t();
    if (!allow_test_ops)
    {
        cur_op->bs_op->retval = -EINVAL;
        secondary_op_callback(cur_op);
        return;
    }
    cur_op->bs_op->opcode = BS_OP_SYNC_STAB_ALL;
    cur_op->bs_op->callback = [this, cur_op](blockstore_op_t *bs_op)
    {
        secondary_op_callback(cur_op);
    };
 #ifdef OSD_STUB
    cur_op->bs_op->retval = 0;
    secondary_op_callback(cur_op);
 #else
    bs->enqueue_op(cur_op->bs_op);
 #endif
 }
--- a/osd_send.cpp
+++ b/osd_send.cpp
@ -1,138 +0,0 @@
 #include "cluster_client.h"
 void cluster_client_t::outbox_push(osd_op_t *cur_op)
 {
    assert(cur_op->peer_fd);
    auto & cl = clients.at(cur_op->peer_fd);
    if (cur_op->op_type == OSD_OP_OUT)
    {
        clock_gettime(CLOCK_REALTIME, &cur_op->tv_begin);
    }
    cl.outbox.push_back(cur_op);
    if (cl.write_op || cl.outbox.size() > 1 || !try_send(cl))
    {
        if (cl.write_state == 0)
        {
            cl.write_state = CL_WRITE_READY;
            write_ready_clients.push_back(cur_op->peer_fd);
        }
        ringloop->wakeup();
    }
 }
 bool cluster_client_t::try_send(osd_client_t & cl)
 {
    int peer_fd = cl.peer_fd;
    io_uring_sqe* sqe = ringloop->get_sqe();
    if (!sqe)
    {
        return false;
    }
    ring_data_t* data = ((ring_data_t*)sqe->user_data);
    if (!cl.write_op)
    {
        // pick next command
        cl.write_op = cl.outbox.front();
        cl.outbox.pop_front();
        cl.write_state = CL_WRITE_REPLY;
        if (cl.write_op->op_type == OSD_OP_IN)
        {
            // Measure execution latency
            timespec tv_end;
            clock_gettime(CLOCK_REALTIME, &tv_end);
            stats.op_stat_count[cl.write_op->req.hdr.opcode]++;
            if (!stats.op_stat_count[cl.write_op->req.hdr.opcode])
            {
                stats.op_stat_count[cl.write_op->req.hdr.opcode]++;
                stats.op_stat_sum[cl.write_op->req.hdr.opcode] = 0;
                stats.op_stat_bytes[cl.write_op->req.hdr.opcode] = 0;
            }
            stats.op_stat_sum[cl.write_op->req.hdr.opcode] += (
                (tv_end.tv_sec - cl.write_op->tv_begin.tv_sec)*1000000 +
                (tv_end.tv_nsec - cl.write_op->tv_begin.tv_nsec)/1000
            );
            if (cl.write_op->req.hdr.opcode == OSD_OP_READ ||
                cl.write_op->req.hdr.opcode == OSD_OP_WRITE)
            {
                stats.op_stat_bytes[cl.write_op->req.hdr.opcode] += cl.write_op->req.rw.len;
            }
            else if (cl.write_op->req.hdr.opcode == OSD_OP_SECONDARY_READ ||
                cl.write_op->req.hdr.opcode == OSD_OP_SECONDARY_WRITE)
            {
                stats.op_stat_bytes[cl.write_op->req.hdr.opcode] += cl.write_op->req.sec_rw.len;
            }
        }
    }
    cl.write_msg.msg_iov = cl.write_op->send_list.get_iovec();
    cl.write_msg.msg_iovlen = cl.write_op->send_list.get_size();
    data->callback = [this, peer_fd](ring_data_t *data) { handle_send(data, peer_fd); };
    my_uring_prep_sendmsg(sqe, peer_fd, &cl.write_msg, 0);
    return true;
 }
 void cluster_client_t::send_replies()
 {
    for (int i = 0; i < write_ready_clients.size(); i++)
    {
        int peer_fd = write_ready_clients[i];
        if (!try_send(clients[peer_fd]))
        {
            write_ready_clients.erase(write_ready_clients.begin(), write_ready_clients.begin() + i);
            return;
        }
    }
    write_ready_clients.clear();
 }
 void cluster_client_t::handle_send(ring_data_t *data, int peer_fd)
 {
    auto cl_it = clients.find(peer_fd);
    if (cl_it != clients.end())
    {
        auto & cl = cl_it->second;
        if (data->res < 0 && data->res != -EAGAIN)
        {
            // this is a client socket, so don't panic. just disconnect it
            printf("Client %d socket write error: %d (%s). Disconnecting client\n", peer_fd, -data->res, strerror(-data->res));
            stop_client(peer_fd);
            return;
        }
        if (data->res >= 0)
        {
            osd_op_t *cur_op = cl.write_op;
            while (data->res > 0 && cur_op->send_list.sent < cur_op->send_list.count)
            {
                iovec & iov = cur_op->send_list.buf[cur_op->send_list.sent];
                if (iov.iov_len <= data->res)
                {
                    data->res -= iov.iov_len;
                    cur_op->send_list.sent++;
                }
                else
                {
                    iov.iov_len -= data->res;
                    iov.iov_base += data->res;
                    break;
                }
            }
            if (cur_op->send_list.sent >= cur_op->send_list.count)
            {
                // Done
                if (cur_op->op_type == OSD_OP_IN)
                {
                    delete cur_op;
                }
                else
                {
                    cl.sent_ops[cl.write_op->req.hdr.id] = cl.write_op;
                }
                cl.write_op = NULL;
                cl.write_state = cl.outbox.size() > 0 ? CL_WRITE_READY : 0;
            }
        }
        if (cl.write_state != 0)
        {
            write_ready_clients.push_back(peer_fd);
        }
    }
 }
--- a/patches/cinder-vitastor.py
+++ b/patches/cinder-vitastor.py
@ -0,0 +1,948 @@
 # Vitastor Driver for OpenStack Cinder
 #
 # --------------------------------------------
 # Install as cinder/volume/drivers/vitastor.py
 # --------------------------------------------
 #
 # Copyright 2020 Vitaliy Filippov
 #
 # Licensed under the Apache License, Version 2.0 (the "License"); you may
 # not use this file except in compliance with the License. You may obtain
 # a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 # WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 # License for the specific language governing permissions and limitations
 # under the License.
 """Cinder Vitastor Driver"""
 import binascii
 import base64
 import errno
 import json
 import math
 import os
 import tempfile
 from castellan import key_manager
 from oslo_config import cfg
 from oslo_log import log as logging
 from oslo_service import loopingcall
 from oslo_concurrency import processutils
 from oslo_utils import encodeutils
 from oslo_utils import excutils
 from oslo_utils import fileutils
 from oslo_utils import units
 import six
 from six.moves.urllib import request
 from cinder import exception
 from cinder.i18n import _
 from cinder.image import image_utils
 from cinder import interface
 from cinder import objects
 from cinder.objects import fields
 from cinder import utils
 from cinder.volume import configuration
 from cinder.volume import driver
 from cinder.volume import volume_utils
 VERSION = '0.6.5'
 LOG = logging.getLogger(__name__)
 VITASTOR_OPTS = [
    cfg.StrOpt(
        'vitastor_config_path',
        default='/etc/vitastor/vitastor.conf',
        help='Vitastor configuration file path'
    ),
    cfg.StrOpt(
        'vitastor_etcd_address',
        default='',
        help='Vitastor etcd address(es)'),
    cfg.StrOpt(
        'vitastor_etcd_prefix',
        default='/vitastor',
        help='Vitastor etcd prefix'
    ),
    cfg.StrOpt(
        'vitastor_pool_id',
        default='',
        help='Vitastor pool ID to use for volumes'
    ),
    # FIXME exclusive_cinder_pool ?
 ]
 CONF = cfg.CONF
 CONF.register_opts(VITASTOR_OPTS, group = configuration.SHARED_CONF_GROUP)
 class VitastorDriverException(exception.VolumeDriverException):
    message = _("Vitastor Cinder driver failure: %(reason)s")
@interface.volumedriver
 class VitastorDriver(driver.CloneableImageVD,
    driver.ManageableVD, driver.ManageableSnapshotsVD,
    driver.BaseVD):
    """Implements Vitastor volume commands."""
    cfg = {}
    _etcd_urls = []
    def __init__(self, active_backend_id = None, *args, **kwargs):
        super(VitastorDriver, self).__init__(*args, **kwargs)
        self.configuration.append_config_values(VITASTOR_OPTS)
    @classmethod
    def get_driver_options(cls):
        additional_opts = cls._get_oslo_driver_opts(
            'reserved_percentage',
            'max_over_subscription_ratio',
            'volume_dd_blocksize'
        )
        return VITASTOR_OPTS + additional_opts
    def do_setup(self, context):
        """Performs initialization steps that could raise exceptions."""
        super(VitastorDriver, self).do_setup(context)
        # Make sure configuration is in UTF-8
        for attr in [ 'config_path', 'etcd_address', 'etcd_prefix', 'pool_id' ]:
            val = self.configuration.safe_get('vitastor_'+attr)
            if val is not None:
                self.cfg[attr] = utils.convert_str(val)
        self.cfg = self._load_config(self.cfg)
    def _load_config(self, cfg):
        # Try to load configuration file
        try:
            f = open(cfg['config_path'] or '/etc/vitastor/vitastor.conf')
            conf = json.loads(f.read())
            f.close()
            for k in conf:
                cfg[k] = cfg.get(k, conf[k])
        except:
            pass
        if isinstance(cfg['etcd_address'], str):
            cfg['etcd_address'] = cfg['etcd_address'].split(',')
        # Sanitize etcd URLs
        for i, etcd_url in enumerate(cfg['etcd_address']):
            ssl = False
            if etcd_url.lower().startswith('http://'):
                etcd_url = etcd_url[7:]
            elif etcd_url.lower().startswith('https://'):
                etcd_url = etcd_url[8:]
                ssl = True
            if etcd_url.find('/') < 0:
                etcd_url += '/v3'
            if ssl:
                etcd_url = 'https://'+etcd_url
            else:
                etcd_url = 'http://'+etcd_url
            cfg['etcd_address'][i] = etcd_url
        return cfg
    def check_for_setup_error(self):
        """Returns an error if prerequisites aren't met."""
    def _encode_etcd_key(self, key):
        if not isinstance(key, bytes):
            key = str(key).encode('utf-8')
        return base64.b64encode(self.cfg['etcd_prefix'].encode('utf-8')+b'/'+key).decode('utf-8')
    def _encode_etcd_value(self, value):
        if not isinstance(value, bytes):
            value = str(value).encode('utf-8')
        return base64.b64encode(value).decode('utf-8')
    def _encode_etcd_requests(self, obj):
        for v in obj:
            for rt in v:
                if 'key' in v[rt]:
                    v[rt]['key'] = self._encode_etcd_key(v[rt]['key'])
                if 'range_end' in v[rt]:
                    v[rt]['range_end'] = self._encode_etcd_key(v[rt]['range_end'])
                if 'value' in v[rt]:
                    v[rt]['value'] = self._encode_etcd_value(v[rt]['value'])
    def _etcd_txn(self, params):
        if 'compare' in params:
            for v in params['compare']:
                if 'key' in v:
                    v['key'] = self._encode_etcd_key(v['key'])
        if 'failure' in params:
            self._encode_etcd_requests(params['failure'])
        if 'success' in params:
            self._encode_etcd_requests(params['success'])
        body = json.dumps(params).encode('utf-8')
        headers = {
            'Content-Type': 'application/json'
        }
        err = None
        for etcd_url in self.cfg['etcd_address']:
            try:
                resp = request.urlopen(request.Request(etcd_url+'/kv/txn', body, headers), timeout = 5)
                data = json.loads(resp.read())
                if 'responses' not in data:
                    data['responses'] = []
                for i, resp in enumerate(data['responses']):
                    if 'response_range' in resp:
                        if 'kvs' not in resp['response_range']:
                            resp['response_range']['kvs'] = []
                        for kv in resp['response_range']['kvs']:
                            kv['key'] = base64.b64decode(kv['key'].encode('utf-8')).decode('utf-8')
                            if kv['key'].startswith(self.cfg['etcd_prefix']+'/'):
                                kv['key'] = kv['key'][len(self.cfg['etcd_prefix'])+1 : ]
                            kv['value'] = json.loads(base64.b64decode(kv['value'].encode('utf-8')))
                    if len(resp.keys()) != 1:
                        LOG.exception('unknown responses['+str(i)+'] format: '+json.dumps(resp))
                    else:
                        resp = data['responses'][i] = resp[list(resp.keys())[0]]
                return data
            except Exception as e:
                LOG.exception('error calling etcd transaction: '+body.decode('utf-8')+'\nerror: '+str(e))
                err = e
        raise err
    def _etcd_foreach(self, prefix, add_fn):
        total = 0
        batch = 1000
        begin = prefix+'/'
        while True:
            resp = self._etcd_txn({ 'success': [
                { 'request_range': {
                    'key': begin,
                    'range_end': prefix+'0',
                    'limit': batch+1,
                } },
            ] })
            i = 0
            while i < batch and i < len(resp['responses'][0]['kvs']):
                kv = resp['responses'][0]['kvs'][i]
                add_fn(kv)
                i += 1
            if len(resp['responses'][0]['kvs']) <= batch:
                break
            begin = resp['responses'][0]['kvs'][batch]['key']
        return total
    def _update_volume_stats(self):
        location_info = json.dumps({
            'config': self.configuration.vitastor_config_path,
            'etcd_address': self.configuration.vitastor_etcd_address,
            'etcd_prefix': self.configuration.vitastor_etcd_prefix,
            'pool_id': self.configuration.vitastor_pool_id,
        })
        stats = {
            'vendor_name': 'Vitastor',
            'driver_version': self.VERSION,
            'storage_protocol': 'vitastor',
            'total_capacity_gb': 'unknown',
            'free_capacity_gb': 'unknown',
            # FIXME check if safe_get is required
            'reserved_percentage': self.configuration.safe_get('reserved_percentage'),
            'multiattach': True,
            'thin_provisioning_support': True,
            'max_over_subscription_ratio': self.configuration.safe_get('max_over_subscription_ratio'),
            'location_info': location_info,
            'backend_state': 'down',
            'volume_backend_name': self.configuration.safe_get('volume_backend_name') or 'vitastor',
            'replication_enabled': False,
        }
        try:
            pool_stats = self._etcd_txn({ 'success': [
                { 'request_range': { 'key': 'pool/stats/'+str(self.cfg['pool_id']) } }
            ] })
            total_provisioned = 0
            def add_total(kv):
                nonlocal total_provisioned
                if kv['key'].find('@') >= 0:
                    total_provisioned += kv['value']['size']
            self._etcd_foreach('config/inode/'+str(self.cfg['pool_id']), lambda kv: add_total(kv))
            stats['provisioned_capacity_gb'] = round(total_provisioned/1024.0/1024.0/1024.0, 2)
            pool_stats = pool_stats['responses'][0]['kvs']
            if len(pool_stats):
                pool_stats = pool_stats[0]
                stats['free_capacity_gb'] = round(1024.0*(pool_stats['total_raw_tb']-pool_stats['used_raw_tb'])/pool_stats['raw_to_usable'], 2)
                stats['total_capacity_gb'] = round(1024.0*pool_stats['total_raw_tb'], 2)
            stats['backend_state'] = 'up'
        except Exception as e:
            # just log and return unknown capacities
            LOG.exception('error getting vitastor pool stats: '+str(e))
        self._stats = stats
    def _next_id(self, resp):
        if len(resp['kvs']) == 0:
            return (1, 0)
        else:
            return (1 + resp['kvs'][0]['value'], resp['kvs'][0]['mod_revision'])
    def create_volume(self, volume):
        """Creates a logical volume."""
        size = int(volume.size) * units.Gi
        # FIXME: Check if convert_str is really required
        vol_name = utils.convert_str(volume.name)
        if vol_name.find('@') >= 0 or vol_name.find('/') >= 0:
            raise exception.VolumeBackendAPIException(data = '@ and / are forbidden in volume and snapshot names')
        LOG.debug("creating volume '%s'", vol_name)
        self._create_image(vol_name, { 'size': size })
        if volume.encryption_key_id:
            self._create_encrypted_volume(volume, volume.obj_context)
        volume_update = {}
        return volume_update
    def _create_encrypted_volume(self, volume, context):
        """Create a new LUKS encrypted image directly in Vitastor."""
        vol_name = utils.convert_str(volume.name)
        f, opts = self._encrypt_opts(volume, context)
        # FIXME: Check if it works at all :-)
        self._execute(
            'qemu-img', 'convert', '-f', 'luks', *opts,
            'vitastor:image='+vol_name.replace(':', '\\:')+self._qemu_args(),
            '%sM' % (volume.size * 1024)
        )
        f.close()
    def _encrypt_opts(self, volume, context):
        encryption = volume_utils.check_encryption_provider(self.db, volume, context)
        # Fetch the key associated with the volume and decode the passphrase
        keymgr = key_manager.API(CONF)
        key = keymgr.get(context, encryption['encryption_key_id'])
        passphrase = binascii.hexlify(key.get_encoded()).decode('utf-8')
        # Decode the dm-crypt style cipher spec into something qemu-img can use
        cipher_spec = image_utils.decode_cipher(encryption['cipher'], encryption['key_size'])
        tmp_dir = volume_utils.image_conversion_dir()
        f = tempfile.NamedTemporaryFile(prefix = 'luks_', dir = tmp_dir)
        f.write(passphrase)
        f.flush()
        return (f, [
            '--object', 'secret,id=luks_sec,format=raw,file=%(passfile)s' % {'passfile': f.name},
            '-o', 'key-secret=luks_sec,cipher-alg=%(cipher_alg)s,cipher-mode=%(cipher_mode)s,ivgen-alg=%(ivgen_alg)s' % cipher_spec,
        ])
    def create_snapshot(self, snapshot):
        """Creates a volume snapshot."""
        vol_name = utils.convert_str(snapshot.volume_name)
        snap_name = utils.convert_str(snapshot.name)
        if snap_name.find('@') >= 0 or snap_name.find('/') >= 0:
            raise exception.VolumeBackendAPIException(data = '@ and / are forbidden in volume and snapshot names')
        self._create_snapshot(vol_name, vol_name+'@'+snap_name)
    def snapshot_revert_use_temp_snapshot(self):
        """Disable the use of a temporary snapshot on revert."""
        return False
    def revert_to_snapshot(self, context, volume, snapshot):
        """Revert a volume to a given snapshot."""
        # FIXME Delete the image, then recreate it from the snapshot
    def delete_snapshot(self, snapshot):
        """Deletes a snapshot."""
        vol_name = utils.convert_str(snapshot.volume_name)
        snap_name = utils.convert_str(snapshot.name)
        # Find the snapshot
        resp = self._etcd_txn({ 'success': [
            { 'request_range': { 'key': 'index/image/'+vol_name+'@'+snap_name } },
        ] })
        if len(resp['responses'][0]['kvs']) == 0:
            raise exception.SnapshotNotFound(snapshot_id = snap_name)
        inode_id = int(resp['responses'][0]['kvs'][0]['value']['id'])
        pool_id = int(resp['responses'][0]['kvs'][0]['value']['pool_id'])
        parents = {}
        parents[(pool_id << 48) | (inode_id & 0xffffffffffff)] = True
        # Check if there are child volumes
        children = self._child_count(parents)
        if children > 0:
            raise exception.SnapshotIsBusy(snapshot_name = snap_name)
        # FIXME: We can't delete snapshots because we can't merge layers yet
        raise exception.VolumeBackendAPIException(data = 'Snapshot delete (layer merge) is not implemented yet')
    def _child_count(self, parents):
        children = 0
        def add_child(kv):
            nonlocal children
            children += self._check_parent(kv, parents)
        self._etcd_foreach('config/inode', lambda kv: add_child(kv))
        return children
    def _check_parent(self, kv, parents):
        if 'parent_id' not in kv['value']:
            return 0
        parent_id = kv['value']['parent_id']
        _, _, pool_id, inode_id = kv['key'].split('/')
        parent_pool_id = pool_id
        if 'parent_pool_id' in kv['value'] and kv['value']['parent_pool_id']:
            parent_pool_id = kv['value']['parent_pool_id']
        inode = (int(pool_id) << 48) | (int(inode_id) & 0xffffffffffff)
        parent = (int(parent_pool_id) << 48) | (int(parent_id) & 0xffffffffffff)
        if parent in parents and inode not in parents:
            return 1
        return 0
    def create_cloned_volume(self, volume, src_vref):
        """Create a cloned volume from another volume."""
        size = int(volume.size) * units.Gi
        src_name = utils.convert_str(src_vref.name)
        dest_name = utils.convert_str(volume.name)
        if dest_name.find('@') >= 0 or dest_name.find('/') >= 0:
            raise exception.VolumeBackendAPIException(data = '@ and / are forbidden in volume and snapshot names')
        # FIXME Do full copy if requested (cfg.disable_clone)
        if src_vref.admin_metadata.get('readonly') == 'True':
            # source volume is a volume-image cache entry or other readonly volume
            # clone without intermediate snapshot
            src = self._get_image(src_name)
            LOG.debug("creating image '%s' from '%s'", dest_name, src_name)
            new_cfg = self._create_image(dest_name, {
                'size': size,
                'parent_id': src['idx']['id'],
                'parent_pool_id': src['idx']['pool_id'],
            })
            return {}
        clone_snap = "%s@%s.clone_snap" % (src_name, dest_name)
        make_img = True
        if (volume.display_name and
            volume.display_name.startswith('image-') and
            src_vref.project_id != volume.project_id):
            # idiotic openstack creates image-volume cache entries
            # as clones of normal VM volumes... :-X prevent it :-D
            clone_snap = dest_name
            make_img = False
        LOG.debug("creating layer '%s' under '%s'", clone_snap, src_name)
        new_cfg = self._create_snapshot(src_name, clone_snap, True)
        if make_img:
            # Then create a clone from it
            new_cfg = self._create_image(dest_name, {
                'size': size,
                'parent_id': new_cfg['parent_id'],
                'parent_pool_id': new_cfg['parent_pool_id'],
            })
        return {}
    def create_volume_from_snapshot(self, volume, snapshot):
        """Creates a cloned volume from an existing snapshot."""
        vol_name = utils.convert_str(volume.name)
        snap_name = utils.convert_str(snapshot.name)
        snap = self._get_image(vol_name+'@'+snap_name)
        if not snap:
            raise exception.SnapshotNotFound(snapshot_id = snap_name)
        snap_inode_id = int(resp['responses'][0]['kvs'][0]['value']['id'])
        snap_pool_id = int(resp['responses'][0]['kvs'][0]['value']['pool_id'])
        size = snap['cfg']['size']
        if int(volume.size):
            size = int(volume.size) * units.Gi
        new_cfg = self._create_image(vol_name, {
            'size': size,
            'parent_id': snap['idx']['id'],
            'parent_pool_id': snap['idx']['pool_id'],
        })
        return {}
    def _vitastor_args(self):
        args = []
        for k in [ 'config_path', 'etcd_address', 'etcd_prefix' ]:
            v = self.configuration.safe_get('vitastor_'+k)
            if v:
                args.extend(['--'+k, v])
        return args
    def _qemu_args(self):
        args = ''
        for k in [ 'config_path', 'etcd_address', 'etcd_prefix' ]:
            v = self.configuration.safe_get('vitastor_'+k)
            kk = k
            if kk == 'etcd_address':
                # FIXME use etcd_address in qemu driver
                kk = 'etcd_host'
            if v:
                args += ':'+kk+'='+v.replace(':', '\\:')
        return args
    def delete_volume(self, volume):
        """Deletes a logical volume."""
        vol_name = utils.convert_str(volume.name)
        # Find the volume and all its snapshots
        range_end = b'index/image/' + vol_name.encode('utf-8')
        range_end = range_end[0 : len(range_end)-1] + six.int2byte(range_end[len(range_end)-1] + 1)
        resp = self._etcd_txn({ 'success': [
            { 'request_range': { 'key': 'index/image/'+vol_name, 'range_end': range_end } },
        ] })
        if len(resp['responses'][0]['kvs']) == 0:
            # already deleted
            LOG.info("volume %s no longer exists in backend", vol_name)
            return
        layers = resp['responses'][0]['kvs']
        layer_ids = {}
        for kv in layers:
            inode_id = int(kv['value']['id'])
            pool_id = int(kv['value']['pool_id'])
            inode_pool_id = (pool_id << 48) | (inode_id & 0xffffffffffff)
            layer_ids[inode_pool_id] = True
        # Check if the volume has clones and raise 'busy' if so
        children = self._child_count(layer_ids)
        if children > 0:
            raise exception.VolumeIsBusy(volume_name = vol_name)
        # Clear data
        for kv in layers:
            args = [
                'vitastor-rm', '--pool', str(kv['value']['pool_id']),
                '--inode', str(kv['value']['id']), '--progress', '0',
                *(self._vitastor_args())
            ]
            try:
                self._execute(*args)
            except processutils.ProcessExecutionError as exc:
                LOG.error("Failed to remove layer "+kv['key']+": "+exc)
                raise exception.VolumeBackendAPIException(data = exc.stderr)
        # Delete all layers from etcd
        requests = []
        for kv in layers:
            requests.append({ 'request_delete_range': { 'key': kv['key'] } })
            requests.append({ 'request_delete_range': { 'key': 'config/inode/'+str(kv['value']['pool_id'])+'/'+str(kv['value']['id']) } })
        self._etcd_txn({ 'success': requests })
    def retype(self, context, volume, new_type, diff, host):
        """Change extra type specifications for a volume."""
        # FIXME Maybe (in the future) support multiple pools as different types
        return True, {}
    def ensure_export(self, context, volume):
        """Synchronously recreates an export for a logical volume."""
        pass
    def create_export(self, context, volume, connector):
        """Exports the volume."""
        pass
    def remove_export(self, context, volume):
        """Removes an export for a logical volume."""
        pass
    def _create_image(self, vol_name, cfg):
        pool_s = str(self.cfg['pool_id'])
        image_id = 0
        while image_id == 0:
            # check if the image already exists and find a free ID
            resp = self._etcd_txn({ 'success': [
                { 'request_range': { 'key': 'index/image/'+vol_name } },
                { 'request_range': { 'key': 'index/maxid/'+pool_s } },
            ] })
            if len(resp['responses'][0]['kvs']) > 0:
                # already exists
                raise exception.VolumeBackendAPIException(data = 'Volume '+vol_name+' already exists')
            image_id, id_mod = self._next_id(resp['responses'][1])
            # try to create the image
            resp = self._etcd_txn({ 'compare': [
                { 'target': 'MOD', 'mod_revision': id_mod, 'key': 'index/maxid/'+pool_s },
                { 'target': 'VERSION', 'version': 0, 'key': 'index/image/'+vol_name },
                { 'target': 'VERSION', 'version': 0, 'key': 'config/inode/'+pool_s+'/'+str(image_id) },
            ], 'success': [
                { 'request_put': { 'key': 'index/maxid/'+pool_s, 'value': image_id } },
                { 'request_put': { 'key': 'index/image/'+vol_name, 'value': json.dumps({
                    'id': image_id, 'pool_id': self.cfg['pool_id']
                }) } },
                { 'request_put': { 'key': 'config/inode/'+pool_s+'/'+str(image_id), 'value': json.dumps({
                    **cfg, 'name': vol_name,
                }) } },
            ] })
            if not resp.get('succeeded'):
                # repeat
                image_id = 0
    def _create_snapshot(self, vol_name, snap_vol_name, allow_existing = False):
        while True:
            # check if the image already exists and snapshot doesn't
            resp = self._etcd_txn({ 'success': [
                { 'request_range': { 'key': 'index/image/'+vol_name } },
                { 'request_range': { 'key': 'index/image/'+snap_vol_name } },
            ] })
            if len(resp['responses'][0]['kvs']) == 0:
                raise exception.VolumeBackendAPIException(data = 'Volume '+vol_name+' does not exist')
            if len(resp['responses'][1]['kvs']) > 0:
                if allow_existing:
                    snap_idx = resp['responses'][1]['kvs'][0]['value']
                    resp = self._etcd_txn({ 'success': [
                        { 'request_range': { 'key': 'config/inode/'+str(snap_idx['pool_id'])+'/'+str(snap_idx['id']) } },
                    ] })
                    if len(resp['responses'][0]['kvs']) == 0:
                        raise exception.VolumeBackendAPIException(data =
                            'Volume '+snap_vol_name+' is already indexed, but does not exist'
                        )
                    return resp['responses'][0]['kvs'][0]['value']
                raise exception.VolumeBackendAPIException(
                    data = 'Volume '+snap_vol_name+' already exists'
                )
            vol_idx = resp['responses'][0]['kvs'][0]['value']
            vol_idx_mod = resp['responses'][0]['kvs'][0]['mod_revision']
            # get image inode config and find a new ID
            resp = self._etcd_txn({ 'success': [
                { 'request_range': { 'key': 'config/inode/'+str(vol_idx['pool_id'])+'/'+str(vol_idx['id']) } },
                { 'request_range': { 'key': 'index/maxid/'+str(self.cfg['pool_id']) } },
            ] })
            if len(resp['responses'][0]['kvs']) == 0:
                raise exception.VolumeBackendAPIException(data = 'Volume '+vol_name+' does not exist')
            vol_cfg = resp['responses'][0]['kvs'][0]['value']
            vol_mod = resp['responses'][0]['kvs'][0]['mod_revision']
            new_id, id_mod = self._next_id(resp['responses'][1])
            # try to redirect image to the new inode
            new_cfg = {
                **vol_cfg, 'name': vol_name, 'parent_id': vol_idx['id'], 'parent_pool_id': vol_idx['pool_id']
            }
            resp = self._etcd_txn({ 'compare': [
                { 'target': 'MOD', 'mod_revision': vol_idx_mod, 'key': 'index/image/'+vol_name },
                { 'target': 'MOD', 'mod_revision': vol_mod, 'key': 'config/inode/'+str(vol_idx['pool_id'])+'/'+str(vol_idx['id']) },
                { 'target': 'MOD', 'mod_revision': id_mod, 'key': 'index/maxid/'+str(self.cfg['pool_id']) },
                { 'target': 'VERSION', 'version': 0, 'key': 'index/image/'+snap_vol_name },
                { 'target': 'VERSION', 'version': 0, 'key': 'config/inode/'+str(self.cfg['pool_id'])+'/'+str(new_id) },
            ], 'success': [
                { 'request_put': { 'key': 'index/maxid/'+str(self.cfg['pool_id']), 'value': new_id } },
                { 'request_put': { 'key': 'index/image/'+vol_name, 'value': json.dumps({
                    'id': new_id, 'pool_id': self.cfg['pool_id']
                }) } },
                { 'request_put': { 'key': 'config/inode/'+str(self.cfg['pool_id'])+'/'+str(new_id), 'value': json.dumps(new_cfg) } },
                { 'request_put': { 'key': 'index/image/'+snap_vol_name, 'value': json.dumps({
                    'id': vol_idx['id'], 'pool_id': vol_idx['pool_id']
                }) } },
                { 'request_put': { 'key': 'config/inode/'+str(vol_idx['pool_id'])+'/'+str(vol_idx['id']), 'value': json.dumps({
                    **vol_cfg, 'name': snap_vol_name, 'readonly': True
                }) } }
            ] })
            if resp.get('succeeded'):
                return new_cfg
    def initialize_connection(self, volume, connector):
        data = {
            'driver_volume_type': 'vitastor',
            'data': {
                'config_path': self.configuration.vitastor_config_path,
                'etcd_address': self.configuration.vitastor_etcd_address,
                'etcd_prefix': self.configuration.vitastor_etcd_prefix,
                'name': volume.name,
                'logical_block_size': 512,
                'physical_block_size': 4096,
            }
        }
        LOG.debug('connection data: %s', data)
        return data
    def terminate_connection(self, volume, connector, **kwargs):
        pass
    def clone_image(self, context, volume, image_location, image_meta, image_service):
        if image_location:
            # Note: image_location[0] is glance image direct_url.
            # image_location[1] contains the list of all locations (including
            # direct_url) or None if show_multiple_locations is False in
            # glance configuration.
            if image_location[1]:
                url_locations = [location['url'] for location in image_location[1]]
            else:
                url_locations = [image_location[0]]
            # iterate all locations to look for a cloneable one.
            for url_location in url_locations:
                if url_location and url_location.startswith('cinder://'):
                    # The idea is to use cinder://<volume-id> Glance volumes as base images
                    base_vol = self.db.volume_get(context, url_location[len('cinder://') : ])
                    if not base_vol or base_vol.volume_type_id != volume.volume_type_id:
                        continue
                    size = int(volume.size) * units.Gi
                    dest_name = utils.convert_str(volume.name)
                    # Find or create the base snapshot
                    snap_cfg = self._create_snapshot(base_vol.name, base_vol.name+'@.clone_snap', True)
                    # Then create a clone from it
                    new_cfg = self._create_image(dest_name, {
                        'size': size,
                        'parent_id': snap_cfg['parent_id'],
                        'parent_pool_id': snap_cfg['parent_pool_id'],
                    })
                    return ({}, True)
        return ({}, False)
    def copy_image_to_encrypted_volume(self, context, volume, image_service, image_id):
        self.copy_image_to_volume(context, volume, image_service, image_id, encrypted = True)
    def copy_image_to_volume(self, context, volume, image_service, image_id, encrypted = False):
        tmp_dir = volume_utils.image_conversion_dir()
        with tempfile.NamedTemporaryFile(dir = tmp_dir) as tmp:
            image_utils.fetch_to_raw(
                context, image_service, image_id, tmp.name,
                self.configuration.volume_dd_blocksize, size = volume.size
            )
            out_format = [ '-O', 'raw' ]
            if encrypted:
                key_file, opts = self._encrypt_opts(volume, context)
                out_format = [ '-O', 'luks', *opts ]
            dest_name = utils.convert_str(volume.name)
            self._try_execute(
                'qemu-img', 'convert', '-f', 'raw', tmp.name, *out_format,
                'vitastor:image='+dest_name.replace(':', '\\:')+self._qemu_args()
            )
            if encrypted:
                key_file.close()
    def copy_volume_to_image(self, context, volume, image_service, image_meta):
        tmp_dir = volume_utils.image_conversion_dir()
        tmp_file = os.path.join(tmp_dir, volume.name + '-' + image_meta['id'])
        with fileutils.remove_path_on_error(tmp_file):
            vol_name = utils.convert_str(volume.name)
            self._try_execute(
                'qemu-img', 'convert', '-f', 'raw',
                'vitastor:image='+vol_name.replace(':', '\\:')+self._qemu_args(),
                '-O', 'raw', tmp_file
            )
            # FIXME: Copy directly if the destination image is also in Vitastor
            volume_utils.upload_volume(context, image_service, image_meta, tmp_file, volume)
        os.unlink(tmp_file)
    def _get_image(self, vol_name):
        # find the image
        resp = self._etcd_txn({ 'success': [
            { 'request_range': { 'key': 'index/image/'+vol_name } },
        ] })
        if len(resp['responses'][0]['kvs']) == 0:
            return None
        vol_idx = resp['responses'][0]['kvs'][0]['value']
        vol_idx_mod = resp['responses'][0]['kvs'][0]['mod_revision']
        # get image inode config
        resp = self._etcd_txn({ 'success': [
            { 'request_range': { 'key': 'config/inode/'+str(vol_idx['pool_id'])+'/'+str(vol_idx['id']) } },
        ] })
        if len(resp['responses'][0]['kvs']) == 0:
            return None
        vol_cfg = resp['responses'][0]['kvs'][0]['value']
        vol_cfg_mod = resp['responses'][0]['kvs'][0]['mod_revision']
        return {
            'cfg': vol_cfg,
            'cfg_mod': vol_cfg_mod,
            'idx': vol_idx,
            'idx_mod': vol_idx_mod,
        }
    def extend_volume(self, volume, new_size):
        """Extend an existing volume."""
        vol_name = utils.convert_str(volume.name)
        while True:
            vol = self._get_image(vol_name)
            if not vol:
                raise exception.VolumeBackendAPIException(data = 'Volume '+vol_name+' does not exist')
            # change size
            size = int(new_size) * units.Gi
            if size == vol['cfg']['size']:
                break
            resp = self._etcd_txn({ 'compare': [ {
                'target': 'MOD',
                'mod_revision': vol['cfg_mod'],
                'key': 'config/inode/'+str(vol['idx']['pool_id'])+'/'+str(vol['idx']['id']),
            } ], 'success': [
                { 'request_put': {
                    'key': 'config/inode/'+str(vol['idx']['pool_id'])+'/'+str(vol['idx']['id']),
                    'value': json.dumps({ **vol['cfg'], 'size': size }),
                } },
            ] })
            if resp.get('succeeded'):
                break
        LOG.debug(
            "Extend volume from %(old_size)s GB to %(new_size)s GB.",
            {'old_size': volume.size, 'new_size': new_size}
        )
    def _add_manageable_volume(self, kv, manageable_volumes, cinder_ids):
        cfg = kv['value']
        if kv['key'].find('@') >= 0:
            # snapshot
            return
        image_id = volume_utils.extract_id_from_volume_name(cfg['name'])
        image_info = {
            'reference': {'source-name': image_name},
            'size': int(math.ceil(float(cfg['size']) / units.Gi)),
            'cinder_id': None,
            'extra_info': None,
        }
        if image_id in cinder_ids:
            image_info['cinder_id'] = image_id
            image_info['safe_to_manage'] = False
            image_info['reason_not_safe'] = 'already managed'
        else:
            image_info['safe_to_manage'] = True
            image_info['reason_not_safe'] = None
        manageable_volumes.append(image_info)
    def get_manageable_volumes(self, cinder_volumes, marker, limit, offset, sort_keys, sort_dirs):
        manageable_volumes = []
        cinder_ids = [resource['id'] for resource in cinder_volumes]
        # List all volumes
        # FIXME: It's possible to use pagination in our case, but.. do we want it?
        self._etcd_foreach('config/inode/'+str(self.cfg['pool_id']),
            lambda kv: self._add_manageable_volume(kv, manageable_volumes, cinder_ids))
        return volume_utils.paginate_entries_list(
            manageable_volumes, marker, limit, offset, sort_keys, sort_dirs)
    def _get_existing_name(existing_ref):
        if not isinstance(existing_ref, dict):
            existing_ref = {"source-name": existing_ref}
        if 'source-name' not in existing_ref:
            reason = _('Reference must contain source-name element.')
            raise exception.ManageExistingInvalidReference(existing_ref=existing_ref, reason=reason)
        src_name = utils.convert_str(existing_ref['source-name'])
        if not src_name:
            reason = _('Reference must contain source-name element.')
            raise exception.ManageExistingInvalidReference(existing_ref=existing_ref, reason=reason)
        return src_name
    def manage_existing_get_size(self, volume, existing_ref):
        """Return size of an existing image for manage_existing.
        :param volume: volume ref info to be set
        :param existing_ref: {'source-name': <image name>}
        """
        src_name = self._get_existing_name(existing_ref)
        vol = self._get_image(src_name)
        if not vol:
            raise exception.VolumeBackendAPIException(data = 'Volume '+src_name+' does not exist')
        return int(math.ceil(float(vol['cfg']['size']) / units.Gi))
    def manage_existing(self, volume, existing_ref):
        """Manages an existing image.
        Renames the image name to match the expected name for the volume.
        :param volume: volume ref info to be set
        :param existing_ref: {'source-name': <image name>}
        """
        from_name = self._get_existing_name(existing_ref)
        to_name = utils.convert_str(volume.name)
        self._rename(from_name, to_name)
    def _rename(self, from_name, to_name):
        while True:
            vol = self._get_image(from_name)
            if not vol:
                raise exception.VolumeBackendAPIException(data = 'Volume '+from_name+' does not exist')
            to = self._get_image(to_name)
            if to:
                raise exception.VolumeBackendAPIException(data = 'Volume '+to_name+' already exists')
            resp = self._etcd_txn({ 'compare': [
                { 'target': 'MOD', 'mod_revision': vol['idx_mod'], 'key': 'index/image/'+vol['cfg']['name'] },
                { 'target': 'MOD', 'mod_revision': vol['cfg_mod'], 'key': 'config/inode/'+str(vol['idx']['pool_id'])+'/'+str(vol['idx']['id']) },
                { 'target': 'VERSION', 'version': 0, 'key': 'index/image/'+to_name },
            ], 'success': [
                { 'request_delete_range': { 'key': 'index/image/'+vol['cfg']['name'] } },
                { 'request_put': { 'key': 'index/image/'+to_name, 'value': json.dumps(vol['idx']) } },
                { 'request_put': { 'key': 'config/inode/'+str(vol['idx']['pool_id'])+'/'+str(vol['idx']['id']),
                    'value': json.dumps({ **vol['cfg'], 'name': to_name }) } },
            ] })
            if resp.get('succeeded'):
                break
    def unmanage(self, volume):
        pass
    def _add_manageable_snapshot(self, kv, manageable_snapshots, cinder_ids):
        cfg = kv['value']
        dog = kv['key'].find('@')
        if dog < 0:
            # snapshot
            return
        image_name = kv['key'][0 : dog]
        snap_name = kv['key'][dog+1 : ]
        snapshot_id = volume_utils.extract_id_from_snapshot_name(snap_name)
        snapshot_info = {
            'reference': {'source-name': snap_name},
            'size': int(math.ceil(float(cfg['size']) / units.Gi)),
            'cinder_id': None,
            'extra_info': None,
            'safe_to_manage': False,
            'reason_not_safe': None,
            'source_reference': {'source-name': image_name}
        }
        if snapshot_id in cinder_ids:
            # Exclude snapshots already managed.
            snapshot_info['reason_not_safe'] = ('already managed')
            snapshot_info['cinder_id'] = snapshot_id
        elif snap_name.endswith('.clone_snap'):
            # Exclude clone snapshot.
            snapshot_info['reason_not_safe'] = ('used for clone snap')
        else:
            snapshot_info['safe_to_manage'] = True
        manageable_snapshots.append(snapshot_info)
    def get_manageable_snapshots(self, cinder_snapshots, marker, limit, offset, sort_keys, sort_dirs):
        """List manageable snapshots in Vitastor."""
        manageable_snapshots = []
        cinder_snapshot_ids = [resource['id'] for resource in cinder_snapshots]
        # List all volumes
        # FIXME: It's possible to use pagination in our case, but.. do we want it?
        self._etcd_foreach('config/inode/'+str(self.cfg['pool_id']),
            lambda kv: self._add_manageable_volume(kv, manageable_snapshots, cinder_snapshot_ids))
        return volume_utils.paginate_entries_list(
            manageable_snapshots, marker, limit, offset, sort_keys, sort_dirs)
    def manage_existing_snapshot_get_size(self, snapshot, existing_ref):
        """Return size of an existing image for manage_existing.
        :param snapshot: snapshot ref info to be set
        :param existing_ref: {'source-name': <name of snapshot>}
        """
        vol_name = utils.convert_str(snapshot.volume_name)
        snap_name = self._get_existing_name(existing_ref)
        vol = self._get_image(vol_name+'@'+snap_name)
        if not vol:
            raise exception.ManageExistingInvalidReference(
                existing_ref=snapshot_name, reason='Specified snapshot does not exist.'
            )
        return int(math.ceil(float(vol['cfg']['size']) / units.Gi))
    def manage_existing_snapshot(self, snapshot, existing_ref):
        """Manages an existing snapshot.
        Renames the snapshot name to match the expected name for the snapshot.
        Error checking done by manage_existing_get_size is not repeated.
        :param snapshot: snapshot ref info to be set
        :param existing_ref: {'source-name': <name of snapshot>}
        """
        vol_name = utils.convert_str(snapshot.volume_name)
        snap_name = self._get_existing_name(existing_ref)
        from_name = vol_name+'@'+snap_name
        to_name = vol_name+'@'+utils.convert_str(snapshot.name)
        self._rename(from_name, to_name)
    def unmanage_snapshot(self, snapshot):
        """Removes the specified snapshot from Cinder management."""
        pass
    def _dumps(self, obj):
        return json.dumps(obj, separators=(',', ':'), sort_keys=True)
--- a/patches/devstack-local.conf
+++ b/patches/devstack-local.conf
@ -0,0 +1,23 @@
 # Devstack configuration for bridged networking
 [[local|localrc]]
 ADMIN_PASSWORD=secret
 DATABASE_PASSWORD=$ADMIN_PASSWORD
 RABBIT_PASSWORD=$ADMIN_PASSWORD
 SERVICE_PASSWORD=$ADMIN_PASSWORD
 HOST_IP=10.0.2.15
 Q_USE_SECGROUP=True
 FLOATING_RANGE="10.0.2.0/24"
 IPV4_ADDRS_SAFE_TO_USE="10.0.5.0/24"
 Q_FLOATING_ALLOCATION_POOL=start=10.0.2.50,end=10.0.2.100
 PUBLIC_NETWORK_GATEWAY=10.0.2.2
 PUBLIC_INTERFACE=ens3
 Q_USE_PROVIDERNET_FOR_PUBLIC=True
 Q_AGENT=linuxbridge
 Q_ML2_PLUGIN_MECHANISM_DRIVERS=linuxbridge
 LB_PHYSICAL_INTERFACE=ens3
 PUBLIC_PHYSICAL_NETWORK=default
 LB_INTERFACE_MAPPINGS=default:ens3
 Q_SERVICE_PLUGIN_CLASSES=
 Q_ML2_PLUGIN_TYPE_DRIVERS=flat
 Q_ML2_PLUGIN_EXT_DRIVERS=
--- a/patches/libvirt-5.0-vitastor.diff
+++ b/patches/libvirt-5.0-vitastor.diff
@ -0,0 +1,609 @@
 commit bd283191b3e7a4c6d1c100d3d96e348a1ebffe55
 Author: Vitaliy Filippov <vitalif@yourcmc.ru>
 Date:   Sun Jun 27 12:52:40 2021 +0300
    Add Vitastor support
 diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng
 index aa50eac..082b4f8 100644
 --- a/docs/schemas/domaincommon.rng
 +++ b/docs/schemas/domaincommon.rng
@@ -1728,6 +1728,35 @@
     </element>
   </define>
 +  <define name="diskSourceNetworkProtocolVitastor">
 +    <element name="source">
 +      <interleave>
 +        <attribute name="protocol">
 +          <value>vitastor</value>
 +        </attribute>
 +        <ref name="diskSourceCommon"/>
 +        <optional>
 +          <attribute name="name"/>
 +        </optional>
 +        <optional>
 +          <attribute name="query"/>
 +        </optional>
 +        <zeroOrMore>
 +          <ref name="diskSourceNetworkHost"/>
 +        </zeroOrMore>
 +        <optional>
 +          <element name="config">
 +            <attribute name="file">
 +              <ref name="absFilePath"/>
 +            </attribute>
 +            <empty/>
 +          </element>
 +        </optional>
 +        <empty/>
 +      </interleave>
 +    </element>
 +  </define>
 +
   <define name="diskSourceNetworkProtocolISCSI">
     <element name="source">
       <attribute name="protocol">
@@ -1851,6 +1880,7 @@
       <ref name="diskSourceNetworkProtocolHTTP"/>
       <ref name="diskSourceNetworkProtocolSimple"/>
       <ref name="diskSourceNetworkProtocolVxHS"/>
 +      <ref name="diskSourceNetworkProtocolVitastor"/>
     </choice>
   </define>
 diff --git a/include/libvirt/libvirt-storage.h b/include/libvirt/libvirt-storage.h
 index 4bf2b5f..dbc011b 100644
 --- a/include/libvirt/libvirt-storage.h
 +++ b/include/libvirt/libvirt-storage.h
@@ -240,6 +240,7 @@ typedef enum {
     VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER       = 1 << 16,
     VIR_CONNECT_LIST_STORAGE_POOLS_ZFS           = 1 << 17,
     VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE      = 1 << 18,
 +    VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR      = 1 << 20,
 } virConnectListAllStoragePoolsFlags;
 int                     virConnectListAllStoragePools(virConnectPtr conn,
 diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
 index 222bb8c..685d255 100644
 --- a/src/conf/domain_conf.c
 +++ b/src/conf/domain_conf.c
@@ -8653,6 +8653,10 @@ virDomainDiskSourceNetworkParse(xmlNodePtr node,
         goto cleanup;
     }
 +    if (src->protocol == VIR_STORAGE_NET_PROTOCOL_VITASTOR) {
 +        src->relPath = virXMLPropString(node, "query");
 +    }
 +
     if ((haveTLS = virXMLPropString(node, "tls")) &&
         (src->haveTLS = virTristateBoolTypeFromString(haveTLS)) <= 0) {
         virReportError(VIR_ERR_XML_ERROR,
@@ -23849,6 +23853,10 @@ virDomainDiskSourceFormatNetwork(virBufferPtr attrBuf,
     virBufferEscapeString(attrBuf, " name='%s'", path ? path : src->path);
 +    if (src->protocol == VIR_STORAGE_NET_PROTOCOL_VITASTOR && src->relPath != NULL) {
 +        virBufferEscapeString(attrBuf, " query='%s'", src->relPath);
 +    }
 +
     VIR_FREE(path);
     if (src->haveTLS != VIR_TRISTATE_BOOL_ABSENT &&
@@ -30930,6 +30938,7 @@ virDomainDiskTranslateSourcePool(virDomainDiskDefPtr def)
     case VIR_STORAGE_POOL_MPATH:
     case VIR_STORAGE_POOL_RBD:
 +    case VIR_STORAGE_POOL_VITASTOR:
     case VIR_STORAGE_POOL_SHEEPDOG:
     case VIR_STORAGE_POOL_GLUSTER:
     case VIR_STORAGE_POOL_LAST:
 diff --git a/src/conf/storage_conf.c b/src/conf/storage_conf.c
 index 55db7a9..7cbe937 100644
 --- a/src/conf/storage_conf.c
 +++ b/src/conf/storage_conf.c
@@ -58,7 +58,7 @@ VIR_ENUM_IMPL(virStoragePool,
               "logical", "disk", "iscsi",
               "iscsi-direct", "scsi", "mpath",
               "rbd", "sheepdog", "gluster",
 -              "zfs", "vstorage")
 +              "zfs", "vstorage", "vitastor")
 VIR_ENUM_IMPL(virStoragePoolFormatFileSystem,
               VIR_STORAGE_POOL_FS_LAST,
@@ -232,6 +232,18 @@ static virStoragePoolTypeInfo poolTypeInfo[] = {
           .formatToString = virStorageFileFormatTypeToString,
       }
     },
 +    {.poolType = VIR_STORAGE_POOL_VITASTOR,
 +     .poolOptions = {
 +         .flags = (VIR_STORAGE_POOL_SOURCE_HOST |
 +                   VIR_STORAGE_POOL_SOURCE_NETWORK |
 +                   VIR_STORAGE_POOL_SOURCE_NAME),
 +      },
 +      .volOptions = {
 +          .defaultFormat = VIR_STORAGE_FILE_RAW,
 +          .formatFromString = virStorageVolumeFormatFromString,
 +          .formatToString = virStorageFileFormatTypeToString,
 +      }
 +    },
     {.poolType = VIR_STORAGE_POOL_SHEEPDOG,
      .poolOptions = {
          .flags = (VIR_STORAGE_POOL_SOURCE_HOST |
@@ -434,6 +446,11 @@ virStoragePoolDefParseSource(xmlXPathContextPtr ctxt,
                        _("element 'name' is mandatory for RBD pool"));
         goto cleanup;
     }
 +    if (pool_type == VIR_STORAGE_POOL_VITASTOR && source->name == NULL) {
 +        virReportError(VIR_ERR_XML_ERROR, "%s",
 +                       _("element 'name' is mandatory for Vitastor pool"));
 +        return -1;
 +    }
     if (options->formatFromString) {
         char *format = virXPathString("string(./format/@type)", ctxt);
@@ -1009,6 +1026,7 @@ virStoragePoolDefFormatBuf(virBufferPtr buf,
     /* RBD, Sheepdog, Gluster and Iscsi-direct devices are not local block devs nor
      * files, so they don't have a target */
     if (def->type != VIR_STORAGE_POOL_RBD &&
 +        def->type != VIR_STORAGE_POOL_VITASTOR &&
         def->type != VIR_STORAGE_POOL_SHEEPDOG &&
         def->type != VIR_STORAGE_POOL_GLUSTER &&
         def->type != VIR_STORAGE_POOL_ISCSI_DIRECT) {
 diff --git a/src/conf/storage_conf.h b/src/conf/storage_conf.h
 index dc0aa2a..ed4983d 100644
 --- a/src/conf/storage_conf.h
 +++ b/src/conf/storage_conf.h
@@ -91,6 +91,7 @@ typedef enum {
     VIR_STORAGE_POOL_GLUSTER,  /* Gluster device */
     VIR_STORAGE_POOL_ZFS,      /* ZFS */
     VIR_STORAGE_POOL_VSTORAGE, /* Virtuozzo Storage */
 +    VIR_STORAGE_POOL_VITASTOR, /* Vitastor */
     VIR_STORAGE_POOL_LAST,
 } virStoragePoolType;
@@ -422,6 +423,7 @@ VIR_ENUM_DECL(virStoragePartedFs)
                  VIR_CONNECT_LIST_STORAGE_POOLS_SCSI     | \
                  VIR_CONNECT_LIST_STORAGE_POOLS_MPATH    | \
                  VIR_CONNECT_LIST_STORAGE_POOLS_RBD      | \
 +                 VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR | \
                  VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG | \
                  VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER  | \
                  VIR_CONNECT_LIST_STORAGE_POOLS_ZFS      | \
 diff --git a/src/conf/virstorageobj.c b/src/conf/virstorageobj.c
 index 6ea6a97..3ba45b9 100644
 --- a/src/conf/virstorageobj.c
 +++ b/src/conf/virstorageobj.c
@@ -1478,6 +1478,7 @@ virStoragePoolObjSourceFindDuplicateCb(const void *payload,
             return 1;
         break;
 +    case VIR_STORAGE_POOL_VITASTOR:
     case VIR_STORAGE_POOL_RBD:
     case VIR_STORAGE_POOL_LAST:
         break;
@@ -1971,6 +1972,8 @@ virStoragePoolObjMatch(virStoragePoolObjPtr obj,
                (obj->def->type == VIR_STORAGE_POOL_MPATH))   ||
               (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_RBD) &&
                (obj->def->type == VIR_STORAGE_POOL_RBD))     ||
 +              (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR) &&
 +               (obj->def->type == VIR_STORAGE_POOL_VITASTOR)) ||
               (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG) &&
                (obj->def->type == VIR_STORAGE_POOL_SHEEPDOG)) ||
               (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER) &&
 diff --git a/src/libvirt-storage.c b/src/libvirt-storage.c
 index 2ea3e94..d5d2273 100644
 --- a/src/libvirt-storage.c
 +++ b/src/libvirt-storage.c
@@ -92,6 +92,7 @@ virStoragePoolGetConnect(virStoragePoolPtr pool)
  * VIR_CONNECT_LIST_STORAGE_POOLS_SCSI
  * VIR_CONNECT_LIST_STORAGE_POOLS_MPATH
  * VIR_CONNECT_LIST_STORAGE_POOLS_RBD
 + * VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR
  * VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG
  *
  * Returns the number of storage pools found or -1 and sets @pools to
 diff --git a/src/libxl/libxl_conf.c b/src/libxl/libxl_conf.c
 index 73e988a..ab7bb81 100644
 --- a/src/libxl/libxl_conf.c
 +++ b/src/libxl/libxl_conf.c
@@ -905,6 +905,7 @@ libxlMakeNetworkDiskSrcStr(virStorageSourcePtr src,
     case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
     case VIR_STORAGE_NET_PROTOCOL_SSH:
     case VIR_STORAGE_NET_PROTOCOL_VXHS:
 +    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
     case VIR_STORAGE_NET_PROTOCOL_LAST:
     case VIR_STORAGE_NET_PROTOCOL_NONE:
         virReportError(VIR_ERR_NO_SUPPORT,
 diff --git a/src/qemu/qemu_block.c b/src/qemu/qemu_block.c
 index cbf0aa4..096700d 100644
 --- a/src/qemu/qemu_block.c
 +++ b/src/qemu/qemu_block.c
@@ -959,6 +959,42 @@ qemuBlockStorageSourceGetRBDProps(virStorageSourcePtr src)
 }
 +static virJSONValuePtr
 +qemuBlockStorageSourceGetVitastorProps(virStorageSource *src)
 +{
 +    virJSONValuePtr ret = NULL;
 +    virStorageNetHostDefPtr host;
 +    size_t i;
 +    virBuffer buf = VIR_BUFFER_INITIALIZER;
 +    char *etcd = NULL;
 +
 +    for (i = 0; i < src->nhosts; i++) {
 +        host = src->hosts + i;
 +        if ((virStorageNetHostTransport)host->transport != VIR_STORAGE_NET_HOST_TRANS_TCP) {
 +            goto cleanup;
 +        }
 +        virBufferAsprintf(&buf, i > 0 ? ",%s:%u" : "%s:%u", host->name, host->port);
 +    }
 +    if (src->nhosts > 0) {
 +        etcd = virBufferContentAndReset(&buf);
 +    }
 +
 +    if (virJSONValueObjectCreate(&ret,
 +                                 "s:driver", "vitastor",
 +                                 "S:etcd_host", etcd,
 +                                 "S:etcd_prefix", src->relPath,
 +                                 "S:config_path", src->configFile,
 +                                 "s:image", src->path,
 +                                 NULL) < 0)
 +        goto cleanup;
 +
 +cleanup:
 +    VIR_FREE(etcd);
 +    virBufferFreeAndReset(&buf);
 +    return ret;
 +}
 +
 +
 static virJSONValuePtr
 qemuBlockStorageSourceGetSheepdogProps(virStorageSourcePtr src)
 {
@@ -1174,6 +1210,11 @@ qemuBlockStorageSourceGetBackendProps(virStorageSourcePtr src,
                 return NULL;
             break;
 +        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
 +            if (!(fileprops = qemuBlockStorageSourceGetVitastorProps(src)))
 +                return NULL;
 +            break;
 +
         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
             if (!(fileprops = qemuBlockStorageSourceGetSheepdogProps(src)))
                 return NULL;
 diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
 index 822d5f8..e375cef 100644
 --- a/src/qemu/qemu_command.c
 +++ b/src/qemu/qemu_command.c
@@ -975,6 +975,43 @@ qemuBuildNetworkDriveStr(virStorageSourcePtr src,
             ret = virBufferContentAndReset(&buf);
             break;
 +        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
 +            if (strchr(src->path, ':')) {
 +                virReportError(VIR_ERR_CONFIG_UNSUPPORTED,
 +                               _("':' not allowed in Vitastor source volume name '%s'"),
 +                               src->path);
 +                return NULL;
 +            }
 +
 +            virBufferStrcat(&buf, "vitastor:image=", src->path, NULL);
 +
 +            if (src->nhosts > 0) {
 +                virBufferAddLit(&buf, ":etcd_host=");
 +                for (i = 0; i < src->nhosts; i++) {
 +                    if (i)
 +                        virBufferAddLit(&buf, ",");
 +
 +                    /* assume host containing : is ipv6 */
 +                    if (strchr(src->hosts[i].name, ':'))
 +                        virBufferEscape(&buf, '\\', ":", "[%s]",
 +                                        src->hosts[i].name);
 +                    else
 +                        virBufferAsprintf(&buf, "%s", src->hosts[i].name);
 +
 +                    if (src->hosts[i].port)
 +                        virBufferAsprintf(&buf, "\\:%u", src->hosts[i].port);
 +                }
 +            }
 +
 +            if (src->configFile)
 +                virBufferEscape(&buf, '\\', ":", ":config_path=%s", src->configFile);
 +
 +            if (src->relPath)
 +                virBufferEscape(&buf, '\\', ":", ":etcd_prefix=%s", src->relPath);
 +
 +            ret = virBufferContentAndReset(&buf);
 +            break;
 +
         case VIR_STORAGE_NET_PROTOCOL_VXHS:
             virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
                            _("VxHS protocol does not support URI syntax"));
 diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
 index ec6b340..f399efa 100644
 --- a/src/qemu/qemu_domain.c
 +++ b/src/qemu/qemu_domain.c
@@ -10881,6 +10881,7 @@ qemuDomainPrepareStorageSourceTLS(virStorageSourcePtr src,
         break;
     case VIR_STORAGE_NET_PROTOCOL_RBD:
 +    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
     case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
     case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
     case VIR_STORAGE_NET_PROTOCOL_ISCSI:
 diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
 index 1d96170..2d24396 100644
 --- a/src/qemu/qemu_driver.c
 +++ b/src/qemu/qemu_driver.c
@@ -14687,6 +14687,7 @@ qemuDomainSnapshotPrepareDiskExternalInactive(virDomainSnapshotDiskDefPtr snapdi
         case VIR_STORAGE_NET_PROTOCOL_TFTP:
         case VIR_STORAGE_NET_PROTOCOL_SSH:
         case VIR_STORAGE_NET_PROTOCOL_VXHS:
 +        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
         case VIR_STORAGE_NET_PROTOCOL_LAST:
             virReportError(VIR_ERR_INTERNAL_ERROR,
                            _("external inactive snapshots are not supported on "
@@ -14764,6 +14765,7 @@ qemuDomainSnapshotPrepareDiskExternalActive(virDomainSnapshotDiskDefPtr snapdisk
         case VIR_STORAGE_NET_PROTOCOL_TFTP:
         case VIR_STORAGE_NET_PROTOCOL_SSH:
         case VIR_STORAGE_NET_PROTOCOL_VXHS:
 +        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
         case VIR_STORAGE_NET_PROTOCOL_LAST:
             virReportError(VIR_ERR_INTERNAL_ERROR,
                            _("external active snapshots are not supported on "
@@ -14887,6 +14889,7 @@ qemuDomainSnapshotPrepareDiskInternal(virDomainDiskDefPtr disk,
         case VIR_STORAGE_NET_PROTOCOL_TFTP:
         case VIR_STORAGE_NET_PROTOCOL_SSH:
         case VIR_STORAGE_NET_PROTOCOL_VXHS:
 +        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
         case VIR_STORAGE_NET_PROTOCOL_LAST:
             virReportError(VIR_ERR_INTERNAL_ERROR,
                            _("internal inactive snapshots are not supported on "
 diff --git a/src/qemu/qemu_parse_command.c b/src/qemu/qemu_parse_command.c
 index c4650f0..551da41 100644
 --- a/src/qemu/qemu_parse_command.c
 +++ b/src/qemu/qemu_parse_command.c
@@ -2184,6 +2184,7 @@ qemuParseCommandLine(virFileCachePtr capsCache,
                 case VIR_STORAGE_NET_PROTOCOL_TFTP:
                 case VIR_STORAGE_NET_PROTOCOL_SSH:
                 case VIR_STORAGE_NET_PROTOCOL_LAST:
 +                case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
                 case VIR_STORAGE_NET_PROTOCOL_NONE:
                     /* ignored for now */
                     break;
 diff --git a/src/storage/storage_driver.c b/src/storage/storage_driver.c
 index 4a13e90..33301c7 100644
 --- a/src/storage/storage_driver.c
 +++ b/src/storage/storage_driver.c
@@ -1568,6 +1568,7 @@ storageVolLookupByPathCallback(virStoragePoolObjPtr obj,
         case VIR_STORAGE_POOL_RBD:
         case VIR_STORAGE_POOL_SHEEPDOG:
         case VIR_STORAGE_POOL_ZFS:
 +        case VIR_STORAGE_POOL_VITASTOR:
         case VIR_STORAGE_POOL_LAST:
             ignore_value(VIR_STRDUP(stable_path, data->path));
             break;
 diff --git a/src/util/virstoragefile.c b/src/util/virstoragefile.c
 index bd4b027..b323cd6 100644
 --- a/src/util/virstoragefile.c
 +++ b/src/util/virstoragefile.c
@@ -84,7 +84,8 @@ VIR_ENUM_IMPL(virStorageNetProtocol, VIR_STORAGE_NET_PROTOCOL_LAST,
               "ftps",
               "tftp",
               "ssh",
 -              "vxhs")
 +              "vxhs",
 +              "vitastor")
 VIR_ENUM_IMPL(virStorageNetHostTransport, VIR_STORAGE_NET_HOST_TRANS_LAST,
               "tcp",
@@ -2839,6 +2840,83 @@ virStorageSourceParseRBDColonString(const char *rbdstr,
 }
 +static int
 +virStorageSourceParseVitastorColonString(const char *colonstr,
 +                                         virStorageSourcePtr src)
 +{
 +    char *p, *e, *next;
 +    char *options = NULL;
 +
 +    /* optionally skip the "vitastor:" prefix if provided */
 +    if (STRPREFIX(colonstr, "vitastor:"))
 +        colonstr += strlen("vitastor:");
 +
 +    if (VIR_STRDUP(options, colonstr) < 0)
 +        return -1;
 +
 +    p = options;
 +    while (*p) {
 +        /* find : delimiter or end of string */
 +        for (e = p; *e && *e != ':'; ++e) {
 +            if (*e == '\\') {
 +                e++;
 +                if (*e == '\0')
 +                    break;
 +            }
 +        }
 +        if (*e == '\0') {
 +            next = e;    /* last kv pair */
 +        } else {
 +            next = e + 1;
 +            *e = '\0';
 +        }
 +
 +        if (STRPREFIX(p, "image=")) {
 +            if (VIR_STRDUP(src->path, p + strlen("image=")) < 0)
 +                return -1;
 +        } else if (STRPREFIX(p, "etcd_prefix=")) {
 +            if (VIR_STRDUP(src->relPath, p + strlen("etcd_prefix=")) < 0)
 +                return -1;
 +        } else if (STRPREFIX(p, "config_file=")) {
 +            if (VIR_STRDUP(src->configFile, p + strlen("config_file=")) < 0)
 +                return -1;
 +        } else if (STRPREFIX(p, "etcd_host=")) {
 +            char *h, *sep;
 +
 +            h = p + strlen("etcd_host=");
 +            while (h < e) {
 +                for (sep = h; sep < e; ++sep) {
 +                    if (*sep == '\\' && (sep[1] == ',' ||
 +                                         sep[1] == ';' ||
 +                                         sep[1] == ' ')) {
 +                        *sep = '\0';
 +                        sep += 2;
 +                        break;
 +                    }
 +                }
 +
 +                if (virStorageSourceRBDAddHost(src, h) < 0)
 +                    goto error;
 +
 +                h = sep;
 +            }
 +        }
 +
 +        p = next;
 +    }
 +
 +    if (!src->path) {
 +        goto error;
 +    }
 +
 +    return 0;
 +
 +error:
 +    VIR_FREE(options);
 +    return -1;
 +}
 +
 +
 static int
 virStorageSourceParseNBDColonString(const char *nbdstr,
                                     virStorageSourcePtr src)
@@ -2942,6 +3020,11 @@ virStorageSourceParseBackingColon(virStorageSourcePtr src,
             goto cleanup;
         break;
 +    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
 +        if (virStorageSourceParseVitastorColonString(path, src) < 0)
 +            return -1;
 +        break;
 +
     case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
     case VIR_STORAGE_NET_PROTOCOL_LAST:
     case VIR_STORAGE_NET_PROTOCOL_NONE:
@@ -3441,6 +3524,56 @@ virStorageSourceParseBackingJSONRBD(virStorageSourcePtr src,
     return ret;
 }
 +static int
 +virStorageSourceParseBackingJSONVitastor(virStorageSourcePtr src,
 +                                         virJSONValuePtr json,
 +                                         int opaque ATTRIBUTE_UNUSED)
 +{
 +    const char *filename;
 +    const char *image = virJSONValueObjectGetString(json, "image");
 +    const char *conf = virJSONValueObjectGetString(json, "config_path");
 +    const char *etcd_prefix = virJSONValueObjectGetString(json, "etcd_prefix");
 +    virJSONValuePtr servers = virJSONValueObjectGetArray(json, "server");
 +    size_t nservers;
 +    size_t i;
 +
 +    src->type = VIR_STORAGE_TYPE_NETWORK;
 +    src->protocol = VIR_STORAGE_NET_PROTOCOL_VITASTOR;
 +
 +    /* legacy syntax passed via 'filename' option */
 +    if ((filename = virJSONValueObjectGetString(json, "filename")))
 +        return virStorageSourceParseVitastorColonString(filename, src);
 +
 +    if (!image) {
 +        virReportError(VIR_ERR_INVALID_ARG, "%s",
 +                       _("missing image name in Vitastor backing volume "
 +                         "JSON specification"));
 +        return -1;
 +    }
 +
 +    if (VIR_STRDUP(src->path, image) < 0 ||
 +        VIR_STRDUP(src->configFile, conf) < 0 ||
 +        VIR_STRDUP(src->relPath, etcd_prefix) < 0)
 +        return -1;
 +
 +    if (servers) {
 +        nservers = virJSONValueArraySize(servers);
 +
 +        if (VIR_ALLOC_N(src->hosts, nservers) < 0)
 +            return -1;
 +
 +        src->nhosts = nservers;
 +
 +        for (i = 0; i < nservers; i++) {
 +            if (virStorageSourceParseBackingJSONInetSocketAddress(src->hosts + i,
 +                                                                  virJSONValueArrayGet(servers, i)) < 0)
 +                return -1;
 +        }
 +    }
 +
 +    return 0;
 +}
 +
 static int
 virStorageSourceParseBackingJSONRaw(virStorageSourcePtr src,
                                     virJSONValuePtr json,
@@ -3507,6 +3640,7 @@ static const struct virStorageSourceJSONDriverParser jsonParsers[] = {
     {"sheepdog", virStorageSourceParseBackingJSONSheepdog, 0},
     {"ssh", virStorageSourceParseBackingJSONSSH, 0},
     {"rbd", virStorageSourceParseBackingJSONRBD, 0},
 +    {"vitastor", virStorageSourceParseBackingJSONVitastor, 0},
     {"raw", virStorageSourceParseBackingJSONRaw, 0},
     {"vxhs", virStorageSourceParseBackingJSONVxHS, 0},
 };
@@ -4276,6 +4410,7 @@ virStorageSourceNetworkDefaultPort(virStorageNetProtocol protocol)
         case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
             return 24007;
 +        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
         case VIR_STORAGE_NET_PROTOCOL_RBD:
             /* we don't provide a default for RBD */
             return 0;
 diff --git a/src/util/virstoragefile.h b/src/util/virstoragefile.h
 index 1d6161a..8d83bf3 100644
 --- a/src/util/virstoragefile.h
 +++ b/src/util/virstoragefile.h
@@ -134,6 +134,7 @@ typedef enum {
     VIR_STORAGE_NET_PROTOCOL_TFTP,
     VIR_STORAGE_NET_PROTOCOL_SSH,
     VIR_STORAGE_NET_PROTOCOL_VXHS,
 +    VIR_STORAGE_NET_PROTOCOL_VITASTOR,
     VIR_STORAGE_NET_PROTOCOL_LAST
 } virStorageNetProtocol;
 diff --git a/src/xenconfig/xen_xl.c b/src/xenconfig/xen_xl.c
 index accfc3a..a18f9c3 100644
 --- a/src/xenconfig/xen_xl.c
 +++ b/src/xenconfig/xen_xl.c
@@ -1535,6 +1535,7 @@ xenFormatXLDiskSrcNet(virStorageSourcePtr src)
     case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
     case VIR_STORAGE_NET_PROTOCOL_SSH:
     case VIR_STORAGE_NET_PROTOCOL_VXHS:
 +    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
     case VIR_STORAGE_NET_PROTOCOL_LAST:
     case VIR_STORAGE_NET_PROTOCOL_NONE:
         virReportError(VIR_ERR_NO_SUPPORT,
 diff --git a/tools/virsh-pool.c b/tools/virsh-pool.c
 index 70ca39b..9caef51 100644
 --- a/tools/virsh-pool.c
 +++ b/tools/virsh-pool.c
@@ -1219,6 +1219,9 @@ cmdPoolList(vshControl *ctl, const vshCmd *cmd ATTRIBUTE_UNUSED)
             case VIR_STORAGE_POOL_VSTORAGE:
                 flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE;
                 break;
 +            case VIR_STORAGE_POOL_VITASTOR:
 +                flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR;
 +                break;
             case VIR_STORAGE_POOL_LAST:
                 break;
             }
--- a/patches/libvirt-7.0-vitastor.diff
+++ b/patches/libvirt-7.0-vitastor.diff
@ -0,0 +1,657 @@
 commit 41cdfe8317d98f70aadedfdbb381effed2641bdd
 Author: Vitaliy Filippov <vitalif@yourcmc.ru>
 Date:   Fri Jul 9 01:31:57 2021 +0300
    Add Vitastor support
 diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng
 index 7dc419b..875433b 100644
 --- a/docs/schemas/domaincommon.rng
 +++ b/docs/schemas/domaincommon.rng
@@ -1827,6 +1827,35 @@
     </element>
   </define>
 +  <define name="diskSourceNetworkProtocolVitastor">
 +    <element name="source">
 +      <interleave>
 +        <attribute name="protocol">
 +          <value>vitastor</value>
 +        </attribute>
 +        <ref name="diskSourceCommon"/>
 +        <optional>
 +          <attribute name="name"/>
 +        </optional>
 +        <optional>
 +          <attribute name="query"/>
 +        </optional>
 +        <zeroOrMore>
 +          <ref name="diskSourceNetworkHost"/>
 +        </zeroOrMore>
 +        <optional>
 +          <element name="config">
 +            <attribute name="file">
 +              <ref name="absFilePath"/>
 +            </attribute>
 +            <empty/>
 +          </element>
 +        </optional>
 +        <empty/>
 +      </interleave>
 +    </element>
 +  </define>
 +
   <define name="diskSourceNetworkProtocolISCSI">
     <element name="source">
       <attribute name="protocol">
@@ -2083,6 +2112,7 @@
       <ref name="diskSourceNetworkProtocolSimple"/>
       <ref name="diskSourceNetworkProtocolVxHS"/>
       <ref name="diskSourceNetworkProtocolNFS"/>
 +      <ref name="diskSourceNetworkProtocolVitastor"/>
     </choice>
   </define>
 diff --git a/include/libvirt/libvirt-storage.h b/include/libvirt/libvirt-storage.h
 index 089e1e0..d7e7ef4 100644
 --- a/include/libvirt/libvirt-storage.h
 +++ b/include/libvirt/libvirt-storage.h
@@ -245,6 +245,7 @@ typedef enum {
     VIR_CONNECT_LIST_STORAGE_POOLS_ZFS           = 1 << 17,
     VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE      = 1 << 18,
     VIR_CONNECT_LIST_STORAGE_POOLS_ISCSI_DIRECT  = 1 << 19,
 +    VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR      = 1 << 20,
 } virConnectListAllStoragePoolsFlags;
 int                     virConnectListAllStoragePools(virConnectPtr conn,
 diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
 index 01b7187..c6e9702 100644
 --- a/src/conf/domain_conf.c
 +++ b/src/conf/domain_conf.c
@@ -8261,7 +8261,8 @@ virDomainDiskSourceNetworkParse(xmlNodePtr node,
     src->configFile = virXPathString("string(./config/@file)", ctxt);
     if (src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTP ||
 -        src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTPS)
 +        src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTPS ||
 +        src->protocol == VIR_STORAGE_NET_PROTOCOL_VITASTOR)
         src->query = virXMLPropString(node, "query");
     if (virDomainStorageNetworkParseHosts(node, ctxt, &src->hosts, &src->nhosts) < 0)
@@ -31392,6 +31393,7 @@ virDomainStorageSourceTranslateSourcePool(virStorageSourcePtr src,
     case VIR_STORAGE_POOL_MPATH:
     case VIR_STORAGE_POOL_RBD:
 +    case VIR_STORAGE_POOL_VITASTOR:
     case VIR_STORAGE_POOL_SHEEPDOG:
     case VIR_STORAGE_POOL_GLUSTER:
     case VIR_STORAGE_POOL_LAST:
 diff --git a/src/conf/storage_conf.c b/src/conf/storage_conf.c
 index 0c50529..fe97574 100644
 --- a/src/conf/storage_conf.c
 +++ b/src/conf/storage_conf.c
@@ -60,7 +60,7 @@ VIR_ENUM_IMPL(virStoragePool,
               "logical", "disk", "iscsi",
               "iscsi-direct", "scsi", "mpath",
               "rbd", "sheepdog", "gluster",
 -              "zfs", "vstorage",
 +              "zfs", "vstorage", "vitastor",
 );
 VIR_ENUM_IMPL(virStoragePoolFormatFileSystem,
@@ -249,6 +249,18 @@ static virStoragePoolTypeInfo poolTypeInfo[] = {
           .formatToString = virStorageFileFormatTypeToString,
       }
     },
 +    {.poolType = VIR_STORAGE_POOL_VITASTOR,
 +     .poolOptions = {
 +         .flags = (VIR_STORAGE_POOL_SOURCE_HOST |
 +                   VIR_STORAGE_POOL_SOURCE_NETWORK |
 +                   VIR_STORAGE_POOL_SOURCE_NAME),
 +      },
 +      .volOptions = {
 +          .defaultFormat = VIR_STORAGE_FILE_RAW,
 +          .formatFromString = virStorageVolumeFormatFromString,
 +          .formatToString = virStorageFileFormatTypeToString,
 +      }
 +    },
     {.poolType = VIR_STORAGE_POOL_SHEEPDOG,
      .poolOptions = {
          .flags = (VIR_STORAGE_POOL_SOURCE_HOST |
@@ -551,6 +563,11 @@ virStoragePoolDefParseSource(xmlXPathContextPtr ctxt,
                        _("element 'name' is mandatory for RBD pool"));
         goto cleanup;
     }
 +    if (pool_type == VIR_STORAGE_POOL_VITASTOR && source->name == NULL) {
 +        virReportError(VIR_ERR_XML_ERROR, "%s",
 +                       _("element 'name' is mandatory for Vitastor pool"));
 +        return -1;
 +    }
     if (options->formatFromString) {
         g_autofree char *format = NULL;
@@ -1217,6 +1234,7 @@ virStoragePoolDefFormatBuf(virBufferPtr buf,
     /* RBD, Sheepdog, Gluster and Iscsi-direct devices are not local block devs nor
      * files, so they don't have a target */
     if (def->type != VIR_STORAGE_POOL_RBD &&
 +        def->type != VIR_STORAGE_POOL_VITASTOR &&
         def->type != VIR_STORAGE_POOL_SHEEPDOG &&
         def->type != VIR_STORAGE_POOL_GLUSTER &&
         def->type != VIR_STORAGE_POOL_ISCSI_DIRECT) {
 diff --git a/src/conf/storage_conf.h b/src/conf/storage_conf.h
 index ffd406e..8868a05 100644
 --- a/src/conf/storage_conf.h
 +++ b/src/conf/storage_conf.h
@@ -110,6 +110,7 @@ typedef enum {
     VIR_STORAGE_POOL_GLUSTER,  /* Gluster device */
     VIR_STORAGE_POOL_ZFS,      /* ZFS */
     VIR_STORAGE_POOL_VSTORAGE, /* Virtuozzo Storage */
 +    VIR_STORAGE_POOL_VITASTOR, /* Vitastor */
     VIR_STORAGE_POOL_LAST,
 } virStoragePoolType;
@@ -474,6 +475,7 @@ VIR_ENUM_DECL(virStoragePartedFs);
                  VIR_CONNECT_LIST_STORAGE_POOLS_SCSI     | \
                  VIR_CONNECT_LIST_STORAGE_POOLS_MPATH    | \
                  VIR_CONNECT_LIST_STORAGE_POOLS_RBD      | \
 +                 VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR | \
                  VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG | \
                  VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER  | \
                  VIR_CONNECT_LIST_STORAGE_POOLS_ZFS      | \
 diff --git a/src/conf/virstorageobj.c b/src/conf/virstorageobj.c
 index 9fe8b3f..bf595b0 100644
 --- a/src/conf/virstorageobj.c
 +++ b/src/conf/virstorageobj.c
@@ -1491,6 +1491,7 @@ virStoragePoolObjSourceFindDuplicateCb(const void *payload,
             return 1;
         break;
 +    case VIR_STORAGE_POOL_VITASTOR:
     case VIR_STORAGE_POOL_RBD:
     case VIR_STORAGE_POOL_LAST:
         break;
@@ -1990,6 +1991,8 @@ virStoragePoolObjMatch(virStoragePoolObjPtr obj,
                (obj->def->type == VIR_STORAGE_POOL_MPATH))   ||
               (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_RBD) &&
                (obj->def->type == VIR_STORAGE_POOL_RBD))     ||
 +              (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR) &&
 +               (obj->def->type == VIR_STORAGE_POOL_VITASTOR)) ||
               (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG) &&
                (obj->def->type == VIR_STORAGE_POOL_SHEEPDOG)) ||
               (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER) &&
 diff --git a/src/libvirt-storage.c b/src/libvirt-storage.c
 index 2a7cdca..f756be1 100644
 --- a/src/libvirt-storage.c
 +++ b/src/libvirt-storage.c
@@ -92,6 +92,7 @@ virStoragePoolGetConnect(virStoragePoolPtr pool)
  * VIR_CONNECT_LIST_STORAGE_POOLS_SCSI
  * VIR_CONNECT_LIST_STORAGE_POOLS_MPATH
  * VIR_CONNECT_LIST_STORAGE_POOLS_RBD
 + * VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR
  * VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG
  * VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER
  * VIR_CONNECT_LIST_STORAGE_POOLS_ZFS
 diff --git a/src/libxl/libxl_conf.c b/src/libxl/libxl_conf.c
 index 6a8ae27..a735bc6 100644
 --- a/src/libxl/libxl_conf.c
 +++ b/src/libxl/libxl_conf.c
@@ -942,6 +942,7 @@ libxlMakeNetworkDiskSrcStr(virStorageSourcePtr src,
     case VIR_STORAGE_NET_PROTOCOL_SSH:
     case VIR_STORAGE_NET_PROTOCOL_VXHS:
     case VIR_STORAGE_NET_PROTOCOL_NFS:
 +    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
     case VIR_STORAGE_NET_PROTOCOL_LAST:
     case VIR_STORAGE_NET_PROTOCOL_NONE:
         virReportError(VIR_ERR_NO_SUPPORT,
 diff --git a/src/libxl/xen_xl.c b/src/libxl/xen_xl.c
 index 17b93d0..c5a0084 100644
 --- a/src/libxl/xen_xl.c
 +++ b/src/libxl/xen_xl.c
@@ -1601,6 +1601,7 @@ xenFormatXLDiskSrcNet(virStorageSourcePtr src)
     case VIR_STORAGE_NET_PROTOCOL_SSH:
     case VIR_STORAGE_NET_PROTOCOL_VXHS:
     case VIR_STORAGE_NET_PROTOCOL_NFS:
 +    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
     case VIR_STORAGE_NET_PROTOCOL_LAST:
     case VIR_STORAGE_NET_PROTOCOL_NONE:
         virReportError(VIR_ERR_NO_SUPPORT,
 diff --git a/src/qemu/qemu_block.c b/src/qemu/qemu_block.c
 index f9c6da2..922dde5 100644
 --- a/src/qemu/qemu_block.c
 +++ b/src/qemu/qemu_block.c
@@ -938,6 +938,38 @@ qemuBlockStorageSourceGetRBDProps(virStorageSourcePtr src,
 }
 +static virJSONValuePtr
 +qemuBlockStorageSourceGetVitastorProps(virStorageSource *src)
 +{
 +    virJSONValuePtr ret = NULL;
 +    virStorageNetHostDefPtr host;
 +    size_t i;
 +    g_auto(virBuffer) buf = VIR_BUFFER_INITIALIZER;
 +    g_autofree char *etcd = NULL;
 +
 +    for (i = 0; i < src->nhosts; i++) {
 +        host = src->hosts + i;
 +        if ((virStorageNetHostTransport)host->transport != VIR_STORAGE_NET_HOST_TRANS_TCP) {
 +            return NULL;
 +        }
 +        virBufferAsprintf(&buf, i > 0 ? ",%s:%u" : "%s:%u", host->name, host->port);
 +    }
 +    if (src->nhosts > 0) {
 +        etcd = virBufferContentAndReset(&buf);
 +    }
 +
 +    if (virJSONValueObjectCreate(&ret,
 +                                 "S:etcd_host", etcd,
 +                                 "S:etcd_prefix", src->query,
 +                                 "S:config_path", src->configFile,
 +                                 "s:image", src->path,
 +                                 NULL) < 0)
 +        return NULL;
 +
 +    return ret;
 +}
 +
 +
 static virJSONValuePtr
 qemuBlockStorageSourceGetSheepdogProps(virStorageSourcePtr src)
 {
@@ -1224,6 +1256,12 @@ qemuBlockStorageSourceGetBackendProps(virStorageSourcePtr src,
                 return NULL;
             break;
 +        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
 +            driver = "vitastor";
 +            if (!(fileprops = qemuBlockStorageSourceGetVitastorProps(src)))
 +                return NULL;
 +            break;
 +
         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
             driver = "sheepdog";
             if (!(fileprops = qemuBlockStorageSourceGetSheepdogProps(src)))
@@ -2183,6 +2221,7 @@ qemuBlockGetBackingStoreString(virStorageSourcePtr src,
             case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
             case VIR_STORAGE_NET_PROTOCOL_RBD:
 +            case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
             case VIR_STORAGE_NET_PROTOCOL_VXHS:
             case VIR_STORAGE_NET_PROTOCOL_NFS:
             case VIR_STORAGE_NET_PROTOCOL_SSH:
@@ -2560,6 +2599,12 @@ qemuBlockStorageSourceCreateGetStorageProps(virStorageSourcePtr src,
                 return -1;
             break;
 +        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
 +            driver = "vitastor";
 +            if (!(location = qemuBlockStorageSourceGetVitastorProps(src)))
 +                return -1;
 +            break;
 +
         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
             driver = "sheepdog";
             if (!(location = qemuBlockStorageSourceGetSheepdogProps(src)))
 diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
 index 6f970a3..10b39ca 100644
 --- a/src/qemu/qemu_command.c
 +++ b/src/qemu/qemu_command.c
@@ -1034,6 +1034,43 @@ qemuBuildNetworkDriveStr(virStorageSourcePtr src,
             ret = virBufferContentAndReset(&buf);
             break;
 +        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
 +            if (strchr(src->path, ':')) {
 +                virReportError(VIR_ERR_CONFIG_UNSUPPORTED,
 +                               _("':' not allowed in Vitastor source volume name '%s'"),
 +                               src->path);
 +                return NULL;
 +            }
 +
 +            virBufferStrcat(&buf, "vitastor:image=", src->path, NULL);
 +
 +            if (src->nhosts > 0) {
 +                virBufferAddLit(&buf, ":etcd_host=");
 +                for (i = 0; i < src->nhosts; i++) {
 +                    if (i)
 +                        virBufferAddLit(&buf, ",");
 +
 +                    /* assume host containing : is ipv6 */
 +                    if (strchr(src->hosts[i].name, ':'))
 +                        virBufferEscape(&buf, '\\', ":", "[%s]",
 +                                        src->hosts[i].name);
 +                    else
 +                        virBufferAsprintf(&buf, "%s", src->hosts[i].name);
 +
 +                    if (src->hosts[i].port)
 +                        virBufferAsprintf(&buf, "\\:%u", src->hosts[i].port);
 +                }
 +            }
 +
 +            if (src->configFile)
 +                virBufferEscape(&buf, '\\', ":", ":config_path=%s", src->configFile);
 +
 +            if (src->query)
 +                virBufferEscape(&buf, '\\', ":", ":etcd_prefix=%s", src->query);
 +
 +            ret = virBufferContentAndReset(&buf);
 +            break;
 +
         case VIR_STORAGE_NET_PROTOCOL_VXHS:
             virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
                            _("VxHS protocol does not support URI syntax"));
 diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
 index 0765dc7..4cff344 100644
 --- a/src/qemu/qemu_domain.c
 +++ b/src/qemu/qemu_domain.c
@@ -4610,7 +4610,8 @@ qemuDomainValidateStorageSource(virStorageSourcePtr src,
     if (src->query &&
         (actualType != VIR_STORAGE_TYPE_NETWORK ||
          (src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTPS &&
 -          src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTP))) {
 +          src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTP &&
 +          src->protocol != VIR_STORAGE_NET_PROTOCOL_VITASTOR))) {
         virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s",
                        _("query is supported only with HTTP(S) protocols"));
         return -1;
@@ -9704,6 +9705,7 @@ qemuDomainPrepareStorageSourceTLS(virStorageSourcePtr src,
         break;
     case VIR_STORAGE_NET_PROTOCOL_RBD:
 +    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
     case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
     case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
     case VIR_STORAGE_NET_PROTOCOL_ISCSI:
 diff --git a/src/qemu/qemu_snapshot.c b/src/qemu/qemu_snapshot.c
 index ee333c3..674aa58 100644
 --- a/src/qemu/qemu_snapshot.c
 +++ b/src/qemu/qemu_snapshot.c
@@ -403,6 +403,7 @@ qemuSnapshotPrepareDiskExternalInactive(virDomainSnapshotDiskDefPtr snapdisk,
         case VIR_STORAGE_NET_PROTOCOL_NONE:
         case VIR_STORAGE_NET_PROTOCOL_NBD:
         case VIR_STORAGE_NET_PROTOCOL_RBD:
 +        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
         case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
         case VIR_STORAGE_NET_PROTOCOL_ISCSI:
@@ -493,6 +494,7 @@ qemuSnapshotPrepareDiskExternalActive(virDomainObjPtr vm,
         case VIR_STORAGE_NET_PROTOCOL_NONE:
         case VIR_STORAGE_NET_PROTOCOL_NBD:
         case VIR_STORAGE_NET_PROTOCOL_RBD:
 +        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
         case VIR_STORAGE_NET_PROTOCOL_ISCSI:
         case VIR_STORAGE_NET_PROTOCOL_HTTP:
@@ -623,6 +625,7 @@ qemuSnapshotPrepareDiskInternal(virDomainDiskDefPtr disk,
         case VIR_STORAGE_NET_PROTOCOL_NONE:
         case VIR_STORAGE_NET_PROTOCOL_NBD:
         case VIR_STORAGE_NET_PROTOCOL_RBD:
 +        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
         case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
         case VIR_STORAGE_NET_PROTOCOL_ISCSI:
 diff --git a/src/storage/storage_driver.c b/src/storage/storage_driver.c
 index 16bc53a..1e5d820 100644
 --- a/src/storage/storage_driver.c
 +++ b/src/storage/storage_driver.c
@@ -1645,6 +1645,7 @@ storageVolLookupByPathCallback(virStoragePoolObjPtr obj,
         case VIR_STORAGE_POOL_GLUSTER:
         case VIR_STORAGE_POOL_RBD:
 +        case VIR_STORAGE_POOL_VITASTOR:
         case VIR_STORAGE_POOL_SHEEPDOG:
         case VIR_STORAGE_POOL_ZFS:
         case VIR_STORAGE_POOL_LAST:
 diff --git a/src/test/test_driver.c b/src/test/test_driver.c
 index 29c4c86..a27ad94 100644
 --- a/src/test/test_driver.c
 +++ b/src/test/test_driver.c
@@ -7096,6 +7096,7 @@ testStorageVolumeTypeForPool(int pooltype)
     case VIR_STORAGE_POOL_ISCSI_DIRECT:
     case VIR_STORAGE_POOL_GLUSTER:
     case VIR_STORAGE_POOL_RBD:
 +    case VIR_STORAGE_POOL_VITASTOR:
         return VIR_STORAGE_VOL_NETWORK;
     case VIR_STORAGE_POOL_LOGICAL:
     case VIR_STORAGE_POOL_DISK:
 diff --git a/src/util/virstoragefile.c b/src/util/virstoragefile.c
 index 0d3c2af..36e3afc 100644
 --- a/src/util/virstoragefile.c
 +++ b/src/util/virstoragefile.c
@@ -91,6 +91,7 @@ VIR_ENUM_IMPL(virStorageNetProtocol,
               "ssh",
               "vxhs",
               "nfs",
 +              "vitastor",
 );
 VIR_ENUM_IMPL(virStorageNetHostTransport,
@@ -2880,6 +2881,75 @@ virStorageSourceParseRBDColonString(const char *rbdstr,
 }
 +static int
 +virStorageSourceParseVitastorColonString(const char *colonstr,
 +                                         virStorageSourcePtr src)
 +{
 +    char *p, *e, *next;
 +    g_autofree char *options = NULL;
 +
 +    /* optionally skip the "vitastor:" prefix if provided */
 +    if (STRPREFIX(colonstr, "vitastor:"))
 +        colonstr += strlen("vitastor:");
 +
 +    options = g_strdup(colonstr);
 +
 +    p = options;
 +    while (*p) {
 +        /* find : delimiter or end of string */
 +        for (e = p; *e && *e != ':'; ++e) {
 +            if (*e == '\\') {
 +                e++;
 +                if (*e == '\0')
 +                    break;
 +            }
 +        }
 +        if (*e == '\0') {
 +            next = e;    /* last kv pair */
 +        } else {
 +            next = e + 1;
 +            *e = '\0';
 +        }
 +
 +        if (STRPREFIX(p, "image=")) {
 +            src->path = g_strdup(p + strlen("image="));
 +        } else if (STRPREFIX(p, "etcd_prefix=")) {
 +            src->query = g_strdup(p + strlen("etcd_prefix="));
 +        } else if (STRPREFIX(p, "config_file=")) {
 +            src->configFile = g_strdup(p + strlen("config_file="));
 +        } else if (STRPREFIX(p, "etcd_host=")) {
 +            char *h, *sep;
 +
 +            h = p + strlen("etcd_host=");
 +            while (h < e) {
 +                for (sep = h; sep < e; ++sep) {
 +                    if (*sep == '\\' && (sep[1] == ',' ||
 +                                         sep[1] == ';' ||
 +                                         sep[1] == ' ')) {
 +                        *sep = '\0';
 +                        sep += 2;
 +                        break;
 +                    }
 +                }
 +
 +                if (virStorageSourceRBDAddHost(src, h) < 0)
 +                    return -1;
 +
 +                h = sep;
 +            }
 +        }
 +
 +        p = next;
 +    }
 +
 +    if (!src->path) {
 +        return -1;
 +    }
 +
 +    return 0;
 +}
 +
 +
 static int
 virStorageSourceParseNBDColonString(const char *nbdstr,
                                     virStorageSourcePtr src)
@@ -2992,6 +3062,11 @@ virStorageSourceParseBackingColon(virStorageSourcePtr src,
             return -1;
         break;
 +    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
 +        if (virStorageSourceParseVitastorColonString(path, src) < 0)
 +            return -1;
 +        break;
 +
     case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
     case VIR_STORAGE_NET_PROTOCOL_LAST:
     case VIR_STORAGE_NET_PROTOCOL_NONE:
@@ -3581,6 +3656,54 @@ virStorageSourceParseBackingJSONRBD(virStorageSourcePtr src,
     return 0;
 }
 +static int
 +virStorageSourceParseBackingJSONVitastor(virStorageSourcePtr src,
 +                                         virJSONValuePtr json,
 +                                         const char *jsonstr G_GNUC_UNUSED,
 +                                         int opaque G_GNUC_UNUSED)
 +{
 +    const char *filename;
 +    const char *image = virJSONValueObjectGetString(json, "image");
 +    const char *conf = virJSONValueObjectGetString(json, "config_path");
 +    const char *etcd_prefix = virJSONValueObjectGetString(json, "etcd_prefix");
 +    virJSONValuePtr servers = virJSONValueObjectGetArray(json, "server");
 +    size_t nservers;
 +    size_t i;
 +
 +    src->type = VIR_STORAGE_TYPE_NETWORK;
 +    src->protocol = VIR_STORAGE_NET_PROTOCOL_VITASTOR;
 +
 +    /* legacy syntax passed via 'filename' option */
 +    if ((filename = virJSONValueObjectGetString(json, "filename")))
 +        return virStorageSourceParseVitastorColonString(filename, src);
 +
 +    if (!image) {
 +        virReportError(VIR_ERR_INVALID_ARG, "%s",
 +                       _("missing image name in Vitastor backing volume "
 +                         "JSON specification"));
 +        return -1;
 +    }
 +
 +    src->path = g_strdup(image);
 +    src->configFile = g_strdup(conf);
 +    src->query = g_strdup(etcd_prefix);
 +
 +    if (servers) {
 +        nservers = virJSONValueArraySize(servers);
 +
 +        src->hosts = g_new0(virStorageNetHostDef, nservers);
 +        src->nhosts = nservers;
 +
 +        for (i = 0; i < nservers; i++) {
 +            if (virStorageSourceParseBackingJSONInetSocketAddress(src->hosts + i,
 +                                                                  virJSONValueArrayGet(servers, i)) < 0)
 +                return -1;
 +        }
 +    }
 +
 +    return 0;
 +}
 +
 static int
 virStorageSourceParseBackingJSONRaw(virStorageSourcePtr src,
                                     virJSONValuePtr json,
@@ -3759,6 +3882,7 @@ static const struct virStorageSourceJSONDriverParser jsonParsers[] = {
     {"sheepdog", false, virStorageSourceParseBackingJSONSheepdog, 0},
     {"ssh", false, virStorageSourceParseBackingJSONSSH, 0},
     {"rbd", false, virStorageSourceParseBackingJSONRBD, 0},
 +    {"vitastor", false, virStorageSourceParseBackingJSONVitastor, 0},
     {"raw", true, virStorageSourceParseBackingJSONRaw, 0},
     {"nfs", false, virStorageSourceParseBackingJSONNFS, 0},
     {"vxhs", false, virStorageSourceParseBackingJSONVxHS, 0},
@@ -4503,6 +4627,7 @@ virStorageSourceNetworkDefaultPort(virStorageNetProtocol protocol)
         case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
             return 24007;
 +        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
         case VIR_STORAGE_NET_PROTOCOL_RBD:
             /* we don't provide a default for RBD */
             return 0;
 diff --git a/src/util/virstoragefile.h b/src/util/virstoragefile.h
 index 5689c39..3eb4e3c 100644
 --- a/src/util/virstoragefile.h
 +++ b/src/util/virstoragefile.h
@@ -136,6 +136,7 @@ typedef enum {
     VIR_STORAGE_NET_PROTOCOL_SSH,
     VIR_STORAGE_NET_PROTOCOL_VXHS,
     VIR_STORAGE_NET_PROTOCOL_NFS,
 +    VIR_STORAGE_NET_PROTOCOL_VITASTOR,
     VIR_STORAGE_NET_PROTOCOL_LAST
 } virStorageNetProtocol;
 diff --git a/tests/storagepoolcapsschemadata/poolcaps-fs.xml b/tests/storagepoolcapsschemadata/poolcaps-fs.xml
 index eee75af..8bd0a57 100644
 --- a/tests/storagepoolcapsschemadata/poolcaps-fs.xml
 +++ b/tests/storagepoolcapsschemadata/poolcaps-fs.xml
@@ -204,4 +204,11 @@
       </enum>
     </volOptions>
   </pool>
 +  <pool type='vitastor' supported='no'>
 +    <volOptions>
 +      <defaultFormat type='raw'/>
 +      <enum name='targetFormatType'>
 +      </enum>
 +    </volOptions>
 +  </pool>
 </storagepoolCapabilities>
 diff --git a/tests/storagepoolcapsschemadata/poolcaps-full.xml b/tests/storagepoolcapsschemadata/poolcaps-full.xml
 index 805950a..852df0d 100644
 --- a/tests/storagepoolcapsschemadata/poolcaps-full.xml
 +++ b/tests/storagepoolcapsschemadata/poolcaps-full.xml
@@ -204,4 +204,11 @@
       </enum>
     </volOptions>
   </pool>
 +  <pool type='vitastor' supported='yes'>
 +    <volOptions>
 +      <defaultFormat type='raw'/>
 +      <enum name='targetFormatType'>
 +      </enum>
 +    </volOptions>
 +  </pool>
 </storagepoolCapabilities>
 diff --git a/tests/storagepoolxml2argvtest.c b/tests/storagepoolxml2argvtest.c
 index 967d1f2..1e8ff7a 100644
 --- a/tests/storagepoolxml2argvtest.c
 +++ b/tests/storagepoolxml2argvtest.c
@@ -68,6 +68,7 @@ testCompareXMLToArgvFiles(bool shouldFail,
     case VIR_STORAGE_POOL_GLUSTER:
     case VIR_STORAGE_POOL_ZFS:
     case VIR_STORAGE_POOL_VSTORAGE:
 +    case VIR_STORAGE_POOL_VITASTOR:
     case VIR_STORAGE_POOL_LAST:
     default:
         VIR_TEST_DEBUG("pool type '%s' has no xml2argv test", defTypeStr);
 diff --git a/tools/virsh-pool.c b/tools/virsh-pool.c
 index 7835fa6..8841fcf 100644
 --- a/tools/virsh-pool.c
 +++ b/tools/virsh-pool.c
@@ -1237,6 +1237,9 @@ cmdPoolList(vshControl *ctl, const vshCmd *cmd G_GNUC_UNUSED)
             case VIR_STORAGE_POOL_VSTORAGE:
                 flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE;
                 break;
 +            case VIR_STORAGE_POOL_VITASTOR:
 +                flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR;
 +                break;
             case VIR_STORAGE_POOL_LAST:
                 break;
             }
--- a/patches/libvirt-7.5-vitastor.diff
+++ b/patches/libvirt-7.5-vitastor.diff
@ -0,0 +1,661 @@
 commit c6e1958a1b4974828e8e5852beb252ce6594e670
 Author: Vitaliy Filippov <vitalif@yourcmc.ru>
 Date:   Mon Jun 28 01:20:19 2021 +0300
    Add Vitastor support
 diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng
 index 5ea14b6..a9df168 100644
 --- a/docs/schemas/domaincommon.rng
 +++ b/docs/schemas/domaincommon.rng
@@ -1859,6 +1859,35 @@
     </element>
   </define>
 +  <define name="diskSourceNetworkProtocolVitastor">
 +    <element name="source">
 +      <interleave>
 +        <attribute name="protocol">
 +          <value>vitastor</value>
 +        </attribute>
 +        <ref name="diskSourceCommon"/>
 +        <optional>
 +          <attribute name="name"/>
 +        </optional>
 +        <optional>
 +          <attribute name="query"/>
 +        </optional>
 +        <zeroOrMore>
 +          <ref name="diskSourceNetworkHost"/>
 +        </zeroOrMore>
 +        <optional>
 +          <element name="config">
 +            <attribute name="file">
 +              <ref name="absFilePath"/>
 +            </attribute>
 +            <empty/>
 +          </element>
 +        </optional>
 +        <empty/>
 +      </interleave>
 +    </element>
 +  </define>
 +
   <define name="diskSourceNetworkProtocolISCSI">
     <element name="source">
       <attribute name="protocol">
@@ -2115,6 +2144,7 @@
       <ref name="diskSourceNetworkProtocolSimple"/>
       <ref name="diskSourceNetworkProtocolVxHS"/>
       <ref name="diskSourceNetworkProtocolNFS"/>
 +      <ref name="diskSourceNetworkProtocolVitastor"/>
     </choice>
   </define>
 diff --git a/include/libvirt/libvirt-storage.h b/include/libvirt/libvirt-storage.h
 index 089e1e0..d7e7ef4 100644
 --- a/include/libvirt/libvirt-storage.h
 +++ b/include/libvirt/libvirt-storage.h
@@ -245,6 +245,7 @@ typedef enum {
     VIR_CONNECT_LIST_STORAGE_POOLS_ZFS           = 1 << 17,
     VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE      = 1 << 18,
     VIR_CONNECT_LIST_STORAGE_POOLS_ISCSI_DIRECT  = 1 << 19,
 +    VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR      = 1 << 20,
 } virConnectListAllStoragePoolsFlags;
 int                     virConnectListAllStoragePools(virConnectPtr conn,
 diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
 index d78f846..f7222e3 100644
 --- a/src/conf/domain_conf.c
 +++ b/src/conf/domain_conf.c
@@ -8251,7 +8251,8 @@ virDomainDiskSourceNetworkParse(xmlNodePtr node,
     src->configFile = virXPathString("string(./config/@file)", ctxt);
     if (src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTP ||
 -        src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTPS)
 +        src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTPS ||
 +        src->protocol == VIR_STORAGE_NET_PROTOCOL_VITASTOR)
         src->query = virXMLPropString(node, "query");
     if (virDomainStorageNetworkParseHosts(node, ctxt, &src->hosts, &src->nhosts) < 0)
@@ -30775,6 +30776,7 @@ virDomainStorageSourceTranslateSourcePool(virStorageSource *src,
     case VIR_STORAGE_POOL_MPATH:
     case VIR_STORAGE_POOL_RBD:
 +    case VIR_STORAGE_POOL_VITASTOR:
     case VIR_STORAGE_POOL_SHEEPDOG:
     case VIR_STORAGE_POOL_GLUSTER:
     case VIR_STORAGE_POOL_LAST:
 diff --git a/src/conf/storage_conf.c b/src/conf/storage_conf.c
 index 2aa9a3d..166ca1f 100644
 --- a/src/conf/storage_conf.c
 +++ b/src/conf/storage_conf.c
@@ -60,7 +60,7 @@ VIR_ENUM_IMPL(virStoragePool,
               "logical", "disk", "iscsi",
               "iscsi-direct", "scsi", "mpath",
               "rbd", "sheepdog", "gluster",
 -              "zfs", "vstorage",
 +              "zfs", "vstorage", "vitastor",
 );
 VIR_ENUM_IMPL(virStoragePoolFormatFileSystem,
@@ -246,6 +246,18 @@ static virStoragePoolTypeInfo poolTypeInfo[] = {
           .formatToString = virStorageFileFormatTypeToString,
       }
     },
 +    {.poolType = VIR_STORAGE_POOL_VITASTOR,
 +     .poolOptions = {
 +         .flags = (VIR_STORAGE_POOL_SOURCE_HOST |
 +                   VIR_STORAGE_POOL_SOURCE_NETWORK |
 +                   VIR_STORAGE_POOL_SOURCE_NAME),
 +      },
 +      .volOptions = {
 +          .defaultFormat = VIR_STORAGE_FILE_RAW,
 +          .formatFromString = virStorageVolumeFormatFromString,
 +          .formatToString = virStorageFileFormatTypeToString,
 +      }
 +    },
     {.poolType = VIR_STORAGE_POOL_SHEEPDOG,
      .poolOptions = {
          .flags = (VIR_STORAGE_POOL_SOURCE_HOST |
@@ -546,6 +558,11 @@ virStoragePoolDefParseSource(xmlXPathContextPtr ctxt,
                        _("element 'name' is mandatory for RBD pool"));
         return -1;
     }
 +    if (pool_type == VIR_STORAGE_POOL_VITASTOR && source->name == NULL) {
 +        virReportError(VIR_ERR_XML_ERROR, "%s",
 +                       _("element 'name' is mandatory for Vitastor pool"));
 +        return -1;
 +    }
     if (options->formatFromString) {
         g_autofree char *format = NULL;
@@ -1182,6 +1199,7 @@ virStoragePoolDefFormatBuf(virBuffer *buf,
     /* RBD, Sheepdog, Gluster and Iscsi-direct devices are not local block devs nor
      * files, so they don't have a target */
     if (def->type != VIR_STORAGE_POOL_RBD &&
 +        def->type != VIR_STORAGE_POOL_VITASTOR &&
         def->type != VIR_STORAGE_POOL_SHEEPDOG &&
         def->type != VIR_STORAGE_POOL_GLUSTER &&
         def->type != VIR_STORAGE_POOL_ISCSI_DIRECT) {
 diff --git a/src/conf/storage_conf.h b/src/conf/storage_conf.h
 index 76efaac..928149a 100644
 --- a/src/conf/storage_conf.h
 +++ b/src/conf/storage_conf.h
@@ -106,6 +106,7 @@ typedef enum {
     VIR_STORAGE_POOL_GLUSTER,  /* Gluster device */
     VIR_STORAGE_POOL_ZFS,      /* ZFS */
     VIR_STORAGE_POOL_VSTORAGE, /* Virtuozzo Storage */
 +    VIR_STORAGE_POOL_VITASTOR, /* Vitastor */
     VIR_STORAGE_POOL_LAST,
 } virStoragePoolType;
@@ -465,6 +466,7 @@ VIR_ENUM_DECL(virStoragePartedFs);
                  VIR_CONNECT_LIST_STORAGE_POOLS_SCSI     | \
                  VIR_CONNECT_LIST_STORAGE_POOLS_MPATH    | \
                  VIR_CONNECT_LIST_STORAGE_POOLS_RBD      | \
 +                 VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR | \
                  VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG | \
                  VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER  | \
                  VIR_CONNECT_LIST_STORAGE_POOLS_ZFS      | \
 diff --git a/src/conf/storage_source_conf.c b/src/conf/storage_source_conf.c
 index 5ca06fa..05ded49 100644
 --- a/src/conf/storage_source_conf.c
 +++ b/src/conf/storage_source_conf.c
@@ -85,6 +85,7 @@ VIR_ENUM_IMPL(virStorageNetProtocol,
               "ssh",
               "vxhs",
               "nfs",
 +              "vitastor",
 );
@@ -1262,6 +1263,7 @@ virStorageSourceNetworkDefaultPort(virStorageNetProtocol protocol)
         case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
             return 24007;
 +        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
         case VIR_STORAGE_NET_PROTOCOL_RBD:
             /* we don't provide a default for RBD */
             return 0;
 diff --git a/src/conf/storage_source_conf.h b/src/conf/storage_source_conf.h
 index 389c7b5..dbf02e3 100644
 --- a/src/conf/storage_source_conf.h
 +++ b/src/conf/storage_source_conf.h
@@ -127,6 +127,7 @@ typedef enum {
     VIR_STORAGE_NET_PROTOCOL_SSH,
     VIR_STORAGE_NET_PROTOCOL_VXHS,
     VIR_STORAGE_NET_PROTOCOL_NFS,
 +    VIR_STORAGE_NET_PROTOCOL_VITASTOR,
     VIR_STORAGE_NET_PROTOCOL_LAST
 } virStorageNetProtocol;
 diff --git a/src/conf/virstorageobj.c b/src/conf/virstorageobj.c
 index 24957d6..4520a73 100644
 --- a/src/conf/virstorageobj.c
 +++ b/src/conf/virstorageobj.c
@@ -1487,6 +1487,7 @@ virStoragePoolObjSourceFindDuplicateCb(const void *payload,
             return 1;
         break;
 +    case VIR_STORAGE_POOL_VITASTOR:
     case VIR_STORAGE_POOL_RBD:
     case VIR_STORAGE_POOL_LAST:
         break;
@@ -1986,6 +1987,8 @@ virStoragePoolObjMatch(virStoragePoolObj *obj,
                (obj->def->type == VIR_STORAGE_POOL_MPATH))   ||
               (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_RBD) &&
                (obj->def->type == VIR_STORAGE_POOL_RBD))     ||
 +              (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR) &&
 +               (obj->def->type == VIR_STORAGE_POOL_VITASTOR)) ||
               (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG) &&
                (obj->def->type == VIR_STORAGE_POOL_SHEEPDOG)) ||
               (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER) &&
 diff --git a/src/libvirt-storage.c b/src/libvirt-storage.c
 index 2a7cdca..f756be1 100644
 --- a/src/libvirt-storage.c
 +++ b/src/libvirt-storage.c
@@ -92,6 +92,7 @@ virStoragePoolGetConnect(virStoragePoolPtr pool)
  * VIR_CONNECT_LIST_STORAGE_POOLS_SCSI
  * VIR_CONNECT_LIST_STORAGE_POOLS_MPATH
  * VIR_CONNECT_LIST_STORAGE_POOLS_RBD
 + * VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR
  * VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG
  * VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER
  * VIR_CONNECT_LIST_STORAGE_POOLS_ZFS
 diff --git a/src/libxl/libxl_conf.c b/src/libxl/libxl_conf.c
 index 56cb9ab..dfb31b9 100644
 --- a/src/libxl/libxl_conf.c
 +++ b/src/libxl/libxl_conf.c
@@ -972,6 +972,7 @@ libxlMakeNetworkDiskSrcStr(virStorageSource *src,
     case VIR_STORAGE_NET_PROTOCOL_SSH:
     case VIR_STORAGE_NET_PROTOCOL_VXHS:
     case VIR_STORAGE_NET_PROTOCOL_NFS:
 +    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
     case VIR_STORAGE_NET_PROTOCOL_LAST:
     case VIR_STORAGE_NET_PROTOCOL_NONE:
         virReportError(VIR_ERR_NO_SUPPORT,
 diff --git a/src/libxl/xen_xl.c b/src/libxl/xen_xl.c
 index c0905b0..c172378 100644
 --- a/src/libxl/xen_xl.c
 +++ b/src/libxl/xen_xl.c
@@ -1540,6 +1540,7 @@ xenFormatXLDiskSrcNet(virStorageSource *src)
     case VIR_STORAGE_NET_PROTOCOL_SSH:
     case VIR_STORAGE_NET_PROTOCOL_VXHS:
     case VIR_STORAGE_NET_PROTOCOL_NFS:
 +    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
     case VIR_STORAGE_NET_PROTOCOL_LAST:
     case VIR_STORAGE_NET_PROTOCOL_NONE:
         virReportError(VIR_ERR_NO_SUPPORT,
 diff --git a/src/qemu/qemu_block.c b/src/qemu/qemu_block.c
 index 6627d04..c33f428 100644
 --- a/src/qemu/qemu_block.c
 +++ b/src/qemu/qemu_block.c
@@ -928,6 +928,38 @@ qemuBlockStorageSourceGetRBDProps(virStorageSource *src,
 }
 +static virJSONValue *
 +qemuBlockStorageSourceGetVitastorProps(virStorageSource *src)
 +{
 +    virJSONValuePtr ret = NULL;
 +    virStorageNetHostDefPtr host;
 +    size_t i;
 +    g_auto(virBuffer) buf = VIR_BUFFER_INITIALIZER;
 +    g_autofree char *etcd = NULL;
 +
 +    for (i = 0; i < src->nhosts; i++) {
 +        host = src->hosts + i;
 +        if ((virStorageNetHostTransport)host->transport != VIR_STORAGE_NET_HOST_TRANS_TCP) {
 +            return NULL;
 +        }
 +        virBufferAsprintf(&buf, i > 0 ? ",%s:%u" : "%s:%u", host->name, host->port);
 +    }
 +    if (src->nhosts > 0) {
 +        etcd = virBufferContentAndReset(&buf);
 +    }
 +
 +    if (virJSONValueObjectCreate(&ret,
 +                                 "S:etcd_host", etcd,
 +                                 "S:etcd_prefix", src->query,
 +                                 "S:config_path", src->configFile,
 +                                 "s:image", src->path,
 +                                 NULL) < 0)
 +        return NULL;
 +
 +    return ret;
 +}
 +
 +
 static virJSONValue *
 qemuBlockStorageSourceGetSheepdogProps(virStorageSource *src)
 {
@@ -1218,6 +1250,12 @@ qemuBlockStorageSourceGetBackendProps(virStorageSource *src,
                 return NULL;
             break;
 +        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
 +            driver = "vitastor";
 +            if (!(fileprops = qemuBlockStorageSourceGetVitastorProps(src)))
 +                return NULL;
 +            break;
 +
         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
             driver = "sheepdog";
             if (!(fileprops = qemuBlockStorageSourceGetSheepdogProps(src)))
@@ -2231,6 +2269,7 @@ qemuBlockGetBackingStoreString(virStorageSource *src,
             case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
             case VIR_STORAGE_NET_PROTOCOL_RBD:
 +            case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
             case VIR_STORAGE_NET_PROTOCOL_VXHS:
             case VIR_STORAGE_NET_PROTOCOL_NFS:
             case VIR_STORAGE_NET_PROTOCOL_SSH:
@@ -2608,6 +2647,12 @@ qemuBlockStorageSourceCreateGetStorageProps(virStorageSource *src,
                 return -1;
             break;
 +        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
 +            driver = "vitastor";
 +            if (!(location = qemuBlockStorageSourceGetVitastorProps(src)))
 +                return -1;
 +            break;
 +
         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
             driver = "sheepdog";
             if (!(location = qemuBlockStorageSourceGetSheepdogProps(src)))
 diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
 index ea51369..8258632 100644
 --- a/src/qemu/qemu_command.c
 +++ b/src/qemu/qemu_command.c
@@ -1074,6 +1074,43 @@ qemuBuildNetworkDriveStr(virStorageSource *src,
             ret = virBufferContentAndReset(&buf);
             break;
 +        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
 +            if (strchr(src->path, ':')) {
 +                virReportError(VIR_ERR_CONFIG_UNSUPPORTED,
 +                               _("':' not allowed in Vitastor source volume name '%s'"),
 +                               src->path);
 +                return NULL;
 +            }
 +
 +            virBufferStrcat(&buf, "vitastor:image=", src->path, NULL);
 +
 +            if (src->nhosts > 0) {
 +                virBufferAddLit(&buf, ":etcd_host=");
 +                for (i = 0; i < src->nhosts; i++) {
 +                    if (i)
 +                        virBufferAddLit(&buf, ",");
 +
 +                    /* assume host containing : is ipv6 */
 +                    if (strchr(src->hosts[i].name, ':'))
 +                        virBufferEscape(&buf, '\\', ":", "[%s]",
 +                                        src->hosts[i].name);
 +                    else
 +                        virBufferAsprintf(&buf, "%s", src->hosts[i].name);
 +
 +                    if (src->hosts[i].port)
 +                        virBufferAsprintf(&buf, "\\:%u", src->hosts[i].port);
 +                }
 +            }
 +
 +            if (src->configFile)
 +                virBufferEscape(&buf, '\\', ":", ":config_path=%s", src->configFile);
 +
 +            if (src->query)
 +                virBufferEscape(&buf, '\\', ":", ":etcd_prefix=%s", src->query);
 +
 +            ret = virBufferContentAndReset(&buf);
 +            break;
 +
         case VIR_STORAGE_NET_PROTOCOL_VXHS:
             virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
                            _("VxHS protocol does not support URI syntax"));
 diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
 index fc60e15..5ab410d 100644
 --- a/src/qemu/qemu_domain.c
 +++ b/src/qemu/qemu_domain.c
@@ -4829,7 +4829,8 @@ qemuDomainValidateStorageSource(virStorageSource *src,
     if (src->query &&
         (actualType != VIR_STORAGE_TYPE_NETWORK ||
          (src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTPS &&
 -          src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTP))) {
 +          src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTP &&
 +          src->protocol != VIR_STORAGE_NET_PROTOCOL_VITASTOR))) {
         virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s",
                        _("query is supported only with HTTP(S) protocols"));
         return -1;
@@ -10027,6 +10028,7 @@ qemuDomainPrepareStorageSourceTLS(virStorageSource *src,
         break;
     case VIR_STORAGE_NET_PROTOCOL_RBD:
 +    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
     case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
     case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
     case VIR_STORAGE_NET_PROTOCOL_ISCSI:
 diff --git a/src/qemu/qemu_snapshot.c b/src/qemu/qemu_snapshot.c
 index 4e74ddd..14e5f2e 100644
 --- a/src/qemu/qemu_snapshot.c
 +++ b/src/qemu/qemu_snapshot.c
@@ -402,6 +402,7 @@ qemuSnapshotPrepareDiskExternalInactive(virDomainSnapshotDiskDef *snapdisk,
         case VIR_STORAGE_NET_PROTOCOL_NONE:
         case VIR_STORAGE_NET_PROTOCOL_NBD:
         case VIR_STORAGE_NET_PROTOCOL_RBD:
 +        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
         case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
         case VIR_STORAGE_NET_PROTOCOL_ISCSI:
@@ -494,6 +495,7 @@ qemuSnapshotPrepareDiskExternalActive(virDomainObj *vm,
         case VIR_STORAGE_NET_PROTOCOL_NONE:
         case VIR_STORAGE_NET_PROTOCOL_NBD:
         case VIR_STORAGE_NET_PROTOCOL_RBD:
 +        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
         case VIR_STORAGE_NET_PROTOCOL_ISCSI:
         case VIR_STORAGE_NET_PROTOCOL_HTTP:
@@ -647,6 +649,7 @@ qemuSnapshotPrepareDiskInternal(virDomainDiskDef *disk,
         case VIR_STORAGE_NET_PROTOCOL_NONE:
         case VIR_STORAGE_NET_PROTOCOL_NBD:
         case VIR_STORAGE_NET_PROTOCOL_RBD:
 +        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
         case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
         case VIR_STORAGE_NET_PROTOCOL_ISCSI:
 diff --git a/src/storage/storage_driver.c b/src/storage/storage_driver.c
 index c2ff4b8..70d0689 100644
 --- a/src/storage/storage_driver.c
 +++ b/src/storage/storage_driver.c
@@ -1644,6 +1644,7 @@ storageVolLookupByPathCallback(virStoragePoolObj *obj,
         case VIR_STORAGE_POOL_GLUSTER:
         case VIR_STORAGE_POOL_RBD:
 +        case VIR_STORAGE_POOL_VITASTOR:
         case VIR_STORAGE_POOL_SHEEPDOG:
         case VIR_STORAGE_POOL_ZFS:
         case VIR_STORAGE_POOL_LAST:
 diff --git a/src/storage_file/storage_source_backingstore.c b/src/storage_file/storage_source_backingstore.c
 index e48ae72..d7a9b72 100644
 --- a/src/storage_file/storage_source_backingstore.c
 +++ b/src/storage_file/storage_source_backingstore.c
@@ -284,6 +284,75 @@ virStorageSourceParseRBDColonString(const char *rbdstr,
 }
 +static int
 +virStorageSourceParseVitastorColonString(const char *colonstr,
 +                                         virStorageSource *src)
 +{
 +    char *p, *e, *next;
 +    g_autofree char *options = NULL;
 +
 +    /* optionally skip the "vitastor:" prefix if provided */
 +    if (STRPREFIX(colonstr, "vitastor:"))
 +        colonstr += strlen("vitastor:");
 +
 +    options = g_strdup(colonstr);
 +
 +    p = options;
 +    while (*p) {
 +        /* find : delimiter or end of string */
 +        for (e = p; *e && *e != ':'; ++e) {
 +            if (*e == '\\') {
 +                e++;
 +                if (*e == '\0')
 +                    break;
 +            }
 +        }
 +        if (*e == '\0') {
 +            next = e;    /* last kv pair */
 +        } else {
 +            next = e + 1;
 +            *e = '\0';
 +        }
 +
 +        if (STRPREFIX(p, "image=")) {
 +            src->path = g_strdup(p + strlen("image="));
 +        } else if (STRPREFIX(p, "etcd_prefix=")) {
 +            src->query = g_strdup(p + strlen("etcd_prefix="));
 +        } else if (STRPREFIX(p, "config_file=")) {
 +            src->configFile = g_strdup(p + strlen("config_file="));
 +        } else if (STRPREFIX(p, "etcd_host=")) {
 +            char *h, *sep;
 +
 +            h = p + strlen("etcd_host=");
 +            while (h < e) {
 +                for (sep = h; sep < e; ++sep) {
 +                    if (*sep == '\\' && (sep[1] == ',' ||
 +                                         sep[1] == ';' ||
 +                                         sep[1] == ' ')) {
 +                        *sep = '\0';
 +                        sep += 2;
 +                        break;
 +                    }
 +                }
 +
 +                if (virStorageSourceRBDAddHost(src, h) < 0)
 +                    return -1;
 +
 +                h = sep;
 +            }
 +        }
 +
 +        p = next;
 +    }
 +
 +    if (!src->path) {
 +        return -1;
 +    }
 +
 +    return 0;
 +}
 +
 +
 static int
 virStorageSourceParseNBDColonString(const char *nbdstr,
                                     virStorageSource *src)
@@ -396,6 +465,11 @@ virStorageSourceParseBackingColon(virStorageSource *src,
             return -1;
         break;
 +    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
 +        if (virStorageSourceParseVitastorColonString(path, src) < 0)
 +            return -1;
 +        break;
 +
     case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
     case VIR_STORAGE_NET_PROTOCOL_LAST:
     case VIR_STORAGE_NET_PROTOCOL_NONE:
@@ -984,6 +1058,54 @@ virStorageSourceParseBackingJSONRBD(virStorageSource *src,
     return 0;
 }
 +static int
 +virStorageSourceParseBackingJSONVitastor(virStorageSource *src,
 +                                         virJSONValue *json,
 +                                         const char *jsonstr G_GNUC_UNUSED,
 +                                         int opaque G_GNUC_UNUSED)
 +{
 +    const char *filename;
 +    const char *image = virJSONValueObjectGetString(json, "image");
 +    const char *conf = virJSONValueObjectGetString(json, "config_path");
 +    const char *etcd_prefix = virJSONValueObjectGetString(json, "etcd_prefix");
 +    virJSONValue *servers = virJSONValueObjectGetArray(json, "server");
 +    size_t nservers;
 +    size_t i;
 +
 +    src->type = VIR_STORAGE_TYPE_NETWORK;
 +    src->protocol = VIR_STORAGE_NET_PROTOCOL_VITASTOR;
 +
 +    /* legacy syntax passed via 'filename' option */
 +    if ((filename = virJSONValueObjectGetString(json, "filename")))
 +        return virStorageSourceParseVitastorColonString(filename, src);
 +
 +    if (!image) {
 +        virReportError(VIR_ERR_INVALID_ARG, "%s",
 +                       _("missing image name in Vitastor backing volume "
 +                         "JSON specification"));
 +        return -1;
 +    }
 +
 +    src->path = g_strdup(image);
 +    src->configFile = g_strdup(conf);
 +    src->query = g_strdup(etcd_prefix);
 +
 +    if (servers) {
 +        nservers = virJSONValueArraySize(servers);
 +
 +        src->hosts = g_new0(virStorageNetHostDef, nservers);
 +        src->nhosts = nservers;
 +
 +        for (i = 0; i < nservers; i++) {
 +            if (virStorageSourceParseBackingJSONInetSocketAddress(src->hosts + i,
 +                                                                  virJSONValueArrayGet(servers, i)) < 0)
 +                return -1;
 +        }
 +    }
 +
 +    return 0;
 +}
 +
 static int
 virStorageSourceParseBackingJSONRaw(virStorageSource *src,
                                     virJSONValue *json,
@@ -1162,6 +1284,7 @@ static const struct virStorageSourceJSONDriverParser jsonParsers[] = {
     {"sheepdog", false, virStorageSourceParseBackingJSONSheepdog, 0},
     {"ssh", false, virStorageSourceParseBackingJSONSSH, 0},
     {"rbd", false, virStorageSourceParseBackingJSONRBD, 0},
 +    {"vitastor", false, virStorageSourceParseBackingJSONVitastor, 0},
     {"raw", true, virStorageSourceParseBackingJSONRaw, 0},
     {"nfs", false, virStorageSourceParseBackingJSONNFS, 0},
     {"vxhs", false, virStorageSourceParseBackingJSONVxHS, 0},
 diff --git a/src/test/test_driver.c b/src/test/test_driver.c
 index ef0ddab..2173dc3 100644
 --- a/src/test/test_driver.c
 +++ b/src/test/test_driver.c
@@ -7131,6 +7131,7 @@ testStorageVolumeTypeForPool(int pooltype)
     case VIR_STORAGE_POOL_ISCSI_DIRECT:
     case VIR_STORAGE_POOL_GLUSTER:
     case VIR_STORAGE_POOL_RBD:
 +    case VIR_STORAGE_POOL_VITASTOR:
         return VIR_STORAGE_VOL_NETWORK;
     case VIR_STORAGE_POOL_LOGICAL:
     case VIR_STORAGE_POOL_DISK:
 diff --git a/tests/storagepoolcapsschemadata/poolcaps-fs.xml b/tests/storagepoolcapsschemadata/poolcaps-fs.xml
 index eee75af..8bd0a57 100644
 --- a/tests/storagepoolcapsschemadata/poolcaps-fs.xml
 +++ b/tests/storagepoolcapsschemadata/poolcaps-fs.xml
@@ -204,4 +204,11 @@
       </enum>
     </volOptions>
   </pool>
 +  <pool type='vitastor' supported='no'>
 +    <volOptions>
 +      <defaultFormat type='raw'/>
 +      <enum name='targetFormatType'>
 +      </enum>
 +    </volOptions>
 +  </pool>
 </storagepoolCapabilities>
 diff --git a/tests/storagepoolcapsschemadata/poolcaps-full.xml b/tests/storagepoolcapsschemadata/poolcaps-full.xml
 index 805950a..852df0d 100644
 --- a/tests/storagepoolcapsschemadata/poolcaps-full.xml
 +++ b/tests/storagepoolcapsschemadata/poolcaps-full.xml
@@ -204,4 +204,11 @@
       </enum>
     </volOptions>
   </pool>
 +  <pool type='vitastor' supported='yes'>
 +    <volOptions>
 +      <defaultFormat type='raw'/>
 +      <enum name='targetFormatType'>
 +      </enum>
 +    </volOptions>
 +  </pool>
 </storagepoolCapabilities>
 diff --git a/tests/storagepoolxml2argvtest.c b/tests/storagepoolxml2argvtest.c
 index 449b745..7f95cc8 100644
 --- a/tests/storagepoolxml2argvtest.c
 +++ b/tests/storagepoolxml2argvtest.c
@@ -68,6 +68,7 @@ testCompareXMLToArgvFiles(bool shouldFail,
     case VIR_STORAGE_POOL_GLUSTER:
     case VIR_STORAGE_POOL_ZFS:
     case VIR_STORAGE_POOL_VSTORAGE:
 +    case VIR_STORAGE_POOL_VITASTOR:
     case VIR_STORAGE_POOL_LAST:
     default:
         VIR_TEST_DEBUG("pool type '%s' has no xml2argv test", defTypeStr);
 diff --git a/tools/virsh-pool.c b/tools/virsh-pool.c
 index 18f3839..c8e1436 100644
 --- a/tools/virsh-pool.c
 +++ b/tools/virsh-pool.c
@@ -1231,6 +1231,9 @@ cmdPoolList(vshControl *ctl, const vshCmd *cmd G_GNUC_UNUSED)
             case VIR_STORAGE_POOL_VSTORAGE:
                 flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE;
                 break;
 +            case VIR_STORAGE_POOL_VITASTOR:
 +                flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR;
 +                break;
             case VIR_STORAGE_POOL_LAST:
                 break;
             }
--- a/patches/libvirt-example.xml
+++ b/patches/libvirt-example.xml
@ -0,0 +1,32 @@
 <!-- Example libvirt VM configuration with Vitastor disk -->
 <domain type='kvm'>
  <name>debian9</name>
  <uuid>96f277fb-fd9c-49da-bf21-a5cfd54eb162</uuid>
  <memory unit="KiB">524288</memory>
  <currentMemory>524288</currentMemory>
  <vcpu>1</vcpu>
  <os>
    <type arch='x86_64'>hvm</type>
    <boot dev='hd' />
  </os>
  <devices>
    <emulator>/usr/bin/qemu-system-x86_64</emulator>
    <disk type='network' device='disk'>
      <target dev='vda' bus='virtio' />
      <driver name='qemu' type='raw' />
      <!-- name is Vitastor image name -->
      <!-- config (optional) is the path to Vitastor's configuration file -->
      <!-- query (optional) is Vitastor's etcd_prefix -->
      <source protocol='vitastor' name='debian9' query='/vitastor' config='/etc/vitastor/vitastor.conf'>
        <!-- hosts = etcd addresses -->
        <host name='192.168.7.2' port='2379' />
      </source>
      <!-- required because Vitastor only supports 4k physical sectors -->
      <blockio physical_block_size="4096" logical_block_size="512" />
    </disk>
    <interface type='network'>
      <source network='default' />
    </interface>
    <graphics type='vnc' port='-1' />
  </devices>
 </domain>
--- a/patches/nova-20.diff
+++ b/patches/nova-20.diff
@ -0,0 +1,287 @@
 diff --git a/nova/virt/image/model.py b/nova/virt/image/model.py
 index 971f7e9c07..70ed70d5e2 100644
 --- a/nova/virt/image/model.py
 +++ b/nova/virt/image/model.py
@@ -129,3 +129,22 @@ class RBDImage(Image):
         self.user = user
         self.password = password
         self.servers = servers
 +
 +
 +class VitastorImage(Image):
 +    """Class for images in a remote Vitastor cluster"""
 +
 +    def __init__(self, name, etcd_address = None, etcd_prefix = None, config_path = None):
 +        """Create a new Vitastor image object
 +
 +        :param name: name of the image
 +        :param etcd_address: etcd URL(s) (optional)
 +        :param etcd_prefix: etcd prefix (optional)
 +        :param config_path: path to the configuration (optional)
 +        """
 +        super(RBDImage, self).__init__(FORMAT_RAW)
 +
 +        self.name = name
 +        self.etcd_address = etcd_address
 +        self.etcd_prefix = etcd_prefix
 +        self.config_path = config_path
 diff --git a/nova/virt/images.py b/nova/virt/images.py
 index 5358f3766a..ebe3d6effb 100644
 --- a/nova/virt/images.py
 +++ b/nova/virt/images.py
@@ -41,7 +41,7 @@ IMAGE_API = glance.API()
 def qemu_img_info(path, format=None):
     """Return an object containing the parsed output from qemu-img info."""
 -    if not os.path.exists(path) and not path.startswith('rbd:'):
 +    if not os.path.exists(path) and not path.startswith('rbd:') and not path.startswith('vitastor:'):
         raise exception.DiskNotFound(location=path)
     info = nova.privsep.qemu.unprivileged_qemu_img_info(path, format=format)
@@ -50,7 +50,7 @@ def qemu_img_info(path, format=None):
 def privileged_qemu_img_info(path, format=None, output_format='json'):
     """Return an object containing the parsed output from qemu-img info."""
 -    if not os.path.exists(path) and not path.startswith('rbd:'):
 +    if not os.path.exists(path) and not path.startswith('rbd:') and not path.startswith('vitastor:'):
         raise exception.DiskNotFound(location=path)
     info = nova.privsep.qemu.privileged_qemu_img_info(path, format=format)
 diff --git a/nova/virt/libvirt/config.py b/nova/virt/libvirt/config.py
 index f9475776b3..51573fe41d 100644
 --- a/nova/virt/libvirt/config.py
 +++ b/nova/virt/libvirt/config.py
@@ -1060,6 +1060,8 @@ class LibvirtConfigGuestDisk(LibvirtConfigGuestDevice):
         self.driver_iommu = False
         self.source_path = None
         self.source_protocol = None
 +        self.source_query = None
 +        self.source_config = None
         self.source_name = None
         self.source_hosts = []
         self.source_ports = []
@@ -1186,7 +1188,8 @@ class LibvirtConfigGuestDisk(LibvirtConfigGuestDevice):
         elif self.source_type == "mount":
             dev.append(etree.Element("source", dir=self.source_path))
         elif self.source_type == "network" and self.source_protocol:
 -            source = etree.Element("source", protocol=self.source_protocol)
 +            source = etree.Element("source", protocol=self.source_protocol,
 +                query=self.source_query, config=self.source_config)
             if self.source_name is not None:
                 source.set('name', self.source_name)
             hosts_info = zip(self.source_hosts, self.source_ports)
 diff --git a/nova/virt/libvirt/driver.py b/nova/virt/libvirt/driver.py
 index 391231c527..34dc60dcdd 100644
 --- a/nova/virt/libvirt/driver.py
 +++ b/nova/virt/libvirt/driver.py
@@ -179,6 +179,7 @@ VOLUME_DRIVERS = {
     'local': 'nova.virt.libvirt.volume.volume.LibvirtVolumeDriver',
     'fake': 'nova.virt.libvirt.volume.volume.LibvirtFakeVolumeDriver',
     'rbd': 'nova.virt.libvirt.volume.net.LibvirtNetVolumeDriver',
 +    'vitastor': 'nova.virt.libvirt.volume.vitastor.LibvirtVitastorVolumeDriver',
     'nfs': 'nova.virt.libvirt.volume.nfs.LibvirtNFSVolumeDriver',
     'smbfs': 'nova.virt.libvirt.volume.smbfs.LibvirtSMBFSVolumeDriver',
     'fibre_channel': 'nova.virt.libvirt.volume.fibrechannel.LibvirtFibreChannelVolumeDriver',  # noqa:E501
@@ -385,10 +386,10 @@ class LibvirtDriver(driver.ComputeDriver):
         # This prevents the risk of one test setting a capability
         # which bleeds over into other tests.
 -        # LVM and RBD require raw images. If we are not configured to
 +        # LVM, RBD, Vitastor require raw images. If we are not configured to
         # force convert images into raw format, then we _require_ raw
         # images only.
 -        raw_only = ('rbd', 'lvm')
 +        raw_only = ('rbd', 'lvm', 'vitastor')
         requires_raw_image = (CONF.libvirt.images_type in raw_only and
                               not CONF.force_raw_images)
         requires_ploop_image = CONF.libvirt.virt_type == 'parallels'
@@ -775,12 +776,12 @@ class LibvirtDriver(driver.ComputeDriver):
         # Some imagebackends are only able to import raw disk images,
         # and will fail if given any other format. See the bug
         # https://bugs.launchpad.net/nova/+bug/1816686 for more details.
 -        if CONF.libvirt.images_type in ('rbd',):
 +        if CONF.libvirt.images_type in ('rbd', 'vitastor'):
             if not CONF.force_raw_images:
                 msg = _("'[DEFAULT]/force_raw_images = False' is not "
 -                        "allowed with '[libvirt]/images_type = rbd'. "
 +                        "allowed with '[libvirt]/images_type = rbd' or 'vitastor'. "
                         "Please check the two configs and if you really "
 -                        "do want to use rbd as images_type, set "
 +                        "do want to use rbd or vitastor as images_type, set "
                         "force_raw_images to True.")
                 raise exception.InvalidConfiguration(msg)
@@ -2603,6 +2604,16 @@ class LibvirtDriver(driver.ComputeDriver):
                     if connection_info['data'].get('auth_enabled'):
                         username = connection_info['data']['auth_username']
                         path = f"rbd:{volume_name}:id={username}"
 +                elif connection_info['driver_volume_type'] == 'vitastor':
 +                    volume_name = connection_info['data']['name']
 +                    path = 'vitastor:image='+volume_name.replace(':', '\\:')
 +                    for k in [ 'config_path', 'etcd_address', 'etcd_prefix' ]:
 +                        if k in connection_info['data']:
 +                            kk = k
 +                            if kk == 'etcd_address':
 +                                # FIXME use etcd_address in qemu driver
 +                                kk = 'etcd_host'
 +                            path += ":"+kk+"="+connection_info['data'][k].replace(':', '\\:')
                 else:
                     path = 'unknown'
                     raise exception.DiskNotFound(location='unknown')
@@ -2827,8 +2838,8 @@ class LibvirtDriver(driver.ComputeDriver):
         image_format = CONF.libvirt.snapshot_image_format or source_type
 -        # NOTE(bfilippov): save lvm and rbd as raw
 -        if image_format == 'lvm' or image_format == 'rbd':
 +        # NOTE(bfilippov): save lvm and rbd and vitastor as raw
 +        if image_format == 'lvm' or image_format == 'rbd' or image_format == 'vitastor':
             image_format = 'raw'
         metadata = self._create_snapshot_metadata(instance.image_meta,
@@ -2899,7 +2910,7 @@ class LibvirtDriver(driver.ComputeDriver):
                               expected_state=task_states.IMAGE_UPLOADING)
             # TODO(nic): possibly abstract this out to the root_disk
 -            if source_type == 'rbd' and live_snapshot:
 +            if (source_type == 'rbd' or source_type == 'vitastor') and live_snapshot:
                 # Standard snapshot uses qemu-img convert from RBD which is
                 # not safe to run with live_snapshot.
                 live_snapshot = False
@@ -4099,7 +4110,7 @@ class LibvirtDriver(driver.ComputeDriver):
         # cleanup rescue volume
         lvm.remove_volumes([lvmdisk for lvmdisk in self._lvm_disks(instance)
                                 if lvmdisk.endswith('.rescue')])
 -        if CONF.libvirt.images_type == 'rbd':
 +        if CONF.libvirt.images_type == 'rbd' or CONF.libvirt.images_type == 'vitastor':
             filter_fn = lambda disk: (disk.startswith(instance.uuid) and
                                       disk.endswith('.rescue'))
             rbd_utils.RBDDriver().cleanup_volumes(filter_fn)
@@ -4356,6 +4367,8 @@ class LibvirtDriver(driver.ComputeDriver):
         # TODO(mikal): there is a bug here if images_type has
         # changed since creation of the instance, but I am pretty
         # sure that this bug already exists.
 +        if CONF.libvirt.images_type == 'vitastor':
 +            return 'vitastor'
         return 'rbd' if CONF.libvirt.images_type == 'rbd' else 'raw'
     @staticmethod
@@ -4764,10 +4777,10 @@ class LibvirtDriver(driver.ComputeDriver):
                 finally:
                     # NOTE(mikal): if the config drive was imported into RBD,
                     # then we no longer need the local copy
 -                    if CONF.libvirt.images_type == 'rbd':
 +                    if CONF.libvirt.images_type == 'rbd' or CONF.libvirt.images_type == 'vitastor':
                         LOG.info('Deleting local config drive %(path)s '
 -                                 'because it was imported into RBD.',
 -                                 {'path': config_disk_local_path},
 +                                 'because it was imported into %(type).',
 +                                 {'path': config_disk_local_path, 'type': CONF.libvirt.images_type},
                                  instance=instance)
                         os.unlink(config_disk_local_path)
 diff --git a/nova/virt/libvirt/utils.py b/nova/virt/libvirt/utils.py
 index da2a6e8b8a..52c02e72f1 100644
 --- a/nova/virt/libvirt/utils.py
 +++ b/nova/virt/libvirt/utils.py
@@ -340,6 +340,10 @@ def find_disk(guest: libvirt_guest.Guest) -> ty.Tuple[str, ty.Optional[str]]:
             disk_path = disk.source_name
             if disk_path:
                 disk_path = 'rbd:' + disk_path
 +        elif not disk_path and disk.source_protocol == 'vitastor':
 +            disk_path = disk.source_name
 +            if disk_path:
 +                disk_path = 'vitastor:' + disk_path
     if not disk_path:
         raise RuntimeError(_("Can't retrieve root device path "
@@ -354,6 +358,8 @@ def get_disk_type_from_path(path: str) -> ty.Optional[str]:
         return 'lvm'
     elif path.startswith('rbd:'):
         return 'rbd'
 +    elif path.startswith('vitastor:'):
 +        return 'vitastor'
     elif (os.path.isdir(path) and
           os.path.exists(os.path.join(path, "DiskDescriptor.xml"))):
         return 'ploop'
 diff --git a/nova/virt/libvirt/volume/vitastor.py b/nova/virt/libvirt/volume/vitastor.py
 new file mode 100644
 index 0000000000..0256df62c1
 --- /dev/null
 +++ b/nova/virt/libvirt/volume/vitastor.py
@@ -0,0 +1,75 @@
 +# Copyright (c) 2021+, Vitaliy Filippov <vitalif@yourcmc.ru>
 +#
 +#    Licensed under the Apache License, Version 2.0 (the "License"); you may
 +#    not use this file except in compliance with the License. You may obtain
 +#    a copy of the License at
 +#
 +#         http://www.apache.org/licenses/LICENSE-2.0
 +#
 +#    Unless required by applicable law or agreed to in writing, software
 +#    distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 +#    WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 +#    License for the specific language governing permissions and limitations
 +#    under the License.
 +
 +from os_brick import exception as os_brick_exception
 +from os_brick import initiator
 +from os_brick.initiator import connector
 +from oslo_log import log as logging
 +
 +import nova.conf
 +from nova import utils
 +from nova.virt.libvirt.volume import volume as libvirt_volume
 +
 +
 +CONF = nova.conf.CONF
 +LOG = logging.getLogger(__name__)
 +
 +
 +class LibvirtVitastorVolumeDriver(libvirt_volume.LibvirtBaseVolumeDriver):
 +    """Driver to attach Vitastor volumes to libvirt."""
 +    def __init__(self, host):
 +        super(LibvirtVitastorVolumeDriver, self).__init__(host, is_block_dev=False)
 +
 +    def connect_volume(self, connection_info, instance):
 +        pass
 +
 +    def disconnect_volume(self, connection_info, instance):
 +        pass
 +
 +    def get_config(self, connection_info, disk_info):
 +        """Returns xml for libvirt."""
 +        conf = super(LibvirtVitastorVolumeDriver, self).get_config(connection_info, disk_info)
 +        conf.source_type = 'network'
 +        conf.source_protocol = 'vitastor'
 +        conf.source_name = connection_info['data'].get('name')
 +        conf.source_query = connection_info['data'].get('etcd_prefix') or None
 +        conf.source_config = connection_info['data'].get('config_path') or None
 +        conf.source_hosts = []
 +        conf.source_ports = []
 +        addresses = connection_info['data'].get('etcd_address', '')
 +        if addresses:
 +            if not isinstance(addresses, list):
 +                addresses = addresses.split(',')
 +            for addr in addresses:
 +                if addr.startswith('https://'):
 +                    raise NotImplementedError('Vitastor block driver does not support SSL for etcd communication yet')
 +                if addr.startswith('http://'):
 +                    addr = addr[7:]
 +                addr = addr.rstrip('/')
 +                if addr.endswith('/v3'):
 +                    addr = addr[0:-3]
 +                p = addr.find('/')
 +                if p > 0:
 +                    raise NotImplementedError('libvirt does not support custom URL paths for Vitastor etcd yet. Use /etc/vitastor/vitastor.conf')
 +                p = addr.find(':')
 +                port = '2379'
 +                if p > 0:
 +                    port = addr[p+1:]
 +                    addr = addr[0:p]
 +                conf.source_hosts.append(addr)
 +                conf.source_ports.append(port)
 +        return conf
 +
 +    def extend_volume(self, connection_info, instance, requested_size):
 +        raise NotImplementedError
--- a/patches/qemu-3.1-vitastor.patch
+++ b/patches/qemu-3.1-vitastor.patch
@ -0,0 +1,88 @@
 Index: qemu-3.1+dfsg/qapi/block-core.json
 ===================================================================
 --- qemu-3.1+dfsg.orig/qapi/block-core.json
 +++ qemu-3.1+dfsg/qapi/block-core.json
@@ -2617,7 +2617,7 @@
 ##
 { 'enum': 'BlockdevDriver',
   'data': [ 'blkdebug', 'blklogwrites', 'blkverify', 'bochs', 'cloop',
 -            'copy-on-read', 'dmg', 'file', 'ftp', 'ftps', 'gluster',
 +            'copy-on-read', 'dmg', 'file', 'ftp', 'ftps', 'gluster', 'vitastor',
             'host_cdrom', 'host_device', 'http', 'https', 'iscsi', 'luks',
             'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels', 'qcow',
             'qcow2', 'qed', 'quorum', 'raw', 'rbd', 'replication', 'sheepdog',
@@ -3367,6 +3367,28 @@
             '*tag': 'str' } }
 ##
 +# @BlockdevOptionsVitastor:
 +#
 +# Driver specific block device options for vitastor
 +#
 +# @image:       Image name
 +# @inode:       Inode number
 +# @pool:        Pool ID
 +# @size:        Desired image size in bytes
 +# @config_path: Path to Vitastor configuration
 +# @etcd_host:   etcd connection address(es)
 +# @etcd_prefix: etcd key/value prefix
 +##
 +{ 'struct': 'BlockdevOptionsVitastor',
 +  'data': { '*inode': 'uint64',
 +            '*pool': 'uint64',
 +            '*size': 'uint64',
 +            '*image': 'str',
 +            '*config_path': 'str',
 +            '*etcd_host': 'str',
 +            '*etcd_prefix': 'str' } }
 +
 +##
 # @ReplicationMode:
 #
 # An enumeration of replication modes.
@@ -3713,6 +3731,7 @@
       'rbd':        'BlockdevOptionsRbd',
       'replication':'BlockdevOptionsReplication',
       'sheepdog':   'BlockdevOptionsSheepdog',
 +      'vitastor':   'BlockdevOptionsVitastor',
       'ssh':        'BlockdevOptionsSsh',
       'throttle':   'BlockdevOptionsThrottle',
       'vdi':        'BlockdevOptionsGenericFormat',
@@ -4158,6 +4177,17 @@
             '*block-state-zero':    'bool' } }
 ##
 +# @BlockdevCreateOptionsVitastor:
 +#
 +# Driver specific image creation options for Vitastor.
 +#
 +# @size: Size of the virtual disk in bytes
 +##
 +{ 'struct': 'BlockdevCreateOptionsVitastor',
 +  'data': { 'location':         'BlockdevOptionsVitastor',
 +            'size':             'size' } }
 +
 +##
 # @BlockdevVpcSubformat:
 #
 # @dynamic: Growing image file
@@ -4212,6 +4242,7 @@
       'qed':            'BlockdevCreateOptionsQed',
       'rbd':            'BlockdevCreateOptionsRbd',
       'sheepdog':       'BlockdevCreateOptionsSheepdog',
 +      'vitastor':       'BlockdevCreateOptionsVitastor',
       'ssh':            'BlockdevCreateOptionsSsh',
       'vdi':            'BlockdevCreateOptionsVdi',
       'vhdx':           'BlockdevCreateOptionsVhdx',
 Index: qemu-3.1+dfsg/scripts/modules/module_block.py
 ===================================================================
 --- qemu-3.1+dfsg.orig/scripts/modules/module_block.py
 +++ qemu-3.1+dfsg/scripts/modules/module_block.py
@@ -88,6 +88,7 @@ def print_bottom(fheader):
 output_file = sys.argv[1]
 with open(output_file, 'w') as fheader:
     print_top(fheader)
 +    add_module(fheader, "vitastor", "vitastor", "vitastor")
     for filename in sys.argv[2:]:
         if os.path.isfile(filename):
--- a/patches/qemu-4.2-vitastor.patch
+++ b/patches/qemu-4.2-vitastor.patch
@ -0,0 +1,88 @@
 Index: qemu/qapi/block-core.json
 ===================================================================
 --- qemu.orig/qapi/block-core.json	2020-11-07 22:57:38.932613674 +0000
 +++ qemu.orig/qapi/block-core.json	2020-11-07 22:59:49.890722862 +0000
@@ -2907,7 +2907,7 @@
             'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels', 'qcow',
             'qcow2', 'qed', 'quorum', 'raw', 'rbd',
             { 'name': 'replication', 'if': 'defined(CONFIG_REPLICATION)' },
 -            'sheepdog',
 +            'sheepdog', 'vitastor',
             'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', 'vxhs' ] }
 ##
@@ -3725,6 +3725,28 @@
             '*tag': 'str' } }
 ##
 +# @BlockdevOptionsVitastor:
 +#
 +# Driver specific block device options for vitastor
 +#
 +# @image:       Image name
 +# @inode:       Inode number
 +# @pool:        Pool ID
 +# @size:        Desired image size in bytes
 +# @config_path: Path to Vitastor configuration
 +# @etcd_host:   etcd connection address(es)
 +# @etcd_prefix: etcd key/value prefix
 +##
 +{ 'struct': 'BlockdevOptionsVitastor',
 +  'data': { '*inode': 'uint64',
 +            '*pool': 'uint64',
 +            '*size': 'uint64',
 +            '*image': 'str',
 +            '*config_path': 'str',
 +            '*etcd_host': 'str',
 +            '*etcd_prefix': 'str' } }
 +
 +##
 # @ReplicationMode:
 #
 # An enumeration of replication modes.
@@ -4084,6 +4102,7 @@
       'replication': { 'type': 'BlockdevOptionsReplication',
                        'if': 'defined(CONFIG_REPLICATION)' },
       'sheepdog':   'BlockdevOptionsSheepdog',
 +      'vitastor':   'BlockdevOptionsVitastor',
       'ssh':        'BlockdevOptionsSsh',
       'throttle':   'BlockdevOptionsThrottle',
       'vdi':        'BlockdevOptionsGenericFormat',
@@ -4461,6 +4480,17 @@
             '*cluster-size' :   'size' } }
 ##
 +# @BlockdevCreateOptionsVitastor:
 +#
 +# Driver specific image creation options for Vitastor.
 +#
 +# @size: Size of the virtual disk in bytes
 +##
 +{ 'struct': 'BlockdevCreateOptionsVitastor',
 +  'data': { 'location':         'BlockdevOptionsVitastor',
 +            'size':             'size' } }
 +
 +##
 # @BlockdevVmdkSubformat:
 #
 # Subformat options for VMDK images
@@ -4722,6 +4752,7 @@
       'qed':            'BlockdevCreateOptionsQed',
       'rbd':            'BlockdevCreateOptionsRbd',
       'sheepdog':       'BlockdevCreateOptionsSheepdog',
 +      'vitastor':       'BlockdevCreateOptionsVitastor',
       'ssh':            'BlockdevCreateOptionsSsh',
       'vdi':            'BlockdevCreateOptionsVdi',
       'vhdx':           'BlockdevCreateOptionsVhdx',
 Index: qemu/scripts/modules/module_block.py
 ===================================================================
 --- qemu.orig/scripts/modules/module_block.py	2020-11-07 22:57:38.936613739 +0000
 +++ qemu/scripts/modules/module_block.py	2020-11-07 22:59:49.890722862 +0000
@@ -86,6 +86,7 @@ def print_bottom(fheader):
 output_file = sys.argv[1]
 with open(output_file, 'w') as fheader:
     print_top(fheader)
 +    add_module(fheader, "vitastor", "vitastor", "vitastor")
     for filename in sys.argv[2:]:
         if os.path.isfile(filename):
--- a/patches/qemu-5.0-vitastor.patch
+++ b/patches/qemu-5.0-vitastor.patch
@ -0,0 +1,88 @@
 Index: qemu/qapi/block-core.json
 ===================================================================
 --- qemu.orig/qapi/block-core.json
 +++ qemu/qapi/block-core.json
@@ -2798,7 +2798,7 @@
             'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
             'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
             { 'name': 'replication', 'if': 'defined(CONFIG_REPLICATION)' },
 -            'sheepdog',
 +            'sheepdog', 'vitastor',
             'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', 'vxhs' ] }
 ##
@@ -3635,6 +3635,28 @@
             '*tag': 'str' } }
 ##
 +# @BlockdevOptionsVitastor:
 +#
 +# Driver specific block device options for vitastor
 +#
 +# @image:       Image name
 +# @inode:       Inode number
 +# @pool:        Pool ID
 +# @size:        Desired image size in bytes
 +# @config_path: Path to Vitastor configuration
 +# @etcd_host:   etcd connection address(es)
 +# @etcd_prefix: etcd key/value prefix
 +##
 +{ 'struct': 'BlockdevOptionsVitastor',
 +  'data': { '*inode': 'uint64',
 +            '*pool': 'uint64',
 +            '*size': 'uint64',
 +            '*image': 'str',
 +            '*config_path': 'str',
 +            '*etcd_host': 'str',
 +            '*etcd_prefix': 'str' } }
 +
 +##
 # @ReplicationMode:
 #
 # An enumeration of replication modes.
@@ -3995,6 +4013,7 @@
       'replication': { 'type': 'BlockdevOptionsReplication',
                        'if': 'defined(CONFIG_REPLICATION)' },
       'sheepdog':   'BlockdevOptionsSheepdog',
 +      'vitastor':   'BlockdevOptionsVitastor',
       'ssh':        'BlockdevOptionsSsh',
       'throttle':   'BlockdevOptionsThrottle',
       'vdi':        'BlockdevOptionsGenericFormat',
@@ -4365,6 +4384,17 @@
             '*cluster-size' :   'size' } }
 ##
 +# @BlockdevCreateOptionsVitastor:
 +#
 +# Driver specific image creation options for Vitastor.
 +#
 +# @size: Size of the virtual disk in bytes
 +##
 +{ 'struct': 'BlockdevCreateOptionsVitastor',
 +  'data': { 'location':         'BlockdevOptionsVitastor',
 +            'size':             'size' } }
 +
 +##
 # @BlockdevVmdkSubformat:
 #
 # Subformat options for VMDK images
@@ -4626,6 +4656,7 @@
       'qed':            'BlockdevCreateOptionsQed',
       'rbd':            'BlockdevCreateOptionsRbd',
       'sheepdog':       'BlockdevCreateOptionsSheepdog',
 +      'vitastor':       'BlockdevCreateOptionsVitastor',
       'ssh':            'BlockdevCreateOptionsSsh',
       'vdi':            'BlockdevCreateOptionsVdi',
       'vhdx':           'BlockdevCreateOptionsVhdx',
 Index: qemu/scripts/modules/module_block.py
 ===================================================================
 --- qemu.orig/scripts/modules/module_block.py
 +++ qemu/scripts/modules/module_block.py
@@ -85,6 +85,7 @@ def print_bottom(fheader):
 output_file = sys.argv[1]
 with open(output_file, 'w') as fheader:
     print_top(fheader)
 +    add_module(fheader, "vitastor", "vitastor", "vitastor")
     for filename in sys.argv[2:]:
         if os.path.isfile(filename):
--- a/patches/qemu-5.1-vitastor.patch
+++ b/patches/qemu-5.1-vitastor.patch
@ -0,0 +1,88 @@
 Index: qemu-5.1+dfsg/qapi/block-core.json
 ===================================================================
 --- qemu-5.1+dfsg.orig/qapi/block-core.json
 +++ qemu-5.1+dfsg/qapi/block-core.json
@@ -2807,7 +2807,7 @@
             'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
             'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
             { 'name': 'replication', 'if': 'defined(CONFIG_REPLICATION)' },
 -            'sheepdog',
 +            'sheepdog', 'vitastor',
             'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
 ##
@@ -3644,6 +3644,28 @@
             '*tag': 'str' } }
 ##
 +# @BlockdevOptionsVitastor:
 +#
 +# Driver specific block device options for vitastor
 +#
 +# @image:       Image name
 +# @inode:       Inode number
 +# @pool:        Pool ID
 +# @size:        Desired image size in bytes
 +# @config_path: Path to Vitastor configuration
 +# @etcd_host:   etcd connection address(es)
 +# @etcd_prefix: etcd key/value prefix
 +##
 +{ 'struct': 'BlockdevOptionsVitastor',
 +  'data': { '*inode': 'uint64',
 +            '*pool': 'uint64',
 +            '*size': 'uint64',
 +            '*image': 'str',
 +            '*config_path': 'str',
 +            '*etcd_host': 'str',
 +            '*etcd_prefix': 'str' } }
 +
 +##
 # @ReplicationMode:
 #
 # An enumeration of replication modes.
@@ -3988,6 +4006,7 @@
       'replication': { 'type': 'BlockdevOptionsReplication',
                        'if': 'defined(CONFIG_REPLICATION)' },
       'sheepdog':   'BlockdevOptionsSheepdog',
 +      'vitastor':   'BlockdevOptionsVitastor',
       'ssh':        'BlockdevOptionsSsh',
       'throttle':   'BlockdevOptionsThrottle',
       'vdi':        'BlockdevOptionsGenericFormat',
@@ -4376,6 +4395,17 @@
             '*cluster-size' :   'size' } }
 ##
 +# @BlockdevCreateOptionsVitastor:
 +#
 +# Driver specific image creation options for Vitastor.
 +#
 +# @size: Size of the virtual disk in bytes
 +##
 +{ 'struct': 'BlockdevCreateOptionsVitastor',
 +  'data': { 'location':         'BlockdevOptionsVitastor',
 +            'size':             'size' } }
 +
 +##
 # @BlockdevVmdkSubformat:
 #
 # Subformat options for VMDK images
@@ -4637,6 +4667,7 @@
       'qed':            'BlockdevCreateOptionsQed',
       'rbd':            'BlockdevCreateOptionsRbd',
       'sheepdog':       'BlockdevCreateOptionsSheepdog',
 +      'vitastor':       'BlockdevCreateOptionsVitastor',
       'ssh':            'BlockdevCreateOptionsSsh',
       'vdi':            'BlockdevCreateOptionsVdi',
       'vhdx':           'BlockdevCreateOptionsVhdx',
 Index: qemu-5.1+dfsg/scripts/modules/module_block.py
 ===================================================================
 --- qemu-5.1+dfsg.orig/scripts/modules/module_block.py
 +++ qemu-5.1+dfsg/scripts/modules/module_block.py
@@ -86,6 +86,7 @@ if __name__ == '__main__':
     output_file = sys.argv[1]
     with open(output_file, 'w') as fheader:
         print_top(fheader)
 +        add_module(fheader, "vitastor", "vitastor", "vitastor")
         for filename in sys.argv[2:]:
             if os.path.isfile(filename):
--- a/rpm/build-tarball.sh
+++ b/rpm/build-tarball.sh
@ -0,0 +1,51 @@
 #!/bin/bash
 # Vitastor depends on QEMU and FIO headers, but QEMU and FIO don't have -devel packages
 # So we have to copy their headers into the source tarball
 set -e
 VITASTOR=$(dirname $0)
 VITASTOR=$(realpath "$VITASTOR/..")
 if [ -d /opt/rh/gcc-toolset-9 ]; then
    # CentOS 8
    EL=8
    . /opt/rh/gcc-toolset-9/enable
 else
    # CentOS 7
    EL=7
    . /opt/rh/devtoolset-9/enable
 fi
 cd ~/rpmbuild/SPECS
 rpmbuild -bp fio.spec
 perl -i -pe 's/^make V=1/exit 0; make V=1/' qemu*.spec
 rpmbuild -bc qemu*.spec
 perl -i -pe 's/^exit 0; make V=1/make V=1/' qemu*.spec
 cd ~/rpmbuild/BUILD/qemu*/
 rm -rf $VITASTOR/qemu $VITASTOR/fio
 mkdir -p $VITASTOR/qemu/b/qemu
 make -j8 config-host.h
 cp config-host.h $VITASTOR/qemu/b/qemu
 cp -r include $VITASTOR/qemu
 if [ -f qapi-schema.json ]; then
    # QEMU 2.0
    make qapi-types.h
    cp qapi-types.h $VITASTOR/qemu/b/qemu
 else
    # QEMU 3.0+
    make qapi
    cp -r qapi $VITASTOR/qemu/b/qemu
 fi
 cd $VITASTOR
 sh copy-qemu-includes.sh
 rm -rf qemu
 mv qemu-copy qemu
 ln -s ~/rpmbuild/BUILD/fio*/ fio
 sh copy-fio-includes.sh
 rm fio
 mv fio-copy fio
 FIO=`rpm -qi fio | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'`
 QEMU=`rpm -qi qemu qemu-kvm | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'`
 perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec
 perl -i -pe 's/(Requires:\s*qemu(?:-kvm)?)([^\n]+)?/$1 = '$QEMU'/' $VITASTOR/rpm/vitastor-el$EL.spec
 tar --transform 's#^#vitastor-0.6.5/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.6.5$(rpm --eval '%dist').tar.gz *
--- a/rpm/qemu-el8.Dockerfile
+++ b/rpm/qemu-el8.Dockerfile
@ -0,0 +1,31 @@
 # Build packages for CentOS 8 inside a container
 # cd ..; podman build -t qemu-el8 -v `pwd`/packages:/root/packages -f rpm/qemu-el8.Dockerfile .
 FROM centos:8
 WORKDIR /root
 RUN rm -f /etc/yum.repos.d/CentOS-Media.repo
 RUN dnf -y install centos-release-advanced-virtualization epel-release dnf-plugins-core rpm-build
 RUN rm -rf /var/lib/dnf/*; dnf download --disablerepo='*' --enablerepo='centos-advanced-virtualization-source' --source qemu-kvm
 RUN rpm --nomd5 -i qemu*.src.rpm
 RUN cd ~/rpmbuild/SPECS && dnf builddep -y --enablerepo=PowerTools --spec qemu-kvm.spec
 ADD patches/qemu-*-vitastor.patch /root/vitastor/patches/
 RUN set -e; \
    mkdir -p /root/packages/qemu-el8; \
    rm -rf /root/packages/qemu-el8/*; \
    rpm --nomd5 -i /root/qemu*.src.rpm; \
    cd ~/rpmbuild/SPECS; \
    PN=$(grep ^Patch qemu-kvm.spec | tail -n1 | perl -pe 's/Patch(\d+).*/$1/'); \
    csplit qemu-kvm.spec "/^Patch$PN/"; \
    cat xx00 > qemu-kvm.spec; \
    head -n 1 xx01 >> qemu-kvm.spec; \
    echo "Patch$((PN+1)): qemu-4.2-vitastor.patch" >> qemu-kvm.spec; \
    tail -n +2 xx01 >> qemu-kvm.spec; \
    perl -i -pe 's/(^Release:\s*\d+)/$1.vitastor/' qemu-kvm.spec; \
    cp /root/vitastor/patches/qemu-4.2-vitastor.patch ~/rpmbuild/SOURCES; \
    rpmbuild --nocheck -ba qemu-kvm.spec; \
    cp ~/rpmbuild/RPMS/*/*qemu* /root/packages/qemu-el8/; \
    cp ~/rpmbuild/SRPMS/*qemu* /root/packages/qemu-el8/
--- a/rpm/qemu-kvm-el7.spec.patch
+++ b/rpm/qemu-kvm-el7.spec.patch
@ -0,0 +1,257 @@
 --- qemu-kvm.spec.orig	2020-11-09 23:41:03.000000000 +0000
 +++ qemu-kvm.spec	2020-12-06 10:44:24.207640963 +0000
@@ -2,7 +2,7 @@
 %global SLOF_gittagcommit 899d9883
 %global have_usbredir 1
 -%global have_spice    1
 +%global have_spice    0
 %global have_opengl   1
 %global have_fdt      0
 %global have_gluster  1
@@ -56,7 +56,7 @@ Requires: %{name}-block-curl = %{epoch}:
 Requires: %{name}-block-gluster = %{epoch}:%{version}-%{release} \
 %endif                                                           \
 Requires: %{name}-block-iscsi = %{epoch}:%{version}-%{release}   \
 -Requires: %{name}-block-rbd = %{epoch}:%{version}-%{release}     \
 +#Requires: %{name}-block-rbd = %{epoch}:%{version}-%{release}     \
 Requires: %{name}-block-ssh = %{epoch}:%{version}-%{release}
 # Macro to properly setup RHEL/RHEV conflict handling
@@ -67,7 +67,7 @@ Obsoletes: %1-rhev
 Summary: QEMU is a machine emulator and virtualizer
 Name: qemu-kvm
 Version: 4.2.0
 -Release: 29.vitastor%{?dist}.6
 +Release: 30.vitastor%{?dist}.6
 # Epoch because we pushed a qemu-1.0 package. AIUI this can't ever be dropped
 Epoch: 15
 License: GPLv2 and GPLv2+ and CC-BY
@@ -99,8 +99,8 @@ Source30: kvm-s390x.conf
 Source31: kvm-x86.conf
 Source32: qemu-pr-helper.service
 Source33: qemu-pr-helper.socket
 -Source34: 81-kvm-rhel.rules
 -Source35: udev-kvm-check.c
 +#Source34: 81-kvm-rhel.rules
 +#Source35: udev-kvm-check.c
 Source36: README.tests
@@ -825,7 +825,9 @@ Patch331: kvm-Drop-bogus-IPv6-messages.p
 Patch333: kvm-virtiofsd-Whitelist-fchmod.patch
 # For bz#1883869 - virtiofsd core dump in KATA Container [rhel-8.2.1.z]
 Patch334: kvm-virtiofsd-avoid-proc-self-fd-tempdir.patch
 -Patch335: qemu-4.2-vitastor.patch
 +Patch335: qemu-use-sphinx-1.2.patch
 +Patch336: qemu-config-tcmalloc-warning.patch
 +Patch337: qemu-4.2-vitastor.patch
 BuildRequires: wget
 BuildRequires: rpm-build
@@ -842,7 +844,8 @@ BuildRequires: pciutils-devel
 BuildRequires: libiscsi-devel
 BuildRequires: ncurses-devel
 BuildRequires: libattr-devel
 -BuildRequires: libusbx-devel >= 1.0.22
 +BuildRequires: gperftools-devel
 +BuildRequires: libusbx-devel >= 1.0.21
 %if %{have_usbredir}
 BuildRequires: usbredir-devel >= 0.7.1
 %endif
@@ -856,12 +859,12 @@ BuildRequires: virglrenderer-devel
 # For smartcard NSS support
 BuildRequires: nss-devel
 %endif
 -BuildRequires: libseccomp-devel >= 2.4.0
 +#Requires: libseccomp >= 2.4.0
 # For network block driver
 BuildRequires: libcurl-devel
 BuildRequires: libssh-devel
 -BuildRequires: librados-devel
 -BuildRequires: librbd-devel
 +#BuildRequires: librados-devel
 +#BuildRequires: librbd-devel
 %if %{have_gluster}
 # For gluster block driver
 BuildRequires: glusterfs-api-devel
@@ -955,25 +958,25 @@ hardware for a full system such as a PC
 %package -n qemu-kvm-core
 Summary: qemu-kvm core components
 +Requires: gperftools-libs
 Requires: qemu-img = %{epoch}:%{version}-%{release}
 %ifarch %{ix86} x86_64
 Requires: seabios-bin >= 1.10.2-1
 Requires: sgabios-bin
 -Requires: edk2-ovmf
 %endif
 %ifarch aarch64
 Requires: edk2-aarch64
 %endif
 %ifnarch aarch64 s390x
 -Requires: seavgabios-bin >= 1.12.0-3
 -Requires: ipxe-roms-qemu >= 20170123-1
 +Requires: seavgabios-bin >= 1.11.0-1
 +Requires: ipxe-roms-qemu >= 20181214-1
 +Requires: /usr/share/ipxe.efi
 %endif
 %ifarch %{power64}
 Requires: SLOF >= %{SLOF_gittagdate}-1.git%{SLOF_gittagcommit}
 %endif
 Requires: %{name}-common = %{epoch}:%{version}-%{release}
 -Requires: libseccomp >= 2.4.0
 # For compressed guest memory dumps
 Requires: lzo snappy
 %if %{have_kvm_setup}
@@ -1085,15 +1088,15 @@ This package provides the additional iSC
 Install this package if you want to access iSCSI volumes.
 -%package  block-rbd
 -Summary: QEMU Ceph/RBD block driver
 -Requires: %{name}-common%{?_isa} = %{epoch}:%{version}-%{release}
 -
 -%description block-rbd
 -This package provides the additional Ceph/RBD block driver for QEMU.
 -
 -Install this package if you want to access remote Ceph volumes
 -using the rbd protocol.
 +#%package  block-rbd
 +#Summary: QEMU Ceph/RBD block driver
 +#Requires: %{name}-common%{?_isa} = %{epoch}:%{version}-%{release}
 +#
 +#%description block-rbd
 +#This package provides the additional Ceph/RBD block driver for QEMU.
 +#
 +#Install this package if you want to access remote Ceph volumes
 +#using the rbd protocol.
 %package  block-ssh
@@ -1117,12 +1120,14 @@ the Secure Shell (SSH) protocol.
 # --build-id option is used for giving info to the debug packages.
 buildldflags="VL_LDFLAGS=-Wl,--build-id"
 -%global block_drivers_list qcow2,raw,file,host_device,nbd,iscsi,rbd,blkdebug,luks,null-co,nvme,copy-on-read,throttle
 +#%global block_drivers_list qcow2,raw,file,host_device,nbd,iscsi,rbd,blkdebug,luks,null-co,nvme,copy-on-read,throttle
 +%global block_drivers_list qcow2,raw,file,host_device,nbd,iscsi,blkdebug,luks,null-co,nvme,copy-on-read,throttle
 %if 0%{have_gluster}
     %global block_drivers_list %{block_drivers_list},gluster
 %endif
 +[ -e /usr/bin/sphinx-build ] || ln -s sphinx-build-3 /usr/bin/sphinx-build
 ./configure  \
  --prefix="%{_prefix}" \
  --libdir="%{_libdir}" \
@@ -1152,15 +1157,15 @@ buildldflags="VL_LDFLAGS=-Wl,--build-id"
 %else
   --disable-numa \
 %endif
 -  --enable-rbd \
 +  --disable-rbd \
 %if 0%{have_librdma}
   --enable-rdma \
 %else
   --disable-rdma \
 %endif
   --disable-pvrdma \
 -  --enable-seccomp \
 -%if 0%{have_spice}
 +  --disable-seccomp \
 +%if %{have_spice}
   --enable-spice \
   --enable-smartcard \
   --enable-virglrenderer \
@@ -1179,7 +1184,7 @@ buildldflags="VL_LDFLAGS=-Wl,--build-id"
 %else
   --disable-usb-redir \
 %endif
 -  --disable-tcmalloc \
 +  --enable-tcmalloc \
 %ifarch x86_64
   --enable-libpmem \
 %else
@@ -1193,9 +1198,7 @@ buildldflags="VL_LDFLAGS=-Wl,--build-id"
 %endif
   --python=%{__python3} \
   --target-list="%{buildarch}" \
 -  --block-drv-rw-whitelist=%{block_drivers_list} \
   --audio-drv-list= \
 -  --block-drv-ro-whitelist=vmdk,vhdx,vpc,https,ssh \
   --with-coroutine=ucontext \
   --tls-priority=NORMAL \
   --disable-bluez \
@@ -1262,7 +1265,7 @@ buildldflags="VL_LDFLAGS=-Wl,--build-id"
   --disable-sanitizers \
   --disable-hvf \
   --disable-whpx \
 -  --enable-malloc-trim \
 +  --disable-malloc-trim \
   --disable-membarrier \
   --disable-vhost-crypto \
   --disable-libxml2 \
@@ -1308,7 +1311,7 @@ make V=1 %{?_smp_mflags} $buildldflags
 cp -a %{kvm_target}-softmmu/qemu-system-%{kvm_target} qemu-kvm
 gcc %{SOURCE6} $RPM_OPT_FLAGS $RPM_LD_FLAGS -o ksmctl
 -gcc %{SOURCE35} $RPM_OPT_FLAGS $RPM_LD_FLAGS -o udev-kvm-check
 +#gcc %{SOURCE35} $RPM_OPT_FLAGS $RPM_LD_FLAGS -o udev-kvm-check
 %install
 %define _udevdir %(pkg-config --variable=udevdir udev)
@@ -1343,8 +1346,8 @@ mkdir -p $RPM_BUILD_ROOT%{testsdir}/test
 mkdir -p $RPM_BUILD_ROOT%{testsdir}/tests/qemu-iotests
 mkdir -p $RPM_BUILD_ROOT%{testsdir}/scripts/qmp
 -install -p -m 0755 udev-kvm-check $RPM_BUILD_ROOT%{_udevdir}
 -install -p -m 0644 %{SOURCE34} $RPM_BUILD_ROOT%{_udevrulesdir}
 +#install -p -m 0755 udev-kvm-check $RPM_BUILD_ROOT%{_udevdir}
 +#install -p -m 0644 %{SOURCE34} $RPM_BUILD_ROOT%{_udevrulesdir}
 install -m 0644 scripts/dump-guest-memory.py \
                 $RPM_BUILD_ROOT%{_datadir}/%{name}
@@ -1562,6 +1565,8 @@ rm -rf $RPM_BUILD_ROOT%{qemudocdir}/inte
 # Remove spec
 rm -rf $RPM_BUILD_ROOT%{qemudocdir}/specs
 +%global __os_install_post %(echo '%{__os_install_post}' | sed -e 's!/usr/lib[^[:space:]]*/brp-python-bytecompile[[:space:]].*$!!g')
 +
 %check
 export DIFF=diff; make check V=1
@@ -1645,8 +1650,8 @@ useradd -r -u 107 -g qemu -G kvm -d / -s
 %config(noreplace) %{_sysconfdir}/sysconfig/ksm
 %{_unitdir}/ksmtuned.service
 %{_sbindir}/ksmtuned
 -%{_udevdir}/udev-kvm-check
 -%{_udevrulesdir}/81-kvm-rhel.rules
 +#%{_udevdir}/udev-kvm-check
 +#%{_udevrulesdir}/81-kvm-rhel.rules
 %ghost %{_sysconfdir}/kvm
 %config(noreplace) %{_sysconfdir}/ksmtuned.conf
 %dir %{_sysconfdir}/%{name}
@@ -1711,8 +1716,8 @@ useradd -r -u 107 -g qemu -G kvm -d / -s
 %{_libexecdir}/vhost-user-gpu
 %{_datadir}/%{name}/vhost-user/50-qemu-gpu.json
 %endif
 -%{_libexecdir}/virtiofsd
 -%{_datadir}/%{name}/vhost-user/50-qemu-virtiofsd.json
 +#%{_libexecdir}/virtiofsd
 +#%{_datadir}/%{name}/vhost-user/50-qemu-virtiofsd.json
 %files -n qemu-img
 %defattr(-,root,root)
@@ -1748,8 +1753,8 @@ useradd -r -u 107 -g qemu -G kvm -d / -s
 %files block-iscsi
 %{_libdir}/qemu-kvm/block-iscsi.so
 -%files block-rbd
 -%{_libdir}/qemu-kvm/block-rbd.so
 +#%files block-rbd
 +#%{_libdir}/qemu-kvm/block-rbd.so
 %files block-ssh
 %{_libdir}/qemu-kvm/block-ssh.so
--- a/rpm/qemu-kvm.spec.patch
+++ b/rpm/qemu-kvm.spec.patch
@ -0,0 +1,29 @@
 --- qemu-kvm.spec	2020-12-05 13:13:54.388623517 +0000
 +++ qemu-kvm.spec	2020-12-05 13:13:58.728696598 +0000
@@ -67,7 +67,7 @@ Obsoletes: %1-rhev
 Summary: QEMU is a machine emulator and virtualizer
 Name: qemu-kvm
 Version: 4.2.0
 -Release: 29%{?dist}.6
 +Release: 29.vitastor%{?dist}.6
 # Epoch because we pushed a qemu-1.0 package. AIUI this can't ever be dropped
 Epoch: 15
 License: GPLv2 and GPLv2+ and CC-BY
@@ -825,6 +825,7 @@ Patch331: kvm-Drop-bogus-IPv6-messages.p
 Patch333: kvm-virtiofsd-Whitelist-fchmod.patch
 # For bz#1883869 - virtiofsd core dump in KATA Container [rhel-8.2.1.z]
 Patch334: kvm-virtiofsd-avoid-proc-self-fd-tempdir.patch
 +Patch335: qemu-4.2-vitastor.patch
 BuildRequires: wget
 BuildRequires: rpm-build
@@ -1192,9 +1193,7 @@ buildldflags="VL_LDFLAGS=-Wl,--build-id"
 %endif
   --python=%{__python3} \
   --target-list="%{buildarch}" \
 -  --block-drv-rw-whitelist=%{block_drivers_list} \
   --audio-drv-list= \
 -  --block-drv-ro-whitelist=vmdk,vhdx,vpc,https,ssh \
   --with-coroutine=ucontext \
   --tls-priority=NORMAL \
   --disable-bluez \
--- a/rpm/vitastor-el7.Dockerfile
+++ b/rpm/vitastor-el7.Dockerfile
@ -0,0 +1,48 @@
 # Build packages for CentOS 7 inside a container
 # cd ..; podman build -t vitastor-el7 -v `pwd`/packages:/root/packages -f rpm/vitastor-el7.Dockerfile .
 # localedef -i ru_RU -f UTF-8 ru_RU.UTF-8
 FROM centos:7
 WORKDIR /root
 RUN rm -f /etc/yum.repos.d/CentOS-Media.repo
 RUN yum -y --enablerepo=extras install centos-release-scl epel-release yum-utils rpm-build
 RUN yum -y install https://vitastor.io/rpms/centos/7/vitastor-release-1.0-1.el7.noarch.rpm
 RUN yum -y install devtoolset-9-gcc-c++ devtoolset-9-libatomic-devel gperftools-devel qemu-kvm fio rh-nodejs12 jerasure-devel gf-complete-devel
 RUN yumdownloader --disablerepo=centos-sclo-rh --source qemu-kvm
 RUN yumdownloader --disablerepo=centos-sclo-rh --source fio
 RUN rpm --nomd5 -i qemu*.src.rpm
 RUN rpm --nomd5 -i fio*.src.rpm
 RUN rm -f /etc/yum.repos.d/CentOS-Media.repo
 RUN cd ~/rpmbuild/SPECS && yum-builddep -y qemu-kvm.spec
 RUN cd ~/rpmbuild/SPECS && yum-builddep -y fio.spec
 RUN yum -y install rdma-core-devel
 ADD https://vitastor.io/rpms/liburing-el7/liburing-0.7-2.el7.src.rpm /root
 RUN set -e; \
    rpm -i liburing*.src.rpm; \
    cd ~/rpmbuild/SPECS/; \
    . /opt/rh/devtoolset-9/enable; \
    rpmbuild -ba liburing.spec; \
    mkdir -p /root/packages/liburing-el7; \
    rm -rf /root/packages/liburing-el7/*; \
    cp ~/rpmbuild/RPMS/*/liburing* /root/packages/liburing-el7/; \
    cp ~/rpmbuild/SRPMS/liburing* /root/packages/liburing-el7/
 RUN rpm -i `ls /root/packages/liburing-el7/liburing-*.x86_64.rpm | grep -v debug`
 ADD . /root/vitastor
 RUN set -e; \
    cd /root/vitastor/rpm; \
    sh build-tarball.sh; \
    cp /root/vitastor-0.6.5.el7.tar.gz ~/rpmbuild/SOURCES; \
    cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \
    cd ~/rpmbuild/SPECS/; \
    rpmbuild -ba vitastor.spec; \
    mkdir -p /root/packages/vitastor-el7; \
    rm -rf /root/packages/vitastor-el7/*; \
    cp ~/rpmbuild/RPMS/*/vitastor* /root/packages/vitastor-el7/; \
    cp ~/rpmbuild/SRPMS/vitastor* /root/packages/vitastor-el7/
--- a/Show More
+++ b/Show More
		`@ -0,0 +1 @@`
							`Subproject commit 5dc108754ad40d3b1d024f9bd7cca0595ef1a1db`
		`@ -0,0 +1,2 @@`
							`dep:fio=3.16-1`
							`dep:qemu=1:5.1+dfsg-4+vitastor1`
		`@ -0,0 +1 @@`
							`Subproject commit 97f06cb20c1e136fd37d58fb40f57dd8f8a3a4a7`