Make pg_stripe_size a per-pool config

More fixes to the failure model (why am I doing this?..)
More correct failure model (I hope so)
2020-10-01 18:51:49 +03:00 · 2020-10-01 18:38:30 +03:00 · 2020-10-01 02:33:48 +03:00 · 2020-09-29 02:06:19 +03:00 · 2020-09-27 19:42:42 +03:00 · 2020-09-26 00:11:55 +03:00
101 changed files with 7405 additions and 2320 deletions
--- a/.gitmodules
+++ b/.gitmodules
@ -0,0 +1,6 @@
+[submodule "cpp-btree"]
+	path = cpp-btree
+	url = ../cpp-btree.git
+[submodule "json11"]
+	path = json11
+	url = ../json11.git
--- a/GPL-2.0.txt
+++ b/GPL-2.0.txt
@ -0,0 +1,339 @@
+                    GNU GENERAL PUBLIC LICENSE
+                       Version 2, June 1991
+
+ Copyright (C) 1989, 1991 Free Software Foundation, Inc.,
+ 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+                            Preamble
+
+  The licenses for most software are designed to take away your
+freedom to share and change it.  By contrast, the GNU General Public
+License is intended to guarantee your freedom to share and change free
+software--to make sure the software is free for all its users.  This
+General Public License applies to most of the Free Software
+Foundation's software and to any other program whose authors commit to
+using it.  (Some other Free Software Foundation software is covered by
+the GNU Lesser General Public License instead.)  You can apply it to
+your programs, too.
+
+  When we speak of free software, we are referring to freedom, not
+price.  Our General Public Licenses are designed to make sure that you
+have the freedom to distribute copies of free software (and charge for
+this service if you wish), that you receive source code or can get it
+if you want it, that you can change the software or use pieces of it
+in new free programs; and that you know you can do these things.
+
+  To protect your rights, we need to make restrictions that forbid
+anyone to deny you these rights or to ask you to surrender the rights.
+These restrictions translate to certain responsibilities for you if you
+distribute copies of the software, or if you modify it.
+
+  For example, if you distribute copies of such a program, whether
+gratis or for a fee, you must give the recipients all the rights that
+you have.  You must make sure that they, too, receive or can get the
+source code.  And you must show them these terms so they know their
+rights.
+
+  We protect your rights with two steps: (1) copyright the software, and
+(2) offer you this license which gives you legal permission to copy,
+distribute and/or modify the software.
+
+  Also, for each author's protection and ours, we want to make certain
+that everyone understands that there is no warranty for this free
+software.  If the software is modified by someone else and passed on, we
+want its recipients to know that what they have is not the original, so
+that any problems introduced by others will not reflect on the original
+authors' reputations.
+
+  Finally, any free program is threatened constantly by software
+patents.  We wish to avoid the danger that redistributors of a free
+program will individually obtain patent licenses, in effect making the
+program proprietary.  To prevent this, we have made it clear that any
+patent must be licensed for everyone's free use or not licensed at all.
+
+  The precise terms and conditions for copying, distribution and
+modification follow.
+
+                    GNU GENERAL PUBLIC LICENSE
+   TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
+
+  0. This License applies to any program or other work which contains
+a notice placed by the copyright holder saying it may be distributed
+under the terms of this General Public License.  The "Program", below,
+refers to any such program or work, and a "work based on the Program"
+means either the Program or any derivative work under copyright law:
+that is to say, a work containing the Program or a portion of it,
+either verbatim or with modifications and/or translated into another
+language.  (Hereinafter, translation is included without limitation in
+the term "modification".)  Each licensee is addressed as "you".
+
+Activities other than copying, distribution and modification are not
+covered by this License; they are outside its scope.  The act of
+running the Program is not restricted, and the output from the Program
+is covered only if its contents constitute a work based on the
+Program (independent of having been made by running the Program).
+Whether that is true depends on what the Program does.
+
+  1. You may copy and distribute verbatim copies of the Program's
+source code as you receive it, in any medium, provided that you
+conspicuously and appropriately publish on each copy an appropriate
+copyright notice and disclaimer of warranty; keep intact all the
+notices that refer to this License and to the absence of any warranty;
+and give any other recipients of the Program a copy of this License
+along with the Program.
+
+You may charge a fee for the physical act of transferring a copy, and
+you may at your option offer warranty protection in exchange for a fee.
+
+  2. You may modify your copy or copies of the Program or any portion
+of it, thus forming a work based on the Program, and copy and
+distribute such modifications or work under the terms of Section 1
+above, provided that you also meet all of these conditions:
+
+    a) You must cause the modified files to carry prominent notices
+    stating that you changed the files and the date of any change.
+
+    b) You must cause any work that you distribute or publish, that in
+    whole or in part contains or is derived from the Program or any
+    part thereof, to be licensed as a whole at no charge to all third
+    parties under the terms of this License.
+
+    c) If the modified program normally reads commands interactively
+    when run, you must cause it, when started running for such
+    interactive use in the most ordinary way, to print or display an
+    announcement including an appropriate copyright notice and a
+    notice that there is no warranty (or else, saying that you provide
+    a warranty) and that users may redistribute the program under
+    these conditions, and telling the user how to view a copy of this
+    License.  (Exception: if the Program itself is interactive but
+    does not normally print such an announcement, your work based on
+    the Program is not required to print an announcement.)
+
+These requirements apply to the modified work as a whole.  If
+identifiable sections of that work are not derived from the Program,
+and can be reasonably considered independent and separate works in
+themselves, then this License, and its terms, do not apply to those
+sections when you distribute them as separate works.  But when you
+distribute the same sections as part of a whole which is a work based
+on the Program, the distribution of the whole must be on the terms of
+this License, whose permissions for other licensees extend to the
+entire whole, and thus to each and every part regardless of who wrote it.
+
+Thus, it is not the intent of this section to claim rights or contest
+your rights to work written entirely by you; rather, the intent is to
+exercise the right to control the distribution of derivative or
+collective works based on the Program.
+
+In addition, mere aggregation of another work not based on the Program
+with the Program (or with a work based on the Program) on a volume of
+a storage or distribution medium does not bring the other work under
+the scope of this License.
+
+  3. You may copy and distribute the Program (or a work based on it,
+under Section 2) in object code or executable form under the terms of
+Sections 1 and 2 above provided that you also do one of the following:
+
+    a) Accompany it with the complete corresponding machine-readable
+    source code, which must be distributed under the terms of Sections
+    1 and 2 above on a medium customarily used for software interchange; or,
+
+    b) Accompany it with a written offer, valid for at least three
+    years, to give any third party, for a charge no more than your
+    cost of physically performing source distribution, a complete
+    machine-readable copy of the corresponding source code, to be
+    distributed under the terms of Sections 1 and 2 above on a medium
+    customarily used for software interchange; or,
+
+    c) Accompany it with the information you received as to the offer
+    to distribute corresponding source code.  (This alternative is
+    allowed only for noncommercial distribution and only if you
+    received the program in object code or executable form with such
+    an offer, in accord with Subsection b above.)
+
+The source code for a work means the preferred form of the work for
+making modifications to it.  For an executable work, complete source
+code means all the source code for all modules it contains, plus any
+associated interface definition files, plus the scripts used to
+control compilation and installation of the executable.  However, as a
+special exception, the source code distributed need not include
+anything that is normally distributed (in either source or binary
+form) with the major components (compiler, kernel, and so on) of the
+operating system on which the executable runs, unless that component
+itself accompanies the executable.
+
+If distribution of executable or object code is made by offering
+access to copy from a designated place, then offering equivalent
+access to copy the source code from the same place counts as
+distribution of the source code, even though third parties are not
+compelled to copy the source along with the object code.
+
+  4. You may not copy, modify, sublicense, or distribute the Program
+except as expressly provided under this License.  Any attempt
+otherwise to copy, modify, sublicense or distribute the Program is
+void, and will automatically terminate your rights under this License.
+However, parties who have received copies, or rights, from you under
+this License will not have their licenses terminated so long as such
+parties remain in full compliance.
+
+  5. You are not required to accept this License, since you have not
+signed it.  However, nothing else grants you permission to modify or
+distribute the Program or its derivative works.  These actions are
+prohibited by law if you do not accept this License.  Therefore, by
+modifying or distributing the Program (or any work based on the
+Program), you indicate your acceptance of this License to do so, and
+all its terms and conditions for copying, distributing or modifying
+the Program or works based on it.
+
+  6. Each time you redistribute the Program (or any work based on the
+Program), the recipient automatically receives a license from the
+original licensor to copy, distribute or modify the Program subject to
+these terms and conditions.  You may not impose any further
+restrictions on the recipients' exercise of the rights granted herein.
+You are not responsible for enforcing compliance by third parties to
+this License.
+
+  7. If, as a consequence of a court judgment or allegation of patent
+infringement or for any other reason (not limited to patent issues),
+conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License.  If you cannot
+distribute so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you
+may not distribute the Program at all.  For example, if a patent
+license would not permit royalty-free redistribution of the Program by
+all those who receive copies directly or indirectly through you, then
+the only way you could satisfy both it and this License would be to
+refrain entirely from distribution of the Program.
+
+If any portion of this section is held invalid or unenforceable under
+any particular circumstance, the balance of the section is intended to
+apply and the section as a whole is intended to apply in other
+circumstances.
+
+It is not the purpose of this section to induce you to infringe any
+patents or other property right claims or to contest validity of any
+such claims; this section has the sole purpose of protecting the
+integrity of the free software distribution system, which is
+implemented by public license practices.  Many people have made
+generous contributions to the wide range of software distributed
+through that system in reliance on consistent application of that
+system; it is up to the author/donor to decide if he or she is willing
+to distribute software through any other system and a licensee cannot
+impose that choice.
+
+This section is intended to make thoroughly clear what is believed to
+be a consequence of the rest of this License.
+
+  8. If the distribution and/or use of the Program is restricted in
+certain countries either by patents or by copyrighted interfaces, the
+original copyright holder who places the Program under this License
+may add an explicit geographical distribution limitation excluding
+those countries, so that distribution is permitted only in or among
+countries not thus excluded.  In such case, this License incorporates
+the limitation as if written in the body of this License.
+
+  9. The Free Software Foundation may publish revised and/or new versions
+of the General Public License from time to time.  Such new versions will
+be similar in spirit to the present version, but may differ in detail to
+address new problems or concerns.
+
+Each version is given a distinguishing version number.  If the Program
+specifies a version number of this License which applies to it and "any
+later version", you have the option of following the terms and conditions
+either of that version or of any later version published by the Free
+Software Foundation.  If the Program does not specify a version number of
+this License, you may choose any version ever published by the Free Software
+Foundation.
+
+  10. If you wish to incorporate parts of the Program into other free
+programs whose distribution conditions are different, write to the author
+to ask for permission.  For software which is copyrighted by the Free
+Software Foundation, write to the Free Software Foundation; we sometimes
+make exceptions for this.  Our decision will be guided by the two goals
+of preserving the free status of all derivatives of our free software and
+of promoting the sharing and reuse of software generally.
+
+                            NO WARRANTY
+
+  11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
+FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW.  EXCEPT WHEN
+OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
+PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
+OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.  THE ENTIRE RISK AS
+TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU.  SHOULD THE
+PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
+REPAIR OR CORRECTION.
+
+  12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
+REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
+INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
+OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
+TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
+YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
+PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGES.
+
+                     END OF TERMS AND CONDITIONS
+
+            How to Apply These Terms to Your New Programs
+
+  If you develop a new program, and you want it to be of the greatest
+possible use to the public, the best way to achieve this is to make it
+free software which everyone can redistribute and change under these terms.
+
+  To do so, attach the following notices to the program.  It is safest
+to attach them to the start of each source file to most effectively
+convey the exclusion of warranty; and each file should have at least
+the "copyright" line and a pointer to where the full notice is found.
+
+    <one line to give the program's name and a brief idea of what it does.>
+    Copyright (C) <year>  <name of author>
+
+    This program is free software; you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation; either version 2 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License along
+    with this program; if not, write to the Free Software Foundation, Inc.,
+    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+
+Also add information on how to contact you by electronic and paper mail.
+
+If the program is interactive, make it output a short notice like this
+when it starts in an interactive mode:
+
+    Gnomovision version 69, Copyright (C) year name of author
+    Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
+    This is free software, and you are welcome to redistribute it
+    under certain conditions; type `show c' for details.
+
+The hypothetical commands `show w' and `show c' should show the appropriate
+parts of the General Public License.  Of course, the commands you use may
+be called something other than `show w' and `show c'; they could even be
+mouse-clicks or menu items--whatever suits your program.
+
+You should also get your employer (if you work as a programmer) or your
+school, if any, to sign a "copyright disclaimer" for the program, if
+necessary.  Here is a sample; alter the names:
+
+  Yoyodyne, Inc., hereby disclaims all copyright interest in the program
+  `Gnomovision' (which makes passes at compilers) written by James Hacker.
+
+  <signature of Ty Coon>, 1 April 1989
+  Ty Coon, President of Vice
+
+This General Public License does not permit incorporating your program into
+proprietary programs.  If your program is a subroutine library, you may
+consider it more useful to permit linking proprietary applications with the
+library.  If this is what you want to do, use the GNU Lesser General
+Public License instead of this License.
--- a/96
+++ b/96
@ -2,7 +2,7 @@ BLOCKSTORE_OBJS := allocator.o blockstore.o blockstore_impl.o blockstore_init.o
 	blockstore_write.o blockstore_sync.o blockstore_stable.o blockstore_rollback.o blockstore_flush.o crc32c.o ringloop.o
 # -fsanitize=address
 CXXFLAGS := -g -O3 -Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -fPIC -fdiagnostics-color=always
-all: libfio_blockstore.so osd libfio_sec_osd.so libfio_cluster.so stub_osd stub_uring_osd stub_bench osd_test dump_journal
+all: libfio_blockstore.so osd libfio_sec_osd.so libfio_cluster.so stub_osd stub_uring_osd stub_bench osd_test dump_journal qemu_driver.so nbd_proxy
 clean:
 	rm -f *.o

@ -15,10 +15,10 @@ libfio_blockstore.so: ./libblockstore.so fio_engine.o json11.o
 	g++ $(CXXFLAGS) -shared -o $@ fio_engine.o json11.o ./libblockstore.so -ltcmalloc_minimal -luring

 OSD_OBJS := osd.o osd_secondary.o msgr_receive.o msgr_send.o osd_peering.o osd_flush.o osd_peering_pg.o \
-	osd_primary.o osd_primary_subops.o etcd_state_client.o messenger.o osd_cluster.o http_client.o pg_states.o \
-	osd_rmw.o json11.o base64.o timerfd_manager.o
+	osd_primary.o osd_primary_subops.o etcd_state_client.o messenger.o osd_cluster.o http_client.o osd_ops.o pg_states.o \
+	osd_rmw.o json11.o base64.o timerfd_manager.o epoll_manager.o
 osd: ./libblockstore.so osd_main.cpp osd.h osd_ops.h $(OSD_OBJS)
-	g++ $(CXXFLAGS) -o $@ osd_main.cpp $(OSD_OBJS) ./libblockstore.so -ltcmalloc_minimal -luring -lpthread
+	g++ $(CXXFLAGS) -o $@ osd_main.cpp $(OSD_OBJS) ./libblockstore.so -ltcmalloc_minimal -luring

 stub_osd: stub_osd.o rw_blocking.o
 	g++ $(CXXFLAGS) -o $@ stub_osd.o rw_blocking.o -ltcmalloc_minimal
@ -36,10 +36,20 @@ osd_peering_pg_test: osd_peering_pg_test.cpp osd_peering_pg.o
 libfio_sec_osd.so: fio_sec_osd.o rw_blocking.o
 	g++ $(CXXFLAGS) -ltcmalloc_minimal -shared -o $@ fio_sec_osd.o rw_blocking.o

-FIO_CLUSTER_OBJS := fio_cluster.o cluster_client.o epoll_manager.o etcd_state_client.o \
-	messenger.o msgr_send.o msgr_receive.o ringloop.o json11.o http_client.o pg_states.o timerfd_manager.o base64.o
-libfio_cluster.so: $(FIO_CLUSTER_OBJS)
-	g++ $(CXXFLAGS) -ltcmalloc_minimal -shared -o $@ $(FIO_CLUSTER_OBJS) -luring
+FIO_CLUSTER_OBJS := cluster_client.o epoll_manager.o etcd_state_client.o \
+	messenger.o msgr_send.o msgr_receive.o ringloop.o json11.o http_client.o osd_ops.o pg_states.o timerfd_manager.o base64.o
+libfio_cluster.so: fio_cluster.o $(FIO_CLUSTER_OBJS)
+	g++ $(CXXFLAGS) -ltcmalloc_minimal -shared -o $@ $< $(FIO_CLUSTER_OBJS) -luring
+
+nbd_proxy: nbd_proxy.o $(FIO_CLUSTER_OBJS)
+	g++ $(CXXFLAGS) -ltcmalloc_minimal -o $@ $< $(FIO_CLUSTER_OBJS) -luring
+
+qemu_driver.o: qemu_driver.c qemu_proxy.h
+	gcc -I qemu/b/qemu `pkg-config glib-2.0 --cflags` \
+		-I qemu/include $(CXXFLAGS) -c -o $@ $<
+
+qemu_driver.so: qemu_driver.o qemu_proxy.o $(FIO_CLUSTER_OBJS)
+	g++ $(CXXFLAGS) -ltcmalloc_minimal -shared -o $@ $< $(FIO_CLUSTER_OBJS) qemu_driver.o qemu_proxy.o -luring

 test_blockstore: ./libblockstore.so test_blockstore.cpp timerfd_interval.o
 	g++ $(CXXFLAGS) -o test_blockstore test_blockstore.cpp timerfd_interval.o ./libblockstore.so -ltcmalloc_minimal -luring
@ -59,78 +69,84 @@ allocator.o: allocator.cpp allocator.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 base64.o: base64.cpp base64.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-blockstore.o: blockstore.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h object_id.h ringloop.h
+blockstore.o: blockstore.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-blockstore_flush.o: blockstore_flush.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h object_id.h ringloop.h
+blockstore_flush.o: blockstore_flush.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-blockstore_impl.o: blockstore_impl.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h object_id.h ringloop.h
+blockstore_impl.o: blockstore_impl.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-blockstore_init.o: blockstore_init.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h object_id.h ringloop.h
+blockstore_init.o: blockstore_init.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-blockstore_journal.o: blockstore_journal.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h object_id.h ringloop.h
+blockstore_journal.o: blockstore_journal.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-blockstore_open.o: blockstore_open.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h object_id.h ringloop.h
+blockstore_open.o: blockstore_open.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-blockstore_read.o: blockstore_read.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h object_id.h ringloop.h
+blockstore_read.o: blockstore_read.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-blockstore_rollback.o: blockstore_rollback.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h object_id.h ringloop.h
+blockstore_rollback.o: blockstore_rollback.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-blockstore_stable.o: blockstore_stable.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h object_id.h ringloop.h
+blockstore_stable.o: blockstore_stable.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-blockstore_sync.o: blockstore_sync.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h object_id.h ringloop.h
+blockstore_sync.o: blockstore_sync.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-blockstore_write.o: blockstore_write.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h object_id.h ringloop.h
+blockstore_write.o: blockstore_write.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-cluster_client.o: cluster_client.cpp cluster_client.h etcd_state_client.h http_client.h json11/json11.hpp messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
+cluster_client.o: cluster_client.cpp cluster_client.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-dump_journal.o: dump_journal.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h object_id.h ringloop.h
+dump_journal.o: dump_journal.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 epoll_manager.o: epoll_manager.cpp epoll_manager.h ringloop.h timerfd_manager.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-etcd_state_client.o: etcd_state_client.cpp base64.h etcd_state_client.h http_client.h json11/json11.hpp object_id.h osd_id.h osd_ops.h pg_states.h ringloop.h timerfd_manager.h
+etcd_state_client.o: etcd_state_client.cpp base64.h etcd_state_client.h http_client.h json11/json11.hpp object_id.h osd_id.h osd_ops.h pg_states.h timerfd_manager.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-fio_cluster.o: fio_cluster.cpp cluster_client.h epoll_manager.h etcd_state_client.h fio/fio.h fio/optgroup.h http_client.h json11/json11.hpp messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
+fio_cluster.o: fio_cluster.cpp cluster_client.h epoll_manager.h etcd_state_client.h fio/fio.h fio/optgroup.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 fio_engine.o: fio_engine.cpp blockstore.h fio/fio.h fio/optgroup.h json11/json11.hpp object_id.h ringloop.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 fio_sec_osd.o: fio_sec_osd.cpp fio/fio.h fio/optgroup.h object_id.h osd_id.h osd_ops.h rw_blocking.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-http_client.o: http_client.cpp http_client.h json11/json11.hpp ringloop.h timerfd_manager.h
+http_client.o: http_client.cpp http_client.h json11/json11.hpp timerfd_manager.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-messenger.o: messenger.cpp json11/json11.hpp messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
+messenger.o: messenger.cpp json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-msgr_receive.o: msgr_receive.cpp json11/json11.hpp messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
+msgr_receive.o: msgr_receive.cpp json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-msgr_send.o: msgr_send.cpp json11/json11.hpp messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
+msgr_send.o: msgr_send.cpp json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-osd.o: osd.cpp blockstore.h cpp-btree/btree_map.h etcd_state_client.h http_client.h json11/json11.hpp messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
+nbd_proxy.o: nbd_proxy.cpp cluster_client.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-osd_cluster.o: osd_cluster.cpp base64.h blockstore.h cpp-btree/btree_map.h etcd_state_client.h http_client.h json11/json11.hpp messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
+osd.o: osd.cpp blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-osd_flush.o: osd_flush.cpp blockstore.h cpp-btree/btree_map.h etcd_state_client.h http_client.h json11/json11.hpp messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
+osd_cluster.o: osd_cluster.cpp base64.h blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-osd_main.o: osd_main.cpp blockstore.h cpp-btree/btree_map.h etcd_state_client.h http_client.h json11/json11.hpp messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
+osd_flush.o: osd_flush.cpp blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-osd_peering.o: osd_peering.cpp base64.h blockstore.h cpp-btree/btree_map.h etcd_state_client.h http_client.h json11/json11.hpp messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
+osd_main.o: osd_main.cpp blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
+	g++ $(CXXFLAGS) -c -o $@ $<
+osd_ops.o: osd_ops.cpp object_id.h osd_id.h osd_ops.h
+	g++ $(CXXFLAGS) -c -o $@ $<
+osd_peering.o: osd_peering.cpp base64.h blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 osd_peering_pg.o: osd_peering_pg.cpp cpp-btree/btree_map.h object_id.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 osd_peering_pg_test.o: osd_peering_pg_test.cpp cpp-btree/btree_map.h object_id.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-osd_primary.o: osd_primary.cpp blockstore.h cpp-btree/btree_map.h etcd_state_client.h http_client.h json11/json11.hpp messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h osd_primary.h osd_rmw.h pg_states.h ringloop.h timerfd_manager.h
+osd_primary.o: osd_primary.cpp blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h osd_primary.h osd_rmw.h pg_states.h ringloop.h timerfd_manager.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-osd_primary_subops.o: osd_primary_subops.cpp blockstore.h cpp-btree/btree_map.h etcd_state_client.h http_client.h json11/json11.hpp messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h osd_primary.h osd_rmw.h pg_states.h ringloop.h timerfd_manager.h
+osd_primary_subops.o: osd_primary_subops.cpp blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h osd_primary.h osd_rmw.h pg_states.h ringloop.h timerfd_manager.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-osd_rmw.o: osd_rmw.cpp object_id.h osd_id.h osd_rmw.h xor.h
+osd_rmw.o: osd_rmw.cpp malloc_or_die.h object_id.h osd_id.h osd_rmw.h xor.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-osd_rmw_test.o: osd_rmw_test.cpp object_id.h osd_id.h osd_rmw.cpp osd_rmw.h test_pattern.h xor.h
+osd_rmw_test.o: osd_rmw_test.cpp malloc_or_die.h object_id.h osd_id.h osd_rmw.cpp osd_rmw.h test_pattern.h xor.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-osd_secondary.o: osd_secondary.cpp blockstore.h cpp-btree/btree_map.h etcd_state_client.h http_client.h json11/json11.hpp messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
+osd_secondary.o: osd_secondary.cpp blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 osd_test.o: osd_test.cpp object_id.h osd_id.h osd_ops.h rw_blocking.h test_pattern.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 pg_states.o: pg_states.cpp pg_states.h
 	g++ $(CXXFLAGS) -c -o $@ $<
+qemu_proxy.o: qemu_proxy.cpp cluster_client.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h qemu_proxy.h ringloop.h timerfd_manager.h
+	g++ $(CXXFLAGS) -c -o $@ $<
 ringloop.o: ringloop.cpp ringloop.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 rw_blocking.o: rw_blocking.cpp rw_blocking.h
@ -139,9 +155,9 @@ stub_bench.o: stub_bench.cpp object_id.h osd_id.h osd_ops.h rw_blocking.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 stub_osd.o: stub_osd.cpp object_id.h osd_id.h osd_ops.h rw_blocking.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-stub_uring_osd.o: stub_uring_osd.cpp epoll_manager.h json11/json11.hpp messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
+stub_uring_osd.o: stub_uring_osd.cpp epoll_manager.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-test.o: test.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h object_id.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h
+test.o: test.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h
 	g++ $(CXXFLAGS) -c -o $@ $<
 test_allocator.o: test_allocator.cpp allocator.h
 	g++ $(CXXFLAGS) -c -o $@ $<
@ -149,5 +165,5 @@ test_blockstore.o: test_blockstore.cpp blockstore.h object_id.h ringloop.h timer
 	g++ $(CXXFLAGS) -c -o $@ $<
 timerfd_interval.o: timerfd_interval.cpp ringloop.h timerfd_interval.h
 	g++ $(CXXFLAGS) -c -o $@ $<
-timerfd_manager.o: timerfd_manager.cpp ringloop.h timerfd_manager.h
+timerfd_manager.o: timerfd_manager.cpp timerfd_manager.h
 	g++ $(CXXFLAGS) -c -o $@ $<
--- a/README.md
+++ b/README.md
@ -0,0 +1,387 @@
+## Vitastor
+
+## The Idea
+
+Make Software-Defined Block Storage Great Again.
+
+Vitastor is a small, simple and fast clustered block storage (storage for VM drives),
+architecturally similar to Ceph which means strong consistency, primary-replication, symmetric
+clustering and automatic data distribution over any number of drives of any size
+with configurable redundancy (replication or erasure codes/XOR).
+
+## Features
+
+Vitastor is currently a pre-release, a lot of features are missing and you can still expect
+breaking changes in the future. However, the following is implemented:
+
+- Basic part: highly-available block storage with symmetric clustering and no SPOF
+- Performance ;-D
+- Two redundancy schemes: Replication and XOR n+1 (simplest case of EC)
+- Configuration via simple JSON data structures in etcd
+- Automatic data distribution over OSDs, with support for:
+  - Mathematical optimization for better uniformity and less data movement
+  - Multiple pools
+  - Placement tree
+  - Configurable failure domains
+- Recovery of degraded blocks
+- Rebalancing (data movement between OSDs)
+- Lazy fsync support
+- I/O statistics reporting to etcd
+- Generic user-space client library
+- QEMU driver (built out-of-tree)
+- Loadable fio engine for benchmarks (also built out-of-tree)
+- NBD proxy for kernel mounts
+
+## Roadmap
+
+- Packaging for Debian and, probably, CentOS too
+- OSD creation tool (OSDs currently have to be created by hand)
+- Inode deletion tool (currently you can't delete anything :))
+- Other administrative tools
+- Per-inode I/O and space usage statistics
+- jerasure EC support with any number of data and parity drives in a group
+- Parallel usage of multiple network interfaces
+- Proxmox and OpenNebula plugins
+- iSCSI proxy
+- Inode metadata storage in etcd
+- Snapshots and copy-on-write image clones
+- Operation timeouts and better failure detection
+- Checksums
+- SSD+HDD optimizations, possibly including tiered storage and soft journal flushes
+- RDMA and NVDIMM support
+- Compression (possibly)
+- Read caching using system page cache (possibly)
+
+## Architecture
+
+Similarities:
+
+- Just like Ceph, Vitastor has Pools, PGs, OSDs, Monitors, Failure Domains, Placement Tree.
+- Just like Ceph, Vitastor is transactional (even though there's a "lazy fsync mode" which
+  doesn't implicitly flush every operation to disks).
+- OSDs also have journal and metadata and they can also be put on separate drives.
+- Just like in Ceph, client library attempts to recover from any cluster failure so
+  you can basically reboot the whole cluster and only pause, but not crash, your clients
+  (I consider this a bug if the client crashes in that case).
+
+Some basic terms for people not familiar with Ceph:
+
+- OSD (Object Storage Daemon) is a process that stores data and serves read/write requests.
+- PG (Placement Group) is a container for data that (normally) shares the same replicas.
+- Pool is a container for data that has the same redundancy scheme and placement rules.
+- Monitor is a separate daemon that watches cluster state and handles failures.
+- Failure Domain is a group of OSDs that you allow to fail. It's "host" by default.
+- Placement Tree groups OSDs in a hierarchy to later split them into Failure Domains.
+
+Architectural differences from Ceph:
+
+- Vitastor's primary focus is on SSDs. Proper SSD+HDD optimizations may be added in the future, though.
+- Vitastor OSD is (and will always be) single-threaded. If you want to dedicate more than 1 core
+  per drive you should run multiple OSDs each on a different partition of the drive.
+  Vitastor isn't CPU-hungry though (as opposed to Ceph), so 1 core is sufficient in a lot of cases.
+- Metadata and journal are always kept in memory. Metadata size depends linearly on drive capacity
+  and data store block size which is 128 KB by default. With 128 KB blocks, metadata should occupy
+  around 512 MB per 1 TB (which is still less than Ceph wants). Journal doesn't have to be big,
+  the example test below was conducted with only 16 MB journal. A big journal is probably even
+  harmful as dirty write metadata also take some memory.
+- Vitastor storage layer doesn't have internal copy-on-write or redirect-write. I know that maybe
+  it's possible to create a good copy-on-write storage, but it's much harder and makes performance
+  less deterministic, so CoW isn't used in Vitastor.
+- The basic layer of Vitastor is block storage with fixed-size blocks, not object storage with
+  rich semantics like in Ceph (RADOS).
+- There's a "lazy fsync" mode which allows to batch writes before flushing them to the disk.
+  This allows to use Vitastor with desktop SSDs, but still lowers performance due to additional
+  network roundtrips, so use server SSDs with capacitor-based power loss protection
+  ("Advanced Power Loss Protection") for best performance.
+- PGs are ephemeral. This means that they aren't stored on data disks and only exist in memory
+  while OSDs are running.
+- Recovery process is per-object (per-block), not per-PG. Also there are no PGLOGs.
+- Monitors don't store data. Cluster configuration and state is stored in etcd in simple human-readable
+  JSON structures. Monitors only watch cluster state and handle data movement.
+  Thus Vitastor's Monitor isn't a critical component of the system and is more similar to Ceph's Manager.
+  Vitastor's Monitor is implemented in node.js.
+- PG distribution isn't based on consistent hashes. All PG mappings are stored in etcd.
+  Rebalancing PGs between OSDs is done by mathematical optimization - data distribution problem
+  is reduced to a linear programming problem and solved by lp_solve. This allows for almost
+  perfect (96-99% uniformity compared to Ceph's 80-90%) data distribution in most cases, ability
+  to map PGs by hand without breaking rebalancing logic, reduced OSD peer-to-peer communication
+  (on average, OSDs have fewer peers) and less data movement. It also probably has a drawback -
+  this method may fail in very large clusters, but up to several hundreds of OSDs it's perfectly fine.
+  It's also easy to add consistent hashes in the future if something proves their necessity.
+- There's no separate CRUSH layer. You select pool redundancy scheme, placement root, failure domain
+  and so on directly in pool configuration.
+
+## Understanding Storage Performance
+
+The most important thing for fast storage is latency, not parallel iops.
+
+The best possible latency is achieved with one thread and queue depth of 1 which basically means
+"client load as low as possible". In this case IOPS = 1/latency, and this number doesn't
+scale with number of servers, drives, server processes or threads and so on.
+Single-threaded IOPS and latency numbers only depend on *how fast a single daemon is*.
+
+Why is it important? It's important because some of the applications *can't* use
+queue depth greater than 1 because their task isn't parallelizable. A notable example
+is any ACID DBMS because all of them write their WALs sequentially with fsync()s.
+
+fsync, by the way, is another important thing often missing in benchmarks. The point is
+that drives have cache buffers and don't guarantee that your data is actually persisted
+until you call fsync() which is translated to a FLUSH CACHE command by the OS.
+
+Desktop SSDs are very fast without fsync - NVMes, for example, can process ~80000 write
+operations per second with queue depth of 1 without fsync - but they're really slow with
+fsync because they have to actually write data to flash chips when you call fsync. Typical
+number is around 1000-2000 iops with fsync.
+
+Server SSDs often have supercapacitors that act as a built-in UPS and allow the drive
+to flush its DRAM cache to the persistent flash storage when a power loss occurs.
+This makes them perform equally well with and without fsync. This feature is called
+"Advanced Power Loss Protection" by Intel; other vendors either call it similarly
+or directly as "Full Capacitor-Based Power Loss Protection".
+
+All software-defined storages that I currently know are slow in terms of latency.
+Notable examples are Ceph and internal SDSes used by cloud providers like Amazon, Google,
+Yandex and so on. They're all slow and can only reach ~0.3ms read and ~0.6ms 4 KB write latency
+with best-in-slot hardware.
+
+And that's in the SSD era when you can buy an SSD that has ~0.04ms latency for 100 $.
+
+I use the following 6 commands with small variations to benchmark any storage:
+
+- Linear write:
+  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=write -runtime=60 -filename=/dev/sdX`
+- Linear read:
+  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=read -runtime=60 -filename=/dev/sdX`
+- Random write latency (this hurts storages the most):
+  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -fsync=1 -rw=randwrite -runtime=60 -filename=/dev/sdX`
+- Random read latency:
+  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -rw=randread -runtime=60 -filename=/dev/sdX`
+- Parallel write iops (use numjobs if a single CPU core is insufficient to saturate the load):
+  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 [-numjobs=4 -group_reporting] -rw=randwrite -runtime=60 -filename=/dev/sdX`
+- Parallel read iops (use numjobs if a single CPU core is insufficient to saturate the load):
+  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 [-numjobs=4 -group_reporting] -rw=randread -runtime=60 -filename=/dev/sdX`
+
+## Vitastor's Theoretical Maximum Random Access Performance
+
+Replicated setups:
+- Single-threaded (T1Q1) read latency: 1 network roundtrip + 1 disk read.
+- Single-threaded write+fsync latency:
+  - With immediate commit: 2 network roundtrips + 1 disk write.
+  - With lazy commit: 4 network roundtrips + 1 disk write + 1 disk flush.
+- Saturated parallel read iops: min(network bandwidth, sum(disk read iops)).
+- Saturated parallel write iops: min(network bandwidth, sum(disk write iops / number of replicas / write amplification)).
+
+EC/XOR setups:
+- Single-threaded (T1Q1) read latency: 1.5 network roundtrips + 1 disk read.
+- Single-threaded write+fsync latency:
+  - With immediate commit: 3.5 network roundtrips + 1 disk read + 2 disk writes.
+  - With lazy commit: 5.5 network roundtrips + 1 disk read + 2 disk writes + 2 disk fsyncs.
+  - 0.5 in actually (k-1)/k which means that an additional roundtrip doesn't happen when
+    the read sub-operation can be served locally.
+- Saturated parallel read iops: min(network bandwidth, sum(disk read iops)).
+- Saturated parallel write iops: min(network bandwidth, sum(disk write iops * number of data drives / (number of data + parity drives) / write amplification)).
+  In fact, you should put disk write iops under the condition of ~10% reads / ~90% writes in this formula.
+
+Write amplification for 4 KB blocks is usually 3-5 in Vitastor:
+1. Journal block write
+2. Journal data write
+3. Metadata block write
+4. Another journal block write for EC/XOR setups
+5. Data block write
+
+If you manage to get an SSD which handles 512 byte blocks well (Optane?) you may
+lower 1, 3 and 4 to 512 bytes (1/8 of data size) and get WA as low as 2.375.
+
+Lazy fsync also reduces WA for parallel workloads because journal blocks are only
+written when they fill up or fsync is requested.
+
+## Example Comparison with Ceph
+
+Hardware configuration: 4 nodes, each with:
+- 6x SATA SSD Intel D3-4510 3.84 TB
+- 2x Xeon Gold 6242 (16 cores @ 2.8 GHz)
+- 384 GB RAM
+- 1x 25 GbE network interface (Mellanox ConnectX-4 LX), connected to a Juniper QFX5200 switch
+
+CPU powersaving was disabled. Both Vitastor and Ceph were configured with 2 OSDs per 1 SSD.
+
+All of the results below apply to 4 KB blocks.
+
+Raw drive performance:
+- T1Q1 write ~27000 iops (~0.037ms latency)
+- T1Q1 read ~9800 iops (~0.101ms latency)
+- T1Q32 write ~60000 iops
+- T1Q32 read ~81700 iops
+
+Ceph 15.2.4 (Bluestore):
+- T1Q1 write ~1000 iops (~1ms latency)
+- T1Q1 read ~1750 iops (~0.57ms latency)
+- T8Q64 write ~100000 iops, total CPU usage by OSDs about 40 virtual cores on each node
+- T8Q64 read ~480000 iops, total CPU usage by OSDs about 40 virtual cores on each node
+
+T8Q64 tests were conducted over 8 400GB RBD images from all hosts (every host was running 2 instances of fio).
+This is because Ceph has performance penalties related to running multiple clients over a single RBD image.
+
+cephx_sign_messages was set to false during tests, RocksDB and Bluestore settings were left at defaults.
+
+In fact, not that bad for Ceph. These servers are an example of well-balanced Ceph nodes.
+However, CPU usage and I/O latency were through the roof, as usual.
+
+Vitastor:
+- T1Q1 write: 7087 iops (0.14ms latency)
+- T1Q1 read: 6838 iops (0.145ms latency)
+- T2Q64 write: 162000 iops, total CPU usage by OSDs about 3 virtual cores on each node
+- T8Q64 read: 895000 iops, total CPU usage by OSDs about 4 virtual cores on each node
+
+T8Q64 read test was conducted over 1 larger inode (3.2T) from all hosts (every host was running 2 instances of fio).
+Vitastor has no performance penalties related to running multiple clients over a single inode.
+If conducted from one node with all primary OSDs moved to other nodes the result was slightly lower (689000 iops),
+this is because all operations resulted in network roundtrips between the client and the primary OSD.
+When fio was colocated with OSDs (like in Ceph benchmarks above), 1/4 of the read workload actually
+used the loopback network.
+
+Vitastor was configured with: `--disable_data_fsync true --immediate_commit all --flusher_count 8
+  --disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096
+  --journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024
+  --journal_size 16777216`.
+
+### NBD
+
+NBD is currently required to mount Vitastor via kernel, but it imposes additional overhead
+due to additional copying between the kernel and userspace. This mostly hurts linear
+bandwidth, not iops.
+
+Vitastor with single-thread NBD on the same hardware:
+- T1Q1 write: 6000 iops (0.166ms latency)
+- T1Q1 read: 5518 iops (0.18ms latency)
+- T1Q128 write: 94400 iops
+- T1Q128 read: 103000 iops
+- Linear write (4M T1Q128): 1266 MB/s (compared to 2600 MB/s via fio)
+- Linear read (4M T1Q128): 975 MB/s (compared to 1400 MB/s via fio)
+
+## Building
+
+- Install Linux kernel 5.4 or newer for io_uring support.
+- Install liburing 0.4 or newer and its headers.
+- Install lp_solve.
+- Install etcd.
+- Install node.js 12 or newer.
+- Install gcc and g++ 9.x.
+- Clone https://yourcmc.ru/git/vitalif/vitastor/ with submodules.
+- Install QEMU 4.x or 5.x, get its source, begin to build it, stop the build and copy headers:
+   - `<qemu>/include` &rarr; `<vitastor>/qemu/include`
+   - Debian:
+      * Use qemu packages from the main repository
+      * `<qemu>/b/qemu/config-host.h` &rarr; `<vitastor>/qemu/b/qemu/config-host.h`
+      * `<qemu>/b/qemu/qapi` &rarr; `<vitastor>/qemu/b/qemu/qapi`
+   - CentOS 8:
+      * Use qemu packages from the Advanced-Virtualization repository. To enable it, run
+        `yum install centos-release-advanced-virtualization.noarch` and then `yum install qemu`
+      * `<qemu>/config-host.h` &rarr; `<vitastor>/qemu/b/qemu/config-host.h`
+      * `<qemu>/qapi` &rarr; `<vitastor>/qemu/b/qemu/qapi`
+   - `config-host.h` and `qapi` are required because they contain generated headers
+- Install fio 3.16, get its source and symlink it into `<vitastor>/fio`. It doesn't currently
+  build with fio 3.20 or newer due to the conflicts between g++ and gcc's atomics. This will
+  be fixed in the future.
+- Build Vitastor with `make -j8`.
+- Copy binaries somewhere.
+
+## Running
+
+Please note that startup procedure isn't currently simple - you specify configuration
+and calculate disk offsets almost by hand. This will be fixed in near future.
+
+- Get some SATA or NVMe SSDs with capacitors (server-grade drives). You can use desktop SSDs
+  with lazy fsync, but prepare for inferior single-thread latency.
+- Get a fast network (at least 10 Gbit/s).
+- Disable CPU powersaving: `cpupower idle-set -D 0 && cpupower frequency-set -g performance`.
+- Install etcd with `--max-txn-ops=100000 --auto-compaction-retention=10 --auto-compaction-mode=revision` options.
+- Create global configuration in etcd: `etcdctl put /vitastor/config/global '{"immediate_commit":"all"}'`
+  (if all your drives have capacitors).
+- Create pool configuration in etcd: `etcdctl put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":256,"failure_domain":"host"}}'`.
+- Calculate offsets for your drives with `node ./mon/simple-offsets.js /dev/sdX`.
+- Make systemd units for your OSDs. Look at `./mon/make-units.sh` for example.
+  Notable configuration variables from the example:
+  - `disable_data_fsync 1` - only safe with server-grade drives with capacitors.
+  - `immediate_commit all` - use this if all your drives are server-grade.
+  - `disable_device_lock 1` - only required if you run multiple OSDs on one block device.
+  - `flusher_count 16` - flusher is a micro-thread that removes old data from the journal.
+    More flushers mean more aggressive journal flushing which allows for more throughput
+    but slightly hurts latency under less load. Flushing will probably be improved in the future
+    because currently high queue depths sometimes lead to performance degradation.
+  - `disk_alignment`, `journal_block_size`, `meta_block_size` should be set to the internal
+    block size of your SSDs which is 4096 on most drives.
+  - `journal_no_same_sector_overwrites true` prevents multiple overwrites of the same journal sector.
+    Some SSDs (like Intel D3-4510) don't like such overwrites so they benefit from this setting.
+    When this setting is set, it is also required to raise `journal_sector_buffer_count` setting,
+    which is the number of dirty journal sectors that may be written to at the same time.
+- `systemctl start vitastor.target` everywhere.
+- Start any number of monitors: `cd mon; node mon-main.js --etcd_url 'http://10.115.0.10:2379,http://10.115.0.11:2379,http://10.115.0.12:2379,http://10.115.0.13:2379' --etcd_prefix '/vitastor' --etcd_start_timeout 5`.
+- At this point, one of the monitors will configure PGs and OSDs will start them.
+- You can check PG states with `etcdctl get --prefix /vitastor/pg/state`. All PGs should become 'active'.
+- Run tests with (for example): `fio -thread -ioengine=./libfio_cluster.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -pool=1 -inode=1 -size=400G`.
+- Upload VM disk image with qemu-img (for example):
+  ```
+  LD_PRELOAD=./qemu_driver.so qemu-img convert -f qcow2 debian10.qcow2 -p
+    -O raw 'vitastor:etcd_host=10.115.0.10\:2379/v3:pool=1:inode=1:size=2147483648'
+  ```
+- Run QEMU with (for example):
+  ```
+  LD_PRELOAD=./qemu_driver.so qemu-system-x86_64 -enable-kvm -m 1024
+    -drive 'file=vitastor:etcd_host=10.115.0.10\:2379/v3:pool=1:inode=1:size=2147483648',format=raw,if=none,id=drive-virtio-disk0,cache=none
+    -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=off,physical_block_size=4096,logical_block_size=512
+    -vnc 0.0.0.0:0
+  ```
+
+## Known Problems
+
+- OSDs may currently crash with "can't get SQE, will fall out of sync with EPOLLET"
+  if you try to load them with very long iodepths because io_uring queue (ring) is limited
+  and OSDs don't check if it fills up.
+- Object deletion requests may currently lead to unfound objects on crashes because
+  proper handling of deletions in a cluster requires a "three-phase cleanup process"
+  and it's currently not implemented. In fact, even though deletion requests are
+  implemented, there's no user tool to delete anything from the cluster yet :).
+  Of course I'll create such tool, but its first implementation will be vulnerable to this issue.
+  It's not a big deal though, because you'll be able to just repeat the deletion request
+  in this case.
+
+## Implementation Principles
+
+- I like simple and stupid solutions, so expect Vitastor to stay simple.
+- I also like reinventing the wheel to some extent, like writing my own HTTP client
+  for etcd interaction instead of using prebuilt libraries, because in this case
+  I'm confident about what my code does and what it doesn't do.
+- I don't care about C++ "best practices" like RAII or proper inheritance or usage of
+  smart pointers or whatever and I don't intend to change my mind, so if you're here
+  looking for ideal reference C++ code, this probably isn't the right place.
+- I like node.js better than any other dynamically-typed language interpreter
+  because it's faster than any other interpreter in the world, has neutral C-like
+  syntax and built-in event loop. That's why Monitor is implemented in node.js.
+
+## Author and License
+
+Copyright (c) Vitaliy Filippov (vitalif [at] yourcmc.ru), 2019+
+
+You can also find me in the Russian Telegram Ceph chat: https://t.me/ceph_ru
+
+All server-side code (OSD, Monitor and so on) is licensed under the terms of
+Vitastor Network Public License 1.0 (VNPL 1.0), a copyleft license based on
+GNU GPLv3.0 with the additional "Network Interaction" clause which requires
+opensourcing all programs directly or indirectly interacting with Vitastor
+through a computer network ("Proxy Programs"). Proxy Programs may be made public
+not only under the terms of the same license, but also under the terms of any
+GPL-Compatible Free Software License, as listed by the Free Software Foundation.
+This is a stricter copyleft license than the Affero GPL.
+
+Basically, you can't use the software in a proprietary environment to provide
+its functionality to users without opensourcing all intermediary components
+standing between the user and Vitastor or purchasing a commercial license
+from the author 😀.
+
+Client libraries (cluster_client and so on) are dual-licensed under the same
+VNPL 1.0 and also GNU GPL 2.0 or later to allow for compatibility with GPLed
+software like QEMU and fio.
+
+You can find the full text of VNPL-1.0 in the file [VNPL-1.0.txt](VNPL-1.0.txt).
+GPL 2.0 is also included in this repository as [GPL-2.0.txt](GPL-2.0.txt).
--- a/VNPL-1.0.txt
+++ b/VNPL-1.0.txt
@ -0,0 +1,645 @@
+                     VITASTOR NETWORK PUBLIC LICENSE
+                       Version 1, 17 September 2020
+
+ Copyright (C) 2020 Vitaliy Filippov <vitalif@yourcmc.ru>
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+                            Preamble
+
+  The Vitastor Network Public License is a free, copyleft license for
+software and other kinds of works, specifically designed to ensure
+cooperation with the community in the case of network server software.
+
+  The licenses for most software and other practical works are designed
+to take away your freedom to share and change the works.  By contrast,
+GNU General Public Licenses and Vitastor Network Public License are
+intended to guarantee your freedom to share and change all versions
+of a program--to make sure it remains free software for all its users.
+
+  When we speak of free software, we are referring to freedom, not
+price.  GNU General Public Licenses and Vitastor Network Public License
+are designed to make sure that you have the freedom to distribute copies
+of free software (and charge for them if you wish), that you receive
+source code or can get it if you want it, that you can change the software
+or use pieces of it in new free programs, and that you know you can do these
+things.
+
+  Developers that use GNU General Public Licenses and Vitastor
+Network Public License protect your rights with two steps:
+(1) assert copyright on the software, and (2) offer
+you this License which gives you legal permission to copy, distribute
+and/or modify the software.
+
+  A secondary benefit of defending all users' freedom is that
+improvements made in alternate versions of the program, if they
+receive widespread use, become available for other developers to
+incorporate.  Many developers of free software are heartened and
+encouraged by the resulting cooperation.  However, in the case of
+software used on network servers, this result may fail to come about.
+The GNU General Public License permits making a modified version and
+letting the public access it on a server without ever releasing its
+source code to the public. Even the GNU Affero General Public License
+permits running a modified version in a closed environment where
+public users only interact with it through a closed-source proxy, again,
+without making the program and the proxy available to the public
+for free.
+
+  The Vitastor Network Public License is designed specifically to
+ensure that, in such cases, the modified program and the proxy stays
+available to the community. It requires the operator of a network server to
+provide the source code of the original program and all other programs
+communicating with it running there to the users of that server.
+Therefore, public use of a modified version, on a server accessible
+directly or indirectly to the public, gives the public access to the source
+code of the modified version.
+
+  The precise terms and conditions for copying, distribution and
+modification follow.
+
+                       TERMS AND CONDITIONS
+
+  0. Definitions.
+
+  "This License" refers to version 1 of the Vitastor Network Public License.
+
+  "Copyright" also means copyright-like laws that apply to other kinds of
+works, such as semiconductor masks.
+
+  "The Program" refers to any copyrightable work licensed under this
+License.  Each licensee is addressed as "you".  "Licensees" and
+"recipients" may be individuals or organizations.
+
+  To "modify" a work means to copy from or adapt all or part of the work
+in a fashion requiring copyright permission, other than the making of an
+exact copy.  The resulting work is called a "modified version" of the
+earlier work or a work "based on" the earlier work.
+
+  A "covered work" means either the unmodified Program or a work based
+on the Program.
+
+  To "propagate" a work means to do anything with it that, without
+permission, would make you directly or secondarily liable for
+infringement under applicable copyright law, except executing it on a
+computer or modifying a private copy.  Propagation includes copying,
+distribution (with or without modification), making available to the
+public, and in some countries other activities as well.
+
+  To "convey" a work means any kind of propagation that enables other
+parties to make or receive copies.  Mere interaction with a user through
+a computer network, with no transfer of a copy, is not conveying.
+
+  An interactive user interface displays "Appropriate Legal Notices"
+to the extent that it includes a convenient and prominently visible
+feature that (1) displays an appropriate copyright notice, and (2)
+tells the user that there is no warranty for the work (except to the
+extent that warranties are provided), that licensees may convey the
+work under this License, and how to view a copy of this License.  If
+the interface presents a list of user commands or options, such as a
+menu, a prominent item in the list meets this criterion.
+
+  1. Source Code.
+
+  The "source code" for a work means the preferred form of the work
+for making modifications to it.  "Object code" means any non-source
+form of a work.
+
+  A "Standard Interface" means an interface that either is an official
+standard defined by a recognized standards body, or, in the case of
+interfaces specified for a particular programming language, one that
+is widely used among developers working in that language.
+
+  The "System Libraries" of an executable work include anything, other
+than the work as a whole, that (a) is included in the normal form of
+packaging a Major Component, but which is not part of that Major
+Component, and (b) serves only to enable use of the work with that
+Major Component, or to implement a Standard Interface for which an
+implementation is available to the public in source code form.  A
+"Major Component", in this context, means a major essential component
+(kernel, window system, and so on) of the specific operating system
+(if any) on which the executable work runs, or a compiler used to
+produce the work, or an object code interpreter used to run it.
+
+  The "Corresponding Source" for a work in object code form means all
+the source code needed to generate, install, and (for an executable
+work) run the object code and to modify the work, including scripts to
+control those activities.  However, it does not include the work's
+System Libraries, or general-purpose tools or generally available free
+programs which are used unmodified in performing those activities but
+which are not part of the work.  For example, Corresponding Source
+includes interface definition files associated with source files for
+the work, and the source code for shared libraries and dynamically
+linked subprograms that the work is specifically designed to require,
+such as by intimate data communication or control flow between those
+subprograms and other parts of the work.
+
+  The Corresponding Source need not include anything that users
+can regenerate automatically from other parts of the Corresponding
+Source.
+
+  The Corresponding Source for a work in source code form is that
+same work.
+
+  2. Basic Permissions.
+
+  All rights granted under this License are granted for the term of
+copyright on the Program, and are irrevocable provided the stated
+conditions are met.  This License explicitly affirms your unlimited
+permission to run the unmodified Program.  The output from running a
+covered work is covered by this License only if the output, given its
+content, constitutes a covered work.  This License acknowledges your
+rights of fair use or other equivalent, as provided by copyright law.
+
+  You may make, run and propagate covered works that you do not
+convey, without conditions so long as your license otherwise remains
+in force.  You may convey covered works to others for the sole purpose
+of having them make modifications exclusively for you, or provide you
+with facilities for running those works, provided that you comply with
+the terms of this License in conveying all material for which you do
+not control copyright.  Those thus making or running the covered works
+for you must do so exclusively on your behalf, under your direction
+and control, on terms that prohibit them from making any copies of
+your copyrighted material outside their relationship with you.
+
+  Conveying under any other circumstances is permitted solely under
+the conditions stated below.  Sublicensing is not allowed; section 10
+makes it unnecessary.
+
+  3. Protecting Users' Legal Rights From Anti-Circumvention Law.
+
+  No covered work shall be deemed part of an effective technological
+measure under any applicable law fulfilling obligations under article
+11 of the WIPO copyright treaty adopted on 20 December 1996, or
+similar laws prohibiting or restricting circumvention of such
+measures.
+
+  When you convey a covered work, you waive any legal power to forbid
+circumvention of technological measures to the extent such circumvention
+is effected by exercising rights under this License with respect to
+the covered work, and you disclaim any intention to limit operation or
+modification of the work as a means of enforcing, against the work's
+users, your or third parties' legal rights to forbid circumvention of
+technological measures.
+
+  4. Conveying Verbatim Copies.
+
+  You may convey verbatim copies of the Program's source code as you
+receive it, in any medium, provided that you conspicuously and
+appropriately publish on each copy an appropriate copyright notice;
+keep intact all notices stating that this License and any
+non-permissive terms added in accord with section 7 apply to the code;
+keep intact all notices of the absence of any warranty; and give all
+recipients a copy of this License along with the Program.
+
+  You may charge any price or no price for each copy that you convey,
+and you may offer support or warranty protection for a fee.
+
+  5. Conveying Modified Source Versions.
+
+  You may convey a work based on the Program, or the modifications to
+produce it from the Program, in the form of source code under the
+terms of section 4, provided that you also meet all of these conditions:
+
+    a) The work must carry prominent notices stating that you modified
+    it, and giving a relevant date.
+
+    b) The work must carry prominent notices stating that it is
+    released under this License and any conditions added under section
+    7.  This requirement modifies the requirement in section 4 to
+    "keep intact all notices".
+
+    c) You must license the entire work, as a whole, under this
+    License to anyone who comes into possession of a copy.  This
+    License will therefore apply, along with any applicable section 7
+    additional terms, to the whole of the work, and all its parts,
+    regardless of how they are packaged.  This License gives no
+    permission to license the work in any other way, but it does not
+    invalidate such permission if you have separately received it.
+
+    d) If the work has interactive user interfaces, each must display
+    Appropriate Legal Notices; however, if the Program has interactive
+    interfaces that do not display Appropriate Legal Notices, your
+    work need not make them do so.
+
+  A compilation of a covered work with other separate and independent
+works, which are not by their nature extensions of the covered work,
+and which are not combined with it such as to form a larger program,
+in or on a volume of a storage or distribution medium, is called an
+"aggregate" if the compilation and its resulting copyright are not
+used to limit the access or legal rights of the compilation's users
+beyond what the individual works permit.  Inclusion of a covered work
+in an aggregate does not cause this License to apply to the other
+parts of the aggregate.
+
+  6. Conveying Non-Source Forms.
+
+  You may convey a covered work in object code form under the terms
+of sections 4 and 5, provided that you also convey the
+machine-readable Corresponding Source under the terms of this License,
+in one of these ways:
+
+    a) Convey the object code in, or embodied in, a physical product
+    (including a physical distribution medium), accompanied by the
+    Corresponding Source fixed on a durable physical medium
+    customarily used for software interchange.
+
+    b) Convey the object code in, or embodied in, a physical product
+    (including a physical distribution medium), accompanied by a
+    written offer, valid for at least three years and valid for as
+    long as you offer spare parts or customer support for that product
+    model, to give anyone who possesses the object code either (1) a
+    copy of the Corresponding Source for all the software in the
+    product that is covered by this License, on a durable physical
+    medium customarily used for software interchange, for a price no
+    more than your reasonable cost of physically performing this
+    conveying of source, or (2) access to copy the
+    Corresponding Source from a network server at no charge.
+
+    c) Convey individual copies of the object code with a copy of the
+    written offer to provide the Corresponding Source.  This
+    alternative is allowed only occasionally and noncommercially, and
+    only if you received the object code with such an offer, in accord
+    with subsection 6b.
+
+    d) Convey the object code by offering access from a designated
+    place (gratis or for a charge), and offer equivalent access to the
+    Corresponding Source in the same way through the same place at no
+    further charge.  You need not require recipients to copy the
+    Corresponding Source along with the object code.  If the place to
+    copy the object code is a network server, the Corresponding Source
+    may be on a different server (operated by you or a third party)
+    that supports equivalent copying facilities, provided you maintain
+    clear directions next to the object code saying where to find the
+    Corresponding Source.  Regardless of what server hosts the
+    Corresponding Source, you remain obligated to ensure that it is
+    available for as long as needed to satisfy these requirements.
+
+    e) Convey the object code using peer-to-peer transmission, provided
+    you inform other peers where the object code and Corresponding
+    Source of the work are being offered to the general public at no
+    charge under subsection 6d.
+
+  A separable portion of the object code, whose source code is excluded
+from the Corresponding Source as a System Library, need not be
+included in conveying the object code work.
+
+  A "User Product" is either (1) a "consumer product", which means any
+tangible personal property which is normally used for personal, family,
+or household purposes, or (2) anything designed or sold for incorporation
+into a dwelling.  In determining whether a product is a consumer product,
+doubtful cases shall be resolved in favor of coverage.  For a particular
+product received by a particular user, "normally used" refers to a
+typical or common use of that class of product, regardless of the status
+of the particular user or of the way in which the particular user
+actually uses, or expects or is expected to use, the product.  A product
+is a consumer product regardless of whether the product has substantial
+commercial, industrial or non-consumer uses, unless such uses represent
+the only significant mode of use of the product.
+
+  "Installation Information" for a User Product means any methods,
+procedures, authorization keys, or other information required to install
+and execute modified versions of a covered work in that User Product from
+a modified version of its Corresponding Source.  The information must
+suffice to ensure that the continued functioning of the modified object
+code is in no case prevented or interfered with solely because
+modification has been made.
+
+  If you convey an object code work under this section in, or with, or
+specifically for use in, a User Product, and the conveying occurs as
+part of a transaction in which the right of possession and use of the
+User Product is transferred to the recipient in perpetuity or for a
+fixed term (regardless of how the transaction is characterized), the
+Corresponding Source conveyed under this section must be accompanied
+by the Installation Information.  But this requirement does not apply
+if neither you nor any third party retains the ability to install
+modified object code on the User Product (for example, the work has
+been installed in ROM).
+
+  The requirement to provide Installation Information does not include a
+requirement to continue to provide support service, warranty, or updates
+for a work that has been modified or installed by the recipient, or for
+the User Product in which it has been modified or installed.  Access to a
+network may be denied when the modification itself materially and
+adversely affects the operation of the network or violates the rules and
+protocols for communication across the network.
+
+  Corresponding Source conveyed, and Installation Information provided,
+in accord with this section must be in a format that is publicly
+documented (and with an implementation available to the public in
+source code form), and must require no special password or key for
+unpacking, reading or copying.
+
+  7. Additional Terms.
+
+  "Additional permissions" are terms that supplement the terms of this
+License by making exceptions from one or more of its conditions.
+Additional permissions that are applicable to the entire Program shall
+be treated as though they were included in this License, to the extent
+that they are valid under applicable law.  If additional permissions
+apply only to part of the Program, that part may be used separately
+under those permissions, but the entire Program remains governed by
+this License without regard to the additional permissions.
+
+  When you convey a copy of a covered work, you may at your option
+remove any additional permissions from that copy, or from any part of
+it.  (Additional permissions may be written to require their own
+removal in certain cases when you modify the work.)  You may place
+additional permissions on material, added by you to a covered work,
+for which you have or can give appropriate copyright permission.
+
+  Notwithstanding any other provision of this License, for material you
+add to a covered work, you may (if authorized by the copyright holders of
+that material) supplement the terms of this License with terms:
+
+    a) Disclaiming warranty or limiting liability differently from the
+    terms of sections 15 and 16 of this License; or
+
+    b) Requiring preservation of specified reasonable legal notices or
+    author attributions in that material or in the Appropriate Legal
+    Notices displayed by works containing it; or
+
+    c) Prohibiting misrepresentation of the origin of that material, or
+    requiring that modified versions of such material be marked in
+    reasonable ways as different from the original version; or
+
+    d) Limiting the use for publicity purposes of names of licensors or
+    authors of the material; or
+
+    e) Declining to grant rights under trademark law for use of some
+    trade names, trademarks, or service marks; or
+
+    f) Requiring indemnification of licensors and authors of that
+    material by anyone who conveys the material (or modified versions of
+    it) with contractual assumptions of liability to the recipient, for
+    any liability that these contractual assumptions directly impose on
+    those licensors and authors.
+
+  All other non-permissive additional terms are considered "further
+restrictions" within the meaning of section 10.  If the Program as you
+received it, or any part of it, contains a notice stating that it is
+governed by this License along with a term that is a further
+restriction, you may remove that term.  If a license document contains
+a further restriction but permits relicensing or conveying under this
+License, you may add to a covered work material governed by the terms
+of that license document, provided that the further restriction does
+not survive such relicensing or conveying.
+
+  If you add terms to a covered work in accord with this section, you
+must place, in the relevant source files, a statement of the
+additional terms that apply to those files, or a notice indicating
+where to find the applicable terms.
+
+  Additional terms, permissive or non-permissive, may be stated in the
+form of a separately written license, or stated as exceptions;
+the above requirements apply either way.
+
+  8. Termination.
+
+  You may not propagate or modify a covered work except as expressly
+provided under this License.  Any attempt otherwise to propagate or
+modify it is void, and will automatically terminate your rights under
+this License (including any patent licenses granted under the third
+paragraph of section 11).
+
+  However, if you cease all violation of this License, then your
+license from a particular copyright holder is reinstated (a)
+provisionally, unless and until the copyright holder explicitly and
+finally terminates your license, and (b) permanently, if the copyright
+holder fails to notify you of the violation by some reasonable means
+prior to 60 days after the cessation.
+
+  Moreover, your license from a particular copyright holder is
+reinstated permanently if the copyright holder notifies you of the
+violation by some reasonable means, this is the first time you have
+received notice of violation of this License (for any work) from that
+copyright holder, and you cure the violation prior to 30 days after
+your receipt of the notice.
+
+  Termination of your rights under this section does not terminate the
+licenses of parties who have received copies or rights from you under
+this License.  If your rights have been terminated and not permanently
+reinstated, you do not qualify to receive new licenses for the same
+material under section 10.
+
+  9. Acceptance Not Required for Having Copies.
+
+  You are not required to accept this License in order to receive or
+run a copy of the Program.  Ancillary propagation of a covered work
+occurring solely as a consequence of using peer-to-peer transmission
+to receive a copy likewise does not require acceptance.  However,
+nothing other than this License grants you permission to propagate or
+modify any covered work.  These actions infringe copyright if you do
+not accept this License.  Therefore, by modifying or propagating a
+covered work, you indicate your acceptance of this License to do so.
+
+  10. Automatic Licensing of Downstream Recipients.
+
+  Each time you convey a covered work, the recipient automatically
+receives a license from the original licensors, to run, modify and
+propagate that work, subject to this License.  You are not responsible
+for enforcing compliance by third parties with this License.
+
+  An "entity transaction" is a transaction transferring control of an
+organization, or substantially all assets of one, or subdividing an
+organization, or merging organizations.  If propagation of a covered
+work results from an entity transaction, each party to that
+transaction who receives a copy of the work also receives whatever
+licenses to the work the party's predecessor in interest had or could
+give under the previous paragraph, plus a right to possession of the
+Corresponding Source of the work from the predecessor in interest, if
+the predecessor has it or can get it with reasonable efforts.
+
+  You may not impose any further restrictions on the exercise of the
+rights granted or affirmed under this License.  For example, you may
+not impose a license fee, royalty, or other charge for exercise of
+rights granted under this License, and you may not initiate litigation
+(including a cross-claim or counterclaim in a lawsuit) alleging that
+any patent claim is infringed by making, using, selling, offering for
+sale, or importing the Program or any portion of it.
+
+  11. Patents.
+
+  A "contributor" is a copyright holder who authorizes use under this
+License of the Program or a work on which the Program is based.  The
+work thus licensed is called the contributor's "contributor version".
+
+  A contributor's "essential patent claims" are all patent claims
+owned or controlled by the contributor, whether already acquired or
+hereafter acquired, that would be infringed by some manner, permitted
+by this License, of making, using, or selling its contributor version,
+but do not include claims that would be infringed only as a
+consequence of further modification of the contributor version.  For
+purposes of this definition, "control" includes the right to grant
+patent sublicenses in a manner consistent with the requirements of
+this License.
+
+  Each contributor grants you a non-exclusive, worldwide, royalty-free
+patent license under the contributor's essential patent claims, to
+make, use, sell, offer for sale, import and otherwise run, modify and
+propagate the contents of its contributor version.
+
+  In the following three paragraphs, a "patent license" is any express
+agreement or commitment, however denominated, not to enforce a patent
+(such as an express permission to practice a patent or covenant not to
+sue for patent infringement).  To "grant" such a patent license to a
+party means to make such an agreement or commitment not to enforce a
+patent against the party.
+
+  If you convey a covered work, knowingly relying on a patent license,
+and the Corresponding Source of the work is not available for anyone
+to copy, free of charge and under the terms of this License, through a
+publicly available network server or other readily accessible means,
+then you must either (1) cause the Corresponding Source to be so
+available, or (2) arrange to deprive yourself of the benefit of the
+patent license for this particular work, or (3) arrange, in a manner
+consistent with the requirements of this License, to extend the patent
+license to downstream recipients.  "Knowingly relying" means you have
+actual knowledge that, but for the patent license, your conveying the
+covered work in a country, or your recipient's use of the covered work
+in a country, would infringe one or more identifiable patents in that
+country that you have reason to believe are valid.
+
+  If, pursuant to or in connection with a single transaction or
+arrangement, you convey, or propagate by procuring conveyance of, a
+covered work, and grant a patent license to some of the parties
+receiving the covered work authorizing them to use, propagate, modify
+or convey a specific copy of the covered work, then the patent license
+you grant is automatically extended to all recipients of the covered
+work and works based on it.
+
+  A patent license is "discriminatory" if it does not include within
+the scope of its coverage, prohibits the exercise of, or is
+conditioned on the non-exercise of one or more of the rights that are
+specifically granted under this License.  You may not convey a covered
+work if you are a party to an arrangement with a third party that is
+in the business of distributing software, under which you make payment
+to the third party based on the extent of your activity of conveying
+the work, and under which the third party grants, to any of the
+parties who would receive the covered work from you, a discriminatory
+patent license (a) in connection with copies of the covered work
+conveyed by you (or copies made from those copies), or (b) primarily
+for and in connection with specific products or compilations that
+contain the covered work, unless you entered into that arrangement,
+or that patent license was granted, prior to 28 March 2007.
+
+  Nothing in this License shall be construed as excluding or limiting
+any implied license or other defenses to infringement that may
+otherwise be available to you under applicable patent law.
+
+  12. No Surrender of Others' Freedom.
+
+  If conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License.  If you cannot convey a
+covered work so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you may
+not convey it at all.  For example, if you agree to terms that obligate you
+to collect a royalty for further conveying from those to whom you convey
+the Program, the only way you could satisfy both those terms and this
+License would be to refrain entirely from conveying the Program.
+
+  13. Remote Network Interaction.
+
+  Notwithstanding any other provision of this License, if you provide
+any user an opportunity to interact with the covered work directly
+or indirectly through a computer network, an imitation of such network,
+or an additional program (hereinafter referred to as a "Proxy Program")
+that, in turn, interacts with the covered work through a computer network,
+an imitation of such network, or another Proxy Program itself,
+you must prominently offer that user an opportunity to receive the
+Corresponding Source of the covered work and all Proxy Programs from a
+network server at no charge, through some standard or customary means of
+facilitating copying of software. The Corresponding Source for the covered
+work must be made available under the conditions of this License, and
+the Corresponding Source for all Proxy Programs must be made available
+under the conditions of either this License or any GPL-Compatible
+Free Software License, as described by the Free Software Foundation
+in their "GPL-Compatible License List".
+
+  14. Revised Versions of this License.
+
+  Vitastor Author may publish revised and/or new versions of
+the Vitastor Network Public License from time to time.  Such new versions
+will be similar in spirit to the present version, but may differ in detail to
+address new problems or concerns.
+
+  Each version is given a distinguishing version number.  If the
+Program specifies that a certain numbered version of the Vitastor Network
+Public License "or any later version" applies to it, you have the
+option of following the terms and conditions either of that numbered
+version or of any later version. If the Program does not specify a version
+number of the Vitastor Network Public License, you may choose any version
+ever published.
+
+  Later license versions may give you additional or different
+permissions.  However, no additional obligations are imposed on any
+author or copyright holder as a result of your choosing to follow a
+later version.
+
+  15. Disclaimer of Warranty.
+
+  THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
+APPLICABLE LAW.  EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
+HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
+OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
+THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
+IS WITH YOU.  SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
+ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
+
+  16. Limitation of Liability.
+
+  IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
+THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
+GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
+USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
+DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
+PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
+EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
+SUCH DAMAGES.
+
+  17. Interpretation of Sections 15 and 16.
+
+  If the disclaimer of warranty and limitation of liability provided
+above cannot be given local legal effect according to their terms,
+reviewing courts shall apply local law that most closely approximates
+an absolute waiver of all civil liability in connection with the
+Program, unless a warranty or assumption of liability accompanies a
+copy of the Program in return for a fee.
+
+                     END OF TERMS AND CONDITIONS
+
+            How to Apply These Terms to Your New Programs
+
+  If you develop a new program, and you want it to be of the greatest
+possible use to the public, the best way to achieve this is to make it
+free software which everyone can redistribute and change under these terms.
+
+  To do so, attach the following notices to the program.  It is safest
+to attach them to the start of each source file to most effectively
+state the exclusion of warranty; and each file should have at least
+the "copyright" line and a pointer to where the full notice is found.
+
+    <one line to give the program's name and a brief idea of what it does.>
+    Copyright (C) <year>  <name of author>
+
+    This program is free software: you can redistribute it and/or modify
+    it under the terms of the Vitastor Network Public License as published by
+    the Vitastor Author, either version 1 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    Vitastor Network Public License for more details.
+
+Also add information on how to contact you by electronic and paper mail.
+
+  If your software can interact with users remotely through a computer
+network, you should also make sure that it provides a way for users to
+get its source.  For example, if your program is a web application, its
+interface could display a "Source" link that leads users to an archive
+of the code.  There are many ways you could offer source, and different
+solutions will be better for different programs; see section 13 for the
+specific requirements.
--- a/allocator.cpp
+++ b/allocator.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #include <stdexcept>
 #include "allocator.h"

--- a/allocator.h
+++ b/allocator.h
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #pragma once

 #include <stdint.h>
--- a/base64.cpp
+++ b/base64.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #include "base64.h"

 std::string base64_encode(const std::string &in)
--- a/base64.h
+++ b/base64.h
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #pragma once
 #include <string>

--- a/blockstore.cpp
+++ b/blockstore.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #include "blockstore_impl.h"

 blockstore_t::blockstore_t(blockstore_config_t & config, ring_loop_t *ringloop)
--- a/blockstore.h
+++ b/blockstore.h
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #pragma once

 #ifndef _LARGEFILE64_SOURCE
@ -27,13 +30,14 @@
 #define BS_OP_MIN 1
 #define BS_OP_READ 1
 #define BS_OP_WRITE 2
-#define BS_OP_SYNC 3
-#define BS_OP_STABLE 4
-#define BS_OP_DELETE 5
-#define BS_OP_LIST 6
-#define BS_OP_ROLLBACK 7
-#define BS_OP_SYNC_STAB_ALL 8
-#define BS_OP_MAX 8
+#define BS_OP_WRITE_STABLE 3
+#define BS_OP_SYNC 4
+#define BS_OP_STABLE 5
+#define BS_OP_DELETE 6
+#define BS_OP_LIST 7
+#define BS_OP_ROLLBACK 8
+#define BS_OP_SYNC_STAB_ALL 9
+#define BS_OP_MAX 9

 #define BS_OP_PRIVATE_DATA_SIZE 256

@ -41,9 +45,9 @@

 Blockstore opcode documentation:

-## BS_OP_READ / BS_OP_WRITE
+## BS_OP_READ / BS_OP_WRITE / BS_OP_WRITE_STABLE

-Read or write object data.
+Read or write object data. WRITE_STABLE writes a version that doesn't require marking as stable.

 Input:
 - oid = requested object
@ -113,6 +117,8 @@ Input:
 - oid.stripe = PG alignment
 - len = PG count or 0 to list all objects
 - offset = PG number
+- oid.inode = min inode number or 0 to list all inodes
+- version = max inode number or 0 to list all inodes

 Output:
 - retval = total obj_ver_id count
--- a/blockstore_flush.cpp
+++ b/blockstore_flush.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #include "blockstore_impl.h"

 journal_flusher_t::journal_flusher_t(int flusher_count, blockstore_impl_t *bs)
@ -10,7 +13,7 @@ journal_flusher_t::journal_flusher_t(int flusher_count, blockstore_impl_t *bs)
    flusher_start_threshold = bs->journal_block_size / sizeof(journal_entry_stable);
    journal_trim_interval = flusher_start_threshold;
    journal_trim_counter = 0;
-    journal_superblock = bs->journal.inmemory ? bs->journal.buffer : memalign(MEM_ALIGNMENT, bs->journal_block_size);
+    journal_superblock = bs->journal.inmemory ? bs->journal.buffer : memalign_or_die(MEM_ALIGNMENT, bs->journal_block_size);
    co = new journal_flusher_co[flusher_count];
    for (int i = 0; i < flusher_count; i++)
    {
@ -183,6 +186,23 @@ resume_0:
    dirty_end = bs->dirty_db.find(cur);
    if (dirty_end != bs->dirty_db.end())
    {
+        repeat_it = flusher->sync_to_repeat.find(cur.oid);
+        if (repeat_it != flusher->sync_to_repeat.end())
+        {
+#ifdef BLOCKSTORE_DEBUG
+            printf("Postpone %lx:%lx v%lu\n", cur.oid.inode, cur.oid.stripe, cur.version);
+#endif
+            // We don't flush different parts of history of the same object in parallel
+            // So we check if someone is already flushing this object
+            // In that case we set sync_to_repeat and pick another object
+            // Another coroutine will see it and re-queue the object after it finishes
+            if (repeat_it->second < cur.version)
+                repeat_it->second = cur.version;
+            wait_state = 0;
+            goto resume_0;
+        }
+        else
+            flusher->sync_to_repeat[cur.oid] = 0;
        if (dirty_end->second.journal_sector >= bs->journal.dirty_start &&
            (bs->journal.dirty_start >= bs->journal.used_start ||
            dirty_end->second.journal_sector < bs->journal.used_start))
@ -213,6 +233,7 @@ resume_0:
            if (!found)
            {
                // Try other objects
+                flusher->sync_to_repeat.erase(cur.oid);
                int search_left = flusher->flush_queue.size() - 1;
 #ifdef BLOCKSTORE_DEBUG
                printf("Flusher overran writers (dirty_start=%08lx) - searching for older flushes (%d left)\n", bs->journal.dirty_start, search_left);
@ -231,15 +252,20 @@ resume_0:
                            dirty_end->second.journal_sector < bs->journal.used_start))
                        {
 #ifdef BLOCKSTORE_DEBUG
-                            printf("Write %lu:%lu v%lu is too new: offset=%08lx\n", cur.oid.inode, cur.oid.stripe, cur.version, dirty_end->second.journal_sector);
+                            printf("Write %lx:%lx v%lu is too new: offset=%08lx\n", cur.oid.inode, cur.oid.stripe, cur.version, dirty_end->second.journal_sector);
 #endif
                            flusher->enqueue_flush(cur);
                        }
                        else
                        {
+                            repeat_it = flusher->sync_to_repeat.find(cur.oid);
+                            if (repeat_it == flusher->sync_to_repeat.end())
+                            {
+                                flusher->sync_to_repeat[cur.oid] = 0;
                                break;
                            }
                        }
+                    }
                    search_left--;
                }
                if (search_left <= 0)
@ -253,25 +279,8 @@ resume_0:
                }
            }
        }
-        repeat_it = flusher->sync_to_repeat.find(cur.oid);
-        if (repeat_it != flusher->sync_to_repeat.end())
-        {
 #ifdef BLOCKSTORE_DEBUG
-            printf("Postpone %lu:%lu v%lu\n", cur.oid.inode, cur.oid.stripe, cur.version);
-#endif
-            // We don't flush different parts of history of the same object in parallel
-            // So we check if someone is already flushing this object
-            // In that case we set sync_to_repeat and pick another object
-            // Another coroutine will see it and re-queue the object after it finishes
-            if (repeat_it->second < cur.version)
-                repeat_it->second = cur.version;
-            wait_state = 0;
-            goto resume_0;
-        }
-        else
-            flusher->sync_to_repeat[cur.oid] = 0;
-#ifdef BLOCKSTORE_DEBUG
-        printf("Flushing %lu:%lu v%lu\n", cur.oid.inode, cur.oid.stripe, cur.version);
+        printf("Flushing %lx:%lx v%lu\n", cur.oid.inode, cur.oid.stripe, cur.version);
 #endif
        flusher->active_flushers++;
 resume_1:
@ -299,7 +308,7 @@ resume_1:
                // Object not allocated. This is a bug.
                char err[1024];
                snprintf(
-                    err, 1024, "BUG: Object %lu:%lu v%lu that we are trying to flush is not allocated on the data device",
+                    err, 1024, "BUG: Object %lx:%lx v%lu that we are trying to flush is not allocated on the data device",
                    cur.oid.inode, cur.oid.stripe, cur.version
                );
                throw std::runtime_error(err);
@ -497,7 +506,8 @@ resume_1:
        }
        // All done
 #ifdef BLOCKSTORE_DEBUG
-        printf("Flushed %lu:%lu v%lu (%ld left)\n", cur.oid.inode, cur.oid.stripe, cur.version, flusher->flush_queue.size());
+        printf("Flushed %lx:%lx v%lu (%d copies, wr:%d, del:%d), %ld left\n", cur.oid.inode, cur.oid.stripe, cur.version,
+            copy_count, has_writes, has_delete, flusher->flush_queue.size());
 #endif
        flusher->active_flushers--;
        repeat_it = flusher->sync_to_repeat.find(cur.oid);
@ -530,7 +540,16 @@ bool journal_flusher_co::scan_dirty(int wait_base)
    clean_init_bitmap = false;
    while (1)
    {
-        if (dirty_it->second.state == ST_J_STABLE && !skip_copy)
+        if (!IS_STABLE(dirty_it->second.state))
+        {
+            char err[1024];
+            snprintf(
+                err, 1024, "BUG: Unexpected dirty_entry %lx:%lx v%lu state during flush: %d",
+                dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version, dirty_it->second.state
+            );
+            throw std::runtime_error(err);
+        }
+        else if (IS_JOURNAL(dirty_it->second.state) && !skip_copy)
        {
            // First we submit all reads
            has_writes = true;
@ -548,18 +567,18 @@ bool journal_flusher_co::scan_dirty(int wait_base)
                    {
                        submit_offset = dirty_it->second.location + offset - dirty_it->second.offset;
                        submit_len = it == v.end() || it->offset >= end_offset ? end_offset-offset : it->offset-offset;
-                        it = v.insert(it, (copy_buffer_t){ .offset = offset, .len = submit_len, .buf = memalign(MEM_ALIGNMENT, submit_len) });
+                        it = v.insert(it, (copy_buffer_t){ .offset = offset, .len = submit_len, .buf = memalign_or_die(MEM_ALIGNMENT, submit_len) });
                        copy_count++;
                        if (bs->journal.inmemory)
                        {
                            // Take it from memory
-                            memcpy(v.back().buf, bs->journal.buffer + submit_offset, submit_len);
+                            memcpy(it->buf, bs->journal.buffer + submit_offset, submit_len);
                        }
                        else
                        {
                            // Read it from disk
                            await_sqe(0);
-                            data->iov = (struct iovec){ v.back().buf, (size_t)submit_len };
+                            data->iov = (struct iovec){ it->buf, (size_t)submit_len };
                            data->callback = simple_callback_r;
                            my_uring_prep_readv(
                                sqe, bs->journal.fd, &data->iov, 1, bs->journal.offset + submit_offset
@ -573,7 +592,7 @@ bool journal_flusher_co::scan_dirty(int wait_base)
                }
            }
        }
-        else if (dirty_it->second.state == ST_D_STABLE && !skip_copy)
+        else if (IS_BIG_WRITE(dirty_it->second.state) && !skip_copy)
        {
            // There is an unflushed big write. Copy small writes in its position
            has_writes = true;
@ -583,21 +602,12 @@ bool journal_flusher_co::scan_dirty(int wait_base)
            clean_bitmap_len = dirty_it->second.len;
            skip_copy = true;
        }
-        else if (dirty_it->second.state == ST_DEL_STABLE && !skip_copy)
+        else if (IS_DELETE(dirty_it->second.state) && !skip_copy)
        {
            // There is an unflushed delete
            has_delete = true;
            skip_copy = true;
        }
-        else if (!IS_STABLE(dirty_it->second.state))
-        {
-            char err[1024];
-            snprintf(
-                err, 1024, "BUG: Unexpected dirty_entry %lu:%lu v%lu state during flush: %d",
-                dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version, dirty_it->second.state
-            );
-            throw std::runtime_error(err);
-        }
        dirty_start = dirty_it;
        if (dirty_it == bs->dirty_db.begin())
        {
@ -633,7 +643,7 @@ bool journal_flusher_co::modify_meta_read(uint64_t meta_loc, flusher_meta_write_
    if (wr.it == flusher->meta_sectors.end())
    {
        // Not in memory yet, read it
-        wr.buf = memalign(MEM_ALIGNMENT, bs->meta_block_size);
+        wr.buf = memalign_or_die(MEM_ALIGNMENT, bs->meta_block_size);
        wr.it = flusher->meta_sectors.emplace(wr.sector, (meta_sector_t){
            .offset = wr.sector,
            .len = bs->meta_block_size,
@ -663,7 +673,7 @@ void journal_flusher_co::update_clean_db()
    if (old_clean_loc != UINT64_MAX && old_clean_loc != clean_loc)
    {
 #ifdef BLOCKSTORE_DEBUG
-        printf("Free block %lu\n", old_clean_loc >> bs->block_order);
+        printf("Free block %lu (new location is %lu)\n", old_clean_loc >> bs->block_order, clean_loc >> bs->block_order);
 #endif
        bs->data_alloc->set(old_clean_loc >> bs->block_order, false);
    }
--- a/blockstore_flush.h
+++ b/blockstore_flush.h
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 struct copy_buffer_t
 {
    uint64_t offset, len;
--- a/blockstore_impl.cpp
+++ b/blockstore_impl.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #include "blockstore_impl.h"

 blockstore_impl_t::blockstore_impl_t(blockstore_config_t & config, ring_loop_t *ringloop)
@ -7,7 +10,7 @@ blockstore_impl_t::blockstore_impl_t(blockstore_config_t & config, ring_loop_t *
    ring_consumer.loop = [this]() { loop(); };
    ringloop->register_consumer(&ring_consumer);
    initialized = 0;
-    zero_object = (uint8_t*)memalign(MEM_ALIGNMENT, block_size);
+    zero_object = (uint8_t*)memalign_or_die(MEM_ALIGNMENT, block_size);
    data_fd = meta_fd = journal.fd = -1;
    parse_config(config);
    try
@ -130,7 +133,7 @@ void blockstore_impl_t::loop()
                }
                else if (PRIV(op)->wait_for)
                {
-                    if (op->opcode == BS_OP_WRITE || op->opcode == BS_OP_DELETE)
+                    if (op->opcode == BS_OP_WRITE || op->opcode == BS_OP_WRITE_STABLE || op->opcode == BS_OP_DELETE)
                    {
                        has_writes = 2;
                    }
@ -144,7 +147,7 @@ void blockstore_impl_t::loop()
            {
                dequeue_op = dequeue_read(op);
            }
-            else if (op->opcode == BS_OP_WRITE)
+            else if (op->opcode == BS_OP_WRITE || op->opcode == BS_OP_WRITE_STABLE)
            {
                if (has_writes == 2)
                {
@ -329,13 +332,13 @@ void blockstore_impl_t::check_wait(blockstore_op_t *op)
 void blockstore_impl_t::enqueue_op(blockstore_op_t *op, bool first)
 {
    if (op->opcode < BS_OP_MIN || op->opcode > BS_OP_MAX ||
-        ((op->opcode == BS_OP_READ || op->opcode == BS_OP_WRITE) && (
+        ((op->opcode == BS_OP_READ || op->opcode == BS_OP_WRITE || op->opcode == BS_OP_WRITE_STABLE) && (
            op->offset >= block_size ||
            op->len > block_size-op->offset ||
            (op->len % disk_alignment)
        )) ||
        readonly && op->opcode != BS_OP_READ && op->opcode != BS_OP_LIST ||
-        first && op->opcode == BS_OP_WRITE)
+        first && (op->opcode == BS_OP_WRITE || op->opcode == BS_OP_WRITE_STABLE))
    {
        // Basic verification not passed
        op->retval = -EINVAL;
@ -380,7 +383,7 @@ void blockstore_impl_t::enqueue_op(blockstore_op_t *op, bool first)
            }
        };
    }
-    if ((op->opcode == BS_OP_WRITE || op->opcode == BS_OP_DELETE) && !enqueue_write(op))
+    if ((op->opcode == BS_OP_WRITE || op->opcode == BS_OP_WRITE_STABLE || op->opcode == BS_OP_DELETE) && !enqueue_write(op))
    {
        std::function<void (blockstore_op_t*)>(op->callback)(op);
        return;
@ -431,10 +434,12 @@ static bool replace_stable(object_id oid, uint64_t version, int search_start, in

 void blockstore_impl_t::process_list(blockstore_op_t *op)
 {
-    // Check PG
    uint32_t list_pg = op->offset;
    uint32_t pg_count = op->len;
    uint64_t pg_stripe_size = op->oid.stripe;
+    uint64_t min_inode = op->oid.inode;
+    uint64_t max_inode = op->version;
+    // Check PG
    if (pg_count != 0 && (pg_stripe_size < MIN_BLOCK_SIZE || list_pg >= pg_count))
    {
        op->retval = -EINVAL;
@ -450,9 +455,22 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
        FINISH_OP(op);
        return;
    }
-    for (auto it = clean_db.begin(); it != clean_db.end(); it++)
    {
-        if (!pg_count || ((it->first.inode + it->first.stripe / pg_stripe_size) % pg_count) == list_pg)
+        auto clean_it = clean_db.begin(), clean_end = clean_db.end();
+        if ((min_inode != 0 || max_inode != 0) && min_inode <= max_inode)
+        {
+            clean_it = clean_db.lower_bound({
+                .inode = min_inode,
+                .stripe = 0,
+            });
+            clean_end = clean_db.upper_bound({
+                .inode = max_inode,
+                .stripe = UINT64_MAX,
+            });
+        }
+        for (; clean_it != clean_end; clean_it++)
+        {
+            if (!pg_count || ((clean_it->first.inode + clean_it->first.stripe / pg_stripe_size) % pg_count) == list_pg)
            {
                if (stable_count >= stable_alloc)
                {
@ -466,36 +484,56 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
                    }
                }
                stable[stable_count++] = {
-                .oid = it->first,
-                .version = it->second.version,
+                    .oid = clean_it->first,
+                    .version = clean_it->second.version,
                };
            }
        }
+    }
    int clean_stable_count = stable_count;
    // Copy dirty_db entries (sorted, too)
    int unstable_count = 0, unstable_alloc = 0;
    obj_ver_id *unstable = NULL;
-    for (auto it = dirty_db.begin(); it != dirty_db.end(); it++)
    {
-        if (!pg_count || ((it->first.oid.inode + it->first.oid.stripe / pg_stripe_size) % pg_count) == list_pg)
+        auto dirty_it = dirty_db.begin(), dirty_end = dirty_db.end();
+        if ((min_inode != 0 || max_inode != 0) && min_inode <= max_inode)
        {
-            if (IS_DELETE(it->second.state))
+            dirty_it = dirty_db.lower_bound({
+                .oid = {
+                    .inode = min_inode,
+                    .stripe = 0,
+                },
+                .version = 0,
+            });
+            dirty_end = dirty_db.upper_bound({
+                .oid = {
+                    .inode = max_inode,
+                    .stripe = UINT64_MAX,
+                },
+                .version = UINT64_MAX,
+            });
+        }
+        for (; dirty_it != dirty_end; dirty_it++)
+        {
+            if (!pg_count || ((dirty_it->first.oid.inode + dirty_it->first.oid.stripe / pg_stripe_size) % pg_count) == list_pg)
+            {
+                if (IS_DELETE(dirty_it->second.state))
                {
                    // Deletions are always stable, so try to zero out two possible entries
-                if (!replace_stable(it->first.oid, 0, 0, clean_stable_count, stable))
+                    if (!replace_stable(dirty_it->first.oid, 0, 0, clean_stable_count, stable))
                    {
-                    replace_stable(it->first.oid, 0, clean_stable_count, stable_count, stable);
+                        replace_stable(dirty_it->first.oid, 0, clean_stable_count, stable_count, stable);
                    }
                }
-            else if (IS_STABLE(it->second.state))
+                else if (IS_STABLE(dirty_it->second.state))
                {
                    // First try to replace a clean stable version in the first part of the list
-                if (!replace_stable(it->first.oid, it->first.version, 0, clean_stable_count, stable))
+                    if (!replace_stable(dirty_it->first.oid, dirty_it->first.version, 0, clean_stable_count, stable))
                    {
                        // Then try to replace the last dirty stable version in the second part of the list
-                    if (stable[stable_count-1].oid == it->first.oid)
+                        if (stable[stable_count-1].oid == dirty_it->first.oid)
                        {
-                        stable[stable_count-1].version = it->first.version;
+                            stable[stable_count-1].version = dirty_it->first.version;
                        }
                        else
                        {
@ -512,7 +550,7 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
                                    return;
                                }
                            }
-                        stable[stable_count++] = it->first;
+                            stable[stable_count++] = dirty_it->first;
                        }
                    }
                }
@ -531,7 +569,8 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
                            return;
                        }
                    }
-                unstable[unstable_count++] = it->first;
+                    unstable[unstable_count++] = dirty_it->first;
+                }
            }
        }
    }
--- a/blockstore_impl.h
+++ b/blockstore_impl.h
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #pragma once

 #include "blockstore.h"
@ -7,7 +10,6 @@
 #include <sys/stat.h>
 #include <fcntl.h>
 #include <unistd.h>
-#include <malloc.h>
 #include <linux/fs.h>

 #include <vector>
@ -17,45 +19,38 @@

 #include "cpp-btree/btree_map.h"

+#include "malloc_or_die.h"
 #include "allocator.h"

 //#define BLOCKSTORE_DEBUG

 // States are not stored on disk. Instead, they're deduced from the journal
-// FIXME: Rename to BS_ST_*

-#define ST_J_WAIT_BIG 1
-#define ST_J_IN_FLIGHT 2
-#define ST_J_SUBMITTED 3
-#define ST_J_WRITTEN 4
-#define ST_J_SYNCED 5
-#define ST_J_STABLE 6
+#define BS_ST_SMALL_WRITE 0x01
+#define BS_ST_BIG_WRITE 0x02
+#define BS_ST_DELETE 0x03

-#define ST_D_IN_FLIGHT 15
-#define ST_D_SUBMITTED 16
-#define ST_D_WRITTEN 17
-#define ST_D_SYNCED 20
-#define ST_D_STABLE 21
+#define BS_ST_WAIT_BIG 0x10
+#define BS_ST_IN_FLIGHT 0x20
+#define BS_ST_SUBMITTED 0x30
+#define BS_ST_WRITTEN 0x40
+#define BS_ST_SYNCED 0x50
+#define BS_ST_STABLE 0x60

-#define ST_DEL_IN_FLIGHT 31
-#define ST_DEL_SUBMITTED 32
-#define ST_DEL_WRITTEN 33
-#define ST_DEL_SYNCED 34
-#define ST_DEL_STABLE 35
-
-#define ST_CURRENT 48
+#define BS_ST_INSTANT 0x100

 #define IMMEDIATE_NONE 0
 #define IMMEDIATE_SMALL 1
 #define IMMEDIATE_ALL 2

-#define IS_IN_FLIGHT(st) (st == ST_J_WAIT_BIG || st == ST_J_IN_FLIGHT || st == ST_D_IN_FLIGHT || st == ST_DEL_IN_FLIGHT || st == ST_J_SUBMITTED || st == ST_D_SUBMITTED || st == ST_DEL_SUBMITTED)
-#define IS_STABLE(st) (st == ST_J_STABLE || st == ST_D_STABLE || st == ST_DEL_STABLE || st == ST_CURRENT)
-#define IS_SYNCED(st) (IS_STABLE(st) || st == ST_J_SYNCED || st == ST_D_SYNCED || st == ST_DEL_SYNCED)
-#define IS_JOURNAL(st) (st >= ST_J_WAIT_BIG && st <= ST_J_STABLE)
-#define IS_BIG_WRITE(st) (st >= ST_D_IN_FLIGHT && st <= ST_D_STABLE)
-#define IS_DELETE(st) (st >= ST_DEL_IN_FLIGHT && st <= ST_DEL_STABLE)
-#define IS_UNSYNCED(st) (st >= ST_J_WAIT_BIG && st <= ST_J_WRITTEN || st >= ST_D_IN_FLIGHT && st <= ST_D_WRITTEN|| st >= ST_DEL_IN_FLIGHT && st <= ST_DEL_WRITTEN)
+#define BS_ST_TYPE_MASK 0x0F
+#define BS_ST_WORKFLOW_MASK 0xF0
+#define IS_IN_FLIGHT(st) (((st) & 0xF0) <= BS_ST_SUBMITTED)
+#define IS_STABLE(st) (((st) & 0xF0) == BS_ST_STABLE)
+#define IS_SYNCED(st) (((st) & 0xF0) >= BS_ST_SYNCED)
+#define IS_JOURNAL(st) (((st) & 0x0F) == BS_ST_SMALL_WRITE)
+#define IS_BIG_WRITE(st) (((st) & 0x0F) == BS_ST_BIG_WRITE)
+#define IS_DELETE(st) (((st) & 0x0F) == BS_ST_DELETE)

 #define BS_SUBMIT_GET_SQE(sqe, data) \
    BS_SUBMIT_GET_ONLY_SQE(sqe); \
--- a/blockstore_init.cpp
+++ b/blockstore_init.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #include "blockstore_impl.h"

 blockstore_init_meta::blockstore_init_meta(blockstore_impl_t *bs)
@ -108,13 +111,13 @@ void blockstore_init_meta::handle_entries(void* entries, unsigned count, int blo
                {
                    // free the previous block
 #ifdef BLOCKSTORE_DEBUG
-                    printf("Free block %lu\n", clean_it->second.location >> bs->block_order);
+                    printf("Free block %lu (new location is %lu)\n", clean_it->second.location >> block_order, done_cnt+i >> block_order);
 #endif
                    bs->data_alloc->set(clean_it->second.location >> block_order, false);
                }
                entries_loaded++;
 #ifdef BLOCKSTORE_DEBUG
-                printf("Allocate block (clean entry) %lu: %lu:%lu v%lu\n", done_cnt+i, entry->oid.inode, entry->oid.stripe, entry->version);
+                printf("Allocate block (clean entry) %lu: %lx:%lx v%lu\n", done_cnt+i, entry->oid.inode, entry->oid.stripe, entry->version);
 #endif
                bs->data_alloc->set(done_cnt+i, true);
                bs->clean_db[entry->oid] = (struct clean_entry){
@ -125,7 +128,7 @@ void blockstore_init_meta::handle_entries(void* entries, unsigned count, int blo
            else
            {
 #ifdef BLOCKSTORE_DEBUG
-                printf("Old clean entry %lu: %lu:%lu v%lu\n", done_cnt+i, entry->oid.inode, entry->oid.stripe, entry->version);
+                printf("Old clean entry %lu: %lx:%lx v%lu\n", done_cnt+i, entry->oid.inode, entry->oid.stripe, entry->version);
 #endif
            }
        }
@ -202,11 +205,7 @@ int blockstore_init_journal::loop()
        goto resume_7;
    printf("Reading blockstore journal\n");
    if (!bs->journal.inmemory)
-    {
-        submitted_buf = memalign(MEM_ALIGNMENT, 2*bs->journal.block_size);
-        if (!submitted_buf)
-            throw std::bad_alloc();
-    }
+        submitted_buf = memalign_or_die(MEM_ALIGNMENT, 2*bs->journal.block_size);
    else
        submitted_buf = bs->journal.buffer;
    // Read first block of the journal
@ -317,7 +316,7 @@ resume_1:
                if (journal_pos < bs->journal.used_start)
                    end = bs->journal.used_start;
                if (!bs->journal.inmemory)
-                    submitted_buf = memalign(MEM_ALIGNMENT, JOURNAL_BUFFER_SIZE);
+                    submitted_buf = memalign_or_die(MEM_ALIGNMENT, JOURNAL_BUFFER_SIZE);
                else
                    submitted_buf = bs->journal.buffer + journal_pos;
                data->iov = {
@ -454,10 +453,15 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                    break;
                }
            }
-            if (je->type == JE_SMALL_WRITE)
+            if (je->type == JE_SMALL_WRITE || je->type == JE_SMALL_WRITE_INSTANT)
            {
 #ifdef BLOCKSTORE_DEBUG
-                printf("je_small_write oid=%lu:%lu ver=%lu offset=%u len=%u\n", je->small_write.oid.inode, je->small_write.oid.stripe, je->small_write.version, je->small_write.offset, je->small_write.len);
+                printf(
+                    "je_small_write%s oid=%lx:%lx ver=%lu offset=%u len=%u\n",
+                    je->type == JE_SMALL_WRITE_INSTANT ? "_instant" : "",
+                    je->small_write.oid.inode, je->small_write.oid.stripe, je->small_write.version,
+                    je->small_write.offset, je->small_write.len
+                );
 #endif
                // oid, version, offset, len
                uint64_t prev_free = next_free;
@ -528,7 +532,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                        .version = je->small_write.version,
                    };
                    bs->dirty_db.emplace(ov, (dirty_entry){
-                        .state = ST_J_SYNCED,
+                        .state = (BS_ST_SMALL_WRITE | BS_ST_SYNCED),
                        .flags = 0,
                        .location = location,
                        .offset = je->small_write.offset,
@ -538,18 +542,26 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                    bs->journal.used_sectors[proc_pos]++;
 #ifdef BLOCKSTORE_DEBUG
                    printf(
-                        "journal offset %08lx is used by %lu:%lu v%lu (%lu refs)\n",
+                        "journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
                        proc_pos, ov.oid.inode, ov.oid.stripe, ov.version, bs->journal.used_sectors[proc_pos]
                    );
 #endif
                    auto & unstab = bs->unstable_writes[ov.oid];
                    unstab = unstab < ov.version ? ov.version : unstab;
+                    if (je->type == JE_SMALL_WRITE_INSTANT)
+                    {
+                        bs->mark_stable(ov);
                    }
                }
-            else if (je->type == JE_BIG_WRITE)
+            }
+            else if (je->type == JE_BIG_WRITE || je->type == JE_BIG_WRITE_INSTANT)
            {
 #ifdef BLOCKSTORE_DEBUG
-                printf("je_big_write oid=%lu:%lu ver=%lu loc=%lu\n", je->big_write.oid.inode, je->big_write.oid.stripe, je->big_write.version, je->big_write.location);
+                printf(
+                    "je_big_write%s oid=%lx:%lx ver=%lu loc=%lu\n",
+                    je->type == JE_BIG_WRITE_INSTANT ? "_instant" : "",
+                    je->big_write.oid.inode, je->big_write.oid.stripe, je->big_write.version, je->big_write.location
+                );
 #endif
                auto clean_it = bs->clean_db.find(je->big_write.oid);
                if (clean_it == bs->clean_db.end() ||
@ -561,7 +573,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                        .version = je->big_write.version,
                    };
                    bs->dirty_db.emplace(ov, (dirty_entry){
-                        .state = ST_D_SYNCED,
+                        .state = (BS_ST_BIG_WRITE | BS_ST_SYNCED),
                        .flags = 0,
                        .location = je->big_write.location,
                        .offset = je->big_write.offset,
@ -575,12 +587,16 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                    bs->journal.used_sectors[proc_pos]++;
                    auto & unstab = bs->unstable_writes[ov.oid];
                    unstab = unstab < ov.version ? ov.version : unstab;
+                    if (je->type == JE_BIG_WRITE_INSTANT)
+                    {
+                        bs->mark_stable(ov);
+                    }
                }
            }
            else if (je->type == JE_STABLE)
            {
 #ifdef BLOCKSTORE_DEBUG
-                printf("je_stable oid=%lu:%lu ver=%lu\n", je->stable.oid.inode, je->stable.oid.stripe, je->stable.version);
+                printf("je_stable oid=%lx:%lx ver=%lu\n", je->stable.oid.inode, je->stable.oid.stripe, je->stable.version);
 #endif
                // oid, version
                obj_ver_id ov = {
@ -592,7 +608,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
            else if (je->type == JE_ROLLBACK)
            {
 #ifdef BLOCKSTORE_DEBUG
-                printf("je_rollback oid=%lu:%lu ver=%lu\n", je->rollback.oid.inode, je->rollback.oid.stripe, je->rollback.version);
+                printf("je_rollback oid=%lx:%lx ver=%lu\n", je->rollback.oid.inode, je->rollback.oid.stripe, je->rollback.version);
 #endif
                // rollback dirty writes of <oid> up to <version>
                obj_ver_id ov = {
@ -604,7 +620,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
            else if (je->type == JE_DELETE)
            {
 #ifdef BLOCKSTORE_DEBUG
-                printf("je_delete oid=%lu:%lu ver=%lu\n", je->del.oid.inode, je->del.oid.stripe, je->del.version);
+                printf("je_delete oid=%lx:%lx ver=%lu\n", je->del.oid.inode, je->del.oid.stripe, je->del.version);
 #endif
                auto clean_it = bs->clean_db.find(je->del.oid);
                if (clean_it == bs->clean_db.end() ||
@ -616,7 +632,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                        .version = je->del.version,
                    };
                    bs->dirty_db.emplace(ov, (dirty_entry){
-                        .state = ST_DEL_SYNCED,
+                        .state = (BS_ST_DELETE | BS_ST_SYNCED),
                        .flags = 0,
                        .location = 0,
                        .offset = 0,
--- a/blockstore_init.h
+++ b/blockstore_init.h
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #pragma once

 class blockstore_init_meta
--- a/blockstore_journal.cpp
+++ b/blockstore_journal.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #include "blockstore_impl.h"

 blockstore_journal_check_t::blockstore_journal_check_t(blockstore_impl_t *bs)
@ -17,7 +20,9 @@ int blockstore_journal_check_t::check_available(blockstore_op_t *op, int entries
    int required = entries_required;
    while (1)
    {
-        int fits = (bs->journal.block_size - next_in_pos) / size;
+        int fits = bs->journal.no_same_sector_overwrites && bs->journal.sector_info[next_sector].written
+            ? 0
+            : (bs->journal.block_size - next_in_pos) / size;
        if (fits > 0)
        {
            if (first_sector == -1)
@ -110,10 +115,12 @@ int blockstore_journal_check_t::check_available(blockstore_op_t *op, int entries

 journal_entry* prefill_single_journal_entry(journal_t & journal, uint16_t type, uint32_t size)
 {
-    if (journal.block_size - journal.in_sector_pos < size)
+    if (journal.block_size - journal.in_sector_pos < size ||
+        journal.no_same_sector_overwrites && journal.sector_info[journal.cur_sector].written)
    {
        assert(!journal.sector_info[journal.cur_sector].dirty);
        // Move to the next journal sector
+        journal.sector_info[journal.cur_sector].written = false;
        if (journal.sector_info[journal.cur_sector].usage_count > 0)
        {
            // Also select next sector buffer in memory
@ -148,6 +155,7 @@ journal_entry* prefill_single_journal_entry(journal_t & journal, uint16_t type,
 void prepare_journal_sector_write(journal_t & journal, int cur_sector, io_uring_sqe *sqe, std::function<void(ring_data_t*)> cb)
 {
    journal.sector_info[cur_sector].dirty = false;
+    journal.sector_info[cur_sector].written = true;
    journal.sector_info[cur_sector].usage_count++;
    ring_data_t *data = ((ring_data_t*)sqe->user_data);
    data->iov = (struct iovec){
--- a/blockstore_journal.h
+++ b/blockstore_journal.h
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #pragma once

 #include "crc32c.h"
@ -19,7 +22,9 @@
 #define JE_STABLE      0x04
 #define JE_DELETE      0x05
 #define JE_ROLLBACK    0x06
-#define JE_MAX         0x06
+#define JE_SMALL_WRITE_INSTANT 0x07
+#define JE_BIG_WRITE_INSTANT   0x08
+#define JE_MAX         0x08

 // crc32c comes first to ease calculation and is equal to crc32()
 struct __attribute__((__packed__)) journal_entry_start
@ -127,6 +132,7 @@ struct journal_sector_info_t
 {
    uint64_t offset;
    uint64_t usage_count;
+    bool written;
    bool dirty;
 };

@ -151,6 +157,7 @@ struct journal_t
    void *sector_buf = NULL;
    journal_sector_info_t *sector_info = NULL;
    uint64_t sector_count;
+    bool no_same_sector_overwrites = false;
    int cur_sector = 0;
    int in_sector_pos = 0;

--- a/blockstore_open.cpp
+++ b/blockstore_open.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #include <sys/file.h>
 #include "blockstore_impl.h"

@ -59,6 +62,8 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
    journal_device = config["journal_device"];
    journal.offset = strtoull(config["journal_offset"].c_str(), NULL, 10);
    journal.sector_count = strtoull(config["journal_sector_buffer_count"].c_str(), NULL, 10);
+    journal.no_same_sector_overwrites = config["journal_no_same_sector_overwrites"] == "true" ||
+        config["journal_no_same_sector_overwrites"] == "1" || config["journal_no_same_sector_overwrites"] == "yes";
    journal.inmemory = config["inmemory_journal"] != "false";
    disk_alignment = strtoull(config["disk_alignment"].c_str(), NULL, 10);
    journal_block_size = strtoull(config["journal_block_size"].c_str(), NULL, 10);
--- a/blockstore_read.cpp
+++ b/blockstore_read.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #include "blockstore_impl.h"

 int blockstore_impl_t::fulfill_read_push(blockstore_op_t *op, void *buf, uint64_t offset, uint64_t len,
@ -37,6 +40,7 @@ int blockstore_impl_t::fulfill_read_push(blockstore_op_t *op, void *buf, uint64_
    return 1;
 }

+// FIXME I've seen a bug here so I want some tests
 int blockstore_impl_t::fulfill_read(blockstore_op_t *read_op, uint64_t &fulfilled, uint32_t item_start, uint32_t item_end,
    uint32_t item_state, uint64_t item_version, uint64_t item_location)
 {
@ -49,8 +53,20 @@ int blockstore_impl_t::fulfill_read(blockstore_op_t *read_op, uint64_t &fulfille
        while (1)
        {
            for (; it != PRIV(read_op)->read_vec.end(); it++)
+            {
                if (it->offset >= cur_start)
+                {
                    break;
+                }
+                else if (it->offset + it->len > cur_start)
+                {
+                    cur_start = it->offset + it->len;
+                    if (cur_start >= item_end)
+                    {
+                        goto endwhile;
+                    }
+                }
+            }
            if (it == PRIV(read_op)->read_vec.end() || it->offset > cur_start)
            {
                fulfill_read_t el = {
@ -69,9 +85,12 @@ int blockstore_impl_t::fulfill_read(blockstore_op_t *read_op, uint64_t &fulfille
            }
            cur_start = it->offset + it->len;
            if (it == PRIV(read_op)->read_vec.end() || cur_start >= item_end)
+            {
                break;
            }
        }
+    }
+endwhile:
    return 1;
 }

@ -141,7 +160,7 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
        {
            if (!clean_entry_bitmap_size)
            {
-                if (!fulfill_read(read_op, fulfilled, 0, block_size, ST_CURRENT, 0, clean_it->second.location))
+                if (!fulfill_read(read_op, fulfilled, 0, block_size, (BS_ST_BIG_WRITE | BS_ST_STABLE), 0, clean_it->second.location))
                {
                    // need to wait. undo added requests, don't dequeue op
                    PRIV(read_op)->read_vec.clear();
@ -173,7 +192,7 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
                    {
                        // fill with zeroes
                        fulfill_read(read_op, fulfilled, bmp_start * bitmap_granularity,
-                            bmp_end * bitmap_granularity, ST_DEL_STABLE, 0, 0);
+                            bmp_end * bitmap_granularity, (BS_ST_DELETE | BS_ST_STABLE), 0, 0);
                    }
                    bmp_start = bmp_end;
                    while (clean_entry_bitmap[bmp_end >> 3] & (1 << (bmp_end & 0x7)) && bmp_end < bmp_size)
@ -183,7 +202,8 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
                    if (bmp_end > bmp_start)
                    {
                        if (!fulfill_read(read_op, fulfilled, bmp_start * bitmap_granularity,
-                            bmp_end * bitmap_granularity, ST_CURRENT, 0, clean_it->second.location + bmp_start * bitmap_granularity))
+                            bmp_end * bitmap_granularity, (BS_ST_BIG_WRITE | BS_ST_STABLE), 0,
+                            clean_it->second.location + bmp_start * bitmap_granularity))
                        {
                            // need to wait. undo added requests, don't dequeue op
                            PRIV(read_op)->read_vec.clear();
@ -198,7 +218,7 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
    else if (fulfilled < read_op->len)
    {
        // fill remaining parts with zeroes
-        fulfill_read(read_op, fulfilled, 0, block_size, ST_DEL_STABLE, 0, 0);
+        fulfill_read(read_op, fulfilled, 0, block_size, (BS_ST_DELETE | BS_ST_STABLE), 0, 0);
    }
    assert(fulfilled == read_op->len);
    read_op->version = result_version;
--- a/blockstore_rollback.cpp
+++ b/blockstore_rollback.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #include "blockstore_impl.h"

 int blockstore_impl_t::dequeue_rollback(blockstore_op_t *op)
@ -230,7 +233,7 @@ void blockstore_impl_t::erase_dirty(blockstore_dirty_db_t::iterator dirty_start,
        int used = --journal.used_sectors[dirty_it->second.journal_sector];
 #ifdef BLOCKSTORE_DEBUG
        printf(
-            "remove usage of journal offset %08lx by %lu:%lu v%lu (%d refs)\n", dirty_it->second.journal_sector,
+            "remove usage of journal offset %08lx by %lx:%lx v%lu (%d refs)\n", dirty_it->second.journal_sector,
            dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version, used
        );
 #endif
--- a/blockstore_stable.cpp
+++ b/blockstore_stable.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #include "blockstore_impl.h"

 // Stabilize small write:
@ -64,7 +67,7 @@ int blockstore_impl_t::dequeue_stable(blockstore_op_t *op)
                // Already stable
            }
        }
-        else if (IS_UNSYNCED(dirty_it->second.state))
+        else if (!IS_SYNCED(dirty_it->second.state))
        {
            // Object not synced yet. Caller must sync it first
            op->retval = -EBUSY;
@ -184,17 +187,9 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v)
    {
        while (1)
        {
-            if (dirty_it->second.state == ST_J_SYNCED)
+            if ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_SYNCED)
            {
-                dirty_it->second.state = ST_J_STABLE;
-            }
-            else if (dirty_it->second.state == ST_D_SYNCED)
-            {
-                dirty_it->second.state = ST_D_STABLE;
-            }
-            else if (dirty_it->second.state == ST_DEL_SYNCED)
-            {
-                dirty_it->second.state = ST_DEL_STABLE;
+                dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_STABLE;
            }
            else if (IS_STABLE(dirty_it->second.state))
            {
@ -211,7 +206,7 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v)
            }
        }
 #ifdef BLOCKSTORE_DEBUG
-        printf("enqueue_flush %lu:%lu v%lu\n", v.oid.inode, v.oid.stripe, v.version);
+        printf("enqueue_flush %lx:%lx v%lu\n", v.oid.inode, v.oid.stripe, v.version);
 #endif
        flusher->enqueue_flush(v);
    }
--- a/blockstore_sync.cpp
+++ b/blockstore_sync.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #include "blockstore_impl.h"

 #define SYNC_HAS_SMALL 1
@ -127,14 +130,16 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
        }
        while (it != PRIV(op)->sync_big_writes.end())
        {
-            journal_entry_big_write *je = (journal_entry_big_write*)
-                prefill_single_journal_entry(journal, JE_BIG_WRITE, sizeof(journal_entry_big_write));
+            journal_entry_big_write *je = (journal_entry_big_write*)prefill_single_journal_entry(
+                journal, (dirty_db[*it].state & BS_ST_INSTANT) ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
+                sizeof(journal_entry_big_write)
+            );
            dirty_db[*it].journal_sector = journal.sector_info[journal.cur_sector].offset;
            journal.sector_info[journal.cur_sector].dirty = false;
            journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
 #ifdef BLOCKSTORE_DEBUG
            printf(
-                "journal offset %08lx is used by %lu:%lu v%lu (%lu refs)\n",
+                "journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
                dirty_db[*it].journal_sector, it->oid.inode, it->oid.stripe, it->version,
                journal.used_sectors[journal.sector_info[journal.cur_sector].offset]
            );
@ -252,18 +257,22 @@ void blockstore_impl_t::ack_one_sync(blockstore_op_t *op)
    for (auto it = PRIV(op)->sync_big_writes.begin(); it != PRIV(op)->sync_big_writes.end(); it++)
    {
 #ifdef BLOCKSTORE_DEBUG
-        printf("Ack sync big %lu:%lu v%lu\n", it->oid.inode, it->oid.stripe, it->version);
+        printf("Ack sync big %lx:%lx v%lu\n", it->oid.inode, it->oid.stripe, it->version);
 #endif
        auto & unstab = unstable_writes[it->oid];
        unstab = unstab < it->version ? it->version : unstab;
        auto dirty_it = dirty_db.find(*it);
-        dirty_it->second.state = ST_D_SYNCED;
+        dirty_it->second.state = ((dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_SYNCED);
+        if (dirty_it->second.state & BS_ST_INSTANT)
+        {
+            mark_stable(dirty_it->first);
+        }
        dirty_it++;
        while (dirty_it != dirty_db.end() && dirty_it->first.oid == it->oid)
        {
-            if (dirty_it->second.state == ST_J_WAIT_BIG)
+            if ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_WAIT_BIG)
            {
-                dirty_it->second.state = ST_J_IN_FLIGHT;
+                dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_IN_FLIGHT;
            }
            dirty_it++;
        }
@ -271,19 +280,23 @@ void blockstore_impl_t::ack_one_sync(blockstore_op_t *op)
    for (auto it = PRIV(op)->sync_small_writes.begin(); it != PRIV(op)->sync_small_writes.end(); it++)
    {
 #ifdef BLOCKSTORE_DEBUG
-        printf("Ack sync small %lu:%lu v%lu\n", it->oid.inode, it->oid.stripe, it->version);
+        printf("Ack sync small %lx:%lx v%lu\n", it->oid.inode, it->oid.stripe, it->version);
 #endif
        auto & unstab = unstable_writes[it->oid];
        unstab = unstab < it->version ? it->version : unstab;
-        if (dirty_db[*it].state == ST_DEL_WRITTEN)
+        if (dirty_db[*it].state == (BS_ST_DELETE | BS_ST_WRITTEN))
        {
-            dirty_db[*it].state = ST_DEL_SYNCED;
+            dirty_db[*it].state = (BS_ST_DELETE | BS_ST_SYNCED);
            // Deletions are treated as immediately stable
            mark_stable(*it);
        }
-        else /* == ST_J_WRITTEN */
+        else /* (BS_ST_INSTANT?) | BS_ST_SMALL_WRITE | BS_ST_WRITTEN */
        {
-            dirty_db[*it].state = ST_J_SYNCED;
+            dirty_db[*it].state = (dirty_db[*it].state & ~BS_ST_WORKFLOW_MASK) | BS_ST_SYNCED;
+            if (dirty_db[*it].state & BS_ST_INSTANT)
+            {
+                mark_stable(*it);
+            }
        }
    }
    in_progress_syncs.erase(PRIV(op)->in_progress_ptr);
--- a/blockstore_write.cpp
+++ b/blockstore_write.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #include "blockstore_impl.h"

 bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
@ -18,9 +21,9 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
            found = true;
            version = dirty_it->first.version + 1;
            deleted = IS_DELETE(dirty_it->second.state);
-            is_inflight_big = dirty_it->second.state >= ST_D_IN_FLIGHT &&
-                dirty_it->second.state < ST_D_SYNCED ||
-                dirty_it->second.state == ST_J_WAIT_BIG;
+            is_inflight_big = (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE
+                ? !IS_SYNCED(dirty_it->second.state)
+                : ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_WAIT_BIG);
        }
    }
    if (!found)
@ -65,9 +68,9 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
    }
 #ifdef BLOCKSTORE_DEBUG
    if (is_del)
-        printf("Delete %lu:%lu v%lu\n", op->oid.inode, op->oid.stripe, op->version);
+        printf("Delete %lx:%lx v%lu\n", op->oid.inode, op->oid.stripe, op->version);
    else
-        printf("Write %lu:%lu v%lu offset=%u len=%u\n", op->oid.inode, op->oid.stripe, op->version, op->offset, op->len);
+        printf("Write %lx:%lx v%lu offset=%u len=%u\n", op->oid.inode, op->oid.stripe, op->version, op->offset, op->len);
 #endif
    // No strict need to add it into dirty_db here, it's just left
    // from the previous implementation where reads waited for writes
@ -77,8 +80,10 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
    }, (dirty_entry){
        .state = (uint32_t)(
            is_del
-                ? ST_DEL_IN_FLIGHT
-                : (op->len == block_size || deleted ? ST_D_IN_FLIGHT : (is_inflight_big ? ST_J_WAIT_BIG : ST_J_IN_FLIGHT))
+                ? (BS_ST_DELETE | BS_ST_IN_FLIGHT)
+                : (op->opcode == BS_OP_WRITE_STABLE ? BS_ST_INSTANT : 0) | (op->len == block_size || deleted
+                    ? (BS_ST_BIG_WRITE | BS_ST_IN_FLIGHT)
+                    : (is_inflight_big ? (BS_ST_SMALL_WRITE | BS_ST_WAIT_BIG) : (BS_ST_SMALL_WRITE | BS_ST_IN_FLIGHT)))
        ),
        .flags = 0,
        .location = 0,
@ -101,11 +106,12 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
        .version = op->version,
    });
    assert(dirty_it != dirty_db.end());
-    if (dirty_it->second.state == ST_J_WAIT_BIG)
+    if ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_WAIT_BIG)
    {
+        // Don't dequeue
        return 0;
    }
-    else if (dirty_it->second.state == ST_D_IN_FLIGHT)
+    else if ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE)
    {
        blockstore_journal_check_t space_check(this);
        if (!space_check.check_available(op, unsynced_big_writes.size() + 1, sizeof(journal_entry_big_write), JOURNAL_STABILIZE_RESERVATION))
@ -129,7 +135,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
        }
        BS_SUBMIT_GET_SQE(sqe, data);
        dirty_it->second.location = loc << block_order;
-        dirty_it->second.state = ST_D_SUBMITTED;
+        dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_SUBMITTED;
 #ifdef BLOCKSTORE_DEBUG
        printf("Allocate block %lu\n", loc);
 #endif
@ -169,7 +175,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
            PRIV(op)->op_state = 1;
        }
    }
-    else
+    else /* if ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_SMALL_WRITE) */
    {
        // Small (journaled) write
        // First check if the journal has sufficient space
@ -209,13 +215,15 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
            }
        }
        // Then pre-fill journal entry
-        journal_entry_small_write *je = (journal_entry_small_write*)
-            prefill_single_journal_entry(journal, JE_SMALL_WRITE, sizeof(journal_entry_small_write));
+        journal_entry_small_write *je = (journal_entry_small_write*)prefill_single_journal_entry(
+            journal, op->opcode == BS_OP_WRITE_STABLE ? JE_SMALL_WRITE_INSTANT : JE_SMALL_WRITE,
+            sizeof(journal_entry_small_write)
+        );
        dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
        journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
 #ifdef BLOCKSTORE_DEBUG
        printf(
-            "journal offset %08lx is used by %lu:%lu v%lu (%lu refs)\n",
+            "journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
            dirty_it->second.journal_sector, dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version,
            journal.used_sectors[journal.sector_info[journal.cur_sector].offset]
        );
@ -257,7 +265,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
            // Zero-length overwrite. Allowed to bump object version in EC placement groups without actually writing data
        }
        dirty_it->second.location = journal.next_free;
-        dirty_it->second.state = ST_J_SUBMITTED;
+        dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_SUBMITTED;
        journal.next_free += op->len;
        if (journal.next_free >= journal.len)
        {
@ -307,13 +315,16 @@ resume_2:
    {
        return 0;
    }
-    je = (journal_entry_big_write*)prefill_single_journal_entry(journal, JE_BIG_WRITE, sizeof(journal_entry_big_write));
+    je = (journal_entry_big_write*)prefill_single_journal_entry(
+        journal, op->opcode == BS_OP_WRITE_STABLE ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
+        sizeof(journal_entry_big_write)
+    );
    dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
    journal.sector_info[journal.cur_sector].dirty = false;
    journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
 #ifdef BLOCKSTORE_DEBUG
    printf(
-        "journal offset %08lx is used by %lu:%lu v%lu (%lu refs)\n",
+        "journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
        journal.sector_info[journal.cur_sector].offset, op->oid.inode, op->oid.stripe, op->version,
        journal.used_sectors[journal.sector_info[journal.cur_sector].offset]
    );
@ -334,9 +345,9 @@ resume_2:
 resume_4:
    // Switch object state
 #ifdef BLOCKSTORE_DEBUG
-    printf("Ack write %lu:%lu v%lu = %d\n", op->oid.inode, op->oid.stripe, op->version, dirty_it->second.state);
+    printf("Ack write %lx:%lx v%lu = %d\n", op->oid.inode, op->oid.stripe, op->version, dirty_it->second.state);
 #endif
-    bool imm = dirty_it->second.state == ST_D_SUBMITTED
+    bool imm = (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE
        ? (immediate_commit == IMMEDIATE_ALL)
        : (immediate_commit != IMMEDIATE_NONE);
    if (imm)
@ -344,31 +355,21 @@ resume_4:
        auto & unstab = unstable_writes[op->oid];
        unstab = unstab < op->version ? op->version : unstab;
    }
-    if (dirty_it->second.state == ST_J_SUBMITTED)
-    {
-        dirty_it->second.state = imm ? ST_J_SYNCED : ST_J_WRITTEN;
-    }
-    else if (dirty_it->second.state == ST_D_SUBMITTED)
-    {
-        dirty_it->second.state = imm ? ST_D_SYNCED : ST_D_WRITTEN;
-    }
-    else if (dirty_it->second.state == ST_DEL_SUBMITTED)
-    {
-        dirty_it->second.state = imm ? ST_DEL_SYNCED : ST_DEL_WRITTEN;
-        if (imm)
+    dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK)
+        | (imm ? BS_ST_SYNCED : BS_ST_WRITTEN);
+    if (imm && ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_DELETE || (dirty_it->second.state & BS_ST_INSTANT)))
    {
        // Deletions are treated as immediately stable
        mark_stable(dirty_it->first);
    }
-    }
    if (immediate_commit == IMMEDIATE_ALL)
    {
        dirty_it++;
        while (dirty_it != dirty_db.end() && dirty_it->first.oid == op->oid)
        {
-            if (dirty_it->second.state == ST_J_WAIT_BIG)
+            if ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_WAIT_BIG)
            {
-                dirty_it->second.state = ST_J_IN_FLIGHT;
+                dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_IN_FLIGHT;
            }
            dirty_it++;
        }
@ -472,13 +473,14 @@ int blockstore_impl_t::dequeue_del(blockstore_op_t *op)
        }
    }
    // Pre-fill journal entry
-    journal_entry_del *je = (journal_entry_del*)
-        prefill_single_journal_entry(journal, JE_DELETE, sizeof(struct journal_entry_del));
+    journal_entry_del *je = (journal_entry_del*)prefill_single_journal_entry(
+        journal, JE_DELETE, sizeof(struct journal_entry_del)
+    );
    dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
    journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
 #ifdef BLOCKSTORE_DEBUG
    printf(
-        "journal offset %08lx is used by %lu:%lu v%lu (%lu refs)\n",
+        "journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
        dirty_it->second.journal_sector, dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version,
        journal.used_sectors[journal.sector_info[journal.cur_sector].offset]
    );
@ -487,7 +489,7 @@ int blockstore_impl_t::dequeue_del(blockstore_op_t *op)
    je->version = op->version;
    je->crc32 = je_crc32((journal_entry*)je);
    journal.crc32_last = je->crc32;
-    dirty_it->second.state = ST_DEL_SUBMITTED;
+    dirty_it->second.state = BS_ST_DELETE | BS_ST_SUBMITTED;
    if (immediate_commit != IMMEDIATE_NONE)
    {
        prepare_journal_sector_write(journal, journal.cur_sector, sqe, cb);
--- a/cluster_client.cpp
+++ b/cluster_client.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
 #include "cluster_client.h"

 cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config)
@ -5,17 +8,63 @@ cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd
    this->ringloop = ringloop;
    this->tfd = tfd;

+    msgr.osd_num = 0;
    msgr.tfd = tfd;
    msgr.ringloop = ringloop;
    msgr.repeer_pgs = [this](osd_num_t peer_osd)
    {
-        // peer_osd just connected or dropped connection
        if (msgr.osd_peer_fds.find(peer_osd) != msgr.osd_peer_fds.end())
        {
-            // really connected :)
+            // peer_osd just connected
+            continue_ops();
+        }
+        else if (unsynced_writes.size())
+        {
+            // peer_osd just dropped connection
+            for (auto op: syncing_writes)
+            {
+                for (auto & part: op->parts)
+                {
+                    if (part.osd_num == peer_osd && part.done)
+                    {
+                        // repeat this operation
+                        part.osd_num = 0;
+                        part.done = false;
+                        assert(!part.sent);
+                        op->done_count--;
+                    }
+                }
+            }
+            for (auto op: unsynced_writes)
+            {
+                for (auto & part: op->parts)
+                {
+                    if (part.osd_num == peer_osd && part.done)
+                    {
+                        // repeat this operation
+                        part.osd_num = 0;
+                        part.done = false;
+                        assert(!part.sent);
+                        op->done_count--;
+                    }
+                }
+                if (op->done_count < op->parts.size())
+                {
+                    cur_ops.insert(op);
+                }
+            }
            continue_ops();
        }
    };
+    msgr.exec_op = [this](osd_op_t *op)
+    {
+        // Garbage in
+        printf("Incoming garbage from peer %d\n", op->peer_fd);
+        msgr.stop_client(op->peer_fd);
+        delete op;
+    };
+    msgr.use_sync_send_recv = config["use_sync_send_recv"].bool_value() ||
+        config["use_sync_send_recv"].uint64_value();

    st_cli.tfd = tfd;
    st_cli.on_load_config_hook = [this](json11::Json::object & cfg) { on_load_config_hook(cfg); };
@ -26,44 +75,51 @@ cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd
    log_level = config["log_level"].int64_value();
    st_cli.parse_config(config);
    st_cli.load_global_config();
+
+    if (ringloop)
+    {
+        consumer.loop = [this]()
+        {
+            msgr.read_requests();
+            msgr.send_replies();
+            this->ringloop->submit();
+        };
+        ringloop->register_consumer(&consumer);
+    }
 }

-void cluster_client_t::continue_ops()
+cluster_client_t::~cluster_client_t()
 {
-    for (auto op_it = unsent_ops.begin(); op_it != unsent_ops.end(); )
+    if (ringloop)
    {
-        cluster_op_t *op = *op_it;
-        if (op->needs_reslice && !op->sent_count)
-        {
-            op->parts.clear();
-            op->done_count = 0;
-            op->needs_reslice = false;
+        ringloop->unregister_consumer(&consumer);
    }
-        if (!op->parts.size())
+}
+
+void cluster_client_t::stop()
+{
+    while (msgr.clients.size() > 0)
    {
-            unsent_ops.erase(op_it++);
-            execute(op);
-            continue;
+        msgr.stop_client(msgr.clients.begin()->first);
    }
-        if (!op->needs_reslice)
+}
+
+void cluster_client_t::continue_ops(bool up_retry)
+{
+    for (auto op_it = cur_ops.begin(); op_it != cur_ops.end(); )
    {
-            for (auto & op_part: op->parts)
+        if ((*op_it)->up_wait)
        {
-                if (!op_part.sent && !op_part.done)
+            if (up_retry)
            {
-                    try_send(op, &op_part);
-                }
-            }
-            if (op->sent_count == op->parts.size() - op->done_count)
-            {
-                unsent_ops.erase(op_it++);
-                sent_ops.insert(op);
+                (*op_it)->up_wait = false;
+                continue_rw(*op_it++);
            }
            else
                op_it++;
        }
        else
-            op_it++;
+            continue_rw(*op_it++);
    }
 }

@ -88,59 +144,103 @@ void cluster_client_t::on_load_config_hook(json11::Json::object & config)
    bs_disk_alignment = config["disk_alignment"].uint64_value();
    bs_bitmap_granularity = config["bitmap_granularity"].uint64_value();
    if (!bs_block_size)
-        bs_block_size = DEFAULT_BLOCK_SIZE;
-    if (!bs_disk_alignment)
-        bs_disk_alignment = DEFAULT_DISK_ALIGNMENT;
-    if (!bs_bitmap_granularity)
-        bs_bitmap_granularity = DEFAULT_BITMAP_GRANULARITY;
    {
+        bs_block_size = DEFAULT_BLOCK_SIZE;
+    }
+    if (!bs_disk_alignment)
+    {
+        bs_disk_alignment = DEFAULT_DISK_ALIGNMENT;
+    }
+    if (!bs_bitmap_granularity)
+    {
+        bs_bitmap_granularity = DEFAULT_BITMAP_GRANULARITY;
+    }
    uint32_t block_order;
    if ((block_order = is_power_of_two(bs_block_size)) >= 64 || bs_block_size < MIN_BLOCK_SIZE || bs_block_size >= MAX_BLOCK_SIZE)
-            throw std::runtime_error("Bad block size");
-    }
-    if (config.find("pg_stripe_size") != config.end())
    {
-        pg_stripe_size = config["pg_stripe_size"].uint64_value();
-        if (!pg_stripe_size)
-            pg_stripe_size = DEFAULT_PG_STRIPE_SIZE;
+        throw std::runtime_error("Bad block size");
    }
    if (config["immediate_commit"] == "all")
    {
        // Cluster-wide immediate_commit mode
        immediate_commit = true;
    }
+    else if (config.find("client_dirty_limit") != config.end())
+    {
+        client_dirty_limit = config["client_dirty_limit"].uint64_value();
+    }
+    if (!client_dirty_limit)
+    {
+        client_dirty_limit = DEFAULT_CLIENT_DIRTY_LIMIT;
+    }
+    up_wait_retry_interval = config["up_wait_retry_interval"].uint64_value();
+    if (!up_wait_retry_interval)
+    {
+        up_wait_retry_interval = 500;
+    }
+    else if (up_wait_retry_interval < 50)
+    {
+        up_wait_retry_interval = 50;
+    }
    msgr.peer_connect_interval = config["peer_connect_interval"].uint64_value();
    if (!msgr.peer_connect_interval)
+    {
        msgr.peer_connect_interval = DEFAULT_PEER_CONNECT_INTERVAL;
+    }
    msgr.peer_connect_timeout = config["peer_connect_timeout"].uint64_value();
    if (!msgr.peer_connect_timeout)
+    {
        msgr.peer_connect_timeout = DEFAULT_PEER_CONNECT_TIMEOUT;
+    }
+    st_cli.start_etcd_watcher();
+    st_cli.load_pgs();
 }

 void cluster_client_t::on_load_pgs_hook(bool success)
 {
-    if (success)
+    for (auto pool_item: st_cli.pool_config)
    {
-        pg_count = st_cli.pg_config.size();
-        continue_ops();
+        pg_counts[pool_item.first] = pool_item.second.real_pg_count;
    }
+    for (auto op: offline_ops)
+    {
+        execute(op);
+    }
+    offline_ops.clear();
+    continue_ops();
 }

 void cluster_client_t::on_change_hook(json11::Json::object & changes)
 {
-    if (pg_count != st_cli.pg_config.size())
+    for (auto pool_item: st_cli.pool_config)
    {
-        // At this point, all operations should be suspended
-        // And they need to be resliced!
-        for (auto op: unsent_ops)
+        if (pg_counts[pool_item.first] != pool_item.second.real_pg_count)
+        {
+            // At this point, all pool operations should have been suspended
+            // And now they have to be resliced!
+            for (auto op: cur_ops)
+            {
+                if (INODE_POOL(op->inode) == pool_item.first)
                {
                    op->needs_reslice = true;
                }
-        for (auto op: sent_ops)
+            }
+            for (auto op: unsynced_writes)
+            {
+                if (INODE_POOL(op->inode) == pool_item.first)
                {
                    op->needs_reslice = true;
                }
-        pg_count = st_cli.pg_config.size();
+            }
+            for (auto op: syncing_writes)
+            {
+                if (INODE_POOL(op->inode) == pool_item.first)
+                {
+                    op->needs_reslice = true;
+                }
+            }
+            pg_counts[pool_item.first] = pool_item.second.real_pg_count;
+        }
    }
    continue_ops();
 }
@ -153,106 +253,256 @@ void cluster_client_t::on_change_osd_state_hook(uint64_t peer_osd)
    }
 }

-// FIXME: Implement OSD_OP_SYNC for immediate_commit == false
+/**
+ * How writes are synced when immediate_commit is false
+ *
+ * 1) accept up to <client_dirty_limit> write operations for execution,
+ *    queue all subsequent writes into <next_writes>
+ * 2) accept exactly one SYNC, queue all subsequent SYNCs into <next_writes>, too
+ * 3) "continue" all accepted writes
+ *
+ * "Continue" WRITE:
+ * 1) if the operation is not a copy yet - copy it (required for replay)
+ * 2) if the operation is not sliced yet - slice it
+ * 3) if the operation doesn't require reslice - try to connect & send all remaining parts
+ * 4) if any of them fail due to disconnected peers or PGs not up, repeat after reconnecting or small timeout
+ * 5) if any of them fail due to other errors, fail the operation and forget it from the current "unsynced batch"
+ * 6) if PG count changes before all parts are done, wait for all in-progress parts to finish,
+ *    throw all results away, reslice and resubmit op
+ * 7) when all parts are done, try to "continue" the current SYNC
+ * 8) if the operation succeeds, but then some OSDs drop their connections, repeat
+ *    parts from the current "unsynced batch" previously sent to those OSDs in any order
+ *
+ * "Continue" current SYNC:
+ * 1) take all unsynced operations from the current batch
+ * 2) check if all affected OSDs are still alive
+ * 3) if yes, send all SYNCs. otherwise, leave current SYNC as is.
+ * 4) if any of them fail due to disconnected peers, repeat SYNC after repeating all writes
+ * 5) if any of them fail due to other errors, fail the SYNC operation
+ */
+
 void cluster_client_t::execute(cluster_op_t *op)
 {
-    if (op->opcode == OSD_OP_SYNC && immediate_commit)
+    if (!bs_disk_alignment)
    {
-        // Syncs are not required in the immediate_commit mode
-        op->retval = 0;
-        std::function<void(cluster_op_t*)>(op->callback)(op);
+        // We're offline
+        offline_ops.push_back(op);
        return;
    }
-    if (op->opcode != OSD_OP_READ && op->opcode != OSD_OP_OUT || !op->inode || !op->len ||
-        op->offset % bs_disk_alignment || op->len % bs_disk_alignment)
+    op->retval = 0;
+    if (op->opcode != OSD_OP_SYNC && op->opcode != OSD_OP_READ && op->opcode != OSD_OP_WRITE ||
+        (op->opcode == OSD_OP_READ || op->opcode == OSD_OP_WRITE) && (!op->inode || !op->len ||
+        op->offset % bs_disk_alignment || op->len % bs_disk_alignment))
    {
        op->retval = -EINVAL;
        std::function<void(cluster_op_t*)>(op->callback)(op);
        return;
    }
-    if (!pg_stripe_size)
+    if (op->opcode == OSD_OP_SYNC)
    {
-        // Config is not loaded yet
-        unsent_ops.insert(op);
+        execute_sync(op);
        return;
    }
    if (op->opcode == OSD_OP_WRITE && !immediate_commit)
    {
-        // Copy operation
+        if (next_writes.size() > 0)
+        {
+            assert(cur_sync);
+            next_writes.push_back(op);
+            return;
+        }
+        if (queued_bytes >= client_dirty_limit)
+        {
+            // Push an extra SYNC operation to flush previous writes
+            next_writes.push_back(op);
+            cluster_op_t *sync_op = new cluster_op_t;
+            sync_op->is_internal = true;
+            sync_op->opcode = OSD_OP_SYNC;
+            sync_op->callback = [](cluster_op_t* sync_op) {};
+            execute_sync(sync_op);
+            return;
+        }
+        queued_bytes += op->len;
+    }
+    cur_ops.insert(op);
+    continue_rw(op);
+}
+
+void cluster_client_t::continue_rw(cluster_op_t *op)
+{
+    pool_id_t pool_id = INODE_POOL(op->inode);
+    if (!pool_id)
+    {
+        op->retval = -EINVAL;
+        std::function<void(cluster_op_t*)>(op->callback)(op);
+        return;
+    }
+    if (st_cli.pool_config.find(pool_id) == st_cli.pool_config.end() ||
+        st_cli.pool_config[pool_id].real_pg_count == 0)
+    {
+        // Postpone operations to unknown pools
+        return;
+    }
+    if (op->opcode == OSD_OP_WRITE && !immediate_commit && !op->is_internal)
+    {
+        // Save operation for replay when PG goes out of sync
+        // (primary OSD drops our connection in this case)
        cluster_op_t *op_copy = new cluster_op_t();
+        op_copy->is_internal = true;
+        op_copy->orig_op = op;
        op_copy->opcode = op->opcode;
        op_copy->inode = op->inode;
        op_copy->offset = op->offset;
        op_copy->len = op->len;
-        op_copy->buf = malloc(op->len);
-        memcpy(op_copy->buf, op->buf, op->len);
-        unsynced_ops.push_back(op_copy);
-        unsynced_bytes += op->len;
-        if (inmemory_commit)
+        op_copy->buf = malloc_or_die(op->len);
+        op_copy->iov.push_back(op_copy->buf, op->len);
+        op_copy->callback = [](cluster_op_t* op_copy)
        {
-            // Immediately acknowledge write and continue with the copy
-            op->retval = op->len;
-            std::function<void(cluster_op_t*)>(op->callback)(op);
+            if (op_copy->orig_op)
+            {
+                // Acknowledge write and forget the original pointer
+                op_copy->orig_op->retval = op_copy->retval;
+                std::function<void(cluster_op_t*)>(op_copy->orig_op->callback)(op_copy->orig_op);
+                op_copy->orig_op = NULL;
+            }
+        };
+        void *cur_buf = op_copy->buf;
+        for (int i = 0; i < op->iov.count; i++)
+        {
+            memcpy(cur_buf, op->iov.buf[i].iov_base, op->iov.buf[i].iov_len);
+            cur_buf += op->iov.buf[i].iov_len;
+        }
+        unsynced_writes.push_back(op_copy);
+        cur_ops.erase(op);
+        cur_ops.insert(op_copy);
        op = op_copy;
    }
-        if (unsynced_bytes >= inmemory_dirty_limit)
+    if (!op->parts.size())
    {
-            // Push an extra SYNC operation
+        // Slice the operation into parts
+        slice_rw(op);
+    }
+    if (!op->needs_reslice)
+    {
+        // Send unsent parts, if they're not subject to change
+        for (auto & op_part: op->parts)
+        {
+            if (!op_part.sent && !op_part.done)
+            {
+                try_send(op, &op_part);
            }
        }
-    // Slice the request into individual object stripe requests
-    // Primary OSDs still operate individual stripes, but their size is multiplied by PG minsize in case of EC
-    uint64_t pg_block_size = bs_block_size * pg_part_count;
-    uint64_t first_stripe = (op->offset / pg_block_size) * pg_block_size;
-    uint64_t last_stripe = ((op->offset + op->len + pg_block_size - 1) / pg_block_size - 1) * pg_block_size;
-    int part_count = 0;
-    for (uint64_t stripe = first_stripe; stripe <= last_stripe; stripe += pg_block_size)
+    }
+    if (!op->sent_count)
    {
-        if (op->offset < (stripe+pg_block_size) && (op->offset+op->len) > stripe)
+        if (op->done_count >= op->parts.size())
        {
-            part_count++;
+            // Finished successfully
+            // Even if the PG count has changed in meanwhile we treat it as success
+            // because if some operations were invalid for the new PG count we'd get errors
+            cur_ops.erase(op);
+            op->retval = op->len;
+            std::function<void(cluster_op_t*)>(op->callback)(op);
+            continue_sync();
+            return;
+        }
+        else if (op->retval != 0 && op->retval != -EPIPE)
+        {
+            // Fatal error (not -EPIPE)
+            cur_ops.erase(op);
+            if (!immediate_commit && op->opcode == OSD_OP_WRITE)
+            {
+                for (int i = 0; i < unsynced_writes.size(); i++)
+                {
+                    if (unsynced_writes[i] == op)
+                    {
+                        unsynced_writes.erase(unsynced_writes.begin()+i, unsynced_writes.begin()+i+1);
+                        break;
                    }
                }
-    op->parts.resize(part_count);
-    bool resend = false;
-    int i = 0;
-    for (uint64_t stripe = first_stripe; stripe <= last_stripe; stripe += pg_block_size)
-    {
-        uint64_t stripe_end = stripe + pg_block_size;
-        if (op->offset < stripe_end && (op->offset+op->len) > stripe)
-        {
-            pg_num_t pg_num = (op->inode + stripe/pg_stripe_size) % pg_count + 1;
-            op->parts[i] = {
-                .parent = op,
-                .offset = op->offset < stripe ? stripe : op->offset,
-                .len = (uint32_t)((op->offset+op->len) > stripe_end ? pg_block_size : op->offset+op->len-stripe),
-                .pg_num = pg_num,
-                .buf = op->buf + (op->offset < stripe ? stripe-op->offset : 0),
-                .sent = false,
-                .done = false,
-            };
-            if (!try_send(op, &op->parts[i]))
-            {
-                // Part needs to be sent later
-                resend = true;
            }
-            i++;
-        }
-    }
-    if (resend)
+            bool del = op->is_internal;
+            std::function<void(cluster_op_t*)>(op->callback)(op);
+            if (del)
            {
-        unsent_ops.insert(op);
+                if (op->buf)
+                    free(op->buf);
+                delete op;
+            }
+            continue_sync();
+            return;
        }
        else
        {
-        sent_ops.insert(op);
+            // -EPIPE or no error - clear the error
+            op->retval = 0;
+            if (op->needs_reslice)
+            {
+                op->parts.clear();
+                op->done_count = 0;
+                op->needs_reslice = false;
+                continue_rw(op);
+            }
+        }
+    }
+}
+
+void cluster_client_t::slice_rw(cluster_op_t *op)
+{
+    // Slice the request into individual object stripe requests
+    // Primary OSDs still operate individual stripes, but their size is multiplied by PG minsize in case of EC
+    auto & pool_cfg = st_cli.pool_config[INODE_POOL(op->inode)];
+    uint64_t pg_block_size = bs_block_size * (
+        pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pool_cfg.pg_minsize
+    );
+    uint64_t first_stripe = (op->offset / pg_block_size) * pg_block_size;
+    uint64_t last_stripe = ((op->offset + op->len + pg_block_size - 1) / pg_block_size - 1) * pg_block_size;
+    op->retval = 0;
+    op->parts.resize((last_stripe - first_stripe) / pg_block_size + 1);
+    int iov_idx = 0;
+    size_t iov_pos = 0;
+    int i = 0;
+    for (uint64_t stripe = first_stripe; stripe <= last_stripe; stripe += pg_block_size)
+    {
+        pg_num_t pg_num = (op->inode + stripe/pool_cfg.pg_stripe_size) % pool_cfg.real_pg_count + 1;
+        uint64_t begin = (op->offset < stripe ? stripe : op->offset);
+        uint64_t end = (op->offset + op->len) > (stripe + pg_block_size)
+            ? (stripe + pg_block_size) : (op->offset + op->len);
+        op->parts[i] = {
+            .parent = op,
+            .offset = begin,
+            .len = (uint32_t)(end - begin),
+            .pg_num = pg_num,
+            .sent = false,
+            .done = false,
+        };
+        int left = end-begin;
+        while (left > 0 && iov_idx < op->iov.count)
+        {
+            if (op->iov.buf[iov_idx].iov_len - iov_pos < left)
+            {
+                op->parts[i].iov.push_back(op->iov.buf[iov_idx].iov_base + iov_pos, op->iov.buf[iov_idx].iov_len - iov_pos);
+                left -= (op->iov.buf[iov_idx].iov_len - iov_pos);
+                iov_pos = 0;
+                iov_idx++;
+            }
+            else
+            {
+                op->parts[i].iov.push_back(op->iov.buf[iov_idx].iov_base + iov_pos, left);
+                iov_pos += left;
+                left = 0;
+            }
+        }
+        assert(left == 0);
+        i++;
    }
 }

 bool cluster_client_t::try_send(cluster_op_t *op, cluster_op_part_t *part)
 {
-    auto pg_it = st_cli.pg_config.find(part->pg_num);
-    if (pg_it != st_cli.pg_config.end() &&
+    auto & pool_cfg = st_cli.pool_config[INODE_POOL(op->inode)];
+    auto pg_it = pool_cfg.pg_config.find(part->pg_num);
+    if (pg_it != pool_cfg.pg_config.end() &&
        !pg_it->second.pause && pg_it->second.cur_primary)
    {
        osd_num_t primary_osd = pg_it->second.cur_primary;
@ -281,15 +531,7 @@ bool cluster_client_t::try_send(cluster_op_t *op, cluster_op_part_t *part)
                    handle_op_part(part);
                },
            };
-            part->op.send_list.push_back(part->op.req.buf, OSD_PACKET_SIZE);
-            if (op->opcode == OSD_OP_WRITE)
-            {
-                part->op.send_list.push_back(part->buf, part->len);
-            }
-            else
-            {
-                part->op.buf = part->buf;
-            }
+            part->op.iov = part->iov;
            msgr.outbox_push(&part->op);
            return true;
        }
@ -301,36 +543,185 @@ bool cluster_client_t::try_send(cluster_op_t *op, cluster_op_part_t *part)
    return false;
 }

+void cluster_client_t::execute_sync(cluster_op_t *op)
+{
+    if (immediate_commit)
+    {
+        // Syncs are not required in the immediate_commit mode
+        op->retval = 0;
+        std::function<void(cluster_op_t*)>(op->callback)(op);
+    }
+    else if (cur_sync != NULL)
+    {
+        next_writes.push_back(op);
+    }
+    else
+    {
+        cur_sync = op;
+        continue_sync();
+    }
+}
+
+void cluster_client_t::continue_sync()
+{
+    if (!cur_sync || cur_sync->parts.size() > 0)
+    {
+        // Already submitted
+        return;
+    }
+    cur_sync->retval = 0;
+    std::set<osd_num_t> sync_osds;
+    for (auto prev_op: unsynced_writes)
+    {
+        if (prev_op->done_count < prev_op->parts.size())
+        {
+            // Writes not finished yet
+            return;
+        }
+        for (auto & part: prev_op->parts)
+        {
+            if (part.osd_num)
+            {
+                sync_osds.insert(part.osd_num);
+            }
+        }
+    }
+    if (!sync_osds.size())
+    {
+        // No dirty writes
+        finish_sync();
+        return;
+    }
+    // Check that all OSD connections are still alive
+    for (auto sync_osd: sync_osds)
+    {
+        auto peer_it = msgr.osd_peer_fds.find(sync_osd);
+        if (peer_it == msgr.osd_peer_fds.end())
+        {
+            // SYNC is pointless to send to a non connected OSD
+            return;
+        }
+    }
+    syncing_writes.swap(unsynced_writes);
+    // Post sync to affected OSDs
+    cur_sync->parts.resize(sync_osds.size());
+    int i = 0;
+    for (auto sync_osd: sync_osds)
+    {
+        cur_sync->parts[i] = {
+            .parent = cur_sync,
+            .osd_num = sync_osd,
+            .sent = false,
+            .done = false,
+        };
+        send_sync(cur_sync, &cur_sync->parts[i]);
+        i++;
+    }
+}
+
+void cluster_client_t::finish_sync()
+{
+    int retval = cur_sync->retval;
+    if (retval != 0)
+    {
+        for (auto op: syncing_writes)
+        {
+            if (op->done_count < op->parts.size())
+            {
+                cur_ops.insert(op);
+            }
+        }
+        unsynced_writes.insert(unsynced_writes.begin(), syncing_writes.begin(), syncing_writes.end());
+        syncing_writes.clear();
+    }
+    if (retval == -EPIPE)
+    {
+        // Retry later
+        cur_sync->parts.clear();
+        cur_sync->retval = 0;
+        cur_sync->sent_count = 0;
+        cur_sync->done_count = 0;
+        return;
+    }
+    std::function<void(cluster_op_t*)>(cur_sync->callback)(cur_sync);
+    if (!retval)
+    {
+        for (auto op: syncing_writes)
+        {
+            assert(op->sent_count == 0);
+            if (op->is_internal)
+            {
+                if (op->buf)
+                    free(op->buf);
+                delete op;
+            }
+        }
+        syncing_writes.clear();
+    }
+    cur_sync = NULL;
+    queued_bytes = 0;
+    std::vector<cluster_op_t*> next_wr_copy;
+    next_wr_copy.swap(next_writes);
+    for (auto next_op: next_wr_copy)
+    {
+        execute(next_op);
+    }
+}
+
+void cluster_client_t::send_sync(cluster_op_t *op, cluster_op_part_t *part)
+{
+    auto peer_it = msgr.osd_peer_fds.find(part->osd_num);
+    assert(peer_it != msgr.osd_peer_fds.end());
+    part->sent = true;
+    op->sent_count++;
+    part->op = {
+        .op_type = OSD_OP_OUT,
+        .peer_fd = peer_it->second,
+        .req = {
+            .hdr = {
+                .magic = SECONDARY_OSD_OP_MAGIC,
+                .id = op_id++,
+                .opcode = OSD_OP_SYNC,
+            },
+        },
+        .callback = [this, part](osd_op_t *op_part)
+        {
+            handle_op_part(part);
+        },
+    };
+    msgr.outbox_push(&part->op);
+}
+
 void cluster_client_t::handle_op_part(cluster_op_part_t *part)
 {
    cluster_op_t *op = part->parent;
    part->sent = false;
    op->sent_count--;
-    part->op.buf = NULL;
-    if (part->op.reply.hdr.retval != part->op.req.rw.len)
+    int expected = part->op.req.hdr.opcode == OSD_OP_SYNC ? 0 : part->op.req.rw.len;
+    if (part->op.reply.hdr.retval != expected)
    {
        // Operation failed, retry
        printf(
-            "Operation part failed on OSD %lu: retval=%ld (expected %u), reconnecting\n",
-            part->osd_num, part->op.reply.hdr.retval, part->op.req.rw.len
+            "Operation failed on OSD %lu: retval=%ld (expected %d), dropping connection\n",
+            part->osd_num, part->op.reply.hdr.retval, expected
        );
        msgr.stop_client(part->op.peer_fd);
-        if (op->sent_count == op->parts.size() - op->done_count - 1)
+        if (part->op.reply.hdr.retval == -EPIPE)
        {
-            // Resend later when OSDs come up
-            // FIXME: Check for different types of errors
-            // FIXME: Repeat operations after a small timeout, for the case when OSD is coming up
-            sent_ops.erase(op);
-            unsent_ops.insert(op);
+            op->up_wait = true;
+            if (!retry_timeout_id)
+            {
+                retry_timeout_id = tfd->set_timer(up_wait_retry_interval, false, [this](int)
+                {
+                    retry_timeout_id = 0;
+                    continue_ops(true);
+                });
            }
-        if (op->sent_count == 0 && op->needs_reslice)
+        }
+        if (!op->retval || op->retval == -EPIPE)
        {
-            // PG count has changed, reslice the operation
-            unsent_ops.erase(op);
-            op->parts.clear();
-            op->done_count = 0;
-            op->needs_reslice = false;
-            execute(op);
+            // Don't overwrite other errors with -EPIPE
+            op->retval = part->op.reply.hdr.retval;
        }
    }
    else
@ -338,12 +729,17 @@ void cluster_client_t::handle_op_part(cluster_op_part_t *part)
        // OK
        part->done = true;
        op->done_count++;
-        if (op->done_count >= op->parts.size())
+    }
+    if (op->sent_count == 0)
    {
-            // Finished!
-            sent_ops.erase(op);
-            op->retval = op->len;
-            std::function<void(cluster_op_t*)>(op->callback)(op);
+        if (op->opcode == OSD_OP_SYNC)
+        {
+            assert(op == cur_sync);
+            finish_sync();
+        }
+        else
+        {
+            continue_rw(op);
        }
    }
 }
--- a/cluster_client.h
+++ b/cluster_client.h
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
 #pragma once

 #include "messenger.h"
@ -6,9 +9,9 @@
 #define MIN_BLOCK_SIZE 4*1024
 #define MAX_BLOCK_SIZE 128*1024*1024
 #define DEFAULT_BLOCK_SIZE 128*1024
-#define DEFAULT_PG_STRIPE_SIZE 4*1024*1024
 #define DEFAULT_DISK_ALIGNMENT 4096
 #define DEFAULT_BITMAP_GRANULARITY 4096
+#define DEFAULT_CLIENT_DIRTY_LIMIT 32*1024*1024

 struct cluster_op_t;

@ -19,7 +22,7 @@ struct cluster_op_part_t
    uint32_t len;
    pg_num_t pg_num;
    osd_num_t osd_num;
-    void *buf;
+    osd_op_buf_list_t iov;
    bool sent;
    bool done;
    osd_op_t op;
@ -32,10 +35,14 @@ struct cluster_op_t
    uint64_t offset;
    uint64_t len;
    int retval;
-    void *buf;
+    osd_op_buf_list_t iov;
    std::function<void(cluster_op_t*)> callback;
 protected:
+    void *buf = NULL;
+    cluster_op_t *orig_op = NULL;
+    bool is_internal = false;
    bool needs_reslice = false;
+    bool up_wait = false;
    int sent_count = 0, done_count = 0;
    std::vector<cluster_op_part_t> parts;
    friend class cluster_client_t;
@ -46,35 +53,50 @@ class cluster_client_t
    timerfd_manager_t *tfd;
    ring_loop_t *ringloop;

-    uint64_t pg_part_count = 2;
-    uint64_t pg_stripe_size = 0;
    uint64_t bs_block_size = 0;
    uint64_t bs_disk_alignment = 0;
    uint64_t bs_bitmap_granularity = 0;
-    uint64_t pg_count = 0;
+    std::map<pool_id_t, uint64_t> pg_counts;
    bool immediate_commit = false;
-    bool inmemory_commit = false;
-    uint64_t inmemory_dirty_limit = 32*1024*1024;
+    // FIXME: Implement inmemory_commit mode. Note that it requires to return overlapping reads from memory.
+    uint64_t client_dirty_limit = 0;
    int log_level;
+    int up_wait_retry_interval = 500; // ms

    uint64_t op_id = 1;
    etcd_state_client_t st_cli;
    osd_messenger_t msgr;
-    std::set<cluster_op_t*> sent_ops, unsent_ops;
+    ring_consumer_t consumer;
+    // operations currently in progress
+    std::set<cluster_op_t*> cur_ops;
+    int retry_timeout_id = 0;
    // unsynced operations are copied in memory to allow replay when cluster isn't in the immediate_commit mode
-    std::vector<cluster_op_t*> unsynced_ops;
-    uint64_t unsynced_bytes = 0;
+    // unsynced_writes are replayed in any order (because only the SYNC operation guarantees ordering)
+    std::vector<cluster_op_t*> unsynced_writes;
+    std::vector<cluster_op_t*> syncing_writes;
+    cluster_op_t* cur_sync = NULL;
+    std::vector<cluster_op_t*> next_writes;
+    std::vector<cluster_op_t*> offline_ops;
+    uint64_t queued_bytes = 0;

 public:
    cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config);
+    ~cluster_client_t();
    void execute(cluster_op_t *op);
+    void stop();

 protected:
-    void continue_ops();
-    void on_load_config_hook(json11::Json::object & cfg);
+    void continue_ops(bool up_retry = false);
+    void on_load_config_hook(json11::Json::object & config);
    void on_load_pgs_hook(bool success);
    void on_change_hook(json11::Json::object & changes);
    void on_change_osd_state_hook(uint64_t peer_osd);
+    void continue_rw(cluster_op_t *op);
+    void slice_rw(cluster_op_t *op);
    bool try_send(cluster_op_t *op, cluster_op_part_t *part);
+    void execute_sync(cluster_op_t *op);
+    void continue_sync();
+    void finish_sync();
+    void send_sync(cluster_op_t *op, cluster_op_part_t *part);
    void handle_op_part(cluster_op_part_t *part);
 };
--- a/1
+++ b/1
@ -0,0 +1 @@
+Subproject commit 5dc108754ad40d3b1d024f9bd7cca0595ef1a1db
--- a/dump_journal.cpp
+++ b/dump_journal.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #define _LARGEFILE64_SOURCE
 #include <sys/types.h>
 #include <sys/ioctl.h>
@ -94,7 +97,7 @@ void journal_dump_t::dump_block(void *buf)
    while (pos < journal_block)
    {
        journal_entry *je = (journal_entry*)(buf + pos);
-        if (je->magic != JOURNAL_MAGIC || je->type < JE_START || je->type > JE_DELETE)
+        if (je->magic != JOURNAL_MAGIC || je->type < JE_MIN || je->type > JE_MAX)
        {
            break;
        }
@ -104,10 +107,11 @@ void journal_dump_t::dump_block(void *buf)
        {
            printf("je_start start=%08lx\n", je->start.journal_start);
        }
-        else if (je->type == JE_SMALL_WRITE)
+        else if (je->type == JE_SMALL_WRITE || je->type == JE_SMALL_WRITE_INSTANT)
        {
            printf(
-                "je_small_write oid=%lu:%lu ver=%lu offset=%u len=%u loc=%08lx",
+                "je_small_write%s oid=%lx:%lx ver=%lu offset=%u len=%u loc=%08lx",
+                je->type == JE_SMALL_WRITE_INSTANT ? "_instant" : "",
                je->small_write.oid.inode, je->small_write.oid.stripe,
                je->small_write.version, je->small_write.offset, je->small_write.len,
                je->small_write.data_offset
@ -139,21 +143,25 @@ void journal_dump_t::dump_block(void *buf)
            );
            printf("\n");
        }
-        else if (je->type == JE_BIG_WRITE)
+        else if (je->type == JE_BIG_WRITE || je->type == JE_BIG_WRITE_INSTANT)
        {
-            printf("je_big_write oid=%lu:%lu ver=%lu loc=%08lx\n", je->big_write.oid.inode, je->big_write.oid.stripe, je->big_write.version, je->big_write.location);
+            printf(
+                "je_big_write%s oid=%lx:%lx ver=%lu loc=%08lx\n",
+                je->type == JE_BIG_WRITE_INSTANT ? "_instant" : "",
+                je->big_write.oid.inode, je->big_write.oid.stripe, je->big_write.version, je->big_write.location
+            );
        }
        else if (je->type == JE_STABLE)
        {
-            printf("je_stable oid=%lu:%lu ver=%lu\n", je->stable.oid.inode, je->stable.oid.stripe, je->stable.version);
+            printf("je_stable oid=%lx:%lx ver=%lu\n", je->stable.oid.inode, je->stable.oid.stripe, je->stable.version);
        }
        else if (je->type == JE_ROLLBACK)
        {
-            printf("je_rollback oid=%lu:%lu ver=%lu\n", je->rollback.oid.inode, je->rollback.oid.stripe, je->rollback.version);
+            printf("je_rollback oid=%lx:%lx ver=%lu\n", je->rollback.oid.inode, je->rollback.oid.stripe, je->rollback.version);
        }
        else if (je->type == JE_DELETE)
        {
-            printf("je_delete oid=%lu:%lu ver=%lu\n", je->del.oid.inode, je->del.oid.stripe, je->del.version);
+            printf("je_delete oid=%lx:%lx ver=%lu\n", je->del.oid.inode, je->del.oid.stripe, je->del.version);
        }
        pos += je->size;
        entry++;
--- a/epoll_manager.cpp
+++ b/epoll_manager.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
 #include <sys/epoll.h>
 #include <sys/poll.h>
 #include <unistd.h>
@ -16,7 +19,7 @@ epoll_manager_t::epoll_manager_t(ring_loop_t *ringloop)
        throw std::runtime_error(std::string("epoll_create: ") + strerror(errno));
    }

-    tfd = new timerfd_manager_t([this](int fd, std::function<void(int, int)> handler) { set_fd_handler(fd, handler); });
+    tfd = new timerfd_manager_t([this](int fd, bool wr, std::function<void(int, int)> handler) { set_fd_handler(fd, wr, handler); });

    handle_epoll_events();
 }
@ -31,14 +34,14 @@ epoll_manager_t::~epoll_manager_t()
    close(epoll_fd);
 }

-void epoll_manager_t::set_fd_handler(int fd, std::function<void(int, int)> handler)
+void epoll_manager_t::set_fd_handler(int fd, bool wr, std::function<void(int, int)> handler)
 {
    if (handler != NULL)
    {
        bool exists = epoll_handlers.find(fd) != epoll_handlers.end();
        epoll_event ev;
        ev.data.fd = fd;
-        ev.events = EPOLLOUT | EPOLLIN | EPOLLRDHUP | EPOLLET;
+        ev.events = (wr ? EPOLLOUT : 0) | EPOLLIN | EPOLLRDHUP | EPOLLET;
        if (epoll_ctl(epoll_fd, exists ? EPOLL_CTL_MOD : EPOLL_CTL_ADD, fd, &ev) < 0)
        {
            throw std::runtime_error(std::string("epoll_ctl: ") + strerror(errno));
--- a/epoll_manager.h
+++ b/epoll_manager.h
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
 #pragma once

 #include <map>
@ -13,7 +16,7 @@ class epoll_manager_t
 public:
    epoll_manager_t(ring_loop_t *ringloop);
    ~epoll_manager_t();
-    void set_fd_handler(int fd, std::function<void(int, int)> handler);
+    void set_fd_handler(int fd, bool wr, std::function<void(int, int)> handler);
    void handle_epoll_events();

    timerfd_manager_t *tfd;
--- a/etcd_state_client.cpp
+++ b/etcd_state_client.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
 #include "osd_ops.h"
 #include "pg_states.h"
 #include "etcd_state_client.h"
@ -81,7 +84,7 @@ void etcd_state_client_t::parse_config(json11::Json & config)
    this->etcd_prefix = config["etcd_prefix"].string_value();
    if (this->etcd_prefix == "")
    {
-        this->etcd_prefix = "/microceph";
+        this->etcd_prefix = "/vitastor";
    }
    else if (this->etcd_prefix[0] != '/')
    {
@ -233,6 +236,11 @@ void etcd_state_client_t::load_global_config()
 void etcd_state_client_t::load_pgs()
 {
    json11::Json::array txn = {
+        json11::Json::object {
+            { "request_range", json11::Json::object {
+                { "key", base64_encode(etcd_prefix+"/config/pools") },
+            } }
+        },
        json11::Json::object {
            { "request_range", json11::Json::object {
                { "key", base64_encode(etcd_prefix+"/config/pgs") },
@ -293,47 +301,162 @@ void etcd_state_client_t::load_pgs()

 void etcd_state_client_t::parse_state(const std::string & key, const json11::Json & value)
 {
-    if (key == etcd_prefix+"/config/pgs")
+    if (key == etcd_prefix+"/config/pools")
    {
-        for (auto & pg_item: this->pg_config)
+        for (auto & pool_item: this->pool_config)
+        {
+            pool_item.second.exists = false;
+        }
+        for (auto & pool_item: value.object_items())
+        {
+            pool_id_t pool_id = stoull_full(pool_item.first);
+            if (!pool_id || pool_id >= POOL_ID_MAX)
+            {
+                printf("Pool ID %s is invalid (must be a number less than 0x%x), skipping pool\n", pool_item.first.c_str(), POOL_ID_MAX);
+                continue;
+            }
+            if (pool_item.second["pg_size"].uint64_value() < 1 ||
+                pool_item.second["scheme"] == "xor" && pool_item.second["pg_size"].uint64_value() < 3)
+            {
+                printf("Pool %u has invalid pg_size, skipping pool\n", pool_id);
+                continue;
+            }
+            if (pool_item.second["pg_minsize"].uint64_value() < 1 ||
+                pool_item.second["pg_minsize"].uint64_value() > pool_item.second["pg_size"].uint64_value() ||
+                pool_item.second["pg_minsize"].uint64_value() < (pool_item.second["pg_size"].uint64_value() - 1))
+            {
+                printf("Pool %u has invalid pg_minsize, skipping pool\n", pool_id);
+                continue;
+            }
+            if (pool_item.second["pg_count"].uint64_value() < 1)
+            {
+                printf("Pool %u has invalid pg_count, skipping pool\n", pool_id);
+                continue;
+            }
+            if (pool_item.second["name"].string_value() == "")
+            {
+                printf("Pool %u has empty name, skipping pool\n", pool_id);
+                continue;
+            }
+            if (pool_item.second["scheme"] != "replicated" && pool_item.second["scheme"] != "xor")
+            {
+                printf("Pool %u has invalid coding scheme (only \"xor\" and \"replicated\" are allowed), skipping pool\n", pool_id);
+                continue;
+            }
+            if (pool_item.second["max_osd_combinations"].uint64_value() > 0 &&
+                pool_item.second["max_osd_combinations"].uint64_value() < 100)
+            {
+                printf("Pool %u has invalid max_osd_combinations (must be at least 100), skipping pool\n", pool_id);
+                continue;
+            }
+            auto & parsed_cfg = this->pool_config[pool_id];
+            parsed_cfg.exists = true;
+            parsed_cfg.id = pool_id;
+            parsed_cfg.name = pool_item.second["name"].string_value();
+            parsed_cfg.scheme = pool_item.second["scheme"] == "replicated" ? POOL_SCHEME_REPLICATED : POOL_SCHEME_XOR;
+            parsed_cfg.pg_size = pool_item.second["pg_size"].uint64_value();
+            parsed_cfg.pg_minsize = pool_item.second["pg_minsize"].uint64_value();
+            parsed_cfg.pg_count = pool_item.second["pg_count"].uint64_value();
+            parsed_cfg.failure_domain = pool_item.second["failure_domain"].string_value();
+            parsed_cfg.pg_stripe_size = pool_item.second["pg_stripe_size"].uint64_value();
+            if (!parsed_cfg.pg_stripe_size)
+            {
+                parsed_cfg.pg_stripe_size = DEFAULT_PG_STRIPE_SIZE;
+            }
+            parsed_cfg.max_osd_combinations = pool_item.second["max_osd_combinations"].uint64_value();
+            if (!parsed_cfg.max_osd_combinations)
+            {
+                parsed_cfg.max_osd_combinations = 10000;
+            }
+            for (auto & pg_item: parsed_cfg.pg_config)
+            {
+                if (pg_item.second.target_set.size() != parsed_cfg.pg_size)
+                {
+                    printf("Pool %u PG %u configuration is invalid: osd_set size %lu != pool pg_size %lu\n",
+                        pool_id, pg_item.first, pg_item.second.target_set.size(), parsed_cfg.pg_size);
+                    pg_item.second.pause = true;
+                }
+            }
+        }
+    }
+    else if (key == etcd_prefix+"/config/pgs")
+    {
+        for (auto & pool_item: this->pool_config)
+        {
+            for (auto & pg_item: pool_item.second.pg_config)
            {
                pg_item.second.exists = false;
            }
-        for (auto & pg_item: value["items"].object_items())
+        }
+        for (auto & pool_item: value["items"].object_items())
+        {
+            pool_id_t pool_id = stoull_full(pool_item.first);
+            if (!pool_id || pool_id >= POOL_ID_MAX)
+            {
+                printf("Pool ID %s is invalid in PG configuration (must be a number less than 0x%x), skipping pool\n", pool_item.first.c_str(), POOL_ID_MAX);
+                continue;
+            }
+            for (auto & pg_item: pool_item.second.object_items())
            {
                pg_num_t pg_num = stoull_full(pg_item.first);
                if (!pg_num)
                {
-                printf("Bad key in PG configuration: %s (must be a number), skipped\n", pg_item.first.c_str());
+                    printf("Bad key in pool %u PG configuration: %s (must be a number), skipped\n", pool_id, pg_item.first.c_str());
                    continue;
                }
-            this->pg_config[pg_num].exists = true;
-            this->pg_config[pg_num].pause = pg_item.second["pause"].bool_value();
-            this->pg_config[pg_num].primary = pg_item.second["primary"].uint64_value();
-            this->pg_config[pg_num].target_set.clear();
-            for (auto pg_osd: pg_item.second["osd_set"].array_items())
+                auto & parsed_cfg = this->pool_config[pool_id].pg_config[pg_num];
+                parsed_cfg.exists = true;
+                parsed_cfg.pause = pg_item.second["pause"].bool_value();
+                parsed_cfg.primary = pg_item.second["primary"].uint64_value();
+                parsed_cfg.target_set.clear();
+                for (auto & pg_osd: pg_item.second["osd_set"].array_items())
                {
-                this->pg_config[pg_num].target_set.push_back(pg_osd.uint64_value());
+                    parsed_cfg.target_set.push_back(pg_osd.uint64_value());
                }
-            if (this->pg_config[pg_num].target_set.size() != 3)
+                if (parsed_cfg.target_set.size() != pool_config[pool_id].pg_size)
                {
-                printf("Bad PG %u config format: incorrect osd_set = %s\n", pg_num, pg_item.second["osd_set"].dump().c_str());
-                this->pg_config[pg_num].target_set.resize(3);
-                this->pg_config[pg_num].pause = true;
+                    printf("Pool %u PG %u configuration is invalid: osd_set size %lu != pool pg_size %lu\n",
+                        pool_id, pg_num, parsed_cfg.target_set.size(), pool_config[pool_id].pg_size);
+                    parsed_cfg.pause = true;
                }
            }
        }
+        for (auto & pool_item: this->pool_config)
+        {
+            int n = 0;
+            for (auto pg_it = pool_item.second.pg_config.begin(); pg_it != pool_item.second.pg_config.end(); pg_it++)
+            {
+                if (pg_it->second.exists && pg_it->first != ++n)
+                {
+                    printf(
+                        "Invalid pool %u PG configuration: PG numbers don't cover whole 1..%lu range\n",
+                        pool_item.second.id, pool_item.second.pg_config.size()
+                    );
+                    for (pg_it = pool_item.second.pg_config.begin(); pg_it != pool_item.second.pg_config.end(); pg_it++)
+                    {
+                        pg_it->second.exists = false;
+                    }
+                    n = 0;
+                    break;
+                }
+            }
+            pool_item.second.real_pg_count = n;
+        }
+    }
    else if (key.substr(0, etcd_prefix.length()+12) == etcd_prefix+"/pg/history/")
    {
-        // <etcd_prefix>/pg/history/%d
-        pg_num_t pg_num = stoull_full(key.substr(etcd_prefix.length()+12));
-        if (!pg_num)
+        // <etcd_prefix>/pg/history/%d/%d
+        pool_id_t pool_id = 0;
+        pg_num_t pg_num = 0;
+        char null_byte = 0;
+        sscanf(key.c_str() + etcd_prefix.length()+12, "%u/%u%c", &pool_id, &pg_num, &null_byte);
+        if (!pool_id || pool_id >= POOL_ID_MAX || !pg_num || null_byte != 0)
        {
            printf("Bad etcd key %s, ignoring\n", key.c_str());
        }
        else
        {
-            auto & pg_cfg = this->pg_config[pg_num];
+            auto & pg_cfg = this->pool_config[pool_id].pg_config[pg_num];
            pg_cfg.target_history.clear();
            pg_cfg.all_peers.clear();
            // Refuse to start PG if any set of the <osd_sets> has no live OSDs
@ -351,20 +474,29 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
            {
                pg_cfg.all_peers.push_back(pg_osd.uint64_value());
            }
+            // Read epoch
+            pg_cfg.epoch = value["epoch"].uint64_value();
+            if (on_change_pg_history_hook != NULL)
+            {
+                on_change_pg_history_hook(pool_id, pg_num);
+            }
        }
    }
    else if (key.substr(0, etcd_prefix.length()+10) == etcd_prefix+"/pg/state/")
    {
-        // <etcd_prefix>/pg/state/%d
-        pg_num_t pg_num = stoull_full(key.substr(etcd_prefix.length()+10));
-        if (!pg_num)
+        // <etcd_prefix>/pg/state/%d/%d
+        pool_id_t pool_id = 0;
+        pg_num_t pg_num = 0;
+        char null_byte = 0;
+        sscanf(key.c_str() + etcd_prefix.length()+10, "%u/%u%c", &pool_id, &pg_num, &null_byte);
+        if (!pool_id || pool_id >= POOL_ID_MAX || !pg_num || null_byte != 0)
        {
            printf("Bad etcd key %s, ignoring\n", key.c_str());
        }
        else if (value.is_null())
        {
-            this->pg_config[pg_num].cur_primary = 0;
-            this->pg_config[pg_num].cur_state = 0;
+            this->pool_config[pool_id].pg_config[pg_num].cur_primary = 0;
+            this->pool_config[pool_id].pg_config[pg_num].cur_state = 0;
        }
        else
        {
@ -383,7 +515,7 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
                }
                if (i >= pg_state_bit_count)
                {
-                    printf("Unexpected PG %u state keyword in etcd: %s\n", pg_num, e.dump().c_str());
+                    printf("Unexpected pool %u PG %u state keyword in etcd: %s\n", pool_id, pg_num, e.dump().c_str());
                    return;
                }
            }
@ -392,11 +524,11 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
                (state & PG_PEERING) && state != PG_PEERING ||
                (state & PG_INCOMPLETE) && state != PG_INCOMPLETE)
            {
-                printf("Unexpected PG %u state in etcd: primary=%lu, state=%s\n", pg_num, cur_primary, value["state"].dump().c_str());
+                printf("Unexpected pool %u PG %u state in etcd: primary=%lu, state=%s\n", pool_id, pg_num, cur_primary, value["state"].dump().c_str());
                return;
            }
-            this->pg_config[pg_num].cur_primary = cur_primary;
-            this->pg_config[pg_num].cur_state = state;
+            this->pool_config[pool_id].pg_config[pg_num].cur_primary = cur_primary;
+            this->pool_config[pool_id].pg_config[pg_num].cur_state = state;
        }
    }
    else if (key.substr(0, etcd_prefix.length()+11) == etcd_prefix+"/osd/state/")
--- a/etcd_state_client.h
+++ b/etcd_state_client.h
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
 #pragma once

 #include "osd_id.h"
@ -13,6 +16,14 @@
 #define ETCD_SLOW_TIMEOUT 5000
 #define ETCD_QUICK_TIMEOUT 1000

+#define DEFAULT_PG_STRIPE_SIZE 4*1024*1024
+
+struct json_kv_t
+{
+    std::string key;
+    json11::Json value;
+};
+
 struct pg_config_t
 {
    bool exists;
@ -23,12 +34,22 @@ struct pg_config_t
    bool pause;
    osd_num_t cur_primary;
    int cur_state;
+    uint64_t epoch;
 };

-struct json_kv_t
+struct pool_config_t
 {
-    std::string key;
-    json11::Json value;
+    bool exists;
+    pool_id_t id;
+    std::string name;
+    uint64_t scheme;
+    uint64_t pg_size, pg_minsize;
+    uint64_t pg_count;
+    uint64_t real_pg_count;
+    std::string failure_domain;
+    uint64_t max_osd_combinations;
+    uint64_t pg_stripe_size;
+    std::map<pg_num_t, pg_config_t> pg_config;
 };

 struct etcd_state_client_t
@ -41,14 +62,15 @@ struct etcd_state_client_t
    int etcd_watches_initialised = 0;
    uint64_t etcd_watch_revision = 0;
    websocket_t *etcd_watch_ws = NULL;
-    std::map<pg_num_t, pg_config_t> pg_config;
+    std::map<pool_id_t, pool_config_t> pool_config;
    std::map<osd_num_t, json11::Json> peer_states;

    std::function<void(json11::Json::object &)> on_change_hook;
    std::function<void(json11::Json::object &)> on_load_config_hook;
    std::function<json11::Json()> load_pgs_checks_hook;
    std::function<void(bool)> on_load_pgs_hook;
-    std::function<void(uint64_t)> on_change_osd_state_hook;
+    std::function<void(pool_id_t, pg_num_t)> on_change_pg_history_hook;
+    std::function<void(osd_num_t)> on_change_osd_state_hook;

    json_kv_t parse_etcd_kv(const json11::Json & kv_json);
    void etcd_call(std::string api, json11::Json payload, int timeout, std::function<void(std::string, json11::Json)> callback);
--- a/fio_cluster.cpp
+++ b/fio_cluster.cpp
@ -1,19 +1,22 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
 // FIO engine to test cluster I/O
 //
 // Random write:
 //
 // fio -thread -ioengine=./libfio_cluster.so -name=test -bs=4k -direct=1 -fsync=16 -iodepth=16 -rw=randwrite \
-//     -etcd=127.0.0.1:2379 [-etcd_prefix=/microceph] -size=1000M
+//     -etcd=127.0.0.1:2379 [-etcd_prefix=/vitastor] -pool=1 -inode=1 -size=1000M
 //
 // Linear write:
 //
 // fio -thread -ioengine=./libfio_cluster.so -name=test -bs=128k -direct=1 -fsync=32 -iodepth=32 -rw=write \
-//     -etcd=127.0.0.1:2379 [-etcd_prefix=/microceph] -size=1000M
+//     -etcd=127.0.0.1:2379 [-etcd_prefix=/vitastor] -pool=1 -inode=1 -size=1000M
 //
 // Random read (run with -iodepth=32 or -iodepth=1):
 //
 // fio -thread -ioengine=./libfio_cluster.so -name=test -bs=4k -direct=1 -iodepth=32 -rw=randread \
-//     -etcd=127.0.0.1:2379 [-etcd_prefix=/microceph] -size=1000M
+//     -etcd=127.0.0.1:2379 [-etcd_prefix=/vitastor] -pool=1 -inode=1 -size=1000M

 #include <sys/types.h>
 #include <sys/socket.h>
@ -49,7 +52,9 @@ struct sec_options
    int __pad;
    char *etcd_host = NULL;
    char *etcd_prefix = NULL;
-    int inode = 0;
+    uint64_t pool = 0;
+    uint64_t inode = 0;
+    int cluster_log = 0;
    int trace = 0;
 };

@ -68,7 +73,16 @@ static struct fio_option options[] = {
        .lname  = "etcd key prefix",
        .type   = FIO_OPT_STR_STORE,
        .off1   = offsetof(struct sec_options, etcd_prefix),
-        .help   = "etcd key prefix, by default /microceph",
+        .help   = "etcd key prefix, by default /vitastor",
+        .category = FIO_OPT_C_ENGINE,
+        .group  = FIO_OPT_G_FILENAME,
+    },
+    {
+        .name   = "pool",
+        .lname  = "pool number for the inode",
+        .type   = FIO_OPT_INT,
+        .off1   = offsetof(struct sec_options, pool),
+        .help   = "pool number for the inode to run tests on",
        .category = FIO_OPT_C_ENGINE,
        .group  = FIO_OPT_G_FILENAME,
    },
@ -81,6 +95,16 @@ static struct fio_option options[] = {
        .category = FIO_OPT_C_ENGINE,
        .group  = FIO_OPT_G_FILENAME,
    },
+    {
+        .name   = "cluster_log_level",
+        .lname  = "cluster log level",
+        .type   = FIO_OPT_BOOL,
+        .off1   = offsetof(struct sec_options, cluster_log),
+        .help   = "Set log level for the Vitastor client",
+        .def    = "0",
+        .category = FIO_OPT_C_ENGINE,
+        .group  = FIO_OPT_G_FILENAME,
+    },
    {
        .name   = "osd_trace",
        .lname  = "OSD trace",
@ -140,9 +164,17 @@ static int sec_init(struct thread_data *td)

    json11::Json cfg = json11::Json::object {
        { "etcd_address", std::string(o->etcd_host) },
-        { "etcd_prefix", std::string(o->etcd_prefix ? o->etcd_prefix : "/microceph") },
+        { "etcd_prefix", std::string(o->etcd_prefix ? o->etcd_prefix : "/vitastor") },
+        { "log_level", o->cluster_log },
    };

+    if (o->pool)
+        o->inode = (o->inode & ((1l << (64-POOL_ID_BITS)) - 1)) | (o->pool << (64-POOL_ID_BITS));
+    if (!(o->inode >> (64-POOL_ID_BITS)))
+    {
+        td_verror(td, EINVAL, "pool is missing");
+        return 1;
+    }
    bsd->ringloop = new ring_loop_t(512);
    bsd->epmgr = new epoll_manager_t(bsd->ringloop);
    bsd->cli = new cluster_client_t(bsd->ringloop, bsd->epmgr->tfd, cfg);
@ -175,7 +207,7 @@ static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)
        op->inode = opt->inode;
        op->offset = io->offset;
        op->len = io->xfer_buflen;
-        op->buf = io->xfer_buf;
+        op->iov.push_back(io->xfer_buf, io->xfer_buflen);
        bsd->last_sync = false;
        break;
    case DDIR_WRITE:
@ -183,7 +215,7 @@ static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)
        op->inode = opt->inode;
        op->offset = io->offset;
        op->len = io->xfer_buflen;
-        op->buf = io->xfer_buf;
+        op->iov.push_back(io->xfer_buf, io->xfer_buflen);
        bsd->last_sync = false;
        break;
    case DDIR_SYNC:
@ -211,8 +243,16 @@ static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)

    if (opt->trace)
    {
-        printf("+++ %s # %d\n", io->ddir == DDIR_READ ? "READ" :
-            (io->ddir == DDIR_WRITE ? "WRITE" : "SYNC"), n);
+        if (io->ddir == DDIR_SYNC)
+        {
+            printf("+++ SYNC # %d\n", n);
+        }
+        else
+        {
+            printf("+++ %s # %d 0x%llx+%llx\n",
+                io->ddir == DDIR_READ ? "READ" : "WRITE",
+                n, io->offset, io->xfer_buflen);
+        }
    }

    io->error = 0;
@ -270,7 +310,7 @@ static int sec_invalidate(struct thread_data *td, struct fio_file *f)
 }

 struct ioengine_ops ioengine = {
-    .name               = "microceph_cluster",
+    .name               = "vitastor_cluster",
    .version            = FIO_IOOPS_VERSION,
    .flags              = FIO_MEMALIGN | FIO_DISKLESSIO | FIO_NOEXTEND,
    .setup              = sec_setup,
--- a/fio_engine.cpp
+++ b/fio_engine.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 // FIO engine to test Blockstore
 //
 // Initialize storage for tests:
@ -290,7 +293,7 @@ static int bs_invalidate(struct thread_data *td, struct fio_file *f)
 }

 struct ioengine_ops ioengine = {
-    .name               = "microceph_blockstore",
+    .name               = "vitastor_blockstore",
    .version            = FIO_IOOPS_VERSION,
    .flags              = FIO_MEMALIGN | FIO_DISKLESSIO | FIO_NOEXTEND,
    .setup              = bs_setup,
--- a/fio_sec_osd.cpp
+++ b/fio_sec_osd.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
 // FIO engine to test Blockstore through Secondary OSD interface
 //
 // Prepare storage like in fio_engine.cpp, then start OSD with ./osd, then test it
@ -205,7 +208,7 @@ static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)
    case DDIR_READ:
        if (!opt->single_primary)
        {
-            op.hdr.opcode = OSD_OP_SECONDARY_READ;
+            op.hdr.opcode = OSD_OP_SEC_READ;
            op.sec_rw.oid = {
                .inode = 1,
                .stripe = io->offset >> bsd->block_order,
@ -226,7 +229,7 @@ static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)
    case DDIR_WRITE:
        if (!opt->single_primary)
        {
-            op.hdr.opcode = OSD_OP_SECONDARY_WRITE;
+            op.hdr.opcode = OSD_OP_SEC_WRITE;
            op.sec_rw.oid = {
                .inode = 1,
                .stripe = io->offset >> bsd->block_order,
@ -381,7 +384,7 @@ static int sec_invalidate(struct thread_data *td, struct fio_file *f)
 }

 struct ioengine_ops ioengine = {
-    .name               = "microceph_secondary_osd",
+    .name               = "vitastor_secondary_osd",
    .version            = FIO_IOOPS_VERSION,
    .flags              = FIO_MEMALIGN | FIO_DISKLESSIO | FIO_NOEXTEND,
    .setup              = sec_setup,
--- a/http_client.cpp
+++ b/http_client.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
 #include <netinet/tcp.h>
 #include <sys/epoll.h>

@ -156,7 +159,7 @@ http_co_t::~http_co_t()
    }
    if (peer_fd >= 0)
    {
-        tfd->set_fd_handler(peer_fd, NULL);
+        tfd->set_fd_handler(peer_fd, false, NULL);
        close(peer_fd);
        peer_fd = -1;
    }
@ -223,7 +226,7 @@ void http_co_t::start_connection()
        end();
        return;
    }
-    tfd->set_fd_handler(peer_fd, [this](int peer_fd, int epoll_events)
+    tfd->set_fd_handler(peer_fd, true, [this](int peer_fd, int epoll_events)
    {
        this->epoll_events |= epoll_events;
        handle_events();
@ -276,6 +279,11 @@ void http_co_t::handle_connect_result()
    }
    int one = 1;
    setsockopt(peer_fd, SOL_TCP, TCP_NODELAY, &one, sizeof(one));
+    tfd->set_fd_handler(peer_fd, false, [this](int peer_fd, int epoll_events)
+    {
+        this->epoll_events |= epoll_events;
+        handle_events();
+    });
    state = HTTP_CO_SENDING_REQUEST;
    submit_send();
    stackout();
@ -297,15 +305,18 @@ void http_co_t::submit_read()
    {
        res = -errno;
    }
-    if (res == -EAGAIN || res == 0)
+    if (res == -EAGAIN)
    {
        epoll_events = epoll_events & ~EPOLLIN;
    }
-    else if (res < 0)
+    else if (res <= 0)
    {
+        // < 0 means error, 0 means EOF
+        if (!res)
+            epoll_events = epoll_events & ~EPOLLIN;
        end();
    }
-    else if (res > 0)
+    else
    {
        response += std::string(rbuf.data(), res);
        handle_read();
--- a/http_client.h
+++ b/http_client.h
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
 #pragma once
 #include <string>
 #include <vector>
--- a/1
+++ b/1
@ -0,0 +1 @@
+Subproject commit 97f06cb20c1e136fd37d58fb40f57dd8f8a3a4a7
--- a/lambda_size.cpp
+++ b/lambda_size.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #include <iostream>
 #include <functional>
 #include <array>
--- a/lp/mon.js
+++ b/lp/mon.js
@ -1,858 +0,0 @@
-const http = require('http');
-const os = require('os');
-const WebSocket = require('ws');
-const LPOptimizer = require('./lp-optimizer.js');
-const stableStringify = require('./stable-stringify.js');
-
-class Mon
-{
-    static etcd_tree = {
-        config: {
-            global: null,
-            /* placement_tree = {
-                levels: { datacenter: 1, rack: 2, host: 3, osd: 4, ... },
-                nodes: { host1: { level: 'host', parent: 'rack1' }, ... },
-                failure_domain: 'host',
-            } */
-            placement_tree: null,
-            osd: {},
-            pgs: {},
-        },
-        osd: {
-            state: {},
-            stats: {},
-        },
-        mon: {
-            master: null,
-        },
-        pg: {
-            change_stamp: null,
-            state: {},
-            stats: {},
-            history: {},
-        },
-    }
-
-    constructor(config)
-    {
-        // FIXME: Maybe prefer local etcd
-        this.etcd_urls = [];
-        for (let url of config.etcd_url.split(/,/))
-        {
-            let scheme = 'http';
-            url = url.trim().replace(/^(https?):\/\//, (m, m1) => { scheme = m1; return ''; });
-            if (!/\/[^\/]/.exec(url))
-                url += '/v3';
-            this.etcd_urls.push(scheme+'://'+url);
-        }
-        this.etcd_prefix = config.etcd_prefix || '/rage';
-        this.etcd_prefix = this.etcd_prefix.replace(/\/\/+/g, '/').replace(/^\/?(.*[^\/])\/?$/, '/$1');
-        this.etcd_start_timeout = (config.etcd_start_timeout || 5) * 1000;
-        this.state = JSON.parse(JSON.stringify(Mon.etcd_tree));
-    }
-
-    async start()
-    {
-        await this.load_config();
-        await this.get_lease();
-        await this.become_master();
-        await this.load_cluster_state();
-        await this.start_watcher();
-        await this.recheck_pgs();
-    }
-
-    async load_config()
-    {
-        const res = await this.etcd_call('/txn', { success: [
-            { requestRange: { key: b64(this.etcd_prefix+'/config/global') } }
-        ] }, this.etcd_start_timeout, -1);
-        this.parse_kv(res.responses[0].response_range.kvs[0]);
-        this.check_config();
-    }
-
-    check_config()
-    {
-        this.config.etcd_mon_timeout = Number(this.config.etcd_mon_timeout) || 0;
-        if (this.config.etcd_mon_timeout <= 0)
-        {
-            this.config.etcd_mon_timeout = 1000;
-        }
-        this.config.etcd_mon_retries = Number(this.config.etcd_mon_retries) || 5;
-        if (this.config.etcd_mon_retries < 0)
-        {
-            this.config.etcd_mon_retries = 0;
-        }
-        this.config.mon_change_timeout = Number(this.config.mon_change_timeout) || 1000;
-        if (this.config.mon_change_timeout < 100)
-        {
-            this.config.mon_change_timeout = 100;
-        }
-        this.config.mon_stats_timeout = Number(this.config.mon_stats_timeout) || 1000;
-        if (this.config.mon_stats_timeout < 100)
-        {
-            this.config.mon_stats_timeout = 100;
-        }
-        // After this number of seconds, a dead OSD will be removed from PG distribution
-        this.config.osd_out_time = Number(this.config.osd_out_time) || 0;
-        if (!this.config.osd_out_time)
-        {
-            this.config.osd_out_time = 30*60; // 30 minutes by default
-        }
-        this.config.max_osd_combinations = Number(this.config.max_osd_combinations) || 10000;
-        if (this.config.max_osd_combinations < 100)
-        {
-            this.config.max_osd_combinations = 100;
-        }
-    }
-
-    async start_watcher(retries)
-    {
-        let retry = 0;
-        if (retries >= 0 && retries < 1)
-        {
-            retries = 1;
-        }
-        while (retries < 0 || retry < retries)
-        {
-            const base = 'ws'+this.etcd_urls[Math.floor(Math.random()*this.etcd_urls.length)].substr(4);
-            const ok = await new Promise((ok, no) =>
-            {
-                const timer_id = setTimeout(() =>
-                {
-                    this.ws.close();
-                    ok(false);
-                }, timeout);
-                this.ws = new WebSocket(base+'/watch');
-                this.ws.on('open', () =>
-                {
-                    if (timer_id)
-                        clearTimeout(timer_id);
-                    ok(true);
-                });
-            });
-            if (!ok)
-            {
-                this.ws = null;
-            }
-            retry++;
-        }
-        if (!this.ws)
-        {
-            this.die('Failed to open etcd watch websocket');
-        }
-        this.ws.send(JSON.stringify({
-            create_request: {
-                key: b64(this.etcd_prefix+'/'),
-                range_end: b64(this.etcd_prefix+'0'),
-                start_revision: ''+this.etcd_watch_revision,
-                watch_id: 1,
-            },
-        }));
-        this.ws.on('message', (msg) =>
-        {
-            let data;
-            try
-            {
-                data = JSON.parse(msg);
-            }
-            catch (e)
-            {
-            }
-            if (!data || !data.result || !data.result.events)
-            {
-                console.error('Garbage received from watch websocket: '+msg);
-            }
-            else
-            {
-                let stats_changed = false, changed = false;
-                console.log('Revision '+data.result.header.revision+' events: ');
-                for (const e of data.result.events)
-                {
-                    this.parse_kv(e.kv);
-                    const key = e.kv.key.substr(this.etcd_prefix.length);
-                    if (key.substr(0, 11) == '/osd/stats/' || key.substr(0, 10) == '/pg/stats/')
-                    {
-                        stats_changed = true;
-                    }
-                    else if (key != '/stats')
-                    {
-                        changed = true;
-                    }
-                    console.log(e);
-                }
-                if (stats_changed)
-                {
-                    this.schedule_update_stats();
-                }
-                if (changed)
-                {
-                    this.schedule_recheck();
-                }
-            }
-        });
-    }
-
-    async get_lease()
-    {
-        const max_ttl = this.config.etcd_mon_ttl + this.config.etcd_mon_timeout/1000*this.config.etcd_mon_retries;
-        const res = await this.etcd_call('/lease/grant', { TTL: max_ttl }, this.config.etcd_mon_timeout, this.config.etcd_mon_retries);
-        this.etcd_lease_id = res.ID;
-        setInterval(async () =>
-        {
-            const res = await this.etcd_call('/lease/keepalive', { ID: this.etcd_lease_id }, this.config.etcd_mon_timeout, this.config.etcd_mon_retries);
-            if (!res.result.TTL)
-            {
-                this.die('Lease expired');
-            }
-        }, config.etcd_mon_timeout);
-    }
-
-    async become_master()
-    {
-        const state = { ip: this.local_ips() };
-        while (1)
-        {
-            const res = await this.etcd_call('/txn', {
-                compare: [ { target: 'CREATE', create_revision: 0, key: b64(this.etcd_prefix+'/mon/master') } ],
-                success: [ { key: b64(this.etcd_prefix+'/mon/master'), value: b64(JSON.stringify(state)), lease: ''+this.etcd_lease_id } ],
-            }, this.etcd_start_timeout, 0);
-            if (!res.succeeded)
-            {
-                await new Promise(ok => setTimeout(ok, this.etcd_start_timeout));
-            }
-        }
-    }
-
-    async load_cluster_state()
-    {
-        const res = await this.etcd_call('/txn', { success: [
-            { requestRange: { key: b64(this.etcd_prefix+'/'), range_end: b64(this.etcd_prefix+'0') } },
-        ] }, this.etcd_start_timeout, -1);
-        this.etcd_watch_revision = BigInt(res.header.revision)+BigInt(1);
-        const data = JSON.parse(JSON.stringify(Mon.etcd_tree));
-        for (const response of res.responses)
-        {
-            for (const kv of response.response_range.kvs)
-            {
-                this.parse_kv(kv);
-            }
-        }
-        this.state = data;
-    }
-
-    all_osds()
-    {
-        return Object.keys(this.state.osd.stats);
-    }
-
-    get_osd_tree()
-    {
-        this.state.config.placement_tree = this.state.config.placement_tree||{};
-        const levels = this.state.config.placement_tree.levels||{};
-        levels.host = levels.host || 100;
-        levels.osd = levels.osd || 101;
-        const tree = { '': { children: [] } };
-        for (const node_id in this.state.config.placement_tree.nodes||{})
-        {
-            const node_cfg = this.state.config.placement_tree.nodes[node_id];
-            if (!node_id || /^\d/.exec(node_id) ||
-                !node_cfg.level || !levels[node_cfg.level])
-            {
-                // All nodes must have non-empty non-numeric IDs and valid levels
-                continue;
-            }
-            tree[node_id] = { id: node_id, level: node_cfg.level, parent: node_cfg.parent, children: [] };
-        }
-        // This requires monitor system time to be in sync with OSD system times (at least to some extent)
-        const down_time = Date.now()/1000 - this.config.osd_out_time;
-        for (const osd_num of this.all_osds().sort((a, b) => a - b))
-        {
-            const stat = this.state.osd.stats[osd_num];
-            if (stat.size && (this.state.osd.state[osd_num] || Number(stat.time) >= down_time))
-            {
-                // Numeric IDs are reserved for OSDs
-                const reweight = this.state.config.osd[osd_num] && Number(this.state.config.osd[osd_num].reweight) || 1;
-                tree[osd_num] = tree[osd_num] || { id: osd_num, parent: stat.host };
-                tree[osd_num].level = 'osd';
-                tree[osd_num].size = reweight * stat.size / 1024 / 1024 / 1024 / 1024; // terabytes
-                delete tree[osd_num].children;
-            }
-        }
-        for (const node_id in tree)
-        {
-            if (node_id === '')
-            {
-                continue;
-            }
-            const node_cfg = tree[node_id];
-            const node_level = levels[node_cfg.level] || node_cfg.level;
-            let parent_level = node_cfg.parent && tree[node_cfg.parent] && tree[node_cfg.parent].children
-                && tree[node_cfg.parent].level;
-            parent_level = parent_level ? (levels[parent_level] || parent_level) : null;
-            // Parent's level must be less than child's; OSDs must be leaves
-            const parent = parent_level && parent_level < node_level ? tree[node_cfg.parent] : '';
-            tree[parent].children.push(tree[node_id]);
-            delete node_cfg.parent;
-        }
-        return LPOptimizer.flatten_tree(tree[''].children, levels, this.state.config.failure_domain, 'osd');
-    }
-
-    async stop_all_pgs()
-    {
-        let has_online = false, paused = true;
-        for (const pg in this.state.config.pgs.items||{})
-        {
-            const cur_state = ((this.state.pg.state[pg]||{}).state||[]).join(',');
-            if (cur_state != '' && cur_state != 'offline')
-            {
-                has_online = true;
-            }
-            if (!this.state.config.pgs.items[pg].pause)
-            {
-                paused = false;
-            }
-        }
-        if (!paused)
-        {
-            console.log('Stopping all PGs before changing PG count');
-            const new_cfg = JSON.parse(JSON.stringify(this.state.config.pgs));
-            for (const pg in new_cfg.items)
-            {
-                new_cfg.items[pg].pause = true;
-            }
-            // Check that no OSDs change their state before we pause PGs
-            // Doing this we make sure that OSDs don't wake up in the middle of our "transaction"
-            // and can't see the old PG configuration
-            const checks = [];
-            for (const osd_num of this.all_osds())
-            {
-                const key = b64(this.etcd_prefix+'/osd/state/'+osd_num);
-                checks.push({ key, target: 'MOD', result: 'LESS', mod_revision: ''+this.etcd_watch_revision });
-            }
-            const res = await this.etcd_call('/txn', {
-                compare: [
-                    { key: b64(this.etcd_prefix+'/mon/master'), target: 'LEASE', lease: ''+this.etcd_lease_id },
-                    { key: b64(this.etcd_prefix+'/config/pgs'), target: 'MOD', mod_revision: ''+this.etcd_watch_revision, result: 'LESS' },
-                    ...checks,
-                ],
-                success: [
-                    { requestPut: { key: b64(this.etcd_prefix+'/config/pgs'), value: b64(JSON.stringify(new_cfg)) } },
-                ],
-            }, this.config.etcd_mon_timeout, 0);
-            if (!res.succeeded)
-            {
-                return false;
-            }
-            this.state.config.pgs = new_cfg;
-        }
-        return !has_online;
-    }
-
-    scale_pg_count(prev_pgs, pg_history, new_pg_count)
-    {
-        const old_pg_count = prev_pgs.length;
-        // Add all possibly intersecting PGs into the history of new PGs
-        if (!(new_pg_count % old_pg_count))
-        {
-            // New PG count is a multiple of the old PG count
-            const mul = (new_pg_count / old_pg_count);
-            for (let i = 0; i < new_pg_count; i++)
-            {
-                const old_i = Math.floor(new_pg_count / mul);
-                pg_history[i] = JSON.parse(JSON.stringify(this.state.pg.history[1+old_i]));
-            }
-        }
-        else if (!(old_pg_count % new_pg_count))
-        {
-            // Old PG count is a multiple of the new PG count
-            const mul = (old_pg_count / new_pg_count);
-            for (let i = 0; i < new_pg_count; i++)
-            {
-                pg_history[i] = {
-                    osd_sets: [],
-                    all_peers: [],
-                };
-                for (let j = 0; j < mul; j++)
-                {
-                    pg_history[i].osd_sets.push(prev_pgs[i*mul]);
-                    const hist = this.state.pg.history[1+i*mul+j];
-                    if (hist && hist.osd_sets && hist.osd_sets.length)
-                    {
-                        Array.prototype.push.apply(pg_history[i].osd_sets, hist.osd_sets);
-                    }
-                    if (hist && hist.all_peers && hist.all_peers.length)
-                    {
-                        Array.prototype.push.apply(pg_history[i].all_peers, hist.all_peers);
-                    }
-                }
-            }
-        }
-        else
-        {
-            // Any PG may intersect with any PG after non-multiple PG count change
-            // So, merge ALL PGs history
-            let all_sets = {};
-            let all_peers = {};
-            for (const pg of prev_pgs)
-            {
-                all_sets[pg.join(' ')] = pg;
-            }
-            for (const pg in this.state.pg.history)
-            {
-                const hist = this.state.pg.history[pg];
-                if (hist && hist.osd_sets)
-                {
-                    for (const pg of hist.osd_sets)
-                    {
-                        all_sets[pg.join(' ')] = pg;
-                    }
-                }
-                if (hist && hist.all_peers)
-                {
-                    for (const osd_num of hist.all_peers)
-                    {
-                        all_peers[osd_num] = Number(osd_num);
-                    }
-                }
-            }
-            all_sets = Object.values(all_sets);
-            all_peers = Object.values(all_peers);
-            for (let i = 0; i < new_pg_count; i++)
-            {
-                pg_history[i] = { osd_sets: all_sets, all_peers };
-            }
-        }
-        // Mark history keys for removed PGs as removed
-        for (let i = new_pg_count; i < old_pg_count; i++)
-        {
-            pg_history[i] = null;
-        }
-        if (old_pg_count < new_pg_count)
-        {
-            for (let i = new_pg_count-1; i >= 0; i--)
-            {
-                prev_pgs[i] = prev_pgs[Math.floor(i/new_pg_count*old_pg_count)];
-            }
-        }
-        else if (old_pg_count > new_pg_count)
-        {
-            for (let i = 0; i < new_pg_count; i++)
-            {
-                prev_pgs[i] = prev_pgs[Math.round(i/new_pg_count*old_pg_count)];
-            }
-            prev_pgs.splice(new_pg_count, old_pg_count-new_pg_count);
-        }
-    }
-
-    async save_new_pgs(prev_pgs, new_pgs, pg_history, tree_hash)
-    {
-        const txn = [], checks = [];
-        const pg_items = {};
-        new_pgs.map((osd_set, i) =>
-        {
-            osd_set = osd_set.map(osd_num => osd_num === LPOptimizer.NO_OSD ? 0 : osd_num);
-            const alive_set = osd_set.filter(osd_num => osd_num);
-            pg_items[i+1] = {
-                osd_set,
-                primary: alive_set.length ? alive_set[Math.floor(Math.random()*alive_set.length)] : 0,
-            };
-            if (prev_pgs[i] && prev_pgs[i].join(' ') != osd_set.join(' '))
-            {
-                pg_history[i] = pg_history[i] || {};
-                pg_history[i].osd_sets = pg_history[i].osd_sets || [];
-                pg_history[i].osd_sets.push(prev_pgs[i]);
-            }
-        });
-        for (let i = 0; i < new_pgs.length || i < prev_pgs.length; i++)
-        {
-            checks.push({
-                key: b64(this.etcd_prefix+'/pg/history/'+(i+1)),
-                target: 'MOD',
-                mod_revision: ''+this.etcd_watch_revision,
-                result: 'LESS',
-            });
-            if (pg_history[i])
-            {
-                txn.push({
-                    requestPut: {
-                        key: b64(this.etcd_prefix+'/pg/history/'+(i+1)),
-                        value: b64(JSON.stringify(pg_history[i])),
-                    },
-                });
-            }
-            else
-            {
-                txn.push({
-                    requestDeleteRange: {
-                        key: b64(this.etcd_prefix+'/pg/history/'+(i+1)),
-                    },
-                });
-            }
-        }
-        this.state.config.pgs = {
-            hash: tree_hash,
-            items: pg_items,
-        };
-        const res = await this.etcd_call('/txn', {
-            compare: [
-                { key: b64(this.etcd_prefix+'/mon/master'), target: 'LEASE', lease: ''+this.etcd_lease_id },
-                { key: b64(this.etcd_prefix+'/config/pgs'), target: 'MOD', mod_revision: ''+this.etcd_watch_revision, result: 'LESS' },
-                ...checks,
-            ],
-            success: [
-                { requestPut: { key: b64(this.etcd_prefix+'/config/pgs'), value: b64(JSON.stringify(this.state.config.pgs)) } },
-                ...txn,
-            ],
-        }, this.config.etcd_mon_timeout, 0);
-        return res.succeeded;
-    }
-
-    async recheck_pgs()
-    {
-        // Take configuration and state, check it against the stored configuration hash
-        // Recalculate PGs and save them to etcd if the configuration is changed
-        const tree_cfg = {
-            osd_tree: this.get_osd_tree(),
-            pg_count: this.config.pg_count || Object.keys(this.state.config.pgs.items||{}).length || 128,
-            max_osd_combinations: this.config.max_osd_combinations,
-        };
-        const tree_hash = sha1hex(stableStringify(tree_cfg));
-        if (this.state.config.pgs.hash != tree_hash)
-        {
-            // Something has changed
-            const prev_pgs = [];
-            for (const pg in this.state.config.pgs.items||{})
-            {
-                prev_pgs[pg-1] = this.state.config.pgs.items[pg].osd_set;
-            }
-            const pg_history = [];
-            const old_pg_count = prev_pgs.length;
-            let optimize_result;
-            if (old_pg_count > 0)
-            {
-                if (old_pg_count != tree_cfg.pg_count)
-                {
-                    // PG count changed. Need to bring all PGs down.
-                    if (!await this.stop_all_pgs())
-                    {
-                        this.schedule_recheck();
-                        return;
-                    }
-                    this.scale_pg_count(prev_pgs, pg_history, new_pg_count);
-                }
-                optimize_result = await LPOptimizer.optimize_change(prev_pgs, tree_cfg.osd_tree, tree_cfg.max_osd_combinations);
-            }
-            else
-            {
-                optimize_result = await LPOptimizer.optimize_initial(tree_cfg.osd_tree, tree_cfg.pg_count, tree_cfg.max_osd_combinations);
-            }
-            if (!await this.save_new_pgs(prev_pgs, optimize_result.int_pgs, pg_history, tree_hash))
-            {
-                console.log('Someone changed PG configuration while we also tried to change it. Retrying in '+this.config.mon_change_timeout+' ms');
-                this.schedule_recheck();
-                return;
-            }
-            console.log('PG configuration successfully changed');
-            if (old_pg_count != optimize_result.int_pgs.length)
-            {
-                console.log(`PG count changed from: ${old_pg_count} to ${optimize_result.int_pgs.length}`);
-            }
-            LPOptimizer.print_change_stats(optimize_result);
-        }
-    }
-
-    schedule_recheck()
-    {
-        if (this.recheck_timer)
-        {
-            clearTimeout(this.recheck_timer);
-            this.recheck_timer = null;
-        }
-        this.recheck_timer = setTimeout(() =>
-        {
-            this.recheck_timer = null;
-            this.recheck_pgs().catch(console.error);
-        }, this.config.mon_change_timeout || 1000);
-    }
-
-    sum_stats()
-    {
-        let overflow = false;
-        this.prev_stats = this.prev_stats || { op_stats: {}, subop_stats: {}, recovery_stats: {} };
-        const op_stats = {}, subop_stats = {}, recovery_stats = {};
-        for (const osd in this.state.osd.stats)
-        {
-            const st = this.state.osd.stats[osd];
-            for (const op in st.op_stats||{})
-            {
-                op_stats[op] = op_stats[op] || { count: 0n, usec: 0n, bytes: 0n };
-                op_stats[op].count += BigInt(st.op_stats.count||0);
-                op_stats[op].usec += BigInt(st.op_stats.usec||0);
-                op_stats[op].bytes += BigInt(st.op_stats.bytes||0);
-            }
-            for (const op in st.subop_stats||{})
-            {
-                subop_stats[op] = subop_stats[op] || { count: 0n, usec: 0n };
-                subop_stats[op].count += BigInt(st.subop_stats.count||0);
-                subop_stats[op].usec += BigInt(st.subop_stats.usec||0);
-            }
-            for (const op in st.recovery_stats||{})
-            {
-                recovery_stats[op] = recovery_stats[op] || { count: 0n, bytes: 0n };
-                recovery_stats[op].count += BigInt(st.recovery_stats.count||0);
-                recovery_stats[op].bytes += BigInt(st.recovery_stats.bytes||0);
-            }
-        }
-        for (const op in op_stats)
-        {
-            if (op_stats[op].count >= 0x10000000000000000n)
-            {
-                if (!this.prev_stats.op_stats[op])
-                {
-                    overflow = true;
-                }
-                else
-                {
-                    op_stats[op].count -= this.prev_stats.op_stats[op].count;
-                    op_stats[op].usec -= this.prev_stats.op_stats[op].usec;
-                    op_stats[op].bytes -= this.prev_stats.op_stats[op].bytes;
-                }
-            }
-        }
-        for (const op in subop_stats)
-        {
-            if (subop_stats[op].count >= 0x10000000000000000n)
-            {
-                if (!this.prev_stats.subop_stats[op])
-                {
-                    overflow = true;
-                }
-                else
-                {
-                    subop_stats[op].count -= this.prev_stats.subop_stats[op].count;
-                    subop_stats[op].usec -= this.prev_stats.subop_stats[op].usec;
-                }
-            }
-        }
-        for (const op in recovery_stats)
-        {
-            if (recovery_stats[op].count >= 0x10000000000000000n)
-            {
-                if (!this.prev_stats.recovery_stats[op])
-                {
-                    overflow = true;
-                }
-                else
-                {
-                    recovery_stats[op].count -= this.prev_stats.recovery_stats[op].count;
-                    recovery_stats[op].bytes -= this.prev_stats.recovery_stats[op].bytes;
-                }
-            }
-        }
-        const object_counts = { object: 0n, clean: 0n, misplaced: 0n, degraded: 0n, incomplete: 0n };
-        for (const pg_num in this.state.pg.stats)
-        {
-            const st = this.state.pg.stats[pg_num];
-            for (const k in object_counts)
-            {
-                if (st[k+'_count'])
-                {
-                    object_counts[k] += BigInt(st[k+'_count']);
-                }
-            }
-        }
-        return (this.prev_stats = { overflow, op_stats, subop_stats, recovery_stats, object_counts });
-    }
-
-    async update_total_stats()
-    {
-        const stats = this.sum_stats();
-        if (!stats.overflow)
-        {
-            // Convert to strings, serialize and save
-            const ser = {};
-            for (const st of [ 'op_stats', 'subop_stats', 'recovery_stats' ])
-            {
-                ser[st] = {};
-                for (const op in stats[st])
-                {
-                    ser[st][op] = {};
-                    for (const k in stats[st][op])
-                    {
-                        ser[st][op][k] = ''+stats[st][op][k];
-                    }
-                }
-            }
-            ser.object_counts = {};
-            for (const k in stats.object_counts)
-            {
-                ser.object_counts[k] = ''+stats.object_counts[k];
-            }
-            await this.etcd_call('/txn', {
-                success: [ { requestPut: { key: b64(this.etcd_prefix+'/stats'), value: b64(JSON.stringify(ser)) } } ],
-            }, this.config.etcd_mon_timeout, 0);
-        }
-    }
-
-    schedule_update_stats()
-    {
-        if (this.stats_timer)
-        {
-            clearTimeout(this.stats_timer);
-            this.stats_timer = null;
-        }
-        this.stats_timer = setTimeout(() =>
-        {
-            this.stats_timer = null;
-            this.update_total_stats().catch(console.error);
-        }, this.config.mon_stats_timeout || 1000);
-    }
-
-    parse_kv(kv)
-    {
-        if (!kv || !kv.key)
-        {
-            return;
-        }
-        kv.key = de64(kv.key);
-        kv.value = kv.value ? JSON.parse(de64(kv.value)) : null;
-        const key = kv.key.substr(this.etcd_prefix.length).replace(/^\/+/, '').split('/');
-        const cur = this.state, orig = Mon.etcd_tree;
-        for (let i = 0; i < key.length-1; i++)
-        {
-            if (!orig[key[i]])
-            {
-                console.log('Bad key in etcd: '+kv.key+' = '+kv.value);
-                return;
-            }
-            orig = orig[key[i]];
-            cur = (cur[key[i]] = cur[key[i]] || {});
-        }
-        if (orig[key.length-1])
-        {
-            console.log('Bad key in etcd: '+kv.key+' = '+kv.value);
-            return;
-        }
-        cur[key[key.length-1]] = kv.value;
-        if (key.join('/') === 'config/global')
-        {
-            this.state.config.global = this.state.config.global || {};
-            this.config = this.state.config.global;
-            this.check_config();
-        }
-    }
-
-    async etcd_call(path, body, timeout, retries)
-    {
-        let retry = 0;
-        if (retries >= 0 && retries < 1)
-        {
-            retries = 1;
-        }
-        while (retries < 0 || retry < retries)
-        {
-            const base = this.etcd_urls[Math.floor(Math.random()*this.etcd_urls.length)];
-            const res = await POST(base+path, body, timeout);
-            if (res.json)
-            {
-                if (res.json.error)
-                {
-                    console.log('etcd returned error: '+res.json.error);
-                    break;
-                }
-                return res.json;
-            }
-            retry++;
-        }
-        this.die();
-    }
-
-    die(err)
-    {
-        // In fact we can just try to rejoin
-        console.fatal(err || 'Cluster connection failed');
-        process.exit(1);
-    }
-
-    local_ips()
-    {
-        const ips = [];
-        const ifaces = os.networkInterfaces();
-        for (const ifname in ifaces)
-        {
-            for (const iface of ifaces[ifname])
-            {
-                if (iface.family == 'IPv4' && !iface.internal)
-                {
-                    ips.push(iface.address);
-                }
-            }
-        }
-        return ips;
-    }
-}
-
-function POST(url, body, timeout)
-{
-    return new Promise((ok, no) =>
-    {
-        const body_text = Buffer.from(JSON.stringify(body));
-        let timer_id = timeout > 0 ? setTimeout(() =>
-        {
-            if (req)
-                req.abort();
-            req = null;
-            ok({ error: 'timeout' });
-        }, timeout) : null;
-        let req = http.request(url, { method: 'POST', headers: {
-            'Content-Type': 'application/json',
-            'Content-Length': body_text,
-        } }, (res) =>
-        {
-            if (!req)
-            {
-                return;
-            }
-            clearTimeout(timer_id);
-            if (res.statusCode != 200)
-            {
-                ok({ error: res.statusCode, response: res });
-                return;
-            }
-            let res_body = '';
-            res.setEncoding('utf8');
-            res.on('data', chunk => { res_body += chunk });
-            res.on('end', () =>
-            {
-                try
-                {
-                    res_body = JSON.parse(res_body);
-                    ok({ response: res, json: res_body });
-                }
-                catch (e)
-                {
-                    ok({ error: e, response: res, body: res_body });
-                }
-            });
-        });
-        req.write(body_text);
-        req.end();
-    });
-}
-
-function b64(str)
-{
-    return Buffer.from(str).toString('base64');
-}
-
-function de64(str)
-{
-    return Buffer.from(str, 'base64').toString();
-}
-
-function sha1hex(str)
-{
-    const hash = crypto.createHash('sha1');
-    hash.update(str);
-    return hash.digest('hex');
-}
--- a/malloc_or_die.h
+++ b/malloc_or_die.h
@ -0,0 +1,50 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
+#pragma once
+
+#include <malloc.h>
+
+inline void* memalign_or_die(size_t alignment, size_t size)
+{
+    void *buf = memalign(alignment, size);
+    if (!buf)
+    {
+        printf("Failed to allocate %lu bytes\n", size);
+        exit(1);
+    }
+    return buf;
+}
+
+inline void* malloc_or_die(size_t size)
+{
+    void *buf = malloc(size);
+    if (!buf)
+    {
+        printf("Failed to allocate %lu bytes\n", size);
+        exit(1);
+    }
+    return buf;
+}
+
+inline void* realloc_or_die(void *ptr, size_t size)
+{
+    void *buf = realloc(ptr, size);
+    if (!buf)
+    {
+        printf("Failed to allocate %lu bytes\n", size);
+        exit(1);
+    }
+    return buf;
+}
+
+inline void* calloc_or_die(size_t nmemb, size_t size)
+{
+    void *buf = calloc(nmemb, size);
+    if (!buf)
+    {
+        printf("Failed to allocate %lu bytes\n", size * nmemb);
+        exit(1);
+    }
+    return buf;
+}
--- a/messenger.cpp
+++ b/messenger.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
 #include <unistd.h>
 #include <fcntl.h>
 #include <sys/socket.h>
@ -22,6 +25,14 @@ osd_op_t::~osd_op_t()
    }
 }

+osd_messenger_t::~osd_messenger_t()
+{
+    while (clients.size() > 0)
+    {
+        stop_client(clients.begin()->first);
+    }
+}
+
 void osd_messenger_t::connect_peer(uint64_t peer_osd, json11::Json peer_state)
 {
    if (wanted_peers.find(peer_osd) == wanted_peers.end())
@ -63,6 +74,7 @@ void osd_messenger_t::try_connect_peer(uint64_t peer_osd)
    }
    wp.cur_addr = wp.address_list[wp.address_index].string_value();
    wp.cur_port = wp.port;
+    wp.connecting = true;
    try_connect_peer_addr(peer_osd, wp.cur_addr.c_str(), wp.cur_port);
 }

@ -110,9 +122,9 @@ void osd_messenger_t::try_connect_peer_addr(osd_num_t peer_osd, const char *peer
        .peer_state = PEER_CONNECTING,
        .connect_timeout_id = timeout_id,
        .osd_num = peer_osd,
-        .in_buf = malloc(receive_buffer_size),
+        .in_buf = malloc_or_die(receive_buffer_size),
    };
-    tfd->set_fd_handler(peer_fd, [this](int peer_fd, int epoll_events)
+    tfd->set_fd_handler(peer_fd, true, [this](int peer_fd, int epoll_events)
    {
        // Either OUT (connected) or HUP
        handle_connect_epoll(peer_fd);
@ -143,8 +155,7 @@ void osd_messenger_t::handle_connect_epoll(int peer_fd)
    int one = 1;
    setsockopt(peer_fd, SOL_TCP, TCP_NODELAY, &one, sizeof(one));
    cl.peer_state = PEER_CONNECTED;
-    // FIXME Disable EPOLLOUT on this fd
-    tfd->set_fd_handler(peer_fd, [this](int peer_fd, int epoll_events)
+    tfd->set_fd_handler(peer_fd, false, [this](int peer_fd, int epoll_events)
    {
        handle_peer_epoll(peer_fd, epoll_events);
    });
@ -169,7 +180,10 @@ void osd_messenger_t::handle_peer_epoll(int peer_fd, int epoll_events)
        if (cl.read_ready == 1)
        {
            read_ready_clients.push_back(cl.peer_fd);
+            if (ringloop)
                ringloop->wakeup();
+            else
+                read_requests();
        }
    }
 }
@ -214,7 +228,6 @@ void osd_messenger_t::check_peer_config(osd_client_t & cl)
 {
    osd_op_t *op = new osd_op_t();
    op->op_type = OSD_OP_OUT;
-    op->send_list.push_back(op->req.buf, OSD_PACKET_SIZE);
    op->peer_fd = cl.peer_fd;
    op->req = {
        .show_conf = {
@ -248,12 +261,13 @@ void osd_messenger_t::check_peer_config(osd_client_t & cl)
            {
                err = true;
                printf("Connected to OSD %lu instead of OSD %lu, peer state is outdated, disconnecting peer\n", config["osd_num"].uint64_value(), cl.osd_num);
-                on_connect_peer(cl.osd_num, -1);
            }
        }
        if (err)
        {
+            osd_num_t osd_num = cl.osd_num;
            stop_client(op->peer_fd);
+            on_connect_peer(osd_num, -1);
            delete op;
            return;
        }
@ -325,7 +339,7 @@ void osd_messenger_t::stop_client(int peer_fd)
        }
    }
    clients.erase(it);
-    tfd->set_fd_handler(peer_fd, NULL);
+    tfd->set_fd_handler(peer_fd, false, NULL);
    if (cl.osd_num)
    {
        osd_peer_fds.erase(cl.osd_num);
@ -381,10 +395,10 @@ void osd_messenger_t::accept_connections(int listen_fd)
            .peer_port = ntohs(addr.sin_port),
            .peer_fd = peer_fd,
            .peer_state = PEER_CONNECTED,
-            .in_buf = malloc(receive_buffer_size),
+            .in_buf = malloc_or_die(receive_buffer_size),
        };
        // Add FD to epoll
-        tfd->set_fd_handler(peer_fd, [this](int peer_fd, int epoll_events)
+        tfd->set_fd_handler(peer_fd, false, [this](int peer_fd, int epoll_events)
        {
            handle_peer_epoll(peer_fd, epoll_events);
        });
--- a/messenger.h
+++ b/messenger.h
@ -1,15 +1,18 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
 #pragma once

 #include <sys/types.h>
 #include <stdint.h>
 #include <arpa/inet.h>
-#include <malloc.h>

 #include <set>
 #include <map>
 #include <deque>
 #include <vector>

+#include "malloc_or_die.h"
 #include "json11/json11.hpp"
 #include "osd_ops.h"
 #include "timerfd_manager.h"
@ -31,13 +34,32 @@
 #define DEFAULT_PEER_CONNECT_INTERVAL 5
 #define DEFAULT_PEER_CONNECT_TIMEOUT 5

+// Kind of a vector with small-list-optimisation
 struct osd_op_buf_list_t
 {
-    int count = 0, alloc = 0, sent = 0;
+    int count = 0, alloc = OSD_OP_INLINE_BUF_COUNT, done = 0;
    iovec *buf = NULL;
    iovec inline_buf[OSD_OP_INLINE_BUF_COUNT];

-    ~osd_op_buf_list_t()
+    inline osd_op_buf_list_t()
+    {
+        buf = inline_buf;
+    }
+
+    inline osd_op_buf_list_t(const osd_op_buf_list_t & other)
+    {
+        buf = inline_buf;
+        append(other);
+    }
+
+    inline osd_op_buf_list_t & operator = (const osd_op_buf_list_t & other)
+    {
+        reset();
+        append(other);
+        return *this;
+    }
+
+    inline ~osd_op_buf_list_t()
    {
        if (buf && buf != inline_buf)
        {
@ -45,40 +67,103 @@ struct osd_op_buf_list_t
        }
    }

+    inline void reset()
+    {
+        count = 0;
+        done = 0;
+    }
+
    inline iovec* get_iovec()
    {
-        return (buf ? buf : inline_buf) + sent;
+        return buf + done;
    }

    inline int get_size()
    {
-        return count - sent;
+        return count - done;
+    }
+
+    inline void append(const osd_op_buf_list_t & other)
+    {
+        if (count+other.count > alloc)
+        {
+            if (buf == inline_buf)
+            {
+                int old = alloc;
+                alloc = (((count+other.count+15)/16)*16);
+                buf = (iovec*)malloc(sizeof(iovec) * alloc);
+                if (!buf)
+                {
+                    printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
+                    exit(1);
+                }
+                memcpy(buf, inline_buf, sizeof(iovec) * old);
+            }
+            else
+            {
+                alloc = (((count+other.count+15)/16)*16);
+                buf = (iovec*)realloc(buf, sizeof(iovec) * alloc);
+                if (!buf)
+                {
+                    printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
+                    exit(1);
+                }
+            }
+        }
+        for (int i = 0; i < other.count; i++)
+        {
+            buf[count++] = other.buf[i];
+        }
    }

    inline void push_back(void *nbuf, size_t len)
    {
        if (count >= alloc)
        {
-            if (!alloc)
-            {
-                alloc = OSD_OP_INLINE_BUF_COUNT;
-                buf = inline_buf;
-            }
-            else if (buf == inline_buf)
+            if (buf == inline_buf)
            {
                int old = alloc;
                alloc = ((alloc/16)*16 + 1);
                buf = (iovec*)malloc(sizeof(iovec) * alloc);
+                if (!buf)
+                {
+                    printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
+                    exit(1);
+                }
                memcpy(buf, inline_buf, sizeof(iovec)*old);
            }
            else
            {
-                alloc = ((alloc/16)*16 + 1);
+                alloc = alloc < 16 ? 16 : (alloc+16);
                buf = (iovec*)realloc(buf, sizeof(iovec) * alloc);
+                if (!buf)
+                {
+                    printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
+                    exit(1);
+                }
            }
        }
        buf[count++] = { .iov_base = nbuf, .iov_len = len };
    }
+
+    inline void eat(int result)
+    {
+        while (result > 0 && done < count)
+        {
+            iovec & iov = buf[done];
+            if (iov.iov_len <= result)
+            {
+                result -= iov.iov_len;
+                done++;
+            }
+            else
+            {
+                iov.iov_len -= result;
+                iov.iov_base += result;
+                break;
+            }
+        }
+    }
 };

 struct blockstore_op_t;
@ -98,7 +183,7 @@ struct osd_op_t
    osd_primary_op_data_t* op_data = NULL;
    std::function<void(osd_op_t*)> callback;

-    osd_op_buf_list_t send_list;
+    osd_op_buf_list_t iov;

    ~osd_op_t();
 };
@ -117,12 +202,11 @@ struct osd_client_t
    // Read state
    int read_ready = 0;
    osd_op_t *read_op = NULL;
-    int read_reply_id = 0;
    iovec read_iov;
    msghdr read_msg;
-    void *read_buf = NULL;
    int read_remaining = 0;
    int read_state = 0;
+    osd_op_buf_list_t recv_list;

    // Incoming operations
    std::vector<osd_op_t*> received_ops;
@ -131,13 +215,14 @@ struct osd_client_t
    std::deque<osd_op_t*> outbox;
    std::map<int, osd_op_t*> sent_ops;

-    // PGs dirtied by this client's primary-writes (FIXME to drop the connection)
-    std::set<pg_num_t> dirty_pgs;
+    // PGs dirtied by this client's primary-writes
+    std::set<pool_pg_num_t> dirty_pgs;

    // Write state
    osd_op_t *write_op = NULL;
    msghdr write_msg;
    int write_state = 0;
+    osd_op_buf_list_t send_list;
 };

 struct osd_wanted_peer_t
@ -167,10 +252,12 @@ struct osd_messenger_t

    // osd_num_t is only for logging and asserts
    osd_num_t osd_num;
-    int receive_buffer_size = 9000;
+    // FIXME: make receive_buffer_size configurable
+    int receive_buffer_size = 64*1024;
    int peer_connect_interval = DEFAULT_PEER_CONNECT_INTERVAL;
    int peer_connect_timeout = DEFAULT_PEER_CONNECT_TIMEOUT;
    int log_level = 0;
+    bool use_sync_send_recv = false;

    std::map<osd_num_t, osd_wanted_peer_t> wanted_peers;
    std::map<uint64_t, int> osd_peer_fds;
@ -179,6 +266,7 @@ struct osd_messenger_t
    std::map<int, osd_client_t> clients;
    std::vector<int> read_ready_clients;
    std::vector<int> write_ready_clients;
+    std::vector<std::function<void()>> set_immediate;

    // op statistics
    osd_op_stats_t stats;
@ -193,6 +281,7 @@ public:
    void read_requests();
    void send_replies();
    void accept_connections(int listen_fd);
+    ~osd_messenger_t();

 protected:
    void try_connect_peer(uint64_t osd_num);
@ -207,7 +296,8 @@ protected:
    void handle_send(int result, int peer_fd);

    bool handle_read(int result, int peer_fd);
-    void handle_finished_read(osd_client_t & cl);
+    bool handle_finished_read(osd_client_t & cl);
    void handle_op_hdr(osd_client_t *cl);
-    void handle_reply_hdr(osd_client_t *cl);
+    bool handle_reply_hdr(osd_client_t *cl);
+    void handle_reply_ready(osd_op_t *op);
 };
--- a/mon/PGUtil.js
+++ b/mon/PGUtil.js
@ -0,0 +1,112 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
+module.exports = {
+    scale_pg_count,
+};
+
+function scale_pg_count(prev_pgs, prev_pg_history, new_pg_history, new_pg_count)
+{
+    const old_pg_count = prev_pgs.length;
+    // Add all possibly intersecting PGs into the history of new PGs
+    if (!(new_pg_count % old_pg_count))
+    {
+        // New PG count is a multiple of the old PG count
+        const mul = (new_pg_count / old_pg_count);
+        for (let i = 0; i < new_pg_count; i++)
+        {
+            const old_i = Math.floor(new_pg_count / mul);
+            new_pg_history[i] = JSON.parse(JSON.stringify(prev_pg_history[1+old_i]));
+        }
+    }
+    else if (!(old_pg_count % new_pg_count))
+    {
+        // Old PG count is a multiple of the new PG count
+        const mul = (old_pg_count / new_pg_count);
+        for (let i = 0; i < new_pg_count; i++)
+        {
+            new_pg_history[i] = {
+                osd_sets: [],
+                all_peers: [],
+                epoch: 0,
+            };
+            for (let j = 0; j < mul; j++)
+            {
+                new_pg_history[i].osd_sets.push(prev_pgs[i*mul]);
+                const hist = prev_pg_history[1+i*mul+j];
+                if (hist && hist.osd_sets && hist.osd_sets.length)
+                {
+                    Array.prototype.push.apply(new_pg_history[i].osd_sets, hist.osd_sets);
+                }
+                if (hist && hist.all_peers && hist.all_peers.length)
+                {
+                    Array.prototype.push.apply(new_pg_history[i].all_peers, hist.all_peers);
+                }
+                if (hist && hist.epoch)
+                {
+                    new_pg_history[i].epoch = new_pg_history[i].epoch < hist.epoch ? hist.epoch : new_pg_history[i].epoch;
+                }
+            }
+        }
+    }
+    else
+    {
+        // Any PG may intersect with any PG after non-multiple PG count change
+        // So, merge ALL PGs history
+        let all_sets = {};
+        let all_peers = {};
+        let max_epoch = 0;
+        for (const pg of prev_pgs)
+        {
+            all_sets[pg.join(' ')] = pg;
+        }
+        for (const pg in prev_pg_history)
+        {
+            const hist = prev_pg_history[pg];
+            if (hist && hist.osd_sets)
+            {
+                for (const pg of hist.osd_sets)
+                {
+                    all_sets[pg.join(' ')] = pg;
+                }
+            }
+            if (hist && hist.all_peers)
+            {
+                for (const osd_num of hist.all_peers)
+                {
+                    all_peers[osd_num] = Number(osd_num);
+                }
+            }
+            if (hist && hist.epoch)
+            {
+                max_epoch = max_epoch < hist.epoch ? hist.epoch : max_epoch;
+            }
+        }
+        all_sets = Object.values(all_sets);
+        all_peers = Object.values(all_peers);
+        for (let i = 0; i < new_pg_count; i++)
+        {
+            new_pg_history[i] = { osd_sets: all_sets, all_peers, epoch: max_epoch };
+        }
+    }
+    // Mark history keys for removed PGs as removed
+    for (let i = new_pg_count; i < old_pg_count; i++)
+    {
+        new_pg_history[i] = null;
+    }
+    if (old_pg_count < new_pg_count)
+    {
+        for (let i = new_pg_count-1; i >= 0; i--)
+        {
+            prev_pgs[i] = prev_pgs[Math.floor(i/new_pg_count*old_pg_count)];
+        }
+    }
+    else if (old_pg_count > new_pg_count)
+    {
+        for (let i = 0; i < new_pg_count; i++)
+        {
+            prev_pgs[i] = prev_pgs[Math.round(i/new_pg_count*old_pg_count)];
+        }
+        prev_pgs.splice(new_pg_count, old_pg_count-new_pg_count);
+    }
+}
--- a/mon/afr.js
+++ b/mon/afr.js
@ -0,0 +1,159 @@
+// Functions to calculate Annualized Failure Rate of your cluster
+// if you know AFR of your drives, number of drives, expected rebalance time
+// and replication factor
+// License: VNPL-1.0 (see README.md for details)
+
+const { sprintf } = require('sprintf-js');
+
+module.exports = {
+    cluster_afr_fullmesh,
+    failure_rate_fullmesh,
+    cluster_afr,
+    print_cluster_afr,
+    c_n_k,
+};
+
+print_cluster_afr({ n_hosts: 4, n_drives: 6, afr_drive: 0.03, afr_host: 0.05, capacity: 4000, speed: 0.1, replicas: 2 });
+print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, capacity: 4000, speed: 0.1, replicas: 2 });
+print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, afr_host: 0.05, capacity: 4000, speed: 0.1, replicas: 2 });
+print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, capacity: 4000, speed: 0.1, ec: [ 2, 1 ] });
+print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, afr_host: 0.05, capacity: 4000, speed: 0.1, ec: [ 2, 1 ] });
+print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, capacity: 8000, speed: 0.02, replicas: 2 });
+print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 2 });
+print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, capacity: 8000, speed: 0.02, replicas: 3 });
+print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 3 });
+print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, capacity: 8000, speed: 0.02, replicas: 3, pgs: 100 });
+print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 3, pgs: 100 });
+print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 3, pgs: 100, degraded_replacement: 1 });
+
+/******** "FULL MESH": ASSUME EACH OSD COMMUNICATES WITH ALL OTHER OSDS ********/
+
+// Estimate AFR of the cluster
+// n - number of drives
+// afr - annualized failure rate of a single drive
+// l - expected rebalance time in days after a single drive failure
+// k - replication factor / number of drives that must fail at the same time for the cluster to fail
+function cluster_afr_fullmesh(n, afr, l, k)
+{
+    return 1 - (1 - afr * failure_rate_fullmesh(n-(k-1), afr*l/365, k-1)) ** (n-(k-1));
+}
+
+// Probability of at least <f> failures in a cluster with <n> drives with AFR=<a>
+function failure_rate_fullmesh(n, a, f)
+{
+    if (f <= 0)
+    {
+        return (1-a)**n;
+    }
+    let p = 1;
+    for (let i = 0; i < f; i++)
+    {
+        p -= c_n_k(n, i) * (1-a)**(n-i) * a**i;
+    }
+    return p;
+}
+
+/******** PGS: EACH OSD ONLY COMMUNICATES WITH <pgs> OTHER OSDs ********/
+
+// <n> hosts of <m> drives of <capacity> GB, each able to backfill at <speed> GB/s,
+// <k> replicas, <pgs> unique peer PGs per OSD
+//
+// For each of n*m drives: P(drive fails in a year) * P(any of its peers fail in <l*365> next days).
+// More peers per OSD increase rebalance speed (more drives work together to resilver) if you
+// let them finish rebalance BEFORE replacing the failed drive.
+// At the same time, more peers per OSD increase probability of any of them to fail!
+//
+// Probability of all except one drives in a replica group to fail is (AFR^(k-1)).
+// So with <x> PGs it becomes ~ (x * (AFR*L/365)^(k-1)). Interesting but reasonable consequence
+// is that, with k=2, total failure rate doesn't depend on number of peers per OSD,
+// because it gets increased linearly by increased number of peers to fail
+// and decreased linearly by reduced rebalance time.
+function cluster_afr_pgs({ n_hosts, n_drives, afr_drive, capacity, speed, replicas, pgs = 1, degraded_replacement })
+{
+    pgs = Math.min(pgs, (n_hosts-1)*n_drives/(replicas-1));
+    const l = capacity/(degraded_replacement ? 1 : pgs)/speed/86400/365;
+    return 1 - (1 - afr_drive * (1-(1-(afr_drive*l)**(replicas-1))**pgs)) ** (n_hosts*n_drives);
+}
+
+function cluster_afr_pgs_ec({ n_hosts, n_drives, afr_drive, capacity, speed, ec: [ ec_data, ec_parity ], pgs = 1, degraded_replacement })
+{
+    const ec_total = ec_data+ec_parity;
+    pgs = Math.min(pgs, (n_hosts-1)*n_drives/(ec_total-1));
+    const l = capacity/(degraded_replacement ? 1 : pgs)/speed/86400/365;
+    return 1 - (1 - afr_drive * (1-(1-failure_rate_fullmesh(ec_total-1, afr_drive*l, ec_parity))**pgs)) ** (n_hosts*n_drives);
+}
+
+// Same as above, but also take server failures into account
+function cluster_afr_pgs_hosts({ n_hosts, n_drives, afr_drive, afr_host, capacity, speed, replicas, pgs = 1, degraded_replacement })
+{
+    let otherhosts = Math.min(pgs, (n_hosts-1)/(replicas-1));
+    pgs = Math.min(pgs, (n_hosts-1)*n_drives/(replicas-1));
+    let pgh = Math.min(pgs*n_drives, (n_hosts-1)*n_drives/(replicas-1));
+    const ld = capacity/(degraded_replacement ? 1 : pgs)/speed/86400/365;
+    const lh = n_drives*capacity/pgs/speed/86400/365;
+    const p1 = ((afr_drive+afr_host*pgs/otherhosts)*lh);
+    const p2 = ((afr_drive+afr_host*pgs/otherhosts)*ld);
+    return 1 - ((1 - afr_host * (1-(1-p1**(replicas-1))**pgh)) ** n_hosts) *
+        ((1 - afr_drive * (1-(1-p2**(replicas-1))**pgs)) ** (n_hosts*n_drives));
+}
+
+function cluster_afr_pgs_ec_hosts({ n_hosts, n_drives, afr_drive, afr_host, capacity, speed, ec: [ ec_data, ec_parity ], pgs = 1, degraded_replacement })
+{
+    const ec_total = ec_data+ec_parity;
+    const otherhosts = Math.min(pgs, (n_hosts-1)/(ec_total-1));
+    pgs = Math.min(pgs, (n_hosts-1)*n_drives/(ec_total-1));
+    const pgh = Math.min(pgs*n_drives, (n_hosts-1)*n_drives/(ec_total-1));
+    const ld = capacity/(degraded_replacement ? 1 : pgs)/speed/86400/365;
+    const lh = n_drives*capacity/pgs/speed/86400/365;
+    const p1 = ((afr_drive+afr_host*pgs/otherhosts)*lh);
+    const p2 = ((afr_drive+afr_host*pgs/otherhosts)*ld);
+    return 1 - ((1 - afr_host * (1-(1-failure_rate_fullmesh(ec_total-1, p1, ec_parity))**pgh)) ** n_hosts) *
+        ((1 - afr_drive * (1-(1-failure_rate_fullmesh(ec_total-1, p2, ec_parity))**pgs)) ** (n_hosts*n_drives));
+}
+
+// Wrapper for 4 above functions
+function cluster_afr(config)
+{
+    if (config.ec && config.afr_host)
+    {
+        return cluster_afr_pgs_ec_hosts(config);
+    }
+    else if (config.ec)
+    {
+        return cluster_afr_pgs_ec(config);
+    }
+    else if (config.afr_host)
+    {
+        return cluster_afr_pgs_hosts(config);
+    }
+    else
+    {
+        return cluster_afr_pgs(config);
+    }
+}
+
+function print_cluster_afr(config)
+{
+    console.log(
+        `${config.n_hosts} nodes with ${config.n_drives} ${sprintf("%.1f", config.capacity/1000)}TB drives`+
+        `, capable to backfill at ${sprintf("%.1f", config.speed*1000)} MB/s, drive AFR ${sprintf("%.1f", config.afr_drive*100)}%`+
+        (config.afr_host ? `, host AFR ${sprintf("%.1f", config.afr_host*100)}%` : '')+
+        (config.ec ? `, EC ${config.ec[0]}+${config.ec[1]}` : `, ${config.replicas} replicas`)+
+        `, ${config.pgs||1} PG per OSD`+
+        (config.degraded_replacement ? `\n...and you don't let the rebalance finish before replacing drives` : '')
+    );
+    console.log('-> '+sprintf("%.7f%%", 100*cluster_afr(config))+'\n');
+}
+
+/******** UTILITY ********/
+
+// Combination count
+function c_n_k(n, k)
+{
+    let r = 1;
+    for (let i = 0; i < k; i++)
+    {
+        r *= (n-i) / (i+1);
+    }
+    return r;
+}
--- a/mon/lp-optimizer.js
+++ b/mon/lp-optimizer.js
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 // Data distribution optimizer using linear programming (lp_solve)

 const child_process = require('child_process');
@ -25,7 +28,7 @@ async function lp_solve(text)
    let vars = {};
    for (const line of stdout.split(/\n/))
    {
-        let m = /^(^Value of objective function: ([\d\.]+)|Actual values of the variables:)\s*$/.exec(line);
+        let m = /^(^Value of objective function: (-?[\d\.]+)|Actual values of the variables:)\s*$/.exec(line);
        if (m)
        {
            if (m[2])
@ -47,34 +50,34 @@ async function lp_solve(text)
    return { score, vars };
 }

-async function optimize_initial(osd_tree, pg_count, max_combinations)
+async function optimize_initial({ osd_tree, pg_count, pg_size = 3, pg_minsize = 2, max_combinations = 10000, parity_space = 1 })
 {
-    max_combinations = max_combinations || 10000;
+    if (!pg_count || !osd_tree)
+    {
+        return null;
+    }
    const all_weights = Object.assign({}, ...Object.values(osd_tree));
    const total_weight = Object.values(all_weights).reduce((a, c) => Number(a) + Number(c), 0);
-    let all_pgs = all_combinations(osd_tree, null, true);
-    if (all_pgs.length > max_combinations)
-    {
-        const prob = max_combinations/all_pgs.length;
-        all_pgs = all_pgs.filter(pg => Math.random() < prob);
-    }
+    const all_pgs = Object.values(random_combinations(osd_tree, pg_size, max_combinations));
    const pg_per_osd = {};
    for (const pg of all_pgs)
    {
-        for (const osd of pg)
+        for (let i = 0; i < pg.length; i++)
        {
+            const osd = pg[i];
            pg_per_osd[osd] = pg_per_osd[osd] || [];
-            pg_per_osd[osd].push("pg_"+pg.join("_"));
+            pg_per_osd[osd].push((i >= pg_minsize ? parity_space+'*' : '')+"pg_"+pg.join("_"));
        }
    }
-    const pg_size = Math.min(Object.keys(osd_tree).length, 3);
+    const pg_effsize = Math.min(pg_minsize, Object.keys(osd_tree).length)
+        + Math.max(0, Math.min(pg_size, Object.keys(osd_tree).length) - pg_minsize) * parity_space;
    let lp = '';
    lp += "max: "+all_pgs.map(pg => 'pg_'+pg.join('_')).join(' + ')+";\n";
    for (const osd in pg_per_osd)
    {
        if (osd !== NO_OSD)
        {
-            let osd_pg_count = all_weights[osd]/total_weight*pg_size*pg_count;
+            let osd_pg_count = all_weights[osd]/total_weight*pg_effsize*pg_count;
            lp += pg_per_osd[osd].join(' + ')+' <= '+osd_pg_count+';\n';
        }
    }
@ -86,11 +89,19 @@ async function optimize_initial(osd_tree, pg_count, max_combinations)
    const lp_result = await lp_solve(lp);
    if (!lp_result)
    {
+        console.log(lp);
        throw new Error('Problem is infeasible or unbounded - is it a bug?');
    }
    const int_pgs = make_int_pgs(lp_result.vars, pg_count);
-    const eff = pg_list_space_efficiency(int_pgs, all_weights);
-    return { score: lp_result.score, weights: lp_result.vars, int_pgs, space: eff*pg_size, total_space: total_weight };
+    const eff = pg_list_space_efficiency(int_pgs, all_weights, pg_minsize, parity_space);
+    const res = {
+        score: lp_result.score,
+        weights: lp_result.vars,
+        int_pgs,
+        space: eff * pg_effsize,
+        total_space: total_weight,
+    };
+    return res;
 }

 function make_int_pgs(weights, pg_count)
@ -112,11 +123,117 @@ function make_int_pgs(weights, pg_count)
    return int_pgs;
 }

-// Try to minimize data movement
-async function optimize_change(prev_int_pgs, osd_tree, max_combinations)
+function calc_intersect_weights(pg_size, pg_count, prev_weights, all_pgs)
 {
-    max_combinations = max_combinations || 10000;
-    const pg_size = Math.min(Object.keys(osd_tree).length, 3);
+    const move_weights = {};
+    if ((1 << pg_size) < pg_count)
+    {
+        const intersect = {};
+        for (const pg_name in prev_weights)
+        {
+            const pg = pg_name.substr(3).split(/_/);
+            for (let omit = 1; omit < (1 << pg_size); omit++)
+            {
+                let pg_omit = [ ...pg ];
+                let intersect_count = pg_size;
+                for (let i = 0; i < pg_size; i++)
+                {
+                    if (omit & (1 << i))
+                    {
+                        pg_omit[i] = '';
+                        intersect_count--;
+                    }
+                }
+                pg_omit = pg_omit.join(':');
+                intersect[pg_omit] = Math.max(intersect[pg_omit] || 0, intersect_count);
+            }
+        }
+        for (const pg of all_pgs)
+        {
+            let max_int = 0;
+            for (let omit = 1; omit < (1 << pg_size); omit++)
+            {
+                let pg_omit = [ ...pg ];
+                for (let i = 0; i < pg_size; i++)
+                {
+                    if (omit & (1 << i))
+                    {
+                        pg_omit[i] = '';
+                    }
+                }
+                pg_omit = pg_omit.join(':');
+                max_int = Math.max(max_int, intersect[pg_omit] || 0);
+            }
+            move_weights['pg_'+pg.join('_')] = pg_size-max_int;
+        }
+    }
+    else
+    {
+        const prev_pg_hashed = Object.keys(prev_weights).map(pg_name => pg_name.substr(3).split(/_/).reduce((a, c) => { a[c] = 1; return a; }, {}));
+        for (const pg of all_pgs)
+        {
+            if (!prev_weights['pg_'+pg.join('_')])
+            {
+                let max_int = 0;
+                for (const prev_hash in prev_pg_hashed)
+                {
+                    const intersect_count = pg.reduce((a, osd) => a + (prev_hash[osd] ? 1 : 0), 0);
+                    if (max_int < intersect_count)
+                    {
+                        max_int = intersect_count;
+                        if (max_int >= pg_size)
+                        {
+                            break;
+                        }
+                    }
+                }
+                move_weights['pg_'+pg.join('_')] = pg_size-max_int;
+            }
+        }
+    }
+    return move_weights;
+}
+
+function add_valid_previous(osd_tree, prev_weights, all_pgs)
+{
+    // Add previous combinations that are still valid
+    const hosts = Object.keys(osd_tree).sort();
+    const host_per_osd = {};
+    for (const host in osd_tree)
+    {
+        for (const osd in osd_tree[host])
+        {
+            host_per_osd[osd] = host;
+        }
+    }
+    skip_pg: for (const pg_name in prev_weights)
+    {
+        const seen_hosts = {};
+        const pg = pg_name.substr(3).split(/_/);
+        for (const osd of pg)
+        {
+            if (!host_per_osd[osd] || seen_hosts[host_per_osd[osd]])
+            {
+                continue skip_pg;
+            }
+            seen_hosts[host_per_osd[osd]] = true;
+        }
+        if (!all_pgs[pg_name])
+        {
+            all_pgs[pg_name] = pg;
+        }
+    }
+}
+
+// Try to minimize data movement
+async function optimize_change({ prev_pgs: prev_int_pgs, osd_tree, pg_size = 3, pg_minsize = 2, max_combinations = 10000, parity_space = 1 })
+{
+    if (!osd_tree)
+    {
+        return null;
+    }
+    const pg_effsize = Math.min(pg_minsize, Object.keys(osd_tree).length)
+        + Math.max(0, Math.min(pg_size, Object.keys(osd_tree).length) - pg_minsize) * parity_space;
    const pg_count = prev_int_pgs.length;
    const prev_weights = {};
    const prev_pg_per_osd = {};
@ -124,70 +241,50 @@ async function optimize_change(prev_int_pgs, osd_tree, max_combinations)
    {
        const pg_name = 'pg_'+pg.join('_');
        prev_weights[pg_name] = (prev_weights[pg_name]||0) + 1;
-        for (const osd of pg)
+        for (let i = 0; i < pg.length; i++)
        {
+            const osd = pg[i];
            prev_pg_per_osd[osd] = prev_pg_per_osd[osd] || [];
-            prev_pg_per_osd[osd].push(pg_name);
+            prev_pg_per_osd[osd].push([ pg_name, (i >= pg_minsize ? parity_space : 1) ]);
        }
    }
    // Get all combinations
-    let all_pgs = all_combinations(osd_tree, null, true);
-    if (all_pgs.length > max_combinations)
-    {
-        const intersecting = all_pgs.filter(pg => prev_weights['pg_'+pg.join('_')]);
-        if (intersecting.length > max_combinations)
-        {
-            const prob = max_combinations/intersecting.length;
-            all_pgs = intersecting.filter(pg => Math.random() < prob);
-        }
-        else
-        {
-            const prob = (max_combinations-intersecting.length)/all_pgs.length;
-            all_pgs = all_pgs.filter(pg => Math.random() < prob || prev_weights['pg_'+pg.join('_')]);
-        }
-    }
+    let all_pgs = random_combinations(osd_tree, pg_size, max_combinations);
+    add_valid_previous(osd_tree, prev_weights, all_pgs);
+    all_pgs = Object.values(all_pgs);
    const pg_per_osd = {};
    for (const pg of all_pgs)
    {
        const pg_name = 'pg_'+pg.join('_');
-        for (const osd of pg)
+        for (let i = 0; i < pg.length; i++)
        {
+            const osd = pg[i];
            pg_per_osd[osd] = pg_per_osd[osd] || [];
-            pg_per_osd[osd].push(pg_name);
+            pg_per_osd[osd].push([ pg_name, (i >= pg_minsize ? parity_space : 1) ]);
        }
    }
    // Penalize PGs based on their similarity to old PGs
-    const intersect = {};
-    for (const pg_name in prev_weights)
-    {
-        const pg = pg_name.substr(3).split(/_/);
-        intersect[pg[0]+'::'] = intersect[':'+pg[1]+':'] = intersect['::'+pg[2]] = 2;
-        intersect[pg[0]+'::'+pg[2]] = intersect[':'+pg[1]+':'+pg[2]] = intersect[pg[0]+':'+pg[1]+':'] = 1;
-    }
-    const move_weights = {};
-    for (const pg of all_pgs)
-    {
-        move_weights['pg_'+pg.join('_')] =
-            intersect[pg[0]+'::'+pg[2]] || intersect[':'+pg[1]+':'+pg[2]] || intersect[pg[0]+':'+pg[1]+':'] ||
-            intersect[pg[0]+'::'] || intersect[':'+pg[1]+':'] || intersect['::'+pg[2]] ||
-            3;
-    }
+    const move_weights = calc_intersect_weights(pg_size, pg_count, prev_weights, all_pgs);
    // Calculate total weight - old PG weights
    const all_pg_names = all_pgs.map(pg => 'pg_'+pg.join('_'));
+    const all_pgs_hash = all_pg_names.reduce((a, c) => { a[c] = true; return a; }, {});
    const all_weights = Object.assign({}, ...Object.values(osd_tree));
    const total_weight = Object.values(all_weights).reduce((a, c) => Number(a) + Number(c), 0);
    // Generate the LP problem
    let lp = '';
    lp += 'max: '+all_pg_names.map(pg_name => (
-        prev_weights[pg_name] ? `${4-move_weights[pg_name]}*add_${pg_name} - 4*del_${pg_name}` : `${4-move_weights[pg_name]}*${pg_name}`
+        prev_weights[pg_name] ? `${pg_size+1}*add_${pg_name} - ${pg_size+1}*del_${pg_name}` : `${pg_size+1-move_weights[pg_name]}*${pg_name}`
    )).join(' + ')+';\n';
    for (const osd in pg_per_osd)
    {
        if (osd !== NO_OSD)
        {
-            const osd_sum = (pg_per_osd[osd]||[]).map(pg_name => prev_weights[pg_name] ? `add_${pg_name} - del_${pg_name}` : pg_name).join(' + ');
-            const rm_osd_pg_count = (prev_pg_per_osd[osd]||[]).filter(old_pg_name => move_weights[old_pg_name]).length;
-            let osd_pg_count = all_weights[osd]*3/total_weight*pg_count - rm_osd_pg_count;
+            const osd_sum = (pg_per_osd[osd]||[]).map(([ pg_name, space ]) => (
+                prev_weights[pg_name] ? `${space} * add_${pg_name} - ${space} * del_${pg_name}` : `${space} * ${pg_name}`
+            )).join(' + ');
+            const rm_osd_pg_count = (prev_pg_per_osd[osd]||[])
+                .reduce((a, [ old_pg_name, space ]) => (a + (all_pgs_hash[old_pg_name] ? space : 0)), 0);
+            const osd_pg_count = all_weights[osd]*pg_effsize/total_weight*pg_count - rm_osd_pg_count;
            lp += osd_sum + ' <= ' + osd_pg_count + ';\n';
        }
    }
@ -221,7 +318,7 @@ async function optimize_change(prev_int_pgs, osd_tree, max_combinations)
    const weights = { ...prev_weights };
    for (const k in prev_weights)
    {
-        if (!move_weights[k])
+        if (!all_pgs_hash[k])
        {
            delete weights[k];
        }
@ -236,7 +333,7 @@ async function optimize_change(prev_int_pgs, osd_tree, max_combinations)
        {
            weights[k.substr(4)] = (weights[k.substr(4)] || 0) - Number(lp_result.vars[k]);
        }
-        else
+        else if (k.substr(0, 3) === 'pg_')
        {
            weights[k] = Number(lp_result.vars[k]);
        }
@ -258,7 +355,7 @@ async function optimize_change(prev_int_pgs, osd_tree, max_combinations)
        {
            differs++;
        }
-        for (let j = 0; j < 3; j++)
+        for (let j = 0; j < pg_size; j++)
        {
            if (new_pgs[i][j] != prev_int_pgs[i][j])
            {
@ -273,7 +370,7 @@ async function optimize_change(prev_int_pgs, osd_tree, max_combinations)
        int_pgs: new_pgs,
        differs,
        osd_differs,
-        space: pg_size * pg_list_space_efficiency(new_pgs, all_weights),
+        space: pg_effsize * pg_list_space_efficiency(new_pgs, all_weights, pg_minsize, parity_space),
        total_space: total_weight,
    };
 }
@ -391,27 +488,89 @@ function extract_osds(osd_tree, levels, osd_level, osds = {})
    return osds;
 }

-// FIXME: support different pg_sizes, not just 3
+function random_combinations(osd_tree, pg_size, count)
+{
+    let seed = 0x5f020e43;
+    let rng = () =>
+    {
+        seed ^= seed << 13;
+        seed ^= seed >> 17;
+        seed ^= seed << 5;
+        return seed + 2147483648;
+    };
+    const hosts = Object.keys(osd_tree).sort();
+    const osds = Object.keys(osd_tree).reduce((a, c) => { a[c] = Object.keys(osd_tree[c]).sort(); return a; }, {});
+    const r = {};
+    // Generate random combinations including each OSD at least once
+    for (let h = 0; h < hosts.length; h++)
+    {
+        for (let o = 0; o < osds[hosts[h]].length; o++)
+        {
+            const pg = [ osds[hosts[h]][o] ];
+            const cur_hosts = [ ...hosts ];
+            cur_hosts.splice(h, 1);
+            for (let i = 1; i < pg_size && i < hosts.length; i++)
+            {
+                const next_host = rng() % cur_hosts.length;
+                const next_osd = rng() % osds[cur_hosts[next_host]].length;
+                pg.push(osds[cur_hosts[next_host]][next_osd]);
+                cur_hosts.splice(next_host, 1);
+            }
+            while (pg.length < pg_size)
+            {
+                pg.push(NO_OSD);
+            }
+            r['pg_'+pg.join('_')] = pg;
+        }
+    }
+    // Generate purely random combinations
+    restart: while (count > 0)
+    {
+        let host_idx = [];
+        for (let i = 0; i < pg_size && i < hosts.length; i++)
+        {
+            let start = i > 0 ? host_idx[i-1]+1 : 0;
+            if (start >= hosts.length)
+            {
+                continue restart;
+            }
+            host_idx[i] = start + rng() % (hosts.length-start);
+        }
+        let pg = host_idx.map(h => osds[hosts[h]][rng() % osds[hosts[h]].length]);
+        while (pg.length < pg_size)
+        {
+            pg.push(NO_OSD);
+        }
+        r['pg_'+pg.join('_')] = pg;
+        count--;
+    }
+    return r;
+}
+
+// Super-stupid algorithm. Given the current OSD tree, generate all possible OSD combinations
 // osd_tree = { failure_domain1: { osd1: size1, ... }, ... }
-function all_combinations(osd_tree, count, ordered)
+// ordered = return combinations without duplicates having different order
+function all_combinations(osd_tree, pg_size, ordered, count)
 {
    const hosts = Object.keys(osd_tree).sort();
    const osds = Object.keys(osd_tree).reduce((a, c) => { a[c] = Object.keys(osd_tree[c]).sort(); return a; }, {});
-    while (hosts.length < 3)
+    while (hosts.length < pg_size)
    {
        osds[NO_OSD] = [ NO_OSD ];
        hosts.push(NO_OSD);
    }
-    let host_idx = [ 0, 1, 2 ];
-    let osd_idx = [ 0, 0, 0 ];
+    let host_idx = [];
+    let osd_idx = [];
+    for (let i = 0; i < pg_size; i++)
+    {
+        host_idx.push(i);
+        osd_idx.push(0);
+    }
    const r = [];
    while (!count || count < 0 || r.length < count)
-    {
-        let inc;
-        if (host_idx[2] != host_idx[1] && host_idx[2] != host_idx[0] && host_idx[1] != host_idx[0])
    {
        r.push(host_idx.map((hi, i) => osds[hosts[hi]][osd_idx[i]]));
-            inc = 2;
+        let inc = pg_size-1;
        while (inc >= 0)
        {
            osd_idx[inc]++;
@ -425,33 +584,39 @@ function all_combinations(osd_tree, count, ordered)
                break;
            }
        }
+        if (inc < 0)
+        {
+            // no osds left in the current host combination, select the next one
+            inc = pg_size-1;
+            same_again: while (inc >= 0)
+            {
+                host_idx[inc]++;
+                for (let prev_host = 0; prev_host < inc; prev_host++)
+                {
+                    if (host_idx[prev_host] == host_idx[inc])
+                    {
+                        continue same_again;
+                    }
+                }
+                if (host_idx[inc] < (ordered ? hosts.length-(pg_size-1-inc) : hosts.length))
+                {
+                    while ((++inc) < pg_size)
+                    {
+                        host_idx[inc] = (ordered ? host_idx[inc-1]+1 : 0);
+                    }
+                    break;
                }
                else
                {
-            inc = -1;
+                    inc--;
+                }
            }
            if (inc < 0)
-        {
-            // no osds left in current host combination, select the next one
-            osd_idx = [ 0, 0, 0 ];
-            host_idx[2]++;
-            if (host_idx[2] >= hosts.length)
-            {
-                host_idx[1]++;
-                host_idx[2] = ordered ? host_idx[1]+1 : 0;
-                if ((ordered ? host_idx[2] : host_idx[1]) >= hosts.length)
-                {
-                    host_idx[0]++;
-                    host_idx[1] = ordered ? host_idx[0]+1 : 0;
-                    host_idx[2] = ordered ? host_idx[1]+1 : 0;
-                    if ((ordered ? host_idx[2] : host_idx[0]) >= hosts.length)
            {
                break;
            }
        }
    }
-        }
-    }
    return r;
 }

@ -468,14 +633,15 @@ function pg_weights_space_efficiency(weights, pg_count, osd_sizes)
    return pg_per_osd_space_efficiency(per_osd, pg_count, osd_sizes);
 }

-function pg_list_space_efficiency(pgs, osd_sizes)
+function pg_list_space_efficiency(pgs, osd_sizes, pg_minsize, parity_space)
 {
    const per_osd = {};
    for (const pg of pgs)
    {
-        for (const osd of pg)
+        for (let i = 0; i < pg.length; i++)
        {
-            per_osd[osd] = (per_osd[osd]||0) + 1;
+            const osd = pg[i];
+            per_osd[osd] = (per_osd[osd]||0) + (i >= pg_minsize ? (parity_space||1) : 1);
        }
    }
    return pg_per_osd_space_efficiency(per_osd, pgs.length, osd_sizes);
@ -517,5 +683,6 @@ module.exports = {
    lp_solve,
    make_int_pgs,
    align_pgs,
+    random_combinations,
    all_combinations,
 };
--- a/mon/make-units.sh
+++ b/mon/make-units.sh
@ -0,0 +1,136 @@
+#!/bin/bash
+# Example startup script generator
+# Of course this isn't a production solution yet, this is just for tests
+# Copyright (c) Vitaliy Filippov, 2019+
+# License: MIT
+
+IP=`ip -json a s | jq -r '.[].addr_info[] | select(.broadcast == "10.115.0.255") | .local'`
+
+[ "$IP" != "" ] || exit 1
+
+useradd vitastor
+chmod 755 /root
+
+BASE=${IP/*./}
+BASE=$(((BASE-10)*12))
+
+cat >/etc/systemd/system/vitastor.target <<EOF
+[Unit]
+Description=vitastor target
+[Install]
+WantedBy=multi-user.target
+EOF
+
+i=1
+for DEV in `ls /dev/disk/by-id/ | grep ata-INTEL_SSDSC2KB`; do
+    dd if=/dev/zero of=/dev/disk/by-id/$DEV bs=1048576 count=$(((427814912+1048575)/1048576+2))
+    dd if=/dev/zero of=/dev/disk/by-id/$DEV bs=1048576 count=$(((427814912+1048575)/1048576+2)) seek=$((1920377991168/1048576))
+cat >/etc/systemd/system/vitastor-osd$((BASE+i)).service <<EOF
+[Unit]
+Description=Vitastor object storage daemon osd.$((BASE+i))
+After=network-online.target local-fs.target time-sync.target
+Wants=network-online.target local-fs.target time-sync.target
+PartOf=vitastor.target
+
+[Service]
+LimitNOFILE=1048576
+LimitNPROC=1048576
+LimitMEMLOCK=infinity
+ExecStart=/root/vitastor/osd \\
+    --etcd_address $IP:2379/v3 \\
+    --bind_address $IP \\
+    --osd_num $((BASE+i)) \\
+    --disable_data_fsync 1 \\
+    --disable_device_lock 1 \\
+    --immediate_commit all \\
+    --flusher_count 8 \\
+    --disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 \\
+    --journal_no_same_sector_overwrites true \\
+    --journal_sector_buffer_count 1024 \\
+    --journal_offset 0 \\
+    --meta_offset 16777216 \\
+    --data_offset 427814912 \\
+    --data_size $((1920377991168-427814912)) \\
+    --data_device /dev/disk/by-id/$DEV
+WorkingDirectory=/root/vitastor
+ExecStartPre=+chown vitastor:vitastor /dev/disk/by-id/$DEV
+User=vitastor
+PrivateTmp=false
+TasksMax=infinity
+Restart=always
+StartLimitInterval=0
+StartLimitIntervalSec=0
+RestartSec=10
+
+[Install]
+WantedBy=vitastor.target
+EOF
+    systemctl enable vitastor-osd$((BASE+i))
+    i=$((i+1))
+cat >/etc/systemd/system/vitastor-osd$((BASE+i)).service <<EOF
+[Unit]
+Description=Vitastor object storage daemon osd.$((BASE+i))
+After=network-online.target local-fs.target time-sync.target
+Wants=network-online.target local-fs.target time-sync.target
+PartOf=vitastor.target
+
+[Service]
+LimitNOFILE=1048576
+LimitNPROC=1048576
+LimitMEMLOCK=infinity
+ExecStart=/root/vitastor/osd \\
+    --etcd_address $IP:2379/v3 \\
+    --bind_address $IP \\
+    --osd_num $((BASE+i)) \\
+    --disable_data_fsync 1 \\
+    --immediate_commit all \\
+    --flusher_count 8 \\
+    --disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 \\
+    --journal_no_same_sector_overwrites true \\
+    --journal_sector_buffer_count 1024 \\
+    --journal_offset 1920377991168 \\
+    --meta_offset $((1920377991168+16777216)) \\
+    --data_offset $((1920377991168+427814912)) \\
+    --data_size $((1920377991168-427814912)) \\
+    --data_device /dev/disk/by-id/$DEV
+WorkingDirectory=/root/vitastor
+ExecStartPre=+chown vitastor:vitastor /dev/disk/by-id/$DEV
+User=vitastor
+PrivateTmp=false
+TasksMax=infinity
+Restart=always
+StartLimitInterval=0
+StartLimitIntervalSec=0
+RestartSec=10
+
+[Install]
+WantedBy=vitastor.target
+EOF
+    systemctl enable vitastor-osd$((BASE+i))
+    i=$((i+1))
+done
+
+exit
+
+node mon-main.js --etcd_url 'http://10.115.0.10:2379,http://10.115.0.11:2379,http://10.115.0.12:2379,http://10.115.0.13:2379' --etcd_prefix '/vitastor' --etcd_start_timeout 5
+
+podman run -d --network host --restart always -v /var/lib/etcd0.etcd:/etcd0.etcd --name etcd quay.io/coreos/etcd:v3.4.13 etcd -name etcd0 \
+    -advertise-client-urls http://10.115.0.10:2379 -listen-client-urls http://10.115.0.10:2379 \
+    -initial-advertise-peer-urls http://10.115.0.10:2380 -listen-peer-urls http://10.115.0.10:2380 \
+    -initial-cluster-token vitastor-etcd-1 -initial-cluster etcd0=http://10.115.0.10:2380,etcd1=http://10.115.0.11:2380,etcd2=http://10.115.0.12:2380,etcd3=http://10.115.0.13:2380 \
+    -initial-cluster-state new --max-txn-ops=100000 --auto-compaction-retention=10 --auto-compaction-mode=revision
+
+etcdctl --endpoints http://10.115.0.10:2379 put /vitastor/config/global '{"immediate_commit":"all"}'
+
+etcdctl --endpoints http://10.115.0.10:2379 put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":48,"failure_domain":"host"}}'
+
+#let pgs = {};
+#for (let n = 0; n < 48; n++) { let i = n/2 | 0; pgs[1+n] = { osd_set: [ (1+i%12+(i/12 | 0)*24), (1+12+i%12+(i/12 | 0)*24) ], primary: (1+(n%2)*12+i%12+(i/12 | 0)*24) }; };
+#console.log(JSON.stringify({ items: { 1: pgs } }));
+#etcdctl --endpoints http://10.115.0.10:2379 put /vitastor/config/pgs ...
+
+#    --disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 \\
+#    --data_offset 427814912 \\
+
+#    --disk_alignment 4096 --journal_block_size 512 --meta_block_size 512 \\
+#    --data_offset 433434624 \\
--- a/mon/mon-main.js
+++ b/mon/mon-main.js
@ -1,5 +1,8 @@
 #!/usr/bin/node

+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 const Mon = require('./mon.js');

 const options = {};
@ -15,8 +18,8 @@ for (let i = 2; i < process.argv.length; i++)

 if (!options.etcd_url)
 {
-    console.error('USAGE: '+process.argv[0]+' '+process.argv[1]+' --etcd_url "http://127.0.0.1:2379,..." --etcd_prefix "/rage" --etcd_start_timeout 5');
+    console.error('USAGE: '+process.argv[0]+' '+process.argv[1]+' --etcd_url "http://127.0.0.1:2379,..." --etcd_prefix "/vitastor" --etcd_start_timeout 5 [--verbose 1]');
    process.exit();
 }

-new Mon(options).start();
+new Mon(options).start().catch(e => { console.error(e); process.exit(); });
--- a/mon/mon.js
+++ b/mon/mon.js
--- a/mon/package.json
+++ b/mon/package.json
@ -1,14 +1,15 @@
 {
-  "name": "rage-mon",
+  "name": "vitastor-mon",
  "version": "1.0.0",
-  "description": "RAGE storage monitor service",
-  "main": "mon.js",
+  "description": "Vitastor SDS monitor service",
+  "main": "mon-main.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "author": "Vitaliy Filippov",
  "license": "UNLICENSED",
  "dependencies": {
+    "sprintf-js": "^1.1.2",
    "ws": "^7.2.5"
  }
 }
--- a/mon/simple-offsets.js
+++ b/mon/simple-offsets.js
@ -0,0 +1,56 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: MIT
+
+// Simple tool to calculate journal and metadata offsets for a single device
+// Will be replaced by smarter tools in the future
+
+const child_process = require('child_process');
+
+async function run()
+{
+    const options = {
+        object_size: 128*1024,
+        bitmap_granularity: 4096,
+        journal_size: 16*1024*1024,
+        device_block_size: 4096,
+        journal_offset: 0,
+        device_size: 0,
+    };
+    for (let i = 2; i < process.argv.length; i++)
+    {
+        if (process.argv[i].substr(0, 2) == '--')
+        {
+            options[process.argv[i].substr(2)] = process.argv[i+1];
+            i++;
+        }
+    }
+    const device_size = Number(options.device_size || await system("blockdev --getsize64 "+options.device));
+    if (!device_size)
+    {
+        process.stderr.write('Failed to get device size\n');
+        process.exit(1);
+    }
+    options.journal_offset = Math.ceil(options.journal_offset/options.device_block_size)*options.device_block_size;
+    const meta_offset = options.journal_offset + Math.ceil(options.journal_size/options.device_block_size)*options.device_block_size;
+    const entries_per_block = Math.floor(options.device_block_size / (24 + options.object_size/options.bitmap_granularity/8));
+    const object_count = Math.floor((device_size-meta_offset)/options.object_size);
+    const meta_size = Math.ceil(object_count / entries_per_block) * options.device_block_size;
+    const data_offset = meta_offset + meta_size;
+    const meta_size_fmt = (meta_size > 1024*1024*1024 ? Math.round(meta_size/1024/1024/1024*100)/100+" GB"
+        : Math.round(meta_size/1024/1024*100)/100+" MB");
+    process.stdout.write(
+        `Metadata size: ${meta_size_fmt}\n`+
+        `Options for the OSD:\n`+
+        `    --journal_offset ${options.journal_offset}\n`+
+        `    --meta_offset ${meta_offset}\n`+
+        `    --data_offset ${data_offset}\n`+
+        (options.device_size ? `    --data_size ${device_size-data_offset}\n` : '')
+    );
+}
+
+function system(cmd)
+{
+    return new Promise((ok, no) => child_process.exec(cmd, { maxBuffer: 64*1024*1024 }, (err, stdout, stderr) => (err ? no(err) : ok(stdout))));
+}
+
+run().catch(console.error);
--- a/mon/stable-stringify.js
+++ b/mon/stable-stringify.js
@ -0,0 +1,78 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: MIT
+
+function stableStringify(obj, opts)
+{
+    if (!opts)
+        opts = {};
+    if (typeof opts === 'function')
+        opts = { cmp: opts };
+    let space = opts.space || '';
+    if (typeof space === 'number')
+        space = Array(space+1).join(' ');
+    const cycles = (typeof opts.cycles === 'boolean') ? opts.cycles : false;
+    const cmp = opts.cmp && (function (f)
+    {
+        return function (node)
+        {
+            return function (a, b)
+            {
+                let aobj = { key: a, value: node[a] };
+                let bobj = { key: b, value: node[b] };
+                return f(aobj, bobj);
+            };
+        };
+    })(opts.cmp);
+    const seen = new Map();
+    return (function stringify (parent, key, node, level)
+    {
+        const indent = space ? ('\n' + new Array(level + 1).join(space)) : '';
+        const colonSeparator = space ? ': ' : ':';
+        if (node === undefined)
+        {
+            return;
+        }
+        if (typeof node !== 'object' || node === null)
+        {
+            return JSON.stringify(node);
+        }
+        if (node instanceof Array)
+        {
+            const out = [];
+            for (let i = 0; i < node.length; i++)
+            {
+                const item = stringify(node, i, node[i], level+1) || JSON.stringify(null);
+                out.push(indent + space + item);
+            }
+            return '[' + out.join(',') + indent + ']';
+        }
+        else
+        {
+            if (seen.has(node))
+            {
+                if (cycles)
+                    return JSON.stringify('__cycle__');
+                throw new TypeError('Converting circular structure to JSON');
+            }
+            else
+                seen.set(node, true);
+            const keys = Object.keys(node).sort(cmp && cmp(node));
+            const out = [];
+            for (let i = 0; i < keys.length; i++)
+            {
+                const key = keys[i];
+                const value = stringify(node, key, node[key], level+1);
+                if (!value)
+                    continue;
+                const keyValue = JSON.stringify(key)
+                    + colonSeparator
+                    + value;
+                out.push(indent + space + keyValue);
+            }
+            seen.delete(node);
+            return '{' + out.join(',') + indent + '}';
+        }
+    })({ '': obj }, '', obj, 0);
+}
+
+module.exports = stableStringify;
--- a/mon/test-nonuniform.js
+++ b/mon/test-nonuniform.js
@ -0,0 +1,130 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
+// Interesting real-world example coming from Ceph with EC and compression enabled.
+// EC parity chunks can't be compressed as efficiently as data chunks,
+// thus they occupy more space (2.26x more space) in OSD object stores.
+// This leads to really uneven OSD fill ratio in Ceph even when PGs are perfectly balanced.
+// But we support this case with the "parity_space" parameter in optimize_initial()/optimize_change().
+
+const LPOptimizer = require('./lp-optimizer.js');
+
+const osd_tree = {
+    ripper5: {
+        osd0: 3.493144989013672,
+        osd1: 3.493144989013672,
+        osd2: 3.454082489013672,
+        osd12: 3.461894989013672,
+    },
+    ripper7: {
+        osd4: 3.638690948486328,
+        osd5: 3.638690948486328,
+        osd6: 3.638690948486328,
+    },
+    ripper4: {
+        osd9: 3.4609375,
+        osd10: 3.4609375,
+        osd11: 3.4609375,
+    },
+    ripper6: {
+        osd3: 3.5849609375,
+        osd7: 3.5859336853027344,
+        osd8: 3.638690948486328,
+        osd13: 3.461894989013672
+    },
+};
+
+const prev_pgs = [[12,7,5],[6,11,12],[3,6,9],[10,0,5],[2,5,13],[9,8,6],[3,4,12],[7,4,12],[12,11,13],[13,6,0],[4,13,10],[9,7,6],[7,10,0],[10,8,0],[3,10,2],[3,0,4],[6,13,0],[13,10,0],[13,10,5],[8,11,6],[3,9,2],[2,8,5],[8,9,5],[3,12,11],[0,7,4],[13,11,1],[11,3,12],[12,8,10],[7,5,12],[2,13,5],[7,11,0],[13,2,6],[0,6,8],[13,1,6],[0,13,4],[0,8,10],[4,10,0],[8,12,4],[8,12,9],[12,7,4],[13,9,5],[3,2,11],[1,9,7],[1,8,5],[5,12,9],[3,5,12],[2,8,10],[0,8,4],[1,4,11],[7,10,2],[12,13,5],[3,1,11],[7,1,4],[4,12,8],[7,0,9],[11,1,8],[3,0,5],[11,13,0],[1,13,5],[12,7,10],[12,8,4],[11,13,5],[0,11,6],[2,11,3],[13,1,11],[2,7,10],[7,10,12],[7,12,10],[12,11,5],[13,12,10],[2,3,9],[4,3,9],[13,2,5],[7,12,6],[12,10,13],[9,8,1],[13,1,5],[9,5,12],[5,11,7],[6,2,9],[8,11,6],[12,5,8],[6,13,1],[7,6,11],[2,3,6],[8,5,9],[1,13,6],[9,3,2],[7,11,1],[3,10,1],[0,11,7],[3,0,5],[1,3,6],[6,0,9],[3,11,4],[8,10,2],[13,1,9],[12,6,9],[3,12,9],[12,8,9],[7,5,0],[8,12,5],[0,11,3],[12,11,13],[0,7,11],[0,3,10],[1,3,11],[2,7,11],[13,2,6],[9,12,13],[8,2,4],[0,7,4],[5,13,0],[13,12,9],[1,9,8],[0,10,3],[3,5,10],[7,12,9],[2,13,4],[12,7,5],[9,2,7],[3,2,9],[6,2,7],[3,1,9],[4,3,2],[5,3,11],[0,7,6],[1,6,13],[7,10,2],[12,4,8],[13,12,6],[7,5,11],[6,2,3],[2,7,6],[2,3,10],[2,7,10],[11,12,6],[0,13,5],[10,2,4],[13,0,11],[7,0,6],[8,9,4],[8,4,11],[7,11,2],[3,4,2],[6,1,3],[7,2,11],[8,9,4],[11,4,8],[10,3,1],[2,10,13],[1,7,11],[13,11,12],[2,6,9],[10,0,13],[7,10,4],[0,11,13],[13,10,1],[7,5,0],[7,12,10],[3,1,4],[7,1,5],[3,11,5],[7,5,0],[1,3,5],[10,5,12],[0,3,9],[7,1,11],[11,8,12],[3,6,2],[7,12,9],[7,11,12],[4,11,3],[0,11,13],[13,2,5],[1,5,8],[0,11,8],[3,5,1],[11,0,6],[3,11,2],[11,8,12],[4,1,3],[10,13,4],[13,9,6],[2,3,10],[12,7,9],[10,0,4],[10,13,2],[3,11,1],[7,2,9],[1,7,4],[13,1,4],[7,0,6],[5,3,9],[10,0,7],[0,7,10],[3,6,10],[13,0,5],[8,4,1],[3,1,10],[2,10,13],[13,0,5],[13,10,2],[12,7,9],[6,8,10],[6,1,8],[10,8,1],[13,5,0],[5,11,3],[7,6,1],[8,5,9],[2,13,11],[10,12,4],[13,4,1],[2,13,4],[11,7,0],[2,9,7],[1,7,6],[8,0,4],[8,1,9],[7,10,12],[13,9,6],[7,6,11],[13,0,4],[1,8,4],[3,12,5],[10,3,1],[10,2,13],[2,4,8],[6,2,3],[3,0,10],[6,7,12],[8,12,5],[3,0,6],[13,12,10],[11,3,6],[9,0,13],[10,0,6],[7,5,2],[1,3,11],[7,10,2],[2,9,8],[11,13,12],[0,8,4],[8,12,11],[6,0,3],[1,13,4],[11,8,2],[12,3,6],[4,7,1],[7,6,12],[3,10,6],[0,10,7],[8,9,1],[0,10,6],[8,10,1]]
+    .map(pg => pg.map(n => 'osd'+n));
+
+const by_osd = {};
+
+for (let i = 0; i < prev_pgs.length; i++)
+{
+    for (let j = 0; j < prev_pgs[i].length; j++)
+    {
+        by_osd[prev_pgs[i][j]] = by_osd[prev_pgs[i][j]] || [];
+        by_osd[prev_pgs[i][j]][j] = (by_osd[prev_pgs[i][j]][j] || 0) + 1;
+    }
+}
+
+/*
+
+This set of PGs was balanced by hand, by heavily tuning OSD weights in Ceph:
+
+{
+  osd0: 4.2,
+  osd1: 3.5,
+  osd2: 3.45409,
+  osd3: 4.5,
+  osd4: 1.4,
+  osd5: 1.4,
+  osd6: 1.75,
+  osd7: 4.5,
+  osd8: 4.4,
+  osd9: 2.2,
+  osd10: 2.7,
+  osd11: 2,
+  osd12: 3.4,
+  osd13: 3.4,
+}
+
+EC+compression is a nightmare in Ceph, yeah :))
+
+To calculate the average ratio between data chunks and parity chunks we
+calculate the number of PG chunks for each chunk role for each OSD:
+
+{
+  osd12: [ 18, 22, 17 ],
+  osd7: [ 35, 22, 8 ],
+  osd5: [ 6, 17, 27 ],
+  osd6: [ 13, 12, 28 ],
+  osd11: [ 13, 26, 20 ],
+  osd3: [ 30, 20, 10 ],
+  osd9: [ 8, 12, 26 ],
+  osd10: [ 15, 23, 20 ],
+  osd0: [ 22, 22, 14 ],
+  osd2: [ 22, 16, 16 ],
+  osd13: [ 29, 19, 13 ],
+  osd8: [ 20, 18, 12 ],
+  osd4: [ 8, 10, 28 ],
+  osd1: [ 17, 17, 17 ]
+}
+
+And now we can pick a pair of OSDs and determine the ratio by solving the following:
+
+osd5 = 23*X + 27*Y = 3249728140
+osd13 = 48*X + 13*Y = 2991675992
+
+=>
+
+osd5 - 27/13*osd13 = 23*X - 27/13*48*X = -76.6923076923077*X = -2963752766.46154
+
+=>
+
+X = 38644720.1243731
+Y = (osd5-23*X)/27 = 87440725.0792377
+Y/X = 2.26268232239284 ~= 2.26
+
+Which means that parity chunks are compressed ~2.26 times worse than data chunks.
+
+Fine, let's try to optimize for it.
+
+*/
+
+async function run()
+{
+    const all_weights = Object.assign({}, ...Object.values(osd_tree));
+    const total_weight = Object.values(all_weights).reduce((a, c) => Number(a) + Number(c), 0);
+    const eff = LPOptimizer.pg_list_space_efficiency(prev_pgs, all_weights, 2, 2.26);
+    const orig = eff*4.26 / total_weight;
+    console.log('Original efficiency was: '+Math.round(orig*10000)/100+' %');
+
+    let prev = await LPOptimizer.optimize_initial({ osd_tree, pg_size: 3, pg_count: 256, parity_space: 2.26 });
+    LPOptimizer.print_change_stats(prev);
+
+    let next = await LPOptimizer.optimize_change({ prev_pgs, osd_tree, pg_size: 3, max_combinations: 10000, parity_space: 2.26 });
+    LPOptimizer.print_change_stats(next);
+}
+
+run().catch(console.error);
--- a/mon/test-optimize-undersized.js
+++ b/mon/test-optimize-undersized.js
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 const LPOptimizer = require('./lp-optimizer.js');

 const crush_tree = [
@ -40,31 +43,31 @@ async function run()
 {
    const cur_tree = {};
    console.log('Empty tree:');
-    let res = await LPOptimizer.optimize_initial(cur_tree, 256);
+    let res = await LPOptimizer.optimize_initial({ osd_tree: cur_tree, pg_size: 3, pg_count: 256 });
    LPOptimizer.print_change_stats(res, false);
    console.log('\nAdding 1st failure domain:');
    cur_tree['dom1'] = osd_tree['dom1'];
-    res = await LPOptimizer.optimize_change(res.int_pgs, cur_tree);
+    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: cur_tree, pg_size: 3 });
    LPOptimizer.print_change_stats(res, false);
    console.log('\nAdding 2nd failure domain:');
    cur_tree['dom2'] = osd_tree['dom2'];
-    res = await LPOptimizer.optimize_change(res.int_pgs, cur_tree);
+    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: cur_tree, pg_size: 3 });
    LPOptimizer.print_change_stats(res, false);
    console.log('\nAdding 3rd failure domain:');
    cur_tree['dom3'] = osd_tree['dom3'];
-    res = await LPOptimizer.optimize_change(res.int_pgs, cur_tree);
+    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: cur_tree, pg_size: 3 });
    LPOptimizer.print_change_stats(res, false);
    console.log('\nRemoving 3rd failure domain:');
    delete cur_tree['dom3'];
-    res = await LPOptimizer.optimize_change(res.int_pgs, cur_tree);
+    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: cur_tree, pg_size: 3 });
    LPOptimizer.print_change_stats(res, false);
    console.log('\nRemoving 2nd failure domain:');
    delete cur_tree['dom2'];
-    res = await LPOptimizer.optimize_change(res.int_pgs, cur_tree);
+    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: cur_tree, pg_size: 3 });
    LPOptimizer.print_change_stats(res, false);
    console.log('\nRemoving 1st failure domain:');
    delete cur_tree['dom1'];
-    res = await LPOptimizer.optimize_change(res.int_pgs, cur_tree);
+    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: cur_tree, pg_size: 3 });
    LPOptimizer.print_change_stats(res, false);
 }

--- a/mon/test-optimize.js
+++ b/mon/test-optimize.js
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 const LPOptimizer = require('./lp-optimizer.js');

 const osd_tree = {
@ -75,19 +78,37 @@ const crush_tree = [

 async function run()
 {
+    let res;
+
    // Test: add 1 OSD of almost the same size. Ideal data movement could be 1/12 = 8.33%. Actual is ~13%
-    // Space efficiency is ~99.5% in both cases.
-    let res = await LPOptimizer.optimize_initial(osd_tree, 256);
+    // Space efficiency is ~99% in all cases.
+
+    console.log('256 PGs, size=2');
+    res = await LPOptimizer.optimize_initial({ osd_tree, pg_size: 2, pg_count: 256 });
    LPOptimizer.print_change_stats(res, false);
-    console.log('adding osd.8');
+    console.log('\nAdding osd.8');
    osd_tree[500][8] = 3.58589;
-    res = await LPOptimizer.optimize_change(res.int_pgs, osd_tree);
+    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree, pg_size: 2 });
    LPOptimizer.print_change_stats(res, false);
-    console.log('removing osd.8');
+    console.log('\nRemoving osd.8');
    delete osd_tree[500][8];
-    res = await LPOptimizer.optimize_change(res.int_pgs, osd_tree);
+    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree, pg_size: 2 });
    LPOptimizer.print_change_stats(res, false);
-    res = await LPOptimizer.optimize_initial(LPOptimizer.flatten_tree(crush_tree, {}, 1, 3), 256);
+
+    console.log('\n256 PGs, size=3');
+    res = await LPOptimizer.optimize_initial({ osd_tree, pg_size: 3, pg_count: 256 });
+    LPOptimizer.print_change_stats(res, false);
+    console.log('\nAdding osd.8');
+    osd_tree[500][8] = 3.58589;
+    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree, pg_size: 3 });
+    LPOptimizer.print_change_stats(res, false);
+    console.log('\nRemoving osd.8');
+    delete osd_tree[500][8];
+    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree, pg_size: 3 });
+    LPOptimizer.print_change_stats(res, false);
+
+    console.log('\n256 PGs, size=3, failure domain=rack');
+    res = await LPOptimizer.optimize_initial({ osd_tree: LPOptimizer.flatten_tree(crush_tree, {}, 1, 3), pg_size: 3, pg_count: 256 });
    LPOptimizer.print_change_stats(res, false);
 }

--- a/msgr_receive.cpp
+++ b/msgr_receive.cpp
@ -1,24 +1,42 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
 #include "messenger.h"

 void osd_messenger_t::read_requests()
 {
-    while (read_ready_clients.size() > 0)
+    for (int i = 0; i < read_ready_clients.size(); i++)
    {
-        int peer_fd = read_ready_clients[0];
+        int peer_fd = read_ready_clients[i];
        auto & cl = clients[peer_fd];
-        if (!cl.read_op || cl.read_remaining < receive_buffer_size)
+        if (cl.read_remaining < receive_buffer_size)
        {
            cl.read_iov.iov_base = cl.in_buf;
            cl.read_iov.iov_len = receive_buffer_size;
+            cl.read_msg.msg_iov = &cl.read_iov;
+            cl.read_msg.msg_iovlen = 1;
        }
        else
        {
-            cl.read_iov.iov_base = cl.read_buf;
+            cl.read_iov.iov_base = 0;
            cl.read_iov.iov_len = cl.read_remaining;
+            cl.read_msg.msg_iov = cl.recv_list.get_iovec();
+            cl.read_msg.msg_iovlen = cl.recv_list.get_size();
        }
-        cl.read_msg.msg_iov = &cl.read_iov;
-        cl.read_msg.msg_iovlen = 1;
-        read_ready_clients.erase(read_ready_clients.begin(), read_ready_clients.begin() + 1);
+        if (ringloop && !use_sync_send_recv)
+        {
+            io_uring_sqe* sqe = ringloop->get_sqe();
+            if (!sqe)
+            {
+                read_ready_clients.erase(read_ready_clients.begin(), read_ready_clients.begin() + i);
+                return;
+            }
+            ring_data_t* data = ((ring_data_t*)sqe->user_data);
+            data->callback = [this, peer_fd](ring_data_t *data) { handle_read(data->res, peer_fd); };
+            my_uring_prep_recvmsg(sqe, peer_fd, &cl.read_msg, 0);
+        }
+        else
+        {
            int result = recvmsg(peer_fd, &cl.read_msg, 0);
            if (result < 0)
            {
@ -26,18 +44,24 @@ void osd_messenger_t::read_requests()
            }
            handle_read(result, peer_fd);
        }
+    }
+    read_ready_clients.clear();
 }

 bool osd_messenger_t::handle_read(int result, int peer_fd)
 {
+    bool ret = false;
    auto cl_it = clients.find(peer_fd);
    if (cl_it != clients.end())
    {
        auto & cl = cl_it->second;
-        if (result < 0 && result != -EAGAIN)
+        if (result <= 0 && result != -EAGAIN)
+        {
+            // this is a client socket, so don't panic on error. just disconnect it
+            if (result != 0)
            {
-            // this is a client socket, so don't panic. just disconnect it
                printf("Client %d socket read error: %d (%s). Disconnecting client\n", peer_fd, -result, strerror(-result));
+            }
            stop_client(peer_fd);
            return false;
        }
@ -65,27 +89,37 @@ bool osd_messenger_t::handle_read(int result, int peer_fd)
                        cl.read_op = new osd_op_t;
                        cl.read_op->peer_fd = peer_fd;
                        cl.read_op->op_type = OSD_OP_IN;
-                        cl.read_buf = cl.read_op->req.buf;
+                        cl.recv_list.push_back(cl.read_op->req.buf, OSD_PACKET_SIZE);
                        cl.read_remaining = OSD_PACKET_SIZE;
                        cl.read_state = CL_READ_HDR;
                    }
-                    if (cl.read_remaining > remain)
+                    while (cl.recv_list.done < cl.recv_list.count && remain > 0)
                    {
-                        memcpy(cl.read_buf, curbuf, remain);
+                        iovec* cur = cl.recv_list.get_iovec();
+                        if (cur->iov_len > remain)
+                        {
+                            memcpy(cur->iov_base, curbuf, remain);
                            cl.read_remaining -= remain;
-                        cl.read_buf += remain;
+                            cur->iov_len -= remain;
+                            cur->iov_base += remain;
                            remain = 0;
-                        if (cl.read_remaining <= 0)
-                            handle_finished_read(cl);
                        }
                        else
                        {
-                        memcpy(cl.read_buf, curbuf, cl.read_remaining);
-                        curbuf += cl.read_remaining;
-                        remain -= cl.read_remaining;
-                        cl.read_remaining = 0;
-                        cl.read_buf = NULL;
-                        handle_finished_read(cl);
+                            memcpy(cur->iov_base, curbuf, cur->iov_len);
+                            curbuf += cur->iov_len;
+                            cl.read_remaining -= cur->iov_len;
+                            remain -= cur->iov_len;
+                            cur->iov_len = 0;
+                            cl.recv_list.done++;
+                        }
+                    }
+                    if (cl.recv_list.done >= cl.recv_list.count)
+                    {
+                        if (!handle_finished_read(cl))
+                        {
+                            goto fin;
+                        }
                    }
                }
            }
@ -93,27 +127,34 @@ bool osd_messenger_t::handle_read(int result, int peer_fd)
            {
                // Long data
                cl.read_remaining -= result;
-                cl.read_buf += result;
-                if (cl.read_remaining <= 0)
+                cl.recv_list.eat(result);
+                if (cl.recv_list.done >= cl.recv_list.count)
                {
                    handle_finished_read(cl);
                }
            }
            if (result >= cl.read_iov.iov_len)
            {
-                return true;
+                ret = true;
            }
        }
    }
-    return false;
+fin:
+    for (auto cb: set_immediate)
+    {
+        cb();
+    }
+    set_immediate.clear();
+    return ret;
 }

-void osd_messenger_t::handle_finished_read(osd_client_t & cl)
+bool osd_messenger_t::handle_finished_read(osd_client_t & cl)
 {
+    cl.recv_list.reset();
    if (cl.read_state == CL_READ_HDR)
    {
        if (cl.read_op->req.hdr.magic == SECONDARY_OSD_REPLY_MAGIC)
-            handle_reply_hdr(&cl);
+            return handle_reply_hdr(&cl);
        else
            handle_op_hdr(&cl);
    }
@ -121,136 +162,130 @@ void osd_messenger_t::handle_finished_read(osd_client_t & cl)
    {
        // Operation is ready
        cl.received_ops.push_back(cl.read_op);
-        exec_op(cl.read_op);
+        set_immediate.push_back([this, op = cl.read_op]() { exec_op(op); });
        cl.read_op = NULL;
        cl.read_state = 0;
    }
    else if (cl.read_state == CL_READ_REPLY_DATA)
    {
        // Reply is ready
-        auto req_it = cl.sent_ops.find(cl.read_reply_id);
-        osd_op_t *request = req_it->second;
-        cl.sent_ops.erase(req_it);
-        cl.read_reply_id = 0;
-        delete cl.read_op;
+        handle_reply_ready(cl.read_op);
        cl.read_op = NULL;
        cl.read_state = 0;
-        // Measure subop latency
-        timespec tv_end;
-        clock_gettime(CLOCK_REALTIME, &tv_end);
-        stats.subop_stat_count[request->req.hdr.opcode]++;
-        if (!stats.subop_stat_count[request->req.hdr.opcode])
-        {
-            stats.subop_stat_count[request->req.hdr.opcode]++;
-            stats.subop_stat_sum[request->req.hdr.opcode] = 0;
-        }
-        stats.subop_stat_sum[request->req.hdr.opcode] += (
-            (tv_end.tv_sec - request->tv_begin.tv_sec)*1000000 +
-            (tv_end.tv_nsec - request->tv_begin.tv_nsec)/1000
-        );
-        request->callback(request);
    }
    else
    {
        assert(0);
    }
+    return true;
 }

 void osd_messenger_t::handle_op_hdr(osd_client_t *cl)
 {
    osd_op_t *cur_op = cl->read_op;
-    if (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_READ)
+    if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ)
    {
        if (cur_op->req.sec_rw.len > 0)
-            cur_op->buf = memalign(MEM_ALIGNMENT, cur_op->req.sec_rw.len);
+            cur_op->buf = memalign_or_die(MEM_ALIGNMENT, cur_op->req.sec_rw.len);
        cl->read_remaining = 0;
    }
-    else if (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_WRITE)
+    else if (cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
+        cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE)
    {
        if (cur_op->req.sec_rw.len > 0)
-            cur_op->buf = memalign(MEM_ALIGNMENT, cur_op->req.sec_rw.len);
+            cur_op->buf = memalign_or_die(MEM_ALIGNMENT, cur_op->req.sec_rw.len);
        cl->read_remaining = cur_op->req.sec_rw.len;
    }
-    else if (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_STABILIZE ||
-        cur_op->req.hdr.opcode == OSD_OP_SECONDARY_ROLLBACK)
+    else if (cur_op->req.hdr.opcode == OSD_OP_SEC_STABILIZE ||
+        cur_op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK)
    {
        if (cur_op->req.sec_stab.len > 0)
-            cur_op->buf = memalign(MEM_ALIGNMENT, cur_op->req.sec_stab.len);
+            cur_op->buf = memalign_or_die(MEM_ALIGNMENT, cur_op->req.sec_stab.len);
        cl->read_remaining = cur_op->req.sec_stab.len;
    }
    else if (cur_op->req.hdr.opcode == OSD_OP_READ)
    {
-        if (cur_op->req.rw.len > 0)
-            cur_op->buf = memalign(MEM_ALIGNMENT, cur_op->req.rw.len);
        cl->read_remaining = 0;
    }
    else if (cur_op->req.hdr.opcode == OSD_OP_WRITE)
    {
        if (cur_op->req.rw.len > 0)
-            cur_op->buf = memalign(MEM_ALIGNMENT, cur_op->req.rw.len);
+            cur_op->buf = memalign_or_die(MEM_ALIGNMENT, cur_op->req.rw.len);
        cl->read_remaining = cur_op->req.rw.len;
    }
    if (cl->read_remaining > 0)
    {
        // Read data
-        cl->read_buf = cur_op->buf;
+        cl->recv_list.push_back(cur_op->buf, cl->read_remaining);
        cl->read_state = CL_READ_DATA;
    }
    else
    {
        // Operation is ready
+        cl->received_ops.push_back(cur_op);
+        set_immediate.push_back([this, cur_op]() { exec_op(cur_op); });
        cl->read_op = NULL;
        cl->read_state = 0;
-        cl->received_ops.push_back(cur_op);
-        exec_op(cur_op);
    }
 }

-void osd_messenger_t::handle_reply_hdr(osd_client_t *cl)
+bool osd_messenger_t::handle_reply_hdr(osd_client_t *cl)
 {
-    osd_op_t *cur_op = cl->read_op;
-    auto req_it = cl->sent_ops.find(cur_op->req.hdr.id);
+    auto req_it = cl->sent_ops.find(cl->read_op->req.hdr.id);
    if (req_it == cl->sent_ops.end())
    {
        // Command out of sync. Drop connection
-        printf("Client %d command out of sync: id %lu\n", cl->peer_fd, cur_op->req.hdr.id);
+        printf("Client %d command out of sync: id %lu\n", cl->peer_fd, cl->read_op->req.hdr.id);
        stop_client(cl->peer_fd);
-        return;
+        return false;
    }
    osd_op_t *op = req_it->second;
-    memcpy(op->reply.buf, cur_op->req.buf, OSD_PACKET_SIZE);
-    if ((op->reply.hdr.opcode == OSD_OP_SECONDARY_READ || op->reply.hdr.opcode == OSD_OP_READ) &&
+    memcpy(op->reply.buf, cl->read_op->req.buf, OSD_PACKET_SIZE);
+    cl->sent_ops.erase(req_it);
+    if ((op->reply.hdr.opcode == OSD_OP_SEC_READ || op->reply.hdr.opcode == OSD_OP_READ) &&
        op->reply.hdr.retval > 0)
    {
        // Read data. In this case we assume that the buffer is preallocated by the caller (!)
-        assert(op->buf);
+        assert(op->iov.count > 0);
+        cl->recv_list.append(op->iov);
+        delete cl->read_op;
+        cl->read_op = op;
        cl->read_state = CL_READ_REPLY_DATA;
-        cl->read_reply_id = op->req.hdr.id;
-        cl->read_buf = op->buf;
        cl->read_remaining = op->reply.hdr.retval;
    }
-    else if (op->reply.hdr.opcode == OSD_OP_SECONDARY_LIST && op->reply.hdr.retval > 0)
+    else if (op->reply.hdr.opcode == OSD_OP_SEC_LIST && op->reply.hdr.retval > 0)
    {
-        op->buf = memalign(MEM_ALIGNMENT, sizeof(obj_ver_id) * op->reply.hdr.retval);
+        assert(!op->iov.count);
+        delete cl->read_op;
+        cl->read_op = op;
        cl->read_state = CL_READ_REPLY_DATA;
-        cl->read_reply_id = op->req.hdr.id;
-        cl->read_buf = op->buf;
        cl->read_remaining = sizeof(obj_ver_id) * op->reply.hdr.retval;
+        op->buf = memalign_or_die(MEM_ALIGNMENT, cl->read_remaining);
+        cl->recv_list.push_back(op->buf, cl->read_remaining);
    }
    else if (op->reply.hdr.opcode == OSD_OP_SHOW_CONFIG && op->reply.hdr.retval > 0)
    {
-        op->buf = malloc(op->reply.hdr.retval);
+        assert(!op->iov.count);
+        delete cl->read_op;
+        cl->read_op = op;
        cl->read_state = CL_READ_REPLY_DATA;
-        cl->read_reply_id = op->req.hdr.id;
-        cl->read_buf = op->buf;
        cl->read_remaining = op->reply.hdr.retval;
+        op->buf = malloc_or_die(op->reply.hdr.retval);
+        cl->recv_list.push_back(op->buf, op->reply.hdr.retval);
    }
    else
    {
-        delete cl->read_op;
-        cl->read_state = 0;
-        cl->read_op = NULL;
-        cl->sent_ops.erase(req_it);
+        // It's fine to reuse cl->read_op for the next reply
+        handle_reply_ready(op);
+        cl->recv_list.push_back(cl->read_op->req.buf, OSD_PACKET_SIZE);
+        cl->read_remaining = OSD_PACKET_SIZE;
+        cl->read_state = CL_READ_HDR;
+    }
+    return true;
+}
+
+void osd_messenger_t::handle_reply_ready(osd_op_t *op)
+{
    // Measure subop latency
    timespec tv_end;
    clock_gettime(CLOCK_REALTIME, &tv_end);
@ -264,7 +299,9 @@ void osd_messenger_t::handle_reply_hdr(osd_client_t *cl)
        (tv_end.tv_sec - op->tv_begin.tv_sec)*1000000 +
        (tv_end.tv_nsec - op->tv_begin.tv_nsec)/1000
    );
+    set_immediate.push_back([this, op]()
+    {
        // Copy lambda to be unaffected by `delete op`
        std::function<void(osd_op_t*)>(op->callback)(op);
-    }
+    });
 }
--- a/msgr_send.cpp
+++ b/msgr_send.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
 #include "messenger.h"

 void osd_messenger_t::outbox_push(osd_op_t *cur_op)
@ -28,7 +31,14 @@ void osd_messenger_t::outbox_push(osd_op_t *cur_op)
        }
    }
    cl.outbox.push_back(cur_op);
-    if (cl.write_op || cl.outbox.size() > 1 || !try_send(cl))
+    if (!ringloop)
+    {
+        while (cl.write_op || cl.outbox.size())
+        {
+            try_send(cl);
+        }
+    }
+    else if (cl.write_op || cl.outbox.size() > 1 || !try_send(cl))
    {
        if (cl.write_state == 0)
        {
@ -69,30 +79,71 @@ bool osd_messenger_t::try_send(osd_client_t & cl)
            {
                stats.op_stat_bytes[cl.write_op->req.hdr.opcode] += cl.write_op->req.rw.len;
            }
-            else if (cl.write_op->req.hdr.opcode == OSD_OP_SECONDARY_READ ||
-                cl.write_op->req.hdr.opcode == OSD_OP_SECONDARY_WRITE)
+            else if (cl.write_op->req.hdr.opcode == OSD_OP_SEC_READ ||
+                cl.write_op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
+                cl.write_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE)
            {
                stats.op_stat_bytes[cl.write_op->req.hdr.opcode] += cl.write_op->req.sec_rw.len;
            }
+            cl.send_list.push_back(cl.write_op->reply.buf, OSD_PACKET_SIZE);
+            if (cl.write_op->req.hdr.opcode == OSD_OP_READ ||
+                cl.write_op->req.hdr.opcode == OSD_OP_SEC_READ ||
+                cl.write_op->req.hdr.opcode == OSD_OP_SEC_LIST ||
+                cl.write_op->req.hdr.opcode == OSD_OP_SHOW_CONFIG)
+            {
+                cl.send_list.append(cl.write_op->iov);
            }
        }
-    cl.write_msg.msg_iov = cl.write_op->send_list.get_iovec();
-    cl.write_msg.msg_iovlen = cl.write_op->send_list.get_size();
+        else
+        {
+            cl.send_list.push_back(cl.write_op->req.buf, OSD_PACKET_SIZE);
+            if (cl.write_op->req.hdr.opcode == OSD_OP_WRITE ||
+                cl.write_op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
+                cl.write_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE ||
+                cl.write_op->req.hdr.opcode == OSD_OP_SEC_STABILIZE ||
+                cl.write_op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK)
+            {
+                cl.send_list.append(cl.write_op->iov);
+            }
+        }
+    }
+    cl.write_msg.msg_iov = cl.send_list.get_iovec();
+    cl.write_msg.msg_iovlen = cl.send_list.get_size();
+    if (ringloop && !use_sync_send_recv)
+    {
+        io_uring_sqe* sqe = ringloop->get_sqe();
+        if (!sqe)
+        {
+            return false;
+        }
+        ring_data_t* data = ((ring_data_t*)sqe->user_data);
+        data->callback = [this, peer_fd](ring_data_t *data) { handle_send(data->res, peer_fd); };
+        my_uring_prep_sendmsg(sqe, peer_fd, &cl.write_msg, 0);
+    }
+    else
+    {
        int result = sendmsg(peer_fd, &cl.write_msg, MSG_NOSIGNAL);
        if (result < 0)
+        {
            result = -errno;
+        }
        handle_send(result, peer_fd);
+    }
    return true;
 }

 void osd_messenger_t::send_replies()
 {
-    while (write_ready_clients.size() > 0)
+    for (int i = 0; i < write_ready_clients.size(); i++)
    {
-        auto & cl = clients[write_ready_clients[0]];
-        write_ready_clients.erase(write_ready_clients.begin(), write_ready_clients.begin() + 1);
-        try_send(cl);
+        int peer_fd = write_ready_clients[i];
+        if (!try_send(clients[peer_fd]))
+        {
+            write_ready_clients.erase(write_ready_clients.begin(), write_ready_clients.begin() + i);
+            return;
        }
+    }
+    write_ready_clients.clear();
 }

 void osd_messenger_t::handle_send(int result, int peer_fd)
@ -110,28 +161,14 @@ void osd_messenger_t::handle_send(int result, int peer_fd)
        }
        if (result >= 0)
        {
-            osd_op_t *cur_op = cl.write_op;
-            while (result > 0 && cur_op->send_list.sent < cur_op->send_list.count)
-            {
-                iovec & iov = cur_op->send_list.buf[cur_op->send_list.sent];
-                if (iov.iov_len <= result)
-                {
-                    result -= iov.iov_len;
-                    cur_op->send_list.sent++;
-                }
-                else
-                {
-                    iov.iov_len -= result;
-                    iov.iov_base += result;
-                    break;
-                }
-            }
-            if (cur_op->send_list.sent >= cur_op->send_list.count)
+            cl.send_list.eat(result);
+            if (cl.send_list.done >= cl.send_list.count)
            {
                // Done
-                if (cur_op->op_type == OSD_OP_IN)
+                cl.send_list.reset();
+                if (cl.write_op->op_type == OSD_OP_IN)
                {
-                    delete cur_op;
+                    delete cl.write_op;
                }
                else
                {
--- a/nbd_proxy.cpp
+++ b/nbd_proxy.cpp
@ -0,0 +1,673 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+// Similar to qemu-nbd, but sets timeout and uses io_uring
+
+#include <linux/nbd.h>
+#include <sys/ioctl.h>
+
+#include <sys/socket.h>
+#include <netinet/in.h>
+#include <netinet/tcp.h>
+#include <arpa/inet.h>
+#include <sys/un.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <signal.h>
+
+#include "epoll_manager.h"
+#include "cluster_client.h"
+
+const char *exe_name = NULL;
+
+class nbd_proxy
+{
+protected:
+    uint64_t inode = 0;
+
+    ring_loop_t *ringloop = NULL;
+    epoll_manager_t *epmgr = NULL;
+    cluster_client_t *cli = NULL;
+    ring_consumer_t consumer;
+
+    std::vector<iovec> send_list, next_send_list;
+    std::vector<void*> to_free;
+    int nbd_fd = -1;
+    void *recv_buf = NULL;
+    int receive_buffer_size = 9000;
+    nbd_request cur_req;
+    cluster_op_t *cur_op = NULL;
+    void *cur_buf = NULL;
+    int cur_left = 0;
+    int read_state = 0;
+    int read_ready = 0;
+    msghdr read_msg = { 0 }, send_msg = { 0 };
+    iovec read_iov = { 0 };
+
+public:
+    static json11::Json::object parse_args(int narg, const char *args[])
+    {
+        json11::Json::object cfg;
+        int pos = 0;
+        for (int i = 1; i < narg; i++)
+        {
+            if (!strcmp(args[i], "-h") || !strcmp(args[i], "--help"))
+            {
+                help();
+            }
+            else if (args[i][0] == '-' && args[i][1] == '-')
+            {
+                const char *opt = args[i]+2;
+                cfg[opt] = !strcmp(opt, "json") || i == narg-1 ? "1" : args[++i];
+            }
+            else if (pos == 0)
+            {
+                cfg["command"] = args[i];
+                pos++;
+            }
+            else if (pos == 1 && (cfg["command"] == "map" || cfg["command"] == "unmap"))
+            {
+                int n = 0;
+                if (sscanf(args[i], "/dev/nbd%d", &n) > 0)
+                    cfg["dev_num"] = n;
+                else
+                    cfg["dev_num"] = args[i];
+                pos++;
+            }
+        }
+        return cfg;
+    }
+
+    void exec(json11::Json cfg)
+    {
+        if (cfg["command"] == "map")
+        {
+            start(cfg);
+        }
+        else if (cfg["command"] == "unmap")
+        {
+            if (cfg["dev_num"].is_null())
+            {
+                fprintf(stderr, "device name or number is missing\n");
+                exit(1);
+            }
+            unmap(cfg["dev_num"].uint64_value());
+        }
+        else if (cfg["command"] == "list" || cfg["command"] == "list-mapped")
+        {
+            auto mapped = list_mapped();
+            print_mapped(mapped, !cfg["json"].is_null());
+        }
+        else
+        {
+            help();
+        }
+    }
+
+    static void help()
+    {
+        printf(
+            "Vitastor NBD proxy\n"
+            "(c) Vitaliy Filippov, 2020 (VNPL-1.0 or GNU GPL 2.0+)\n\n"
+            "USAGE:\n"
+            "  %s map --etcd_address <etcd_address> --pool <pool> --inode <inode> --size <size in bytes>\n"
+            "  %s unmap /dev/nbd0\n"
+            "  %s list [--json]\n",
+            exe_name, exe_name, exe_name
+        );
+        exit(0);
+    }
+
+    void unmap(int dev_num)
+    {
+        char path[64] = { 0 };
+        sprintf(path, "/dev/nbd%d", dev_num);
+        int r, nbd = open(path, O_RDWR);
+        if (nbd < 0)
+        {
+            perror("open");
+            exit(1);
+        }
+        r = ioctl(nbd, NBD_DISCONNECT);
+        if (r < 0)
+        {
+            perror("NBD_DISCONNECT");
+            exit(1);
+        }
+        close(nbd);
+    }
+
+    void start(json11::Json cfg)
+    {
+        // Check options
+        if (cfg["etcd_address"].string_value() == "")
+        {
+            fprintf(stderr, "etcd_address is missing\n");
+            exit(1);
+        }
+        if (!cfg["size"].uint64_value())
+        {
+            fprintf(stderr, "device size is missing\n");
+            exit(1);
+        }
+        inode = cfg["inode"].uint64_value();
+        uint64_t pool = cfg["pool"].uint64_value();
+        if (pool)
+        {
+            inode = (inode & ((1l << (64-POOL_ID_BITS)) - 1)) | (pool << (64-POOL_ID_BITS));
+        }
+        if (!(inode >> (64-POOL_ID_BITS)))
+        {
+            fprintf(stderr, "pool is missing\n");
+            exit(1);
+        }
+        // Initialize NBD
+        int sockfd[2];
+        if (socketpair(AF_UNIX, SOCK_STREAM, 0, sockfd) < 0)
+        {
+            perror("socketpair");
+            exit(1);
+        }
+        fcntl(sockfd[0], F_SETFL, fcntl(sockfd[0], F_GETFL, 0) | O_NONBLOCK);
+        nbd_fd = sockfd[0];
+        load_module();
+        if (!cfg["dev_num"].is_null())
+        {
+            if (run_nbd(sockfd, cfg["dev_num"].int64_value(), cfg["size"].uint64_value(), NBD_FLAG_SEND_FLUSH, 30) < 0)
+            {
+                perror("run_nbd");
+                exit(1);
+            }
+        }
+        else
+        {
+            // Find an unused device
+            int i = 0;
+            while (true)
+            {
+                int r = run_nbd(sockfd, i, cfg["size"].uint64_value(), NBD_FLAG_SEND_FLUSH, 30);
+                if (r == 0)
+                {
+                    printf("/dev/nbd%d\n", i);
+                    break;
+                }
+                else if (r == -1 && errno == ENOENT)
+                {
+                    fprintf(stderr, "No free NBD devices found\n");
+                    exit(1);
+                }
+                else if (r == -2 && errno == EBUSY)
+                {
+                    i++;
+                }
+                else
+                {
+                    printf("%d %d\n", r, errno);
+                    perror("run_nbd");
+                    exit(1);
+                }
+            }
+        }
+        if (cfg["foreground"].is_null())
+        {
+            daemonize();
+        }
+        // Create client
+        ringloop = new ring_loop_t(512);
+        epmgr = new epoll_manager_t(ringloop);
+        cli = new cluster_client_t(ringloop, epmgr->tfd, cfg);
+        // Initialize read state
+        read_state = CL_READ_HDR;
+        recv_buf = malloc_or_die(receive_buffer_size);
+        cur_buf = &cur_req;
+        cur_left = sizeof(nbd_request);
+        consumer.loop = [this]()
+        {
+            submit_read();
+            submit_send();
+            ringloop->submit();
+        };
+        ringloop->register_consumer(&consumer);
+        // Add FD to epoll
+        epmgr->tfd->set_fd_handler(sockfd[0], false, [this](int peer_fd, int epoll_events)
+        {
+            read_ready++;
+            submit_read();
+        });
+        while (1)
+        {
+            ringloop->loop();
+            ringloop->wait();
+        }
+    }
+
+    void load_module()
+    {
+        if (access("/sys/module/nbd", F_OK))
+        {
+            return;
+        }
+        int r;
+        if ((r = system("modprobe nbd")) != 0)
+        {
+            if (r < 0)
+                perror("Failed to load NBD kernel module");
+            else
+                fprintf(stderr, "Failed to load NBD kernel module\n");
+            exit(1);
+        }
+    }
+
+    void daemonize()
+    {
+        if (fork())
+            exit(0);
+        setsid();
+        if (fork())
+            exit(0);
+        chdir("/");
+        close(0);
+        close(1);
+        close(2);
+        open("/dev/null", O_RDONLY);
+        open("/dev/null", O_WRONLY);
+        open("/dev/null", O_WRONLY);
+    }
+
+    json11::Json::object list_mapped()
+    {
+        const char *self_filename = exe_name;
+        for (int i = 0; exe_name[i] != 0; i++)
+        {
+            if (exe_name[i] == '/')
+                self_filename = exe_name+i+1;
+        }
+        char path[64] = { 0 };
+        json11::Json::object mapped;
+        int dev_num = -1;
+        int pid;
+        while (true)
+        {
+            dev_num++;
+            sprintf(path, "/sys/block/nbd%d", dev_num);
+            if (access(path, F_OK) != 0)
+                break;
+            sprintf(path, "/sys/block/nbd%d/pid", dev_num);
+            std::string pid_str = read_file(path);
+            if (pid_str == "")
+                continue;
+            if (sscanf(pid_str.c_str(), "%d", &pid) < 1)
+            {
+                printf("Failed to read pid from /sys/block/nbd%d/pid\n", dev_num);
+                continue;
+            }
+            sprintf(path, "/proc/%d/cmdline", pid);
+            std::string cmdline = read_file(path);
+            std::vector<const char*> argv;
+            int last = 0;
+            for (int i = 0; i < cmdline.size(); i++)
+            {
+                if (cmdline[i] == 0)
+                {
+                    argv.push_back(cmdline.c_str()+last);
+                    last = i+1;
+                }
+            }
+            if (argv.size() > 0)
+            {
+                const char *pid_filename = argv[0];
+                for (int i = 0; argv[0][i] != 0; i++)
+                {
+                    if (argv[0][i] == '/')
+                        pid_filename = argv[0]+i+1;
+                }
+                if (!strcmp(pid_filename, self_filename))
+                {
+                    json11::Json::object cfg = nbd_proxy::parse_args(argv.size(), argv.data());
+                    if (cfg["command"] == "map")
+                    {
+                        cfg.erase("command");
+                        cfg["pid"] = pid;
+                        mapped["/dev/nbd"+std::to_string(dev_num)] = cfg;
+                    }
+                }
+            }
+        }
+        return mapped;
+    }
+
+    void print_mapped(json11::Json mapped, bool json)
+    {
+        if (json)
+        {
+            printf("%s\n", mapped.dump().c_str());
+        }
+        else
+        {
+            for (auto & dev: mapped.object_items())
+            {
+                printf("%s\n", dev.first.c_str());
+                for (auto & k: dev.second.object_items())
+                {
+                    printf("%s: %s\n", k.first.c_str(), k.second.string_value().c_str());
+                }
+                printf("\n");
+            }
+        }
+    }
+
+    std::string read_file(char *path)
+    {
+        int fd = open(path, O_RDONLY);
+        if (fd < 0)
+        {
+            if (errno == ENOENT)
+                return "";
+            auto err = "open "+std::string(path);
+            perror(err.c_str());
+            exit(1);
+        }
+        std::string r;
+        while (true)
+        {
+            int l = r.size();
+            r.resize(l + 1024);
+            int rd = read(fd, (void*)(r.c_str() + l), 1024);
+            if (rd <= 0)
+            {
+                r.resize(l);
+                break;
+            }
+            r.resize(l + rd);
+        }
+        close(fd);
+        return r;
+    }
+
+protected:
+    int run_nbd(int sockfd[2], int dev_num, uint64_t size, uint64_t flags, unsigned timeout)
+    {
+        // Check handle size
+        assert(sizeof(cur_req.handle) == 8);
+        char path[64] = { 0 };
+        sprintf(path, "/dev/nbd%d", dev_num);
+        int r, nbd = open(path, O_RDWR), qd_fd;
+        if (nbd < 0)
+        {
+            return -1;
+        }
+        r = ioctl(nbd, NBD_SET_SOCK, sockfd[1]);
+        if (r < 0)
+        {
+            goto end_close;
+        }
+        r = ioctl(nbd, NBD_SET_BLKSIZE, 4096);
+        if (r < 0)
+        {
+            goto end_unmap;
+        }
+        r = ioctl(nbd, NBD_SET_SIZE, size);
+        if (r < 0)
+        {
+            goto end_unmap;
+        }
+        ioctl(nbd, NBD_SET_FLAGS, flags);
+        if (timeout >= 0)
+        {
+            r = ioctl(nbd, NBD_SET_TIMEOUT, (unsigned long)timeout);
+            if (r < 0)
+            {
+                goto end_unmap;
+            }
+        }
+        // Configure request size
+        sprintf(path, "/sys/block/nbd%d/queue/max_sectors_kb", dev_num);
+        qd_fd = open(path, O_WRONLY);
+        if (qd_fd < 0)
+        {
+            goto end_unmap;
+        }
+        write(qd_fd, "32768", 5);
+        close(qd_fd);
+        if (!fork())
+        {
+            // Run in child
+            close(sockfd[0]);
+            r = ioctl(nbd, NBD_DO_IT);
+            if (r < 0)
+            {
+                fprintf(stderr, "NBD device terminated with error: %s\n", strerror(errno));
+                kill(getppid(), SIGTERM);
+            }
+            close(sockfd[1]);
+            ioctl(nbd, NBD_CLEAR_QUE);
+            ioctl(nbd, NBD_CLEAR_SOCK);
+            exit(0);
+        }
+        close(sockfd[1]);
+        close(nbd);
+        return 0;
+    end_close:
+        r = errno;
+        close(nbd);
+        errno = r;
+        return -2;
+    end_unmap:
+        r = errno;
+        ioctl(nbd, NBD_CLEAR_SOCK);
+        close(nbd);
+        errno = r;
+        return -3;
+    }
+
+    void submit_send()
+    {
+        if (!send_list.size() || send_msg.msg_iovlen > 0)
+        {
+            return;
+        }
+        io_uring_sqe* sqe = ringloop->get_sqe();
+        if (!sqe)
+        {
+            return;
+        }
+        ring_data_t* data = ((ring_data_t*)sqe->user_data);
+        data->callback = [this](ring_data_t *data) { handle_send(data->res); };
+        send_msg.msg_iov = send_list.data();
+        send_msg.msg_iovlen = send_list.size();
+        my_uring_prep_sendmsg(sqe, nbd_fd, &send_msg, MSG_ZEROCOPY);
+    }
+
+    void handle_send(int result)
+    {
+        send_msg.msg_iovlen = 0;
+        if (result < 0 && result != -EAGAIN)
+        {
+            fprintf(stderr, "Socket disconnected: %s\n", strerror(-result));
+            exit(1);
+        }
+        int to_eat = 0;
+        while (result > 0 && to_eat < send_list.size())
+        {
+            if (result >= send_list[to_eat].iov_len)
+            {
+                free(to_free[to_eat]);
+                result -= send_list[to_eat].iov_len;
+                to_eat++;
+            }
+            else
+            {
+                send_list[to_eat].iov_base += result;
+                send_list[to_eat].iov_len -= result;
+                break;
+            }
+        }
+        if (to_eat > 0)
+        {
+            send_list.erase(send_list.begin(), send_list.begin() + to_eat);
+            to_free.erase(to_free.begin(), to_free.begin() + to_eat);
+        }
+        for (int i = 0; i < next_send_list.size(); i++)
+        {
+            send_list.push_back(next_send_list[i]);
+        }
+        next_send_list.clear();
+        if (send_list.size() > 0)
+        {
+            ringloop->wakeup();
+        }
+    }
+
+    void submit_read()
+    {
+        if (!read_ready || read_msg.msg_iovlen > 0)
+        {
+            return;
+        }
+        io_uring_sqe* sqe = ringloop->get_sqe();
+        if (!sqe)
+        {
+            return;
+        }
+        ring_data_t* data = ((ring_data_t*)sqe->user_data);
+        data->callback = [this](ring_data_t *data) { handle_read(data->res); };
+        if (cur_left < receive_buffer_size)
+        {
+            read_iov.iov_base = recv_buf;
+            read_iov.iov_len = receive_buffer_size;
+        }
+        else
+        {
+            read_iov.iov_base = cur_buf;
+            read_iov.iov_len = cur_left;
+        }
+        read_msg.msg_iov = &read_iov;
+        read_msg.msg_iovlen = 1;
+        my_uring_prep_recvmsg(sqe, nbd_fd, &read_msg, 0);
+    }
+
+    void handle_read(int result)
+    {
+        read_msg.msg_iovlen = 0;
+        if (result < 0 && result != -EAGAIN)
+        {
+            fprintf(stderr, "Socket disconnected: %s\n", strerror(-result));
+            exit(1);
+        }
+        if (result == -EAGAIN || result < read_iov.iov_len)
+        {
+            read_ready--;
+        }
+        if (read_ready > 0)
+        {
+            ringloop->wakeup();
+        }
+        void *b = recv_buf;
+        while (result > 0)
+        {
+            if (read_iov.iov_base == recv_buf)
+            {
+                int inc = result >= cur_left ? cur_left : result;
+                memcpy(cur_buf, b, inc);
+                cur_left -= inc;
+                result -= inc;
+                cur_buf += inc;
+                b += inc;
+            }
+            else
+            {
+                assert(result <= cur_left);
+                cur_left -= result;
+                result = 0;
+            }
+            if (cur_left <= 0)
+            {
+                handle_finished_read();
+            }
+        }
+    }
+
+    void handle_finished_read()
+    {
+        if (read_state == CL_READ_HDR)
+        {
+            int req_type = be32toh(cur_req.type);
+            if (be32toh(cur_req.magic) != NBD_REQUEST_MAGIC ||
+                req_type != NBD_CMD_READ && req_type != NBD_CMD_WRITE && req_type != NBD_CMD_FLUSH)
+            {
+                printf("Unexpected request: magic=%x type=%x, terminating\n", cur_req.magic, req_type);
+                exit(1);
+            }
+            uint64_t handle = *((uint64_t*)cur_req.handle);
+#ifdef DEBUG
+            printf("request %lx +%x %lx\n", be64toh(cur_req.from), be32toh(cur_req.len), handle);
+#endif
+            void *buf = NULL;
+            cluster_op_t *op = new cluster_op_t;
+            if (req_type == NBD_CMD_READ || req_type == NBD_CMD_WRITE)
+            {
+                op->opcode = req_type == NBD_CMD_READ ? OSD_OP_READ : OSD_OP_WRITE;
+                op->inode = inode;
+                op->offset = be64toh(cur_req.from);
+                op->len = be32toh(cur_req.len);
+                buf = malloc_or_die(sizeof(nbd_reply) + op->len);
+                op->iov.push_back(buf + sizeof(nbd_reply), op->len);
+            }
+            else if (req_type == NBD_CMD_FLUSH)
+            {
+                op->opcode = OSD_OP_SYNC;
+                buf = malloc_or_die(sizeof(nbd_reply));
+            }
+            op->callback = [this, buf, handle](cluster_op_t *op)
+            {
+#ifdef DEBUG
+                printf("reply %lx e=%d\n", handle, op->retval);
+#endif
+                nbd_reply *reply = (nbd_reply*)buf;
+                reply->magic = htobe32(NBD_REPLY_MAGIC);
+                memcpy(reply->handle, &handle, 8);
+                reply->error = htobe32(op->retval < 0 ? -op->retval : 0);
+                auto & to_list = send_msg.msg_iovlen > 0 ? next_send_list : send_list;
+                if (op->retval < 0 || op->opcode != OSD_OP_READ)
+                    to_list.push_back({ .iov_base = buf, .iov_len = sizeof(nbd_reply) });
+                else
+                    to_list.push_back({ .iov_base = buf, .iov_len = sizeof(nbd_reply) + op->len });
+                to_free.push_back(buf);
+                delete op;
+                ringloop->wakeup();
+            };
+            if (req_type == NBD_CMD_WRITE)
+            {
+                cur_op = op;
+                cur_buf = buf + sizeof(nbd_reply);
+                cur_left = op->len;
+                read_state = CL_READ_DATA;
+            }
+            else
+            {
+                cur_op = NULL;
+                cur_buf = &cur_req;
+                cur_left = sizeof(nbd_request);
+                read_state = CL_READ_HDR;
+                cli->execute(op);
+            }
+        }
+        else
+        {
+            cli->execute(cur_op);
+            cur_op = NULL;
+            cur_buf = &cur_req;
+            cur_left = sizeof(nbd_request);
+            read_state = CL_READ_HDR;
+        }
+    }
+};
+
+int main(int narg, const char *args[])
+{
+    setvbuf(stdout, NULL, _IONBF, 0);
+    setvbuf(stderr, NULL, _IONBF, 0);
+    exe_name = args[0];
+    nbd_proxy *p = new nbd_proxy();
+    p->exec(nbd_proxy::parse_args(narg, args));
+    return 0;
+}
--- a/object_id.h
+++ b/object_id.h
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
 #pragma once

 #include <stdint.h>
--- a/osd.cpp
+++ b/osd.cpp
@ -1,6 +1,7 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #include <sys/socket.h>
-#include <sys/epoll.h>
-#include <sys/eventfd.h>
 #include <sys/poll.h>
 #include <netinet/in.h>
 #include <netinet/tcp.h>
@ -8,25 +9,6 @@

 #include "osd.h"

-#define MAX_EPOLL_EVENTS 64
-
-const char* osd_op_names[] = {
-    "",
-    "read",
-    "write",
-    "sync",
-    "stabilize",
-    "rollback",
-    "delete",
-    "sync_stab_all",
-    "list",
-    "show_config",
-    "primary_read",
-    "primary_write",
-    "primary_sync",
-    "primary_delete",
-};
-
 osd_t::osd_t(blockstore_config_t & config, blockstore_t *bs, ring_loop_t *ringloop)
 {
    this->config = config;
@ -39,19 +21,9 @@ osd_t::osd_t(blockstore_config_t & config, blockstore_t *bs, ring_loop_t *ringlo

    parse_config(config);

-    epoll_fd = epoll_create(1);
-    if (epoll_fd < 0)
-    {
-        throw std::runtime_error(std::string("epoll_create: ") + strerror(errno));
-    }
-    event_fd = eventfd(0, EFD_NONBLOCK);
-    if (event_fd < 0)
-    {
-        throw std::runtime_error(std::string("eventfd: ") + strerror(errno));
-    }
+    epmgr = new epoll_manager_t(ringloop);
+    this->tfd = epmgr->tfd;

-    this->tfd = new timerfd_manager_t(ringloop);
-    this->tfd->set_fd_handler = [this](int fd, std::function<void(int, int)> handler) { set_fd_handler(fd, handler); };
    this->tfd->set_timer(print_stats_interval*1000, true, [this](int timer_id)
    {
        print_stats();
@ -66,40 +38,12 @@ osd_t::osd_t(blockstore_config_t & config, blockstore_t *bs, ring_loop_t *ringlo

    consumer.loop = [this]() { loop(); };
    ringloop->register_consumer(&consumer);
-    epoll_thread = new std::thread([this]()
-    {
-        int nfds;
-        epoll_event events[MAX_EPOLL_EVENTS];
-        while (1)
-        {
-            nfds = epoll_wait(epoll_fd, events, MAX_EPOLL_EVENTS, -1);
-            {
-                std::lock_guard<std::mutex> guard(epoll_mutex);
-                for (int i = 0; i < nfds; i++)
-                {
-                    int fd = events[i].data.fd;
-                    int ev = events[i].events;
-                    epoll_ready[fd] |= ev;
-                }
-                uint64_t n = 1;
-                write(event_fd, &n, 8);
-            }
-        }
-    });
 }

 osd_t::~osd_t()
 {
-    close(epoll_fd);
-    epoll_thread->join();
-    delete epoll_thread;
-    if (tfd)
-    {
-        delete tfd;
-        tfd = NULL;
-    }
    ringloop->unregister_consumer(&consumer);
-    close(event_fd);
+    delete epmgr;
    close(listen_fd);
 }

@ -139,12 +83,6 @@ void osd_t::parse_config(blockstore_config_t & config)
        if (client_queue_depth < 128)
            client_queue_depth = 128;
    }
-    if (config.find("pg_stripe_size") != config.end())
-    {
-        pg_stripe_size = strtoull(config["pg_stripe_size"].c_str(), NULL, 10);
-        if (!pg_stripe_size || !bs_block_size || pg_stripe_size < bs_block_size || (pg_stripe_size % bs_block_size) != 0)
-            pg_stripe_size = DEFAULT_PG_STRIPE_SIZE;
-    }
    recovery_queue_depth = strtoull(config["recovery_queue_depth"].c_str(), NULL, 10);
    if (recovery_queue_depth < 1 || recovery_queue_depth > MAX_RECOVERY_QUEUE)
        recovery_queue_depth = DEFAULT_RECOVERY_QUEUE;
@ -211,20 +149,10 @@ void osd_t::bind_socket()

    fcntl(listen_fd, F_SETFL, fcntl(listen_fd, F_GETFL, 0) | O_NONBLOCK);

-    epoll_event ev;
-    ev.data.fd = listen_fd;
-    ev.events = EPOLLIN | EPOLLET;
-    if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, listen_fd, &ev) < 0)
-    {
-        close(listen_fd);
-        close(epoll_fd);
-        throw std::runtime_error(std::string("epoll_ctl (add listen_fd): ") + strerror(errno));
-    }
-
-    epoll_handlers[listen_fd] = [this](int peer_fd, int epoll_events)
+    epmgr->set_fd_handler(listen_fd, false, [this](int fd, int events)
    {
        c_cli.accept_connections(listen_fd);
-    };
+    });
 }

 bool osd_t::shutdown()
@ -239,81 +167,12 @@ bool osd_t::shutdown()

 void osd_t::loop()
 {
-    std::map<int,int> cur_epoll;
-    {
-        std::lock_guard<std::mutex> guard(epoll_mutex);
-        cur_epoll.swap(epoll_ready);
-    }
-    for (auto p: cur_epoll)
-    {
-        auto cb_it = epoll_handlers.find(p.first);
-        if (cb_it != epoll_handlers.end())
-        {
-            cb_it->second(p.first, p.second);
-        }
-    }
-    if (!(wait_state & 2))
-    {
-        handle_eventfd();
-        wait_state = wait_state | 2;
-    }
    handle_peers();
    c_cli.read_requests();
    c_cli.send_replies();
    ringloop->submit();
 }

-void osd_t::set_fd_handler(int fd, std::function<void(int, int)> handler)
-{
-    if (handler != NULL)
-    {
-        bool exists = epoll_handlers.find(fd) != epoll_handlers.end();
-        epoll_event ev;
-        ev.data.fd = fd;
-        ev.events = EPOLLOUT | EPOLLIN | EPOLLRDHUP | EPOLLET;
-        if (epoll_ctl(epoll_fd, exists ? EPOLL_CTL_MOD : EPOLL_CTL_ADD, fd, &ev) < 0)
-        {
-            throw std::runtime_error(std::string(exists ? "epoll_ctl (mod fd): " : "epoll_ctl (add fd): ") + strerror(errno));
-        }
-        epoll_handlers[fd] = handler;
-    }
-    else
-    {
-        if (epoll_ctl(epoll_fd, EPOLL_CTL_DEL, fd, NULL) < 0 && errno != ENOENT)
-        {
-            throw std::runtime_error(std::string("epoll_ctl (remove fd): ") + strerror(errno));
-        }
-        epoll_handlers.erase(fd);
-    }
-}
-
-void osd_t::handle_eventfd()
-{
-    io_uring_sqe *sqe = ringloop->get_sqe();
-    if (!sqe)
-    {
-        throw std::runtime_error("can't get SQE, will fall out of sync with eventfd");
-    }
-    ring_data_t *data = ((ring_data_t*)sqe->user_data);
-    my_uring_prep_poll_add(sqe, event_fd, POLLIN);
-    data->callback = [this](ring_data_t *data)
-    {
-        if (data->res < 0)
-        {
-            throw std::runtime_error(std::string("epoll failed: ") + strerror(-data->res));
-        }
-        handle_eventfd();
-    };
-    ringloop->submit();
-    uint64_t n = 0;
-    size_t res = read(event_fd, &n, 8);
-    if (res == 8)
-    {
-        // No need to do anything, the loop has already woken up
-        ringloop->wakeup();
-    }
-}
-
 void osd_t::exec_op(osd_op_t *cur_op)
 {
    clock_gettime(CLOCK_REALTIME, &cur_op->tv_begin);
@ -324,21 +183,28 @@ void osd_t::exec_op(osd_op_t *cur_op)
        return;
    }
    inflight_ops++;
-    cur_op->send_list.push_back(cur_op->reply.buf, OSD_PACKET_SIZE);
    if (cur_op->req.hdr.magic != SECONDARY_OSD_OP_MAGIC ||
        cur_op->req.hdr.opcode < OSD_OP_MIN || cur_op->req.hdr.opcode > OSD_OP_MAX ||
-        (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_READ || cur_op->req.hdr.opcode == OSD_OP_SECONDARY_WRITE) &&
-        (cur_op->req.sec_rw.len > OSD_RW_MAX || cur_op->req.sec_rw.len % bs_disk_alignment || cur_op->req.sec_rw.offset % bs_disk_alignment) ||
-        (cur_op->req.hdr.opcode == OSD_OP_READ || cur_op->req.hdr.opcode == OSD_OP_WRITE || cur_op->req.hdr.opcode == OSD_OP_DELETE) &&
-        (cur_op->req.rw.len > OSD_RW_MAX || cur_op->req.rw.len % bs_disk_alignment || cur_op->req.rw.offset % bs_disk_alignment))
+        ((cur_op->req.hdr.opcode == OSD_OP_SEC_READ ||
+            cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
+            cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE) &&
+            (cur_op->req.sec_rw.len > OSD_RW_MAX ||
+            cur_op->req.sec_rw.len % bs_disk_alignment ||
+            cur_op->req.sec_rw.offset % bs_disk_alignment)) ||
+        ((cur_op->req.hdr.opcode == OSD_OP_READ ||
+            cur_op->req.hdr.opcode == OSD_OP_WRITE ||
+            cur_op->req.hdr.opcode == OSD_OP_DELETE) &&
+            (cur_op->req.rw.len > OSD_RW_MAX ||
+            cur_op->req.rw.len % bs_disk_alignment ||
+            cur_op->req.rw.offset % bs_disk_alignment)))
    {
        // Bad command
        finish_op(cur_op, -EINVAL);
        return;
    }
    if (readonly &&
-        cur_op->req.hdr.opcode != OSD_OP_SECONDARY_READ &&
-        cur_op->req.hdr.opcode != OSD_OP_SECONDARY_LIST &&
+        cur_op->req.hdr.opcode != OSD_OP_SEC_READ &&
+        cur_op->req.hdr.opcode != OSD_OP_SEC_LIST &&
        cur_op->req.hdr.opcode != OSD_OP_READ &&
        cur_op->req.hdr.opcode != OSD_OP_SHOW_CONFIG)
    {
--- a/osd.h
+++ b/osd.h
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #pragma once

 #include <sys/types.h>
@ -12,12 +15,11 @@

 #include <set>
 #include <deque>
-#include <mutex>
-#include <thread>

 #include "blockstore.h"
 #include "ringloop.h"
 #include "timerfd_manager.h"
+#include "epoll_manager.h"
 #include "osd_peering_pg.h"
 #include "messenger.h"
 #include "etcd_state_client.h"
@ -35,12 +37,9 @@
 #define DEFAULT_AUTOSYNC_INTERVAL 5
 #define MAX_RECOVERY_QUEUE 2048
 #define DEFAULT_RECOVERY_QUEUE 4
-#define DEFAULT_PG_STRIPE_SIZE 4*1024*1024 // 4 MB by default

 //#define OSD_STUB

-extern const char* osd_op_names[];
-
 struct osd_object_id_t
 {
    osd_num_t osd_num;
@ -51,7 +50,6 @@ struct osd_recovery_op_t
 {
    int st = 0;
    bool degraded = false;
-    pg_num_t pg_num = 0;
    object_id oid = { 0 };
    osd_op_t *osd_op = NULL;
 };
@ -85,18 +83,19 @@ class osd_t
    std::string etcd_lease_id;
    json11::Json self_state;
    bool loading_peer_config = false;
-    std::set<pg_num_t> pg_state_dirty;
+    std::set<pool_pg_num_t> pg_state_dirty;
    bool pg_config_applied = false;
    bool etcd_reporting_pg_state = false;
    bool etcd_reporting_stats = false;

    // peers and PGs

-    std::map<pg_num_t, pg_t> pgs;
-    std::set<pg_num_t> dirty_pgs;
+    std::map<pool_id_t, pg_num_t> pg_counts;
+    std::map<pool_pg_num_t, pg_t> pgs;
+    std::set<pool_pg_num_t> dirty_pgs;
+    std::set<osd_num_t> dirty_osds;
    uint64_t misplaced_objects = 0, degraded_objects = 0, incomplete_objects = 0;
    int peering_state = 0;
-    unsigned pg_count = 0;
    std::map<object_id, osd_recovery_op_t> recovery_ops;
    osd_op_t *autosync_op = NULL;

@ -110,20 +109,13 @@ class osd_t
    int inflight_ops = 0;
    blockstore_t *bs;
    uint32_t bs_block_size, bs_disk_alignment;
-    uint64_t pg_stripe_size = DEFAULT_PG_STRIPE_SIZE;
    ring_loop_t *ringloop;
    timerfd_manager_t *tfd = NULL;
+    epoll_manager_t *epmgr = NULL;

-    int wait_state = 0;
-    int epoll_fd = 0;
-    int event_fd = 0;
-    std::thread *epoll_thread = NULL;
-    std::mutex epoll_mutex;
-    std::map<int, int> epoll_ready;
    int listening_port = 0;
    int listen_fd = 0;
    ring_consumer_t consumer;
-    std::map<int, std::function<void(int, int)>> epoll_handlers;

    // op statistics
    osd_op_stats_t prev_stats;
@ -134,7 +126,8 @@ class osd_t
    // cluster connection
    void parse_config(blockstore_config_t & config);
    void init_cluster();
-    void on_change_osd_state_hook(uint64_t osd_num);
+    void on_change_osd_state_hook(osd_num_t peer_osd);
+    void on_change_pg_history_hook(pool_id_t pool_id, pg_num_t pg_num);
    void on_change_etcd_state_hook(json11::Json::object & changes);
    void on_load_config_hook(json11::Json::object & changes);
    json11::Json on_load_pgs_checks_hook();
@ -155,24 +148,22 @@ class osd_t

    // event loop, socket read/write
    void loop();
-    void set_fd_handler(int fd, std::function<void(int, int)> handler);
-    void handle_eventfd();

    // peer handling (primary OSD logic)
    void parse_test_peer(std::string peer);
    void handle_peers();
    void repeer_pgs(osd_num_t osd_num);
-    void start_pg_peering(pg_num_t pg_num);
+    void start_pg_peering(pg_t & pg);
    void submit_sync_and_list_subop(osd_num_t role_osd, pg_peering_state_t *ps);
    void submit_list_subop(osd_num_t role_osd, pg_peering_state_t *ps);
    void discard_list_subop(osd_op_t *list_op);
-    bool stop_pg(pg_num_t pg_num);
+    bool stop_pg(pg_t & pg);
    void finish_stop_pg(pg_t & pg);

    // flushing, recovery and backfill
-    void submit_pg_flush_ops(pg_num_t pg_num);
-    void handle_flush_op(bool rollback, pg_num_t pg_num, pg_flush_batch_t *fb, osd_num_t peer_osd, int retval);
-    void submit_flush_op(pg_num_t pg_num, pg_flush_batch_t *fb, bool rollback, osd_num_t peer_osd, int count, obj_ver_id *data);
+    void submit_pg_flush_ops(pg_t & pg);
+    void handle_flush_op(bool rollback, pool_id_t pool_id, pg_num_t pg_num, pg_flush_batch_t *fb, osd_num_t peer_osd, int retval);
+    void submit_flush_op(pool_id_t pool_id, pg_num_t pg_num, pg_flush_batch_t *fb, bool rollback, osd_num_t peer_osd, int count, obj_ver_id *data);
    bool pick_next_recovery(osd_recovery_op_t &op);
    void submit_recovery_op(osd_recovery_op_t *op);
    bool continue_recovery();
@ -203,13 +194,16 @@ class osd_t
    void handle_primary_bs_subop(osd_op_t *subop);
    void add_bs_subop_stats(osd_op_t *subop);
    void pg_cancel_write_queue(pg_t & pg, osd_op_t *first_op, object_id oid, int retval);
-    void submit_primary_subops(int submit_type, int read_pg_size, const uint64_t* osd_set, osd_op_t *cur_op);
-    void submit_primary_del_subops(osd_op_t *cur_op, uint64_t *cur_set, pg_osd_set_t & loc_set);
+    void submit_primary_subops(int submit_type, uint64_t op_version, int pg_size, const uint64_t* osd_set, osd_op_t *cur_op);
+    void submit_primary_del_subops(osd_op_t *cur_op, uint64_t *cur_set, uint64_t set_size, pg_osd_set_t & loc_set);
    void submit_primary_sync_subops(osd_op_t *cur_op);
    void submit_primary_stab_subops(osd_op_t *cur_op);

-    inline pg_num_t map_to_pg(object_id oid)
+    inline pg_num_t map_to_pg(object_id oid, uint64_t pg_stripe_size)
    {
+        uint64_t pg_count = pg_counts[INODE_POOL(oid.inode)];
+        if (!pg_count)
+            pg_count = 1;
        return (oid.inode + oid.stripe / pg_stripe_size) % pg_count + 1;
    }

--- a/osd_cluster.cpp
+++ b/osd_cluster.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #include "osd.h"
 #include "base64.h"
 #include "etcd_state_client.h"
@ -14,7 +17,7 @@ void osd_t::init_cluster()
    {
        if (run_primary)
        {
-            // Test version of clustering code with 1 PG and 2 peers
+            // Test version of clustering code with 1 pool, 1 PG and 2 peers
            // Example: peers = 2:127.0.0.1:11204,3:127.0.0.1:11205
            std::string peerstr = config["peers"];
            while (peerstr.size())
@ -27,15 +30,16 @@ void osd_t::init_cluster()
            {
                throw std::runtime_error("run_primary requires at least 2 peers");
            }
-            pgs[1] = (pg_t){
+            pgs[{ 1, 1 }] = (pg_t){
                .state = PG_PEERING,
                .pg_cursize = 0,
+                .pool_id = 1,
                .pg_num = 1,
                .target_set = { 1, 2, 3 },
                .cur_set = { 0, 0, 0 },
            };
-            report_pg_state(pgs[1]);
-            pg_count = 1;
+            report_pg_state(pgs[{ 1, 1 }]);
+            pg_counts[1] = 1;
        }
        bind_socket();
    }
@ -43,7 +47,8 @@ void osd_t::init_cluster()
    {
        st_cli.tfd = tfd;
        st_cli.log_level = log_level;
-        st_cli.on_change_osd_state_hook = [this](uint64_t peer_osd) { on_change_osd_state_hook(peer_osd); };
+        st_cli.on_change_osd_state_hook = [this](osd_num_t peer_osd) { on_change_osd_state_hook(peer_osd); };
+        st_cli.on_change_pg_history_hook = [this](pool_id_t pool_id, pg_num_t pg_num) { on_change_pg_history_hook(pool_id, pg_num); };
        st_cli.on_change_hook = [this](json11::Json::object & changes) { on_change_etcd_state_hook(changes); };
        st_cli.on_load_config_hook = [this](json11::Json::object & cfg) { on_load_config_hook(cfg); };
        st_cli.load_pgs_checks_hook = [this]() { return on_load_pgs_checks_hook(); };
@ -182,7 +187,7 @@ void osd_t::report_statistics()
        pg_stats["write_osd_set"] = pg.cur_set;
        txn.push_back(json11::Json::object {
            { "request_put", json11::Json::object {
-                { "key", base64_encode(st_cli.etcd_prefix+"/pg/stats/"+std::to_string(pg.pg_num)) },
+                { "key", base64_encode(st_cli.etcd_prefix+"/pg/stats/"+std::to_string(pg.pool_id)+"/"+std::to_string(pg.pg_num)) },
                { "value", base64_encode(json11::Json(pg_stats).dump()) },
            } }
        });
@ -207,7 +212,7 @@ void osd_t::report_statistics()
    });
 }

-void osd_t::on_change_osd_state_hook(uint64_t peer_osd)
+void osd_t::on_change_osd_state_hook(osd_num_t peer_osd)
 {
    if (c_cli.wanted_peers.find(peer_osd) != c_cli.wanted_peers.end())
    {
@ -222,6 +227,30 @@ void osd_t::on_change_etcd_state_hook(json11::Json::object & changes)
    apply_pg_config();
 }

+void osd_t::on_change_pg_history_hook(pool_id_t pool_id, pg_num_t pg_num)
+{
+    auto pg_it = pgs.find({
+        .pool_id = pool_id,
+        .pg_num = pg_num,
+    });
+    if (pg_it != pgs.end() && pg_it->second.epoch > pg_it->second.reported_epoch &&
+        st_cli.pool_config[pool_id].pg_config[pg_num].epoch >= pg_it->second.epoch)
+    {
+        pg_it->second.reported_epoch = st_cli.pool_config[pool_id].pg_config[pg_num].epoch;
+        object_id oid = { 0 };
+        bool first = true;
+        for (auto op: pg_it->second.write_queue)
+        {
+            if (first || oid != op.first)
+            {
+                oid = op.first;
+                first = false;
+                continue_primary_write(op.second);
+            }
+        }
+    }
+}
+
 void osd_t::on_load_config_hook(json11::Json::object & global_config)
 {
    blockstore_config_t osd_config = this->config;
@ -429,23 +458,19 @@ void osd_t::on_load_pgs_hook(bool success)

 void osd_t::apply_pg_count()
 {
-    pg_num_t pg_count = st_cli.pg_config.size();
-    if (pg_count > 0 && (st_cli.pg_config.begin()->first != 1 || std::prev(st_cli.pg_config.end())->first != pg_count))
+    for (auto & pool_item: st_cli.pool_config)
    {
-        printf("Invalid PG configuration: PG numbers don't cover the whole 1..%d range\n", pg_count);
-        force_stop(1);
-        return;
-    }
-    if (this->pg_count != 0 && this->pg_count != pg_count)
+        if (pool_item.second.real_pg_count != 0 &&
+            pool_item.second.real_pg_count != pg_counts[pool_item.first])
        {
-        // Check that all PGs are offline. It is not allowed to change PG count when any PGs are online
+            // Check that all pool PGs are offline. It is not allowed to change PG count when any PGs are online
            // The external tool must wait for all PGs to come down before changing PG count
            // If it doesn't wait, a restarted OSD may apply the new count immediately which will lead to bugs
            // So an OSD just dies if it detects PG count change while there are active PGs
            int still_active = 0;
            for (auto & kv: pgs)
            {
-            if (kv.second.state & PG_ACTIVE)
+                if (kv.first.pool_id == pool_item.first && (kv.second.state & PG_ACTIVE))
                {
                    still_active++;
                }
@ -457,24 +482,28 @@ void osd_t::apply_pg_count()
                return;
            }
        }
-    this->pg_count = pg_count;
+        this->pg_counts[pool_item.first] = pool_item.second.real_pg_count;
+    }
 }

 void osd_t::apply_pg_config()
 {
    bool all_applied = true;
-    for (auto & kv: st_cli.pg_config)
+    for (auto & pool_item: st_cli.pool_config)
+    {
+        auto pool_id = pool_item.first;
+        for (auto & kv: pool_item.second.pg_config)
        {
            pg_num_t pg_num = kv.first;
            auto & pg_cfg = kv.second;
            bool take = pg_cfg.exists && pg_cfg.primary == this->osd_num &&
                !pg_cfg.pause && (!pg_cfg.cur_primary || pg_cfg.cur_primary == this->osd_num);
-        bool currently_taken = this->pgs.find(pg_num) != this->pgs.end() &&
-            this->pgs[pg_num].state != PG_OFFLINE;
+            auto pg_it = this->pgs.find({ .pool_id = pool_id, .pg_num = pg_num });
+            bool currently_taken = pg_it != this->pgs.end() && pg_it->second.state != PG_OFFLINE;
            if (currently_taken && !take)
            {
                // Stop this PG
-            stop_pg(pg_num);
+                stop_pg(pg_it->second);
            }
            else if (take)
            {
@ -506,9 +535,9 @@ void osd_t::apply_pg_config()
                }
                if (currently_taken)
                {
-                if (this->pgs[pg_num].state & (PG_ACTIVE | PG_INCOMPLETE | PG_PEERING))
+                    if (pg_it->second.state & (PG_ACTIVE | PG_INCOMPLETE | PG_PEERING))
                    {
-                    if (this->pgs[pg_num].target_set == pg_cfg.target_set)
+                        if (pg_it->second.target_set == pg_cfg.target_set)
                        {
                            // No change in osd_set; history changes are ignored
                            continue;
@ -516,18 +545,18 @@ void osd_t::apply_pg_config()
                        else
                        {
                            // Stop PG, reapply change after stopping
-                        stop_pg(pg_num);
+                            stop_pg(pg_it->second);
                            all_applied = false;
                            continue;
                        }
                    }
-                else if (this->pgs[pg_num].state & PG_STOPPING)
+                    else if (pg_it->second.state & PG_STOPPING)
                    {
                        // Reapply change after stopping
                        all_applied = false;
                        continue;
                    }
-                else if (this->pgs[pg_num].state & PG_STARTING)
+                    else if (pg_it->second.state & PG_STARTING)
                    {
                        if (pg_cfg.cur_primary == this->osd_num)
                        {
@ -542,19 +571,25 @@ void osd_t::apply_pg_config()
                    }
                    else
                    {
-                    throw std::runtime_error("Unexpected PG "+std::to_string(pg_num)+" state: "+std::to_string(this->pgs[pg_num].state));
+                        throw std::runtime_error("Unexpected PG "+std::to_string(pg_num)+" state: "+std::to_string(pg_it->second.state));
                    }
                }
-            this->pgs[pg_num] = (pg_t){
+                auto & pg = this->pgs[{ .pool_id = pool_id, .pg_num = pg_num }];
+                pg = (pg_t){
                    .state = pg_cfg.cur_primary == this->osd_num ? PG_PEERING : PG_STARTING,
+                    .scheme = pool_item.second.scheme,
                    .pg_cursize = 0,
+                    .pg_size = pool_item.second.pg_size,
+                    .pg_minsize = pool_item.second.pg_minsize,
+                    .pool_id = pool_id,
                    .pg_num = pg_num,
+                    .reported_epoch = pg_cfg.epoch,
                    .target_history = pg_cfg.target_history,
                    .all_peers = std::vector<osd_num_t>(all_peers.begin(), all_peers.end()),
                    .target_set = pg_cfg.target_set,
                };
-            this->pg_state_dirty.insert(pg_num);
-            this->pgs[pg_num].print_state();
+                this->pg_state_dirty.insert({ .pool_id = pool_id, .pg_num = pg_num });
+                pg.print_state();
                if (pg_cfg.cur_primary == this->osd_num)
                {
                    // Add peers
@ -565,7 +600,7 @@ void osd_t::apply_pg_config()
                            c_cli.connect_peer(pg_osd, st_cli.peer_states[pg_osd]);
                        }
                    }
-                start_pg_peering(pg_num);
+                    start_pg_peering(pg);
                }
                else
                {
@ -574,6 +609,7 @@ void osd_t::apply_pg_config()
                }
            }
        }
+    }
    report_pg_states();
    this->pg_config_applied = all_applied;
 }
@ -584,8 +620,7 @@ void osd_t::report_pg_states()
    {
        return;
    }
-    etcd_reporting_pg_state = true;
-    std::vector<std::pair<pg_num_t,bool>> reporting_pgs;
+    std::vector<std::pair<pool_pg_num_t,bool>> reporting_pgs;
    json11::Json::array checks;
    json11::Json::array success;
    json11::Json::array failure;
@ -597,8 +632,8 @@ void osd_t::report_pg_states()
            continue;
        }
        auto & pg = pg_it->second;
-        reporting_pgs.push_back({ pg.pg_num, pg.history_changed });
-        std::string state_key_base64 = base64_encode(st_cli.etcd_prefix+"/pg/state/"+std::to_string(pg.pg_num));
+        reporting_pgs.push_back({ *it, pg.history_changed });
+        std::string state_key_base64 = base64_encode(st_cli.etcd_prefix+"/pg/state/"+std::to_string(pg.pool_id)+"/"+std::to_string(pg.pg_num));
        if (pg.state == PG_STARTING)
        {
            // Check that the PG key does not exist
@ -640,7 +675,7 @@ void osd_t::report_pg_states()
            }
            success.push_back(json11::Json::object {
                { "request_put", json11::Json::object {
-                    { "key", base64_encode(st_cli.etcd_prefix+"/pg/state/"+std::to_string(pg.pg_num)) },
+                    { "key", state_key_base64 },
                    { "value", base64_encode(json11::Json(json11::Json::object {
                        { "primary", this->osd_num },
                        { "state", pg_state_keywords },
@ -651,28 +686,28 @@ void osd_t::report_pg_states()
            });
            if (pg.history_changed)
            {
+                // Prevent race conditions (for the case when the monitor is updating this key at the same time)
                pg.history_changed = false;
-                if (pg.state == PG_ACTIVE)
-                {
-                    success.push_back(json11::Json::object {
-                        { "request_delete_range", json11::Json::object {
-                            { "key", base64_encode(st_cli.etcd_prefix+"/pg/history/"+std::to_string(pg.pg_num)) },
-                        } }
+                std::string history_key = base64_encode(st_cli.etcd_prefix+"/pg/history/"+std::to_string(pg.pool_id)+"/"+std::to_string(pg.pg_num));
+                json11::Json::object history_value = {
+                    { "epoch", pg.epoch },
+                    { "all_peers", pg.all_peers },
+                    { "osd_sets", pg.target_history },
+                };
+                checks.push_back(json11::Json::object {
+                    { "target", "MOD" },
+                    { "key", history_key },
+                    { "result", "LESS" },
+                    { "mod_revision", st_cli.etcd_watch_revision+1 },
                });
-                }
-                else if (pg.state == (PG_ACTIVE|PG_LEFT_ON_DEAD))
-                {
                success.push_back(json11::Json::object {
                    { "request_put", json11::Json::object {
-                            { "key", base64_encode(st_cli.etcd_prefix+"/pg/history/"+std::to_string(pg.pg_num)) },
-                            { "value", base64_encode(json11::Json(json11::Json::object {
-                                { "all_peers", pg.all_peers },
-                            }).dump()) },
+                        { "key", history_key },
+                        { "value", base64_encode(json11::Json(history_value).dump()) },
                    } }
                });
            }
        }
-        }
        failure.push_back(json11::Json::object {
            { "request_range", json11::Json::object {
                { "key", state_key_base64 },
@ -680,6 +715,7 @@ void osd_t::report_pg_states()
        });
    }
    pg_state_dirty.clear();
+    etcd_reporting_pg_state = true;
    st_cli.etcd_txn(json11::Json::object {
        { "compare", checks }, { "success", success }, { "failure", failure }
    }, ETCD_QUICK_TIMEOUT, [this, reporting_pgs](std::string err, json11::Json data)
@ -705,17 +741,26 @@ void osd_t::report_pg_states()
                if (res["kvs"].array_items().size())
                {
                    auto kv = st_cli.parse_etcd_kv(res["kvs"][0]);
-                    pg_num_t pg_num = stoull_full(kv.key.substr(st_cli.etcd_prefix.length()+10));
-                    auto pg_it = pgs.find(pg_num);
+                    if (kv.key.substr(st_cli.etcd_prefix.length()+10) == st_cli.etcd_prefix+"/pg/state/")
+                    {
+                        pool_id_t pool_id = 0;
+                        pg_num_t pg_num = 0;
+                        char null_byte = 0;
+                        sscanf(kv.key.c_str() + st_cli.etcd_prefix.length()+10, "%u/%u%c", &pool_id, &pg_num, &null_byte);
+                        if (null_byte == 0)
+                        {
+                            auto pg_it = pgs.find({ .pool_id = pool_id, .pg_num = pg_num });
                            if (pg_it != pgs.end() && pg_it->second.state != PG_OFFLINE && pg_it->second.state != PG_STARTING)
                            {
                                // Live PG state update failed
-                        printf("Failed to report state of PG %u which is live. Race condition detected, exiting\n", pg_num);
+                                printf("Failed to report state of pool %u PG %u which is live. Race condition detected, exiting\n", pool_id, pg_num);
                                force_stop(1);
                                return;
                            }
                        }
                    }
+                }
+            }
            // Retry after a short pause (hope we'll get some updates and update PG states accordingly)
            tfd->set_timer(500, false, [this](int) { report_pg_states(); });
        }
--- a/osd_flush.cpp
+++ b/osd_flush.cpp
@ -1,10 +1,12 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #include "osd.h"

 #define FLUSH_BATCH 512

-void osd_t::submit_pg_flush_ops(pg_num_t pg_num)
+void osd_t::submit_pg_flush_ops(pg_t & pg)
 {
-    pg_t & pg = pgs[pg_num];
    pg_flush_batch_t *fb = new pg_flush_batch_t();
    pg.flush_batch = fb;
    auto it = pg.flush_actions.begin(), prev_it = pg.flush_actions.begin();
@ -45,7 +47,7 @@ void osd_t::submit_pg_flush_ops(pg_num_t pg_num)
        if (l.second.size() > 0)
        {
            fb->flush_ops++;
-            submit_flush_op(pg.pg_num, fb, true, l.first, l.second.size(), l.second.data());
+            submit_flush_op(pg.pool_id, pg.pg_num, fb, true, l.first, l.second.size(), l.second.data());
        }
    }
    for (auto & l: fb->stable_lists)
@ -53,14 +55,15 @@ void osd_t::submit_pg_flush_ops(pg_num_t pg_num)
        if (l.second.size() > 0)
        {
            fb->flush_ops++;
-            submit_flush_op(pg.pg_num, fb, false, l.first, l.second.size(), l.second.data());
+            submit_flush_op(pg.pool_id, pg.pg_num, fb, false, l.first, l.second.size(), l.second.data());
        }
    }
 }

-void osd_t::handle_flush_op(bool rollback, pg_num_t pg_num, pg_flush_batch_t *fb, osd_num_t peer_osd, int retval)
+void osd_t::handle_flush_op(bool rollback, pool_id_t pool_id, pg_num_t pg_num, pg_flush_batch_t *fb, osd_num_t peer_osd, int retval)
 {
-    if (pgs.find(pg_num) == pgs.end() || pgs[pg_num].flush_batch != fb)
+    pool_pg_num_t pg_id = { .pool_id = pool_id, .pg_num = pg_num };
+    if (pgs.find(pg_id) == pgs.end() || pgs[pg_id].flush_batch != fb)
    {
        // Throw the result away
        return;
@ -92,7 +95,7 @@ void osd_t::handle_flush_op(bool rollback, pg_num_t pg_num, pg_flush_batch_t *fb
    {
        // This flush batch is done
        std::vector<osd_op_t*> continue_ops;
-        auto & pg = pgs[pg_num];
+        auto & pg = pgs[pg_id];
        auto it = pg.flush_actions.begin(), prev_it = it;
        auto erase_start = it;
        while (1)
@ -153,11 +156,11 @@ void osd_t::handle_flush_op(bool rollback, pg_num_t pg_num, pg_flush_batch_t *fb
    }
 }

-void osd_t::submit_flush_op(pg_num_t pg_num, pg_flush_batch_t *fb, bool rollback, osd_num_t peer_osd, int count, obj_ver_id *data)
+void osd_t::submit_flush_op(pool_id_t pool_id, pg_num_t pg_num, pg_flush_batch_t *fb, bool rollback, osd_num_t peer_osd, int count, obj_ver_id *data)
 {
    osd_op_t *op = new osd_op_t();
    // Copy buffer so it gets freed along with the operation
-    op->buf = malloc(sizeof(obj_ver_id) * count);
+    op->buf = malloc_or_die(sizeof(obj_ver_id) * count);
    memcpy(op->buf, data, sizeof(obj_ver_id) * count);
    if (peer_osd == this->osd_num)
    {
@ -165,10 +168,10 @@ void osd_t::submit_flush_op(pg_num_t pg_num, pg_flush_batch_t *fb, bool rollback
        clock_gettime(CLOCK_REALTIME, &op->tv_begin);
        op->bs_op = new blockstore_op_t({
            .opcode = (uint64_t)(rollback ? BS_OP_ROLLBACK : BS_OP_STABLE),
-            .callback = [this, op, pg_num, fb](blockstore_op_t *bs_op)
+            .callback = [this, op, pool_id, pg_num, fb](blockstore_op_t *bs_op)
            {
                add_bs_subop_stats(op);
-                handle_flush_op(bs_op->opcode == BS_OP_ROLLBACK, pg_num, fb, this->osd_num, bs_op->retval);
+                handle_flush_op(bs_op->opcode == BS_OP_ROLLBACK, pool_id, pg_num, fb, this->osd_num, bs_op->retval);
                delete op->bs_op;
                op->bs_op = NULL;
                delete op;
@ -183,22 +186,21 @@ void osd_t::submit_flush_op(pg_num_t pg_num, pg_flush_batch_t *fb, bool rollback
        // Peer
        int peer_fd = c_cli.osd_peer_fds[peer_osd];
        op->op_type = OSD_OP_OUT;
-        op->send_list.push_back(op->req.buf, OSD_PACKET_SIZE);
-        op->send_list.push_back(op->buf, count * sizeof(obj_ver_id));
+        op->iov.push_back(op->buf, count * sizeof(obj_ver_id));
        op->peer_fd = peer_fd;
        op->req = {
            .sec_stab = {
                .header = {
                    .magic = SECONDARY_OSD_OP_MAGIC,
                    .id = c_cli.next_subop_id++,
-                    .opcode = (uint64_t)(rollback ? OSD_OP_SECONDARY_ROLLBACK : OSD_OP_SECONDARY_STABILIZE),
+                    .opcode = (uint64_t)(rollback ? OSD_OP_SEC_ROLLBACK : OSD_OP_SEC_STABILIZE),
                },
                .len = count * sizeof(obj_ver_id),
            },
        };
-        op->callback = [this, pg_num, fb, peer_osd](osd_op_t *op)
+        op->callback = [this, pool_id, pg_num, fb, peer_osd](osd_op_t *op)
        {
-            handle_flush_op(op->req.hdr.opcode == OSD_OP_SECONDARY_ROLLBACK, pg_num, fb, peer_osd, op->reply.hdr.retval);
+            handle_flush_op(op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK, pool_id, pg_num, fb, peer_osd, op->reply.hdr.retval);
            delete op;
        };
        c_cli.outbox_push(op);
@ -216,7 +218,6 @@ bool osd_t::pick_next_recovery(osd_recovery_op_t &op)
                if (recovery_ops.find(obj_it->first) == recovery_ops.end())
                {
                    op.degraded = true;
-                    op.pg_num = pg_it->first;
                    op.oid = obj_it->first;
                    return true;
                }
@ -232,7 +233,6 @@ bool osd_t::pick_next_recovery(osd_recovery_op_t &op)
                if (recovery_ops.find(obj_it->first) == recovery_ops.end())
                {
                    op.degraded = false;
-                    op.pg_num = pg_it->first;
                    op.oid = obj_it->first;
                    return true;
                }
--- a/osd_id.h
+++ b/osd_id.h
@ -1,4 +1,27 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
 #pragma once

+#define POOL_SCHEME_REPLICATED 1
+#define POOL_SCHEME_XOR 2
+#define POOL_ID_MAX 0x10000
+#define POOL_ID_BITS 16
+#define INODE_POOL(inode) (pool_id_t)((inode) >> (64 - POOL_ID_BITS))
+
+// Pool ID is 16 bits long
+typedef uint32_t pool_id_t;
+
 typedef uint64_t osd_num_t;
 typedef uint32_t pg_num_t;
+
+struct pool_pg_num_t
+{
+    pool_id_t pool_id;
+    pg_num_t pg_num;
+};
+
+inline bool operator < (const pool_pg_num_t & a, const pool_pg_num_t & b)
+{
+    return a.pool_id < b.pool_id || a.pool_id == b.pool_id && a.pg_num < b.pg_num;
+}
--- a/osd_main.cpp
+++ b/osd_main.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #include "osd.h"

 #include <signal.h>
@ -18,6 +21,8 @@ static void handle_sigint(int sig)

 int main(int narg, char *args[])
 {
+    setvbuf(stdout, NULL, _IONBF, 0);
+    setvbuf(stderr, NULL, _IONBF, 0);
    if (sizeof(osd_any_op_t) > OSD_PACKET_SIZE ||
        sizeof(osd_any_reply_t) > OSD_PACKET_SIZE)
    {
--- a/osd_ops.cpp
+++ b/osd_ops.cpp
@ -0,0 +1,22 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
+#include "osd_ops.h"
+
+const char* osd_op_names[] = {
+    "",
+    "read",
+    "write",
+    "write_stable",
+    "sync",
+    "stabilize",
+    "rollback",
+    "delete",
+    "sync_stab_all",
+    "list",
+    "show_config",
+    "primary_read",
+    "primary_write",
+    "primary_sync",
+    "primary_delete",
+};
--- a/osd_ops.h
+++ b/osd_ops.h
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
 #pragma once

 #include "object_id.h"
@ -10,20 +13,21 @@
 #define OSD_PACKET_SIZE             0x80
 // Opcodes
 #define OSD_OP_MIN                  1
-#define OSD_OP_SECONDARY_READ       1
-#define OSD_OP_SECONDARY_WRITE      2
-#define OSD_OP_SECONDARY_SYNC       3
-#define OSD_OP_SECONDARY_STABILIZE  4
-#define OSD_OP_SECONDARY_ROLLBACK   5
-#define OSD_OP_SECONDARY_DELETE     6
-#define OSD_OP_TEST_SYNC_STAB_ALL   7
-#define OSD_OP_SECONDARY_LIST       8
-#define OSD_OP_SHOW_CONFIG          9
-#define OSD_OP_READ                 10
-#define OSD_OP_WRITE                11
-#define OSD_OP_SYNC                 12
-#define OSD_OP_DELETE               13
-#define OSD_OP_MAX                  13
+#define OSD_OP_SEC_READ             1
+#define OSD_OP_SEC_WRITE            2
+#define OSD_OP_SEC_WRITE_STABLE     3
+#define OSD_OP_SEC_SYNC             4
+#define OSD_OP_SEC_STABILIZE        5
+#define OSD_OP_SEC_ROLLBACK         6
+#define OSD_OP_SEC_DELETE           7
+#define OSD_OP_TEST_SYNC_STAB_ALL   8
+#define OSD_OP_SEC_LIST             9
+#define OSD_OP_SHOW_CONFIG          10
+#define OSD_OP_READ                 11
+#define OSD_OP_WRITE                12
+#define OSD_OP_SYNC                 13
+#define OSD_OP_DELETE               14
+#define OSD_OP_MAX                  14
 // Alignment & limit for read/write operations
 #ifndef MEM_ALIGNMENT
 #define MEM_ALIGNMENT               512
@ -134,7 +138,10 @@ struct __attribute__((__packed__)) osd_op_secondary_list_t
    osd_op_header_t header;
    // placement group total number and total count
    pg_num_t list_pg, pg_count;
+    // size of an area that maps to one PG continuously
    uint64_t pg_stripe_size;
+    // inode range (used to select pools)
+    uint64_t min_inode, max_inode;
 };

 struct __attribute__((__packed__)) osd_reply_secondary_list_t
@ -202,3 +209,5 @@ union osd_any_reply_t
    osd_reply_sync_t sync;
    uint8_t buf[OSD_PACKET_SIZE];
 };
+
+extern const char* osd_op_names[];
--- a/osd_peering.cpp
+++ b/osd_peering.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #include <netinet/tcp.h>
 #include <sys/epoll.h>

@ -26,7 +29,7 @@ void osd_t::handle_peers()
                    degraded_objects += p.second.degraded_objects.size();
                    if ((p.second.state & (PG_ACTIVE | PG_HAS_UNCLEAN)) == (PG_ACTIVE | PG_HAS_UNCLEAN))
                        peering_state = peering_state | OSD_FLUSHING_PGS;
-                    else
+                    else if (p.second.state & PG_ACTIVE)
                        peering_state = peering_state | OSD_RECOVERING;
                }
                else
@ -50,7 +53,7 @@ void osd_t::handle_peers()
            {
                if (!p.second.flush_batch)
                {
-                    submit_pg_flush_ops(p.first);
+                    submit_pg_flush_ops(p.second);
                }
                still = true;
            }
@ -89,16 +92,15 @@ void osd_t::repeer_pgs(osd_num_t peer_osd)
            {
                // Repeer this pg
                printf("[PG %u] Repeer because of OSD %lu\n", p.second.pg_num, peer_osd);
-                start_pg_peering(p.second.pg_num);
+                start_pg_peering(p.second);
            }
        }
    }
 }

 // Repeer on each connect/disconnect peer event
-void osd_t::start_pg_peering(pg_num_t pg_num)
+void osd_t::start_pg_peering(pg_t & pg)
 {
-    auto & pg = pgs[pg_num];
    pg.state = PG_PEERING;
    this->peering_state |= OSD_PEERING_PGS;
    report_pg_state(pg);
@ -123,16 +125,32 @@ void osd_t::start_pg_peering(pg_num_t pg_num)
        cancel_primary_write(p.second);
    }
    pg.write_queue.clear();
+    uint64_t pg_stripe_size = st_cli.pool_config[pg.pool_id].pg_stripe_size;
    for (auto it = unstable_writes.begin(); it != unstable_writes.end(); )
    {
        // Forget this PG's unstable writes
-        pg_num_t n = (it->first.oid.inode + it->first.oid.stripe / pg_stripe_size) % pg_count + 1;
-        if (n == pg.pg_num)
+        if (INODE_POOL(it->first.oid.inode) == pg.pool_id && map_to_pg(it->first.oid, pg_stripe_size) == pg.pg_num)
            unstable_writes.erase(it++);
        else
            it++;
    }
-    dirty_pgs.erase(pg.pg_num);
+    dirty_pgs.erase({ .pool_id = pg.pool_id, .pg_num = pg.pg_num });
+    // Drop connections of clients who have this PG in dirty_pgs
+    if (immediate_commit != IMMEDIATE_ALL)
+    {
+        std::vector<int> to_stop;
+        for (auto & cp: c_cli.clients)
+        {
+            if (cp.second.dirty_pgs.find({ .pool_id = pg.pool_id, .pg_num = pg.pg_num }) != cp.second.dirty_pgs.end())
+            {
+                to_stop.push_back(cp.first);
+            }
+        }
+        for (auto peer_fd: to_stop)
+        {
+            c_cli.stop_client(peer_fd);
+        }
+    }
    // Calculate current write OSD set
    pg.pg_cursize = 0;
    pg.cur_set.resize(pg.target_set.size());
@ -170,6 +188,7 @@ void osd_t::start_pg_peering(pg_num_t pg_num)
            {
                pg.state = PG_INCOMPLETE;
                report_pg_state(pg);
+                return;
            }
        }
    }
@ -177,6 +196,7 @@ void osd_t::start_pg_peering(pg_num_t pg_num)
    {
        pg.state = PG_INCOMPLETE;
        report_pg_state(pg);
+        return;
    }
    std::set<osd_num_t> cur_peers;
    for (auto pg_osd: pg.all_peers)
@ -233,6 +253,7 @@ void osd_t::start_pg_peering(pg_num_t pg_num)
    if (!pg.peering_state)
    {
        pg.peering_state = new pg_peering_state_t();
+        pg.peering_state->pool_id = pg.pool_id;
        pg.peering_state->pg_num = pg.pg_num;
    }
    for (osd_num_t peer_osd: cur_peers)
@ -287,14 +308,13 @@ void osd_t::submit_sync_and_list_subop(osd_num_t role_osd, pg_peering_state_t *p
        auto & cl = c_cli.clients.at(c_cli.osd_peer_fds[role_osd]);
        osd_op_t *op = new osd_op_t();
        op->op_type = OSD_OP_OUT;
-        op->send_list.push_back(op->req.buf, OSD_PACKET_SIZE);
        op->peer_fd = cl.peer_fd;
        op->req = {
            .sec_sync = {
                .header = {
                    .magic = SECONDARY_OSD_OP_MAGIC,
                    .id = c_cli.next_subop_id++,
-                    .opcode = OSD_OP_SECONDARY_SYNC,
+                    .opcode = OSD_OP_SEC_SYNC,
                },
            },
        };
@ -329,8 +349,10 @@ void osd_t::submit_list_subop(osd_num_t role_osd, pg_peering_state_t *ps)
        clock_gettime(CLOCK_REALTIME, &op->tv_begin);
        op->bs_op = new blockstore_op_t();
        op->bs_op->opcode = BS_OP_LIST;
-        op->bs_op->oid.stripe = pg_stripe_size;
-        op->bs_op->len = pg_count;
+        op->bs_op->oid.stripe = st_cli.pool_config[ps->pool_id].pg_stripe_size;
+        op->bs_op->oid.inode = ((uint64_t)ps->pool_id << (64 - POOL_ID_BITS));
+        op->bs_op->version = ((uint64_t)(ps->pool_id+1) << (64 - POOL_ID_BITS)) - 1;
+        op->bs_op->len = pg_counts[ps->pool_id];
        op->bs_op->offset = ps->pg_num-1;
        op->bs_op->callback = [this, ps, op, role_osd](blockstore_op_t *bs_op)
        {
@ -361,18 +383,19 @@ void osd_t::submit_list_subop(osd_num_t role_osd, pg_peering_state_t *ps)
        // Peer
        osd_op_t *op = new osd_op_t();
        op->op_type = OSD_OP_OUT;
-        op->send_list.push_back(op->req.buf, OSD_PACKET_SIZE);
        op->peer_fd = c_cli.osd_peer_fds[role_osd];
        op->req = {
            .sec_list = {
                .header = {
                    .magic = SECONDARY_OSD_OP_MAGIC,
                    .id = c_cli.next_subop_id++,
-                    .opcode = OSD_OP_SECONDARY_LIST,
+                    .opcode = OSD_OP_SEC_LIST,
                },
                .list_pg = ps->pg_num,
-                .pg_count = pg_count,
-                .pg_stripe_size = pg_stripe_size,
+                .pg_count = pg_counts[ps->pool_id],
+                .pg_stripe_size = st_cli.pool_config[ps->pool_id].pg_stripe_size,
+                .min_inode = ((uint64_t)(ps->pool_id) << (64 - POOL_ID_BITS)),
+                .max_inode = ((uint64_t)(ps->pool_id+1) << (64 - POOL_ID_BITS)) - 1,
            },
        };
        op->callback = [this, ps, role_osd](osd_op_t *op)
@ -428,14 +451,8 @@ void osd_t::discard_list_subop(osd_op_t *list_op)
    }
 }

-bool osd_t::stop_pg(pg_num_t pg_num)
+bool osd_t::stop_pg(pg_t & pg)
 {
-    auto pg_it = pgs.find(pg_num);
-    if (pg_it == pgs.end())
-    {
-        return false;
-    }
-    auto & pg = pg_it->second;
    if (pg.peering_state)
    {
        // Stop peering
@ -478,7 +495,7 @@ void osd_t::finish_stop_pg(pg_t & pg)
 void osd_t::report_pg_state(pg_t & pg)
 {
    pg.print_state();
-    this->pg_state_dirty.insert(pg.pg_num);
+    this->pg_state_dirty.insert({ .pool_id = pg.pool_id, .pg_num = pg.pg_num });
    if (pg.state == PG_ACTIVE && (pg.target_history.size() > 0 || pg.all_peers.size() > pg.target_set.size()))
    {
        // Clear history of active+clean PGs
--- a/osd_peering_pg.cpp
+++ b/osd_peering_pg.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #include "osd_peering_pg.h"

 struct obj_ver_role
@ -33,6 +36,7 @@ struct obj_piece_ver_t
 struct pg_obj_state_check_t
 {
    pg_t *pg;
+    bool replicated = false;
    std::vector<obj_ver_role> list;
    int list_pos;
    int obj_start = 0, obj_end = 0, ver_start = 0, ver_end = 0;
@ -41,7 +45,7 @@ struct pg_obj_state_check_t
    uint64_t last_ver = 0;
    uint64_t target_ver = 0;
    uint64_t n_copies = 0, has_roles = 0, n_roles = 0, n_stable = 0, n_mismatched = 0;
-    uint64_t n_unstable = 0, n_buggy = 0;
+    uint64_t n_unstable = 0, n_invalid = 0;
    pg_osd_set_t osd_set;
    int log_level;

@ -73,6 +77,12 @@ void pg_obj_state_check_t::walk()
    {
        finish_object();
    }
+    if (pg->state & PG_HAS_INVALID)
+    {
+        // Stop PGs with "invalid" objects
+        pg->state = PG_INCOMPLETE | PG_HAS_INVALID;
+        return;
+    }
    if (pg->pg_cursize < pg->pg_size)
    {
        pg->state |= PG_DEGRADED;
@ -92,7 +102,7 @@ void pg_obj_state_check_t::start_object()
    target_ver = 0;
    ver_start = list_pos;
    has_roles = n_copies = n_roles = n_stable = n_mismatched = 0;
-    n_unstable = n_buggy = 0;
+    n_unstable = n_invalid = 0;
 }

 void pg_obj_state_check_t::handle_version()
@ -111,11 +121,11 @@ void pg_obj_state_check_t::handle_version()
            has_roles = n_copies = n_roles = n_stable = n_mismatched = 0;
            last_ver = list[list_pos].version;
        }
-        int replica = (list[list_pos].oid.stripe & STRIPE_MASK);
+        unsigned replica = (list[list_pos].oid.stripe & STRIPE_MASK);
        n_copies++;
-        if (replica >= pg->pg_size)
+        if (replicated && replica > 0 || replica >= pg->pg_size)
        {
-            n_buggy++;
+            n_invalid++;
        }
        else
        {
@ -123,6 +133,23 @@ void pg_obj_state_check_t::handle_version()
            {
                n_stable++;
            }
+            if (replicated)
+            {
+                int i;
+                for (i = 0; i < pg->cur_set.size(); i++)
+                {
+                    if (pg->cur_set[i] == list[list_pos].osd_num)
+                    {
+                        break;
+                    }
+                }
+                if (i == pg->cur_set.size())
+                {
+                    n_mismatched++;
+                }
+            }
+            else
+            {
                if (pg->cur_set[replica] != list[list_pos].osd_num)
                {
                    n_mismatched++;
@ -134,6 +161,7 @@ void pg_obj_state_check_t::handle_version()
                }
            }
        }
+    }
    if (!list[list_pos].is_stable)
    {
        n_unstable++;
@ -151,11 +179,14 @@ void pg_obj_state_check_t::finish_object()
    obj_end = list_pos;
    // Remember the decision
    uint64_t state = 0;
-    if (n_buggy > 0)
+    if (n_invalid > 0)
    {
-        state = OBJ_BUGGY;
-        // FIXME: bring pg offline
-        throw std::runtime_error("buggy object state");
+        // It's not allowed to change the replication scheme for a pool other than by recreating it
+        // So we must bring the PG offline
+        state = OBJ_INCOMPLETE;
+        pg->state |= PG_HAS_INVALID;
+        pg->total_count++;
+        return;
    }
    if (n_unstable > 0)
    {
@ -201,7 +232,7 @@ void pg_obj_state_check_t::finish_object()
    {
        return;
    }
-    if (n_roles < pg->pg_minsize)
+    if (!replicated && n_roles < pg->pg_minsize)
    {
        if (log_level > 1)
        {
@ -210,7 +241,7 @@ void pg_obj_state_check_t::finish_object()
        state = OBJ_INCOMPLETE;
        pg->state = pg->state | PG_HAS_INCOMPLETE;
    }
-    else if (n_roles < pg->pg_cursize)
+    else if ((replicated ? n_copies : n_roles) < pg->pg_cursize)
    {
        if (log_level > 1)
        {
@ -219,29 +250,31 @@ void pg_obj_state_check_t::finish_object()
        state = OBJ_DEGRADED;
        pg->state = pg->state | PG_HAS_DEGRADED;
    }
-    if (n_mismatched > 0)
+    else if (n_mismatched > 0)
    {
-        if (n_roles >= pg->pg_cursize && log_level > 1)
+        if (log_level > 1 && (replicated || n_roles >= pg->pg_cursize))
        {
            printf("Object is misplaced: inode=%lu stripe=%lu version=%lu/%lu\n", oid.inode, oid.stripe, target_ver, max_ver);
        }
        state |= OBJ_MISPLACED;
        pg->state = pg->state | PG_HAS_MISPLACED;
    }
-    if (log_level > 1 && (n_roles < pg->pg_cursize || n_mismatched > 0))
+    if (log_level > 1 && ((replicated ? n_copies : n_roles) < pg->pg_cursize || n_mismatched > 0))
    {
        if (log_level > 2)
        {
            for (int i = obj_start; i < obj_end; i++)
            {
-                printf("v%lu present on: osd %lu, role %ld%s\n", list[i].version, list[i].osd_num, (list[i].oid.stripe & STRIPE_MASK), list[i].is_stable ? " (stable)" : "");
+                printf("v%lu present on: osd %lu, role %ld%s\n", list[i].version, list[i].osd_num,
+                    (list[i].oid.stripe & STRIPE_MASK), list[i].is_stable ? " (stable)" : "");
            }
        }
        else
        {
            for (int i = ver_start; i < ver_end; i++)
            {
-                printf("Target version present on: osd %lu, role %ld%s\n", list[i].osd_num, (list[i].oid.stripe & STRIPE_MASK), list[i].is_stable ? " (stable)" : "");
+                printf("Target version present on: osd %lu, role %ld%s\n", list[i].osd_num,
+                    (list[i].oid.stripe & STRIPE_MASK), list[i].is_stable ? " (stable)" : "");
            }
        }
    }
@ -278,11 +311,14 @@ void pg_obj_state_check_t::finish_object()
                    .osd_num = list[i].osd_num,
                    .outdated = true,
                });
+                if (!(state & (OBJ_INCOMPLETE | OBJ_DEGRADED)))
+                {
                    state |= OBJ_MISPLACED;
                    pg->state = pg->state | PG_HAS_MISPLACED;
                }
            }
        }
+    }
    if (target_ver < max_ver)
    {
        pg->ver_override[oid] = target_ver;
@ -297,6 +333,23 @@ void pg_obj_state_check_t::finish_object()
        if (it == pg->state_dict.end())
        {
            std::vector<uint64_t> read_target;
+            if (replicated)
+            {
+                for (auto & o: osd_set)
+                {
+                    if (!o.outdated)
+                    {
+                        read_target.push_back(o.osd_num);
+                    }
+                }
+                while (read_target.size() < pg->pg_size)
+                {
+                    // FIXME: This is because we then use .data() and assume it's at least <pg_size> long
+                    read_target.push_back(0);
+                }
+            }
+            else
+            {
                read_target.resize(pg->pg_size);
                for (int i = 0; i < pg->pg_size; i++)
                {
@ -309,6 +362,7 @@ void pg_obj_state_check_t::finish_object()
                        read_target[o.role] = o.osd_num;
                    }
                }
+            }
            pg->state_dict[osd_set] = {
                .read_target = read_target,
                .osd_set = osd_set,
@ -343,7 +397,9 @@ void pg_t::calc_object_states(int log_level)
    pg_obj_state_check_t st;
    st.log_level = log_level;
    st.pg = this;
+    st.replicated = (this->scheme == POOL_SCHEME_REPLICATED);
    auto ps = peering_state;
+    epoch = 0;
    for (auto it: ps->list_results)
    {
        auto nstab = it.second.stable_count;
@ -354,6 +410,10 @@ void pg_t::calc_object_states(int log_level)
        obj_ver_id *ov = it.second.buf;
        for (uint64_t i = 0; i < n; i++, ov++)
        {
+            if ((ov->version >> (64-PG_EPOCH_BITS)) > epoch)
+            {
+                epoch = (ov->version >> (64-PG_EPOCH_BITS));
+            }
            st.list[start+i] = {
                .oid = ov->oid,
                .version = ov->version,
@ -369,12 +429,17 @@ void pg_t::calc_object_states(int log_level)
    std::sort(st.list.begin(), st.list.end());
    // Walk over it and check object states
    st.walk();
+    if (this->state & (PG_DEGRADED|PG_LEFT_ON_DEAD))
+    {
+        assert(epoch != ((1ul << PG_EPOCH_BITS)-1));
+        epoch++;
+    }
 }

 void pg_t::print_state()
 {
    printf(
-        "[PG %u] is %s%s%s%s%s%s%s%s%s%s%s (%lu objects)\n", pg_num,
+        "[PG %u] is %s%s%s%s%s%s%s%s%s%s%s%s%s (%lu objects)\n", pg_num,
        (state & PG_STARTING) ? "starting" : "",
        (state & PG_OFFLINE) ? "offline" : "",
        (state & PG_PEERING) ? "peering" : "",
@ -386,6 +451,8 @@ void pg_t::print_state()
        (state & PG_HAS_DEGRADED) ? " + has_degraded" : "",
        (state & PG_HAS_MISPLACED) ? " + has_misplaced" : "",
        (state & PG_HAS_UNCLEAN) ? " + has_unclean" : "",
+        (state & PG_HAS_INVALID) ? " + has_invalid" : "",
+        (state & PG_LEFT_ON_DEAD) ? " + left_on_dead" : "",
        total_count
    );
 }
--- a/osd_peering_pg.h
+++ b/osd_peering_pg.h
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #include <map>
 #include <unordered_map>
 #include <vector>
@ -9,6 +12,8 @@
 #include "osd_ops.h"
 #include "pg_states.h"

+#define PG_EPOCH_BITS 48
+
 struct pg_obj_loc_t
 {
    uint64_t role;
@ -42,6 +47,7 @@ struct pg_peering_state_t
    // osd_num -> list result
    std::unordered_map<osd_num_t, osd_op_t*> list_ops;
    std::unordered_map<osd_num_t, pg_list_result_t> list_results;
+    pool_id_t pool_id = 0;
    pg_num_t pg_num = 0;
 };

@ -69,9 +75,13 @@ struct pg_flush_batch_t
 struct pg_t
 {
    int state = 0;
-    uint64_t pg_cursize = 3, pg_size = 3, pg_minsize = 2;
-    pg_num_t pg_num;
+    uint64_t scheme = 0;
+    uint64_t pg_cursize = 0, pg_size = 0, pg_minsize = 0;
+    pool_id_t pool_id = 0;
+    pg_num_t pg_num = 0;
    uint64_t clean_count = 0, total_count = 0;
+    // epoch number - should increase with each non-clean activation of the PG
+    uint64_t epoch = 0, reported_epoch = 0;
    // target history and all potential peers
    std::vector<std::vector<osd_num_t>> target_history;
    std::vector<osd_num_t> all_peers;
--- a/osd_peering_pg_test.cpp
+++ b/osd_peering_pg_test.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #define _LARGEFILE64_SOURCE

 #include "osd_peering_pg.h"
@ -28,7 +31,7 @@ int main(int argc, char *argv[])
    for (uint64_t osd_num = 1; osd_num <= 3; osd_num++)
    {
        pg_list_result_t r = {
-            .buf = (obj_ver_id*)malloc(sizeof(obj_ver_id) * 1024*1024*8),
+            .buf = (obj_ver_id*)malloc_or_die(sizeof(obj_ver_id) * 1024*1024*8),
            .total_count = 1024*1024*8,
            .stable_count = (uint64_t)(1024*1024*8 - (osd_num == 1 ? 10 : 0)),
        };
--- a/osd_primary.cpp
+++ b/osd_primary.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #include "osd_primary.h"

 // read: read directly or read paired stripe(s), reconstruct, return
@ -13,15 +16,17 @@ bool osd_t::prepare_primary_rw(osd_op_t *cur_op)
 {
    // PG number is calculated from the offset
    // Our EC scheme stores data in fixed chunks equal to (K*block size)
-    // K = pg_minsize and will be a property of the inode. Not it's hardcoded (FIXME)
-    uint64_t pg_block_size = bs_block_size * 2;
+    // K = pg_minsize in case of EC/XOR, or 1 for replicated pools
+    pool_id_t pool_id = INODE_POOL(cur_op->req.rw.inode);
+    auto & pool_cfg = st_cli.pool_config[pool_id];
+    uint64_t pg_block_size = bs_block_size * (pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pool_cfg.pg_minsize);
    object_id oid = {
        .inode = cur_op->req.rw.inode,
        // oid.stripe = starting offset of the parity stripe
        .stripe = (cur_op->req.rw.offset/pg_block_size)*pg_block_size,
    };
-    pg_num_t pg_num = (cur_op->req.rw.inode + oid.stripe/pg_stripe_size) % pg_count + 1;
-    auto pg_it = pgs.find(pg_num);
+    pg_num_t pg_num = (cur_op->req.rw.inode + oid.stripe/pool_cfg.pg_stripe_size) % pg_counts[pool_id] + 1;
+    auto pg_it = pgs.find({ .pool_id = pool_id, .pg_num = pg_num });
    if (pg_it == pgs.end() || !(pg_it->second.state & PG_ACTIVE))
    {
        // This OSD is not primary for this PG or the PG is inactive
@ -35,14 +40,16 @@ bool osd_t::prepare_primary_rw(osd_op_t *cur_op)
        finish_op(cur_op, -EINVAL);
        return false;
    }
-    osd_primary_op_data_t *op_data = (osd_primary_op_data_t*)calloc(
-        sizeof(osd_primary_op_data_t) + sizeof(osd_rmw_stripe_t) * pg_it->second.pg_size, 1
+    osd_primary_op_data_t *op_data = (osd_primary_op_data_t*)calloc_or_die(
+        1, sizeof(osd_primary_op_data_t) + sizeof(osd_rmw_stripe_t) * (pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pg_it->second.pg_size)
    );
    op_data->pg_num = pg_num;
    op_data->oid = oid;
    op_data->stripes = ((osd_rmw_stripe_t*)(op_data+1));
+    op_data->scheme = pool_cfg.scheme;
    cur_op->op_data = op_data;
-    split_stripes(pg_it->second.pg_minsize, bs_block_size, (uint32_t)(cur_op->req.rw.offset - oid.stripe), cur_op->req.rw.len, op_data->stripes);
+    split_stripes((pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pg_it->second.pg_minsize),
+        bs_block_size, (uint32_t)(cur_op->req.rw.offset - oid.stripe), cur_op->req.rw.len, op_data->stripes);
    pg_it->second.inflight++;
    return true;
 }
@ -86,8 +93,8 @@ void osd_t::continue_primary_read(osd_op_t *cur_op)
    if (op_data->st == 1)      goto resume_1;
    else if (op_data->st == 2) goto resume_2;
    {
-        auto & pg = pgs[op_data->pg_num];
-        for (int role = 0; role < pg.pg_minsize; role++)
+        auto & pg = pgs[{ .pool_id = INODE_POOL(op_data->oid.inode), .pg_num = op_data->pg_num }];
+        for (int role = 0; role < (op_data->scheme == POOL_SCHEME_REPLICATED ? 1 : pg.pg_minsize); role++)
        {
            op_data->stripes[role].read_start = op_data->stripes[role].req_start;
            op_data->stripes[role].read_end = op_data->stripes[role].req_end;
@ -95,12 +102,13 @@ void osd_t::continue_primary_read(osd_op_t *cur_op)
        // Determine version
        auto vo_it = pg.ver_override.find(op_data->oid);
        op_data->target_ver = vo_it != pg.ver_override.end() ? vo_it->second : UINT64_MAX;
-        if (pg.state == PG_ACTIVE)
+        if (pg.state == PG_ACTIVE || op_data->scheme == POOL_SCHEME_REPLICATED)
        {
            // Fast happy-path
-            cur_op->buf = alloc_read_buffer(op_data->stripes, pg.pg_minsize, 0);
-            submit_primary_subops(SUBMIT_READ, pg.pg_minsize, pg.cur_set.data(), cur_op);
-            cur_op->send_list.push_back(cur_op->buf, cur_op->req.rw.len);
+            cur_op->buf = alloc_read_buffer(op_data->stripes,
+                (op_data->scheme == POOL_SCHEME_REPLICATED ? 1 : pg.pg_minsize), 0);
+            submit_primary_subops(SUBMIT_READ, op_data->target_ver,
+                (op_data->scheme == POOL_SCHEME_REPLICATED ? pg.pg_size : pg.pg_minsize), pg.cur_set.data(), cur_op);
            op_data->st = 1;
        }
        else
@ -117,7 +125,7 @@ void osd_t::continue_primary_read(osd_op_t *cur_op)
            op_data->pg_size = pg.pg_size;
            op_data->degraded = 1;
            cur_op->buf = alloc_read_buffer(op_data->stripes, pg.pg_size, 0);
-            submit_primary_subops(SUBMIT_READ, pg.pg_size, cur_set, cur_op);
+            submit_primary_subops(SUBMIT_READ, op_data->target_ver, pg.pg_size, cur_set, cur_op);
            op_data->st = 1;
        }
    }
@ -138,18 +146,22 @@ resume_2:
        {
            if (stripes[role].read_end != 0 && stripes[role].missing)
            {
-                reconstruct_stripe(stripes, op_data->pg_size, role);
+                reconstruct_stripe_xor(stripes, op_data->pg_size, role);
            }
            if (stripes[role].req_end != 0)
            {
                // Send buffer in parts to avoid copying
-                cur_op->send_list.push_back(
+                cur_op->iov.push_back(
                    stripes[role].read_buf + (stripes[role].req_start - stripes[role].read_start),
                    stripes[role].req_end - stripes[role].req_start
                );
            }
        }
    }
+    else
+    {
+        cur_op->iov.push_back(cur_op->buf, cur_op->req.rw.len);
+    }
    finish_op(cur_op, cur_op->req.rw.len);
 }

@ -187,7 +199,7 @@ void osd_t::continue_primary_write(osd_op_t *cur_op)
        return;
    }
    osd_primary_op_data_t *op_data = cur_op->op_data;
-    auto & pg = pgs[op_data->pg_num];
+    auto & pg = pgs[{ .pool_id = INODE_POOL(op_data->oid.inode), .pg_num = op_data->pg_num }];
    if (op_data->st == 1)      goto resume_1;
    else if (op_data->st == 2) goto resume_2;
    else if (op_data->st == 3) goto resume_3;
@ -197,6 +209,7 @@ void osd_t::continue_primary_write(osd_op_t *cur_op)
    else if (op_data->st == 7) goto resume_7;
    else if (op_data->st == 8) goto resume_8;
    else if (op_data->st == 9) goto resume_9;
+    else if (op_data->st == 10) goto resume_10;
    assert(op_data->st == 0);
    if (!check_write_queue(cur_op, pg))
    {
@ -205,12 +218,30 @@ void osd_t::continue_primary_write(osd_op_t *cur_op)
 resume_1:
    // Determine blocks to read and write
    // Missing chunks are allowed to be overwritten even in incomplete objects
-    // FIXME: Allow to do small writes to the old (degraded/misplaced) OSD set for the lower performance impact
+    // FIXME: Allow to do small writes to the old (degraded/misplaced) OSD set for lower performance impact
    op_data->prev_set = get_object_osd_set(pg, op_data->oid, pg.cur_set.data(), &op_data->object_state);
+    if (op_data->scheme == POOL_SCHEME_REPLICATED)
+    {
+        // Simplified algorithm
+        op_data->stripes[0].write_start = op_data->stripes[0].req_start;
+        op_data->stripes[0].write_end = op_data->stripes[0].req_end;
+        op_data->stripes[0].write_buf = cur_op->buf;
+        if (pg.cur_set.data() != op_data->prev_set && (op_data->stripes[0].write_start != 0 ||
+            op_data->stripes[0].write_end != bs_block_size))
+        {
+            // Object is degraded/misplaced and will be moved to <write_osd_set>
+            op_data->stripes[0].read_start = 0;
+            op_data->stripes[0].read_end = bs_block_size;
+            cur_op->rmw_buf = op_data->stripes[0].read_buf = memalign_or_die(MEM_ALIGNMENT, bs_block_size);
+        }
+    }
+    else
+    {
        cur_op->rmw_buf = calc_rmw(cur_op->buf, op_data->stripes, op_data->prev_set,
            pg.pg_size, pg.pg_minsize, pg.pg_cursize, pg.cur_set.data(), bs_block_size);
+    }
    // Read required blocks
-    submit_primary_subops(SUBMIT_RMW_READ, pg.pg_size, op_data->prev_set, cur_op);
+    submit_primary_subops(SUBMIT_RMW_READ, UINT64_MAX, pg.pg_size, op_data->prev_set, cur_op);
 resume_2:
    op_data->st = 2;
    return;
@ -222,10 +253,56 @@ resume_3:
    }
    // Save version override for parallel reads
    pg.ver_override[op_data->oid] = op_data->fact_ver;
+    if (op_data->scheme == POOL_SCHEME_REPLICATED)
+    {
+        // Only (possibly) copy new data from the request into the recovery buffer
+        if (pg.cur_set.data() != op_data->prev_set && (op_data->stripes[0].write_start != 0 ||
+            op_data->stripes[0].write_end != bs_block_size))
+        {
+            memcpy(
+                op_data->stripes[0].read_buf + op_data->stripes[0].req_start,
+                op_data->stripes[0].write_buf,
+                op_data->stripes[0].req_end - op_data->stripes[0].req_start
+            );
+            op_data->stripes[0].write_buf = op_data->stripes[0].read_buf;
+            op_data->stripes[0].write_start = 0;
+            op_data->stripes[0].write_end = bs_block_size;
+        }
+    }
+    else
+    {
        // Recover missing stripes, calculate parity
-    calc_rmw_parity(op_data->stripes, pg.pg_size, op_data->prev_set, pg.cur_set.data(), bs_block_size);
+        calc_rmw_parity_xor(op_data->stripes, pg.pg_size, op_data->prev_set, pg.cur_set.data(), bs_block_size);
+    }
    // Send writes
-    submit_primary_subops(SUBMIT_WRITE, pg.pg_size, pg.cur_set.data(), cur_op);
+    if ((op_data->fact_ver >> (64-PG_EPOCH_BITS)) < pg.epoch)
+    {
+        op_data->target_ver = ((uint64_t)pg.epoch << (64-PG_EPOCH_BITS)) | 1;
+    }
+    else
+    {
+        if ((op_data->fact_ver & (1ul<<(64-PG_EPOCH_BITS) - 1)) == (1ul<<(64-PG_EPOCH_BITS) - 1))
+        {
+            assert(pg.epoch != ((1ul << PG_EPOCH_BITS)-1));
+            pg.epoch++;
+        }
+        op_data->target_ver = op_data->fact_ver + 1;
+    }
+    if (pg.epoch > pg.reported_epoch)
+    {
+        // Report newer epoch before writing
+        // FIXME: We may report only one PG state here...
+        this->pg_state_dirty.insert({ .pool_id = pg.pool_id, .pg_num = pg.pg_num });
+        pg.history_changed = true;
+        report_pg_states();
+resume_10:
+        if (pg.epoch > pg.reported_epoch)
+        {
+            op_data->st = 10;
+            return;
+        }
+    }
+    submit_primary_subops(SUBMIT_WRITE, op_data->target_ver, pg.pg_size, pg.cur_set.data(), cur_op);
 resume_4:
    op_data->st = 4;
    return;
@ -251,7 +328,7 @@ resume_5:
                recovery_stat_count[0][recovery_type]++;
                recovery_stat_bytes[0][recovery_type] = 0;
            }
-            for (int role = 0; role < pg.pg_size; role++)
+            for (int role = 0; role < (op_data->scheme == POOL_SCHEME_REPLICATED ? 1 : pg.pg_size); role++)
            {
                recovery_stat_bytes[0][recovery_type] += op_data->stripes[role].write_end - op_data->stripes[role].write_start;
            }
@ -259,7 +336,7 @@ resume_5:
        if (op_data->object_state->state & OBJ_MISPLACED)
        {
            // Remove extra chunks
-            submit_primary_del_subops(cur_op, pg.cur_set.data(), op_data->object_state->osd_set);
+            submit_primary_del_subops(cur_op, pg.cur_set.data(), pg.pg_size, op_data->object_state->osd_set);
            if (op_data->n_subops > 0)
            {
 resume_8:
@ -316,6 +393,9 @@ bool osd_t::remember_unstable_write(osd_op_t *cur_op, pg_t & pg, pg_osd_set_t &
    }
    if (immediate_commit == IMMEDIATE_ALL)
    {
+        if (op_data->scheme != POOL_SCHEME_REPLICATED)
+        {
+            // Send STABILIZE ops immediately
            op_data->unstable_write_osds = new std::vector<unstable_osd_num_t>();
            op_data->unstable_writes = new obj_ver_id[loc_set.size()];
            {
@ -353,11 +433,15 @@ resume_7:
                return false;
            }
        }
+    }
    else
    {
-        // Remember version as unstable
+        if (op_data->scheme != POOL_SCHEME_REPLICATED)
+        {
+            // Remember version as unstable for EC/XOR
            for (auto & chunk: loc_set)
            {
+                this->dirty_osds.insert(chunk.osd_num);
                this->unstable_writes[(osd_object_id_t){
                    .osd_num = chunk.osd_num,
                    .oid = {
@ -366,10 +450,19 @@ resume_7:
                    },
                }] = op_data->fact_ver;
            }
+        }
+        else
+        {
+            // Only remember to sync OSDs for replicated pools
+            for (auto & chunk: loc_set)
+            {
+                this->dirty_osds.insert(chunk.osd_num);
+            }
+        }
        // Remember PG as dirty to drop the connection when PG goes offline
        // (this is required because of the "lazy sync")
-        c_cli.clients[cur_op->peer_fd].dirty_pgs.insert(op_data->pg_num);
-        dirty_pgs.insert(op_data->pg_num);
+        c_cli.clients[cur_op->peer_fd].dirty_pgs.insert({ .pool_id = pg.pool_id, .pg_num = pg.pg_num });
+        dirty_pgs.insert({ .pool_id = pg.pool_id, .pg_num = pg.pg_num });
    }
    return true;
 }
@ -379,7 +472,7 @@ void osd_t::continue_primary_sync(osd_op_t *cur_op)
 {
    if (!cur_op->op_data)
    {
-        cur_op->op_data = (osd_primary_op_data_t*)calloc(sizeof(osd_primary_op_data_t), 1);
+        cur_op->op_data = (osd_primary_op_data_t*)calloc_or_die(1, sizeof(osd_primary_op_data_t));
    }
    osd_primary_op_data_t *op_data = cur_op->op_data;
    if (op_data->st == 1)      goto resume_1;
@ -403,7 +496,7 @@ resume_1:
        syncs_in_progress.push_back(cur_op);
    }
 resume_2:
-    if (unstable_writes.size() == 0)
+    if (dirty_osds.size() == 0)
    {
        // Nothing to sync
        goto finish;
@ -411,11 +504,10 @@ resume_2:
    // Save and clear unstable_writes
    // In theory it is possible to do in on a per-client basis, but this seems to be an unnecessary complication
    // It would be cool not to copy these here at all, but someone has to deduplicate them by object IDs anyway
+    if (unstable_writes.size() > 0)
    {
        op_data->unstable_write_osds = new std::vector<unstable_osd_num_t>();
        op_data->unstable_writes = new obj_ver_id[this->unstable_writes.size()];
-        op_data->dirty_pgs = new pg_num_t[dirty_pgs.size()];
-        op_data->dirty_pg_count = dirty_pgs.size();
        osd_num_t last_osd = 0;
        int last_start = 0, last_end = 0;
        for (auto it = this->unstable_writes.begin(); it != this->unstable_writes.end(); it++)
@ -447,6 +539,14 @@ resume_2:
                .len = last_end - last_start,
            });
        }
+        this->unstable_writes.clear();
+    }
+    {
+        void *dirty_buf = malloc_or_die(sizeof(pool_pg_num_t)*dirty_pgs.size() + sizeof(osd_num_t)*dirty_osds.size());
+        op_data->dirty_pgs = (pool_pg_num_t*)dirty_buf;
+        op_data->dirty_osds = (osd_num_t*)(dirty_buf + sizeof(pool_pg_num_t)*dirty_pgs.size());
+        op_data->dirty_pg_count = dirty_pgs.size();
+        op_data->dirty_osd_count = dirty_osds.size();
        int dpg = 0;
        for (auto dirty_pg_num: dirty_pgs)
        {
@ -454,7 +554,12 @@ resume_2:
            op_data->dirty_pgs[dpg++] = dirty_pg_num;
        }
        dirty_pgs.clear();
-        this->unstable_writes.clear();
+        dpg = 0;
+        for (auto osd_num: dirty_osds)
+        {
+            op_data->dirty_osds[dpg++] = osd_num;
+        }
+        dirty_osds.clear();
    }
    if (immediate_commit != IMMEDIATE_ALL)
    {
@ -469,13 +574,27 @@ resume_4:
            goto resume_6;
        }
    }
-    // Stabilize version sets
+    if (op_data->unstable_writes)
+    {
+        // Stabilize version sets, if any
        submit_primary_stab_subops(cur_op);
 resume_5:
        op_data->st = 5;
        return;
+    }
 resume_6:
    if (op_data->errors > 0)
+    {
+        // Return PGs and OSDs back into their dirty sets
+        for (int i = 0; i < op_data->dirty_pg_count; i++)
+        {
+            dirty_pgs.insert(op_data->dirty_pgs[i]);
+        }
+        for (int i = 0; i < op_data->dirty_osd_count; i++)
+        {
+            dirty_osds.insert(op_data->dirty_osds[i]);
+        }
+        if (op_data->unstable_writes)
        {
            // Return objects back into the unstable write set
            for (auto unstable_osd: *(op_data->unstable_write_osds))
@ -484,7 +603,10 @@ resume_6:
                {
                    // Except those from peered PGs
                    auto & w = op_data->unstable_writes[i];
-                pg_num_t wpg = map_to_pg(w.oid);
+                    pool_pg_num_t wpg = {
+                        .pool_id = INODE_POOL(w.oid.inode),
+                        .pg_num = map_to_pg(w.oid, st_cli.pool_config.at(INODE_POOL(w.oid.inode)).pg_stripe_size),
+                    };
                    if (pgs[wpg].state & PG_ACTIVE)
                    {
                        uint64_t & dest = this->unstable_writes[(osd_object_id_t){
@ -497,6 +619,7 @@ resume_6:
                }
            }
        }
+    }
    for (int i = 0; i < op_data->dirty_pg_count; i++)
    {
        auto & pg = pgs.at(op_data->dirty_pgs[i]);
@ -507,11 +630,16 @@ resume_6:
        }
    }
    // FIXME: Free those in the destructor?
-    delete op_data->dirty_pgs;
+    free(op_data->dirty_pgs);
+    op_data->dirty_pgs = NULL;
+    op_data->dirty_osds = NULL;
+    if (op_data->unstable_writes)
+    {
        delete op_data->unstable_write_osds;
        delete[] op_data->unstable_writes;
        op_data->unstable_writes = NULL;
        op_data->unstable_write_osds = NULL;
+    }
    if (op_data->errors > 0)
    {
        finish_op(cur_op, op_data->epipe > 0 ? -EPIPE : -EIO);
@ -590,7 +718,7 @@ void osd_t::continue_primary_del(osd_op_t *cur_op)
        return;
    }
    osd_primary_op_data_t *op_data = cur_op->op_data;
-    auto & pg = pgs[op_data->pg_num];
+    auto & pg = pgs[{ .pool_id = INODE_POOL(op_data->oid.inode), .pg_num = op_data->pg_num }];
    if (op_data->st == 1)      goto resume_1;
    else if (op_data->st == 2) goto resume_2;
    else if (op_data->st == 3) goto resume_3;
@ -611,7 +739,7 @@ resume_1:
    // Determine which OSDs contain this object and delete it
    op_data->prev_set = get_object_osd_set(pg, op_data->oid, pg.cur_set.data(), &op_data->object_state);
    // Submit 1 read to determine the actual version number
-    submit_primary_subops(SUBMIT_RMW_READ, pg.pg_size, op_data->prev_set, cur_op);
+    submit_primary_subops(SUBMIT_RMW_READ, UINT64_MAX, pg.pg_size, op_data->prev_set, cur_op);
 resume_2:
    op_data->st = 2;
    return;
@ -625,7 +753,7 @@ resume_3:
    pg.ver_override[op_data->oid] = op_data->fact_ver;
    // Submit deletes
    op_data->fact_ver++;
-    submit_primary_del_subops(cur_op, NULL, op_data->object_state ? op_data->object_state->osd_set : pg.cur_loc_set);
+    submit_primary_del_subops(cur_op, NULL, 0, op_data->object_state ? op_data->object_state->osd_set : pg.cur_loc_set);
 resume_4:
    op_data->st = 4;
    return;
--- a/osd_primary.h
+++ b/osd_primary.h
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #pragma once

 #include "osd.h"
@ -20,6 +23,7 @@ struct osd_primary_op_data_t
    object_id oid;
    uint64_t target_ver;
    uint64_t fact_ver = 0;
+    uint64_t scheme = 0;
    int n_subops = 0, done = 0, errors = 0, epipe = 0;
    int degraded = 0, pg_size, pg_minsize;
    osd_rmw_stripe_t *stripes;
@ -29,7 +33,9 @@ struct osd_primary_op_data_t

    // for sync. oops, requires freeing
    std::vector<unstable_osd_num_t> *unstable_write_osds = NULL;
-    pg_num_t *dirty_pgs = NULL;
+    pool_pg_num_t *dirty_pgs = NULL;
    int dirty_pg_count = 0;
+    osd_num_t *dirty_osds = NULL;
+    int dirty_osd_count = 0;
    obj_ver_id *unstable_writes = NULL;
 };
--- a/osd_primary_subops.cpp
+++ b/osd_primary_subops.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #include "osd_primary.h"

 void osd_t::autosync()
@ -37,7 +40,7 @@ void osd_t::finish_op(osd_op_t *cur_op, int retval)
    {
        if (cur_op->op_data->pg_num > 0)
        {
-            auto & pg = pgs[cur_op->op_data->pg_num];
+            auto & pg = pgs[{ .pool_id = INODE_POOL(cur_op->op_data->oid.inode), .pg_num = cur_op->op_data->pg_num }];
            pg.inflight--;
            assert(pg.inflight >= 0);
            if ((pg.state & PG_STOPPING) && pg.inflight == 0 && !pg.flush_batch)
@ -76,11 +79,12 @@ void osd_t::finish_op(osd_op_t *cur_op, int retval)
    }
 }

-void osd_t::submit_primary_subops(int submit_type, int pg_size, const uint64_t* osd_set, osd_op_t *cur_op)
+void osd_t::submit_primary_subops(int submit_type, uint64_t op_version, int pg_size, const uint64_t* osd_set, osd_op_t *cur_op)
 {
-    bool w = submit_type == SUBMIT_WRITE;
+    bool wr = submit_type == SUBMIT_WRITE;
    osd_primary_op_data_t *op_data = cur_op->op_data;
    osd_rmw_stripe_t *stripes = op_data->stripes;
+    bool rep = op_data->scheme == POOL_SCHEME_REPLICATED;
    // Allocate subops
    int n_subops = 0, zero_read = -1;
    for (int role = 0; role < pg_size; role++)
@ -89,12 +93,12 @@ void osd_t::submit_primary_subops(int submit_type, int pg_size, const uint64_t*
        {
            zero_read = role;
        }
-        if (osd_set[role] != 0 && (w || stripes[role].read_end != 0))
+        if (osd_set[role] != 0 && (wr || !rep && stripes[role].read_end != 0))
        {
            n_subops++;
        }
    }
-    if (!n_subops && submit_type == SUBMIT_RMW_READ)
+    if (!n_subops && (submit_type == SUBMIT_RMW_READ || rep))
    {
        n_subops = 1;
    }
@ -102,7 +106,6 @@ void osd_t::submit_primary_subops(int submit_type, int pg_size, const uint64_t*
    {
        zero_read = -1;
    }
-    uint64_t op_version = w ? op_data->fact_ver+1 : (submit_type == SUBMIT_RMW_READ ? UINT64_MAX : op_data->target_ver);
    osd_op_t *subops = new osd_op_t[n_subops];
    op_data->fact_ver = 0;
    op_data->done = op_data->errors = 0;
@ -112,36 +115,37 @@ void osd_t::submit_primary_subops(int submit_type, int pg_size, const uint64_t*
    for (int role = 0; role < pg_size; role++)
    {
        // We always submit zero-length writes to all replicas, even if the stripe is not modified
-        if (!(w || stripes[role].read_end != 0 || zero_read == role))
+        if (!(wr || !rep && stripes[role].read_end != 0 || zero_read == role))
        {
            continue;
        }
        osd_num_t role_osd_num = osd_set[role];
        if (role_osd_num != 0)
        {
+            int stripe_num = rep ? 0 : role;
            if (role_osd_num == this->osd_num)
            {
                clock_gettime(CLOCK_REALTIME, &subops[i].tv_begin);
                subops[i].op_type = (uint64_t)cur_op;
                subops[i].bs_op = new blockstore_op_t({
-                    .opcode = (uint64_t)(w ? BS_OP_WRITE : BS_OP_READ),
+                    .opcode = (uint64_t)(wr ? (rep ? BS_OP_WRITE_STABLE : BS_OP_WRITE) : BS_OP_READ),
                    .callback = [subop = &subops[i], this](blockstore_op_t *bs_subop)
                    {
                        handle_primary_bs_subop(subop);
                    },
                    .oid = {
                        .inode = op_data->oid.inode,
-                        .stripe = op_data->oid.stripe | role,
+                        .stripe = op_data->oid.stripe | stripe_num,
                    },
                    .version = op_version,
-                    .offset = w ? stripes[role].write_start : stripes[role].read_start,
-                    .len = w ? stripes[role].write_end - stripes[role].write_start : stripes[role].read_end - stripes[role].read_start,
-                    .buf = w ? stripes[role].write_buf : stripes[role].read_buf,
+                    .offset = wr ? stripes[stripe_num].write_start : stripes[stripe_num].read_start,
+                    .len = wr ? stripes[stripe_num].write_end - stripes[stripe_num].write_start : stripes[stripe_num].read_end - stripes[stripe_num].read_start,
+                    .buf = wr ? stripes[stripe_num].write_buf : stripes[stripe_num].read_buf,
                });
 #ifdef OSD_DEBUG
                printf(
-                    "Submit %s to local: %lu:%lu v%lu %u-%u\n", w ? "write" : "read",
-                    op_data->oid.inode, op_data->oid.stripe | role, op_version,
+                    "Submit %s to local: %lx:%lx v%lu %u-%u\n", wr ? "write" : "read",
+                    op_data->oid.inode, op_data->oid.stripe | stripe_num, op_version,
                    subops[i].bs_op->offset, subops[i].bs_op->len
                );
 #endif
@ -150,40 +154,46 @@ void osd_t::submit_primary_subops(int submit_type, int pg_size, const uint64_t*
            else
            {
                subops[i].op_type = OSD_OP_OUT;
-                subops[i].send_list.push_back(subops[i].req.buf, OSD_PACKET_SIZE);
                subops[i].peer_fd = c_cli.osd_peer_fds.at(role_osd_num);
                subops[i].req.sec_rw = {
                    .header = {
                        .magic = SECONDARY_OSD_OP_MAGIC,
                        .id = c_cli.next_subop_id++,
-                        .opcode = (uint64_t)(w ? OSD_OP_SECONDARY_WRITE : OSD_OP_SECONDARY_READ),
+                        .opcode = (uint64_t)(wr ? (rep ? OSD_OP_SEC_WRITE_STABLE : OSD_OP_SEC_WRITE) : OSD_OP_SEC_READ),
                    },
                    .oid = {
                        .inode = op_data->oid.inode,
-                        .stripe = op_data->oid.stripe | role,
+                        .stripe = op_data->oid.stripe | stripe_num,
                    },
                    .version = op_version,
-                    .offset = w ? stripes[role].write_start : stripes[role].read_start,
-                    .len = w ? stripes[role].write_end - stripes[role].write_start : stripes[role].read_end - stripes[role].read_start,
+                    .offset = wr ? stripes[stripe_num].write_start : stripes[stripe_num].read_start,
+                    .len = wr ? stripes[stripe_num].write_end - stripes[stripe_num].write_start : stripes[stripe_num].read_end - stripes[stripe_num].read_start,
                };
 #ifdef OSD_DEBUG
                printf(
-                    "Submit %s to osd %lu: %lu:%lu v%lu %u-%u\n", w ? "write" : "read", role_osd_num,
-                    op_data->oid.inode, op_data->oid.stripe | role, op_version,
+                    "Submit %s to osd %lu: %lx:%lx v%lu %u-%u\n", wr ? "write" : "read", role_osd_num,
+                    op_data->oid.inode, op_data->oid.stripe | stripe_num, op_version,
                    subops[i].req.sec_rw.offset, subops[i].req.sec_rw.len
                );
 #endif
-                subops[i].buf = w ? stripes[role].write_buf : stripes[role].read_buf;
-                if (w && stripes[role].write_end > 0)
+                if (wr)
                {
-                    subops[i].send_list.push_back(stripes[role].write_buf, stripes[role].write_end - stripes[role].write_start);
+                    if (stripes[stripe_num].write_end > stripes[stripe_num].write_start)
+                    {
+                        subops[i].iov.push_back(stripes[stripe_num].write_buf, stripes[stripe_num].write_end - stripes[stripe_num].write_start);
+                    }
+                }
+                else
+                {
+                    if (stripes[stripe_num].read_end > stripes[stripe_num].read_start)
+                    {
+                        subops[i].iov.push_back(stripes[stripe_num].read_buf, stripes[stripe_num].read_end - stripes[stripe_num].read_start);
+                    }
                }
                subops[i].callback = [cur_op, this](osd_op_t *subop)
                {
-                    int fail_fd = subop->req.hdr.opcode == OSD_OP_SECONDARY_WRITE &&
+                    int fail_fd = subop->req.hdr.opcode == OSD_OP_SEC_WRITE &&
                        subop->reply.hdr.retval != subop->req.sec_rw.len ? subop->peer_fd : -1;
-                    // so it doesn't get freed
-                    subop->buf = NULL;
                    handle_primary_subop(subop, cur_op);
                    if (fail_fd >= 0)
                    {
@ -200,21 +210,23 @@ void osd_t::submit_primary_subops(int submit_type, int pg_size, const uint64_t*

 static uint64_t bs_op_to_osd_op[] = {
    0,
-    OSD_OP_SECONDARY_READ,      // BS_OP_READ
-    OSD_OP_SECONDARY_WRITE,     // BS_OP_WRITE
-    OSD_OP_SECONDARY_SYNC,      // BS_OP_SYNC
-    OSD_OP_SECONDARY_STABILIZE, // BS_OP_STABLE
-    OSD_OP_SECONDARY_DELETE,    // BS_OP_DELETE
-    OSD_OP_SECONDARY_LIST,      // BS_OP_LIST
-    OSD_OP_SECONDARY_ROLLBACK,  // BS_OP_ROLLBACK
-    OSD_OP_TEST_SYNC_STAB_ALL,  // BS_OP_SYNC_STAB_ALL
+    OSD_OP_SEC_READ,            // BS_OP_READ = 1
+    OSD_OP_SEC_WRITE,           // BS_OP_WRITE = 2
+    OSD_OP_SEC_WRITE_STABLE,    // BS_OP_WRITE_STABLE = 3
+    OSD_OP_SEC_SYNC,            // BS_OP_SYNC = 4
+    OSD_OP_SEC_STABILIZE,       // BS_OP_STABLE = 5
+    OSD_OP_SEC_DELETE,          // BS_OP_DELETE = 6
+    OSD_OP_SEC_LIST,            // BS_OP_LIST = 7
+    OSD_OP_SEC_ROLLBACK,        // BS_OP_ROLLBACK = 8
+    OSD_OP_TEST_SYNC_STAB_ALL,  // BS_OP_SYNC_STAB_ALL = 9
 };

 void osd_t::handle_primary_bs_subop(osd_op_t *subop)
 {
    osd_op_t *cur_op = (osd_op_t*)subop->op_type;
    blockstore_op_t *bs_op = subop->bs_op;
-    int expected = bs_op->opcode == BS_OP_READ || bs_op->opcode == BS_OP_WRITE ? bs_op->len : 0;
+    int expected = bs_op->opcode == BS_OP_READ || bs_op->opcode == BS_OP_WRITE
+        || bs_op->opcode == BS_OP_WRITE_STABLE ? bs_op->len : 0;
    if (bs_op->retval != expected && bs_op->opcode != BS_OP_READ)
    {
        // die
@ -226,7 +238,7 @@ void osd_t::handle_primary_bs_subop(osd_op_t *subop)
    add_bs_subop_stats(subop);
    subop->req.hdr.opcode = bs_op_to_osd_op[bs_op->opcode];
    subop->reply.hdr.retval = bs_op->retval;
-    if (bs_op->opcode == BS_OP_READ || bs_op->opcode == BS_OP_WRITE)
+    if (bs_op->opcode == BS_OP_READ || bs_op->opcode == BS_OP_WRITE || bs_op->opcode == BS_OP_WRITE_STABLE)
    {
        subop->req.sec_rw.len = bs_op->len;
        subop->reply.sec_rw.version = bs_op->version;
@ -253,7 +265,7 @@ void osd_t::add_bs_subop_stats(osd_op_t *subop)
        (tv_end.tv_sec - subop->tv_begin.tv_sec)*1000000 +
        (tv_end.tv_nsec - subop->tv_begin.tv_nsec)/1000
    );
-    if (opcode == OSD_OP_SECONDARY_READ || opcode == OSD_OP_SECONDARY_WRITE)
+    if (opcode == OSD_OP_SEC_READ || opcode == OSD_OP_SEC_WRITE)
    {
        c_cli.stats.op_stat_bytes[opcode] += subop->bs_op->len;
    }
@ -263,8 +275,8 @@ void osd_t::handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op)
 {
    uint64_t opcode = subop->req.hdr.opcode;
    int retval = subop->reply.hdr.retval;
-    int expected = opcode == OSD_OP_SECONDARY_READ || opcode == OSD_OP_SECONDARY_WRITE
-        ? subop->req.sec_rw.len : 0;
+    int expected = opcode == OSD_OP_SEC_READ || opcode == OSD_OP_SEC_WRITE
+        || opcode == OSD_OP_SEC_WRITE_STABLE ? subop->req.sec_rw.len : 0;
    osd_primary_op_data_t *op_data = cur_op->op_data;
    if (retval != expected)
    {
@ -278,7 +290,7 @@ void osd_t::handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op)
    else
    {
        op_data->done++;
-        if (opcode == OSD_OP_SECONDARY_READ || opcode == OSD_OP_SECONDARY_WRITE)
+        if (opcode == OSD_OP_SEC_READ || opcode == OSD_OP_SEC_WRITE || opcode == OSD_OP_SEC_WRITE_STABLE)
        {
            uint64_t version = subop->reply.sec_rw.version;
 #ifdef OSD_DEBUG
@ -341,13 +353,27 @@ void osd_t::cancel_primary_write(osd_op_t *cur_op)
    }
 }

-void osd_t::submit_primary_del_subops(osd_op_t *cur_op, uint64_t *cur_set, pg_osd_set_t & loc_set)
+static bool contains_osd(osd_num_t *osd_set, uint64_t size, osd_num_t osd_num)
+{
+    for (uint64_t i = 0; i < size; i++)
+    {
+        if (osd_set[i] == osd_num)
+        {
+            return true;
+        }
+    }
+    return false;
+}
+
+void osd_t::submit_primary_del_subops(osd_op_t *cur_op, osd_num_t *cur_set, uint64_t set_size, pg_osd_set_t & loc_set)
 {
    osd_primary_op_data_t *op_data = cur_op->op_data;
+    bool rep = op_data->scheme == POOL_SCHEME_REPLICATED;
    int extra_chunks = 0;
+    // ordered comparison for EC/XOR, unordered for replicated pools
    for (auto & chunk: loc_set)
    {
-        if (!cur_set || chunk.osd_num != cur_set[chunk.role])
+        if (!cur_set || (rep ? !contains_osd(cur_set, set_size, chunk.osd_num) : chunk.osd_num != cur_set[chunk.role]))
        {
            extra_chunks++;
        }
@ -363,8 +389,9 @@ void osd_t::submit_primary_del_subops(osd_op_t *cur_op, uint64_t *cur_set, pg_os
    int i = 0;
    for (auto & chunk: loc_set)
    {
-        if (!cur_set || chunk.osd_num != cur_set[chunk.role])
+        if (!cur_set || (rep ? !contains_osd(cur_set, set_size, chunk.osd_num) : chunk.osd_num != cur_set[chunk.role]))
        {
+            int stripe_num = op_data->scheme == POOL_SCHEME_REPLICATED ? 0 : chunk.role;
            if (chunk.osd_num == this->osd_num)
            {
                clock_gettime(CLOCK_REALTIME, &subops[i].tv_begin);
@ -377,7 +404,7 @@ void osd_t::submit_primary_del_subops(osd_op_t *cur_op, uint64_t *cur_set, pg_os
                    },
                    .oid = {
                        .inode = op_data->oid.inode,
-                        .stripe = op_data->oid.stripe | chunk.role,
+                        .stripe = op_data->oid.stripe | stripe_num,
                    },
                    // Same version as write
                    .version = op_data->fact_ver,
@ -387,17 +414,16 @@ void osd_t::submit_primary_del_subops(osd_op_t *cur_op, uint64_t *cur_set, pg_os
            else
            {
                subops[i].op_type = OSD_OP_OUT;
-                subops[i].send_list.push_back(subops[i].req.buf, OSD_PACKET_SIZE);
                subops[i].peer_fd = c_cli.osd_peer_fds.at(chunk.osd_num);
                subops[i].req.sec_del = {
                    .header = {
                        .magic = SECONDARY_OSD_OP_MAGIC,
                        .id = c_cli.next_subop_id++,
-                        .opcode = OSD_OP_SECONDARY_DELETE,
+                        .opcode = OSD_OP_SEC_DELETE,
                    },
                    .oid = {
                        .inode = op_data->oid.inode,
-                        .stripe = op_data->oid.stripe | chunk.role,
+                        .stripe = op_data->oid.stripe | stripe_num,
                    },
                    // Same version as write
                    .version = op_data->fact_ver,
@ -422,14 +448,14 @@ void osd_t::submit_primary_del_subops(osd_op_t *cur_op, uint64_t *cur_set, pg_os
 void osd_t::submit_primary_sync_subops(osd_op_t *cur_op)
 {
    osd_primary_op_data_t *op_data = cur_op->op_data;
-    int n_osds = op_data->unstable_write_osds->size();
+    int n_osds = op_data->dirty_osd_count;
    osd_op_t *subops = new osd_op_t[n_osds];
    op_data->done = op_data->errors = 0;
    op_data->n_subops = n_osds;
    op_data->subops = subops;
    for (int i = 0; i < n_osds; i++)
    {
-        osd_num_t sync_osd = (*(op_data->unstable_write_osds))[i].osd_num;
+        osd_num_t sync_osd = op_data->dirty_osds[i];
        if (sync_osd == this->osd_num)
        {
            clock_gettime(CLOCK_REALTIME, &subops[i].tv_begin);
@ -446,13 +472,12 @@ void osd_t::submit_primary_sync_subops(osd_op_t *cur_op)
        else
        {
            subops[i].op_type = OSD_OP_OUT;
-            subops[i].send_list.push_back(subops[i].req.buf, OSD_PACKET_SIZE);
            subops[i].peer_fd = c_cli.osd_peer_fds.at(sync_osd);
            subops[i].req.sec_sync = {
                .header = {
                    .magic = SECONDARY_OSD_OP_MAGIC,
                    .id = c_cli.next_subop_id++,
-                    .opcode = OSD_OP_SECONDARY_SYNC,
+                    .opcode = OSD_OP_SEC_SYNC,
                },
            };
            subops[i].callback = [cur_op, this](osd_op_t *subop)
@ -499,17 +524,16 @@ void osd_t::submit_primary_stab_subops(osd_op_t *cur_op)
        else
        {
            subops[i].op_type = OSD_OP_OUT;
-            subops[i].send_list.push_back(subops[i].req.buf, OSD_PACKET_SIZE);
            subops[i].peer_fd = c_cli.osd_peer_fds.at(stab_osd.osd_num);
            subops[i].req.sec_stab = {
                .header = {
                    .magic = SECONDARY_OSD_OP_MAGIC,
                    .id = c_cli.next_subop_id++,
-                    .opcode = OSD_OP_SECONDARY_STABILIZE,
+                    .opcode = OSD_OP_SEC_STABILIZE,
                },
                .len = (uint64_t)(stab_osd.len * sizeof(obj_ver_id)),
            };
-            subops[i].send_list.push_back(op_data->unstable_writes + stab_osd.start, stab_osd.len * sizeof(obj_ver_id));
+            subops[i].iov.push_back(op_data->unstable_writes + stab_osd.start, stab_osd.len * sizeof(obj_ver_id));
            subops[i].callback = [cur_op, this](osd_op_t *subop)
            {
                int fail_fd = subop->reply.hdr.retval != 0 ? subop->peer_fd : -1;
--- a/osd_rmw.cpp
+++ b/osd_rmw.cpp
@ -1,8 +1,11 @@
-#include <malloc.h>
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #include <string.h>
 #include <assert.h>
 #include "xor.h"
 #include "osd_rmw.h"
+#include "malloc_or_die.h"

 static inline void extend_read(uint32_t start, uint32_t end, osd_rmw_stripe_t & stripe)
 {
@ -72,7 +75,7 @@ void split_stripes(uint64_t pg_minsize, uint32_t bs_block_size, uint32_t start,
    }
 }

-void reconstruct_stripe(osd_rmw_stripe_t *stripes, int pg_size, int role)
+void reconstruct_stripe_xor(osd_rmw_stripe_t *stripes, int pg_size, int role)
 {
    int prev = -2;
    for (int other = 0; other < pg_size; other++)
@ -152,7 +155,7 @@ void* alloc_read_buffer(osd_rmw_stripe_t *stripes, int read_pg_size, uint64_t ad
        }
    }
    // Allocate buffer
-    void *buf = memalign(MEM_ALIGNMENT, buf_size);
+    void *buf = memalign_or_die(MEM_ALIGNMENT, buf_size);
    uint64_t buf_pos = add_size;
    for (int role = 0; role < read_pg_size; role++)
    {
@ -207,9 +210,8 @@ void* calc_rmw(void *request_buf, osd_rmw_stripe_t *stripes, uint64_t *read_osd_
        // Object is degraded/misplaced and will be moved to <write_osd_set>
        for (int role = 0; role < pg_size; role++)
        {
-            if (write_osd_set[role] != read_osd_set[role])
+            if (write_osd_set[role] != read_osd_set[role] && write_osd_set[role] != 0)
            {
-                // FIXME: For EC more than 2+1: handle case when write_osd_set == 0 and read_osd_set != 0
                // We need to get data for any moved / recovered chunk
                // And we need a continuous write buffer so we'll only optimize
                // for the case when the whole chunk is ovewritten in the request
@ -357,21 +359,22 @@ static void xor_multiple_buffers(buf_len_t *xor1, int n1, buf_len_t *xor2, int n
    }
 }

-void calc_rmw_parity(osd_rmw_stripe_t *stripes, int pg_size, uint64_t *read_osd_set, uint64_t *write_osd_set, uint32_t chunk_size)
+void calc_rmw_parity_xor(osd_rmw_stripe_t *stripes, int pg_size, uint64_t *read_osd_set, uint64_t *write_osd_set, uint32_t chunk_size)
 {
    int pg_minsize = pg_size-1;
    for (int role = 0; role < pg_size; role++)
    {
        if (stripes[role].read_end != 0 && stripes[role].missing)
        {
-            // Reconstruct missing stripe (EC k+1)
-            reconstruct_stripe(stripes, pg_size, role);
+            // Reconstruct missing stripe (XOR k+1)
+            reconstruct_stripe_xor(stripes, pg_size, role);
            break;
        }
    }
    uint32_t start = 0, end = 0;
-    if (!stripes[pg_minsize].missing || write_osd_set != read_osd_set)
+    if (write_osd_set[pg_minsize] != 0 || write_osd_set != read_osd_set)
    {
+        // Required for the next two if()s
        for (int role = 0; role < pg_minsize; role++)
        {
            if (stripes[role].req_end != 0)
@ -385,10 +388,9 @@ void calc_rmw_parity(osd_rmw_stripe_t *stripes, int pg_size, uint64_t *read_osd_
    {
        for (int role = 0; role < pg_minsize; role++)
        {
-            if (write_osd_set[role] != read_osd_set[role] &&
+            if (write_osd_set[role] != read_osd_set[role] && write_osd_set[role] != 0 &&
                (stripes[role].req_start != 0 || stripes[role].req_end != chunk_size))
            {
-                // FIXME again, handle case when write_osd_set[role] is 0
                // Copy modified chunk into the read buffer to write it back
                memcpy(
                    stripes[role].read_buf + stripes[role].req_start,
@ -401,9 +403,9 @@ void calc_rmw_parity(osd_rmw_stripe_t *stripes, int pg_size, uint64_t *read_osd_
            }
        }
    }
-    if (!stripes[pg_minsize].missing && end != 0)
+    if (write_osd_set[pg_minsize] != 0 && end != 0)
    {
-        // Calculate new parity (EC k+1)
+        // Calculate new parity (XOR k+1)
        int parity = pg_minsize, prev = -2;
        for (int other = 0; other < pg_minsize; other++)
        {
--- a/osd_rmw.h
+++ b/osd_rmw.h
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #pragma once

 #include <stdint.h>
@ -25,7 +28,7 @@ struct osd_rmw_stripe_t

 void split_stripes(uint64_t pg_minsize, uint32_t bs_block_size, uint32_t start, uint32_t len, osd_rmw_stripe_t *stripes);

-void reconstruct_stripe(osd_rmw_stripe_t *stripes, int pg_size, int role);
+void reconstruct_stripe_xor(osd_rmw_stripe_t *stripes, int pg_size, int role);

 int extend_missing_stripes(osd_rmw_stripe_t *stripes, osd_num_t *osd_set, int minsize, int size);

@ -34,4 +37,4 @@ void* alloc_read_buffer(osd_rmw_stripe_t *stripes, int read_pg_size, uint64_t ad
 void* calc_rmw(void *request_buf, osd_rmw_stripe_t *stripes, uint64_t *read_osd_set,
    uint64_t pg_size, uint64_t pg_minsize, uint64_t pg_cursize, uint64_t *write_osd_set, uint64_t chunk_size);

-void calc_rmw_parity(osd_rmw_stripe_t *stripes, int pg_size, uint64_t *read_osd_set, uint64_t *write_osd_set, uint32_t chunk_size);
+void calc_rmw_parity_xor(osd_rmw_stripe_t *stripes, int pg_size, uint64_t *read_osd_set, uint64_t *write_osd_set, uint32_t chunk_size);
--- a/osd_rmw_test.cpp
+++ b/osd_rmw_test.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #include <string.h>
 #include "osd_rmw.cpp"
 #include "test_pattern.h"
@ -58,7 +61,7 @@ Cases:
     input buffer: [ write0, write1 ],
     rmw buffer: [ write2, read0, read1, read2 ],
   }
-   then, after calc_rmw_parity(): {
+   then, after calc_rmw_parity_xor(): {
     write: [ [ 128K-4K, 128K ], [ 0, 128K ], [ 0, 128K ] ],
     write1==read1,
   }
@ -82,7 +85,7 @@ Cases:
     input buffer: NULL,
     rmw buffer: [ read0, read1, read2 ],
   }
-   then, after calc_rmw_parity(): {
+   then, after calc_rmw_parity_xor(): {
     write: [ [ 0, 128K ], [ 0, 0 ], [ 0, 0 ] ],
     write0==read0,
   }
@ -182,7 +185,7 @@ void test4()
    set_pattern(stripes[0].read_buf, 128*1024, PATTERN1); // old data
    set_pattern(stripes[1].read_buf, 128*1024-4096, UINT64_MAX); // didn't read it, it's missing
    set_pattern(stripes[2].read_buf, 128*1024-4096, 0); // old parity = 0
-    calc_rmw_parity(stripes, 3, osd_set, osd_set, 128*1024);
+    calc_rmw_parity_xor(stripes, 3, osd_set, osd_set, 128*1024);
    check_pattern(stripes[2].write_buf, 4096, PATTERN0^PATTERN1); // new parity
    check_pattern(stripes[2].write_buf+4096, 128*1024-4096*2, 0); // new parity
    check_pattern(stripes[2].write_buf+128*1024-4096, 4096, PATTERN0^PATTERN1); // new parity
@ -268,7 +271,7 @@ void test7()
    set_pattern(stripes[0].read_buf, 128*1024, PATTERN1); // old data
    set_pattern(stripes[1].read_buf, 128*1024, UINT64_MAX); // didn't read it, it's missing
    set_pattern(stripes[2].read_buf, 128*1024, 0); // old parity = 0
-    calc_rmw_parity(stripes, 3, osd_set, write_osd_set, 128*1024);
+    calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024);
    assert(stripes[0].write_start == 128*1024-4096 && stripes[0].write_end == 128*1024);
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 128*1024);
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
@ -306,7 +309,7 @@ void test8()
    // Test 8.2
    set_pattern(write_buf, 128*1024+4096, PATTERN0);
    set_pattern(stripes[1].read_buf, 128*1024-4096, PATTERN1);
-    calc_rmw_parity(stripes, 3, osd_set, write_osd_set, 128*1024);
+    calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024);
    assert(stripes[0].write_start == 0 && stripes[0].write_end == 128*1024); // recheck again
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 4096);     // recheck again
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024); // recheck again
@ -344,10 +347,10 @@ void test9()
    assert(stripes[0].write_buf == NULL);
    assert(stripes[1].write_buf == NULL);
    assert(stripes[2].write_buf == NULL);
-    // Test 8.2
+    // Test 9.2
    set_pattern(stripes[1].read_buf, 128*1024, 0);
    set_pattern(stripes[2].read_buf, 128*1024, PATTERN1);
-    calc_rmw_parity(stripes, 3, osd_set, write_osd_set, 128*1024);
+    calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024);
    assert(stripes[0].write_start == 0 && stripes[0].write_end == 128*1024);
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 0);
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 0);
--- a/osd_secondary.cpp
+++ b/osd_secondary.cpp
@ -1,30 +1,34 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #include "osd.h"

 #include "json11/json11.hpp"

 void osd_t::secondary_op_callback(osd_op_t *op)
 {
-    if (op->req.hdr.opcode == OSD_OP_SECONDARY_READ ||
-        op->req.hdr.opcode == OSD_OP_SECONDARY_WRITE)
+    if (op->req.hdr.opcode == OSD_OP_SEC_READ ||
+        op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
+        op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE)
    {
        op->reply.sec_rw.version = op->bs_op->version;
    }
-    else if (op->req.hdr.opcode == OSD_OP_SECONDARY_DELETE)
+    else if (op->req.hdr.opcode == OSD_OP_SEC_DELETE)
    {
        op->reply.sec_del.version = op->bs_op->version;
    }
-    if (op->req.hdr.opcode == OSD_OP_SECONDARY_READ &&
+    if (op->req.hdr.opcode == OSD_OP_SEC_READ &&
        op->bs_op->retval > 0)
    {
-        op->send_list.push_back(op->buf, op->bs_op->retval);
+        op->iov.push_back(op->buf, op->bs_op->retval);
    }
-    else if (op->req.hdr.opcode == OSD_OP_SECONDARY_LIST)
+    else if (op->req.hdr.opcode == OSD_OP_SEC_LIST)
    {
        // allocated by blockstore
        op->buf = op->bs_op->buf;
        if (op->bs_op->retval > 0)
        {
-            op->send_list.push_back(op->buf, op->bs_op->retval * sizeof(obj_ver_id));
+            op->iov.push_back(op->buf, op->bs_op->retval * sizeof(obj_ver_id));
        }
        op->reply.sec_list.stable_count = op->bs_op->version;
    }
@ -38,16 +42,18 @@ void osd_t::exec_secondary(osd_op_t *cur_op)
 {
    cur_op->bs_op = new blockstore_op_t();
    cur_op->bs_op->callback = [this, cur_op](blockstore_op_t* bs_op) { secondary_op_callback(cur_op); };
-    cur_op->bs_op->opcode = (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_READ ? BS_OP_READ
-        : (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_WRITE ? BS_OP_WRITE
-        : (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_SYNC ? BS_OP_SYNC
-        : (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_STABILIZE ? BS_OP_STABLE
-        : (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_ROLLBACK ? BS_OP_ROLLBACK
-        : (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_DELETE ? BS_OP_DELETE
-        : (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_LIST ? BS_OP_LIST
-        : -1)))))));
-    if (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_READ ||
-        cur_op->req.hdr.opcode == OSD_OP_SECONDARY_WRITE)
+    cur_op->bs_op->opcode = (cur_op->req.hdr.opcode == OSD_OP_SEC_READ ? BS_OP_READ
+        : (cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE ? BS_OP_WRITE
+        : (cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE ? BS_OP_WRITE_STABLE
+        : (cur_op->req.hdr.opcode == OSD_OP_SEC_SYNC ? BS_OP_SYNC
+        : (cur_op->req.hdr.opcode == OSD_OP_SEC_STABILIZE ? BS_OP_STABLE
+        : (cur_op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK ? BS_OP_ROLLBACK
+        : (cur_op->req.hdr.opcode == OSD_OP_SEC_DELETE ? BS_OP_DELETE
+        : (cur_op->req.hdr.opcode == OSD_OP_SEC_LIST ? BS_OP_LIST
+        : -1))))))));
+    if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ ||
+        cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
+        cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE)
    {
        cur_op->bs_op->oid = cur_op->req.sec_rw.oid;
        cur_op->bs_op->version = cur_op->req.sec_rw.version;
@ -58,7 +64,7 @@ void osd_t::exec_secondary(osd_op_t *cur_op)
        cur_op->bs_op->retval = cur_op->bs_op->len;
 #endif
    }
-    else if (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_DELETE)
+    else if (cur_op->req.hdr.opcode == OSD_OP_SEC_DELETE)
    {
        cur_op->bs_op->oid = cur_op->req.sec_del.oid;
        cur_op->bs_op->version = cur_op->req.sec_del.version;
@ -66,8 +72,8 @@ void osd_t::exec_secondary(osd_op_t *cur_op)
        cur_op->bs_op->retval = 0;
 #endif
    }
-    else if (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_STABILIZE ||
-        cur_op->req.hdr.opcode == OSD_OP_SECONDARY_ROLLBACK)
+    else if (cur_op->req.hdr.opcode == OSD_OP_SEC_STABILIZE ||
+        cur_op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK)
    {
        cur_op->bs_op->len = cur_op->req.sec_stab.len/sizeof(obj_ver_id);
        cur_op->bs_op->buf = cur_op->buf;
@ -75,11 +81,12 @@ void osd_t::exec_secondary(osd_op_t *cur_op)
        cur_op->bs_op->retval = 0;
 #endif
    }
-    else if (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_LIST)
+    else if (cur_op->req.hdr.opcode == OSD_OP_SEC_LIST)
    {
        if (cur_op->req.sec_list.pg_count < cur_op->req.sec_list.list_pg)
        {
            // requested pg number is greater than total pg count
+            printf("Invalid LIST request: pg count %u < pg number %u\n", cur_op->req.sec_list.pg_count, cur_op->req.sec_list.list_pg);
            cur_op->bs_op->retval = -EINVAL;
            secondary_op_callback(cur_op);
            return;
@ -87,6 +94,8 @@ void osd_t::exec_secondary(osd_op_t *cur_op)
        cur_op->bs_op->oid.stripe = cur_op->req.sec_list.pg_stripe_size;
        cur_op->bs_op->len = cur_op->req.sec_list.pg_count;
        cur_op->bs_op->offset = cur_op->req.sec_list.list_pg - 1;
+        cur_op->bs_op->oid.inode = cur_op->req.sec_list.min_inode;
+        cur_op->bs_op->version = cur_op->req.sec_list.max_inode;
 #ifdef OSD_STUB
        cur_op->bs_op->retval = 0;
        cur_op->bs_op->buf = NULL;
@ -103,9 +112,9 @@ void osd_t::exec_show_config(osd_op_t *cur_op)
 {
    // FIXME: Send the real config, not its source
    std::string cfg_str = json11::Json(config).dump();
-    cur_op->buf = malloc(cfg_str.size()+1);
+    cur_op->buf = malloc_or_die(cfg_str.size()+1);
    memcpy(cur_op->buf, cfg_str.c_str(), cfg_str.size()+1);
-    cur_op->send_list.push_back(cur_op->buf, cfg_str.size()+1);
+    cur_op->iov.push_back(cur_op->buf, cfg_str.size()+1);
    finish_op(cur_op, cfg_str.size()+1);
 }

--- a/osd_test.cpp
+++ b/osd_test.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #include <sys/types.h>
 #include <sys/socket.h>
 #include <netinet/in.h>
@ -184,7 +187,7 @@ uint64_t test_read(int connect_fd, uint64_t inode, uint64_t stripe, uint64_t ver
    osd_any_reply_t reply;
    op.hdr.magic = SECONDARY_OSD_OP_MAGIC;
    op.hdr.id = 1;
-    op.hdr.opcode = OSD_OP_SECONDARY_READ;
+    op.hdr.opcode = OSD_OP_SEC_READ;
    op.sec_rw.oid = {
        .inode = inode,
        .stripe = stripe,
@ -208,8 +211,8 @@ uint64_t test_read(int connect_fd, uint64_t inode, uint64_t stripe, uint64_t ver
        return 0;
    }
    free(data);
-    printf("Read %lu:%lu v%lu = v%lu\n", inode, stripe, version, reply.sec_rw.version);
-    op.hdr.opcode = OSD_OP_SECONDARY_LIST;
+    printf("Read %lx:%lx v%lu = v%lu\n", inode, stripe, version, reply.sec_rw.version);
+    op.hdr.opcode = OSD_OP_SEC_LIST;
    op.sec_list.list_pg = 1;
    op.sec_list.pg_count = 1;
    op.sec_list.pg_stripe_size = 4*1024*1024;
@ -232,7 +235,7 @@ uint64_t test_read(int connect_fd, uint64_t inode, uint64_t stripe, uint64_t ver
    {
        if (ov[i].oid.inode == inode && (ov[i].oid.stripe & ~(4096-1)) == (stripe & ~(4096-1)))
        {
-            printf("list: %lu:%lu v%lu stable=%d\n", ov[i].oid.inode, ov[i].oid.stripe, ov[i].version, i < reply.sec_list.stable_count ? 1 : 0);
+            printf("list: %lx:%lx v%lu stable=%d\n", ov[i].oid.inode, ov[i].oid.stripe, ov[i].version, i < reply.sec_list.stable_count ? 1 : 0);
        }
    }
    return 0;
@ -244,7 +247,7 @@ uint64_t test_write(int connect_fd, uint64_t inode, uint64_t stripe, uint64_t ve
    osd_any_reply_t reply;
    op.hdr.magic = SECONDARY_OSD_OP_MAGIC;
    op.hdr.id = 1;
-    op.hdr.opcode = OSD_OP_SECONDARY_WRITE;
+    op.hdr.opcode = OSD_OP_SEC_WRITE;
    op.sec_rw.oid = {
        .inode = inode,
        .stripe = stripe,
@ -354,7 +357,7 @@ void test_list_stab(int connect_fd)
    osd_any_reply_t reply;
    op.hdr.magic = SECONDARY_OSD_OP_MAGIC;
    op.hdr.id = 1;
-    op.hdr.opcode = OSD_OP_SECONDARY_LIST;
+    op.hdr.opcode = OSD_OP_SEC_LIST;
    op.sec_list.pg_count = 0;
    assert(write_blocking(connect_fd, op.buf, OSD_PACKET_SIZE) == OSD_PACKET_SIZE);
    int r = read_blocking(connect_fd, reply.buf, OSD_PACKET_SIZE);
@ -370,7 +373,7 @@ void test_list_stab(int connect_fd)
        // Stabilize in portions of 32 entries
        if (i - last_start >= 32 || i == total_count)
        {
-            op.hdr.opcode = OSD_OP_SECONDARY_STABILIZE;
+            op.hdr.opcode = OSD_OP_SEC_STABILIZE;
            op.sec_stab.len = sizeof(obj_ver_id) * (i - last_start);
            assert(write_blocking(connect_fd, op.buf, OSD_PACKET_SIZE) == OSD_PACKET_SIZE);
            assert(write_blocking(connect_fd, data + last_start, op.sec_stab.len) == op.sec_stab.len);
--- a/pg_states.cpp
+++ b/pg_states.cpp
@ -1,8 +1,11 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
 #include "pg_states.h"

-const int pg_state_bit_count = 13;
+const int pg_state_bit_count = 14;

-const int pg_state_bits[13] = {
+const int pg_state_bits[14] = {
    PG_STARTING,
    PG_PEERING,
    PG_INCOMPLETE,
@ -14,10 +17,11 @@ const int pg_state_bits[13] = {
    PG_HAS_DEGRADED,
    PG_HAS_MISPLACED,
    PG_HAS_UNCLEAN,
+    PG_HAS_INVALID,
    PG_LEFT_ON_DEAD,
 };

-const char *pg_state_names[13] = {
+const char *pg_state_names[14] = {
    "starting",
    "peering",
    "incomplete",
@ -29,5 +33,6 @@ const char *pg_state_names[13] = {
    "has_degraded",
    "has_misplaced",
    "has_unclean",
+    "has_invalid",
    "left_on_dead",
 };
--- a/pg_states.h
+++ b/pg_states.h
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
 #pragma once

 // Placement group states
@ -15,9 +18,11 @@
 #define PG_HAS_DEGRADED (1<<8)
 #define PG_HAS_MISPLACED (1<<9)
 #define PG_HAS_UNCLEAN (1<<10)
-#define PG_LEFT_ON_DEAD (1<<11)
+#define PG_HAS_INVALID (1<<11)
+#define PG_LEFT_ON_DEAD (1<<12)

-// FIXME: Safe default that doesn't depend on pg_stripe_size or pg_block_size
+// Lower bits that represent object role (EC 0/1/2... or always 0 with replication)
+// 12 bits is a safe default that doesn't depend on pg_stripe_size or pg_block_size
 #define STRIPE_MASK ((uint64_t)4096 - 1)

 // OSD object states
@ -26,7 +31,6 @@
 #define OBJ_MISPLACED 0x08
 #define OBJ_NEEDS_STABLE 0x10000
 #define OBJ_NEEDS_ROLLBACK 0x20000
-#define OBJ_BUGGY 0x80000

 extern const int pg_state_bits[];
 extern const char *pg_state_names[];
--- a/qemu_driver.c
+++ b/qemu_driver.c
@ -0,0 +1,400 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
+// QEMU block driver
+
+#define _GNU_SOURCE
+#include "qemu/osdep.h"
+#include "qemu/units.h"
+#include "block/block_int.h"
+#include "block/qdict.h"
+#include "qapi/error.h"
+#include "qapi/qmp/qdict.h"
+#include "qapi/qmp/qerror.h"
+#include "qemu/uri.h"
+#include "qemu/error-report.h"
+#include "qemu/module.h"
+#include "qemu/option.h"
+#include "qemu/cutils.h"
+
+#include "qemu_proxy.h"
+
+typedef struct VitastorClient
+{
+    void *proxy;
+    char *etcd_host;
+    char *etcd_prefix;
+    uint64_t inode;
+    uint64_t pool;
+    uint64_t size;
+    int readonly;
+    QemuMutex mutex;
+} VitastorClient;
+
+typedef struct VitastorRPC
+{
+    BlockDriverState *bs;
+    Coroutine *co;
+    QEMUIOVector *iov;
+    int ret;
+    int complete;
+} VitastorRPC;
+
+static char *qemu_rbd_next_tok(char *src, char delim, char **p)
+{
+    char *end;
+    *p = NULL;
+    for (end = src; *end; ++end)
+    {
+        if (*end == delim)
+            break;
+        if (*end == '\\' && end[1] != '\0')
+            end++;
+    }
+    if (*end == delim)
+    {
+        *p = end + 1;
+        *end = '\0';
+    }
+    return src;
+}
+
+static void qemu_rbd_unescape(char *src)
+{
+    char *p;
+    for (p = src; *src; ++src, ++p)
+    {
+        if (*src == '\\' && src[1] != '\0')
+            src++;
+        *p = *src;
+    }
+    *p = '\0';
+}
+
+// vitastor[:key=value]*
+// vitastor:etcd_host=127.0.0.1:inode=1:pool=1
+static void vitastor_parse_filename(const char *filename, QDict *options, Error **errp)
+{
+    const char *start;
+    char *p, *buf;
+
+    if (!strstart(filename, "vitastor:", &start))
+    {
+        error_setg(errp, "File name must start with 'vitastor:'");
+        return;
+    }
+
+    buf = g_strdup(start);
+    p = buf;
+
+    // The following are all key/value pairs
+    while (p)
+    {
+        char *name, *value;
+        name = qemu_rbd_next_tok(p, '=', &p);
+        if (!p)
+        {
+            error_setg(errp, "conf option %s has no value", name);
+            break;
+        }
+        qemu_rbd_unescape(name);
+        value = qemu_rbd_next_tok(p, ':', &p);
+        qemu_rbd_unescape(value);
+        if (!strcmp(name, "inode") || !strcmp(name, "pool") || !strcmp(name, "size"))
+        {
+            unsigned long long num_val;
+            if (parse_uint_full(value, &num_val, 0))
+            {
+                error_setg(errp, "Illegal %s: %s", name, value);
+                goto out;
+            }
+            qdict_put_int(options, name, num_val);
+        }
+        else
+        {
+            qdict_put_str(options, name, value);
+        }
+    }
+    if (!qdict_get_try_int(options, "inode", 0))
+    {
+        error_setg(errp, "inode is missing");
+        goto out;
+    }
+    if (!(qdict_get_try_int(options, "inode", 0) >> (64-POOL_ID_BITS)) &&
+        !qdict_get_try_int(options, "pool", 0))
+    {
+        error_setg(errp, "pool number is missing");
+        goto out;
+    }
+    if (!qdict_get_try_int(options, "size", 0))
+    {
+        error_setg(errp, "size is missing");
+        goto out;
+    }
+    if (!qdict_get_str(options, "etcd_host"))
+    {
+        error_setg(errp, "etcd_host is missing");
+        goto out;
+    }
+
+out:
+    g_free(buf);
+    return;
+}
+
+static int vitastor_file_open(BlockDriverState *bs, QDict *options, int flags, Error **errp)
+{
+    VitastorClient *client = bs->opaque;
+    int64_t ret = 0;
+    client->etcd_host = g_strdup(qdict_get_try_str(options, "etcd_host"));
+    client->etcd_prefix = g_strdup(qdict_get_try_str(options, "etcd_prefix"));
+    client->inode = qdict_get_int(options, "inode");
+    client->pool = qdict_get_int(options, "pool");
+    if (client->pool)
+        client->inode = (client->inode & ((1l << (64-POOL_ID_BITS)) - 1)) | (client->pool << (64-POOL_ID_BITS));
+    client->size = qdict_get_int(options, "size");
+    client->readonly = (flags & BDRV_O_RDWR) ? 1 : 0;
+    client->proxy = vitastor_proxy_create(bdrv_get_aio_context(bs), client->etcd_host, client->etcd_prefix);
+    //client->aio_context = bdrv_get_aio_context(bs);
+    bs->total_sectors = client->size / BDRV_SECTOR_SIZE;
+    qdict_del(options, "etcd_host");
+    qdict_del(options, "etcd_prefix");
+    qdict_del(options, "inode");
+    qdict_del(options, "pool");
+    qdict_del(options, "size");
+    qemu_mutex_init(&client->mutex);
+    return ret;
+}
+
+static void vitastor_close(BlockDriverState *bs)
+{
+    VitastorClient *client = bs->opaque;
+    vitastor_proxy_destroy(client->proxy);
+    qemu_mutex_destroy(&client->mutex);
+    g_free(client->etcd_host);
+    if (client->etcd_prefix)
+        g_free(client->etcd_prefix);
+}
+
+static int vitastor_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz)
+{
+    bsz->phys = 4096;
+    bsz->log = 4096;
+    return 0;
+}
+
+static int coroutine_fn vitastor_co_create_opts(
+#if QEMU_VERSION_MAJOR >= 4
+    BlockDriver *drv,
+#endif
+    const char *url, QemuOpts *opts, Error **errp)
+{
+    QDict *options;
+    int ret;
+
+    options = qdict_new();
+    vitastor_parse_filename(url, options, errp);
+    if (*errp)
+    {
+        ret = -1;
+        goto out;
+    }
+
+    // inodes don't require creation in Vitastor. FIXME: They will when there will be some metadata
+
+    ret = 0;
+out:
+    qobject_unref(options);
+    return ret;
+}
+
+static int coroutine_fn vitastor_co_truncate(BlockDriverState *bs, int64_t offset,
+#if QEMU_VERSION_MAJOR >= 4
+    bool exact,
+#endif
+    PreallocMode prealloc, Error **errp)
+{
+    VitastorClient *client = bs->opaque;
+
+    if (prealloc != PREALLOC_MODE_OFF)
+    {
+        error_setg(errp, "Unsupported preallocation mode '%s'", PreallocMode_str(prealloc));
+        return -ENOTSUP;
+    }
+
+    // TODO: Resize inode to <offset> bytes
+    client->size = offset / BDRV_SECTOR_SIZE;
+
+    return 0;
+}
+
+static int vitastor_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
+{
+    bdi->cluster_size = 4096;
+    return 0;
+}
+
+static int64_t vitastor_getlength(BlockDriverState *bs)
+{
+    VitastorClient *client = bs->opaque;
+    return client->size;
+}
+
+static void vitastor_refresh_limits(BlockDriverState *bs, Error **errp)
+{
+    bs->bl.request_alignment = 4096;
+    bs->bl.min_mem_alignment = 4096;
+    bs->bl.opt_mem_alignment = 4096;
+}
+
+static int64_t vitastor_get_allocated_file_size(BlockDriverState *bs)
+{
+    return 0;
+}
+
+static void vitastor_co_init_task(BlockDriverState *bs, VitastorRPC *task)
+{
+    *task = (VitastorRPC) {
+        .co     = qemu_coroutine_self(),
+        .bs     = bs,
+    };
+}
+
+static void vitastor_co_generic_bh_cb(int retval, void *opaque)
+{
+    VitastorRPC *task = opaque;
+    task->ret = retval;
+    task->complete = 1;
+    if (qemu_coroutine_self() != task->co)
+    {
+        aio_co_wake(task->co);
+    }
+}
+
+static int coroutine_fn vitastor_co_preadv(BlockDriverState *bs, uint64_t offset, uint64_t bytes, QEMUIOVector *iov, int flags)
+{
+    VitastorClient *client = bs->opaque;
+    VitastorRPC task;
+    vitastor_co_init_task(bs, &task);
+    task.iov = iov;
+
+    qemu_mutex_lock(&client->mutex);
+    vitastor_proxy_rw(0, client->proxy, client->inode, offset, bytes, iov->iov, iov->niov, vitastor_co_generic_bh_cb, &task);
+    qemu_mutex_unlock(&client->mutex);
+
+    while (!task.complete)
+    {
+        qemu_coroutine_yield();
+    }
+
+    return task.ret;
+}
+
+static int coroutine_fn vitastor_co_pwritev(BlockDriverState *bs, uint64_t offset, uint64_t bytes, QEMUIOVector *iov, int flags)
+{
+    VitastorClient *client = bs->opaque;
+    VitastorRPC task;
+    vitastor_co_init_task(bs, &task);
+    task.iov = iov;
+
+    qemu_mutex_lock(&client->mutex);
+    vitastor_proxy_rw(1, client->proxy, client->inode, offset, bytes, iov->iov, iov->niov, vitastor_co_generic_bh_cb, &task);
+    qemu_mutex_unlock(&client->mutex);
+
+    while (!task.complete)
+    {
+        qemu_coroutine_yield();
+    }
+
+    return task.ret;
+}
+
+static int coroutine_fn vitastor_co_flush(BlockDriverState *bs)
+{
+    VitastorClient *client = bs->opaque;
+    VitastorRPC task;
+    vitastor_co_init_task(bs, &task);
+
+    qemu_mutex_lock(&client->mutex);
+    vitastor_proxy_sync(client->proxy, vitastor_co_generic_bh_cb, &task);
+    qemu_mutex_unlock(&client->mutex);
+
+    while (!task.complete)
+    {
+        qemu_coroutine_yield();
+    }
+
+    return task.ret;
+}
+
+static QemuOptsList vitastor_create_opts = {
+    .name = "vitastor-create-opts",
+    .head = QTAILQ_HEAD_INITIALIZER(vitastor_create_opts.head),
+    .desc = {
+        {
+            .name = BLOCK_OPT_SIZE,
+            .type = QEMU_OPT_SIZE,
+            .help = "Virtual disk size"
+        },
+        { /* end of list */ }
+    }
+};
+
+static const char *vitastor_strong_runtime_opts[] = {
+    "inode",
+    "pool",
+    "etcd_host",
+    "etcd_prefix",
+
+    NULL
+};
+
+static BlockDriver bdrv_vitastor = {
+    .format_name                    = "vitastor",
+    .protocol_name                  = "vitastor",
+
+    .instance_size                  = sizeof(VitastorClient),
+    .bdrv_parse_filename            = vitastor_parse_filename,
+
+    .bdrv_has_zero_init             = bdrv_has_zero_init_1,
+#if QEMU_VERSION_MAJOR >= 4
+    .bdrv_has_zero_init_truncate    = bdrv_has_zero_init_1,
+#endif
+    .bdrv_get_info                  = vitastor_get_info,
+    .bdrv_getlength                 = vitastor_getlength,
+    .bdrv_probe_blocksizes          = vitastor_probe_blocksizes,
+    .bdrv_refresh_limits            = vitastor_refresh_limits,
+
+    // FIXME: Implement it along with per-inode statistics
+    //.bdrv_get_allocated_file_size   = vitastor_get_allocated_file_size,
+
+    .bdrv_file_open                 = vitastor_file_open,
+    .bdrv_close                     = vitastor_close,
+
+    // Option list for the create operation
+    .create_opts                    = &vitastor_create_opts,
+
+    // For qmp_blockdev_create(), used by the qemu monitor / QAPI
+    // Requires patching QAPI IDL, thus unimplemented
+    //.bdrv_co_create                 = vitastor_co_create,
+
+    // For bdrv_create(), used by qemu-img
+    .bdrv_co_create_opts            = vitastor_co_create_opts,
+
+    .bdrv_co_truncate               = vitastor_co_truncate,
+
+    .bdrv_co_preadv                 = vitastor_co_preadv,
+    .bdrv_co_pwritev                = vitastor_co_pwritev,
+    .bdrv_co_flush_to_disk          = vitastor_co_flush,
+
+#if QEMU_VERSION_MAJOR >= 4
+    .strong_runtime_opts            = vitastor_strong_runtime_opts,
+#endif
+};
+
+static void vitastor_block_init(void)
+{
+    bdrv_register(&bdrv_vitastor);
+}
+
+block_init(vitastor_block_init);
--- a/qemu_proxy.cpp
+++ b/qemu_proxy.cpp
@ -0,0 +1,130 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
+// C-C++ proxy for the QEMU driver
+// (QEMU headers don't compile with g++)
+
+#include <sys/epoll.h>
+
+#include "cluster_client.h"
+
+typedef void* AioContext;
+#include "qemu_proxy.h"
+
+extern "C"
+{
+    // QEMU
+    typedef void IOHandler(void *opaque);
+    void aio_set_fd_handler(AioContext *ctx, int fd, int is_external, IOHandler *fd_read, IOHandler *fd_write, void *poll_fn, void *opaque);
+}
+
+struct QemuProxyData
+{
+    int fd;
+    std::function<void(int, int)> callback;
+};
+
+class QemuProxy
+{
+    std::map<int, QemuProxyData> handlers;
+
+public:
+
+    timerfd_manager_t *tfd;
+    cluster_client_t *cli;
+    AioContext *ctx;
+
+    QemuProxy(AioContext *ctx, const char *etcd_host, const char *etcd_prefix)
+    {
+        this->ctx = ctx;
+        json11::Json cfg = json11::Json::object {
+            { "etcd_address", std::string(etcd_host) },
+            { "etcd_prefix", std::string(etcd_prefix ? etcd_prefix : "/vitastor") },
+        };
+        tfd = new timerfd_manager_t([this](int fd, bool wr, std::function<void(int, int)> callback) { set_fd_handler(fd, wr, callback); });
+        cli = new cluster_client_t(NULL, tfd, cfg);
+    }
+
+    ~QemuProxy()
+    {
+        cli->stop();
+        delete cli;
+        delete tfd;
+    }
+
+    void set_fd_handler(int fd, bool wr, std::function<void(int, int)> callback)
+    {
+        if (callback != NULL)
+        {
+            handlers[fd] = { .fd = fd, .callback = callback };
+            aio_set_fd_handler(ctx, fd, false, &QemuProxy::read_handler, wr ? &QemuProxy::write_handler : NULL, NULL, &handlers[fd]);
+        }
+        else
+        {
+            handlers.erase(fd);
+            aio_set_fd_handler(ctx, fd, false, NULL, NULL, NULL, NULL);
+        }
+    }
+
+    static void read_handler(void *opaque)
+    {
+        QemuProxyData *data = (QemuProxyData *)opaque;
+        data->callback(data->fd, EPOLLIN);
+    }
+
+    static void write_handler(void *opaque)
+    {
+        QemuProxyData *data = (QemuProxyData *)opaque;
+        data->callback(data->fd, EPOLLOUT);
+    }
+};
+
+extern "C" {
+
+void* vitastor_proxy_create(AioContext *ctx, const char *etcd_host, const char *etcd_prefix)
+{
+    QemuProxy *p = new QemuProxy(ctx, etcd_host, etcd_prefix);
+    return p;
+}
+
+void vitastor_proxy_destroy(void *client)
+{
+    QemuProxy *p = (QemuProxy*)client;
+    delete p;
+}
+
+void vitastor_proxy_rw(int write, void *client, uint64_t inode, uint64_t offset, uint64_t len,
+    iovec *iov, int iovcnt, VitastorIOHandler cb, void *opaque)
+{
+    QemuProxy *p = (QemuProxy*)client;
+    cluster_op_t *op = new cluster_op_t;
+    op->opcode = write ? OSD_OP_WRITE : OSD_OP_READ;
+    op->inode = inode;
+    op->offset = offset;
+    op->len = len;
+    for (int i = 0; i < iovcnt; i++)
+    {
+        op->iov.push_back(iov[i].iov_base, iov[i].iov_len);
+    }
+    op->callback = [cb, opaque](cluster_op_t *op)
+    {
+        cb(op->retval, opaque);
+        delete op;
+    };
+    p->cli->execute(op);
+}
+
+void vitastor_proxy_sync(void *client, VitastorIOHandler cb, void *opaque)
+{
+    QemuProxy *p = (QemuProxy*)client;
+    cluster_op_t *op = new cluster_op_t;
+    op->opcode = OSD_OP_SYNC;
+    op->callback = [cb, opaque](cluster_op_t *op)
+    {
+        cb(op->retval, opaque);
+        delete op;
+    };
+    p->cli->execute(op);
+}
+
+}
--- a/qemu_proxy.h
+++ b/qemu_proxy.h
@ -0,0 +1,29 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
+#ifndef VITASTOR_QEMU_PROXY_H
+#define VITASTOR_QEMU_PROXY_H
+
+#ifndef POOL_ID_BITS
+#define POOL_ID_BITS 16
+#endif
+#include <stdint.h>
+#include <sys/uio.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+// Our exports
+typedef void VitastorIOHandler(int retval, void *opaque);
+void* vitastor_proxy_create(AioContext *ctx, const char *etcd_host, const char *etcd_prefix);
+void vitastor_proxy_destroy(void *client);
+void vitastor_proxy_rw(int write, void *client, uint64_t inode, uint64_t offset, uint64_t len,
+    struct iovec *iov, int iovcnt, VitastorIOHandler cb, void *opaque);
+void vitastor_proxy_sync(void *client, VitastorIOHandler cb, void *opaque);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
--- a/ringloop.cpp
+++ b/ringloop.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
 #include "ringloop.h"

 ring_loop_t::ring_loop_t(int qd)
--- a/ringloop.h
+++ b/ringloop.h
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
 #pragma once

 #ifndef _LARGEFILE64_SOURCE
--- a/rw_blocking.cpp
+++ b/rw_blocking.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
 #include <errno.h>
 #include <stdlib.h>
 #include <stdio.h>
--- a/rw_blocking.h
+++ b/rw_blocking.h
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
 #pragma once

 #include <unistd.h>
--- a/stub_bench.cpp
+++ b/stub_bench.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
 /**
 * Stub benchmarker
 */
@ -123,7 +126,7 @@ void run_bench(int peer_fd)
        // read
        op.hdr.magic = SECONDARY_OSD_OP_MAGIC;
        op.hdr.id = 1;
-        op.hdr.opcode = OSD_OP_SECONDARY_READ;
+        op.hdr.opcode = OSD_OP_SEC_READ;
        op.sec_rw.oid.inode = 3;
        op.sec_rw.oid.stripe = (rand() << 17) % (1 << 29); // 512 MB
        op.sec_rw.version = 0;
@ -149,7 +152,7 @@ void run_bench(int peer_fd)
        // write
        op.hdr.magic = SECONDARY_OSD_OP_MAGIC;
        op.hdr.id = 1;
-        op.hdr.opcode = OSD_OP_SECONDARY_WRITE;
+        op.hdr.opcode = OSD_OP_SEC_WRITE;
        op.sec_rw.oid.inode = 3;
        op.sec_rw.oid.stripe = (rand() << 17) % (1 << 29); // 512 MB
        op.sec_rw.version = 0;
--- a/stub_osd.cpp
+++ b/stub_osd.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
 /**
 * Stub "OSD" to test & compare network performance with sync read/write and io_uring
 *
@ -130,7 +133,7 @@ void run_stub(int peer_fd)
        reply.hdr.magic = SECONDARY_OSD_REPLY_MAGIC;
        reply.hdr.id = op.hdr.id;
        reply.hdr.opcode = op.hdr.opcode;
-        if (op.hdr.opcode == OSD_OP_SECONDARY_READ)
+        if (op.hdr.opcode == OSD_OP_SEC_READ)
        {
            reply.hdr.retval = op.sec_rw.len;
            buf = malloc(op.sec_rw.len);
@ -141,7 +144,7 @@ void run_stub(int peer_fd)
            if (r < op.sec_rw.len)
                break;
        }
-        else if (op.hdr.opcode == OSD_OP_SECONDARY_WRITE)
+        else if (op.hdr.opcode == OSD_OP_SEC_WRITE || op.hdr.opcode == OSD_OP_SEC_WRITE_STABLE)
        {
            buf = malloc(op.sec_rw.len);
            r = read_blocking(peer_fd, buf, op.sec_rw.len);
--- a/stub_uring_osd.cpp
+++ b/stub_uring_osd.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
 /**
 * Stub "OSD" implemented on top of osd_messenger to test & compare
 * network performance with sync read/write and io_uring
@ -38,7 +41,7 @@ int main(int narg, char *args[])
    msgr->exec_op = [msgr](osd_op_t *op) { stub_exec_op(msgr, op); };
    // Accept new connections
    int listen_fd = bind_stub("0.0.0.0", 11203);
-    epmgr->set_fd_handler(listen_fd, [listen_fd, msgr](int fd, int events)
+    epmgr->set_fd_handler(listen_fd, false, [listen_fd, msgr](int fd, int events)
    {
        msgr->accept_connections(listen_fd);
    });
@ -105,14 +108,13 @@ void stub_exec_op(osd_messenger_t *msgr, osd_op_t *op)
    op->reply.hdr.magic = SECONDARY_OSD_REPLY_MAGIC;
    op->reply.hdr.id = op->req.hdr.id;
    op->reply.hdr.opcode = op->req.hdr.opcode;
-    op->send_list.push_back(op->reply.buf, OSD_PACKET_SIZE);
-    if (op->req.hdr.opcode == OSD_OP_SECONDARY_READ)
+    if (op->req.hdr.opcode == OSD_OP_SEC_READ)
    {
        op->reply.hdr.retval = op->req.sec_rw.len;
        op->buf = malloc(op->req.sec_rw.len);
-        op->send_list.push_back(op->buf, op->req.sec_rw.len);
+        op->iov.push_back(op->buf, op->req.sec_rw.len);
    }
-    else if (op->req.hdr.opcode == OSD_OP_SECONDARY_WRITE)
+    else if (op->req.hdr.opcode == OSD_OP_SEC_WRITE || op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE)
    {
        op->reply.hdr.retval = op->req.sec_rw.len;
    }
--- a/test.cpp
+++ b/test.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #define _LARGEFILE64_SOURCE
 #include <sys/types.h>
 #include <sys/ioctl.h>
--- a/test_allocator.cpp
+++ b/test_allocator.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #include <stdio.h>
 #include "allocator.h"

--- a/test_blockstore.cpp
+++ b/test_blockstore.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #include <malloc.h>
 #include "timerfd_interval.h"
 #include "blockstore.h"
--- a/test_pattern.h
+++ b/test_pattern.h
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 (see README.md for details)
+
 #pragma once

 #include <assert.h>
--- a/timerfd_interval.cpp
+++ b/timerfd_interval.cpp
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
 #include <sys/timerfd.h>
 #include <sys/poll.h>
 #include <unistd.h>
--- a/timerfd_interval.h
+++ b/timerfd_interval.h
@ -1,3 +1,6 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
 #pragma once

 #include "ringloop.h"
--- a/timerfd_manager.cpp
+++ b/timerfd_manager.cpp
@ -1,24 +1,32 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
 #include <sys/timerfd.h>
 #include <sys/poll.h>
+#include <sys/epoll.h>
 #include <unistd.h>
+#include <errno.h>
+#include <string.h>
 #include "timerfd_manager.h"

-timerfd_manager_t::timerfd_manager_t(ring_loop_t *ringloop)
+timerfd_manager_t::timerfd_manager_t(std::function<void(int, bool, std::function<void(int, int)>)> set_fd_handler)
 {
+    this->set_fd_handler = set_fd_handler;
    wait_state = 0;
    timerfd = timerfd_create(CLOCK_MONOTONIC, TFD_NONBLOCK);
    if (timerfd < 0)
    {
        throw std::runtime_error(std::string("timerfd_create: ") + strerror(errno));
    }
-    consumer.loop = [this]() { loop(); };
-    ringloop->register_consumer(&consumer);
-    this->ringloop = ringloop;
+    set_fd_handler(timerfd, false, [this](int fd, int events)
+    {
+        handle_readable();
+    });
 }

 timerfd_manager_t::~timerfd_manager_t()
 {
-    ringloop->unregister_consumer(&consumer);
+    set_fd_handler(timerfd, false, NULL);
    close(timerfd);
 }

@ -48,7 +56,6 @@ int timerfd_manager_t::set_timer(uint64_t millis, bool repeat, std::function<voi
    });
    inc_timer(timers[timers.size()-1]);
    set_nearest();
-    set_wait();
    return timer_id;
 }

@ -69,7 +76,6 @@ void timerfd_manager_t::clear_timer(int timer_id)
                nearest--;
            }
            set_nearest();
-            set_wait();
            break;
        }
    }
@ -154,36 +160,3 @@ void timerfd_manager_t::trigger_nearest()
    cb(nearest_id);
    nearest = -1;
 }
-
-void timerfd_manager_t::loop()
-{
-    if (!(wait_state & 1) && timers.size())
-    {
-        set_nearest();
-    }
-    set_wait();
-}
-
-void timerfd_manager_t::set_wait()
-{
-    if ((wait_state & 3) == 1)
-    {
-        io_uring_sqe *sqe = ringloop->get_sqe();
-        if (!sqe)
-        {
-            return;
-        }
-        ring_data_t *data = ((ring_data_t*)sqe->user_data);
-        my_uring_prep_poll_add(sqe, timerfd, POLLIN);
-        data->callback = [this](ring_data_t *data)
-        {
-            if (data->res < 0)
-            {
-                throw std::runtime_error(std::string("waiting for timer failed: ") + strerror(-data->res));
-            }
-            handle_readable();
-            set_wait();
-        };
-        wait_state = 3;
-    }
-}
--- a/timerfd_manager.h
+++ b/timerfd_manager.h
@ -1,7 +1,11 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
+
 #pragma once

 #include <time.h>
-#include "ringloop.h"
+#include <vector>
+#include <functional>

 struct timerfd_timer_t
 {
@ -19,20 +23,15 @@ class timerfd_manager_t
    int nearest = -1;
    int id = 1;
    std::vector<timerfd_timer_t> timers;
-    ring_loop_t *ringloop;
-    ring_consumer_t consumer;

    void inc_timer(timerfd_timer_t & t);
    void set_nearest();
    void trigger_nearest();
    void handle_readable();
-    void set_wait();
-    void loop();
 public:
-    // FIXME shouldn't be here
-    std::function<void(int, std::function<void(int, int)>)> set_fd_handler;
+    std::function<void(int, bool, std::function<void(int, int)>)> set_fd_handler;

-    timerfd_manager_t(ring_loop_t *ringloop);
+    timerfd_manager_t(std::function<void(int, bool, std::function<void(int, int)>)> set_fd_handler);
    ~timerfd_manager_t();
    int set_timer(uint64_t millis, bool repeat, std::function<void(int)> callback);
    void clear_timer(int timer_id);
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Vitaliy Filippov	9f2a948712	Make pg_stripe_size a per-pool config	2020-10-01 18:51:49 +03:00
Vitaliy Filippov	ba74eece4a	More fixes to the failure model (why am I doing this?..)	2020-10-01 18:38:30 +03:00
Vitaliy Filippov	2fdd8a1b38	More correct failure model (I hope so)	2020-10-01 02:33:48 +03:00
Vitaliy Filippov	526983f7a9	Add usable CLI commands for NBD proxy (map/unmap/list)	2020-09-29 02:06:19 +03:00
Vitaliy Filippov	8e36f04482	One more experiment with cluster AFR%	2020-09-27 19:42:42 +03:00
Vitaliy Filippov	f460d8c1c8	Add note about NBD	2020-09-26 00:11:55 +03:00
Vitaliy Filippov	7619a789c0	Set request size in NBD	2020-09-26 00:01:23 +03:00
Vitaliy Filippov	e65a28e27e	Implement a simple NBD proxy (does not daemonize yet)	2020-09-25 20:51:01 +03:00
Vitaliy Filippov	6852f299ae	Add functions to calculate AFR for a cluster	2020-09-24 23:15:26 +03:00
Vitaliy Filippov	1967269c13	Resume operations in cluster_client when PGs are loaded (fixes a hang in qemu-img)	2020-09-20 01:50:19 +03:00
Vitaliy Filippov	7574183ba6	Make qemu driver build with QEMU 3.x	2020-09-20 01:50:19 +03:00
Vitaliy Filippov	108cd6312d	Correct some typos in README, add note about qemu-img	2020-09-20 01:50:19 +03:00
Vitaliy Filippov	588b9e6393	Add README	2020-09-17 23:07:50 +03:00
Vitaliy Filippov	0471b09b9c	Add license notices to all source code files	2020-09-17 23:07:06 +03:00
Vitaliy Filippov	ef911555ed	Add cpp-btree and json11 submodules	2020-09-17 23:07:06 +03:00
Vitaliy Filippov	9d20839a02	Add license texts	2020-09-17 23:07:06 +03:00
Vitaliy Filippov	67a2e5640c	Fix a GIANT memory leak on read :D	2020-09-17 00:45:59 +03:00
Vitaliy Filippov	28a0f08ce7	Add a very simple tool for calculating device offsets	2020-09-17 00:45:59 +03:00
Vitaliy Filippov	9b4e5b64ae	Move monitor to mon/	2020-09-16 02:15:26 +03:00
Vitaliy Filippov	4ca2eeafff	Prefer data OSDs for EC/XOR because they can actually read something locally	2020-09-16 01:34:18 +03:00
Vitaliy Filippov	79156e0ee1	Add test systemd unit generation script	2020-09-13 12:50:15 +03:00
Vitaliy Filippov	ed26c33f85	React to down OSDs instantly, set timer to recheck PGs after <osd_out_time>	2020-09-13 12:23:18 +03:00
Vitaliy Filippov	18692517be	Increase receive_buffer_size	2020-09-13 00:39:20 +03:00
Vitaliy Filippov	de6919b02b	Add option to disable multiple overwrites of the same journal sector This makes sense for some SSDs like Intel D3-4510 because they don't like overwrites of the same sector: $ fio -direct=1 -rw=write -bs=4k -size=4k -loops=100000 -iodepth=1 write: IOPS=3142, BW=12.3MiB/s (12.9MB/s)(97.9MiB/7977msec) $ fio -direct=1 -rw=write -bs=4k -size=128k -loops=100000 -iodepth=1 write: IOPS=20.8k, BW=81.4MiB/s (85.3MB/s)(543MiB/6675msec)	2020-09-13 00:37:39 +03:00
Vitaliy Filippov	8f9f438e25	Allow zero reweights, fix changing pgs	2020-09-11 23:59:30 +03:00
Vitaliy Filippov	db4b82089e	connecting=true was also forgotten	2020-09-11 19:51:17 +03:00
Vitaliy Filippov	faa871090f	Do not die in mon on bad JSON in etcd	2020-09-11 19:25:52 +03:00
Vitaliy Filippov	49ec8c7c63	Add --verbose 1 flag for mon	2020-09-11 18:49:07 +03:00
Vitaliy Filippov	eadd454992	Fix etcd key regexps	2020-09-11 18:33:49 +03:00
Vitaliy Filippov	a15bd23ebd	Missed a bad PG key	2020-09-11 17:56:18 +03:00
Vitaliy Filippov	e3f502b466	Oops, I forgot a file	2020-09-11 17:49:44 +03:00
Vitaliy Filippov	6e72cf2732	Disable stdout/stderr buffering	2020-09-11 16:52:51 +03:00
Vitaliy Filippov	53832d184a	Allow to use lazy sync with replicated pools	2020-09-06 12:08:44 +03:00
Vitaliy Filippov	352caeba14	Fix one more bug with replicated reads	2020-09-06 02:27:44 +03:00
Vitaliy Filippov	fb533991b7	"Lock" retried objects from other flushers when accounting for overruns Fixes a rare 100% CPU consuming hang	2020-09-06 02:19:36 +03:00
Vitaliy Filippov	73e26dbbea	Add up_wait_retry_interval to config and fix it so it actually works	2020-09-05 22:05:21 +03:00
Vitaliy Filippov	44973e7f27	Fix replicated pool bugs	2020-09-05 21:45:04 +03:00
Vitaliy Filippov	242d9a42a2	Change object format in prints to %lx:%lx v%lu	2020-09-05 17:44:05 +03:00
Vitaliy Filippov	68c3e96e46	Add pool setting to fio and qemu drivers	2020-09-05 17:44:00 +03:00
Vitaliy Filippov	cc4714a3a7	Basic fixes for the Monitor	2020-09-05 02:14:43 +03:00
Vitaliy Filippov	e051db5a73	Check for unsuccessful memory allocations	2020-09-05 01:42:11 +03:00
Vitaliy Filippov	4f9b5286a0	Add replicated pool support to OSD logic ...in theory :-D now it needs some testing	2020-09-05 01:42:11 +03:00
Vitaliy Filippov	168cc2c803	Add pool support to OSD, part 1 This just fixes all the code so it builds and works like before, but doesn't yet bring the support for replicated pools.	2020-09-04 17:04:17 +03:00
Vitaliy Filippov	4cdad634b5	Add pool support to the cluster client	2020-09-03 00:52:41 +03:00
Vitaliy Filippov	293cb5bd1d	Parse pool configuration in etcd_state_client	2020-09-02 21:54:32 +03:00
Vitaliy Filippov	0918ea08fa	Implement min/max inode filters in LIST operation	2020-09-02 14:42:40 +03:00
Vitaliy Filippov	a8b3cbd6af	Implement per-pool PG calculation, fix some lint warnings	2020-09-01 18:53:54 +03:00
Vitaliy Filippov	fe0d78bf8e	Filter configuration keys with regexps instead of "osd_tree" value checks	2020-09-01 17:07:06 +03:00
Vitaliy Filippov	085c145a18	Document etcd data (to-be state with pools) at least in some form	2020-09-01 16:29:45 +03:00
Vitaliy Filippov	30da4bddbe	Extract scale_pg_count into a separate file	2020-09-01 16:18:58 +03:00
Vitaliy Filippov	14b4a4617e	(re)move placement_tree	2020-09-01 16:18:58 +03:00
Vitaliy Filippov	3932c9b2e2	Add WRITE_STABLE to the secondary OSD for the upcoming replication support	2020-09-01 16:18:58 +03:00
Vitaliy Filippov	2e8c69fc5b	Rename OSD_OP_SECONDARY_* to OSD_OP_SEC_*	2020-08-31 23:57:50 +03:00
Vitaliy Filippov	a86788fe3b	Support optimizing for the case when parity chunks occupy more space than data chunks Mostly as an experiment because the problem solved by this commit comes from Ceph's EC+compression implementation details and I'm not sure if my implementation will be the same	2020-08-17 01:44:19 +03:00
Vitaliy Filippov	95ebfad283	Final name is Vitastor	2020-08-03 23:50:59 +03:00
Vitaliy Filippov	6022f28dc9	Add pseudo-random PG generation	2020-07-07 23:13:07 +03:00
Vitaliy Filippov	9d10a4d057	Support arbitrary pg_size in LPOptimizer	2020-07-05 20:28:05 +03:00
Vitaliy Filippov	ec7acc8f3a	Add WRITE_STABLE operation for future replication support	2020-07-05 01:48:02 +03:00
Vitaliy Filippov	416a80b099	Make blockstore object state a combination of type and workflow	2020-07-04 22:20:32 +03:00
Vitaliy Filippov	a7929931eb	Implement PG epochs to prevent the "version split" The "version split" is when: - A block is written to 1 OSD out of 3, all of them die - OSDs 2 and 3 come up, the same block is written to both of them - The remaining OSD comes up. Now all 3 OSDs have the same version of the same object, but with different data.	2020-07-04 00:55:27 +03:00
Vitaliy Filippov	e680d6c1c3	Rename reconstruct_stripe and calc_rmw_parity to indicate that they are only for XOR N+1	2020-06-30 10:40:43 +03:00
Vitaliy Filippov	9b33f598d3	Fix two more cluster client bugs 1) Sync could delete an unfinished write due to the lack of ordering (fixed by introducing syncing_writes) 2) Writes could be postponed indefinitely due to bad resuming of operations after a sync	2020-06-27 02:13:35 +03:00
Vitaliy Filippov	592bcd3699	Fix QEMU driver bugs (QEMU and qemu-img now work! hooray!)	2020-06-26 18:25:43 +03:00
Vitaliy Filippov	5e1e39633d	Implement QEMU block driver	2020-06-25 11:59:43 +03:00
Vitaliy Filippov	41c2655edd	Disconnect sockets when read returns zero	2020-06-24 01:32:19 +03:00
Vitaliy Filippov	d68370304e	Support iovecs in cluster_client_t	2020-06-24 01:31:48 +03:00
Vitaliy Filippov	a22d9f38aa	Only use EPOLLOUT while connecting	2020-06-23 20:18:31 +03:00
Vitaliy Filippov	8736b3ad32	Add destructors, make ringloop optional in cluster_client_t	2020-06-23 20:10:33 +03:00
Vitaliy Filippov	62343c8022	Allow to turn synchronous recvmsg/sendmsg on with a config option	2020-06-23 01:15:07 +03:00
Vitaliy Filippov	9abaf5b735	Use epoll_manager in osd	2020-06-20 01:28:18 +03:00
Vitaliy Filippov	badf68c039	Support iovecs for read operations	2020-06-19 19:47:05 +03:00
Vitaliy Filippov	0f6d193d73	Postpone op callbacks to the end of handle_read(), fix a bug where primary OSD could reply -EPIPE with data to a read operation	2020-06-16 01:36:38 +03:00
Vitaliy Filippov	27ee14a4e6	Fix bugs in cluster_client	2020-06-16 00:08:45 +03:00
Vitaliy Filippov	64afec03ec	In theory, implement syncs and replay for the non-immediate commit mode	2020-06-15 00:04:16 +03:00
				`@ -0,0 +1 @@`
				`Subproject commit 5dc108754ad40d3b1d024f9bd7cca0595ef1a1db`
				`@ -0,0 +1 @@`
				`Subproject commit 97f06cb20c1e136fd37d58fb40f57dd8f8a3a4a7`