Compare commits

...

51 Commits

Author SHA1 Message Date
bb2d9a3afe Release 0.5.5
- Transition to CMake build system
- Fix Monitor being unable to change PG sizes
- Fix PG optimizer not using some OSDs in some cases
- Fix inability to change PG count online
- Improve journal flusher performance
- Add a little better systemd unit generator
- Use w=8 with jerasure (breaking change for EC pools)
2021-02-26 01:59:18 +03:00
e899ed2c25 Make OSDs with 256 flushers (as they are now dynamic) 2021-02-26 01:59:18 +03:00
e21b14b72c Fix rpm specs for building with CMake 2021-02-26 01:59:18 +03:00
5af8eddaa9 Add the remaining build script for Debian 2021-02-26 01:59:18 +03:00
4f5a94c07a Modify instructions for the CMake build 2021-02-26 00:28:57 +03:00
e16b87ecc8 Rename random_combinations() parameter from "unordered" to "ordered" as it's more correct 2021-02-25 23:59:34 +03:00
fcb4aa0a11 Fix Monitor being unable to change PG sizes 2021-02-25 23:59:34 +03:00
12adfa470c Add a test for changing PG size 2021-02-25 23:59:33 +03:00
7f15e0c084 Add a simple test for the PG optimizer 2021-02-25 23:59:33 +03:00
08d4bef419 Fix PG optimizer removing PGs without adding new ones
This happened when the distribution was already valid for the current OSD tree,
but didn't use all OSDs. For example, OSDs 1 2 3 and all PGs equal to [ 1, 2 ]
remained unchanged.
2021-02-25 23:59:33 +03:00
2d73b19a6c Fix online PG count change bugs 2021-02-25 23:59:33 +03:00
69c87009e9 Add a test for changing PG count 2021-02-25 23:59:33 +03:00
c974cb539c Make flusher_count adaptive and limit write iodepth 2021-02-25 23:59:33 +03:00
00e98f64f3 A little better systemd unit generator 2021-02-25 23:59:33 +03:00
91a70dfb1b Add a test for the no_same_sector_overwrites mode 2021-02-25 23:59:33 +03:00
178388ac8c Use packages/ subdir instead of build/ for Docker package builds 2021-02-25 23:59:04 +03:00
bf9a175efc Move C/C++ sources to src subdirectory 2021-02-25 23:59:03 +03:00
08aed962de Use CMake 2021-02-25 23:58:08 +03:00
8c65e890b9 Slightly clean up the build script 2021-02-25 23:56:54 +03:00
8cda70b889 Allow to enable AddressSanitizer with "ASAN=1 make" 2021-02-25 23:55:33 +03:00
61ab22403a Use w=8 with jerasure 2021-02-25 23:55:33 +03:00
16da663a66 Add another test for failure domains 2021-02-25 23:55:33 +03:00
4a2dcf7b6b Update the license to VNPL 1.1
VNPL 1.1 is slightly reworded to make it clear that proprietary software
interacting with Vitastor and providing some kind of service to end users isn't
a "Proxy Program" if it's not specially designed to be used with Vitastor.

For example, Windows OS running in a virtual machine stored in a Vitastor
cluster clearly isn't.
2021-02-25 23:55:33 +03:00
8d48cc56b0 Generate randomly permutated OSD combinations when optimizing for compressed chunks 2021-02-25 23:55:33 +03:00
9f58f01425 Mirror afr.js from /vitalif/ceph-afr-calc 2021-02-25 23:55:33 +03:00
b9e7d31aa1 Release v0.5.4
- Fix a rare hang, more or less reproducible with very slow drives
- Fix a hang with the no_same_sector_overwrites mode
2021-02-24 01:40:30 +03:00
2d9f09dcb6 Attempt forced trim when stopping an overrun flusher
Fixes a rare hang happening in the event of journal space running out without
new work to do for flushers except the current sector.
The hang could be reproduced more or less consistently with very slow drives.
2021-02-24 01:33:01 +03:00
7cc59260c5 Fix no_same_sector_overwrites related bug 2021-02-23 18:50:51 +03:00
ca0a11ec85 Release 0.5.3 2021-02-03 00:38:57 +03:00
51c0b5afee Whitelist more leaks 2021-02-02 02:05:41 +03:00
e1e01d042e Rename sector_info.usage_count to flush_count 2021-02-02 01:32:23 +03:00
534a4a657e Rename space_check.sectors_required to sectors_to_write 2021-02-02 01:30:23 +03:00
9b5d8b9ad4 Fix multiple-sector journal writes, add assertions to not miss any SQEs 2021-02-02 01:29:11 +03:00
e66ed47515 Clear SQEs before returning them to the caller to prevent erroneous double submissions 2021-02-02 01:26:54 +03:00
036c6d4c42 Add a simple test case 2021-02-01 19:43:10 +03:00
4cb79a3bf8 Allow to calculate simple-offsets for files 2021-02-01 19:43:10 +03:00
3bf53754c2 Fix several I/O bugs 2021-02-01 19:43:10 +03:00
6023cac361 Do not stop clients before they are connected 2021-02-01 19:31:10 +03:00
915d04c446 Allow empty global configuration, report OSD statistics faster 2021-02-01 19:31:10 +03:00
21e06ea40d Fix memory leaks in fio engines 2021-02-01 19:31:10 +03:00
9ef7f865b0 Fix incorrect calls to prepare_journal_sector_write() when flushing multiple sectors 2021-02-01 19:31:10 +03:00
9dd20a31aa Do not use pg_minsize in the client code! 2021-02-01 19:31:10 +03:00
28be049909 Dump only actual part of the journal by default 2021-01-01 23:04:30 +03:00
78fbaacf1f External jerasure's w into defines
In fact, w=8 looks better than w=32, so it may be changed in the future
2020-12-31 19:15:22 +03:00
1526c5a213 Add lp_solve into dependencies 2020-12-31 01:32:31 +03:00
c7cc414c90 Skip removed descriptors in epoll (this is possible in real clusters) 2020-12-30 17:04:18 +03:00
f4ea313707 Fix cl->read_op being freed without calling the completion callback 2020-12-30 16:55:54 +03:00
b88b76f316 Parallel usage of multiple network interfaces was a sick fantasy 2020-12-30 00:05:17 +03:00
4a17a61d1f Make rm_inode work with incomplete and degraded objects, allow to wait before deleting objects 2020-12-28 16:38:08 +03:00
ccabbbfbcb For reference: include a spec patch for building QEMU 4.2 or CentOS 7 2020-12-06 15:43:38 +03:00
26dac57083 State that jerasure is now supported 2020-12-06 15:25:48 +03:00
128 changed files with 1714 additions and 920 deletions

View File

@@ -1,5 +1,6 @@
.git .git
build build
packages
mon/node_modules mon/node_modules
*.o *.o
*.so *.so
@@ -15,3 +16,4 @@ fio
qemu qemu
rpm/*.Dockerfile rpm/*.Dockerfile
debian/*.Dockerfile debian/*.Dockerfile
Dockerfile

5
CMakeLists.txt Normal file
View File

@@ -0,0 +1,5 @@
cmake_minimum_required(VERSION 2.8)
project(vitastor)
add_subdirectory(src)

View File

@@ -1,46 +0,0 @@
#!/usr/bin/perl
use strict;
my $deps = {};
for my $line (split /\n/, `grep '^#include "' *.cpp *.h`)
{
if ($line =~ /^([^:]+):\#include "([^"]+)"/s)
{
$deps->{$1}->{$2} = 1;
}
}
my $added;
do
{
$added = 0;
for my $file (keys %$deps)
{
for my $dep (keys %{$deps->{$file}})
{
if ($deps->{$dep})
{
for my $subdep (keys %{$deps->{$dep}})
{
if (!$deps->{$file}->{$subdep})
{
$added = 1;
$deps->{$file}->{$subdep} = 1;
}
}
}
}
}
} while ($added);
for my $file (sort keys %$deps)
{
if ($file =~ /\.cpp$/)
{
my $obj = $file;
$obj =~ s/\.cpp$/.o/s;
print "$obj: $file ".join(" ", sort keys %{$deps->{$file}})."\n";
print "\tg++ \$(CXXFLAGS) -c -o \$\@ \$\<\n";
}
}

195
Makefile
View File

@@ -1,195 +0,0 @@
BINDIR ?= /usr/bin
LIBDIR ?= /usr/lib/x86_64-linux-gnu
QEMU_PLUGINDIR ?= /usr/lib/x86_64-linux-gnu/qemu
BLOCKSTORE_OBJS := allocator.o blockstore.o blockstore_impl.o blockstore_init.o blockstore_open.o blockstore_journal.o blockstore_read.o \
blockstore_write.o blockstore_sync.o blockstore_stable.o blockstore_rollback.o blockstore_flush.o crc32c.o ringloop.o
# -fsanitize=address
CXXFLAGS := -g -O3 -Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -fPIC -fdiagnostics-color=always -I/usr/include/jerasure
all: libfio_blockstore.so osd libfio_sec_osd.so libfio_cluster.so stub_osd stub_uring_osd stub_bench osd_test dump_journal qemu_driver.so nbd_proxy rm_inode
clean:
rm -f *.o libblockstore.so libfio_blockstore.so osd libfio_sec_osd.so libfio_cluster.so stub_osd stub_uring_osd stub_bench osd_test dump_journal qemu_driver.so nbd_proxy rm_inode
install: all
mkdir -p $(DESTDIR)$(LIBDIR)/vitastor
install -m 0755 libfio_sec_osd.so $(DESTDIR)$(LIBDIR)/vitastor/
install -m 0755 libfio_cluster.so $(DESTDIR)$(LIBDIR)/vitastor/
install -m 0755 libfio_blockstore.so $(DESTDIR)$(LIBDIR)/vitastor/
install -m 0755 libblockstore.so $(DESTDIR)$(LIBDIR)/vitastor/
mkdir -p $(DESTDIR)$(BINDIR)
install -m 0755 osd $(DESTDIR)$(BINDIR)/vitastor-osd
install -m 0755 dump_journal $(DESTDIR)$(BINDIR)/vitastor-dump-journal
install -m 0755 nbd_proxy $(DESTDIR)$(BINDIR)/vitastor-nbd
install -m 0755 rm_inode $(DESTDIR)$(BINDIR)/vitastor-rm
mkdir -p $(DESTDIR)$(QEMU_PLUGINDIR)
install -m 0755 qemu_driver.so $(DESTDIR)$(QEMU_PLUGINDIR)/block-vitastor.so
dump_journal: dump_journal.cpp crc32c.o blockstore_journal.h
g++ $(CXXFLAGS) -o $@ $< crc32c.o
libblockstore.so: $(BLOCKSTORE_OBJS)
g++ $(CXXFLAGS) -o $@ -shared $(BLOCKSTORE_OBJS) -ltcmalloc_minimal -luring
libfio_blockstore.so: ./libblockstore.so fio_engine.o json11.o
g++ $(CXXFLAGS) -Wl,-rpath,'$(LIBDIR)/vitastor',-rpath,'$$ORIGIN' -shared -o $@ fio_engine.o json11.o libblockstore.so -ltcmalloc_minimal -luring
OSD_OBJS := osd.o osd_secondary.o msgr_receive.o msgr_send.o osd_peering.o osd_flush.o osd_peering_pg.o \
osd_primary.o osd_primary_subops.o etcd_state_client.o messenger.o osd_cluster.o http_client.o osd_ops.o pg_states.o \
osd_rmw.o json11.o base64.o timerfd_manager.o epoll_manager.o
osd: ./libblockstore.so osd_main.cpp osd.h osd_ops.h $(OSD_OBJS)
g++ $(CXXFLAGS) -Wl,-rpath,'$(LIBDIR)/vitastor',-rpath,'$$ORIGIN' -o $@ osd_main.cpp $(OSD_OBJS) libblockstore.so -ltcmalloc_minimal -luring -lJerasure
stub_osd: stub_osd.o rw_blocking.o
g++ $(CXXFLAGS) -o $@ stub_osd.o rw_blocking.o -ltcmalloc_minimal
osd_rmw_test: osd_rmw_test.o
g++ $(CXXFLAGS) -o $@ osd_rmw_test.o -lJerasure -fsanitize=address
STUB_URING_OSD_OBJS := stub_uring_osd.o epoll_manager.o messenger.o msgr_send.o msgr_receive.o ringloop.o timerfd_manager.o json11.o
stub_uring_osd: $(STUB_URING_OSD_OBJS)
g++ $(CXXFLAGS) -o $@ -ltcmalloc_minimal $(STUB_URING_OSD_OBJS) -luring
stub_bench: stub_bench.cpp osd_ops.h rw_blocking.o
g++ $(CXXFLAGS) -o $@ stub_bench.cpp rw_blocking.o -ltcmalloc_minimal
osd_test: osd_test.cpp osd_ops.h rw_blocking.o
g++ $(CXXFLAGS) -o $@ osd_test.cpp rw_blocking.o -ltcmalloc_minimal
osd_peering_pg_test: osd_peering_pg_test.cpp osd_peering_pg.o
g++ $(CXXFLAGS) -o $@ $< osd_peering_pg.o -ltcmalloc_minimal
libfio_sec_osd.so: fio_sec_osd.o rw_blocking.o
g++ $(CXXFLAGS) -ltcmalloc_minimal -shared -o $@ fio_sec_osd.o rw_blocking.o
FIO_CLUSTER_OBJS := cluster_client.o epoll_manager.o etcd_state_client.o \
messenger.o msgr_send.o msgr_receive.o ringloop.o json11.o http_client.o osd_ops.o pg_states.o timerfd_manager.o base64.o
libfio_cluster.so: fio_cluster.o $(FIO_CLUSTER_OBJS)
g++ $(CXXFLAGS) -ltcmalloc_minimal -shared -o $@ $< $(FIO_CLUSTER_OBJS) -luring
nbd_proxy: nbd_proxy.o $(FIO_CLUSTER_OBJS)
g++ $(CXXFLAGS) -ltcmalloc_minimal -o $@ $< $(FIO_CLUSTER_OBJS) -luring
rm_inode: rm_inode.o $(FIO_CLUSTER_OBJS)
g++ $(CXXFLAGS) -ltcmalloc_minimal -o $@ $< $(FIO_CLUSTER_OBJS) -luring
qemu_driver.o: qemu_driver.c qemu_proxy.h
gcc -I qemu/b/qemu `pkg-config glib-2.0 --cflags` \
-I qemu/include $(CXXFLAGS) -c -o $@ $<
qemu_driver.so: qemu_driver.o qemu_proxy.o $(FIO_CLUSTER_OBJS)
g++ $(CXXFLAGS) -ltcmalloc_minimal -shared -o $@ $(FIO_CLUSTER_OBJS) qemu_driver.o qemu_proxy.o -luring
test_blockstore: ./libblockstore.so test_blockstore.cpp timerfd_interval.o
g++ $(CXXFLAGS) -Wl,-rpath,'$(LIBDIR)/vitastor',-rpath,'$$ORIGIN' -o test_blockstore test_blockstore.cpp timerfd_interval.o libblockstore.so -ltcmalloc_minimal -luring
test_shit: test_shit.cpp osd_peering_pg.o
g++ $(CXXFLAGS) -o test_shit test_shit.cpp -luring -lm
test_allocator: test_allocator.cpp allocator.o
g++ $(CXXFLAGS) -o test_allocator test_allocator.cpp allocator.o
crc32c.o: crc32c.c crc32c.h
g++ $(CXXFLAGS) -c -o $@ $<
json11.o: json11/json11.cpp
g++ $(CXXFLAGS) -c -o json11.o json11/json11.cpp
# Autogenerated
allocator.o: allocator.cpp allocator.h
g++ $(CXXFLAGS) -c -o $@ $<
base64.o: base64.cpp base64.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore.o: blockstore.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_flush.o: blockstore_flush.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_impl.o: blockstore_impl.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_init.o: blockstore_init.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_journal.o: blockstore_journal.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_open.o: blockstore_open.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_read.o: blockstore_read.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_rollback.o: blockstore_rollback.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_stable.o: blockstore_stable.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_sync.o: blockstore_sync.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_write.o: blockstore_write.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
cluster_client.o: cluster_client.cpp cluster_client.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
dump_journal.o: dump_journal.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
epoll_manager.o: epoll_manager.cpp epoll_manager.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
etcd_state_client.o: etcd_state_client.cpp base64.h etcd_state_client.h http_client.h json11/json11.hpp object_id.h osd_id.h osd_ops.h pg_states.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
fio_cluster.o: fio_cluster.cpp cluster_client.h epoll_manager.h etcd_state_client.h fio/arch/arch.h fio/fio.h fio/optgroup.h fio_headers.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
fio_engine.o: fio_engine.cpp blockstore.h fio/arch/arch.h fio/fio.h fio/optgroup.h fio_headers.h json11/json11.hpp object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
fio_sec_osd.o: fio_sec_osd.cpp fio/arch/arch.h fio/fio.h fio/optgroup.h fio_headers.h object_id.h osd_id.h osd_ops.h rw_blocking.h
g++ $(CXXFLAGS) -c -o $@ $<
http_client.o: http_client.cpp http_client.h json11/json11.hpp timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
messenger.o: messenger.cpp json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
msgr_receive.o: msgr_receive.cpp json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
msgr_send.o: msgr_send.cpp json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
nbd_proxy.o: nbd_proxy.cpp cluster_client.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd.o: osd.cpp blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_cluster.o: osd_cluster.cpp base64.h blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_flush.o: osd_flush.cpp blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_main.o: osd_main.cpp blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_ops.o: osd_ops.cpp object_id.h osd_id.h osd_ops.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_peering.o: osd_peering.cpp base64.h blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_peering_pg.o: osd_peering_pg.cpp cpp-btree/btree_map.h object_id.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_peering_pg_test.o: osd_peering_pg_test.cpp cpp-btree/btree_map.h object_id.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_primary.o: osd_primary.cpp blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h osd_primary.h osd_rmw.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_primary_subops.o: osd_primary_subops.cpp blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h osd_primary.h osd_rmw.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_rmw.o: osd_rmw.cpp malloc_or_die.h object_id.h osd_id.h osd_rmw.h xor.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_rmw_test.o: osd_rmw_test.cpp malloc_or_die.h object_id.h osd_id.h osd_rmw.cpp osd_rmw.h test_pattern.h xor.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_secondary.o: osd_secondary.cpp blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_test.o: osd_test.cpp object_id.h osd_id.h osd_ops.h rw_blocking.h test_pattern.h
g++ $(CXXFLAGS) -c -o $@ $<
pg_states.o: pg_states.cpp pg_states.h
g++ $(CXXFLAGS) -c -o $@ $<
qemu_proxy.o: qemu_proxy.cpp cluster_client.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h qemu_proxy.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
ringloop.o: ringloop.cpp ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
rm_inode.o: rm_inode.cpp cluster_client.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
rw_blocking.o: rw_blocking.cpp rw_blocking.h
g++ $(CXXFLAGS) -c -o $@ $<
stub_bench.o: stub_bench.cpp object_id.h osd_id.h osd_ops.h rw_blocking.h
g++ $(CXXFLAGS) -c -o $@ $<
stub_osd.o: stub_osd.cpp object_id.h osd_id.h osd_ops.h rw_blocking.h
g++ $(CXXFLAGS) -c -o $@ $<
stub_uring_osd.o: stub_uring_osd.cpp epoll_manager.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
test_allocator.o: test_allocator.cpp allocator.h
g++ $(CXXFLAGS) -c -o $@ $<
test_blockstore.o: test_blockstore.cpp blockstore.h object_id.h ringloop.h timerfd_interval.h
g++ $(CXXFLAGS) -c -o $@ $<
test_shit.o: test_shit.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
timerfd_interval.o: timerfd_interval.cpp ringloop.h timerfd_interval.h
g++ $(CXXFLAGS) -c -o $@ $<
timerfd_manager.o: timerfd_manager.cpp timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<

View File

@@ -16,7 +16,8 @@ breaking changes in the future. However, the following is implemented:
- Basic part: highly-available block storage with symmetric clustering and no SPOF - Basic part: highly-available block storage with symmetric clustering and no SPOF
- Performance ;-D - Performance ;-D
- Two redundancy schemes: Replication and XOR n+1 (simplest case of EC) - Multiple redundancy schemes: Replication, XOR n+1, Reed-Solomon erasure codes
based on jerasure library with any number of data and parity drives in a group
- Configuration via simple JSON data structures in etcd - Configuration via simple JSON data structures in etcd
- Automatic data distribution over OSDs, with support for: - Automatic data distribution over OSDs, with support for:
- Mathematical optimization for better uniformity and less data movement - Mathematical optimization for better uniformity and less data movement
@@ -39,8 +40,6 @@ breaking changes in the future. However, the following is implemented:
- OSD creation tool (OSDs currently have to be created by hand) - OSD creation tool (OSDs currently have to be created by hand)
- Other administrative tools - Other administrative tools
- Per-inode I/O and space usage statistics - Per-inode I/O and space usage statistics
- jerasure EC support with any number of data and parity drives in a group
- Parallel usage of multiple network interfaces
- Proxmox and OpenNebula plugins - Proxmox and OpenNebula plugins
- iSCSI proxy - iSCSI proxy
- Inode metadata storage in etcd - Inode metadata storage in etcd
@@ -50,6 +49,7 @@ breaking changes in the future. However, the following is implemented:
- Checksums - Checksums
- SSD+HDD optimizations, possibly including tiered storage and soft journal flushes - SSD+HDD optimizations, possibly including tiered storage and soft journal flushes
- RDMA and NVDIMM support - RDMA and NVDIMM support
- Web GUI
- Compression (possibly) - Compression (possibly)
- Read caching using system page cache (possibly) - Read caching using system page cache (possibly)
@@ -336,9 +336,8 @@ Vitastor with single-thread NBD on the same hardware:
- You can also rebuild QEMU with a patch that makes LD_PRELOAD unnecessary to load vitastor driver. - You can also rebuild QEMU with a patch that makes LD_PRELOAD unnecessary to load vitastor driver.
See `qemu-*.*-vitastor.patch`. See `qemu-*.*-vitastor.patch`.
- Install fio 3.7 or later, get its source and symlink it into `<vitastor>/fio`. - Install fio 3.7 or later, get its source and symlink it into `<vitastor>/fio`.
- Build Vitastor with `make -j8`. - Build & install Vitastor with `mkdir build && cd build && cmake .. && make -j8 && make install`.
- Run `make install` (optionally with `LIBDIR=/usr/lib64 QEMU_PLUGINDIR=/usr/lib64/qemu-kvm` Pay attention to the `QEMU_PLUGINDIR` cmake option - it must be set to `qemu-kvm` on RHEL.
if you're using an RPM-based distro).
## Running ## Running
@@ -349,20 +348,16 @@ and calculate disk offsets almost by hand. This will be fixed in near future.
with lazy fsync, but prepare for inferior single-thread latency. with lazy fsync, but prepare for inferior single-thread latency.
- Get a fast network (at least 10 Gbit/s). - Get a fast network (at least 10 Gbit/s).
- Disable CPU powersaving: `cpupower idle-set -D 0 && cpupower frequency-set -g performance`. - Disable CPU powersaving: `cpupower idle-set -D 0 && cpupower frequency-set -g performance`.
- Start etcd with `--max-txn-ops=100000 --auto-compaction-retention=10 --auto-compaction-mode=revision` options. - Check `/usr/lib/vitastor/mon/make-units.sh` and `/usr/lib/vitastor/mon/make-osd.sh` and
- Create global configuration in etcd: `etcdctl --endpoints=... put /vitastor/config/global '{"immediate_commit":"all"}'` put desired values into the variables at the top of these files.
(if all your drives have capacitors). - Create systemd units for the monitor and etcd: `/usr/lib/vitastor/mon/make-units.sh`
- Create pool configuration in etcd: `etcdctl --endpoints=... put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":256,"failure_domain":"host"}}'`. - Create systemd units for your OSDs: `/usr/lib/vitastor/mon/make-osd.sh /dev/disk/by-partuuid/XXX [/dev/disk/by-partuuid/YYY ...]`
- Calculate offsets for your drives with `node /usr/lib/vitastor/mon/simple-offsets.js --device /dev/sdX`. - You can edit the units and change OSD configuration. Notable configuration variables:
- Make systemd units for your OSDs. Look at `/usr/lib/vitastor/mon/make-units.sh` for example.
Notable configuration variables from the example:
- `disable_data_fsync 1` - only safe with server-grade drives with capacitors. - `disable_data_fsync 1` - only safe with server-grade drives with capacitors.
- `immediate_commit all` - use this if all your drives are server-grade. - `immediate_commit all` - use this if all your drives are server-grade.
- `disable_device_lock 1` - only required if you run multiple OSDs on one block device. - `disable_device_lock 1` - only required if you run multiple OSDs on one block device.
- `flusher_count 16` - flusher is a micro-thread that removes old data from the journal. - `flusher_count 256` - flusher is a micro-thread that removes old data from the journal.
More flushers mean more aggressive journal flushing which allows for more throughput You don't have to worry about this parameter anymore, 256 is enough.
but slightly hurts latency under less load. Flushing will probably be improved in the future
because currently high queue depths sometimes lead to performance degradation.
- `disk_alignment`, `journal_block_size`, `meta_block_size` should be set to the internal - `disk_alignment`, `journal_block_size`, `meta_block_size` should be set to the internal
block size of your SSDs which is 4096 on most drives. block size of your SSDs which is 4096 on most drives.
- `journal_no_same_sector_overwrites true` prevents multiple overwrites of the same journal sector. - `journal_no_same_sector_overwrites true` prevents multiple overwrites of the same journal sector.
@@ -373,18 +368,22 @@ and calculate disk offsets almost by hand. This will be fixed in near future.
setting is set, it is also required to raise `journal_sector_buffer_count` setting, which is the setting is set, it is also required to raise `journal_sector_buffer_count` setting, which is the
number of dirty journal sectors that may be written to at the same time. number of dirty journal sectors that may be written to at the same time.
- `systemctl start vitastor.target` everywhere. - `systemctl start vitastor.target` everywhere.
- Start any number of monitors: `node /usr/lib/vitastor/mon/mon-main.js --etcd_url 'http://10.115.0.10:2379,http://10.115.0.11:2379,http://10.115.0.12:2379,http://10.115.0.13:2379' --etcd_prefix '/vitastor' --etcd_start_timeout 5`. - Create global configuration in etcd: `etcdctl --endpoints=... put /vitastor/config/global '{"immediate_commit":"all"}'`
(if all your drives have capacitors).
- Create pool configuration in etcd: `etcdctl --endpoints=... put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":256,"failure_domain":"host"}}'`.
For jerasure pools the configuration should look like the following: `2:{"name":"ecpool","scheme":"jerasure","pg_size":4,"parity_chunks":2,"pg_minsize":2,"pg_count":256,"failure_domain":"host"}`.
- At this point, one of the monitors will configure PGs and OSDs will start them. - At this point, one of the monitors will configure PGs and OSDs will start them.
- You can check PG states with `etcdctl --endpoints=... get --prefix /vitastor/pg/state`. All PGs should become 'active'. - You can check PG states with `etcdctl --endpoints=... get --prefix /vitastor/pg/state`. All PGs should become 'active'.
- Run tests with (for example): `fio -thread -ioengine=/usr/lib/x86_64-linux-gnu/vitastor/libfio_cluster.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -pool=1 -inode=1 -size=400G`. - Run tests with (for example): `fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -pool=1 -inode=1 -size=400G`.
- Upload VM disk image with qemu-img (for example): - Upload VM disk image with qemu-img (for example):
``` ```
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so qemu-img convert -f qcow2 debian10.qcow2 -p qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:etcd_host=10.115.0.10\:2379/v3:pool=1:inode=1:size=2147483648'
-O raw 'vitastor:etcd_host=10.115.0.10\:2379/v3:pool=1:inode=1:size=2147483648'
``` ```
Note that the command requires to be run with `LD_PRELOAD=/usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so qemu-img ...`
if you use unmodified QEMU.
- Run QEMU with (for example): - Run QEMU with (for example):
``` ```
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so qemu-system-x86_64 -enable-kvm -m 1024 qemu-system-x86_64 -enable-kvm -m 1024
-drive 'file=vitastor:etcd_host=10.115.0.10\:2379/v3:pool=1:inode=1:size=2147483648',format=raw,if=none,id=drive-virtio-disk0,cache=none -drive 'file=vitastor:etcd_host=10.115.0.10\:2379/v3:pool=1:inode=1:size=2147483648',format=raw,if=none,id=drive-virtio-disk0,cache=none
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=off,physical_block_size=4096,logical_block_size=512 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=off,physical_block_size=4096,logical_block_size=512
-vnc 0.0.0.0:0 -vnc 0.0.0.0:0
@@ -398,10 +397,7 @@ and calculate disk offsets almost by hand. This will be fixed in near future.
- Object deletion requests may currently lead to 'incomplete' objects if your OSDs crash during - Object deletion requests may currently lead to 'incomplete' objects if your OSDs crash during
deletion because proper handling of object cleanup in a cluster should be "three-phase" deletion because proper handling of object cleanup in a cluster should be "three-phase"
and it's currently not implemented. Inode removal tool currently can't handle unclean and it's currently not implemented. Just to repeat the removal again in this case.
objects, so incomplete objects become undeletable. This will be fixed in near future
by allowing the inode removal tool to delete unclean objects. With this problem fixed
you'll be able just to repeat the removal again.
## Implementation Principles ## Implementation Principles
@@ -423,22 +419,27 @@ Copyright (c) Vitaliy Filippov (vitalif [at] yourcmc.ru), 2019+
You can also find me in the Russian Telegram Ceph chat: https://t.me/ceph_ru You can also find me in the Russian Telegram Ceph chat: https://t.me/ceph_ru
All server-side code (OSD, Monitor and so on) is licensed under the terms of All server-side code (OSD, Monitor and so on) is licensed under the terms of
Vitastor Network Public License 1.0 (VNPL 1.0), a copyleft license based on Vitastor Network Public License 1.1 (VNPL 1.1), a copyleft license based on
GNU GPLv3.0 with the additional "Network Interaction" clause which requires GNU GPLv3.0 with the additional "Network Interaction" clause which requires
opensourcing all programs directly or indirectly interacting with Vitastor opensourcing all programs directly or indirectly interacting with Vitastor
through a computer network ("Proxy Programs"). Proxy Programs may be made public through a computer network and expressly designed to be used in conjunction
not only under the terms of the same license, but also under the terms of any with it ("Proxy Programs"). Proxy Programs may be made public not only under
GPL-Compatible Free Software License, as listed by the Free Software Foundation. the terms of the same license, but also under the terms of any GPL-Compatible
Free Software License, as listed by the Free Software Foundation.
This is a stricter copyleft license than the Affero GPL. This is a stricter copyleft license than the Affero GPL.
Please note that VNPL doesn't require you to open the code of proprietary
software running inside a VM if it's not specially designed to be used with
Vitastor.
Basically, you can't use the software in a proprietary environment to provide Basically, you can't use the software in a proprietary environment to provide
its functionality to users without opensourcing all intermediary components its functionality to users without opensourcing all intermediary components
standing between the user and Vitastor or purchasing a commercial license standing between the user and Vitastor or purchasing a commercial license
from the author 😀. from the author 😀.
Client libraries (cluster_client and so on) are dual-licensed under the same Client libraries (cluster_client and so on) are dual-licensed under the same
VNPL 1.0 and also GNU GPL 2.0 or later to allow for compatibility with GPLed VNPL 1.1 and also GNU GPL 2.0 or later to allow for compatibility with GPLed
software like QEMU and fio. software like QEMU and fio.
You can find the full text of VNPL-1.0 in the file [VNPL-1.0.txt](VNPL-1.0.txt). You can find the full text of VNPL-1.1 in the file [VNPL-1.1.txt](VNPL-1.1.txt).
GPL 2.0 is also included in this repository as [GPL-2.0.txt](GPL-2.0.txt). GPL 2.0 is also included in this repository as [GPL-2.0.txt](GPL-2.0.txt).

View File

@@ -1,7 +1,7 @@
VITASTOR NETWORK PUBLIC LICENSE VITASTOR NETWORK PUBLIC LICENSE
Version 1, 17 September 2020 Version 1.1, 6 February 2021
Copyright (C) 2020 Vitaliy Filippov <vitalif@yourcmc.ru> Copyright (C) 2021 Vitaliy Filippov <vitalif@yourcmc.ru>
Everyone is permitted to copy and distribute verbatim copies Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed. of this license document, but changing it is not allowed.
@@ -540,12 +540,15 @@ License would be to refrain entirely from conveying the Program.
13. Remote Network Interaction. 13. Remote Network Interaction.
Notwithstanding any other provision of this License, if you provide A "Proxy Program" means a separate program which is specially designed to
any user an opportunity to interact with the covered work directly be used in conjunction with the covered work and interacts with it directly
or indirectly through a computer network, an imitation of such network, or indirectly through any kind of API (application programming interfaces),
or an additional program (hereinafter referred to as a "Proxy Program") a computer network, an imitation of such network, or another Proxy Program
that, in turn, interacts with the covered work through a computer network, itself.
an imitation of such network, or another Proxy Program itself,
Notwithstanding any other provision of this License, if you provide any user
with an opportunity to interact with the covered work through a computer
network, an imitation of such network, or any number of "Proxy Programs",
you must prominently offer that user an opportunity to receive the you must prominently offer that user an opportunity to receive the
Corresponding Source of the covered work and all Proxy Programs from a Corresponding Source of the covered work and all Proxy Programs from a
network server at no charge, through some standard or customary means of network server at no charge, through some standard or customary means of

View File

@@ -1,6 +1,6 @@
#!/bin/bash #!/bin/bash
gcc -E -o fio_headers.i fio_headers.h gcc -I. -E -o fio_headers.i src/fio_headers.h
rm -rf fio-copy rm -rf fio-copy
for i in `grep -Po 'fio/[^"]+' fio_headers.i | sort | uniq`; do for i in `grep -Po 'fio/[^"]+' fio_headers.i | sort | uniq`; do

View File

@@ -5,7 +5,7 @@
#cd b/qemu; make qapi #cd b/qemu; make qapi
gcc -I qemu/b/qemu `pkg-config glib-2.0 --cflags` \ gcc -I qemu/b/qemu `pkg-config glib-2.0 --cflags` \
-I qemu/include -E -o qemu_driver.i qemu_driver.c -I qemu/include -E -o qemu_driver.i src/qemu_driver.c
rm -rf qemu-copy rm -rf qemu-copy
for i in `grep -Po 'qemu/[^"]+' qemu_driver.i | sort | uniq`; do for i in `grep -Po 'qemu/[^"]+' qemu_driver.i | sort | uniq`; do

7
debian/build-vitastor-bullseye.sh vendored Executable file
View File

@@ -0,0 +1,7 @@
#!/bin/bash
sed 's/$REL/bullseye/' < vitastor.Dockerfile > ../Dockerfile
cd ..
mkdir -p packages
sudo podman build -v `pwd`/packages:/root/packages -f Dockerfile .
rm Dockerfile

7
debian/build-vitastor-buster.sh vendored Executable file
View File

@@ -0,0 +1,7 @@
#!/bin/bash
sed 's/$REL/buster/' < vitastor.Dockerfile > ../Dockerfile
cd ..
mkdir -p packages
sudo podman build -v `pwd`/packages:/root/packages -f Dockerfile .
rm Dockerfile

6
debian/changelog vendored
View File

@@ -1,3 +1,9 @@
vitastor (0.5.5-1) unstable; urgency=medium
* Bugfixes
-- Vitaliy Filippov <vitalif@yourcmc.ru> Tue, 02 Feb 2021 23:01:24 +0300
vitastor (0.5.1-1) unstable; urgency=medium vitastor (0.5.1-1) unstable; urgency=medium
* Add jerasure support * Add jerasure support

2
debian/control vendored
View File

@@ -9,7 +9,7 @@ Rules-Requires-Root: no
Package: vitastor Package: vitastor
Architecture: amd64 Architecture: amd64
Depends: ${shlibs:Depends}, ${misc:Depends}, fio (= ${dep:fio}), qemu (= ${dep:qemu}), nodejs (>= 10), node-sprintf-js, node-ws (>= 7), libjerasure2 Depends: ${shlibs:Depends}, ${misc:Depends}, fio (= ${dep:fio}), qemu (= ${dep:qemu}), nodejs (>= 10), node-sprintf-js, node-ws (>= 7), libjerasure2, lp-solve
Description: Vitastor, a fast software-defined clustered block storage Description: Vitastor, a fast software-defined clustered block storage
Vitastor is a small, simple and fast clustered block storage (storage for VM drives), Vitastor is a small, simple and fast clustered block storage (storage for VM drives),
architecturally similar to Ceph which means strong consistency, primary-replication, architecturally similar to Ceph which means strong consistency, primary-replication,

13
debian/copyright vendored
View File

@@ -5,16 +5,17 @@ Source: https://vitastor.io
Files: * Files: *
Copyright: 2019+ Vitaliy Filippov <vitalif@yourcmc.ru> Copyright: 2019+ Vitaliy Filippov <vitalif@yourcmc.ru>
License: Multiple licenses VNPL-1.0 and/or GPL-2.0+ License: Multiple licenses VNPL-1.1 and/or GPL-2.0+
All server-side code (OSD, Monitor and so on) is licensed under the terms of All server-side code (OSD, Monitor and so on) is licensed under the terms of
Vitastor Network Public License 1.0 (VNPL 1.0), a copyleft license based on Vitastor Network Public License 1.1 (VNPL 1.1), a copyleft license based on
GNU GPLv3.0 with the additional "Network Interaction" clause which requires GNU GPLv3.0 with the additional "Network Interaction" clause which requires
opensourcing all programs directly or indirectly interacting with Vitastor opensourcing all programs directly or indirectly interacting with Vitastor
through a computer network ("Proxy Programs"). Proxy Programs may be made public through a computer network and expressly designed to be used in conjunction
not only under the terms of the same license, but also under the terms of any with it ("Proxy Programs"). Proxy Programs may be made public not only under
GPL-Compatible Free Software License, as listed by the Free Software Foundation. the terms of the same license, but also under the terms of any GPL-Compatible
Free Software License, as listed by the Free Software Foundation.
This is a stricter copyleft license than the Affero GPL. This is a stricter copyleft license than the Affero GPL.
. .
Client libraries (cluster_client and so on) are dual-licensed under the same Client libraries (cluster_client and so on) are dual-licensed under the same
VNPL 1.0 and also GNU GPL 2.0 or later to allow for compatibility with GPLed VNPL 1.1 and also GNU GPL 2.0 or later to allow for compatibility with GPLed
software like QEMU and fio. software like QEMU and fio.

2
debian/install vendored
View File

@@ -1,3 +1,3 @@
VNPL-1.0.txt usr/share/doc/vitastor VNPL-1.1.txt usr/share/doc/vitastor
GPL-2.0.txt usr/share/doc/vitastor GPL-2.0.txt usr/share/doc/vitastor
mon usr/lib/vitastor mon usr/lib/vitastor

View File

@@ -1,13 +1,8 @@
# Build patched QEMU for Debian Buster or Bullseye/Sid inside a container # Build patched QEMU for Debian Buster or Bullseye/Sid inside a container
# cd ..; podman build --build-arg REL=bullseye -v `pwd`/build:/root/build -f debian/patched-qemu.Dockerfile . # cd ..; podman build --build-arg REL=bullseye -v `pwd`/packages:/root/packages -f debian/patched-qemu.Dockerfile .
ARG REL=bullseye
FROM debian:$REL FROM debian:$REL
# again, it doesn't work otherwise
ARG REL=bullseye
WORKDIR /root WORKDIR /root
RUN if [ "$REL" = "buster" ]; then \ RUN if [ "$REL" = "buster" ]; then \
@@ -30,20 +25,20 @@ RUN apt-get --download-only source fio
ADD qemu-5.0-vitastor.patch qemu-5.1-vitastor.patch /root/vitastor/ ADD qemu-5.0-vitastor.patch qemu-5.1-vitastor.patch /root/vitastor/
RUN set -e; \ RUN set -e; \
mkdir -p /root/build/qemu-$REL; \ mkdir -p /root/packages/qemu-$REL; \
rm -rf /root/build/qemu-$REL/*; \ rm -rf /root/packages/qemu-$REL/*; \
cd /root/build/qemu-$REL; \ cd /root/packages/qemu-$REL; \
dpkg-source -x /root/qemu*.dsc; \ dpkg-source -x /root/qemu*.dsc; \
if [ -d /root/build/qemu-$REL/qemu-5.0 ]; then \ if [ -d /root/packages/qemu-$REL/qemu-5.0 ]; then \
cp /root/vitastor/qemu-5.0-vitastor.patch /root/build/qemu-$REL/qemu-5.0/debian/patches; \ cp /root/vitastor/qemu-5.0-vitastor.patch /root/packages/qemu-$REL/qemu-5.0/debian/patches; \
echo qemu-5.0-vitastor.patch >> /root/build/qemu-$REL/qemu-5.0/debian/patches/series; \ echo qemu-5.0-vitastor.patch >> /root/packages/qemu-$REL/qemu-5.0/debian/patches/series; \
else \ else \
cp /root/vitastor/qemu-5.1-vitastor.patch /root/build/qemu-$REL/qemu-*/debian/patches; \ cp /root/vitastor/qemu-5.1-vitastor.patch /root/packages/qemu-$REL/qemu-*/debian/patches; \
P=`ls -d /root/build/qemu-$REL/qemu-*/debian/patches`; \ P=`ls -d /root/packages/qemu-$REL/qemu-*/debian/patches`; \
echo qemu-5.1-vitastor.patch >> $P/series; \ echo qemu-5.1-vitastor.patch >> $P/series; \
fi; \ fi; \
cd /root/build/qemu-$REL/qemu-*/; \ cd /root/packages/qemu-$REL/qemu-*/; \
V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)(~bpo[\d\+]*)?\).*$/$1/')+vitastor1; \ V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)(~bpo[\d\+]*)?\).*$/$1/')+vitastor1; \
DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v $V 'Plug Vitastor block driver'; \ DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v $V 'Plug Vitastor block driver'; \
DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \ DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \
rm -rf /root/build/qemu-$REL/qemu-*/ rm -rf /root/packages/qemu-$REL/qemu-*/

View File

@@ -1,13 +1,8 @@
# Build Vitastor packages for Debian Buster or Bullseye/Sid inside a container # Build Vitastor packages for Debian Buster or Bullseye/Sid inside a container
# cd ..; podman build --build-arg REL=bullseye -v `pwd`/build:/root/build -f debian/vitastor.Dockerfile . # cd ..; podman build --build-arg REL=bullseye -v `pwd`/packages:/root/packages -f debian/vitastor.Dockerfile .
ARG REL=bullseye
FROM debian:$REL FROM debian:$REL
# again, it doesn't work otherwise
ARG REL=bullseye
WORKDIR /root WORKDIR /root
RUN if [ "$REL" = "buster" ]; then \ RUN if [ "$REL" = "buster" ]; then \
@@ -27,7 +22,7 @@ RUN apt-get -y build-dep qemu
RUN apt-get -y build-dep fio RUN apt-get -y build-dep fio
RUN apt-get --download-only source qemu RUN apt-get --download-only source qemu
RUN apt-get --download-only source fio RUN apt-get --download-only source fio
RUN apt-get -y install libjerasure-dev RUN apt-get -y install libjerasure-dev cmake
ADD . /root/vitastor ADD . /root/vitastor
RUN set -e -x; \ RUN set -e -x; \
@@ -35,20 +30,20 @@ RUN set -e -x; \
cd /root/fio-build/; \ cd /root/fio-build/; \
rm -rf /root/fio-build/*; \ rm -rf /root/fio-build/*; \
dpkg-source -x /root/fio*.dsc; \ dpkg-source -x /root/fio*.dsc; \
cd /root/build/qemu-$REL/; \ cd /root/packages/qemu-$REL/; \
rm -rf qemu*/; \ rm -rf qemu*/; \
dpkg-source -x qemu*.dsc; \ dpkg-source -x qemu*.dsc; \
cd /root/build/qemu-$REL/qemu*/; \ cd /root/packages/qemu-$REL/qemu*/; \
debian/rules b/configure-stamp; \ debian/rules b/configure-stamp; \
cd b/qemu; \ cd b/qemu; \
make -j8 qapi; \ make -j8 qapi/qapi-builtin-types.h; \
mkdir -p /root/build/vitastor-$REL; \ mkdir -p /root/packages/vitastor-$REL; \
rm -rf /root/build/vitastor-$REL/*; \ rm -rf /root/packages/vitastor-$REL/*; \
cd /root/build/vitastor-$REL; \ cd /root/packages/vitastor-$REL; \
cp -r /root/vitastor vitastor-0.5.1; \ cp -r /root/vitastor vitastor-0.5.5; \
ln -s /root/build/qemu-$REL/qemu-*/ vitastor-0.5.1/qemu; \ ln -s /root/packages/qemu-$REL/qemu-*/ vitastor-0.5.5/qemu; \
ln -s /root/fio-build/fio-*/ vitastor-0.5.1/fio; \ ln -s /root/fio-build/fio-*/ vitastor-0.5.5/fio; \
cd vitastor-0.5.1; \ cd vitastor-0.5.5; \
FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \ FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
QEMU=$(head -n1 qemu/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \ QEMU=$(head -n1 qemu/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
sh copy-qemu-includes.sh; \ sh copy-qemu-includes.sh; \
@@ -60,13 +55,13 @@ RUN set -e -x; \
diff -NaurpbB a b > debian/patches/qemu-fio-headers.patch || true; \ diff -NaurpbB a b > debian/patches/qemu-fio-headers.patch || true; \
echo qemu-fio-headers.patch >> debian/patches/series; \ echo qemu-fio-headers.patch >> debian/patches/series; \
rm -rf a b; \ rm -rf a b; \
rm -rf /root/build/qemu-$REL/qemu*/; \ rm -rf /root/packages/qemu-$REL/qemu*/; \
echo "dep:fio=$FIO" > debian/substvars; \ echo "dep:fio=$FIO" > debian/substvars; \
echo "dep:qemu=$QEMU" >> debian/substvars; \ echo "dep:qemu=$QEMU" >> debian/substvars; \
cd /root/build/vitastor-$REL; \ cd /root/packages/vitastor-$REL; \
tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.5.1.orig.tar.xz vitastor-0.5.1; \ tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.5.5.orig.tar.xz vitastor-0.5.5; \
cd vitastor-0.5.1; \ cd vitastor-0.5.5; \
V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \ V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \ DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \
DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \ DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \
rm -rf /root/build/vitastor-$REL/vitastor-*/ rm -rf /root/packages/vitastor-$REL/vitastor-*/

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include <iostream> #include <iostream>
#include <functional> #include <functional>

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
module.exports = { module.exports = {
scale_pg_count, scale_pg_count,
@@ -8,7 +8,7 @@ module.exports = {
function scale_pg_count(prev_pgs, prev_pg_history, new_pg_history, new_pg_count) function scale_pg_count(prev_pgs, prev_pg_history, new_pg_history, new_pg_count)
{ {
const old_pg_count = prev_pgs.length; const old_pg_count = prev_pgs.length;
// Add all possibly intersecting PGs into the history of new PGs // Add all possibly intersecting PGs to the history of new PGs
if (!(new_pg_count % old_pg_count)) if (!(new_pg_count % old_pg_count))
{ {
// New PG count is a multiple of the old PG count // New PG count is a multiple of the old PG count
@@ -16,7 +16,7 @@ function scale_pg_count(prev_pgs, prev_pg_history, new_pg_history, new_pg_count)
for (let i = 0; i < new_pg_count; i++) for (let i = 0; i < new_pg_count; i++)
{ {
const old_i = Math.floor(new_pg_count / mul); const old_i = Math.floor(new_pg_count / mul);
new_pg_history[i] = JSON.parse(JSON.stringify(prev_pg_history[1+old_i])); new_pg_history[i] = prev_pg_history[old_i] ? JSON.parse(JSON.stringify(prev_pg_history[old_i])) : undefined;
} }
} }
else if (!(old_pg_count % new_pg_count)) else if (!(old_pg_count % new_pg_count))

View File

@@ -1,31 +1,16 @@
// Functions to calculate Annualized Failure Rate of your cluster // Functions to calculate Annualized Failure Rate of your cluster
// if you know AFR of your drives, number of drives, expected rebalance time // if you know AFR of your drives, number of drives, expected rebalance time
// and replication factor // and replication factor
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see https://yourcmc.ru/git/vitalif/vitastor/src/branch/master/README.md for details) or AGPL-3.0
// Author: Vitaliy Filippov, 2020+
const { sprintf } = require('sprintf-js');
module.exports = { module.exports = {
cluster_afr_fullmesh, cluster_afr_fullmesh,
failure_rate_fullmesh, failure_rate_fullmesh,
cluster_afr, cluster_afr,
print_cluster_afr,
c_n_k, c_n_k,
}; };
print_cluster_afr({ n_hosts: 4, n_drives: 6, afr_drive: 0.03, afr_host: 0.05, capacity: 4000, speed: 0.1, replicas: 2 });
print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, capacity: 4000, speed: 0.1, replicas: 2 });
print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, afr_host: 0.05, capacity: 4000, speed: 0.1, replicas: 2 });
print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, capacity: 4000, speed: 0.1, ec: [ 2, 1 ] });
print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, afr_host: 0.05, capacity: 4000, speed: 0.1, ec: [ 2, 1 ] });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, capacity: 8000, speed: 0.02, replicas: 2 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 2 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, capacity: 8000, speed: 0.02, replicas: 3 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 3 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, capacity: 8000, speed: 0.02, replicas: 3, pgs: 100 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 3, pgs: 100 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 3, pgs: 100, degraded_replacement: 1 });
/******** "FULL MESH": ASSUME EACH OSD COMMUNICATES WITH ALL OTHER OSDS ********/ /******** "FULL MESH": ASSUME EACH OSD COMMUNICATES WITH ALL OTHER OSDS ********/
// Estimate AFR of the cluster // Estimate AFR of the cluster
@@ -56,93 +41,38 @@ function failure_rate_fullmesh(n, a, f)
/******** PGS: EACH OSD ONLY COMMUNICATES WITH <pgs> OTHER OSDs ********/ /******** PGS: EACH OSD ONLY COMMUNICATES WITH <pgs> OTHER OSDs ********/
// <n> hosts of <m> drives of <capacity> GB, each able to backfill at <speed> GB/s, // <n> hosts of <m> drives of <capacity> GB, each able to backfill at <speed> GB/s,
// <k> replicas, <pgs> unique peer PGs per OSD // <k> replicas, <pgs> unique peer PGs per OSD (~50 for 100 PG-per-OSD in a big cluster)
// //
// For each of n*m drives: P(drive fails in a year) * P(any of its peers fail in <l*365> next days). // For each of n*m drives: P(drive fails in a year) * P(any of its peers fail in <l*365> next days).
// More peers per OSD increase rebalance speed (more drives work together to resilver) if you // More peers per OSD increase rebalance speed (more drives work together to resilver) if you
// let them finish rebalance BEFORE replacing the failed drive. // let them finish rebalance BEFORE replacing the failed drive (degraded_replacement=false).
// At the same time, more peers per OSD increase probability of any of them to fail! // At the same time, more peers per OSD increase probability of any of them to fail!
// osd_rm=true means that failed OSDs' data is rebalanced over all other hosts,
// not over the same host as it's in Ceph by default (dead OSDs are marked 'out').
// //
// Probability of all except one drives in a replica group to fail is (AFR^(k-1)). // Probability of all except one drives in a replica group to fail is (AFR^(k-1)).
// So with <x> PGs it becomes ~ (x * (AFR*L/365)^(k-1)). Interesting but reasonable consequence // So with <x> PGs it becomes ~ (x * (AFR*L/365)^(k-1)). Interesting but reasonable consequence
// is that, with k=2, total failure rate doesn't depend on number of peers per OSD, // is that, with k=2, total failure rate doesn't depend on number of peers per OSD,
// because it gets increased linearly by increased number of peers to fail // because it gets increased linearly by increased number of peers to fail
// and decreased linearly by reduced rebalance time. // and decreased linearly by reduced rebalance time.
function cluster_afr_pgs({ n_hosts, n_drives, afr_drive, capacity, speed, replicas, pgs = 1, degraded_replacement }) function cluster_afr({ n_hosts, n_drives, afr_drive, afr_host, capacity, speed, ec, ec_data, ec_parity, replicas, pgs = 1, osd_rm, degraded_replacement, down_out_interval = 600 })
{ {
pgs = Math.min(pgs, (n_hosts-1)*n_drives/(replicas-1)); const pg_size = (ec ? ec_data+ec_parity : replicas);
const l = capacity/(degraded_replacement ? 1 : pgs)/speed/86400/365; pgs = Math.min(pgs, (n_hosts-1)*n_drives/(pg_size-1));
return 1 - (1 - afr_drive * (1-(1-(afr_drive*l)**(replicas-1))**pgs)) ** (n_hosts*n_drives); const host_pgs = Math.min(pgs*n_drives, (n_hosts-1)*n_drives/(pg_size-1));
} const resilver_disk = n_drives == 1 || osd_rm ? pgs : (n_drives-1);
const disk_heal_time = (down_out_interval + capacity/(degraded_replacement ? 1 : resilver_disk)/speed)/86400/365;
function cluster_afr_pgs_ec({ n_hosts, n_drives, afr_drive, capacity, speed, ec: [ ec_data, ec_parity ], pgs = 1, degraded_replacement }) const host_heal_time = (down_out_interval + n_drives*capacity/pgs/speed)/86400/365;
{ const disk_heal_fail = ((afr_drive+afr_host/n_drives)*disk_heal_time);
const ec_total = ec_data+ec_parity; const host_heal_fail = ((afr_drive+afr_host/n_drives)*host_heal_time);
pgs = Math.min(pgs, (n_hosts-1)*n_drives/(ec_total-1)); const disk_pg_fail = ec
const l = capacity/(degraded_replacement ? 1 : pgs)/speed/86400/365; ? failure_rate_fullmesh(ec_data+ec_parity-1, disk_heal_fail, ec_parity)
return 1 - (1 - afr_drive * (1-(1-failure_rate_fullmesh(ec_total-1, afr_drive*l, ec_parity))**pgs)) ** (n_hosts*n_drives); : disk_heal_fail**(replicas-1);
} const host_pg_fail = ec
? failure_rate_fullmesh(ec_data+ec_parity-1, host_heal_fail, ec_parity)
// Same as above, but also take server failures into account : host_heal_fail**(replicas-1);
function cluster_afr_pgs_hosts({ n_hosts, n_drives, afr_drive, afr_host, capacity, speed, replicas, pgs = 1, degraded_replacement }) return 1 - ((1 - afr_drive * (1-(1-disk_pg_fail)**pgs)) ** (n_hosts*n_drives))
{ * ((1 - afr_host * (1-(1-host_pg_fail)**host_pgs)) ** n_hosts);
let otherhosts = Math.min(pgs, (n_hosts-1)/(replicas-1));
pgs = Math.min(pgs, (n_hosts-1)*n_drives/(replicas-1));
let pgh = Math.min(pgs*n_drives, (n_hosts-1)*n_drives/(replicas-1));
const ld = capacity/(degraded_replacement ? 1 : pgs)/speed/86400/365;
const lh = n_drives*capacity/pgs/speed/86400/365;
const p1 = ((afr_drive+afr_host*pgs/otherhosts)*lh);
const p2 = ((afr_drive+afr_host*pgs/otherhosts)*ld);
return 1 - ((1 - afr_host * (1-(1-p1**(replicas-1))**pgh)) ** n_hosts) *
((1 - afr_drive * (1-(1-p2**(replicas-1))**pgs)) ** (n_hosts*n_drives));
}
function cluster_afr_pgs_ec_hosts({ n_hosts, n_drives, afr_drive, afr_host, capacity, speed, ec: [ ec_data, ec_parity ], pgs = 1, degraded_replacement })
{
const ec_total = ec_data+ec_parity;
const otherhosts = Math.min(pgs, (n_hosts-1)/(ec_total-1));
pgs = Math.min(pgs, (n_hosts-1)*n_drives/(ec_total-1));
const pgh = Math.min(pgs*n_drives, (n_hosts-1)*n_drives/(ec_total-1));
const ld = capacity/(degraded_replacement ? 1 : pgs)/speed/86400/365;
const lh = n_drives*capacity/pgs/speed/86400/365;
const p1 = ((afr_drive+afr_host*pgs/otherhosts)*lh);
const p2 = ((afr_drive+afr_host*pgs/otherhosts)*ld);
return 1 - ((1 - afr_host * (1-(1-failure_rate_fullmesh(ec_total-1, p1, ec_parity))**pgh)) ** n_hosts) *
((1 - afr_drive * (1-(1-failure_rate_fullmesh(ec_total-1, p2, ec_parity))**pgs)) ** (n_hosts*n_drives));
}
// Wrapper for 4 above functions
function cluster_afr(config)
{
if (config.ec && config.afr_host)
{
return cluster_afr_pgs_ec_hosts(config);
}
else if (config.ec)
{
return cluster_afr_pgs_ec(config);
}
else if (config.afr_host)
{
return cluster_afr_pgs_hosts(config);
}
else
{
return cluster_afr_pgs(config);
}
}
function print_cluster_afr(config)
{
console.log(
`${config.n_hosts} nodes with ${config.n_drives} ${sprintf("%.1f", config.capacity/1000)}TB drives`+
`, capable to backfill at ${sprintf("%.1f", config.speed*1000)} MB/s, drive AFR ${sprintf("%.1f", config.afr_drive*100)}%`+
(config.afr_host ? `, host AFR ${sprintf("%.1f", config.afr_host*100)}%` : '')+
(config.ec ? `, EC ${config.ec[0]}+${config.ec[1]}` : `, ${config.replicas} replicas`)+
`, ${config.pgs||1} PG per OSD`+
(config.degraded_replacement ? `\n...and you don't let the rebalance finish before replacing drives` : '')
);
console.log('-> '+sprintf("%.7f%%", 100*cluster_afr(config))+'\n');
} }
/******** UTILITY ********/ /******** UTILITY ********/

28
mon/afr_test.js Normal file
View File

@@ -0,0 +1,28 @@
const { sprintf } = require('sprintf-js');
const { cluster_afr } = require('./afr.js');
print_cluster_afr({ n_hosts: 4, n_drives: 6, afr_drive: 0.03, afr_host: 0.05, capacity: 4000, speed: 0.1, replicas: 2 });
print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, afr_host: 0, capacity: 4000, speed: 0.1, replicas: 2 });
print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, afr_host: 0.05, capacity: 4000, speed: 0.1, replicas: 2 });
print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, afr_host: 0, capacity: 4000, speed: 0.1, ec: true, ec_data: 2, ec_parity: 1 });
print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, afr_host: 0.05, capacity: 4000, speed: 0.1, ec: true, ec_data: 2, ec_parity: 1 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0, capacity: 8000, speed: 0.02, replicas: 2 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 2 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0, capacity: 8000, speed: 0.02, replicas: 3 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 3 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0, capacity: 8000, speed: 0.02, replicas: 3, pgs: 100 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 3, pgs: 100 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 3, pgs: 100, degraded_replacement: 1 });
function print_cluster_afr(config)
{
console.log(
`${config.n_hosts} nodes with ${config.n_drives} ${sprintf("%.1f", config.capacity/1000)}TB drives`+
`, capable to backfill at ${sprintf("%.1f", config.speed*1000)} MB/s, drive AFR ${sprintf("%.1f", config.afr_drive*100)}%`+
(config.afr_host ? `, host AFR ${sprintf("%.1f", config.afr_host*100)}%` : '')+
(config.ec ? `, EC ${config.ec_data}+${config.ec_parity}` : `, ${config.replicas} replicas`)+
`, ${config.pgs||1} PG per OSD`+
(config.degraded_replacement ? `\n...and you don't let the rebalance finish before replacing drives` : '')
);
console.log('-> '+sprintf("%.7f%%", 100*cluster_afr(config))+'\n');
}

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
// Data distribution optimizer using linear programming (lp_solve) // Data distribution optimizer using linear programming (lp_solve)
@@ -58,7 +58,7 @@ async function optimize_initial({ osd_tree, pg_count, pg_size = 3, pg_minsize =
} }
const all_weights = Object.assign({}, ...Object.values(osd_tree)); const all_weights = Object.assign({}, ...Object.values(osd_tree));
const total_weight = Object.values(all_weights).reduce((a, c) => Number(a) + Number(c), 0); const total_weight = Object.values(all_weights).reduce((a, c) => Number(a) + Number(c), 0);
const all_pgs = Object.values(random_combinations(osd_tree, pg_size, max_combinations)); const all_pgs = Object.values(random_combinations(osd_tree, pg_size, max_combinations, parity_space > 1));
const pg_per_osd = {}; const pg_per_osd = {};
for (const pg of all_pgs) for (const pg of all_pgs)
{ {
@@ -249,7 +249,7 @@ async function optimize_change({ prev_pgs: prev_int_pgs, osd_tree, pg_size = 3,
} }
} }
// Get all combinations // Get all combinations
let all_pgs = random_combinations(osd_tree, pg_size, max_combinations); let all_pgs = random_combinations(osd_tree, pg_size, max_combinations, parity_space > 1);
add_valid_previous(osd_tree, prev_weights, all_pgs); add_valid_previous(osd_tree, prev_weights, all_pgs);
all_pgs = Object.values(all_pgs); all_pgs = Object.values(all_pgs);
const pg_per_osd = {}; const pg_per_osd = {};
@@ -275,6 +275,11 @@ async function optimize_change({ prev_pgs: prev_int_pgs, osd_tree, pg_size = 3,
lp += 'max: '+all_pg_names.map(pg_name => ( lp += 'max: '+all_pg_names.map(pg_name => (
prev_weights[pg_name] ? `${pg_size+1}*add_${pg_name} - ${pg_size+1}*del_${pg_name}` : `${pg_size+1-move_weights[pg_name]}*${pg_name}` prev_weights[pg_name] ? `${pg_size+1}*add_${pg_name} - ${pg_size+1}*del_${pg_name}` : `${pg_size+1-move_weights[pg_name]}*${pg_name}`
)).join(' + ')+';\n'; )).join(' + ')+';\n';
lp += all_pg_names
.map(pg_name => (prev_weights[pg_name] ? `add_${pg_name} - del_${pg_name}` : `${pg_name}`))
.join(' + ')+' = '+(pg_count
- Object.keys(prev_weights).reduce((a, old_pg_name) => (a + (all_pgs_hash[old_pg_name] ? prev_weights[old_pg_name] : 0)), 0)
)+';\n';
for (const osd in pg_per_osd) for (const osd in pg_per_osd)
{ {
if (osd !== NO_OSD) if (osd !== NO_OSD)
@@ -488,7 +493,8 @@ function extract_osds(osd_tree, levels, osd_level, osds = {})
return osds; return osds;
} }
function random_combinations(osd_tree, pg_size, count) // ordered = don't treat (x,y) and (y,x) as equal
function random_combinations(osd_tree, pg_size, count, ordered)
{ {
let seed = 0x5f020e43; let seed = 0x5f020e43;
let rng = () => let rng = () =>
@@ -516,25 +522,47 @@ function random_combinations(osd_tree, pg_size, count)
pg.push(osds[cur_hosts[next_host]][next_osd]); pg.push(osds[cur_hosts[next_host]][next_osd]);
cur_hosts.splice(next_host, 1); cur_hosts.splice(next_host, 1);
} }
while (pg.length < pg_size) const cyclic_pgs = [ pg ];
if (ordered)
{ {
pg.push(NO_OSD); for (let i = 1; i < pg.size; i++)
{
cyclic_pgs.push([ ...pg.slice(i), ...pg.slice(0, i) ]);
}
}
for (const pg of cyclic_pgs)
{
while (pg.length < pg_size)
{
pg.push(NO_OSD);
}
r['pg_'+pg.join('_')] = pg;
} }
r['pg_'+pg.join('_')] = pg;
} }
} }
// Generate purely random combinations // Generate purely random combinations
restart: while (count > 0) while (count > 0)
{ {
let host_idx = []; let host_idx = [];
for (let i = 0; i < pg_size && i < hosts.length; i++) const cur_hosts = [ ...hosts.map((h, i) => i) ];
const max_hosts = pg_size < hosts.length ? pg_size : hosts.length;
if (ordered)
{ {
let start = i > 0 ? host_idx[i-1]+1 : 0; for (let i = 0; i < max_hosts; i++)
if (start >= hosts.length)
{ {
continue restart; const r = rng() % cur_hosts.length;
host_idx[i] = cur_hosts[r];
cur_hosts.splice(r, 1);
}
}
else
{
for (let i = 0; i < max_hosts; i++)
{
const r = rng() % (cur_hosts.length - (max_hosts - i - 1));
host_idx[i] = cur_hosts[r];
cur_hosts.splice(0, r+1);
} }
host_idx[i] = start + rng() % (hosts.length-start);
} }
let pg = host_idx.map(h => osds[hosts[h]][rng() % osds[hosts[h]].length]); let pg = host_idx.map(h => osds[hosts[h]][rng() % osds[hosts[h]].length]);
while (pg.length < pg_size) while (pg.length < pg_size)

76
mon/make-osd.sh Executable file
View File

@@ -0,0 +1,76 @@
#!/bin/bash
# Very simple systemd unit generator for vitastor-osd services
# Not the final solution yet, mostly for tests
# Copyright (c) Vitaliy Filippov, 2019+
# License: MIT
# USAGE: ./make-osd.sh /dev/disk/by-partuuid/xxx [ /dev/disk/by-partuuid/yyy]...
IP_SUBSTR="10.200.1."
ETCD_HOSTS="etcd0=http://10.200.1.10:2380,etcd1=http://10.200.1.11:2380,etcd2=http://10.200.1.12:2380"
set -e -x
IP=`ip -json a s | jq -r '.[].addr_info[] | select(.local | startswith("'$IP_SUBSTR'")) | .local'`
[ "$IP" != "" ] || exit 1
ETCD_MON=$(echo $ETCD_HOSTS | perl -pe 's/:2380/:2379/g; s/etcd\d*=//g;')
D=`dirname $0`
# Create OSDs on all passed devices
OSD_NUM=1
for DEV in $*; do
# Ugly :) -> node.js rework pending
while true; do
ST=$(etcdctl --endpoints="$ETCD_MON" get --print-value-only /vitastor/osd/stats/$OSD_NUM)
if [ "$ST" = "" ]; then
break
fi
OSD_NUM=$((OSD_NUM+1))
done
etcdctl --endpoints="$ETCD_MON" put /vitastor/osd/stats/$OSD_NUM '{}'
echo Creating OSD $OSD_NUM on $DEV
OPT=`node $D/simple-offsets.js --device $DEV --format options | tr '\n' ' '`
META=`echo $OPT | grep -Po '(?<=data_offset )\d+'`
dd if=/dev/zero of=$DEV bs=1048576 count=$(((META+1048575)/1048576)) oflag=direct
cat >/etc/systemd/system/vitastor-osd$OSD_NUM.service <<EOF
[Unit]
Description=Vitastor object storage daemon osd.$OSD_NUM
After=network-online.target local-fs.target time-sync.target
Wants=network-online.target local-fs.target time-sync.target
PartOf=vitastor.target
[Service]
LimitNOFILE=1048576
LimitNPROC=1048576
LimitMEMLOCK=infinity
ExecStart=/usr/bin/vitastor-osd \\
--etcd_address $IP:2379/v3 \\
--bind_address $IP \\
--osd_num $OSD_NUM \\
--disable_data_fsync 1 \\
--immediate_commit all \\
--flusher_count 256 \\
--disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 \\
--journal_no_same_sector_overwrites true \\
--journal_sector_buffer_count 1024 \\
$OPT
WorkingDirectory=/
ExecStartPre=+chown vitastor:vitastor $DEV
User=vitastor
PrivateTmp=false
TasksMax=infinity
Restart=always
StartLimitInterval=0
RestartSec=10
[Install]
WantedBy=vitastor.target
EOF
systemctl enable vitastor-osd$OSD_NUM
done

138
mon/make-units.sh Normal file → Executable file
View File

@@ -1,19 +1,25 @@
#!/bin/bash #!/bin/bash
# Example startup script generator # Very simple systemd unit generator for etcd & vitastor-mon services
# Of course this isn't a production solution yet, this is just for tests # Not the final solution yet, mostly for tests
# Copyright (c) Vitaliy Filippov, 2019+ # Copyright (c) Vitaliy Filippov, 2019+
# License: MIT # License: MIT
IP=`ip -json a s | jq -r '.[].addr_info[] | select(.broadcast == "10.115.0.255") | .local'` # USAGE: ./make-units.sh
IP_SUBSTR="10.200.1."
ETCD_HOSTS="etcd0=http://10.200.1.10:2380,etcd1=http://10.200.1.11:2380,etcd2=http://10.200.1.12:2380"
# determine IP
IP=`ip -json a s | jq -r '.[].addr_info[] | select(.local | startswith("'$IP_SUBSTR'")) | .local'`
[ "$IP" != "" ] || exit 1 [ "$IP" != "" ] || exit 1
ETCD_NUM=${ETCD_HOSTS/$IP*/}
[ "$ETCD_NUM" != "$ETCD_HOSTS" ] || exit 1
ETCD_NUM=$(echo $ETCD_NUM | tr -d -c , | wc -c)
BASE=${IP/*./} # etcd
BASE=$((BASE-10))
useradd etcd useradd etcd
mkdir -p /var/lib/etcd$BASE.etcd mkdir -p /var/lib/etcd$ETCD_NUM.etcd
cat >/etc/systemd/system/etcd.service <<EOF cat >/etc/systemd/system/etcd.service <<EOF
[Unit] [Unit]
Description=etcd for vitastor Description=etcd for vitastor
@@ -22,19 +28,18 @@ Wants=network-online.target local-fs.target time-sync.target
[Service] [Service]
Restart=always Restart=always
ExecStart=/usr/local/bin/etcd -name etcd$BASE --data-dir /var/lib/etcd$BASE.etcd \\ ExecStart=/usr/local/bin/etcd -name etcd$ETCD_NUM --data-dir /var/lib/etcd$ETCD_NUM.etcd \\
--advertise-client-urls http://$IP:2379 --listen-client-urls http://$IP:2379 \\ --advertise-client-urls http://$IP:2379 --listen-client-urls http://$IP:2379 \\
--initial-advertise-peer-urls http://$IP:2380 --listen-peer-urls http://$IP:2380 \\ --initial-advertise-peer-urls http://$IP:2380 --listen-peer-urls http://$IP:2380 \\
--initial-cluster-token vitastor-etcd-1 --initial-cluster etcd0=http://10.115.0.10:2380,etcd1=http://10.115.0.11:2380,etcd2=http://10.115.0.12:2380,etcd3=http://10.115.0.13:2380 \\ --initial-cluster-token vitastor-etcd-1 --initial-cluster $ETCD_HOSTS \\
--initial-cluster-state new --max-txn-ops=100000 --auto-compaction-retention=10 --auto-compaction-mode=revision --initial-cluster-state new --max-txn-ops=100000 --auto-compaction-retention=10 --auto-compaction-mode=revision
WorkingDirectory=/var/lib/etcd$BASE.etcd WorkingDirectory=/var/lib/etcd$ETCD_NUM.etcd
ExecStartPre=+chown -R etcd /var/lib/etcd$BASE.etcd ExecStartPre=+chown -R etcd /var/lib/etcd$ETCD_NUM.etcd
User=etcd User=etcd
PrivateTmp=false PrivateTmp=false
TasksMax=infinity TasksMax=infinity
Restart=always Restart=always
StartLimitInterval=0 StartLimitInterval=0
StartLimitIntervalSec=0
RestartSec=10 RestartSec=10
[Install] [Install]
@@ -48,9 +53,7 @@ systemctl start etcd
useradd vitastor useradd vitastor
chmod 755 /root chmod 755 /root
BASE=${IP/*./} # Vitastor target
BASE=$(((BASE-10)*12))
cat >/etc/systemd/system/vitastor.target <<EOF cat >/etc/systemd/system/vitastor.target <<EOF
[Unit] [Unit]
Description=vitastor target Description=vitastor target
@@ -58,116 +61,25 @@ Description=vitastor target
WantedBy=multi-user.target WantedBy=multi-user.target
EOF EOF
i=1 # Monitor unit
for DEV in `ls /dev/disk/by-id/ | grep ata-INTEL_SSDSC2KB`; do ETCD_MON=$(echo $ETCD_HOSTS | perl -pe 's/:2380/:2379/g; s/etcd\d*=//g;')
dd if=/dev/zero of=/dev/disk/by-id/$DEV bs=1048576 count=$(((427814912+1048575)/1048576+2)) cat >/etc/systemd/system/vitastor-mon.service <<EOF
dd if=/dev/zero of=/dev/disk/by-id/$DEV bs=1048576 count=$(((427814912+1048575)/1048576+2)) seek=$((1920377991168/1048576))
cat >/etc/systemd/system/vitastor-osd$((BASE+i)).service <<EOF
[Unit] [Unit]
Description=Vitastor object storage daemon osd.$((BASE+i)) Description=Vitastor monitor
After=network-online.target local-fs.target time-sync.target After=network-online.target local-fs.target time-sync.target
Wants=network-online.target local-fs.target time-sync.target Wants=network-online.target local-fs.target time-sync.target
PartOf=vitastor.target
[Service] [Service]
LimitNOFILE=1048576 Restart=always
LimitNPROC=1048576 ExecStart=node /usr/lib/vitastor/mon/mon-main.js --etcd_url '$ETCD_MON' --etcd_prefix '/vitastor' --etcd_start_timeout 5
LimitMEMLOCK=infinity WorkingDirectory=/
ExecStart=/root/vitastor/osd \\
--etcd_address $IP:2379/v3 \\
--bind_address $IP \\
--osd_num $((BASE+i)) \\
--disable_data_fsync 1 \\
--disable_device_lock 1 \\
--immediate_commit all \\
--flusher_count 8 \\
--disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 \\
--journal_no_same_sector_overwrites true \\
--journal_sector_buffer_count 1024 \\
--journal_offset 0 \\
--meta_offset 16777216 \\
--data_offset 427814912 \\
--data_size $((1920377991168-427814912)) \\
--data_device /dev/disk/by-id/$DEV
WorkingDirectory=/root/vitastor
ExecStartPre=+chown vitastor:vitastor /dev/disk/by-id/$DEV
User=vitastor User=vitastor
PrivateTmp=false PrivateTmp=false
TasksMax=infinity TasksMax=infinity
Restart=always Restart=always
StartLimitInterval=0 StartLimitInterval=0
StartLimitIntervalSec=0
RestartSec=10 RestartSec=10
[Install] [Install]
WantedBy=vitastor.target WantedBy=vitastor.target
EOF EOF
systemctl enable vitastor-osd$((BASE+i))
i=$((i+1))
cat >/etc/systemd/system/vitastor-osd$((BASE+i)).service <<EOF
[Unit]
Description=Vitastor object storage daemon osd.$((BASE+i))
After=network-online.target local-fs.target time-sync.target
Wants=network-online.target local-fs.target time-sync.target
PartOf=vitastor.target
[Service]
LimitNOFILE=1048576
LimitNPROC=1048576
LimitMEMLOCK=infinity
ExecStart=/root/vitastor/osd \\
--etcd_address $IP:2379/v3 \\
--bind_address $IP \\
--osd_num $((BASE+i)) \\
--disable_data_fsync 1 \\
--immediate_commit all \\
--flusher_count 8 \\
--disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 \\
--journal_no_same_sector_overwrites true \\
--journal_sector_buffer_count 1024 \\
--journal_offset 1920377991168 \\
--meta_offset $((1920377991168+16777216)) \\
--data_offset $((1920377991168+427814912)) \\
--data_size $((1920377991168-427814912)) \\
--data_device /dev/disk/by-id/$DEV
WorkingDirectory=/root/vitastor
ExecStartPre=+chown vitastor:vitastor /dev/disk/by-id/$DEV
User=vitastor
PrivateTmp=false
TasksMax=infinity
Restart=always
StartLimitInterval=0
StartLimitIntervalSec=0
RestartSec=10
[Install]
WantedBy=vitastor.target
EOF
systemctl enable vitastor-osd$((BASE+i))
i=$((i+1))
done
exit
node mon-main.js --etcd_url 'http://10.115.0.10:2379,http://10.115.0.11:2379,http://10.115.0.12:2379,http://10.115.0.13:2379' --etcd_prefix '/vitastor' --etcd_start_timeout 5
podman run -d --network host --restart always -v /var/lib/etcd0.etcd:/etcd0.etcd --name etcd quay.io/coreos/etcd:v3.4.13 etcd -name etcd0 \
-advertise-client-urls http://10.115.0.10:2379 -listen-client-urls http://10.115.0.10:2379 \
-initial-advertise-peer-urls http://10.115.0.10:2380 -listen-peer-urls http://10.115.0.10:2380 \
-initial-cluster-token vitastor-etcd-1 -initial-cluster etcd0=http://10.115.0.10:2380,etcd1=http://10.115.0.11:2380,etcd2=http://10.115.0.12:2380,etcd3=http://10.115.0.13:2380 \
-initial-cluster-state new --max-txn-ops=100000 --auto-compaction-retention=10 --auto-compaction-mode=revision
etcdctl --endpoints http://10.115.0.10:2379 put /vitastor/config/global '{"immediate_commit":"all"}'
etcdctl --endpoints http://10.115.0.10:2379 put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":48,"failure_domain":"host"}}'
#let pgs = {};
#for (let n = 0; n < 48; n++) { let i = n/2 | 0; pgs[1+n] = { osd_set: [ (1+i%12+(i/12 | 0)*24), (1+12+i%12+(i/12 | 0)*24) ], primary: (1+(n%2)*12+i%12+(i/12 | 0)*24) }; };
#console.log(JSON.stringify({ items: { 1: pgs } }));
#etcdctl --endpoints http://10.115.0.10:2379 put /vitastor/config/pgs ...
# --disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 \\
# --data_offset 427814912 \\
# --disk_alignment 4096 --journal_block_size 512 --meta_block_size 512 \\
# --data_offset 433434624 \\

View File

@@ -1,7 +1,7 @@
#!/usr/bin/node #!/usr/bin/node
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
const Mon = require('./mon.js'); const Mon = require('./mon.js');

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
const http = require('http'); const http = require('http');
const crypto = require('crypto'); const crypto = require('crypto');
@@ -253,7 +253,10 @@ class Mon
const res = await this.etcd_call('/kv/txn', { success: [ const res = await this.etcd_call('/kv/txn', { success: [
{ requestRange: { key: b64(this.etcd_prefix+'/config/global') } } { requestRange: { key: b64(this.etcd_prefix+'/config/global') } }
] }, this.etcd_start_timeout, -1); ] }, this.etcd_start_timeout, -1);
this.parse_kv(res.responses[0].response_range.kvs[0]); if (res.responses[0].response_range.kvs)
{
this.parse_kv(res.responses[0].response_range.kvs[0]);
}
this.check_config(); this.check_config();
} }
@@ -808,6 +811,17 @@ class Mon
} }
PGUtil.scale_pg_count(prev_pgs, this.state.pg.history[pool_id]||{}, pg_history, pool_cfg.pg_count); PGUtil.scale_pg_count(prev_pgs, this.state.pg.history[pool_id]||{}, pg_history, pool_cfg.pg_count);
} }
for (const pg of prev_pgs)
{
while (pg.length < pool_cfg.pg_size)
{
pg.push(0);
}
while (pg.length > pool_cfg.pg_size)
{
pg.pop();
}
}
optimize_result = await LPOptimizer.optimize_change({ optimize_result = await LPOptimizer.optimize_change({
prev_pgs, prev_pgs,
osd_tree: pool_tree, osd_tree: pool_tree,
@@ -837,7 +851,7 @@ class Mon
this.save_new_pgs_txn(etcd_request, pool_id, up_osds, prev_pgs, optimize_result.int_pgs, pg_history); this.save_new_pgs_txn(etcd_request, pool_id, up_osds, prev_pgs, optimize_result.int_pgs, pg_history);
} }
this.state.config.pgs.hash = tree_hash; this.state.config.pgs.hash = tree_hash;
await this.save_pg_config(); await this.save_pg_config(etcd_request);
} }
else else
{ {

View File

@@ -4,6 +4,7 @@
// Simple tool to calculate journal and metadata offsets for a single device // Simple tool to calculate journal and metadata offsets for a single device
// Will be replaced by smarter tools in the future // Will be replaced by smarter tools in the future
const fs = require('fs').promises;
const child_process = require('child_process'); const child_process = require('child_process');
async function run() async function run()
@@ -15,6 +16,7 @@ async function run()
device_block_size: 4096, device_block_size: 4096,
journal_offset: 0, journal_offset: 0,
device_size: 0, device_size: 0,
format: 'text',
}; };
for (let i = 2; i < process.argv.length; i++) for (let i = 2; i < process.argv.length; i++)
{ {
@@ -24,7 +26,22 @@ async function run()
i++; i++;
} }
} }
const device_size = Number(options.device_size || await system("blockdev --getsize64 "+options.device)); if (!options.device)
{
process.stderr.write('USAGE: nodejs '+process.argv[1]+' --device /dev/sdXXX\n');
process.exit(1);
}
options.device_size = Number(options.device_size);
let device_size = options.device_size;
if (!device_size)
{
const st = await fs.stat(options.device);
options.device_block_size = st.blksize;
if (st.isBlockDevice())
device_size = Number(await system("/sbin/blockdev --getsize64 "+options.device))
else
device_size = st.size;
}
if (!device_size) if (!device_size)
{ {
process.stderr.write('Failed to get device size\n'); process.stderr.write('Failed to get device size\n');
@@ -32,25 +49,45 @@ async function run()
} }
options.journal_offset = Math.ceil(options.journal_offset/options.device_block_size)*options.device_block_size; options.journal_offset = Math.ceil(options.journal_offset/options.device_block_size)*options.device_block_size;
const meta_offset = options.journal_offset + Math.ceil(options.journal_size/options.device_block_size)*options.device_block_size; const meta_offset = options.journal_offset + Math.ceil(options.journal_size/options.device_block_size)*options.device_block_size;
const entries_per_block = Math.floor(options.device_block_size / (24 + options.object_size/options.bitmap_granularity/8)); const entries_per_block = Math.floor(options.device_block_size / (24 + 2*options.object_size/options.bitmap_granularity/8));
const object_count = Math.floor((device_size-meta_offset)/options.object_size); const object_count = Math.floor((device_size-meta_offset)/options.object_size);
const meta_size = Math.ceil(object_count / entries_per_block) * options.device_block_size; const meta_size = Math.ceil(object_count / entries_per_block) * options.device_block_size;
const data_offset = meta_offset + meta_size; const data_offset = meta_offset + meta_size;
const meta_size_fmt = (meta_size > 1024*1024*1024 ? Math.round(meta_size/1024/1024/1024*100)/100+" GB" const meta_size_fmt = (meta_size > 1024*1024*1024 ? Math.round(meta_size/1024/1024/1024*100)/100+" GB"
: Math.round(meta_size/1024/1024*100)/100+" MB"); : Math.round(meta_size/1024/1024*100)/100+" MB");
process.stdout.write( if (options.format == 'text' || options.format == 'options')
`Metadata size: ${meta_size_fmt}\n`+ {
`Options for the OSD:\n`+ if (options.format == 'text')
` --journal_offset ${options.journal_offset}\n`+ {
` --meta_offset ${meta_offset}\n`+ process.stderr.write(
` --data_offset ${data_offset}\n`+ `Metadata size: ${meta_size_fmt}\n`+
(options.device_size ? ` --data_size ${device_size-data_offset}\n` : '') `Options for the OSD:\n`
); );
}
process.stdout.write(
` --data_device ${options.device}\n`+
` --journal_offset ${options.journal_offset}\n`+
` --meta_offset ${meta_offset}\n`+
` --data_offset ${data_offset}\n`+
(options.device_size ? ` --data_size ${device_size-data_offset}\n` : '')
);
}
else if (options.format == 'env')
{
process.stdout.write(
`journal_offset=${options.journal_offset}\n`+
`meta_offset=${meta_offset}\n`+
`data_offset=${data_offset}\n`+
`data_size=${device_size-data_offset}\n`
);
}
else
process.stdout.write('Unknown format: '+options.format);
} }
function system(cmd) function system(cmd)
{ {
return new Promise((ok, no) => child_process.exec(cmd, { maxBuffer: 64*1024*1024 }, (err, stdout, stderr) => (err ? no(err) : ok(stdout)))); return new Promise((ok, no) => child_process.exec(cmd, { maxBuffer: 64*1024*1024 }, (err, stdout, stderr) => (err ? no(err.message) : ok(stdout))));
} }
run().catch(console.error); run().catch(err => { console.error(err); process.exit(1); });

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
// Interesting real-world example coming from Ceph with EC and compression enabled. // Interesting real-world example coming from Ceph with EC and compression enabled.
// EC parity chunks can't be compressed as efficiently as data chunks, // EC parity chunks can't be compressed as efficiently as data chunks,

View File

@@ -0,0 +1,25 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
const LPOptimizer = require('./lp-optimizer.js');
async function run()
{
const osd_tree = { a: { 1: 1 }, b: { 2: 1 }, c: { 3: 1 } };
let res;
console.log('16 PGs, size=3');
res = await LPOptimizer.optimize_initial({ osd_tree, pg_size: 3, pg_count: 16 });
LPOptimizer.print_change_stats(res, false);
console.log('\nReduce PG size to 2');
res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs.map(pg => pg.slice(0, 2)), osd_tree, pg_size: 2 });
LPOptimizer.print_change_stats(res, false);
console.log('\nRemove OSD 3');
delete osd_tree['c'];
res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree, pg_size: 2 });
LPOptimizer.print_change_stats(res, false);
}
run().catch(console.error);

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
const LPOptimizer = require('./lp-optimizer.js'); const LPOptimizer = require('./lp-optimizer.js');

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
const LPOptimizer = require('./lp-optimizer.js'); const LPOptimizer = require('./lp-optimizer.js');

View File

@@ -48,4 +48,4 @@ FIO=`rpm -qi fio | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Ve
QEMU=`rpm -qi qemu qemu-kvm | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'` QEMU=`rpm -qi qemu qemu-kvm | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'`
perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec
perl -i -pe 's/(Requires:\s*qemu(?:-kvm)?)([^\n]+)?/$1 = '$QEMU'/' $VITASTOR/rpm/vitastor-el$EL.spec perl -i -pe 's/(Requires:\s*qemu(?:-kvm)?)([^\n]+)?/$1 = '$QEMU'/' $VITASTOR/rpm/vitastor-el$EL.spec
tar --transform 's#^#vitastor-0.5.1/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.5.1$(rpm --eval '%dist').tar.gz * tar --transform 's#^#vitastor-0.5.5/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.5.5$(rpm --eval '%dist').tar.gz *

View File

@@ -1,5 +1,5 @@
# Build packages for CentOS 8 inside a container # Build packages for CentOS 8 inside a container
# cd ..; podman build -t qemu-el8 -v `pwd`/build:/root/build -f rpm/qemu-el8.Dockerfile . # cd ..; podman build -t qemu-el8 -v `pwd`/packages:/root/packages -f rpm/qemu-el8.Dockerfile .
FROM centos:8 FROM centos:8
@@ -14,8 +14,8 @@ RUN cd ~/rpmbuild/SPECS && dnf builddep -y --enablerepo=PowerTools --spec qemu-k
ADD qemu-*-vitastor.patch /root/vitastor/ ADD qemu-*-vitastor.patch /root/vitastor/
RUN set -e; \ RUN set -e; \
mkdir -p /root/build/qemu-el8; \ mkdir -p /root/packages/qemu-el8; \
rm -rf /root/build/qemu-el8/*; \ rm -rf /root/packages/qemu-el8/*; \
rpm --nomd5 -i /root/qemu*.src.rpm; \ rpm --nomd5 -i /root/qemu*.src.rpm; \
cd ~/rpmbuild/SPECS; \ cd ~/rpmbuild/SPECS; \
PN=$(grep ^Patch qemu-kvm.spec | tail -n1 | perl -pe 's/Patch(\d+).*/$1/'); \ PN=$(grep ^Patch qemu-kvm.spec | tail -n1 | perl -pe 's/Patch(\d+).*/$1/'); \
@@ -27,5 +27,5 @@ RUN set -e; \
perl -i -pe 's/(^Release:\s*\d+)/$1.vitastor/' qemu-kvm.spec; \ perl -i -pe 's/(^Release:\s*\d+)/$1.vitastor/' qemu-kvm.spec; \
cp /root/vitastor/qemu-4.2-vitastor.patch ~/rpmbuild/SOURCES; \ cp /root/vitastor/qemu-4.2-vitastor.patch ~/rpmbuild/SOURCES; \
rpmbuild --nocheck -ba qemu-kvm.spec; \ rpmbuild --nocheck -ba qemu-kvm.spec; \
cp ~/rpmbuild/RPMS/*/*qemu* /root/build/qemu-el8/; \ cp ~/rpmbuild/RPMS/*/*qemu* /root/packages/qemu-el8/; \
cp ~/rpmbuild/SRPMS/*qemu* /root/build/qemu-el8/ cp ~/rpmbuild/SRPMS/*qemu* /root/packages/qemu-el8/

257
rpm/qemu-kvm-el7.spec.patch Normal file
View File

@@ -0,0 +1,257 @@
--- qemu-kvm.spec.orig 2020-11-09 23:41:03.000000000 +0000
+++ qemu-kvm.spec 2020-12-06 10:44:24.207640963 +0000
@@ -2,7 +2,7 @@
%global SLOF_gittagcommit 899d9883
%global have_usbredir 1
-%global have_spice 1
+%global have_spice 0
%global have_opengl 1
%global have_fdt 0
%global have_gluster 1
@@ -56,7 +56,7 @@ Requires: %{name}-block-curl = %{epoch}:
Requires: %{name}-block-gluster = %{epoch}:%{version}-%{release} \
%endif \
Requires: %{name}-block-iscsi = %{epoch}:%{version}-%{release} \
-Requires: %{name}-block-rbd = %{epoch}:%{version}-%{release} \
+#Requires: %{name}-block-rbd = %{epoch}:%{version}-%{release} \
Requires: %{name}-block-ssh = %{epoch}:%{version}-%{release}
# Macro to properly setup RHEL/RHEV conflict handling
@@ -67,7 +67,7 @@ Obsoletes: %1-rhev
Summary: QEMU is a machine emulator and virtualizer
Name: qemu-kvm
Version: 4.2.0
-Release: 29.vitastor%{?dist}.6
+Release: 30.vitastor%{?dist}.6
# Epoch because we pushed a qemu-1.0 package. AIUI this can't ever be dropped
Epoch: 15
License: GPLv2 and GPLv2+ and CC-BY
@@ -99,8 +99,8 @@ Source30: kvm-s390x.conf
Source31: kvm-x86.conf
Source32: qemu-pr-helper.service
Source33: qemu-pr-helper.socket
-Source34: 81-kvm-rhel.rules
-Source35: udev-kvm-check.c
+#Source34: 81-kvm-rhel.rules
+#Source35: udev-kvm-check.c
Source36: README.tests
@@ -825,7 +825,9 @@ Patch331: kvm-Drop-bogus-IPv6-messages.p
Patch333: kvm-virtiofsd-Whitelist-fchmod.patch
# For bz#1883869 - virtiofsd core dump in KATA Container [rhel-8.2.1.z]
Patch334: kvm-virtiofsd-avoid-proc-self-fd-tempdir.patch
-Patch335: qemu-4.2-vitastor.patch
+Patch335: qemu-use-sphinx-1.2.patch
+Patch336: qemu-config-tcmalloc-warning.patch
+Patch337: qemu-4.2-vitastor.patch
BuildRequires: wget
BuildRequires: rpm-build
@@ -842,7 +844,8 @@ BuildRequires: pciutils-devel
BuildRequires: libiscsi-devel
BuildRequires: ncurses-devel
BuildRequires: libattr-devel
-BuildRequires: libusbx-devel >= 1.0.22
+BuildRequires: gperftools-devel
+BuildRequires: libusbx-devel >= 1.0.21
%if %{have_usbredir}
BuildRequires: usbredir-devel >= 0.7.1
%endif
@@ -856,12 +859,12 @@ BuildRequires: virglrenderer-devel
# For smartcard NSS support
BuildRequires: nss-devel
%endif
-BuildRequires: libseccomp-devel >= 2.4.0
+#Requires: libseccomp >= 2.4.0
# For network block driver
BuildRequires: libcurl-devel
BuildRequires: libssh-devel
-BuildRequires: librados-devel
-BuildRequires: librbd-devel
+#BuildRequires: librados-devel
+#BuildRequires: librbd-devel
%if %{have_gluster}
# For gluster block driver
BuildRequires: glusterfs-api-devel
@@ -955,25 +958,25 @@ hardware for a full system such as a PC
%package -n qemu-kvm-core
Summary: qemu-kvm core components
+Requires: gperftools-libs
Requires: qemu-img = %{epoch}:%{version}-%{release}
%ifarch %{ix86} x86_64
Requires: seabios-bin >= 1.10.2-1
Requires: sgabios-bin
-Requires: edk2-ovmf
%endif
%ifarch aarch64
Requires: edk2-aarch64
%endif
%ifnarch aarch64 s390x
-Requires: seavgabios-bin >= 1.12.0-3
-Requires: ipxe-roms-qemu >= 20170123-1
+Requires: seavgabios-bin >= 1.11.0-1
+Requires: ipxe-roms-qemu >= 20181214-1
+Requires: /usr/share/ipxe.efi
%endif
%ifarch %{power64}
Requires: SLOF >= %{SLOF_gittagdate}-1.git%{SLOF_gittagcommit}
%endif
Requires: %{name}-common = %{epoch}:%{version}-%{release}
-Requires: libseccomp >= 2.4.0
# For compressed guest memory dumps
Requires: lzo snappy
%if %{have_kvm_setup}
@@ -1085,15 +1088,15 @@ This package provides the additional iSC
Install this package if you want to access iSCSI volumes.
-%package block-rbd
-Summary: QEMU Ceph/RBD block driver
-Requires: %{name}-common%{?_isa} = %{epoch}:%{version}-%{release}
-
-%description block-rbd
-This package provides the additional Ceph/RBD block driver for QEMU.
-
-Install this package if you want to access remote Ceph volumes
-using the rbd protocol.
+#%package block-rbd
+#Summary: QEMU Ceph/RBD block driver
+#Requires: %{name}-common%{?_isa} = %{epoch}:%{version}-%{release}
+#
+#%description block-rbd
+#This package provides the additional Ceph/RBD block driver for QEMU.
+#
+#Install this package if you want to access remote Ceph volumes
+#using the rbd protocol.
%package block-ssh
@@ -1117,12 +1120,14 @@ the Secure Shell (SSH) protocol.
# --build-id option is used for giving info to the debug packages.
buildldflags="VL_LDFLAGS=-Wl,--build-id"
-%global block_drivers_list qcow2,raw,file,host_device,nbd,iscsi,rbd,blkdebug,luks,null-co,nvme,copy-on-read,throttle
+#%global block_drivers_list qcow2,raw,file,host_device,nbd,iscsi,rbd,blkdebug,luks,null-co,nvme,copy-on-read,throttle
+%global block_drivers_list qcow2,raw,file,host_device,nbd,iscsi,blkdebug,luks,null-co,nvme,copy-on-read,throttle
%if 0%{have_gluster}
%global block_drivers_list %{block_drivers_list},gluster
%endif
+[ -e /usr/bin/sphinx-build ] || ln -s sphinx-build-3 /usr/bin/sphinx-build
./configure \
--prefix="%{_prefix}" \
--libdir="%{_libdir}" \
@@ -1152,15 +1157,15 @@ buildldflags="VL_LDFLAGS=-Wl,--build-id"
%else
--disable-numa \
%endif
- --enable-rbd \
+ --disable-rbd \
%if 0%{have_librdma}
--enable-rdma \
%else
--disable-rdma \
%endif
--disable-pvrdma \
- --enable-seccomp \
-%if 0%{have_spice}
+ --disable-seccomp \
+%if %{have_spice}
--enable-spice \
--enable-smartcard \
--enable-virglrenderer \
@@ -1179,7 +1184,7 @@ buildldflags="VL_LDFLAGS=-Wl,--build-id"
%else
--disable-usb-redir \
%endif
- --disable-tcmalloc \
+ --enable-tcmalloc \
%ifarch x86_64
--enable-libpmem \
%else
@@ -1193,9 +1198,7 @@ buildldflags="VL_LDFLAGS=-Wl,--build-id"
%endif
--python=%{__python3} \
--target-list="%{buildarch}" \
- --block-drv-rw-whitelist=%{block_drivers_list} \
--audio-drv-list= \
- --block-drv-ro-whitelist=vmdk,vhdx,vpc,https,ssh \
--with-coroutine=ucontext \
--tls-priority=NORMAL \
--disable-bluez \
@@ -1262,7 +1265,7 @@ buildldflags="VL_LDFLAGS=-Wl,--build-id"
--disable-sanitizers \
--disable-hvf \
--disable-whpx \
- --enable-malloc-trim \
+ --disable-malloc-trim \
--disable-membarrier \
--disable-vhost-crypto \
--disable-libxml2 \
@@ -1308,7 +1311,7 @@ make V=1 %{?_smp_mflags} $buildldflags
cp -a %{kvm_target}-softmmu/qemu-system-%{kvm_target} qemu-kvm
gcc %{SOURCE6} $RPM_OPT_FLAGS $RPM_LD_FLAGS -o ksmctl
-gcc %{SOURCE35} $RPM_OPT_FLAGS $RPM_LD_FLAGS -o udev-kvm-check
+#gcc %{SOURCE35} $RPM_OPT_FLAGS $RPM_LD_FLAGS -o udev-kvm-check
%install
%define _udevdir %(pkg-config --variable=udevdir udev)
@@ -1343,8 +1346,8 @@ mkdir -p $RPM_BUILD_ROOT%{testsdir}/test
mkdir -p $RPM_BUILD_ROOT%{testsdir}/tests/qemu-iotests
mkdir -p $RPM_BUILD_ROOT%{testsdir}/scripts/qmp
-install -p -m 0755 udev-kvm-check $RPM_BUILD_ROOT%{_udevdir}
-install -p -m 0644 %{SOURCE34} $RPM_BUILD_ROOT%{_udevrulesdir}
+#install -p -m 0755 udev-kvm-check $RPM_BUILD_ROOT%{_udevdir}
+#install -p -m 0644 %{SOURCE34} $RPM_BUILD_ROOT%{_udevrulesdir}
install -m 0644 scripts/dump-guest-memory.py \
$RPM_BUILD_ROOT%{_datadir}/%{name}
@@ -1562,6 +1565,8 @@ rm -rf $RPM_BUILD_ROOT%{qemudocdir}/inte
# Remove spec
rm -rf $RPM_BUILD_ROOT%{qemudocdir}/specs
+%global __os_install_post %(echo '%{__os_install_post}' | sed -e 's!/usr/lib[^[:space:]]*/brp-python-bytecompile[[:space:]].*$!!g')
+
%check
export DIFF=diff; make check V=1
@@ -1645,8 +1650,8 @@ useradd -r -u 107 -g qemu -G kvm -d / -s
%config(noreplace) %{_sysconfdir}/sysconfig/ksm
%{_unitdir}/ksmtuned.service
%{_sbindir}/ksmtuned
-%{_udevdir}/udev-kvm-check
-%{_udevrulesdir}/81-kvm-rhel.rules
+#%{_udevdir}/udev-kvm-check
+#%{_udevrulesdir}/81-kvm-rhel.rules
%ghost %{_sysconfdir}/kvm
%config(noreplace) %{_sysconfdir}/ksmtuned.conf
%dir %{_sysconfdir}/%{name}
@@ -1711,8 +1716,8 @@ useradd -r -u 107 -g qemu -G kvm -d / -s
%{_libexecdir}/vhost-user-gpu
%{_datadir}/%{name}/vhost-user/50-qemu-gpu.json
%endif
-%{_libexecdir}/virtiofsd
-%{_datadir}/%{name}/vhost-user/50-qemu-virtiofsd.json
+#%{_libexecdir}/virtiofsd
+#%{_datadir}/%{name}/vhost-user/50-qemu-virtiofsd.json
%files -n qemu-img
%defattr(-,root,root)
@@ -1748,8 +1753,8 @@ useradd -r -u 107 -g qemu -G kvm -d / -s
%files block-iscsi
%{_libdir}/qemu-kvm/block-iscsi.so
-%files block-rbd
-%{_libdir}/qemu-kvm/block-rbd.so
+#%files block-rbd
+#%{_libdir}/qemu-kvm/block-rbd.so
%files block-ssh
%{_libdir}/qemu-kvm/block-ssh.so

View File

@@ -1,5 +1,5 @@
# Build packages for CentOS 7 inside a container # Build packages for CentOS 7 inside a container
# cd ..; podman build -t vitastor-el7 -v `pwd`/build:/root/build -f rpm/vitastor-el7.Dockerfile . # cd ..; podman build -t vitastor-el7 -v `pwd`/packages:/root/packages -f rpm/vitastor-el7.Dockerfile .
# localedef -i ru_RU -f UTF-8 ru_RU.UTF-8 # localedef -i ru_RU -f UTF-8 ru_RU.UTF-8
FROM centos:7 FROM centos:7
@@ -25,23 +25,23 @@ RUN set -e; \
cd ~/rpmbuild/SPECS/; \ cd ~/rpmbuild/SPECS/; \
. /opt/rh/devtoolset-9/enable; \ . /opt/rh/devtoolset-9/enable; \
rpmbuild -ba liburing.spec; \ rpmbuild -ba liburing.spec; \
mkdir -p /root/build/liburing-el7; \ mkdir -p /root/packages/liburing-el7; \
rm -rf /root/build/liburing-el7/*; \ rm -rf /root/packages/liburing-el7/*; \
cp ~/rpmbuild/RPMS/*/liburing* /root/build/liburing-el7/; \ cp ~/rpmbuild/RPMS/*/liburing* /root/packages/liburing-el7/; \
cp ~/rpmbuild/SRPMS/liburing* /root/build/liburing-el7/ cp ~/rpmbuild/SRPMS/liburing* /root/packages/liburing-el7/
RUN rpm -i `ls /root/build/liburing-el7/liburing-*.x86_64.rpm | grep -v debug` RUN rpm -i `ls /root/packages/liburing-el7/liburing-*.x86_64.rpm | grep -v debug`
ADD . /root/vitastor ADD . /root/vitastor
RUN set -e; \ RUN set -e; \
cd /root/vitastor/rpm; \ cd /root/vitastor/rpm; \
sh build-tarball.sh; \ sh build-tarball.sh; \
cp /root/vitastor-0.5.1.el7.tar.gz ~/rpmbuild/SOURCES; \ cp /root/vitastor-0.5.5.el7.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \ cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \ cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \ rpmbuild -ba vitastor.spec; \
mkdir -p /root/build/vitastor-el7; \ mkdir -p /root/packages/vitastor-el7; \
rm -rf /root/build/vitastor-el7/*; \ rm -rf /root/packages/vitastor-el7/*; \
cp ~/rpmbuild/RPMS/*/vitastor* /root/build/vitastor-el7/; \ cp ~/rpmbuild/RPMS/*/vitastor* /root/packages/vitastor-el7/; \
cp ~/rpmbuild/SRPMS/vitastor* /root/build/vitastor-el7/ cp ~/rpmbuild/SRPMS/vitastor* /root/packages/vitastor-el7/

View File

@@ -1,11 +1,11 @@
Name: vitastor Name: vitastor
Version: 0.5.1 Version: 0.5.5
Release: 2%{?dist} Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.0 License: Vitastor Network Public License 1.1
URL: https://vitastor.io/ URL: https://vitastor.io/
Source0: vitastor-0.5.1.el7.tar.gz Source0: vitastor-0.5.5.el7.tar.gz
BuildRequires: liburing-devel >= 0.6 BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel BuildRequires: gperftools-devel
@@ -14,12 +14,14 @@ BuildRequires: rh-nodejs12
BuildRequires: rh-nodejs12-npm BuildRequires: rh-nodejs12-npm
BuildRequires: jerasure-devel BuildRequires: jerasure-devel
BuildRequires: gf-complete-devel BuildRequires: gf-complete-devel
BuildRequires: cmake
Requires: fio = 3.7-1.el7 Requires: fio = 3.7-1.el7
Requires: qemu-kvm = 2.0.0-1.el7.6 Requires: qemu-kvm = 2.0.0-1.el7.6
Requires: rh-nodejs12 Requires: rh-nodejs12
Requires: rh-nodejs12-npm Requires: rh-nodejs12-npm
Requires: liburing >= 0.6 Requires: liburing >= 0.6
Requires: libJerasure2 Requires: libJerasure2
Requires: lpsolve
%description %description
Vitastor is a small, simple and fast clustered block storage (storage for VM drives), Vitastor is a small, simple and fast clustered block storage (storage for VM drives),
@@ -34,12 +36,13 @@ size with configurable redundancy (replication or erasure codes/XOR).
%build %build
. /opt/rh/devtoolset-9/enable . /opt/rh/devtoolset-9/enable
make %{?_smp_mflags} BINDIR=%_bindir LIBDIR=%_libdir QEMU_PLUGINDIR=%_libdir/qemu-kvm %cmake . -DQEMU_PLUGINDIR=qemu-kvm
%make_build
%install %install
rm -rf $RPM_BUILD_ROOT rm -rf $RPM_BUILD_ROOT
%make_install BINDIR=%_bindir LIBDIR=%_libdir QEMU_PLUGINDIR=%_libdir/qemu-kvm %make_install
. /opt/rh/rh-nodejs12/enable . /opt/rh/rh-nodejs12/enable
cd mon cd mon
npm install npm install
@@ -55,7 +58,11 @@ cp -r mon %buildroot/usr/lib/vitastor/mon
%_bindir/vitastor-osd %_bindir/vitastor-osd
%_bindir/vitastor-rm %_bindir/vitastor-rm
%_libdir/qemu-kvm/block-vitastor.so %_libdir/qemu-kvm/block-vitastor.so
%_libdir/vitastor %_libdir/libfio_vitastor.so
%_libdir/libfio_vitastor_blk.so
%_libdir/libfio_vitastor_sec.so
%_libdir/libvitastor_blk.so
%_libdir/libvitastor_client.so
/usr/lib/vitastor /usr/lib/vitastor

View File

@@ -1,5 +1,5 @@
# Build packages for CentOS 8 inside a container # Build packages for CentOS 8 inside a container
# cd ..; podman build -t vitastor-el8 -v `pwd`/build:/root/build -f rpm/vitastor-el8.Dockerfile . # cd ..; podman build -t vitastor-el8 -v `pwd`/packages:/root/packages -f rpm/vitastor-el8.Dockerfile .
FROM centos:8 FROM centos:8
@@ -13,8 +13,8 @@ RUN rm -rf /var/lib/dnf/*; dnf download --disablerepo='*' --enablerepo='vitastor
RUN dnf download --source fio RUN dnf download --source fio
RUN rpm --nomd5 -i qemu*.src.rpm RUN rpm --nomd5 -i qemu*.src.rpm
RUN rpm --nomd5 -i fio*.src.rpm RUN rpm --nomd5 -i fio*.src.rpm
RUN cd ~/rpmbuild/SPECS && dnf builddep -y --enablerepo=PowerTools --spec qemu-kvm.spec RUN cd ~/rpmbuild/SPECS && dnf builddep -y --enablerepo=powertools --spec qemu-kvm.spec
RUN cd ~/rpmbuild/SPECS && dnf builddep -y --enablerepo=PowerTools --spec fio.spec RUN cd ~/rpmbuild/SPECS && dnf builddep -y --enablerepo=powertools --spec fio.spec && dnf install -y cmake
ADD https://vitastor.io/rpms/liburing-el7/liburing-0.7-2.el7.src.rpm /root ADD https://vitastor.io/rpms/liburing-el7/liburing-0.7-2.el7.src.rpm /root
@@ -23,23 +23,23 @@ RUN set -e; \
cd ~/rpmbuild/SPECS/; \ cd ~/rpmbuild/SPECS/; \
. /opt/rh/gcc-toolset-9/enable; \ . /opt/rh/gcc-toolset-9/enable; \
rpmbuild -ba liburing.spec; \ rpmbuild -ba liburing.spec; \
mkdir -p /root/build/liburing-el8; \ mkdir -p /root/packages/liburing-el8; \
rm -rf /root/build/liburing-el8/*; \ rm -rf /root/packages/liburing-el8/*; \
cp ~/rpmbuild/RPMS/*/liburing* /root/build/liburing-el8/; \ cp ~/rpmbuild/RPMS/*/liburing* /root/packages/liburing-el8/; \
cp ~/rpmbuild/SRPMS/liburing* /root/build/liburing-el8/ cp ~/rpmbuild/SRPMS/liburing* /root/packages/liburing-el8/
RUN rpm -i `ls /root/build/liburing-el7/liburing-*.x86_64.rpm | grep -v debug` RUN rpm -i `ls /root/packages/liburing-el7/liburing-*.x86_64.rpm | grep -v debug`
ADD . /root/vitastor ADD . /root/vitastor
RUN set -e; \ RUN set -e; \
cd /root/vitastor/rpm; \ cd /root/vitastor/rpm; \
sh build-tarball.sh; \ sh build-tarball.sh; \
cp /root/vitastor-0.5.1.el8.tar.gz ~/rpmbuild/SOURCES; \ cp /root/vitastor-0.5.5.el8.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \ cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \ cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \ rpmbuild -ba vitastor.spec; \
mkdir -p /root/build/vitastor-el8; \ mkdir -p /root/packages/vitastor-el8; \
rm -rf /root/build/vitastor-el8/*; \ rm -rf /root/packages/vitastor-el8/*; \
cp ~/rpmbuild/RPMS/*/vitastor* /root/build/vitastor-el8/; \ cp ~/rpmbuild/RPMS/*/vitastor* /root/packages/vitastor-el8/; \
cp ~/rpmbuild/SRPMS/vitastor* /root/build/vitastor-el8/ cp ~/rpmbuild/SRPMS/vitastor* /root/packages/vitastor-el8/

View File

@@ -1,11 +1,11 @@
Name: vitastor Name: vitastor
Version: 0.5.1 Version: 0.5.5
Release: 2%{?dist} Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.0 License: Vitastor Network Public License 1.1
URL: https://vitastor.io/ URL: https://vitastor.io/
Source0: vitastor-0.5.1.el8.tar.gz Source0: vitastor-0.5.5.el8.tar.gz
BuildRequires: liburing-devel >= 0.6 BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel BuildRequires: gperftools-devel
@@ -13,11 +13,13 @@ BuildRequires: gcc-toolset-9-gcc-c++
BuildRequires: nodejs >= 10 BuildRequires: nodejs >= 10
BuildRequires: jerasure-devel BuildRequires: jerasure-devel
BuildRequires: gf-complete-devel BuildRequires: gf-complete-devel
BuildRequires: cmake
Requires: fio = 3.7-3.el8 Requires: fio = 3.7-3.el8
Requires: qemu-kvm = 4.2.0-29.el8.6 Requires: qemu-kvm = 4.2.0-29.el8.6
Requires: nodejs >= 10 Requires: nodejs >= 10
Requires: liburing >= 0.6 Requires: liburing >= 0.6
Requires: libJerasure2 Requires: libJerasure2
Requires: lpsolve
%description %description
Vitastor is a small, simple and fast clustered block storage (storage for VM drives), Vitastor is a small, simple and fast clustered block storage (storage for VM drives),
@@ -32,12 +34,13 @@ size with configurable redundancy (replication or erasure codes/XOR).
%build %build
. /opt/rh/gcc-toolset-9/enable . /opt/rh/gcc-toolset-9/enable
make %{?_smp_mflags} BINDIR=%_bindir LIBDIR=%_libdir QEMU_PLUGINDIR=%_libdir/qemu-kvm %cmake . -DQEMU_PLUGINDIR=qemu-kvm
%make_build
%install %install
rm -rf $RPM_BUILD_ROOT rm -rf $RPM_BUILD_ROOT
%make_install BINDIR=%_bindir LIBDIR=%_libdir QEMU_PLUGINDIR=%_libdir/qemu-kvm %make_install
cd mon cd mon
npm install npm install
cd .. cd ..
@@ -52,7 +55,11 @@ cp -r mon %buildroot/usr/lib/vitastor
%_bindir/vitastor-osd %_bindir/vitastor-osd
%_bindir/vitastor-rm %_bindir/vitastor-rm
%_libdir/qemu-kvm/block-vitastor.so %_libdir/qemu-kvm/block-vitastor.so
%_libdir/vitastor %_libdir/libfio_vitastor.so
%_libdir/libfio_vitastor_blk.so
%_libdir/libfio_vitastor_sec.so
%_libdir/libvitastor_blk.so
%_libdir/libvitastor_client.so
/usr/lib/vitastor /usr/lib/vitastor

188
src/CMakeLists.txt Normal file
View File

@@ -0,0 +1,188 @@
cmake_minimum_required(VERSION 2.8)
project(vitastor)
include(GNUInstallDirs)
set(QEMU_PLUGINDIR qemu CACHE STRING "QEMU plugin directory suffix (qemu-kvm on RHEL)")
set(WITH_ASAN false CACHE BOOL "Build with AddressSanitizer")
if("${CMAKE_INSTALL_PREFIX}" MATCHES "^/usr/local/?$")
if(EXISTS "/etc/debian_version")
set(CMAKE_INSTALL_LIBDIR "lib/${CMAKE_LIBRARY_ARCHITECTURE}")
endif()
set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}")
endif()
add_definitions(-DVERSION="0.6-dev")
add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith)
if (${WITH_ASAN})
add_definitions(-fsanitize=address -fno-omit-frame-pointer)
add_link_options(-fsanitize=address -fno-omit-frame-pointer)
endif (${WITH_ASAN})
set(CMAKE_BUILD_TYPE RelWithDebInfo)
string(REGEX REPLACE "([\\/\\-]O)[12]?" "\\13" CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE}")
string(REGEX REPLACE "([\\/\\-]O)[12]?" "\\13" CMAKE_CXX_FLAGS_MINSIZEREL "${CMAKE_CXX_FLAGS_MINSIZEREL}")
string(REGEX REPLACE "([\\/\\-]O)[12]?" "\\13" CMAKE_CXX_FLAGS_RELWITHDEBINFO "${CMAKE_CXX_FLAGS_RELWITHDEBINFO}")
string(REGEX REPLACE "([\\/\\-]D) *NDEBUG" "" CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE}")
string(REGEX REPLACE "([\\/\\-]D) *NDEBUG" "" CMAKE_CXX_FLAGS_MINSIZEREL "${CMAKE_CXX_FLAGS_MINSIZEREL}")
string(REGEX REPLACE "([\\/\\-]D) *NDEBUG" "" CMAKE_CXX_FLAGS_RELWITHDEBINFO "${CMAKE_CXX_FLAGS_RELWITHDEBINFO}")
string(REGEX REPLACE "([\\/\\-]O)[12]?" "\\13" CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE}")
string(REGEX REPLACE "([\\/\\-]O)[12]?" "\\13" CMAKE_C_FLAGS_MINSIZEREL "${CMAKE_C_FLAGS_MINSIZEREL}")
string(REGEX REPLACE "([\\/\\-]O)[12]?" "\\13" CMAKE_C_FLAGS_RELWITHDEBINFO "${CMAKE_C_FLAGS_RELWITHDEBINFO}")
string(REGEX REPLACE "([\\/\\-]D) *NDEBUG" "" CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE}")
string(REGEX REPLACE "([\\/\\-]D) *NDEBUG" "" CMAKE_C_FLAGS_MINSIZEREL "${CMAKE_C_FLAGS_MINSIZEREL}")
string(REGEX REPLACE "([\\/\\-]D) *NDEBUG" "" CMAKE_C_FLAGS_RELWITHDEBINFO "${CMAKE_C_FLAGS_RELWITHDEBINFO}")
find_package(PkgConfig)
pkg_check_modules(LIBURING REQUIRED liburing)
pkg_check_modules(GLIB REQUIRED glib-2.0)
include_directories(
../
/usr/include/jerasure
${LIBURING_INCLUDE_DIRS}
)
# libvitastor_blk.so
add_library(vitastor_blk SHARED
allocator.cpp blockstore.cpp blockstore_impl.cpp blockstore_init.cpp blockstore_open.cpp blockstore_journal.cpp blockstore_read.cpp
blockstore_write.cpp blockstore_sync.cpp blockstore_stable.cpp blockstore_rollback.cpp blockstore_flush.cpp crc32c.c ringloop.cpp
)
target_link_libraries(vitastor_blk
${LIBURING_LIBRARIES}
tcmalloc_minimal
)
# libfio_vitastor_blk.so
add_library(fio_vitastor_blk SHARED
fio_engine.cpp
../json11/json11.cpp
)
target_link_libraries(fio_vitastor_blk
vitastor_blk
)
# vitastor-osd
add_executable(vitastor-osd
osd_main.cpp osd.cpp osd_secondary.cpp msgr_receive.cpp msgr_send.cpp osd_peering.cpp osd_flush.cpp osd_peering_pg.cpp
osd_primary.cpp osd_primary_subops.cpp etcd_state_client.cpp messenger.cpp osd_cluster.cpp http_client.cpp osd_ops.cpp pg_states.cpp
osd_rmw.cpp base64.cpp timerfd_manager.cpp epoll_manager.cpp ../json11/json11.cpp
)
target_link_libraries(vitastor-osd
vitastor_blk
Jerasure
)
# libfio_vitastor_sec.so
add_library(fio_vitastor_sec SHARED
fio_sec_osd.cpp
rw_blocking.cpp
)
target_link_libraries(fio_vitastor_sec
tcmalloc_minimal
)
# libvitastor_client.so
add_library(vitastor_client SHARED
cluster_client.cpp epoll_manager.cpp etcd_state_client.cpp
messenger.cpp msgr_send.cpp msgr_receive.cpp ringloop.cpp ../json11/json11.cpp
http_client.cpp osd_ops.cpp pg_states.cpp timerfd_manager.cpp base64.cpp
)
target_link_libraries(vitastor_client
tcmalloc_minimal
${LIBURING_LIBRARIES}
)
# libfio_vitastor.so
add_library(fio_vitastor SHARED
fio_cluster.cpp
)
target_link_libraries(fio_vitastor
vitastor_client
)
# vitastor-nbd
add_executable(vitastor-nbd
nbd_proxy.cpp
)
target_link_libraries(vitastor-nbd
vitastor_client
)
# vitastor-rm
add_executable(vitastor-rm
rm_inode.cpp
)
target_link_libraries(vitastor-rm
vitastor_client
)
# vitastor-dump-journal
add_executable(vitastor-dump-journal
dump_journal.cpp crc32c.c
)
# qemu_driver.so
add_library(qemu_proxy STATIC qemu_proxy.cpp)
target_compile_options(qemu_proxy PUBLIC -fPIC)
target_include_directories(qemu_proxy PUBLIC
../qemu/b/qemu
../qemu/include
${GLIB_INCLUDE_DIRS}
)
target_link_libraries(qemu_proxy
vitastor_client
)
add_library(qemu_vitastor SHARED
qemu_driver.c
)
target_link_libraries(qemu_vitastor
qemu_proxy
)
set_target_properties(qemu_vitastor PROPERTIES
PREFIX ""
OUTPUT_NAME "block-vitastor"
)
### Test stubs
# stub_osd, stub_bench, osd_test
add_executable(stub_osd stub_osd.cpp rw_blocking.cpp)
target_link_libraries(stub_osd tcmalloc_minimal)
add_executable(stub_bench stub_bench.cpp rw_blocking.cpp)
target_link_libraries(stub_bench tcmalloc_minimal)
add_executable(osd_test osd_test.cpp rw_blocking.cpp)
target_link_libraries(osd_test tcmalloc_minimal)
# osd_rmw_test
add_executable(osd_rmw_test osd_rmw_test.cpp allocator.cpp)
target_link_libraries(osd_rmw_test Jerasure tcmalloc_minimal)
# stub_uring_osd
add_executable(stub_uring_osd
stub_uring_osd.cpp epoll_manager.cpp messenger.cpp msgr_send.cpp msgr_receive.cpp ringloop.cpp timerfd_manager.cpp ../json11/json11.cpp
)
target_link_libraries(stub_uring_osd
${LIBURING_LIBRARIES}
tcmalloc_minimal
)
# osd_peering_pg_test
add_executable(osd_peering_pg_test osd_peering_pg_test.cpp osd_peering_pg.cpp)
target_link_libraries(osd_peering_pg_test tcmalloc_minimal)
# test_allocator
add_executable(test_allocator test_allocator.cpp allocator.cpp)
## test_blockstore, test_shit
#add_executable(test_blockstore test_blockstore.cpp timerfd_interval.cpp)
#target_link_libraries(test_blockstore blockstore)
#add_executable(test_shit test_shit.cpp osd_peering_pg.cpp)
#target_link_libraries(test_shit ${LIBURING_LIBRARIES} m)
### Install
install(TARGETS vitastor-osd vitastor-dump-journal vitastor-nbd vitastor-rm RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR})
install(TARGETS fio_vitastor fio_vitastor_blk fio_vitastor_sec vitastor_blk vitastor_client LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR})
install(TARGETS qemu_vitastor LIBRARY DESTINATION /usr/${CMAKE_INSTALL_LIBDIR}/${QEMU_PLUGINDIR})

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include <stdexcept> #include <stdexcept>
#include "allocator.h" #include "allocator.h"

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#pragma once #pragma once

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "base64.h" #include "base64.h"

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#pragma once #pragma once
#include <string> #include <string>

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h" #include "blockstore_impl.h"

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#pragma once #pragma once

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h" #include "blockstore_impl.h"
@@ -7,6 +7,8 @@ journal_flusher_t::journal_flusher_t(int flusher_count, blockstore_impl_t *bs)
{ {
this->bs = bs; this->bs = bs;
this->flusher_count = flusher_count; this->flusher_count = flusher_count;
this->cur_flusher_count = 1;
this->target_flusher_count = 1;
dequeuing = false; dequeuing = false;
trimming = false; trimming = false;
active_flushers = 0; active_flushers = 0;
@@ -68,10 +70,24 @@ bool journal_flusher_t::is_active()
void journal_flusher_t::loop() void journal_flusher_t::loop()
{ {
for (int i = 0; (active_flushers > 0 || dequeuing) && i < flusher_count; i++) target_flusher_count = bs->write_iodepth*2;
if (target_flusher_count <= 0)
target_flusher_count = 1;
else if (target_flusher_count > flusher_count)
target_flusher_count = flusher_count;
if (target_flusher_count > cur_flusher_count)
cur_flusher_count = target_flusher_count;
else if (target_flusher_count < cur_flusher_count)
{ {
co[i].loop(); while (target_flusher_count < cur_flusher_count)
{
if (co[cur_flusher_count-1].wait_state)
break;
cur_flusher_count--;
}
} }
for (int i = 0; (active_flushers > 0 || dequeuing) && i < cur_flusher_count; i++)
co[i].loop();
} }
void journal_flusher_t::enqueue_flush(obj_ver_id ov) void journal_flusher_t::enqueue_flush(obj_ver_id ov)
@@ -223,6 +239,7 @@ bool journal_flusher_co::loop()
resume_0: resume_0:
if (!flusher->flush_queue.size() || !flusher->dequeuing) if (!flusher->flush_queue.size() || !flusher->dequeuing)
{ {
stop_flusher:
if (flusher->trim_wanted > 0 && flusher->journal_trim_counter > 0) if (flusher->trim_wanted > 0 && flusher->journal_trim_counter > 0)
{ {
// Attempt forced trim // Attempt forced trim
@@ -327,9 +344,7 @@ resume_0:
#ifdef BLOCKSTORE_DEBUG #ifdef BLOCKSTORE_DEBUG
printf("No older flushes, stopping\n"); printf("No older flushes, stopping\n");
#endif #endif
flusher->dequeuing = false; goto stop_flusher;
wait_state = 0;
return true;
} }
} }
} }

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
struct copy_buffer_t struct copy_buffer_t
{ {
@@ -80,7 +80,7 @@ class journal_flusher_t
{ {
int trim_wanted = 0; int trim_wanted = 0;
bool dequeuing; bool dequeuing;
int flusher_count; int flusher_count, cur_flusher_count, target_flusher_count;
int flusher_start_threshold; int flusher_start_threshold;
journal_flusher_co *co; journal_flusher_co *co;
blockstore_impl_t *bs; blockstore_impl_t *bs;

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h" #include "blockstore_impl.h"
@@ -287,7 +287,7 @@ void blockstore_impl_t::check_wait(blockstore_op_t *op)
else if (PRIV(op)->wait_for == WAIT_JOURNAL_BUFFER) else if (PRIV(op)->wait_for == WAIT_JOURNAL_BUFFER)
{ {
int next = ((journal.cur_sector + 1) % journal.sector_count); int next = ((journal.cur_sector + 1) % journal.sector_count);
if (journal.sector_info[next].usage_count > 0 || if (journal.sector_info[next].flush_count > 0 ||
journal.sector_info[next].dirty) journal.sector_info[next].dirty)
{ {
// do not submit // do not submit

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#pragma once #pragma once
@@ -199,7 +199,10 @@ class blockstore_impl_t
// Suitable only for server SSDs with capacitors, requires disabled data and journal fsyncs // Suitable only for server SSDs with capacitors, requires disabled data and journal fsyncs
int immediate_commit = IMMEDIATE_NONE; int immediate_commit = IMMEDIATE_NONE;
bool inmemory_meta = false; bool inmemory_meta = false;
int flusher_count; // Maximum flusher count
unsigned flusher_count;
// Maximum queue depth
unsigned max_write_iodepth = 128;
/******* END OF OPTIONS *******/ /******* END OF OPTIONS *******/
struct ring_consumer_t ring_consumer; struct ring_consumer_t ring_consumer;
@@ -226,6 +229,7 @@ class blockstore_impl_t
struct journal_t journal; struct journal_t journal;
journal_flusher_t *flusher; journal_flusher_t *flusher;
int write_iodepth = 0;
bool live = false, queue_stall = false; bool live = false, queue_stall = false;
ring_loop_t *ringloop; ring_loop_t *ringloop;

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h" #include "blockstore_impl.h"

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#pragma once #pragma once

View File

@@ -1,12 +1,12 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h" #include "blockstore_impl.h"
blockstore_journal_check_t::blockstore_journal_check_t(blockstore_impl_t *bs) blockstore_journal_check_t::blockstore_journal_check_t(blockstore_impl_t *bs)
{ {
this->bs = bs; this->bs = bs;
sectors_required = 0; sectors_to_write = 0;
next_pos = bs->journal.next_free; next_pos = bs->journal.next_free;
next_sector = bs->journal.cur_sector; next_sector = bs->journal.cur_sector;
first_sector = -1; first_sector = -1;
@@ -20,23 +20,26 @@ int blockstore_journal_check_t::check_available(blockstore_op_t *op, int entries
int required = entries_required; int required = entries_required;
while (1) while (1)
{ {
int fits = bs->journal.no_same_sector_overwrites && bs->journal.sector_info[next_sector].written int fits = bs->journal.no_same_sector_overwrites && next_pos == bs->journal.next_free && bs->journal.sector_info[next_sector].written
? 0 ? 0
: (bs->journal.block_size - next_in_pos) / size; : (bs->journal.block_size - next_in_pos) / size;
if (fits > 0) if (fits > 0)
{ {
if (fits > required)
{
fits = required;
}
if (first_sector == -1) if (first_sector == -1)
{ {
first_sector = next_sector; first_sector = next_sector;
} }
required -= fits; required -= fits;
next_in_pos += fits * size; next_in_pos += fits * size;
sectors_required++; sectors_to_write++;
} }
else if (bs->journal.sector_info[next_sector].dirty) else if (bs->journal.sector_info[next_sector].dirty)
{ {
// sectors_required is more like "sectors to write" sectors_to_write++;
sectors_required++;
} }
if (required <= 0) if (required <= 0)
{ {
@@ -59,7 +62,7 @@ int blockstore_journal_check_t::check_available(blockstore_op_t *op, int entries
" is too small for a batch of "+std::to_string(entries_required)+" entries of "+std::to_string(size)+" bytes" " is too small for a batch of "+std::to_string(entries_required)+" entries of "+std::to_string(size)+" bytes"
); );
} }
if (bs->journal.sector_info[next_sector].usage_count > 0 || if (bs->journal.sector_info[next_sector].flush_count > 0 ||
bs->journal.sector_info[next_sector].dirty) bs->journal.sector_info[next_sector].dirty)
{ {
// No memory buffer available. Wait for it. // No memory buffer available. Wait for it.
@@ -71,17 +74,18 @@ int blockstore_journal_check_t::check_available(blockstore_op_t *op, int entries
dirty++; dirty++;
used++; used++;
} }
if (bs->journal.sector_info[i].usage_count > 0) if (bs->journal.sector_info[i].flush_count > 0)
{ {
used++; used++;
} }
} }
// In fact, it's even more rare than "ran out of journal space", so print a warning // In fact, it's even more rare than "ran out of journal space", so print a warning
printf( printf(
"Ran out of journal sector buffers: %d/%lu buffers used (%d dirty), next buffer (%ld) is %s and flushed %lu times\n", "Ran out of journal sector buffers: %d/%lu buffers used (%d dirty), next buffer (%ld)"
" is %s and flushed %lu times. Consider increasing \'journal_sector_buffer_count\'\n",
used, bs->journal.sector_count, dirty, next_sector, used, bs->journal.sector_count, dirty, next_sector,
bs->journal.sector_info[next_sector].dirty ? "dirty" : "not dirty", bs->journal.sector_info[next_sector].dirty ? "dirty" : "not dirty",
bs->journal.sector_info[next_sector].usage_count bs->journal.sector_info[next_sector].flush_count
); );
PRIV(op)->wait_for = WAIT_JOURNAL_BUFFER; PRIV(op)->wait_for = WAIT_JOURNAL_BUFFER;
return 0; return 0;
@@ -100,11 +104,8 @@ int blockstore_journal_check_t::check_available(blockstore_op_t *op, int entries
{ {
// No space in the journal. Wait until used_start changes. // No space in the journal. Wait until used_start changes.
printf( printf(
"Ran out of journal space (free space: %lu bytes, sectors to write: %d)\n", "Ran out of journal space (used_start=%08lx, next_free=%08lx, dirty_start=%08lx)\n",
(bs->journal.next_free >= bs->journal.used_start bs->journal.used_start, bs->journal.next_free, bs->journal.dirty_start
? bs->journal.len-bs->journal.block_size - (bs->journal.next_free-bs->journal.used_start)
: bs->journal.used_start - bs->journal.next_free),
sectors_required
); );
PRIV(op)->wait_for = WAIT_JOURNAL; PRIV(op)->wait_for = WAIT_JOURNAL;
bs->flusher->request_trim(); bs->flusher->request_trim();
@@ -116,22 +117,21 @@ int blockstore_journal_check_t::check_available(blockstore_op_t *op, int entries
journal_entry* prefill_single_journal_entry(journal_t & journal, uint16_t type, uint32_t size) journal_entry* prefill_single_journal_entry(journal_t & journal, uint16_t type, uint32_t size)
{ {
if (journal.block_size - journal.in_sector_pos < size || if (!journal.entry_fits(size))
journal.no_same_sector_overwrites && journal.sector_info[journal.cur_sector].written)
{ {
assert(!journal.sector_info[journal.cur_sector].dirty); assert(!journal.sector_info[journal.cur_sector].dirty);
// Move to the next journal sector // Move to the next journal sector
journal.sector_info[journal.cur_sector].written = false; if (journal.sector_info[journal.cur_sector].flush_count > 0)
if (journal.sector_info[journal.cur_sector].usage_count > 0)
{ {
// Also select next sector buffer in memory // Also select next sector buffer in memory
journal.cur_sector = ((journal.cur_sector + 1) % journal.sector_count); journal.cur_sector = ((journal.cur_sector + 1) % journal.sector_count);
assert(!journal.sector_info[journal.cur_sector].usage_count); assert(!journal.sector_info[journal.cur_sector].flush_count);
} }
else else
{ {
journal.dirty_start = journal.next_free; journal.dirty_start = journal.next_free;
} }
journal.sector_info[journal.cur_sector].written = false;
journal.sector_info[journal.cur_sector].offset = journal.next_free; journal.sector_info[journal.cur_sector].offset = journal.next_free;
journal.in_sector_pos = 0; journal.in_sector_pos = 0;
journal.next_free = (journal.next_free+journal.block_size) < journal.len ? journal.next_free + journal.block_size : journal.block_size; journal.next_free = (journal.next_free+journal.block_size) < journal.len ? journal.next_free + journal.block_size : journal.block_size;
@@ -157,7 +157,7 @@ void prepare_journal_sector_write(journal_t & journal, int cur_sector, io_uring_
{ {
journal.sector_info[cur_sector].dirty = false; journal.sector_info[cur_sector].dirty = false;
journal.sector_info[cur_sector].written = true; journal.sector_info[cur_sector].written = true;
journal.sector_info[cur_sector].usage_count++; journal.sector_info[cur_sector].flush_count++;
ring_data_t *data = ((ring_data_t*)sqe->user_data); ring_data_t *data = ((ring_data_t*)sqe->user_data);
data->iov = (struct iovec){ data->iov = (struct iovec){
(journal.inmemory (journal.inmemory

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#pragma once #pragma once
@@ -133,7 +133,7 @@ inline uint32_t je_crc32(journal_entry *je)
struct journal_sector_info_t struct journal_sector_info_t
{ {
uint64_t offset; uint64_t offset;
uint64_t usage_count; uint64_t flush_count;
bool written; bool written;
bool dirty; bool dirty;
}; };
@@ -170,13 +170,18 @@ struct journal_t
~journal_t(); ~journal_t();
bool trim(); bool trim();
uint64_t get_trim_pos(); uint64_t get_trim_pos();
inline bool entry_fits(int size)
{
return !(block_size - in_sector_pos < size ||
no_same_sector_overwrites && sector_info[cur_sector].written);
}
}; };
struct blockstore_journal_check_t struct blockstore_journal_check_t
{ {
blockstore_impl_t *bs; blockstore_impl_t *bs;
uint64_t next_pos, next_sector, next_in_pos; uint64_t next_pos, next_sector, next_in_pos;
int sectors_required, first_sector; int sectors_to_write, first_sector;
bool right_dir; // writing to the end or the beginning of the ring buffer bool right_dir; // writing to the end or the beginning of the ring buffer
blockstore_journal_check_t(blockstore_impl_t *bs); blockstore_journal_check_t(blockstore_impl_t *bs);

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include <sys/file.h> #include <sys/file.h>
#include "blockstore_impl.h" #include "blockstore_impl.h"
@@ -70,6 +70,7 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
meta_block_size = strtoull(config["meta_block_size"].c_str(), NULL, 10); meta_block_size = strtoull(config["meta_block_size"].c_str(), NULL, 10);
bitmap_granularity = strtoull(config["bitmap_granularity"].c_str(), NULL, 10); bitmap_granularity = strtoull(config["bitmap_granularity"].c_str(), NULL, 10);
flusher_count = strtoull(config["flusher_count"].c_str(), NULL, 10); flusher_count = strtoull(config["flusher_count"].c_str(), NULL, 10);
max_write_iodepth = strtoull(config["max_write_iodepth"].c_str(), NULL, 10);
// Validate // Validate
if (!block_size) if (!block_size)
{ {
@@ -83,6 +84,10 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
{ {
flusher_count = 32; flusher_count = 32;
} }
if (!max_write_iodepth)
{
max_write_iodepth = 128;
}
if (!disk_alignment) if (!disk_alignment)
{ {
disk_alignment = 4096; disk_alignment = 4096;

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h" #include "blockstore_impl.h"

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h" #include "blockstore_impl.h"
@@ -75,44 +75,35 @@ skip_ov:
return 0; return 0;
} }
// There is sufficient space. Get SQEs // There is sufficient space. Get SQEs
struct io_uring_sqe *sqe[space_check.sectors_required]; struct io_uring_sqe *sqe[space_check.sectors_to_write];
for (i = 0; i < space_check.sectors_required; i++) for (i = 0; i < space_check.sectors_to_write; i++)
{ {
BS_SUBMIT_GET_SQE_DECL(sqe[i]); BS_SUBMIT_GET_SQE_DECL(sqe[i]);
} }
// Prepare and submit journal entries // Prepare and submit journal entries
auto cb = [this, op](ring_data_t *data) { handle_rollback_event(data, op); }; auto cb = [this, op](ring_data_t *data) { handle_rollback_event(data, op); };
int s = 0, cur_sector = -1; int s = 0, cur_sector = -1;
if ((journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_rollback) &&
journal.sector_info[journal.cur_sector].dirty)
{
if (cur_sector == -1)
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
cur_sector = journal.cur_sector;
prepare_journal_sector_write(journal, cur_sector, sqe[s++], cb);
}
for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++) for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++)
{ {
if (!journal.entry_fits(sizeof(journal_entry_rollback)) &&
journal.sector_info[journal.cur_sector].dirty)
{
if (cur_sector == -1)
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], cb);
cur_sector = journal.cur_sector;
}
journal_entry_rollback *je = (journal_entry_rollback*) journal_entry_rollback *je = (journal_entry_rollback*)
prefill_single_journal_entry(journal, JE_ROLLBACK, sizeof(journal_entry_rollback)); prefill_single_journal_entry(journal, JE_ROLLBACK, sizeof(journal_entry_rollback));
journal.sector_info[journal.cur_sector].dirty = false;
je->oid = v->oid; je->oid = v->oid;
je->version = v->version; je->version = v->version;
je->crc32 = je_crc32((journal_entry*)je); je->crc32 = je_crc32((journal_entry*)je);
journal.crc32_last = je->crc32; journal.crc32_last = je->crc32;
if (cur_sector != journal.cur_sector)
{
// Write previous sector. We should write the sector only after filling it,
// because otherwise we'll write a lot more sectors in the "no_same_sector_overwrite" mode
if (cur_sector != -1)
prepare_journal_sector_write(journal, cur_sector, sqe[s++], cb);
else
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
cur_sector = journal.cur_sector;
}
} }
if (cur_sector != -1) prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], cb);
prepare_journal_sector_write(journal, cur_sector, sqe[s++], cb); assert(s == space_check.sectors_to_write);
if (cur_sector == -1)
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector; PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
PRIV(op)->pending_ops = s; PRIV(op)->pending_ops = s;
PRIV(op)->op_state = 1; PRIV(op)->op_state = 1;

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h" #include "blockstore_impl.h"
@@ -98,45 +98,36 @@ int blockstore_impl_t::dequeue_stable(blockstore_op_t *op)
return 0; return 0;
} }
// There is sufficient space. Get SQEs // There is sufficient space. Get SQEs
struct io_uring_sqe *sqe[space_check.sectors_required]; struct io_uring_sqe *sqe[space_check.sectors_to_write];
for (i = 0; i < space_check.sectors_required; i++) for (i = 0; i < space_check.sectors_to_write; i++)
{ {
BS_SUBMIT_GET_SQE_DECL(sqe[i]); BS_SUBMIT_GET_SQE_DECL(sqe[i]);
} }
// Prepare and submit journal entries // Prepare and submit journal entries
auto cb = [this, op](ring_data_t *data) { handle_stable_event(data, op); }; auto cb = [this, op](ring_data_t *data) { handle_stable_event(data, op); };
int s = 0, cur_sector = -1; int s = 0, cur_sector = -1;
if ((journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_stable) &&
journal.sector_info[journal.cur_sector].dirty)
{
if (cur_sector == -1)
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
cur_sector = journal.cur_sector;
prepare_journal_sector_write(journal, cur_sector, sqe[s++], cb);
}
for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++) for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++)
{ {
// FIXME: Only stabilize versions that aren't stable yet // FIXME: Only stabilize versions that aren't stable yet
if (!journal.entry_fits(sizeof(journal_entry_stable)) &&
journal.sector_info[journal.cur_sector].dirty)
{
if (cur_sector == -1)
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], cb);
cur_sector = journal.cur_sector;
}
journal_entry_stable *je = (journal_entry_stable*) journal_entry_stable *je = (journal_entry_stable*)
prefill_single_journal_entry(journal, JE_STABLE, sizeof(journal_entry_stable)); prefill_single_journal_entry(journal, JE_STABLE, sizeof(journal_entry_stable));
journal.sector_info[journal.cur_sector].dirty = false;
je->oid = v->oid; je->oid = v->oid;
je->version = v->version; je->version = v->version;
je->crc32 = je_crc32((journal_entry*)je); je->crc32 = je_crc32((journal_entry*)je);
journal.crc32_last = je->crc32; journal.crc32_last = je->crc32;
if (cur_sector != journal.cur_sector)
{
// Write previous sector. We should write the sector only after filling it,
// because otherwise we'll write a lot more sectors in the "no_same_sector_overwrite" mode
if (cur_sector != -1)
prepare_journal_sector_write(journal, cur_sector, sqe[s++], cb);
else
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
cur_sector = journal.cur_sector;
}
} }
if (cur_sector != -1) prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], cb);
prepare_journal_sector_write(journal, cur_sector, sqe[s++], cb); assert(s == space_check.sectors_to_write);
if (cur_sector == -1)
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector; PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
PRIV(op)->pending_ops = s; PRIV(op)->pending_ops = s;
PRIV(op)->op_state = 1; PRIV(op)->op_state = 1;

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h" #include "blockstore_impl.h"
@@ -112,30 +112,29 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
return 0; return 0;
} }
// Get SQEs. Don't bother about merging, submit each journal sector as a separate request // Get SQEs. Don't bother about merging, submit each journal sector as a separate request
struct io_uring_sqe *sqe[space_check.sectors_required]; struct io_uring_sqe *sqe[space_check.sectors_to_write];
for (int i = 0; i < space_check.sectors_required; i++) for (int i = 0; i < space_check.sectors_to_write; i++)
{ {
BS_SUBMIT_GET_SQE_DECL(sqe[i]); BS_SUBMIT_GET_SQE_DECL(sqe[i]);
} }
// Prepare and submit journal entries // Prepare and submit journal entries
auto it = PRIV(op)->sync_big_writes.begin(); auto it = PRIV(op)->sync_big_writes.begin();
int s = 0, cur_sector = -1; int s = 0, cur_sector = -1;
if ((journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_big_write) &&
journal.sector_info[journal.cur_sector].dirty)
{
if (cur_sector == -1)
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
cur_sector = journal.cur_sector;
prepare_journal_sector_write(journal, cur_sector, sqe[s++], cb);
}
while (it != PRIV(op)->sync_big_writes.end()) while (it != PRIV(op)->sync_big_writes.end())
{ {
if (!journal.entry_fits(sizeof(journal_entry_big_write)) &&
journal.sector_info[journal.cur_sector].dirty)
{
if (cur_sector == -1)
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], cb);
cur_sector = journal.cur_sector;
}
journal_entry_big_write *je = (journal_entry_big_write*)prefill_single_journal_entry( journal_entry_big_write *je = (journal_entry_big_write*)prefill_single_journal_entry(
journal, (dirty_db[*it].state & BS_ST_INSTANT) ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE, journal, (dirty_db[*it].state & BS_ST_INSTANT) ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
sizeof(journal_entry_big_write) sizeof(journal_entry_big_write)
); );
dirty_db[*it].journal_sector = journal.sector_info[journal.cur_sector].offset; dirty_db[*it].journal_sector = journal.sector_info[journal.cur_sector].offset;
journal.sector_info[journal.cur_sector].dirty = false;
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++; journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
#ifdef BLOCKSTORE_DEBUG #ifdef BLOCKSTORE_DEBUG
printf( printf(
@@ -152,19 +151,11 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
je->crc32 = je_crc32((journal_entry*)je); je->crc32 = je_crc32((journal_entry*)je);
journal.crc32_last = je->crc32; journal.crc32_last = je->crc32;
it++; it++;
if (cur_sector != journal.cur_sector)
{
// Write previous sector. We should write the sector only after filling it,
// because otherwise we'll write a lot more sectors in the "no_same_sector_overwrite" mode
if (cur_sector != -1)
prepare_journal_sector_write(journal, cur_sector, sqe[s++], cb);
else
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
cur_sector = journal.cur_sector;
}
} }
if (cur_sector != -1) prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], cb);
prepare_journal_sector_write(journal, cur_sector, sqe[s++], cb); assert(s == space_check.sectors_to_write);
if (cur_sector == -1)
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector; PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
PRIV(op)->pending_ops = s; PRIV(op)->pending_ops = s;
PRIV(op)->op_state = SYNC_JOURNAL_WRITE_SENT; PRIV(op)->op_state = SYNC_JOURNAL_WRITE_SENT;

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h" #include "blockstore_impl.h"
@@ -167,6 +167,10 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
.version = op->version, .version = op->version,
}, e).first; }, e).first;
} }
if (write_iodepth >= max_write_iodepth)
{
return 0;
}
if ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE) if ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE)
{ {
blockstore_journal_check_t space_check(this); blockstore_journal_check_t space_check(this);
@@ -191,6 +195,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
FINISH_OP(op); FINISH_OP(op);
return 1; return 1;
} }
write_iodepth++;
BS_SUBMIT_GET_SQE(sqe, data); BS_SUBMIT_GET_SQE(sqe, data);
dirty_it->second.location = loc << block_order; dirty_it->second.location = loc << block_order;
dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_SUBMITTED; dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_SUBMITTED;
@@ -243,6 +248,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
{ {
return 0; return 0;
} }
write_iodepth++;
// There is sufficient space. Get SQE(s) // There is sufficient space. Get SQE(s)
struct io_uring_sqe *sqe1 = NULL; struct io_uring_sqe *sqe1 = NULL;
if (immediate_commit != IMMEDIATE_NONE || if (immediate_commit != IMMEDIATE_NONE ||
@@ -377,7 +383,6 @@ resume_2:
sizeof(journal_entry_big_write) sizeof(journal_entry_big_write)
); );
dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset; dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
journal.sector_info[journal.cur_sector].dirty = false;
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++; journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
#ifdef BLOCKSTORE_DEBUG #ifdef BLOCKSTORE_DEBUG
printf( printf(
@@ -433,6 +438,7 @@ resume_4:
} }
// Acknowledge write // Acknowledge write
op->retval = op->len; op->retval = op->len;
write_iodepth--;
FINISH_OP(op); FINISH_OP(op);
return 1; return 1;
} }
@@ -469,8 +475,8 @@ void blockstore_impl_t::release_journal_sectors(blockstore_op_t *op)
uint64_t s = PRIV(op)->min_flushed_journal_sector; uint64_t s = PRIV(op)->min_flushed_journal_sector;
while (1) while (1)
{ {
journal.sector_info[s-1].usage_count--; journal.sector_info[s-1].flush_count--;
if (s != (1+journal.cur_sector) && journal.sector_info[s-1].usage_count == 0) if (s != (1+journal.cur_sector) && journal.sector_info[s-1].flush_count == 0)
{ {
// We know for sure that we won't write into this sector anymore // We know for sure that we won't write into this sector anymore
uint64_t new_ds = journal.sector_info[s-1].offset + journal.block_size; uint64_t new_ds = journal.sector_info[s-1].offset + journal.block_size;

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#include <stdexcept> #include <stdexcept>
#include "cluster_client.h" #include "cluster_client.h"
@@ -473,7 +473,7 @@ void cluster_client_t::slice_rw(cluster_op_t *op)
// Primary OSDs still operate individual stripes, but their size is multiplied by PG minsize in case of EC // Primary OSDs still operate individual stripes, but their size is multiplied by PG minsize in case of EC
auto & pool_cfg = st_cli.pool_config[INODE_POOL(op->inode)]; auto & pool_cfg = st_cli.pool_config[INODE_POOL(op->inode)];
uint64_t pg_block_size = bs_block_size * ( uint64_t pg_block_size = bs_block_size * (
pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pool_cfg.pg_minsize pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pool_cfg.pg_size-pool_cfg.parity_chunks
); );
uint64_t first_stripe = (op->offset / pg_block_size) * pg_block_size; uint64_t first_stripe = (op->offset / pg_block_size) * pg_block_size;
uint64_t last_stripe = ((op->offset + op->len + pg_block_size - 1) / pg_block_size - 1) * pg_block_size; uint64_t last_stripe = ((op->offset + op->len + pg_block_size - 1) / pg_block_size - 1) * pg_block_size;

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once #pragma once

View File

@@ -8,4 +8,10 @@
// unsigned __int64 _mm_crc32_u64 (unsigned __int64 crc, unsigned __int64 v) // unsigned __int64 _mm_crc32_u64 (unsigned __int64 crc, unsigned __int64 v)
// unsigned int _mm_crc32_u8 (unsigned int crc, unsigned char v) // unsigned int _mm_crc32_u8 (unsigned int crc, unsigned char v)
#ifdef __cplusplus
extern "C" {
#endif
uint32_t crc32c(uint32_t crc, const void *buf, size_t len); uint32_t crc32c(uint32_t crc, const void *buf, size_t len);
#ifdef __cplusplus
};
#endif

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#define _LARGEFILE64_SOURCE #define _LARGEFILE64_SOURCE
#include <sys/types.h> #include <sys/types.h>
@@ -26,23 +26,32 @@ struct journal_dump_t
uint64_t journal_offset; uint64_t journal_offset;
uint64_t journal_len; uint64_t journal_len;
uint64_t journal_pos; uint64_t journal_pos;
bool all;
bool started;
int fd; int fd;
uint32_t crc32_last;
void dump_block(void *buf); int dump_block(void *buf);
}; };
int main(int argc, char *argv[]) int main(int argc, char *argv[])
{ {
if (argc < 5) journal_dump_t self = { 0 };
int b = 1;
if (argc >= 2 && !strcmp(argv[1], "--all"))
{ {
printf("USAGE: %s <journal_file> <journal_block_size> <offset> <size>\n", argv[0]); self.all = true;
b = 2;
}
if (argc < b+4)
{
printf("USAGE: %s [--all] <journal_file> <journal_block_size> <offset> <size>\n", argv[0]);
return 1; return 1;
} }
journal_dump_t self; self.journal_device = argv[b];
self.journal_device = argv[1]; self.journal_block = strtoul(argv[b+1], NULL, 10);
self.journal_block = strtoul(argv[2], NULL, 10); self.journal_offset = strtoull(argv[b+2], NULL, 10);
self.journal_offset = strtoull(argv[3], NULL, 10); self.journal_len = strtoull(argv[b+3], NULL, 10);
self.journal_len = strtoull(argv[4], NULL, 10);
if (self.journal_block < MEM_ALIGNMENT || (self.journal_block % MEM_ALIGNMENT) || if (self.journal_block < MEM_ALIGNMENT || (self.journal_block % MEM_ALIGNMENT) ||
self.journal_block > 128*1024) self.journal_block > 128*1024)
{ {
@@ -57,30 +66,64 @@ int main(int argc, char *argv[])
} }
void *data = memalign(MEM_ALIGNMENT, self.journal_block); void *data = memalign(MEM_ALIGNMENT, self.journal_block);
self.journal_pos = 0; self.journal_pos = 0;
while (self.journal_pos < self.journal_len) if (self.all)
{
while (self.journal_pos < self.journal_len)
{
int r = pread(self.fd, data, self.journal_block, self.journal_offset+self.journal_pos);
assert(r == self.journal_block);
uint64_t s;
for (s = 0; s < self.journal_block; s += 8)
{
if (*((uint64_t*)(data+s)) != 0)
break;
}
if (s == self.journal_block)
{
printf("offset %08lx: zeroes\n", self.journal_pos);
self.journal_pos += self.journal_block;
}
else if (((journal_entry*)data)->magic == JOURNAL_MAGIC)
{
printf("offset %08lx:\n", self.journal_pos);
self.dump_block(data);
}
else
{
printf("offset %08lx: no magic in the beginning, looks like random data (pattern=%lx)\n", self.journal_pos, *((uint64_t*)data));
self.journal_pos += self.journal_block;
}
}
}
else
{ {
int r = pread(self.fd, data, self.journal_block, self.journal_offset+self.journal_pos); int r = pread(self.fd, data, self.journal_block, self.journal_offset+self.journal_pos);
assert(r == self.journal_block); assert(r == self.journal_block);
uint64_t s; journal_entry *je = (journal_entry*)(data);
for (s = 0; s < self.journal_block; s += 8) if (je->magic != JOURNAL_MAGIC || je->type != JE_START || je_crc32(je) != je->crc32)
{ {
if (*((uint64_t*)(data+s)) != 0) printf("offset %08lx: journal superblock is invalid\n", self.journal_pos);
break;
}
if (s == self.journal_block)
{
printf("offset %08lx: zeroes\n", self.journal_pos);
self.journal_pos += self.journal_block;
}
else if (((journal_entry*)data)->magic == JOURNAL_MAGIC)
{
printf("offset %08lx:\n", self.journal_pos);
self.dump_block(data);
} }
else else
{ {
printf("offset %08lx: no magic in the beginning, looks like random data (pattern=%lx)\n", self.journal_pos, *((uint64_t*)data)); printf("offset %08lx:\n", self.journal_pos);
self.journal_pos += self.journal_block; self.dump_block(data);
self.started = false;
self.journal_pos = je->start.journal_start;
while (1)
{
if (self.journal_pos >= self.journal_len)
self.journal_pos = self.journal_block;
r = pread(self.fd, data, self.journal_block, self.journal_offset+self.journal_pos);
assert(r == self.journal_block);
printf("offset %08lx:\n", self.journal_pos);
r = self.dump_block(data);
if (r <= 0)
{
printf("end of the journal\n");
break;
}
}
} }
} }
free(data); free(data);
@@ -88,7 +131,7 @@ int main(int argc, char *argv[])
return 0; return 0;
} }
void journal_dump_t::dump_block(void *buf) int journal_dump_t::dump_block(void *buf)
{ {
uint32_t pos = 0; uint32_t pos = 0;
journal_pos += journal_block; journal_pos += journal_block;
@@ -97,12 +140,19 @@ void journal_dump_t::dump_block(void *buf)
while (pos < journal_block) while (pos < journal_block)
{ {
journal_entry *je = (journal_entry*)(buf + pos); journal_entry *je = (journal_entry*)(buf + pos);
if (je->magic != JOURNAL_MAGIC || je->type < JE_MIN || je->type > JE_MAX) if (je->magic != JOURNAL_MAGIC || je->type < JE_MIN || je->type > JE_MAX ||
!all && started && je->crc32_prev != crc32_last)
{ {
break; break;
} }
const char *crc32_valid = je_crc32(je) == je->crc32 ? "(valid)" : "(invalid)"; bool crc32_valid = je_crc32(je) == je->crc32;
printf("entry % 3d: crc32=%08x %s prev=%08x ", entry, je->crc32, crc32_valid, je->crc32_prev); if (!all && !crc32_valid)
{
break;
}
started = true;
crc32_last = je->crc32;
printf("entry % 3d: crc32=%08x %s prev=%08x ", entry, je->crc32, (crc32_valid ? "(valid)" : "(invalid)"), je->crc32_prev);
if (je->type == JE_START) if (je->type == JE_START)
{ {
printf("je_start start=%08lx\n", je->start.journal_start); printf("je_start start=%08lx\n", je->start.journal_start);
@@ -170,4 +220,5 @@ void journal_dump_t::dump_block(void *buf)
{ {
journal_pos = journal_len; journal_pos = journal_len;
} }
return entry;
} }

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#include <sys/epoll.h> #include <sys/epoll.h>
#include <sys/poll.h> #include <sys/poll.h>
@@ -84,8 +84,12 @@ void epoll_manager_t::handle_epoll_events()
nfds = epoll_wait(epoll_fd, events, MAX_EPOLL_EVENTS, 0); nfds = epoll_wait(epoll_fd, events, MAX_EPOLL_EVENTS, 0);
for (int i = 0; i < nfds; i++) for (int i = 0; i < nfds; i++)
{ {
auto & cb = epoll_handlers[events[i].data.fd]; auto cb_it = epoll_handlers.find(events[i].data.fd);
cb(events[i].data.fd, events[i].events); if (cb_it != epoll_handlers.end())
{
auto & cb = cb_it->second;
cb(events[i].data.fd, events[i].events);
}
} }
} while (nfds == MAX_EPOLL_EVENTS); } while (nfds == MAX_EPOLL_EVENTS);
} }

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once #pragma once

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#include "osd_ops.h" #include "osd_ops.h"
#include "pg_states.h" #include "pg_states.h"

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once #pragma once

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
// FIO engine to test cluster I/O // FIO engine to test cluster I/O
// //
@@ -93,7 +93,7 @@ static struct fio_option options[] = {
{ {
.name = "cluster_log_level", .name = "cluster_log_level",
.lname = "cluster log level", .lname = "cluster log level",
.type = FIO_OPT_BOOL, .type = FIO_OPT_INT,
.off1 = offsetof(struct sec_options, cluster_log), .off1 = offsetof(struct sec_options, cluster_log),
.help = "Set log level for the Vitastor client", .help = "Set log level for the Vitastor client",
.def = "0", .def = "0",
@@ -145,9 +145,7 @@ static void sec_cleanup(struct thread_data *td)
delete bsd->cli; delete bsd->cli;
delete bsd->epmgr; delete bsd->epmgr;
delete bsd->ringloop; delete bsd->ringloop;
bsd->cli = NULL; delete bsd;
bsd->epmgr = NULL;
bsd->ringloop = NULL;
} }
} }

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
// FIO engine to test Blockstore // FIO engine to test Blockstore
// //

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
// FIO engine to test Blockstore through Secondary OSD interface // FIO engine to test Blockstore through Secondary OSD interface
// //
@@ -140,6 +140,7 @@ static void sec_cleanup(struct thread_data *td)
if (bsd) if (bsd)
{ {
close(bsd->connect_fd); close(bsd->connect_fd);
delete bsd;
} }
} }
@@ -312,6 +313,7 @@ static int sec_getevents(struct thread_data *td, unsigned int min, unsigned int
exit(1); exit(1);
} }
io_u* io = it->second; io_u* io = it->second;
bsd->queue.erase(it);
if (io->ddir == DDIR_READ) if (io->ddir == DDIR_READ)
{ {
if (reply.hdr.retval != io->xfer_buflen) if (reply.hdr.retval != io->xfer_buflen)

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#include <netinet/tcp.h> #include <netinet/tcp.h>
#include <sys/epoll.h> #include <sys/epoll.h>

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once #pragma once
#include <string> #include <string>

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once #pragma once

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#include <unistd.h> #include <unistd.h>
#include <fcntl.h> #include <fcntl.h>
@@ -30,7 +30,7 @@ osd_messenger_t::~osd_messenger_t()
{ {
while (clients.size() > 0) while (clients.size() > 0)
{ {
stop_client(clients.begin()->first); stop_client(clients.begin()->first, true);
} }
} }
@@ -111,7 +111,7 @@ void osd_messenger_t::try_connect_peer_addr(osd_num_t peer_osd, const char *peer
timeout_id = tfd->set_timer(1000*peer_connect_timeout, false, [this, peer_fd](int timer_id) timeout_id = tfd->set_timer(1000*peer_connect_timeout, false, [this, peer_fd](int timer_id)
{ {
osd_num_t peer_osd = clients.at(peer_fd)->osd_num; osd_num_t peer_osd = clients.at(peer_fd)->osd_num;
stop_client(peer_fd); stop_client(peer_fd, true);
on_connect_peer(peer_osd, -EIO); on_connect_peer(peer_osd, -EIO);
return; return;
}); });
@@ -149,7 +149,7 @@ void osd_messenger_t::handle_connect_epoll(int peer_fd)
} }
if (result != 0) if (result != 0)
{ {
stop_client(peer_fd); stop_client(peer_fd, true);
on_connect_peer(peer_osd, -result); on_connect_peer(peer_osd, -result);
return; return;
} }
@@ -171,7 +171,7 @@ void osd_messenger_t::handle_peer_epoll(int peer_fd, int epoll_events)
{ {
// Stop client // Stop client
printf("[OSD %lu] client %d disconnected\n", this->osd_num, peer_fd); printf("[OSD %lu] client %d disconnected\n", this->osd_num, peer_fd);
stop_client(peer_fd); stop_client(peer_fd, true);
} }
else if (epoll_events & EPOLLIN) else if (epoll_events & EPOLLIN)
{ {
@@ -309,7 +309,7 @@ void osd_messenger_t::cancel_op(osd_op_t *op)
} }
} }
void osd_messenger_t::stop_client(int peer_fd) void osd_messenger_t::stop_client(int peer_fd, bool force)
{ {
assert(peer_fd != 0); assert(peer_fd != 0);
auto it = clients.find(peer_fd); auto it = clients.find(peer_fd);
@@ -334,6 +334,10 @@ void osd_messenger_t::stop_client(int peer_fd)
printf("[OSD %lu] Stopping client %d (regular client)\n", osd_num, peer_fd); printf("[OSD %lu] Stopping client %d (regular client)\n", osd_num, peer_fd);
} }
} }
else if (!force)
{
return;
}
cl->peer_state = PEER_STOPPED; cl->peer_state = PEER_STOPPED;
clients.erase(it); clients.erase(it);
tfd->set_fd_handler(peer_fd, false, NULL); tfd->set_fd_handler(peer_fd, false, NULL);
@@ -348,7 +352,14 @@ void osd_messenger_t::stop_client(int peer_fd)
} }
if (cl->read_op) if (cl->read_op)
{ {
delete cl->read_op; if (cl->read_op->callback)
{
cancel_op(cl->read_op);
}
else
{
delete cl->read_op;
}
cl->read_op = NULL; cl->read_op = NULL;
} }
for (auto rit = read_ready_clients.begin(); rit != read_ready_clients.end(); rit++) for (auto rit = read_ready_clients.begin(); rit != read_ready_clients.end(); rit++)

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once #pragma once
@@ -275,7 +275,7 @@ struct osd_messenger_t
public: public:
void connect_peer(uint64_t osd_num, json11::Json peer_state); void connect_peer(uint64_t osd_num, json11::Json peer_state);
void stop_client(int peer_fd); void stop_client(int peer_fd, bool force = false);
void outbox_push(osd_op_t *cur_op); void outbox_push(osd_op_t *cur_op);
std::function<void(osd_op_t*)> exec_op; std::function<void(osd_op_t*)> exec_op;
std::function<void(osd_num_t)> repeer_pgs; std::function<void(osd_num_t)> repeer_pgs;

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#include "messenger.h" #include "messenger.h"
@@ -9,6 +9,10 @@ void osd_messenger_t::read_requests()
{ {
int peer_fd = read_ready_clients[i]; int peer_fd = read_ready_clients[i];
osd_client_t *cl = clients[peer_fd]; osd_client_t *cl = clients[peer_fd];
if (cl->read_msg.msg_iovlen)
{
continue;
}
if (cl->read_remaining < receive_buffer_size) if (cl->read_remaining < receive_buffer_size)
{ {
cl->read_iov.iov_base = cl->in_buf; cl->read_iov.iov_base = cl->in_buf;
@@ -29,6 +33,7 @@ void osd_messenger_t::read_requests()
io_uring_sqe* sqe = ringloop->get_sqe(); io_uring_sqe* sqe = ringloop->get_sqe();
if (!sqe) if (!sqe)
{ {
cl->read_msg.msg_iovlen = 0;
read_ready_clients.erase(read_ready_clients.begin(), read_ready_clients.begin() + i); read_ready_clients.erase(read_ready_clients.begin(), read_ready_clients.begin() + i);
return; return;
} }
@@ -52,6 +57,7 @@ void osd_messenger_t::read_requests()
bool osd_messenger_t::handle_read(int result, osd_client_t *cl) bool osd_messenger_t::handle_read(int result, osd_client_t *cl)
{ {
bool ret = false; bool ret = false;
cl->read_msg.msg_iovlen = 0;
cl->refs--; cl->refs--;
if (cl->peer_state == PEER_STOPPED) if (cl->peer_state == PEER_STOPPED)
{ {
@@ -160,8 +166,14 @@ bool osd_messenger_t::handle_finished_read(osd_client_t *cl)
{ {
if (cl->read_op->req.hdr.magic == SECONDARY_OSD_REPLY_MAGIC) if (cl->read_op->req.hdr.magic == SECONDARY_OSD_REPLY_MAGIC)
return handle_reply_hdr(cl); return handle_reply_hdr(cl);
else else if (cl->read_op->req.hdr.magic == SECONDARY_OSD_OP_MAGIC)
handle_op_hdr(cl); handle_op_hdr(cl);
else
{
printf("Received garbage: magic=%lx id=%lu opcode=%lx from %d\n", cl->read_op->req.hdr.magic, cl->read_op->req.hdr.id, cl->read_op->req.hdr.opcode, cl->peer_fd);
stop_client(cl->peer_fd);
return false;
}
} }
else if (cl->read_state == CL_READ_DATA) else if (cl->read_state == CL_READ_DATA)
{ {

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#define _XOPEN_SOURCE #define _XOPEN_SOURCE
#include <limits.h> #include <limits.h>
@@ -46,7 +46,8 @@ void osd_messenger_t::outbox_push(osd_op_t *cur_op)
to_send_list.push_back((iovec){ .iov_base = cur_op->req.buf, .iov_len = OSD_PACKET_SIZE }); to_send_list.push_back((iovec){ .iov_base = cur_op->req.buf, .iov_len = OSD_PACKET_SIZE });
cl->sent_ops[cur_op->req.hdr.id] = cur_op; cl->sent_ops[cur_op->req.hdr.id] = cur_op;
} }
// Pre-defined send_lists to_outbox.push_back(NULL);
// Operation data
if ((cur_op->op_type == OSD_OP_IN if ((cur_op->op_type == OSD_OP_IN
? (cur_op->req.hdr.opcode == OSD_OP_READ || ? (cur_op->req.hdr.opcode == OSD_OP_READ ||
cur_op->req.hdr.opcode == OSD_OP_SEC_READ || cur_op->req.hdr.opcode == OSD_OP_SEC_READ ||
@@ -58,17 +59,17 @@ void osd_messenger_t::outbox_push(osd_op_t *cur_op)
cur_op->req.hdr.opcode == OSD_OP_SEC_STABILIZE || cur_op->req.hdr.opcode == OSD_OP_SEC_STABILIZE ||
cur_op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK)) && cur_op->iov.count > 0) cur_op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK)) && cur_op->iov.count > 0)
{ {
to_outbox.push_back(NULL);
for (int i = 0; i < cur_op->iov.count; i++) for (int i = 0; i < cur_op->iov.count; i++)
{ {
assert(cur_op->iov.buf[i].iov_base); assert(cur_op->iov.buf[i].iov_base);
to_send_list.push_back(cur_op->iov.buf[i]); to_send_list.push_back(cur_op->iov.buf[i]);
to_outbox.push_back(i == cur_op->iov.count-1 ? cur_op : NULL); to_outbox.push_back(NULL);
} }
} }
else if (cur_op->op_type == OSD_OP_IN)
{ {
to_outbox.push_back(cur_op); // To free it later
to_outbox[to_outbox.size()-1] = cur_op;
} }
if (!ringloop) if (!ringloop)
{ {
@@ -92,6 +93,10 @@ void osd_messenger_t::outbox_push(osd_op_t *cur_op)
void osd_messenger_t::measure_exec(osd_op_t *cur_op) void osd_messenger_t::measure_exec(osd_op_t *cur_op)
{ {
// Measure execution latency // Measure execution latency
if (cur_op->req.hdr.opcode > OSD_OP_MAX)
{
return;
}
timespec tv_end; timespec tv_end;
clock_gettime(CLOCK_REALTIME, &tv_end); clock_gettime(CLOCK_REALTIME, &tv_end);
stats.op_stat_count[cur_op->req.hdr.opcode]++; stats.op_stat_count[cur_op->req.hdr.opcode]++;
@@ -198,11 +203,8 @@ void osd_messenger_t::handle_send(int result, osd_client_t *cl)
{ {
if (cl->outbox[done]) if (cl->outbox[done])
{ {
// Operation fully sent // Reply fully sent
if (cl->outbox[done]->op_type == OSD_OP_IN) delete cl->outbox[done];
{
delete cl->outbox[done];
}
} }
result -= iov.iov_len; result -= iov.iov_len;
done++; done++;

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
// Similar to qemu-nbd, but sets timeout and uses io_uring // Similar to qemu-nbd, but sets timeout and uses io_uring
#include <linux/nbd.h> #include <linux/nbd.h>
@@ -111,7 +111,7 @@ public:
{ {
printf( printf(
"Vitastor NBD proxy\n" "Vitastor NBD proxy\n"
"(c) Vitaliy Filippov, 2020 (VNPL-1.0)\n\n" "(c) Vitaliy Filippov, 2020 (VNPL-1.1)\n\n"
"USAGE:\n" "USAGE:\n"
" %s map --etcd_address <etcd_address> --pool <pool> --inode <inode> --size <size in bytes>\n" " %s map --etcd_address <etcd_address> --pool <pool> --inode <inode> --size <size in bytes>\n"
" %s unmap /dev/nbd0\n" " %s unmap /dev/nbd0\n"

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once #pragma once

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include <sys/socket.h> #include <sys/socket.h>
#include <sys/poll.h> #include <sys/poll.h>

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#pragma once #pragma once

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "osd.h" #include "osd.h"
#include "base64.h" #include "base64.h"
@@ -384,6 +384,7 @@ void osd_t::create_osd_state()
{ {
st_cli.load_pgs(); st_cli.load_pgs();
} }
report_statistics();
}); });
} }
@@ -801,11 +802,11 @@ void osd_t::report_pg_states()
if (pg_it->second.state == PG_OFFLINE) if (pg_it->second.state == PG_OFFLINE)
{ {
// Remove offline PGs after reporting their state // Remove offline PGs after reporting their state
this->pgs.erase(pg_it);
if (pg_it->second.scheme == POOL_SCHEME_JERASURE) if (pg_it->second.scheme == POOL_SCHEME_JERASURE)
{ {
use_jerasure(pg_it->second.pg_size, pg_it->second.pg_size-pg_it->second.parity_chunks, false); use_jerasure(pg_it->second.pg_size, pg_it->second.pg_size-pg_it->second.parity_chunks, false);
} }
this->pgs.erase(pg_it);
} }
} }
} }

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "osd.h" #include "osd.h"

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once #pragma once

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "osd.h" #include "osd.h"

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#include "osd_ops.h" #include "osd_ops.h"

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once #pragma once

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include <netinet/tcp.h> #include <netinet/tcp.h>
#include <sys/epoll.h> #include <sys/epoll.h>

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include <unordered_map> #include <unordered_map>
#include "osd_peering_pg.h" #include "osd_peering_pg.h"

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include <map> #include <map>
#include <vector> #include <vector>
@@ -94,7 +94,7 @@ struct pg_t
std::vector<osd_num_t> cur_set; std::vector<osd_num_t> cur_set;
// same thing in state_dict-like format // same thing in state_dict-like format
pg_osd_set_t cur_loc_set; pg_osd_set_t cur_loc_set;
// moved object map. by default, each object is considered to reside on the cur_set. // moved object map. by default, each object is considered to reside on cur_set.
// this map stores all objects that differ. // this map stores all objects that differ.
// it may consume up to ~ (raw storage / object size) * 24 bytes in the worst case scenario // it may consume up to ~ (raw storage / object size) * 24 bytes in the worst case scenario
// which is up to ~192 MB per 1 TB in the worst case scenario // which is up to ~192 MB per 1 TB in the worst case scenario

View File

@@ -1,8 +1,9 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#define _LARGEFILE64_SOURCE #define _LARGEFILE64_SOURCE
#include "malloc_or_die.h"
#include "osd_peering_pg.h" #include "osd_peering_pg.h"
#define STRIPE_SHIFT 12 #define STRIPE_SHIFT 12

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "osd_primary.h" #include "osd_primary.h"
@@ -489,7 +489,11 @@ resume_7:
} }
// Remember PG as dirty to drop the connection when PG goes offline // Remember PG as dirty to drop the connection when PG goes offline
// (this is required because of the "lazy sync") // (this is required because of the "lazy sync")
c_cli.clients[cur_op->peer_fd]->dirty_pgs.insert({ .pool_id = pg.pool_id, .pg_num = pg.pg_num }); auto cl_it = c_cli.clients.find(cur_op->peer_fd);
if (cl_it != c_cli.clients.end())
{
cl_it->second->dirty_pgs.insert({ .pool_id = pg.pool_id, .pg_num = pg.pg_num });
}
dirty_pgs.insert({ .pool_id = pg.pool_id, .pg_num = pg.pg_num }); dirty_pgs.insert({ .pool_id = pg.pool_id, .pg_num = pg.pg_num });
} }
return true; return true;

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#pragma once #pragma once

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "osd_primary.h" #include "osd_primary.h"
@@ -295,7 +295,7 @@ void osd_t::handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op)
uint64_t version = subop->reply.sec_rw.version; uint64_t version = subop->reply.sec_rw.version;
#ifdef OSD_DEBUG #ifdef OSD_DEBUG
uint64_t peer_osd = c_cli.clients.find(subop->peer_fd) != c_cli.clients.end() uint64_t peer_osd = c_cli.clients.find(subop->peer_fd) != c_cli.clients.end()
? c_cli.clients[subop->peer_fd].osd_num : osd_num; ? c_cli.clients[subop->peer_fd]->osd_num : osd_num;
printf("subop %lu from osd %lu: version = %lu\n", opcode, peer_osd, version); printf("subop %lu from osd %lu: version = %lu\n", opcode, peer_osd, version);
#endif #endif
if (op_data->fact_ver != 0 && op_data->fact_ver != version) if (op_data->fact_ver != 0 && op_data->fact_ver != version)

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include <stdexcept> #include <stdexcept>
#include <string.h> #include <string.h>
@@ -11,6 +11,8 @@
#include "osd_rmw.h" #include "osd_rmw.h"
#include "malloc_or_die.h" #include "malloc_or_die.h"
#define OSD_JERASURE_W 8
static inline void extend_read(uint32_t start, uint32_t end, osd_rmw_stripe_t & stripe) static inline void extend_read(uint32_t start, uint32_t end, osd_rmw_stripe_t & stripe)
{ {
if (stripe.read_end == 0) if (stripe.read_end == 0)
@@ -158,7 +160,7 @@ void use_jerasure(int pg_size, int pg_minsize, bool use)
{ {
return; return;
} }
int *matrix = reed_sol_vandermonde_coding_matrix(pg_minsize, pg_size-pg_minsize, 32); int *matrix = reed_sol_vandermonde_coding_matrix(pg_minsize, pg_size-pg_minsize, OSD_JERASURE_W);
matrices[key] = (reed_sol_matrix_t){ matrices[key] = (reed_sol_matrix_t){
.refs = 0, .refs = 0,
.data = matrix, .data = matrix,
@@ -214,8 +216,8 @@ int* get_jerasure_decoding_matrix(osd_rmw_stripe_t *stripes, int pg_size, int pg
int *decoding_matrix = dm_ids + pg_minsize; int *decoding_matrix = dm_ids + pg_minsize;
if (!dm_ids) if (!dm_ids)
throw std::bad_alloc(); throw std::bad_alloc();
// we always use row_k_ones=1 and w=32 // we always use row_k_ones=1 and w=8 (OSD_JERASURE_W)
if (jerasure_make_decoding_matrix(pg_minsize, pg_size-pg_minsize, 32, matrix->data, erased, decoding_matrix, dm_ids) < 0) if (jerasure_make_decoding_matrix(pg_minsize, pg_size-pg_minsize, OSD_JERASURE_W, matrix->data, erased, decoding_matrix, dm_ids) < 0)
{ {
free(dm_ids); free(dm_ids);
throw std::runtime_error("jerasure_make_decoding_matrix() failed"); throw std::runtime_error("jerasure_make_decoding_matrix() failed");
@@ -252,7 +254,7 @@ void reconstruct_stripes_jerasure(osd_rmw_stripe_t *stripes, int pg_size, int pg
} }
data_ptrs[role] = (char*)stripes[role].read_buf; data_ptrs[role] = (char*)stripes[role].read_buf;
jerasure_matrix_dotprod( jerasure_matrix_dotprod(
pg_minsize, 32, decoding_matrix+(role*pg_minsize), dm_ids, role, pg_minsize, OSD_JERASURE_W, decoding_matrix+(role*pg_minsize), dm_ids, role,
data_ptrs, data_ptrs+pg_minsize, stripes[role].read_end - stripes[role].read_start data_ptrs, data_ptrs+pg_minsize, stripes[role].read_end - stripes[role].read_start
); );
} }
@@ -694,7 +696,7 @@ void calc_rmw_parity_jerasure(osd_rmw_stripe_t *stripes, int pg_size, int pg_min
} }
} }
jerasure_matrix_encode( jerasure_matrix_encode(
pg_minsize, pg_size-pg_minsize, 32, matrix->data, pg_minsize, pg_size-pg_minsize, OSD_JERASURE_W, matrix->data,
(char**)data_ptrs, (char**)data_ptrs+pg_minsize, next_end-pos (char**)data_ptrs, (char**)data_ptrs+pg_minsize, next_end-pos
); );
pos = next_end; pos = next_end;

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#pragma once #pragma once

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#define RMW_DEBUG #define RMW_DEBUG

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "osd.h" #include "osd.h"

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include <sys/types.h> #include <sys/types.h>
#include <sys/socket.h> #include <sys/socket.h>

Some files were not shown because too many files have changed in this diff Show More