Compare commits

..

78 Commits

Author SHA1 Message Date
7e6e1a5a82 Release 0.5.10
The version seems to be stable after this bunch of fixes :)

- Fix delete & write operation ordering during rebalance to not lose objects in the immediate_commit=off mode
- Fix a possible crash caused by very high iodepths
- Re-distribute PG primaries over OSDs that come up after a short downtime
- Allow to specify etcd URLs for OSDs with http://, do not die with a strange error if -etcd option is missing for fio
- Fix a journal flushing deadlock which sometimes occurred in the immediate_commit=off mode
- Fix a bug where OSDs could hang if the data device filled up
- Fix an allocator bug where it was unable to allocate up to last (n%64) data device blocks
- Fix monitor crash that occurred on removal of some etcd keys
- Fix a bug where PGs could remain incomplete due to incorrect PG history with just zeroes in osd_sets
2021-03-16 12:48:26 +03:00
435045751d Delete objects only after a SYNC during rebalance in the non-immediate_commit mode
Previously OSDs could commit deletes before writes during recovery or rebalance
in the "lazy fsync" (immediate_commit=off) mode which could result in lost objects
2021-03-16 12:48:26 +03:00
c5fb1d5987 Do not duplicate blockstore operations when io_uring fills up
This bug was leading to OSDs dying with "Assertion `fulfilled == read_op->len' failed"
when testing fio -rw=randread -numjobs=8 -iodepth=128
2021-03-16 12:48:26 +03:00
9f59381bea Re-distribute PG primaries over OSDs that come up after a short downtime 2021-03-16 12:48:26 +03:00
9ac7e75178 Allow to specify etcd URLs for OSDs with http://, do not die with a strange error if -etcd option is missing for fio 2021-03-16 12:48:26 +03:00
88671cf745 Fix a bug causing all flushers to wait for an fsync without actually trying to do it
This happened because flusher_count became dynamic and fsync_batch() was comparing the number
of flushers currently ready to do an fsync with the maximum number of flushers. Also the number
wasn't rechecked on every loop which was also incorrect.

Now the interrupted_rebalance test passes even without IMMEDIATE_COMMIT=1.
2021-03-13 17:27:29 +03:00
fe1749c427 Fix the multiple_interrupted_rebalance test 2021-03-13 17:19:45 +03:00
ceb9c28de7 Set default log_level before passing config to etcd_state_client 2021-03-13 17:19:45 +03:00
299d7d7c95 Use common macro for get_sqe 2021-03-13 17:19:45 +03:00
d1526b415f Correctly resume writes when OSD is full to return an error 2021-03-13 17:19:45 +03:00
f49fd53d55 Fix a bug where allocator was unable to allocate up to last (n%64) blocks, add tests for it 2021-03-13 02:19:02 +03:00
dd76eda5e5 Test multiple interrupted rebalancings
Currently only passes with immediate_commit=all configuration
(env variable IMMEDIATE_COMMIT=1 for the bash script)
2021-03-12 12:55:44 +03:00
87dbd8fa57 Use empty hash as the default value for some etcd keys in the monitor 2021-03-12 12:40:15 +03:00
b44f49aab2 Ignore zero OSDs in history osd_sets 2021-03-12 12:40:15 +03:00
036555638e Release 0.5.9
- Fix two monitor bugs which led to objects being "logically lost" (physically
  present on some secondary OSDs while primary doesn't know about it) after multiple
  interrupted rebalancings
- Implement "no_recovery" and "no_rebalance" flags
2021-03-11 00:39:10 +03:00
af5155fcd9 Implement "no_recovery" and "no_rebalance" flags 2021-03-11 00:36:31 +03:00
0d2efbecc9 Preserve previous PG history when changing PG distribution
Fixes incorrect PG history in case when a new rebalance is started
before the finish of the previous one which could make primary OSDs unable
to locate some objects on some secondaries.
2021-03-11 00:16:10 +03:00
e62e8b6bae Use real pg configuration instead of the "last clean" one for generating PG history
Basically fixes the bug introduced in 0.5.7 where an rebalance interrupted
by the monitor could result in forgetting objects moved to the new place
2021-03-10 02:01:44 +03:00
c4ba24c305 Do not print ping op latency 2021-03-10 02:01:44 +03:00
19e47a0279 Release 0.5.8
- Add heartbeats (fixes failover in case of network issues or offline nodes)
- Fix a bug where a PG could incorrectly become listed as 'incomplete' if historical osd_sets
  included a set with the the PG's primary OSD as the only alive one
- Use osd_out_time = 10 minutes by default instead of 30 minutes
- Make monitors stick to a single selected etcd URL on start and not try to select random ones
  on every request - this was leading to etcd interaction errors when some etcds were unavailable
2020-03-09 02:38:17 +03:00
bd178ac20f Fix history osd_set check - local OSD is always available! 2021-03-09 02:18:18 +03:00
7006875a24 Make monitor stick to one etcd until the restart 2021-03-09 02:15:38 +03:00
ad577c4aac Add PING operation and timeouts to detect OSD failures when a host goes down 2021-03-09 02:15:38 +03:00
836635c518 Use osd_out_time = 10 minutes by default 2021-03-09 02:15:38 +03:00
88a03f4e98 Release 0.5.7
- Fix multiple bugs leading to OSDs sometimes being unable to correctly activate PGs
  when a lot of PG peering events occurred in a small amount of time
- Fix a bug where OSDs could list incomplete object versions during peering. The bug
  manifested with "local rollback operation failed" messages in OSD logs
- Fix a bug where misplaced chunks for degraded and incomplete objects were not removed
  from extra OSDs during recovery
- Fix incorrect PG history configuration resulting in OSDs being unable to find some
  of the objects after a PG count change
- Simplify block layer write ordering logic
- Avoid extra data move when a lot of OSDs are first stopped for long time and then restarted
- Fix incorrect degraded & misplaced object statistics after a completed rebalance
- Fix incorrect usage of pg_minsize instead of the minimal possible object chunk count in EC pools
2021-03-08 23:37:02 +03:00
2a5036669d Fix PG count change procedure
In previous versions PG histories were calculated incorrectly during
PG count change which led to objects being lost on OSDs not in PG's osd set.
2021-03-08 23:15:58 +03:00
2e0c853180 Make test_change_pg_count check if any objects are lost during the test 2021-03-08 23:15:07 +03:00
e91ff2a9ec Only forget offline PGs if their state is not changed during reporting 2021-03-08 17:04:10 +03:00
086667f568 Do not check PG state key ownership if it doesn't exist yet
This fixes the bug where OSDs were sometimes trying to report updated PG states
infinitely without luck when PGs transitioned from 'starting' to 'peering' too fast
2021-03-08 17:04:10 +03:00
73ce20e246 Add a test for the "reappear after move" case 2021-03-08 17:04:10 +03:00
1be94da437 Check & remove extra chunks for degraded / incomplete objects, too 2021-03-08 17:04:10 +03:00
80e12358a2 Use pg_data_size instead of pg_minsize for object state calculation 2021-03-08 17:04:10 +03:00
36c935ace6 Use std::vector for the blockstore submission queue 2021-03-08 17:04:10 +03:00
0d8b5e2ef9 Remove unused enqueue_op_first() 2021-03-08 17:04:10 +03:00
98f1e2c277 Rework write/sync ordering
Make syncs wait for all previous writes because it's the only way
to make sure that OSDs do not receive incomplete writes in LIST results
during peering when some writes are still in progress.

Also simplify blockstore submission queue logic.
2021-03-08 17:04:10 +03:00
21e7686037 Fix possible "assertion failed: pg.inflight >= 0" error during PG stop 2021-03-08 17:04:10 +03:00
ab21a1908b Check for the dirty PG flag when trying to continue to stop it after sync 2021-03-08 17:04:10 +03:00
30d1ccd43e Fix an infinite loop when discarding list operations during stop_pg() 2021-03-08 17:04:10 +03:00
8bdd6d8d78 Reset PG state when stopping them 2021-03-08 17:04:10 +03:00
09b3e4e789 Fix OSDs being unable to stop PGs that are 'peering', not 'active'
This was sometimes leading to incorrect misplaced and degraded object count statistics
2021-03-08 17:04:10 +03:00
07912fd670 Use history/last_clean_pgs to avoid extra data move when observing a series of changes in the cluster 2021-03-08 17:04:10 +03:00
bc742ccf8c Fix a small memory leak in etcd_state_client 2021-03-08 17:04:10 +03:00
314b20437b Do not break subsequent small writes badly when a big write is canceled 2021-03-08 17:04:10 +03:00
29bac892ad Add .gitignore 2021-03-08 17:04:10 +03:00
cf7547faf3 Fix *.sh build scripts 2021-03-02 02:17:11 +03:00
ab90ed747f Release 0.5.6
- Fix operation statistics
- Fix a rebalance hang introduced in 0.5.5
- Test PG count changes with actual data moving
- Fix a possible 'unexpected pg state: 0' error during PG count change
2021-03-01 16:26:04 +03:00
29d8ac8b1b Do not report statistics for the empty operation 2021-03-01 16:20:57 +03:00
97795ea1b1 Use pg_minsize=2 in the pg_count change test
Also don't check for has_degraded because it's not a bug that objects
are _temporarily_ listed as degraded during PG peering as it's not
required for the new primary to connect to _all_ older peers to start
peering. The test may be improved in the future by temporarily disabling
degraded recovery during it and returning the has_degraded check back.
2021-03-01 16:18:08 +03:00
24e7075f08 Fix monitor's statistics aggregation 2021-02-28 19:51:16 +03:00
6155b23a7e Replace pgs[id] with pgs.at(id) to prevent accidental auto-vivification 2021-02-28 19:36:59 +03:00
7d49706c07 Improve the pg_count change test: add more OSDs and actually move data between them 2021-02-28 19:36:59 +03:00
46e79f3306 Wait for PGs to become clean before stopping them 2021-02-28 19:36:59 +03:00
41fd14e024 Fix deletes not increasing write_iodepth 2021-02-28 19:36:59 +03:00
bb2d9a3afe Release 0.5.5
- Transition to CMake build system
- Fix Monitor being unable to change PG sizes
- Fix PG optimizer not using some OSDs in some cases
- Fix inability to change PG count online
- Improve journal flusher performance
- Add a little better systemd unit generator
- Use w=8 with jerasure (breaking change for EC pools)
2021-02-26 01:59:18 +03:00
e899ed2c25 Make OSDs with 256 flushers (as they are now dynamic) 2021-02-26 01:59:18 +03:00
e21b14b72c Fix rpm specs for building with CMake 2021-02-26 01:59:18 +03:00
5af8eddaa9 Add the remaining build script for Debian 2021-02-26 01:59:18 +03:00
4f5a94c07a Modify instructions for the CMake build 2021-02-26 00:28:57 +03:00
e16b87ecc8 Rename random_combinations() parameter from "unordered" to "ordered" as it's more correct 2021-02-25 23:59:34 +03:00
fcb4aa0a11 Fix Monitor being unable to change PG sizes 2021-02-25 23:59:34 +03:00
12adfa470c Add a test for changing PG size 2021-02-25 23:59:33 +03:00
7f15e0c084 Add a simple test for the PG optimizer 2021-02-25 23:59:33 +03:00
08d4bef419 Fix PG optimizer removing PGs without adding new ones
This happened when the distribution was already valid for the current OSD tree,
but didn't use all OSDs. For example, OSDs 1 2 3 and all PGs equal to [ 1, 2 ]
remained unchanged.
2021-02-25 23:59:33 +03:00
2d73b19a6c Fix online PG count change bugs 2021-02-25 23:59:33 +03:00
69c87009e9 Add a test for changing PG count 2021-02-25 23:59:33 +03:00
c974cb539c Make flusher_count adaptive and limit write iodepth 2021-02-25 23:59:33 +03:00
00e98f64f3 A little better systemd unit generator 2021-02-25 23:59:33 +03:00
91a70dfb1b Add a test for the no_same_sector_overwrites mode 2021-02-25 23:59:33 +03:00
178388ac8c Use packages/ subdir instead of build/ for Docker package builds 2021-02-25 23:59:04 +03:00
bf9a175efc Move C/C++ sources to src subdirectory 2021-02-25 23:59:03 +03:00
08aed962de Use CMake 2021-02-25 23:58:08 +03:00
8c65e890b9 Slightly clean up the build script 2021-02-25 23:56:54 +03:00
8cda70b889 Allow to enable AddressSanitizer with "ASAN=1 make" 2021-02-25 23:55:33 +03:00
61ab22403a Use w=8 with jerasure 2021-02-25 23:55:33 +03:00
16da663a66 Add another test for failure domains 2021-02-25 23:55:33 +03:00
4a2dcf7b6b Update the license to VNPL 1.1
VNPL 1.1 is slightly reworded to make it clear that proprietary software
interacting with Vitastor and providing some kind of service to end users isn't
a "Proxy Program" if it's not specially designed to be used with Vitastor.

For example, Windows OS running in a virtual machine stored in a Vitastor
cluster clearly isn't.
2021-02-25 23:55:33 +03:00
8d48cc56b0 Generate randomly permutated OSD combinations when optimizing for compressed chunks 2021-02-25 23:55:33 +03:00
9f58f01425 Mirror afr.js from /vitalif/ceph-afr-calc 2021-02-25 23:55:33 +03:00
131 changed files with 2148 additions and 1389 deletions

View File

@@ -1,5 +1,6 @@
.git .git
build build
packages
mon/node_modules mon/node_modules
*.o *.o
*.so *.so
@@ -15,3 +16,4 @@ fio
qemu qemu
rpm/*.Dockerfile rpm/*.Dockerfile
debian/*.Dockerfile debian/*.Dockerfile
Dockerfile

18
.gitignore vendored Normal file
View File

@@ -0,0 +1,18 @@
*.o
*.so
package-lock.json
fio
qemu
osd
stub_osd
stub_uring_osd
stub_bench
osd_test
osd_peering_pg_test
dump_journal
nbd_proxy
rm_inode
test_allocator
test_blockstore
test_shit
osd_rmw_test

5
CMakeLists.txt Normal file
View File

@@ -0,0 +1,5 @@
cmake_minimum_required(VERSION 2.8)
project(vitastor)
add_subdirectory(src)

View File

@@ -1,46 +0,0 @@
#!/usr/bin/perl
use strict;
my $deps = {};
for my $line (split /\n/, `grep '^#include "' *.cpp *.h`)
{
if ($line =~ /^([^:]+):\#include "([^"]+)"/s)
{
$deps->{$1}->{$2} = 1;
}
}
my $added;
do
{
$added = 0;
for my $file (keys %$deps)
{
for my $dep (keys %{$deps->{$file}})
{
if ($deps->{$dep})
{
for my $subdep (keys %{$deps->{$dep}})
{
if (!$deps->{$file}->{$subdep})
{
$added = 1;
$deps->{$file}->{$subdep} = 1;
}
}
}
}
}
} while ($added);
for my $file (sort keys %$deps)
{
if ($file =~ /\.cpp$/)
{
my $obj = $file;
$obj =~ s/\.cpp$/.o/s;
print "$obj: $file ".join(" ", sort keys %{$deps->{$file}})."\n";
print "\tg++ \$(CXXFLAGS) -c -o \$\@ \$\<\n";
}
}

195
Makefile
View File

@@ -1,195 +0,0 @@
BINDIR ?= /usr/bin
LIBDIR ?= /usr/lib/x86_64-linux-gnu
QEMU_PLUGINDIR ?= /usr/lib/x86_64-linux-gnu/qemu
BLOCKSTORE_OBJS := allocator.o blockstore.o blockstore_impl.o blockstore_init.o blockstore_open.o blockstore_journal.o blockstore_read.o \
blockstore_write.o blockstore_sync.o blockstore_stable.o blockstore_rollback.o blockstore_flush.o crc32c.o ringloop.o
# -fsanitize=address
CXXFLAGS := -g -O3 -Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -fPIC -fdiagnostics-color=always -I/usr/include/jerasure
all: libfio_blockstore.so osd libfio_sec_osd.so libfio_cluster.so stub_osd stub_uring_osd stub_bench osd_test dump_journal qemu_driver.so nbd_proxy rm_inode
clean:
rm -f *.o libblockstore.so libfio_blockstore.so osd libfio_sec_osd.so libfio_cluster.so stub_osd stub_uring_osd stub_bench osd_test dump_journal qemu_driver.so nbd_proxy rm_inode
install: all
mkdir -p $(DESTDIR)$(LIBDIR)/vitastor
install -m 0755 libfio_sec_osd.so $(DESTDIR)$(LIBDIR)/vitastor/
install -m 0755 libfio_cluster.so $(DESTDIR)$(LIBDIR)/vitastor/
install -m 0755 libfio_blockstore.so $(DESTDIR)$(LIBDIR)/vitastor/
install -m 0755 libblockstore.so $(DESTDIR)$(LIBDIR)/vitastor/
mkdir -p $(DESTDIR)$(BINDIR)
install -m 0755 osd $(DESTDIR)$(BINDIR)/vitastor-osd
install -m 0755 dump_journal $(DESTDIR)$(BINDIR)/vitastor-dump-journal
install -m 0755 nbd_proxy $(DESTDIR)$(BINDIR)/vitastor-nbd
install -m 0755 rm_inode $(DESTDIR)$(BINDIR)/vitastor-rm
mkdir -p $(DESTDIR)$(QEMU_PLUGINDIR)
install -m 0755 qemu_driver.so $(DESTDIR)$(QEMU_PLUGINDIR)/block-vitastor.so
dump_journal: dump_journal.cpp crc32c.o blockstore_journal.h
g++ $(CXXFLAGS) -o $@ $< crc32c.o
libblockstore.so: $(BLOCKSTORE_OBJS)
g++ $(CXXFLAGS) -o $@ -shared $(BLOCKSTORE_OBJS) -ltcmalloc_minimal -luring
libfio_blockstore.so: ./libblockstore.so fio_engine.o json11.o
g++ $(CXXFLAGS) -Wl,-rpath,'$(LIBDIR)/vitastor',-rpath,'$$ORIGIN' -shared -o $@ fio_engine.o json11.o libblockstore.so -ltcmalloc_minimal -luring
OSD_OBJS := osd.o osd_secondary.o msgr_receive.o msgr_send.o osd_peering.o osd_flush.o osd_peering_pg.o \
osd_primary.o osd_primary_subops.o etcd_state_client.o messenger.o osd_cluster.o http_client.o osd_ops.o pg_states.o \
osd_rmw.o json11.o base64.o timerfd_manager.o epoll_manager.o
osd: ./libblockstore.so osd_main.cpp osd.h osd_ops.h $(OSD_OBJS)
g++ $(CXXFLAGS) -Wl,-rpath,'$(LIBDIR)/vitastor',-rpath,'$$ORIGIN' -o $@ osd_main.cpp $(OSD_OBJS) libblockstore.so -ltcmalloc_minimal -luring -lJerasure
stub_osd: stub_osd.o rw_blocking.o
g++ $(CXXFLAGS) -o $@ stub_osd.o rw_blocking.o -ltcmalloc_minimal
osd_rmw_test: osd_rmw_test.o
g++ $(CXXFLAGS) -o $@ osd_rmw_test.o -lJerasure -fsanitize=address
STUB_URING_OSD_OBJS := stub_uring_osd.o epoll_manager.o messenger.o msgr_send.o msgr_receive.o ringloop.o timerfd_manager.o json11.o
stub_uring_osd: $(STUB_URING_OSD_OBJS)
g++ $(CXXFLAGS) -o $@ -ltcmalloc_minimal $(STUB_URING_OSD_OBJS) -luring
stub_bench: stub_bench.cpp osd_ops.h rw_blocking.o
g++ $(CXXFLAGS) -o $@ stub_bench.cpp rw_blocking.o -ltcmalloc_minimal
osd_test: osd_test.cpp osd_ops.h rw_blocking.o
g++ $(CXXFLAGS) -o $@ osd_test.cpp rw_blocking.o -ltcmalloc_minimal
osd_peering_pg_test: osd_peering_pg_test.cpp osd_peering_pg.o
g++ $(CXXFLAGS) -o $@ $< osd_peering_pg.o -ltcmalloc_minimal
libfio_sec_osd.so: fio_sec_osd.o rw_blocking.o
g++ $(CXXFLAGS) -ltcmalloc_minimal -shared -o $@ fio_sec_osd.o rw_blocking.o
FIO_CLUSTER_OBJS := cluster_client.o epoll_manager.o etcd_state_client.o \
messenger.o msgr_send.o msgr_receive.o ringloop.o json11.o http_client.o osd_ops.o pg_states.o timerfd_manager.o base64.o
libfio_cluster.so: fio_cluster.o $(FIO_CLUSTER_OBJS)
g++ $(CXXFLAGS) -ltcmalloc_minimal -shared -o $@ $< $(FIO_CLUSTER_OBJS) -luring
nbd_proxy: nbd_proxy.o $(FIO_CLUSTER_OBJS)
g++ $(CXXFLAGS) -ltcmalloc_minimal -o $@ $< $(FIO_CLUSTER_OBJS) -luring
rm_inode: rm_inode.o $(FIO_CLUSTER_OBJS)
g++ $(CXXFLAGS) -ltcmalloc_minimal -o $@ $< $(FIO_CLUSTER_OBJS) -luring
qemu_driver.o: qemu_driver.c qemu_proxy.h
gcc -I qemu/b/qemu `pkg-config glib-2.0 --cflags` \
-I qemu/include $(CXXFLAGS) -c -o $@ $<
qemu_driver.so: qemu_driver.o qemu_proxy.o $(FIO_CLUSTER_OBJS)
g++ $(CXXFLAGS) -ltcmalloc_minimal -shared -o $@ $(FIO_CLUSTER_OBJS) qemu_driver.o qemu_proxy.o -luring
test_blockstore: ./libblockstore.so test_blockstore.cpp timerfd_interval.o
g++ $(CXXFLAGS) -Wl,-rpath,'$(LIBDIR)/vitastor',-rpath,'$$ORIGIN' -o test_blockstore test_blockstore.cpp timerfd_interval.o libblockstore.so -ltcmalloc_minimal -luring
test_shit: test_shit.cpp osd_peering_pg.o
g++ $(CXXFLAGS) -o test_shit test_shit.cpp -luring -lm
test_allocator: test_allocator.cpp allocator.o
g++ $(CXXFLAGS) -o test_allocator test_allocator.cpp allocator.o
crc32c.o: crc32c.c crc32c.h
g++ $(CXXFLAGS) -c -o $@ $<
json11.o: json11/json11.cpp
g++ $(CXXFLAGS) -c -o json11.o json11/json11.cpp
# Autogenerated
allocator.o: allocator.cpp allocator.h
g++ $(CXXFLAGS) -c -o $@ $<
base64.o: base64.cpp base64.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore.o: blockstore.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_flush.o: blockstore_flush.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_impl.o: blockstore_impl.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_init.o: blockstore_init.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_journal.o: blockstore_journal.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_open.o: blockstore_open.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_read.o: blockstore_read.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_rollback.o: blockstore_rollback.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_stable.o: blockstore_stable.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_sync.o: blockstore_sync.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_write.o: blockstore_write.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
cluster_client.o: cluster_client.cpp cluster_client.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
dump_journal.o: dump_journal.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
epoll_manager.o: epoll_manager.cpp epoll_manager.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
etcd_state_client.o: etcd_state_client.cpp base64.h etcd_state_client.h http_client.h json11/json11.hpp object_id.h osd_id.h osd_ops.h pg_states.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
fio_cluster.o: fio_cluster.cpp cluster_client.h epoll_manager.h etcd_state_client.h fio/arch/arch.h fio/fio.h fio/optgroup.h fio_headers.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
fio_engine.o: fio_engine.cpp blockstore.h fio/arch/arch.h fio/fio.h fio/optgroup.h fio_headers.h json11/json11.hpp object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
fio_sec_osd.o: fio_sec_osd.cpp fio/arch/arch.h fio/fio.h fio/optgroup.h fio_headers.h object_id.h osd_id.h osd_ops.h rw_blocking.h
g++ $(CXXFLAGS) -c -o $@ $<
http_client.o: http_client.cpp http_client.h json11/json11.hpp timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
messenger.o: messenger.cpp json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
msgr_receive.o: msgr_receive.cpp json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
msgr_send.o: msgr_send.cpp json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
nbd_proxy.o: nbd_proxy.cpp cluster_client.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd.o: osd.cpp blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_cluster.o: osd_cluster.cpp base64.h blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_flush.o: osd_flush.cpp blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_main.o: osd_main.cpp blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_ops.o: osd_ops.cpp object_id.h osd_id.h osd_ops.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_peering.o: osd_peering.cpp base64.h blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_peering_pg.o: osd_peering_pg.cpp cpp-btree/btree_map.h object_id.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_peering_pg_test.o: osd_peering_pg_test.cpp cpp-btree/btree_map.h object_id.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_primary.o: osd_primary.cpp blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h osd_primary.h osd_rmw.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_primary_subops.o: osd_primary_subops.cpp blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h osd_primary.h osd_rmw.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_rmw.o: osd_rmw.cpp malloc_or_die.h object_id.h osd_id.h osd_rmw.h xor.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_rmw_test.o: osd_rmw_test.cpp malloc_or_die.h object_id.h osd_id.h osd_rmw.cpp osd_rmw.h test_pattern.h xor.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_secondary.o: osd_secondary.cpp blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_test.o: osd_test.cpp object_id.h osd_id.h osd_ops.h rw_blocking.h test_pattern.h
g++ $(CXXFLAGS) -c -o $@ $<
pg_states.o: pg_states.cpp pg_states.h
g++ $(CXXFLAGS) -c -o $@ $<
qemu_proxy.o: qemu_proxy.cpp cluster_client.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h qemu_proxy.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
ringloop.o: ringloop.cpp ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
rm_inode.o: rm_inode.cpp cluster_client.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
rw_blocking.o: rw_blocking.cpp rw_blocking.h
g++ $(CXXFLAGS) -c -o $@ $<
stub_bench.o: stub_bench.cpp object_id.h osd_id.h osd_ops.h rw_blocking.h
g++ $(CXXFLAGS) -c -o $@ $<
stub_osd.o: stub_osd.cpp object_id.h osd_id.h osd_ops.h rw_blocking.h
g++ $(CXXFLAGS) -c -o $@ $<
stub_uring_osd.o: stub_uring_osd.cpp epoll_manager.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
test_allocator.o: test_allocator.cpp allocator.h
g++ $(CXXFLAGS) -c -o $@ $<
test_blockstore.o: test_blockstore.cpp blockstore.h object_id.h ringloop.h timerfd_interval.h
g++ $(CXXFLAGS) -c -o $@ $<
test_shit.o: test_shit.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
timerfd_interval.o: timerfd_interval.cpp ringloop.h timerfd_interval.h
g++ $(CXXFLAGS) -c -o $@ $<
timerfd_manager.o: timerfd_manager.cpp timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<

View File

@@ -336,9 +336,8 @@ Vitastor with single-thread NBD on the same hardware:
- You can also rebuild QEMU with a patch that makes LD_PRELOAD unnecessary to load vitastor driver. - You can also rebuild QEMU with a patch that makes LD_PRELOAD unnecessary to load vitastor driver.
See `qemu-*.*-vitastor.patch`. See `qemu-*.*-vitastor.patch`.
- Install fio 3.7 or later, get its source and symlink it into `<vitastor>/fio`. - Install fio 3.7 or later, get its source and symlink it into `<vitastor>/fio`.
- Build Vitastor with `make -j8`. - Build & install Vitastor with `mkdir build && cd build && cmake .. && make -j8 && make install`.
- Run `make install` (optionally with `LIBDIR=/usr/lib64 QEMU_PLUGINDIR=/usr/lib64/qemu-kvm` Pay attention to the `QEMU_PLUGINDIR` cmake option - it must be set to `qemu-kvm` on RHEL.
if you're using an RPM-based distro).
## Running ## Running
@@ -349,21 +348,16 @@ and calculate disk offsets almost by hand. This will be fixed in near future.
with lazy fsync, but prepare for inferior single-thread latency. with lazy fsync, but prepare for inferior single-thread latency.
- Get a fast network (at least 10 Gbit/s). - Get a fast network (at least 10 Gbit/s).
- Disable CPU powersaving: `cpupower idle-set -D 0 && cpupower frequency-set -g performance`. - Disable CPU powersaving: `cpupower idle-set -D 0 && cpupower frequency-set -g performance`.
- Start etcd with `--max-txn-ops=100000 --auto-compaction-retention=10 --auto-compaction-mode=revision` options. - Check `/usr/lib/vitastor/mon/make-units.sh` and `/usr/lib/vitastor/mon/make-osd.sh` and
- Create global configuration in etcd: `etcdctl --endpoints=... put /vitastor/config/global '{"immediate_commit":"all"}'` put desired values into the variables at the top of these files.
(if all your drives have capacitors). - Create systemd units for the monitor and etcd: `/usr/lib/vitastor/mon/make-units.sh`
- Create pool configuration in etcd: `etcdctl --endpoints=... put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":256,"failure_domain":"host"}}'`. - Create systemd units for your OSDs: `/usr/lib/vitastor/mon/make-osd.sh /dev/disk/by-partuuid/XXX [/dev/disk/by-partuuid/YYY ...]`
For jerasure pools the configuration should look like the following: `2:{"name":"ecpool","scheme":"jerasure","pg_size":4,"parity_chunks":2,"pg_minsize":2,"pg_count":256,"failure_domain":"host"}`. - You can edit the units and change OSD configuration. Notable configuration variables:
- Calculate offsets for your drives with `node /usr/lib/vitastor/mon/simple-offsets.js --device /dev/sdX`.
- Make systemd units for your OSDs. Look at `/usr/lib/vitastor/mon/make-units.sh` for example.
Notable configuration variables from the example:
- `disable_data_fsync 1` - only safe with server-grade drives with capacitors. - `disable_data_fsync 1` - only safe with server-grade drives with capacitors.
- `immediate_commit all` - use this if all your drives are server-grade. - `immediate_commit all` - use this if all your drives are server-grade.
- `disable_device_lock 1` - only required if you run multiple OSDs on one block device. - `disable_device_lock 1` - only required if you run multiple OSDs on one block device.
- `flusher_count 16` - flusher is a micro-thread that removes old data from the journal. - `flusher_count 256` - flusher is a micro-thread that removes old data from the journal.
More flushers mean more aggressive journal flushing which allows for more throughput You don't have to worry about this parameter anymore, 256 is enough.
but slightly hurts latency under less load. Flushing will probably be improved in the future
because currently high queue depths sometimes lead to performance degradation.
- `disk_alignment`, `journal_block_size`, `meta_block_size` should be set to the internal - `disk_alignment`, `journal_block_size`, `meta_block_size` should be set to the internal
block size of your SSDs which is 4096 on most drives. block size of your SSDs which is 4096 on most drives.
- `journal_no_same_sector_overwrites true` prevents multiple overwrites of the same journal sector. - `journal_no_same_sector_overwrites true` prevents multiple overwrites of the same journal sector.
@@ -374,18 +368,22 @@ and calculate disk offsets almost by hand. This will be fixed in near future.
setting is set, it is also required to raise `journal_sector_buffer_count` setting, which is the setting is set, it is also required to raise `journal_sector_buffer_count` setting, which is the
number of dirty journal sectors that may be written to at the same time. number of dirty journal sectors that may be written to at the same time.
- `systemctl start vitastor.target` everywhere. - `systemctl start vitastor.target` everywhere.
- Start any number of monitors: `node /usr/lib/vitastor/mon/mon-main.js --etcd_url 'http://10.115.0.10:2379,http://10.115.0.11:2379,http://10.115.0.12:2379,http://10.115.0.13:2379' --etcd_prefix '/vitastor' --etcd_start_timeout 5`. - Create global configuration in etcd: `etcdctl --endpoints=... put /vitastor/config/global '{"immediate_commit":"all"}'`
(if all your drives have capacitors).
- Create pool configuration in etcd: `etcdctl --endpoints=... put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":256,"failure_domain":"host"}}'`.
For jerasure pools the configuration should look like the following: `2:{"name":"ecpool","scheme":"jerasure","pg_size":4,"parity_chunks":2,"pg_minsize":2,"pg_count":256,"failure_domain":"host"}`.
- At this point, one of the monitors will configure PGs and OSDs will start them. - At this point, one of the monitors will configure PGs and OSDs will start them.
- You can check PG states with `etcdctl --endpoints=... get --prefix /vitastor/pg/state`. All PGs should become 'active'. - You can check PG states with `etcdctl --endpoints=... get --prefix /vitastor/pg/state`. All PGs should become 'active'.
- Run tests with (for example): `fio -thread -ioengine=/usr/lib/x86_64-linux-gnu/vitastor/libfio_cluster.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -pool=1 -inode=1 -size=400G`. - Run tests with (for example): `fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -pool=1 -inode=1 -size=400G`.
- Upload VM disk image with qemu-img (for example): - Upload VM disk image with qemu-img (for example):
``` ```
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so qemu-img convert -f qcow2 debian10.qcow2 -p qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:etcd_host=10.115.0.10\:2379/v3:pool=1:inode=1:size=2147483648'
-O raw 'vitastor:etcd_host=10.115.0.10\:2379/v3:pool=1:inode=1:size=2147483648'
``` ```
Note that the command requires to be run with `LD_PRELOAD=/usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so qemu-img ...`
if you use unmodified QEMU.
- Run QEMU with (for example): - Run QEMU with (for example):
``` ```
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so qemu-system-x86_64 -enable-kvm -m 1024 qemu-system-x86_64 -enable-kvm -m 1024
-drive 'file=vitastor:etcd_host=10.115.0.10\:2379/v3:pool=1:inode=1:size=2147483648',format=raw,if=none,id=drive-virtio-disk0,cache=none -drive 'file=vitastor:etcd_host=10.115.0.10\:2379/v3:pool=1:inode=1:size=2147483648',format=raw,if=none,id=drive-virtio-disk0,cache=none
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=off,physical_block_size=4096,logical_block_size=512 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=off,physical_block_size=4096,logical_block_size=512
-vnc 0.0.0.0:0 -vnc 0.0.0.0:0
@@ -421,22 +419,27 @@ Copyright (c) Vitaliy Filippov (vitalif [at] yourcmc.ru), 2019+
You can also find me in the Russian Telegram Ceph chat: https://t.me/ceph_ru You can also find me in the Russian Telegram Ceph chat: https://t.me/ceph_ru
All server-side code (OSD, Monitor and so on) is licensed under the terms of All server-side code (OSD, Monitor and so on) is licensed under the terms of
Vitastor Network Public License 1.0 (VNPL 1.0), a copyleft license based on Vitastor Network Public License 1.1 (VNPL 1.1), a copyleft license based on
GNU GPLv3.0 with the additional "Network Interaction" clause which requires GNU GPLv3.0 with the additional "Network Interaction" clause which requires
opensourcing all programs directly or indirectly interacting with Vitastor opensourcing all programs directly or indirectly interacting with Vitastor
through a computer network ("Proxy Programs"). Proxy Programs may be made public through a computer network and expressly designed to be used in conjunction
not only under the terms of the same license, but also under the terms of any with it ("Proxy Programs"). Proxy Programs may be made public not only under
GPL-Compatible Free Software License, as listed by the Free Software Foundation. the terms of the same license, but also under the terms of any GPL-Compatible
Free Software License, as listed by the Free Software Foundation.
This is a stricter copyleft license than the Affero GPL. This is a stricter copyleft license than the Affero GPL.
Please note that VNPL doesn't require you to open the code of proprietary
software running inside a VM if it's not specially designed to be used with
Vitastor.
Basically, you can't use the software in a proprietary environment to provide Basically, you can't use the software in a proprietary environment to provide
its functionality to users without opensourcing all intermediary components its functionality to users without opensourcing all intermediary components
standing between the user and Vitastor or purchasing a commercial license standing between the user and Vitastor or purchasing a commercial license
from the author 😀. from the author 😀.
Client libraries (cluster_client and so on) are dual-licensed under the same Client libraries (cluster_client and so on) are dual-licensed under the same
VNPL 1.0 and also GNU GPL 2.0 or later to allow for compatibility with GPLed VNPL 1.1 and also GNU GPL 2.0 or later to allow for compatibility with GPLed
software like QEMU and fio. software like QEMU and fio.
You can find the full text of VNPL-1.0 in the file [VNPL-1.0.txt](VNPL-1.0.txt). You can find the full text of VNPL-1.1 in the file [VNPL-1.1.txt](VNPL-1.1.txt).
GPL 2.0 is also included in this repository as [GPL-2.0.txt](GPL-2.0.txt). GPL 2.0 is also included in this repository as [GPL-2.0.txt](GPL-2.0.txt).

View File

@@ -1,7 +1,7 @@
VITASTOR NETWORK PUBLIC LICENSE VITASTOR NETWORK PUBLIC LICENSE
Version 1, 17 September 2020 Version 1.1, 6 February 2021
Copyright (C) 2020 Vitaliy Filippov <vitalif@yourcmc.ru> Copyright (C) 2021 Vitaliy Filippov <vitalif@yourcmc.ru>
Everyone is permitted to copy and distribute verbatim copies Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed. of this license document, but changing it is not allowed.
@@ -540,12 +540,15 @@ License would be to refrain entirely from conveying the Program.
13. Remote Network Interaction. 13. Remote Network Interaction.
Notwithstanding any other provision of this License, if you provide A "Proxy Program" means a separate program which is specially designed to
any user an opportunity to interact with the covered work directly be used in conjunction with the covered work and interacts with it directly
or indirectly through a computer network, an imitation of such network, or indirectly through any kind of API (application programming interfaces),
or an additional program (hereinafter referred to as a "Proxy Program") a computer network, an imitation of such network, or another Proxy Program
that, in turn, interacts with the covered work through a computer network, itself.
an imitation of such network, or another Proxy Program itself,
Notwithstanding any other provision of this License, if you provide any user
with an opportunity to interact with the covered work through a computer
network, an imitation of such network, or any number of "Proxy Programs",
you must prominently offer that user an opportunity to receive the you must prominently offer that user an opportunity to receive the
Corresponding Source of the covered work and all Proxy Programs from a Corresponding Source of the covered work and all Proxy Programs from a
network server at no charge, through some standard or customary means of network server at no charge, through some standard or customary means of

View File

@@ -1,6 +1,6 @@
#!/bin/bash #!/bin/bash
gcc -E -o fio_headers.i fio_headers.h gcc -I. -E -o fio_headers.i src/fio_headers.h
rm -rf fio-copy rm -rf fio-copy
for i in `grep -Po 'fio/[^"]+' fio_headers.i | sort | uniq`; do for i in `grep -Po 'fio/[^"]+' fio_headers.i | sort | uniq`; do

View File

@@ -5,7 +5,7 @@
#cd b/qemu; make qapi #cd b/qemu; make qapi
gcc -I qemu/b/qemu `pkg-config glib-2.0 --cflags` \ gcc -I qemu/b/qemu `pkg-config glib-2.0 --cflags` \
-I qemu/include -E -o qemu_driver.i qemu_driver.c -I qemu/include -E -o qemu_driver.i src/qemu_driver.c
rm -rf qemu-copy rm -rf qemu-copy
for i in `grep -Po 'qemu/[^"]+' qemu_driver.i | sort | uniq`; do for i in `grep -Po 'qemu/[^"]+' qemu_driver.i | sort | uniq`; do

7
debian/build-vitastor-bullseye.sh vendored Executable file
View File

@@ -0,0 +1,7 @@
#!/bin/bash
sed 's/$REL/bullseye/g' < vitastor.Dockerfile > ../Dockerfile
cd ..
mkdir -p packages
sudo podman build -v `pwd`/packages:/root/packages -f Dockerfile .
rm Dockerfile

7
debian/build-vitastor-buster.sh vendored Executable file
View File

@@ -0,0 +1,7 @@
#!/bin/bash
sed 's/$REL/buster/g' < vitastor.Dockerfile > ../Dockerfile
cd ..
mkdir -p packages
sudo podman build -v `pwd`/packages:/root/packages -f Dockerfile .
rm Dockerfile

2
debian/changelog vendored
View File

@@ -1,4 +1,4 @@
vitastor (0.5.4-1) unstable; urgency=medium vitastor (0.5.10-1) unstable; urgency=medium
* Bugfixes * Bugfixes

13
debian/copyright vendored
View File

@@ -5,16 +5,17 @@ Source: https://vitastor.io
Files: * Files: *
Copyright: 2019+ Vitaliy Filippov <vitalif@yourcmc.ru> Copyright: 2019+ Vitaliy Filippov <vitalif@yourcmc.ru>
License: Multiple licenses VNPL-1.0 and/or GPL-2.0+ License: Multiple licenses VNPL-1.1 and/or GPL-2.0+
All server-side code (OSD, Monitor and so on) is licensed under the terms of All server-side code (OSD, Monitor and so on) is licensed under the terms of
Vitastor Network Public License 1.0 (VNPL 1.0), a copyleft license based on Vitastor Network Public License 1.1 (VNPL 1.1), a copyleft license based on
GNU GPLv3.0 with the additional "Network Interaction" clause which requires GNU GPLv3.0 with the additional "Network Interaction" clause which requires
opensourcing all programs directly or indirectly interacting with Vitastor opensourcing all programs directly or indirectly interacting with Vitastor
through a computer network ("Proxy Programs"). Proxy Programs may be made public through a computer network and expressly designed to be used in conjunction
not only under the terms of the same license, but also under the terms of any with it ("Proxy Programs"). Proxy Programs may be made public not only under
GPL-Compatible Free Software License, as listed by the Free Software Foundation. the terms of the same license, but also under the terms of any GPL-Compatible
Free Software License, as listed by the Free Software Foundation.
This is a stricter copyleft license than the Affero GPL. This is a stricter copyleft license than the Affero GPL.
. .
Client libraries (cluster_client and so on) are dual-licensed under the same Client libraries (cluster_client and so on) are dual-licensed under the same
VNPL 1.0 and also GNU GPL 2.0 or later to allow for compatibility with GPLed VNPL 1.1 and also GNU GPL 2.0 or later to allow for compatibility with GPLed
software like QEMU and fio. software like QEMU and fio.

2
debian/install vendored
View File

@@ -1,3 +1,3 @@
VNPL-1.0.txt usr/share/doc/vitastor VNPL-1.1.txt usr/share/doc/vitastor
GPL-2.0.txt usr/share/doc/vitastor GPL-2.0.txt usr/share/doc/vitastor
mon usr/lib/vitastor mon usr/lib/vitastor

View File

@@ -1,13 +1,8 @@
# Build patched QEMU for Debian Buster or Bullseye/Sid inside a container # Build patched QEMU for Debian Buster or Bullseye/Sid inside a container
# cd ..; podman build --build-arg REL=bullseye -v `pwd`/build:/root/build -f debian/patched-qemu.Dockerfile . # cd ..; podman build --build-arg REL=bullseye -v `pwd`/packages:/root/packages -f debian/patched-qemu.Dockerfile .
ARG REL=bullseye
FROM debian:$REL FROM debian:$REL
# again, it doesn't work otherwise
ARG REL=bullseye
WORKDIR /root WORKDIR /root
RUN if [ "$REL" = "buster" ]; then \ RUN if [ "$REL" = "buster" ]; then \
@@ -30,20 +25,20 @@ RUN apt-get --download-only source fio
ADD qemu-5.0-vitastor.patch qemu-5.1-vitastor.patch /root/vitastor/ ADD qemu-5.0-vitastor.patch qemu-5.1-vitastor.patch /root/vitastor/
RUN set -e; \ RUN set -e; \
mkdir -p /root/build/qemu-$REL; \ mkdir -p /root/packages/qemu-$REL; \
rm -rf /root/build/qemu-$REL/*; \ rm -rf /root/packages/qemu-$REL/*; \
cd /root/build/qemu-$REL; \ cd /root/packages/qemu-$REL; \
dpkg-source -x /root/qemu*.dsc; \ dpkg-source -x /root/qemu*.dsc; \
if [ -d /root/build/qemu-$REL/qemu-5.0 ]; then \ if [ -d /root/packages/qemu-$REL/qemu-5.0 ]; then \
cp /root/vitastor/qemu-5.0-vitastor.patch /root/build/qemu-$REL/qemu-5.0/debian/patches; \ cp /root/vitastor/qemu-5.0-vitastor.patch /root/packages/qemu-$REL/qemu-5.0/debian/patches; \
echo qemu-5.0-vitastor.patch >> /root/build/qemu-$REL/qemu-5.0/debian/patches/series; \ echo qemu-5.0-vitastor.patch >> /root/packages/qemu-$REL/qemu-5.0/debian/patches/series; \
else \ else \
cp /root/vitastor/qemu-5.1-vitastor.patch /root/build/qemu-$REL/qemu-*/debian/patches; \ cp /root/vitastor/qemu-5.1-vitastor.patch /root/packages/qemu-$REL/qemu-*/debian/patches; \
P=`ls -d /root/build/qemu-$REL/qemu-*/debian/patches`; \ P=`ls -d /root/packages/qemu-$REL/qemu-*/debian/patches`; \
echo qemu-5.1-vitastor.patch >> $P/series; \ echo qemu-5.1-vitastor.patch >> $P/series; \
fi; \ fi; \
cd /root/build/qemu-$REL/qemu-*/; \ cd /root/packages/qemu-$REL/qemu-*/; \
V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)(~bpo[\d\+]*)?\).*$/$1/')+vitastor1; \ V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)(~bpo[\d\+]*)?\).*$/$1/')+vitastor1; \
DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v $V 'Plug Vitastor block driver'; \ DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v $V 'Plug Vitastor block driver'; \
DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \ DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \
rm -rf /root/build/qemu-$REL/qemu-*/ rm -rf /root/packages/qemu-$REL/qemu-*/

View File

@@ -1,13 +1,8 @@
# Build Vitastor packages for Debian Buster or Bullseye/Sid inside a container # Build Vitastor packages for Debian Buster or Bullseye/Sid inside a container
# cd ..; podman build --build-arg REL=bullseye -v `pwd`/build:/root/build -f debian/vitastor.Dockerfile . # cd ..; podman build --build-arg REL=bullseye -v `pwd`/packages:/root/packages -f debian/vitastor.Dockerfile .
ARG REL=bullseye
FROM debian:$REL FROM debian:$REL
# again, it doesn't work otherwise
ARG REL=bullseye
WORKDIR /root WORKDIR /root
RUN if [ "$REL" = "buster" ]; then \ RUN if [ "$REL" = "buster" ]; then \
@@ -27,7 +22,7 @@ RUN apt-get -y build-dep qemu
RUN apt-get -y build-dep fio RUN apt-get -y build-dep fio
RUN apt-get --download-only source qemu RUN apt-get --download-only source qemu
RUN apt-get --download-only source fio RUN apt-get --download-only source fio
RUN apt-get -y install libjerasure-dev RUN apt-get -y install libjerasure-dev cmake
ADD . /root/vitastor ADD . /root/vitastor
RUN set -e -x; \ RUN set -e -x; \
@@ -35,20 +30,20 @@ RUN set -e -x; \
cd /root/fio-build/; \ cd /root/fio-build/; \
rm -rf /root/fio-build/*; \ rm -rf /root/fio-build/*; \
dpkg-source -x /root/fio*.dsc; \ dpkg-source -x /root/fio*.dsc; \
cd /root/build/qemu-$REL/; \ cd /root/packages/qemu-$REL/; \
rm -rf qemu*/; \ rm -rf qemu*/; \
dpkg-source -x qemu*.dsc; \ dpkg-source -x qemu*.dsc; \
cd /root/build/qemu-$REL/qemu*/; \ cd /root/packages/qemu-$REL/qemu*/; \
debian/rules b/configure-stamp; \ debian/rules b/configure-stamp; \
cd b/qemu; \ cd b/qemu; \
make -j8 qapi; \ make -j8 qapi/qapi-builtin-types.h; \
mkdir -p /root/build/vitastor-$REL; \ mkdir -p /root/packages/vitastor-$REL; \
rm -rf /root/build/vitastor-$REL/*; \ rm -rf /root/packages/vitastor-$REL/*; \
cd /root/build/vitastor-$REL; \ cd /root/packages/vitastor-$REL; \
cp -r /root/vitastor vitastor-0.5.4; \ cp -r /root/vitastor vitastor-0.5.10; \
ln -s /root/build/qemu-$REL/qemu-*/ vitastor-0.5.4/qemu; \ ln -s /root/packages/qemu-$REL/qemu-*/ vitastor-0.5.10/qemu; \
ln -s /root/fio-build/fio-*/ vitastor-0.5.4/fio; \ ln -s /root/fio-build/fio-*/ vitastor-0.5.10/fio; \
cd vitastor-0.5.4; \ cd vitastor-0.5.10; \
FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \ FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
QEMU=$(head -n1 qemu/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \ QEMU=$(head -n1 qemu/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
sh copy-qemu-includes.sh; \ sh copy-qemu-includes.sh; \
@@ -60,13 +55,13 @@ RUN set -e -x; \
diff -NaurpbB a b > debian/patches/qemu-fio-headers.patch || true; \ diff -NaurpbB a b > debian/patches/qemu-fio-headers.patch || true; \
echo qemu-fio-headers.patch >> debian/patches/series; \ echo qemu-fio-headers.patch >> debian/patches/series; \
rm -rf a b; \ rm -rf a b; \
rm -rf /root/build/qemu-$REL/qemu*/; \ rm -rf /root/packages/qemu-$REL/qemu*/; \
echo "dep:fio=$FIO" > debian/substvars; \ echo "dep:fio=$FIO" > debian/substvars; \
echo "dep:qemu=$QEMU" >> debian/substvars; \ echo "dep:qemu=$QEMU" >> debian/substvars; \
cd /root/build/vitastor-$REL; \ cd /root/packages/vitastor-$REL; \
tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.5.4.orig.tar.xz vitastor-0.5.4; \ tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.5.10.orig.tar.xz vitastor-0.5.10; \
cd vitastor-0.5.4; \ cd vitastor-0.5.10; \
V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \ V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \ DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \
DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \ DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \
rm -rf /root/build/vitastor-$REL/vitastor-*/ rm -rf /root/packages/vitastor-$REL/vitastor-*/

View File

@@ -1,51 +0,0 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include <iostream>
#include <functional>
#include <array>
#include <cstdlib> // for malloc() and free()
using namespace std;
// replace operator new and delete to log allocations
void* operator new(std::size_t n)
{
cout << "Allocating " << n << " bytes" << endl;
return malloc(n);
}
void operator delete(void* p) throw()
{
free(p);
}
class test
{
public:
std::string s;
void a(std::function<void()> & f, const char *str)
{
auto l = [this, str]() { cout << str << " ? " << s << " from this\n"; };
cout << "Assigning lambda3 of size " << sizeof(l) << endl;
f = l;
}
};
int main()
{
std::array<char, 16> arr1;
auto lambda1 = [arr1](){};
cout << "Assigning lambda1 of size " << sizeof(lambda1) << endl;
std::function<void()> f1 = lambda1;
std::array<char, 17> arr2;
auto lambda2 = [arr2](){};
cout << "Assigning lambda2 of size " << sizeof(lambda2) << endl;
std::function<void()> f2 = lambda2;
test t;
std::function<void()> f3;
t.s = "str";
t.a(f3, "huyambda");
f3();
}

View File

@@ -1,22 +1,59 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
module.exports = { module.exports = {
scale_pg_count, scale_pg_count,
}; };
function add_pg_history(new_pg_history, new_pg, prev_pgs, prev_pg_history, old_pg)
{
if (!new_pg_history[new_pg])
{
new_pg_history[new_pg] = {
osd_sets: {},
all_peers: {},
epoch: 0,
};
}
const nh = new_pg_history[new_pg], oh = prev_pg_history[old_pg];
nh.osd_sets[prev_pgs[old_pg].join(' ')] = prev_pgs[old_pg];
if (oh && oh.osd_sets && oh.osd_sets.length)
{
for (const pg of oh.osd_sets)
{
nh.osd_sets[pg.join(' ')] = pg;
}
}
if (oh && oh.all_peers && oh.all_peers.length)
{
for (const osd_num of oh.all_peers)
{
nh.all_peers[osd_num] = Number(osd_num);
}
}
if (oh && oh.epoch)
{
nh.epoch = nh.epoch < oh.epoch ? oh.epoch : nh.epoch;
}
}
function finish_pg_history(merged_history)
{
merged_history.osd_sets = Object.values(merged_history.osd_sets);
merged_history.all_peers = Object.values(merged_history.all_peers);
}
function scale_pg_count(prev_pgs, prev_pg_history, new_pg_history, new_pg_count) function scale_pg_count(prev_pgs, prev_pg_history, new_pg_history, new_pg_count)
{ {
const old_pg_count = prev_pgs.length; const old_pg_count = prev_pgs.length;
// Add all possibly intersecting PGs into the history of new PGs // Add all possibly intersecting PGs to the history of new PGs
if (!(new_pg_count % old_pg_count)) if (!(new_pg_count % old_pg_count))
{ {
// New PG count is a multiple of the old PG count // New PG count is a multiple of old PG count
const mul = (new_pg_count / old_pg_count);
for (let i = 0; i < new_pg_count; i++) for (let i = 0; i < new_pg_count; i++)
{ {
const old_i = Math.floor(new_pg_count / mul); add_pg_history(new_pg_history, i, prev_pgs, prev_pg_history, i % old_pg_count);
new_pg_history[i] = JSON.parse(JSON.stringify(prev_pg_history[1+old_i])); finish_pg_history(new_pg_history[i]);
} }
} }
else if (!(old_pg_count % new_pg_count)) else if (!(old_pg_count % new_pg_count))
@@ -25,68 +62,26 @@ function scale_pg_count(prev_pgs, prev_pg_history, new_pg_history, new_pg_count)
const mul = (old_pg_count / new_pg_count); const mul = (old_pg_count / new_pg_count);
for (let i = 0; i < new_pg_count; i++) for (let i = 0; i < new_pg_count; i++)
{ {
new_pg_history[i] = {
osd_sets: [],
all_peers: [],
epoch: 0,
};
for (let j = 0; j < mul; j++) for (let j = 0; j < mul; j++)
{ {
new_pg_history[i].osd_sets.push(prev_pgs[i*mul]); add_pg_history(new_pg_history, i, prev_pgs, prev_pg_history, i+j*new_pg_count);
const hist = prev_pg_history[1+i*mul+j];
if (hist && hist.osd_sets && hist.osd_sets.length)
{
Array.prototype.push.apply(new_pg_history[i].osd_sets, hist.osd_sets);
}
if (hist && hist.all_peers && hist.all_peers.length)
{
Array.prototype.push.apply(new_pg_history[i].all_peers, hist.all_peers);
}
if (hist && hist.epoch)
{
new_pg_history[i].epoch = new_pg_history[i].epoch < hist.epoch ? hist.epoch : new_pg_history[i].epoch;
}
} }
finish_pg_history(new_pg_history[i]);
} }
} }
else else
{ {
// Any PG may intersect with any PG after non-multiple PG count change // Any PG may intersect with any PG after non-multiple PG count change
// So, merge ALL PGs history // So, merge ALL PGs history
let all_sets = {}; let merged_history = {};
let all_peers = {}; for (let i = 0; i < old_pg_count; i++)
let max_epoch = 0;
for (const pg of prev_pgs)
{ {
all_sets[pg.join(' ')] = pg; add_pg_history(merged_history, 1, prev_pgs, prev_pg_history, i);
} }
for (const pg in prev_pg_history) finish_pg_history(merged_history[1]);
{
const hist = prev_pg_history[pg];
if (hist && hist.osd_sets)
{
for (const pg of hist.osd_sets)
{
all_sets[pg.join(' ')] = pg;
}
}
if (hist && hist.all_peers)
{
for (const osd_num of hist.all_peers)
{
all_peers[osd_num] = Number(osd_num);
}
}
if (hist && hist.epoch)
{
max_epoch = max_epoch < hist.epoch ? hist.epoch : max_epoch;
}
}
all_sets = Object.values(all_sets);
all_peers = Object.values(all_peers);
for (let i = 0; i < new_pg_count; i++) for (let i = 0; i < new_pg_count; i++)
{ {
new_pg_history[i] = { osd_sets: all_sets, all_peers, epoch: max_epoch }; new_pg_history[i] = { ...merged_history[1] };
} }
} }
// Mark history keys for removed PGs as removed // Mark history keys for removed PGs as removed
@@ -94,19 +89,16 @@ function scale_pg_count(prev_pgs, prev_pg_history, new_pg_history, new_pg_count)
{ {
new_pg_history[i] = null; new_pg_history[i] = null;
} }
// Just for the lp_solve optimizer - pick a "previous" PG for each "new" one
if (old_pg_count < new_pg_count) if (old_pg_count < new_pg_count)
{ {
for (let i = new_pg_count-1; i >= 0; i--) for (let i = old_pg_count; i < new_pg_count; i++)
{ {
prev_pgs[i] = prev_pgs[Math.floor(i/new_pg_count*old_pg_count)]; prev_pgs[i] = prev_pgs[i % old_pg_count];
} }
} }
else if (old_pg_count > new_pg_count) else if (old_pg_count > new_pg_count)
{ {
for (let i = 0; i < new_pg_count; i++)
{
prev_pgs[i] = prev_pgs[Math.round(i/new_pg_count*old_pg_count)];
}
prev_pgs.splice(new_pg_count, old_pg_count-new_pg_count); prev_pgs.splice(new_pg_count, old_pg_count-new_pg_count);
} }
} }

View File

@@ -1,31 +1,16 @@
// Functions to calculate Annualized Failure Rate of your cluster // Functions to calculate Annualized Failure Rate of your cluster
// if you know AFR of your drives, number of drives, expected rebalance time // if you know AFR of your drives, number of drives, expected rebalance time
// and replication factor // and replication factor
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see https://yourcmc.ru/git/vitalif/vitastor/src/branch/master/README.md for details) or AGPL-3.0
// Author: Vitaliy Filippov, 2020+
const { sprintf } = require('sprintf-js');
module.exports = { module.exports = {
cluster_afr_fullmesh, cluster_afr_fullmesh,
failure_rate_fullmesh, failure_rate_fullmesh,
cluster_afr, cluster_afr,
print_cluster_afr,
c_n_k, c_n_k,
}; };
print_cluster_afr({ n_hosts: 4, n_drives: 6, afr_drive: 0.03, afr_host: 0.05, capacity: 4000, speed: 0.1, replicas: 2 });
print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, capacity: 4000, speed: 0.1, replicas: 2 });
print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, afr_host: 0.05, capacity: 4000, speed: 0.1, replicas: 2 });
print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, capacity: 4000, speed: 0.1, ec: [ 2, 1 ] });
print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, afr_host: 0.05, capacity: 4000, speed: 0.1, ec: [ 2, 1 ] });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, capacity: 8000, speed: 0.02, replicas: 2 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 2 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, capacity: 8000, speed: 0.02, replicas: 3 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 3 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, capacity: 8000, speed: 0.02, replicas: 3, pgs: 100 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 3, pgs: 100 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 3, pgs: 100, degraded_replacement: 1 });
/******** "FULL MESH": ASSUME EACH OSD COMMUNICATES WITH ALL OTHER OSDS ********/ /******** "FULL MESH": ASSUME EACH OSD COMMUNICATES WITH ALL OTHER OSDS ********/
// Estimate AFR of the cluster // Estimate AFR of the cluster
@@ -56,93 +41,38 @@ function failure_rate_fullmesh(n, a, f)
/******** PGS: EACH OSD ONLY COMMUNICATES WITH <pgs> OTHER OSDs ********/ /******** PGS: EACH OSD ONLY COMMUNICATES WITH <pgs> OTHER OSDs ********/
// <n> hosts of <m> drives of <capacity> GB, each able to backfill at <speed> GB/s, // <n> hosts of <m> drives of <capacity> GB, each able to backfill at <speed> GB/s,
// <k> replicas, <pgs> unique peer PGs per OSD // <k> replicas, <pgs> unique peer PGs per OSD (~50 for 100 PG-per-OSD in a big cluster)
// //
// For each of n*m drives: P(drive fails in a year) * P(any of its peers fail in <l*365> next days). // For each of n*m drives: P(drive fails in a year) * P(any of its peers fail in <l*365> next days).
// More peers per OSD increase rebalance speed (more drives work together to resilver) if you // More peers per OSD increase rebalance speed (more drives work together to resilver) if you
// let them finish rebalance BEFORE replacing the failed drive. // let them finish rebalance BEFORE replacing the failed drive (degraded_replacement=false).
// At the same time, more peers per OSD increase probability of any of them to fail! // At the same time, more peers per OSD increase probability of any of them to fail!
// osd_rm=true means that failed OSDs' data is rebalanced over all other hosts,
// not over the same host as it's in Ceph by default (dead OSDs are marked 'out').
// //
// Probability of all except one drives in a replica group to fail is (AFR^(k-1)). // Probability of all except one drives in a replica group to fail is (AFR^(k-1)).
// So with <x> PGs it becomes ~ (x * (AFR*L/365)^(k-1)). Interesting but reasonable consequence // So with <x> PGs it becomes ~ (x * (AFR*L/365)^(k-1)). Interesting but reasonable consequence
// is that, with k=2, total failure rate doesn't depend on number of peers per OSD, // is that, with k=2, total failure rate doesn't depend on number of peers per OSD,
// because it gets increased linearly by increased number of peers to fail // because it gets increased linearly by increased number of peers to fail
// and decreased linearly by reduced rebalance time. // and decreased linearly by reduced rebalance time.
function cluster_afr_pgs({ n_hosts, n_drives, afr_drive, capacity, speed, replicas, pgs = 1, degraded_replacement }) function cluster_afr({ n_hosts, n_drives, afr_drive, afr_host, capacity, speed, ec, ec_data, ec_parity, replicas, pgs = 1, osd_rm, degraded_replacement, down_out_interval = 600 })
{ {
pgs = Math.min(pgs, (n_hosts-1)*n_drives/(replicas-1)); const pg_size = (ec ? ec_data+ec_parity : replicas);
const l = capacity/(degraded_replacement ? 1 : pgs)/speed/86400/365; pgs = Math.min(pgs, (n_hosts-1)*n_drives/(pg_size-1));
return 1 - (1 - afr_drive * (1-(1-(afr_drive*l)**(replicas-1))**pgs)) ** (n_hosts*n_drives); const host_pgs = Math.min(pgs*n_drives, (n_hosts-1)*n_drives/(pg_size-1));
} const resilver_disk = n_drives == 1 || osd_rm ? pgs : (n_drives-1);
const disk_heal_time = (down_out_interval + capacity/(degraded_replacement ? 1 : resilver_disk)/speed)/86400/365;
function cluster_afr_pgs_ec({ n_hosts, n_drives, afr_drive, capacity, speed, ec: [ ec_data, ec_parity ], pgs = 1, degraded_replacement }) const host_heal_time = (down_out_interval + n_drives*capacity/pgs/speed)/86400/365;
{ const disk_heal_fail = ((afr_drive+afr_host/n_drives)*disk_heal_time);
const ec_total = ec_data+ec_parity; const host_heal_fail = ((afr_drive+afr_host/n_drives)*host_heal_time);
pgs = Math.min(pgs, (n_hosts-1)*n_drives/(ec_total-1)); const disk_pg_fail = ec
const l = capacity/(degraded_replacement ? 1 : pgs)/speed/86400/365; ? failure_rate_fullmesh(ec_data+ec_parity-1, disk_heal_fail, ec_parity)
return 1 - (1 - afr_drive * (1-(1-failure_rate_fullmesh(ec_total-1, afr_drive*l, ec_parity))**pgs)) ** (n_hosts*n_drives); : disk_heal_fail**(replicas-1);
} const host_pg_fail = ec
? failure_rate_fullmesh(ec_data+ec_parity-1, host_heal_fail, ec_parity)
// Same as above, but also take server failures into account : host_heal_fail**(replicas-1);
function cluster_afr_pgs_hosts({ n_hosts, n_drives, afr_drive, afr_host, capacity, speed, replicas, pgs = 1, degraded_replacement }) return 1 - ((1 - afr_drive * (1-(1-disk_pg_fail)**pgs)) ** (n_hosts*n_drives))
{ * ((1 - afr_host * (1-(1-host_pg_fail)**host_pgs)) ** n_hosts);
let otherhosts = Math.min(pgs, (n_hosts-1)/(replicas-1));
pgs = Math.min(pgs, (n_hosts-1)*n_drives/(replicas-1));
let pgh = Math.min(pgs*n_drives, (n_hosts-1)*n_drives/(replicas-1));
const ld = capacity/(degraded_replacement ? 1 : pgs)/speed/86400/365;
const lh = n_drives*capacity/pgs/speed/86400/365;
const p1 = ((afr_drive+afr_host*pgs/otherhosts)*lh);
const p2 = ((afr_drive+afr_host*pgs/otherhosts)*ld);
return 1 - ((1 - afr_host * (1-(1-p1**(replicas-1))**pgh)) ** n_hosts) *
((1 - afr_drive * (1-(1-p2**(replicas-1))**pgs)) ** (n_hosts*n_drives));
}
function cluster_afr_pgs_ec_hosts({ n_hosts, n_drives, afr_drive, afr_host, capacity, speed, ec: [ ec_data, ec_parity ], pgs = 1, degraded_replacement })
{
const ec_total = ec_data+ec_parity;
const otherhosts = Math.min(pgs, (n_hosts-1)/(ec_total-1));
pgs = Math.min(pgs, (n_hosts-1)*n_drives/(ec_total-1));
const pgh = Math.min(pgs*n_drives, (n_hosts-1)*n_drives/(ec_total-1));
const ld = capacity/(degraded_replacement ? 1 : pgs)/speed/86400/365;
const lh = n_drives*capacity/pgs/speed/86400/365;
const p1 = ((afr_drive+afr_host*pgs/otherhosts)*lh);
const p2 = ((afr_drive+afr_host*pgs/otherhosts)*ld);
return 1 - ((1 - afr_host * (1-(1-failure_rate_fullmesh(ec_total-1, p1, ec_parity))**pgh)) ** n_hosts) *
((1 - afr_drive * (1-(1-failure_rate_fullmesh(ec_total-1, p2, ec_parity))**pgs)) ** (n_hosts*n_drives));
}
// Wrapper for 4 above functions
function cluster_afr(config)
{
if (config.ec && config.afr_host)
{
return cluster_afr_pgs_ec_hosts(config);
}
else if (config.ec)
{
return cluster_afr_pgs_ec(config);
}
else if (config.afr_host)
{
return cluster_afr_pgs_hosts(config);
}
else
{
return cluster_afr_pgs(config);
}
}
function print_cluster_afr(config)
{
console.log(
`${config.n_hosts} nodes with ${config.n_drives} ${sprintf("%.1f", config.capacity/1000)}TB drives`+
`, capable to backfill at ${sprintf("%.1f", config.speed*1000)} MB/s, drive AFR ${sprintf("%.1f", config.afr_drive*100)}%`+
(config.afr_host ? `, host AFR ${sprintf("%.1f", config.afr_host*100)}%` : '')+
(config.ec ? `, EC ${config.ec[0]}+${config.ec[1]}` : `, ${config.replicas} replicas`)+
`, ${config.pgs||1} PG per OSD`+
(config.degraded_replacement ? `\n...and you don't let the rebalance finish before replacing drives` : '')
);
console.log('-> '+sprintf("%.7f%%", 100*cluster_afr(config))+'\n');
} }
/******** UTILITY ********/ /******** UTILITY ********/

28
mon/afr_test.js Normal file
View File

@@ -0,0 +1,28 @@
const { sprintf } = require('sprintf-js');
const { cluster_afr } = require('./afr.js');
print_cluster_afr({ n_hosts: 4, n_drives: 6, afr_drive: 0.03, afr_host: 0.05, capacity: 4000, speed: 0.1, replicas: 2 });
print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, afr_host: 0, capacity: 4000, speed: 0.1, replicas: 2 });
print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, afr_host: 0.05, capacity: 4000, speed: 0.1, replicas: 2 });
print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, afr_host: 0, capacity: 4000, speed: 0.1, ec: true, ec_data: 2, ec_parity: 1 });
print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, afr_host: 0.05, capacity: 4000, speed: 0.1, ec: true, ec_data: 2, ec_parity: 1 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0, capacity: 8000, speed: 0.02, replicas: 2 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 2 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0, capacity: 8000, speed: 0.02, replicas: 3 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 3 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0, capacity: 8000, speed: 0.02, replicas: 3, pgs: 100 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 3, pgs: 100 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 3, pgs: 100, degraded_replacement: 1 });
function print_cluster_afr(config)
{
console.log(
`${config.n_hosts} nodes with ${config.n_drives} ${sprintf("%.1f", config.capacity/1000)}TB drives`+
`, capable to backfill at ${sprintf("%.1f", config.speed*1000)} MB/s, drive AFR ${sprintf("%.1f", config.afr_drive*100)}%`+
(config.afr_host ? `, host AFR ${sprintf("%.1f", config.afr_host*100)}%` : '')+
(config.ec ? `, EC ${config.ec_data}+${config.ec_parity}` : `, ${config.replicas} replicas`)+
`, ${config.pgs||1} PG per OSD`+
(config.degraded_replacement ? `\n...and you don't let the rebalance finish before replacing drives` : '')
);
console.log('-> '+sprintf("%.7f%%", 100*cluster_afr(config))+'\n');
}

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
// Data distribution optimizer using linear programming (lp_solve) // Data distribution optimizer using linear programming (lp_solve)
@@ -58,7 +58,7 @@ async function optimize_initial({ osd_tree, pg_count, pg_size = 3, pg_minsize =
} }
const all_weights = Object.assign({}, ...Object.values(osd_tree)); const all_weights = Object.assign({}, ...Object.values(osd_tree));
const total_weight = Object.values(all_weights).reduce((a, c) => Number(a) + Number(c), 0); const total_weight = Object.values(all_weights).reduce((a, c) => Number(a) + Number(c), 0);
const all_pgs = Object.values(random_combinations(osd_tree, pg_size, max_combinations)); const all_pgs = Object.values(random_combinations(osd_tree, pg_size, max_combinations, parity_space > 1));
const pg_per_osd = {}; const pg_per_osd = {};
for (const pg of all_pgs) for (const pg of all_pgs)
{ {
@@ -249,7 +249,7 @@ async function optimize_change({ prev_pgs: prev_int_pgs, osd_tree, pg_size = 3,
} }
} }
// Get all combinations // Get all combinations
let all_pgs = random_combinations(osd_tree, pg_size, max_combinations); let all_pgs = random_combinations(osd_tree, pg_size, max_combinations, parity_space > 1);
add_valid_previous(osd_tree, prev_weights, all_pgs); add_valid_previous(osd_tree, prev_weights, all_pgs);
all_pgs = Object.values(all_pgs); all_pgs = Object.values(all_pgs);
const pg_per_osd = {}; const pg_per_osd = {};
@@ -275,6 +275,11 @@ async function optimize_change({ prev_pgs: prev_int_pgs, osd_tree, pg_size = 3,
lp += 'max: '+all_pg_names.map(pg_name => ( lp += 'max: '+all_pg_names.map(pg_name => (
prev_weights[pg_name] ? `${pg_size+1}*add_${pg_name} - ${pg_size+1}*del_${pg_name}` : `${pg_size+1-move_weights[pg_name]}*${pg_name}` prev_weights[pg_name] ? `${pg_size+1}*add_${pg_name} - ${pg_size+1}*del_${pg_name}` : `${pg_size+1-move_weights[pg_name]}*${pg_name}`
)).join(' + ')+';\n'; )).join(' + ')+';\n';
lp += all_pg_names
.map(pg_name => (prev_weights[pg_name] ? `add_${pg_name} - del_${pg_name}` : `${pg_name}`))
.join(' + ')+' = '+(pg_count
- Object.keys(prev_weights).reduce((a, old_pg_name) => (a + (all_pgs_hash[old_pg_name] ? prev_weights[old_pg_name] : 0)), 0)
)+';\n';
for (const osd in pg_per_osd) for (const osd in pg_per_osd)
{ {
if (osd !== NO_OSD) if (osd !== NO_OSD)
@@ -488,7 +493,8 @@ function extract_osds(osd_tree, levels, osd_level, osds = {})
return osds; return osds;
} }
function random_combinations(osd_tree, pg_size, count) // ordered = don't treat (x,y) and (y,x) as equal
function random_combinations(osd_tree, pg_size, count, ordered)
{ {
let seed = 0x5f020e43; let seed = 0x5f020e43;
let rng = () => let rng = () =>
@@ -516,25 +522,47 @@ function random_combinations(osd_tree, pg_size, count)
pg.push(osds[cur_hosts[next_host]][next_osd]); pg.push(osds[cur_hosts[next_host]][next_osd]);
cur_hosts.splice(next_host, 1); cur_hosts.splice(next_host, 1);
} }
while (pg.length < pg_size) const cyclic_pgs = [ pg ];
if (ordered)
{ {
pg.push(NO_OSD); for (let i = 1; i < pg.size; i++)
{
cyclic_pgs.push([ ...pg.slice(i), ...pg.slice(0, i) ]);
}
}
for (const pg of cyclic_pgs)
{
while (pg.length < pg_size)
{
pg.push(NO_OSD);
}
r['pg_'+pg.join('_')] = pg;
} }
r['pg_'+pg.join('_')] = pg;
} }
} }
// Generate purely random combinations // Generate purely random combinations
restart: while (count > 0) while (count > 0)
{ {
let host_idx = []; let host_idx = [];
for (let i = 0; i < pg_size && i < hosts.length; i++) const cur_hosts = [ ...hosts.map((h, i) => i) ];
const max_hosts = pg_size < hosts.length ? pg_size : hosts.length;
if (ordered)
{ {
let start = i > 0 ? host_idx[i-1]+1 : 0; for (let i = 0; i < max_hosts; i++)
if (start >= hosts.length)
{ {
continue restart; const r = rng() % cur_hosts.length;
host_idx[i] = cur_hosts[r];
cur_hosts.splice(r, 1);
}
}
else
{
for (let i = 0; i < max_hosts; i++)
{
const r = rng() % (cur_hosts.length - (max_hosts - i - 1));
host_idx[i] = cur_hosts[r];
cur_hosts.splice(0, r+1);
} }
host_idx[i] = start + rng() % (hosts.length-start);
} }
let pg = host_idx.map(h => osds[hosts[h]][rng() % osds[hosts[h]].length]); let pg = host_idx.map(h => osds[hosts[h]][rng() % osds[hosts[h]].length]);
while (pg.length < pg_size) while (pg.length < pg_size)

76
mon/make-osd.sh Executable file
View File

@@ -0,0 +1,76 @@
#!/bin/bash
# Very simple systemd unit generator for vitastor-osd services
# Not the final solution yet, mostly for tests
# Copyright (c) Vitaliy Filippov, 2019+
# License: MIT
# USAGE: ./make-osd.sh /dev/disk/by-partuuid/xxx [ /dev/disk/by-partuuid/yyy]...
IP_SUBSTR="10.200.1."
ETCD_HOSTS="etcd0=http://10.200.1.10:2380,etcd1=http://10.200.1.11:2380,etcd2=http://10.200.1.12:2380"
set -e -x
IP=`ip -json a s | jq -r '.[].addr_info[] | select(.local | startswith("'$IP_SUBSTR'")) | .local'`
[ "$IP" != "" ] || exit 1
ETCD_MON=$(echo $ETCD_HOSTS | perl -pe 's/:2380/:2379/g; s/etcd\d*=//g;')
D=`dirname $0`
# Create OSDs on all passed devices
OSD_NUM=1
for DEV in $*; do
# Ugly :) -> node.js rework pending
while true; do
ST=$(etcdctl --endpoints="$ETCD_MON" get --print-value-only /vitastor/osd/stats/$OSD_NUM)
if [ "$ST" = "" ]; then
break
fi
OSD_NUM=$((OSD_NUM+1))
done
etcdctl --endpoints="$ETCD_MON" put /vitastor/osd/stats/$OSD_NUM '{}'
echo Creating OSD $OSD_NUM on $DEV
OPT=`node $D/simple-offsets.js --device $DEV --format options | tr '\n' ' '`
META=`echo $OPT | grep -Po '(?<=data_offset )\d+'`
dd if=/dev/zero of=$DEV bs=1048576 count=$(((META+1048575)/1048576)) oflag=direct
cat >/etc/systemd/system/vitastor-osd$OSD_NUM.service <<EOF
[Unit]
Description=Vitastor object storage daemon osd.$OSD_NUM
After=network-online.target local-fs.target time-sync.target
Wants=network-online.target local-fs.target time-sync.target
PartOf=vitastor.target
[Service]
LimitNOFILE=1048576
LimitNPROC=1048576
LimitMEMLOCK=infinity
ExecStart=/usr/bin/vitastor-osd \\
--etcd_address $IP:2379/v3 \\
--bind_address $IP \\
--osd_num $OSD_NUM \\
--disable_data_fsync 1 \\
--immediate_commit all \\
--flusher_count 256 \\
--disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 \\
--journal_no_same_sector_overwrites true \\
--journal_sector_buffer_count 1024 \\
$OPT
WorkingDirectory=/
ExecStartPre=+chown vitastor:vitastor $DEV
User=vitastor
PrivateTmp=false
TasksMax=infinity
Restart=always
StartLimitInterval=0
RestartSec=10
[Install]
WantedBy=vitastor.target
EOF
systemctl enable vitastor-osd$OSD_NUM
done

138
mon/make-units.sh Normal file → Executable file
View File

@@ -1,19 +1,25 @@
#!/bin/bash #!/bin/bash
# Example startup script generator # Very simple systemd unit generator for etcd & vitastor-mon services
# Of course this isn't a production solution yet, this is just for tests # Not the final solution yet, mostly for tests
# Copyright (c) Vitaliy Filippov, 2019+ # Copyright (c) Vitaliy Filippov, 2019+
# License: MIT # License: MIT
IP=`ip -json a s | jq -r '.[].addr_info[] | select(.broadcast == "10.115.0.255") | .local'` # USAGE: ./make-units.sh
IP_SUBSTR="10.200.1."
ETCD_HOSTS="etcd0=http://10.200.1.10:2380,etcd1=http://10.200.1.11:2380,etcd2=http://10.200.1.12:2380"
# determine IP
IP=`ip -json a s | jq -r '.[].addr_info[] | select(.local | startswith("'$IP_SUBSTR'")) | .local'`
[ "$IP" != "" ] || exit 1 [ "$IP" != "" ] || exit 1
ETCD_NUM=${ETCD_HOSTS/$IP*/}
[ "$ETCD_NUM" != "$ETCD_HOSTS" ] || exit 1
ETCD_NUM=$(echo $ETCD_NUM | tr -d -c , | wc -c)
BASE=${IP/*./} # etcd
BASE=$((BASE-10))
useradd etcd useradd etcd
mkdir -p /var/lib/etcd$BASE.etcd mkdir -p /var/lib/etcd$ETCD_NUM.etcd
cat >/etc/systemd/system/etcd.service <<EOF cat >/etc/systemd/system/etcd.service <<EOF
[Unit] [Unit]
Description=etcd for vitastor Description=etcd for vitastor
@@ -22,19 +28,18 @@ Wants=network-online.target local-fs.target time-sync.target
[Service] [Service]
Restart=always Restart=always
ExecStart=/usr/local/bin/etcd -name etcd$BASE --data-dir /var/lib/etcd$BASE.etcd \\ ExecStart=/usr/local/bin/etcd -name etcd$ETCD_NUM --data-dir /var/lib/etcd$ETCD_NUM.etcd \\
--advertise-client-urls http://$IP:2379 --listen-client-urls http://$IP:2379 \\ --advertise-client-urls http://$IP:2379 --listen-client-urls http://$IP:2379 \\
--initial-advertise-peer-urls http://$IP:2380 --listen-peer-urls http://$IP:2380 \\ --initial-advertise-peer-urls http://$IP:2380 --listen-peer-urls http://$IP:2380 \\
--initial-cluster-token vitastor-etcd-1 --initial-cluster etcd0=http://10.115.0.10:2380,etcd1=http://10.115.0.11:2380,etcd2=http://10.115.0.12:2380,etcd3=http://10.115.0.13:2380 \\ --initial-cluster-token vitastor-etcd-1 --initial-cluster $ETCD_HOSTS \\
--initial-cluster-state new --max-txn-ops=100000 --auto-compaction-retention=10 --auto-compaction-mode=revision --initial-cluster-state new --max-txn-ops=100000 --auto-compaction-retention=10 --auto-compaction-mode=revision
WorkingDirectory=/var/lib/etcd$BASE.etcd WorkingDirectory=/var/lib/etcd$ETCD_NUM.etcd
ExecStartPre=+chown -R etcd /var/lib/etcd$BASE.etcd ExecStartPre=+chown -R etcd /var/lib/etcd$ETCD_NUM.etcd
User=etcd User=etcd
PrivateTmp=false PrivateTmp=false
TasksMax=infinity TasksMax=infinity
Restart=always Restart=always
StartLimitInterval=0 StartLimitInterval=0
StartLimitIntervalSec=0
RestartSec=10 RestartSec=10
[Install] [Install]
@@ -48,9 +53,7 @@ systemctl start etcd
useradd vitastor useradd vitastor
chmod 755 /root chmod 755 /root
BASE=${IP/*./} # Vitastor target
BASE=$(((BASE-10)*12))
cat >/etc/systemd/system/vitastor.target <<EOF cat >/etc/systemd/system/vitastor.target <<EOF
[Unit] [Unit]
Description=vitastor target Description=vitastor target
@@ -58,116 +61,25 @@ Description=vitastor target
WantedBy=multi-user.target WantedBy=multi-user.target
EOF EOF
i=1 # Monitor unit
for DEV in `ls /dev/disk/by-id/ | grep ata-INTEL_SSDSC2KB`; do ETCD_MON=$(echo $ETCD_HOSTS | perl -pe 's/:2380/:2379/g; s/etcd\d*=//g;')
dd if=/dev/zero of=/dev/disk/by-id/$DEV bs=1048576 count=$(((427814912+1048575)/1048576+2)) cat >/etc/systemd/system/vitastor-mon.service <<EOF
dd if=/dev/zero of=/dev/disk/by-id/$DEV bs=1048576 count=$(((427814912+1048575)/1048576+2)) seek=$((1920377991168/1048576))
cat >/etc/systemd/system/vitastor-osd$((BASE+i)).service <<EOF
[Unit] [Unit]
Description=Vitastor object storage daemon osd.$((BASE+i)) Description=Vitastor monitor
After=network-online.target local-fs.target time-sync.target After=network-online.target local-fs.target time-sync.target
Wants=network-online.target local-fs.target time-sync.target Wants=network-online.target local-fs.target time-sync.target
PartOf=vitastor.target
[Service] [Service]
LimitNOFILE=1048576 Restart=always
LimitNPROC=1048576 ExecStart=node /usr/lib/vitastor/mon/mon-main.js --etcd_url '$ETCD_MON' --etcd_prefix '/vitastor' --etcd_start_timeout 5
LimitMEMLOCK=infinity WorkingDirectory=/
ExecStart=/root/vitastor/osd \\
--etcd_address $IP:2379/v3 \\
--bind_address $IP \\
--osd_num $((BASE+i)) \\
--disable_data_fsync 1 \\
--disable_device_lock 1 \\
--immediate_commit all \\
--flusher_count 8 \\
--disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 \\
--journal_no_same_sector_overwrites true \\
--journal_sector_buffer_count 1024 \\
--journal_offset 0 \\
--meta_offset 16777216 \\
--data_offset 427814912 \\
--data_size $((1920377991168-427814912)) \\
--data_device /dev/disk/by-id/$DEV
WorkingDirectory=/root/vitastor
ExecStartPre=+chown vitastor:vitastor /dev/disk/by-id/$DEV
User=vitastor User=vitastor
PrivateTmp=false PrivateTmp=false
TasksMax=infinity TasksMax=infinity
Restart=always Restart=always
StartLimitInterval=0 StartLimitInterval=0
StartLimitIntervalSec=0
RestartSec=10 RestartSec=10
[Install] [Install]
WantedBy=vitastor.target WantedBy=vitastor.target
EOF EOF
systemctl enable vitastor-osd$((BASE+i))
i=$((i+1))
cat >/etc/systemd/system/vitastor-osd$((BASE+i)).service <<EOF
[Unit]
Description=Vitastor object storage daemon osd.$((BASE+i))
After=network-online.target local-fs.target time-sync.target
Wants=network-online.target local-fs.target time-sync.target
PartOf=vitastor.target
[Service]
LimitNOFILE=1048576
LimitNPROC=1048576
LimitMEMLOCK=infinity
ExecStart=/root/vitastor/osd \\
--etcd_address $IP:2379/v3 \\
--bind_address $IP \\
--osd_num $((BASE+i)) \\
--disable_data_fsync 1 \\
--immediate_commit all \\
--flusher_count 8 \\
--disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 \\
--journal_no_same_sector_overwrites true \\
--journal_sector_buffer_count 1024 \\
--journal_offset 1920377991168 \\
--meta_offset $((1920377991168+16777216)) \\
--data_offset $((1920377991168+427814912)) \\
--data_size $((1920377991168-427814912)) \\
--data_device /dev/disk/by-id/$DEV
WorkingDirectory=/root/vitastor
ExecStartPre=+chown vitastor:vitastor /dev/disk/by-id/$DEV
User=vitastor
PrivateTmp=false
TasksMax=infinity
Restart=always
StartLimitInterval=0
StartLimitIntervalSec=0
RestartSec=10
[Install]
WantedBy=vitastor.target
EOF
systemctl enable vitastor-osd$((BASE+i))
i=$((i+1))
done
exit
node mon-main.js --etcd_url 'http://10.115.0.10:2379,http://10.115.0.11:2379,http://10.115.0.12:2379,http://10.115.0.13:2379' --etcd_prefix '/vitastor' --etcd_start_timeout 5
podman run -d --network host --restart always -v /var/lib/etcd0.etcd:/etcd0.etcd --name etcd quay.io/coreos/etcd:v3.4.13 etcd -name etcd0 \
-advertise-client-urls http://10.115.0.10:2379 -listen-client-urls http://10.115.0.10:2379 \
-initial-advertise-peer-urls http://10.115.0.10:2380 -listen-peer-urls http://10.115.0.10:2380 \
-initial-cluster-token vitastor-etcd-1 -initial-cluster etcd0=http://10.115.0.10:2380,etcd1=http://10.115.0.11:2380,etcd2=http://10.115.0.12:2380,etcd3=http://10.115.0.13:2380 \
-initial-cluster-state new --max-txn-ops=100000 --auto-compaction-retention=10 --auto-compaction-mode=revision
etcdctl --endpoints http://10.115.0.10:2379 put /vitastor/config/global '{"immediate_commit":"all"}'
etcdctl --endpoints http://10.115.0.10:2379 put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":48,"failure_domain":"host"}}'
#let pgs = {};
#for (let n = 0; n < 48; n++) { let i = n/2 | 0; pgs[1+n] = { osd_set: [ (1+i%12+(i/12 | 0)*24), (1+12+i%12+(i/12 | 0)*24) ], primary: (1+(n%2)*12+i%12+(i/12 | 0)*24) }; };
#console.log(JSON.stringify({ items: { 1: pgs } }));
#etcdctl --endpoints http://10.115.0.10:2379 put /vitastor/config/pgs ...
# --disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 \\
# --data_offset 427814912 \\
# --disk_alignment 4096 --journal_block_size 512 --meta_block_size 512 \\
# --data_offset 433434624 \\

View File

@@ -1,7 +1,7 @@
#!/usr/bin/node #!/usr/bin/node
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
const Mon = require('./mon.js'); const Mon = require('./mon.js');

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
const http = require('http'); const http = require('http');
const crypto = require('crypto'); const crypto = require('crypto');
@@ -10,6 +10,14 @@ const stableStringify = require('./stable-stringify.js');
const PGUtil = require('./PGUtil.js'); const PGUtil = require('./PGUtil.js');
// FIXME document all etcd keys and config variables in the form of JSON schema or similar // FIXME document all etcd keys and config variables in the form of JSON schema or similar
const etcd_nonempty_keys = {
'config/global': 1,
'config/node_placement': 1,
'config/pools': 1,
'config/pgs': 1,
'history/last_clean_pgs': 1,
'stats': 1,
};
const etcd_allow = new RegExp('^'+[ const etcd_allow = new RegExp('^'+[
'config/global', 'config/global',
'config/node_placement', 'config/node_placement',
@@ -22,6 +30,7 @@ const etcd_allow = new RegExp('^'+[
'pg/state/[1-9]\\d*/[1-9]\\d*', 'pg/state/[1-9]\\d*/[1-9]\\d*',
'pg/stats/[1-9]\\d*/[1-9]\\d*', 'pg/stats/[1-9]\\d*/[1-9]\\d*',
'pg/history/[1-9]\\d*/[1-9]\\d*', 'pg/history/[1-9]\\d*/[1-9]\\d*',
'history/last_clean_pgs',
'stats', 'stats',
].join('$|^')+'$'); ].join('$|^')+'$');
@@ -34,7 +43,7 @@ const etcd_tree = {
etcd_mon_retries: 5, // min: 0 etcd_mon_retries: 5, // min: 0
mon_change_timeout: 1000, // ms. min: 100 mon_change_timeout: 1000, // ms. min: 100
mon_stats_timeout: 1000, // ms. min: 100 mon_stats_timeout: 1000, // ms. min: 100
osd_out_time: 1800, // seconds. min: 0 osd_out_time: 600, // seconds. min: 0
placement_levels: { datacenter: 1, rack: 2, host: 3, osd: 4, ... }, placement_levels: { datacenter: 1, rack: 2, host: 3, osd: 4, ... },
// client and osd // client and osd
use_sync_send_recv: false, use_sync_send_recv: false,
@@ -46,6 +55,8 @@ const etcd_tree = {
client_dirty_limit: 33554432, client_dirty_limit: 33554432,
peer_connect_interval: 5, // seconds. min: 1 peer_connect_interval: 5, // seconds. min: 1
peer_connect_timeout: 5, // seconds. min: 1 peer_connect_timeout: 5, // seconds. min: 1
osd_idle_timeout: 5, // seconds. min: 1
osd_ping_timeout: 5, // seconds. min: 1
up_wait_retry_interval: 500, // ms. min: 50 up_wait_retry_interval: 500, // ms. min: 50
// osd // osd
etcd_report_interval: 30, // min: 10 etcd_report_interval: 30, // min: 10
@@ -55,8 +66,12 @@ const etcd_tree = {
autosync_interval: 5, autosync_interval: 5,
client_queue_depth: 128, // unused client_queue_depth: 128, // unused
recovery_queue_depth: 4, recovery_queue_depth: 4,
recovery_sync_batch: 16,
readonly: false, readonly: false,
no_recovery: false,
no_rebalance: false,
print_stats_interval: 3, print_stats_interval: 3,
slow_log_interval: 10,
// blockstore - fixed in superblock // blockstore - fixed in superblock
block_size, block_size,
disk_alignment, disk_alignment,
@@ -76,6 +91,7 @@ const etcd_tree = {
disable_meta_fsync, disable_meta_fsync,
disable_device_lock, disable_device_lock,
// blockstore - configurable // blockstore - configurable
max_write_iodepth,
flusher_count, flusher_count,
inmemory_metadata, inmemory_metadata,
inmemory_journal, inmemory_journal,
@@ -213,6 +229,9 @@ const etcd_tree = {
incomplete: uint64_t, incomplete: uint64_t,
}, */ }, */
}, },
history: {
last_clean_pgs: {},
},
}; };
// FIXME Split into several files // FIXME Split into several files
@@ -291,7 +310,7 @@ class Mon
this.config.osd_out_time = Number(this.config.osd_out_time) || 0; this.config.osd_out_time = Number(this.config.osd_out_time) || 0;
if (!this.config.osd_out_time) if (!this.config.osd_out_time)
{ {
this.config.osd_out_time = 30*60; // 30 minutes by default this.config.osd_out_time = 600; // 10 minutes by default
} }
} }
@@ -313,8 +332,14 @@ class Mon
ok(false); ok(false);
}, this.config.etcd_mon_timeout); }, this.config.etcd_mon_timeout);
this.ws = new WebSocket(base+'/watch'); this.ws = new WebSocket(base+'/watch');
const fail = () =>
{
ok(false);
};
this.ws.on('error', fail);
this.ws.on('open', () => this.ws.on('open', () =>
{ {
this.ws.removeListener('error', fail);
if (timer_id) if (timer_id)
clearTimeout(timer_id); clearTimeout(timer_id);
ok(true); ok(true);
@@ -359,7 +384,7 @@ class Mon
} }
else else
{ {
let stats_changed = false, changed = false; let stats_changed = false, changed = false, pg_states_changed = false;
if (this.verbose) if (this.verbose)
{ {
console.log('Revision '+data.result.header.revision+' events: '); console.log('Revision '+data.result.header.revision+' events: ');
@@ -373,15 +398,23 @@ class Mon
{ {
stats_changed = true; stats_changed = true;
} }
else if (key.substr(0, 10) == '/pg/state/')
{
pg_states_changed = true;
}
else if (key != '/stats') else if (key != '/stats')
{ {
changed = true; changed = true;
} }
if (this.verbose) if (this.verbose)
{ {
console.log(e); console.log(JSON.stringify(e));
} }
} }
if (pg_states_changed)
{
this.save_last_clean().catch(console.error);
}
if (stats_changed) if (stats_changed)
{ {
this.schedule_update_stats(); this.schedule_update_stats();
@@ -394,10 +427,46 @@ class Mon
}); });
} }
async save_last_clean()
{
// last_clean_pgs is used to avoid extra data move when observing a series of changes in the cluster
for (const pool_id in this.state.config.pools)
{
const pool_cfg = this.state.config.pools[pool_id];
if (!this.validate_pool_cfg(pool_id, pool_cfg, false))
{
continue;
}
for (let pg_num = 1; pg_num <= pool_cfg.pg_count; pg_num++)
{
if (!this.state.pg.state[pool_id] ||
!this.state.pg.state[pool_id][pg_num] ||
!(this.state.pg.state[pool_id][pg_num].state instanceof Array))
{
// Unclean
return;
}
let st = this.state.pg.state[pool_id][pg_num].state.join(',');
if (st != 'active' && st != 'active,left_on_dead' && st != 'left_on_dead,active')
{
// Unclean
return;
}
}
}
this.state.history.last_clean_pgs = JSON.parse(JSON.stringify(this.state.config.pgs));
await this.etcd_call('/kv/txn', {
success: [ { requestPut: {
key: b64(this.etcd_prefix+'/history/last_clean_pgs'),
value: b64(JSON.stringify(this.state.history.last_clean_pgs))
} } ],
}, this.etcd_start_timeout, 0);
}
async get_lease() async get_lease()
{ {
const max_ttl = this.config.etcd_mon_ttl + this.config.etcd_mon_timeout/1000*this.config.etcd_mon_retries; const max_ttl = this.config.etcd_mon_ttl + this.config.etcd_mon_timeout/1000*this.config.etcd_mon_retries;
const res = await this.etcd_call('/lease/grant', { TTL: max_ttl }, this.config.etcd_mon_timeout, this.config.etcd_mon_retries); const res = await this.etcd_call('/lease/grant', { TTL: max_ttl }, this.config.etcd_mon_timeout, -1);
this.etcd_lease_id = res.ID; this.etcd_lease_id = res.ID;
setInterval(async () => setInterval(async () =>
{ {
@@ -573,29 +642,51 @@ class Mon
return !has_online; return !has_online;
} }
reset_rng()
{
this.seed = 0x5f020e43;
}
rng()
{
this.seed ^= this.seed << 13;
this.seed ^= this.seed >> 17;
this.seed ^= this.seed << 5;
return this.seed + 2147483648;
}
pick_primary(pool_id, osd_set, up_osds)
{
let alive_set;
if (this.state.config.pools[pool_id].scheme === 'replicated')
alive_set = osd_set.filter(osd_num => osd_num && up_osds[osd_num]);
else
{
// Prefer data OSDs for EC because they can actually read something without an additional network hop
const pg_data_size = (this.state.config.pools[pool_id].pg_size||0) -
(this.state.config.pools[pool_id].parity_chunks||0);
alive_set = osd_set.slice(0, pg_data_size).filter(osd_num => osd_num && up_osds[osd_num]);
if (!alive_set.length)
alive_set = osd_set.filter(osd_num => osd_num && up_osds[osd_num]);
}
if (!alive_set.length)
return 0;
return alive_set[this.rng() % alive_set.length];
}
save_new_pgs_txn(request, pool_id, up_osds, prev_pgs, new_pgs, pg_history) save_new_pgs_txn(request, pool_id, up_osds, prev_pgs, new_pgs, pg_history)
{ {
const replicated = new_pgs.length && this.state.config.pools[pool_id].scheme === 'replicated';
const pg_minsize = new_pgs.length && this.state.config.pools[pool_id].pg_minsize;
const pg_items = {}; const pg_items = {};
this.reset_rng();
new_pgs.map((osd_set, i) => new_pgs.map((osd_set, i) =>
{ {
osd_set = osd_set.map(osd_num => osd_num === LPOptimizer.NO_OSD ? 0 : osd_num); osd_set = osd_set.map(osd_num => osd_num === LPOptimizer.NO_OSD ? 0 : osd_num);
let alive_set;
if (replicated)
alive_set = osd_set.filter(osd_num => osd_num && up_osds[osd_num]);
else
{
// Prefer data OSDs for EC because they can actually read something without an additional network hop
alive_set = osd_set.slice(0, pg_minsize).filter(osd_num => osd_num && up_osds[osd_num]);
if (!alive_set.length)
alive_set = osd_set.filter(osd_num => osd_num && up_osds[osd_num]);
}
pg_items[i+1] = { pg_items[i+1] = {
osd_set, osd_set,
primary: alive_set.length ? alive_set[Math.floor(Math.random()*alive_set.length)] : 0, primary: this.pick_primary(pool_id, osd_set, up_osds),
}; };
if (prev_pgs[i] && prev_pgs[i].join(' ') != osd_set.join(' ')) if (prev_pgs[i] && prev_pgs[i].join(' ') != osd_set.join(' ') &&
prev_pgs[i].filter(osd_num => osd_num).length > 0)
{ {
pg_history[i] = pg_history[i] || {}; pg_history[i] = pg_history[i] || {};
pg_history[i].osd_sets = pg_history[i].osd_sets || []; pg_history[i].osd_sets = pg_history[i].osd_sets || [];
@@ -791,12 +882,25 @@ class Mon
pool_tree = pool_tree ? pool_tree.children : []; pool_tree = pool_tree ? pool_tree.children : [];
pool_tree = LPOptimizer.flatten_tree(pool_tree, levels, pool_cfg.failure_domain, 'osd'); pool_tree = LPOptimizer.flatten_tree(pool_tree, levels, pool_cfg.failure_domain, 'osd');
this.filter_osds_by_tags(osd_tree, pool_tree, pool_cfg.osd_tags); this.filter_osds_by_tags(osd_tree, pool_tree, pool_cfg.osd_tags);
const prev_pgs = []; // These are for the purpose of building history.osd_sets
for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{})||{}) const real_prev_pgs = [];
let pg_history = [];
for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
{ {
prev_pgs[pg-1] = this.state.config.pgs.items[pool_id][pg].osd_set; real_prev_pgs[pg-1] = this.state.config.pgs.items[pool_id][pg].osd_set;
if (this.state.pg.history[pool_id] &&
this.state.pg.history[pool_id][pg])
{
pg_history[pg-1] = this.state.pg.history[pool_id][pg];
}
} }
const pg_history = []; // And these are for the purpose of minimizing data movement
let prev_pgs = [];
for (const pg in ((this.state.history.last_clean_pgs.items||{})[pool_id]||{}))
{
prev_pgs[pg-1] = this.state.history.last_clean_pgs.items[pool_id][pg].osd_set;
}
prev_pgs = JSON.parse(JSON.stringify(prev_pgs.length ? prev_pgs : real_prev_pgs));
const old_pg_count = prev_pgs.length; const old_pg_count = prev_pgs.length;
let optimize_result; let optimize_result;
if (old_pg_count > 0) if (old_pg_count > 0)
@@ -809,7 +913,20 @@ class Mon
this.schedule_recheck(); this.schedule_recheck();
return; return;
} }
PGUtil.scale_pg_count(prev_pgs, this.state.pg.history[pool_id]||{}, pg_history, pool_cfg.pg_count); const new_pg_history = [];
PGUtil.scale_pg_count(prev_pgs, pg_history, new_pg_history, pool_cfg.pg_count);
pg_history = new_pg_history;
}
for (const pg of prev_pgs)
{
while (pg.length < pool_cfg.pg_size)
{
pg.push(0);
}
while (pg.length > pool_cfg.pg_size)
{
pg.pop();
}
} }
optimize_result = await LPOptimizer.optimize_change({ optimize_result = await LPOptimizer.optimize_change({
prev_pgs, prev_pgs,
@@ -835,16 +952,21 @@ class Mon
`PG count for pool ${pool_id} (${pool_cfg.name || 'unnamed'})`+ `PG count for pool ${pool_id} (${pool_cfg.name || 'unnamed'})`+
` changed from: ${old_pg_count} to ${optimize_result.int_pgs.length}` ` changed from: ${old_pg_count} to ${optimize_result.int_pgs.length}`
); );
// Drop stats
etcd_request.success.push({ requestDeleteRange: {
key: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'/'),
range_end: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'0'),
} });
} }
LPOptimizer.print_change_stats(optimize_result); LPOptimizer.print_change_stats(optimize_result);
this.save_new_pgs_txn(etcd_request, pool_id, up_osds, prev_pgs, optimize_result.int_pgs, pg_history); this.save_new_pgs_txn(etcd_request, pool_id, up_osds, real_prev_pgs, optimize_result.int_pgs, pg_history);
} }
this.state.config.pgs.hash = tree_hash; this.state.config.pgs.hash = tree_hash;
await this.save_pg_config(); await this.save_pg_config(etcd_request);
} }
else else
{ {
// Nothing changed, but we still want to check for down OSDs // Nothing changed, but we still want to recheck the distribution of primaries
let changed = false; let changed = false;
for (const pool_id in this.state.config.pools) for (const pool_id in this.state.config.pools)
{ {
@@ -854,22 +976,13 @@ class Mon
continue; continue;
} }
const replicated = pool_cfg.scheme === 'replicated'; const replicated = pool_cfg.scheme === 'replicated';
for (const pg_num in ((this.state.config.pgs.items||{})[pool_id]||{})||{}) this.reset_rng();
for (let pg_num = 1; pg_num <= pool_cfg.pg_count; pg_num++)
{ {
const pg_cfg = this.state.config.pgs.items[pool_id][pg_num]; const pg_cfg = this.state.config.pgs.items[pool_id][pg_num];
if (!Number(pg_cfg.primary) || !up_osds[pg_cfg.primary]) if (pg_cfg)
{ {
let alive_set; const new_primary = this.pick_primary(pool_id, pg_cfg.osd_set, up_osds);
if (replicated)
alive_set = pg_cfg.osd_set.filter(osd_num => osd_num && up_osds[osd_num]);
else
{
// Prefer data OSDs for EC because they can actually read something without an additional network hop
alive_set = pg_cfg.osd_set.slice(0, pool_cfg.pg_minsize).filter(osd_num => osd_num && up_osds[osd_num]);
if (!alive_set.length)
alive_set = pg_cfg.osd_set.filter(osd_num => osd_num && up_osds[osd_num]);
}
const new_primary = alive_set.length ? alive_set[Math.floor(Math.random()*alive_set.length)] : 0;
if (pg_cfg.primary != new_primary) if (pg_cfg.primary != new_primary)
{ {
console.log( console.log(
@@ -963,21 +1076,21 @@ class Mon
for (const op in st.op_stats||{}) for (const op in st.op_stats||{})
{ {
op_stats[op] = op_stats[op] || { count: 0n, usec: 0n, bytes: 0n }; op_stats[op] = op_stats[op] || { count: 0n, usec: 0n, bytes: 0n };
op_stats[op].count += BigInt(st.op_stats.count||0); op_stats[op].count += BigInt(st.op_stats[op].count||0);
op_stats[op].usec += BigInt(st.op_stats.usec||0); op_stats[op].usec += BigInt(st.op_stats[op].usec||0);
op_stats[op].bytes += BigInt(st.op_stats.bytes||0); op_stats[op].bytes += BigInt(st.op_stats[op].bytes||0);
} }
for (const op in st.subop_stats||{}) for (const op in st.subop_stats||{})
{ {
subop_stats[op] = subop_stats[op] || { count: 0n, usec: 0n }; subop_stats[op] = subop_stats[op] || { count: 0n, usec: 0n };
subop_stats[op].count += BigInt(st.subop_stats.count||0); subop_stats[op].count += BigInt(st.subop_stats[op].count||0);
subop_stats[op].usec += BigInt(st.subop_stats.usec||0); subop_stats[op].usec += BigInt(st.subop_stats[op].usec||0);
} }
for (const op in st.recovery_stats||{}) for (const op in st.recovery_stats||{})
{ {
recovery_stats[op] = recovery_stats[op] || { count: 0n, bytes: 0n }; recovery_stats[op] = recovery_stats[op] || { count: 0n, bytes: 0n };
recovery_stats[op].count += BigInt(st.recovery_stats.count||0); recovery_stats[op].count += BigInt(st.recovery_stats[op].count||0);
recovery_stats[op].bytes += BigInt(st.recovery_stats.bytes||0); recovery_stats[op].bytes += BigInt(st.recovery_stats[op].bytes||0);
} }
} }
for (const op in op_stats) for (const op in op_stats)
@@ -1032,11 +1145,14 @@ class Mon
for (const pg_num in this.state.pg.stats[pool_id]) for (const pg_num in this.state.pg.stats[pool_id])
{ {
const st = this.state.pg.stats[pool_id][pg_num]; const st = this.state.pg.stats[pool_id][pg_num];
for (const k in object_counts) if (st)
{ {
if (st[k+'_count']) for (const k in object_counts)
{ {
object_counts[k] += BigInt(st[k+'_count']); if (st[k+'_count'])
{
object_counts[k] += BigInt(st[k+'_count']);
}
} }
} }
} }
@@ -1111,16 +1227,20 @@ class Mon
console.log('Bad value in etcd: '+kv.key+' = '+kv.value); console.log('Bad value in etcd: '+kv.key+' = '+kv.value);
return; return;
} }
key = key.split('/'); let key_parts = key.split('/');
let cur = this.state; let cur = this.state;
for (let i = 0; i < key.length-1; i++) for (let i = 0; i < key_parts.length-1; i++)
{ {
cur = (cur[key[i]] = cur[key[i]] || {}); cur = (cur[key_parts[i]] = cur[key_parts[i]] || {});
} }
cur[key[key.length-1]] = kv.value; if (etcd_nonempty_keys[key])
if (key.join('/') === 'config/global') {
// Do not clear these to null
kv.value = kv.value || {};
}
cur[key_parts[key_parts.length-1]] = kv.value;
if (key === 'config/global')
{ {
this.state.config.global = this.state.config.global || {};
this.config = this.state.config.global; this.config = this.state.config.global;
this.check_config(); this.check_config();
for (const osd_num in this.state.osd.stats) for (const osd_num in this.state.osd.stats)
@@ -1131,7 +1251,7 @@ class Mon
); );
} }
} }
else if (key.join('/') === 'config/pools') else if (key === 'config/pools')
{ {
for (const pool_id in this.state.config.pools) for (const pool_id in this.state.config.pools)
{ {
@@ -1140,7 +1260,7 @@ class Mon
this.validate_pool_cfg(pool_id, pool_cfg, true); this.validate_pool_cfg(pool_id, pool_cfg, true);
} }
} }
else if (key[0] === 'osd' && key[1] === 'stats') else if (key_parts[0] === 'osd' && key_parts[1] === 'stats')
{ {
// Recheck PGs <osd_out_time> later // Recheck PGs <osd_out_time> later
this.schedule_next_recheck_at( this.schedule_next_recheck_at(
@@ -1172,6 +1292,11 @@ class Mon
console.error('etcd returned error: '+res.json.error); console.error('etcd returned error: '+res.json.error);
break; break;
} }
if (this.etcd_urls.length > 1)
{
// Stick to the same etcd for the rest of calls
this.etcd_urls = [ base ];
}
return res.json; return res.json;
} }
retry++; retry++;

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
// Interesting real-world example coming from Ceph with EC and compression enabled. // Interesting real-world example coming from Ceph with EC and compression enabled.
// EC parity chunks can't be compressed as efficiently as data chunks, // EC parity chunks can't be compressed as efficiently as data chunks,

View File

@@ -0,0 +1,25 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
const LPOptimizer = require('./lp-optimizer.js');
async function run()
{
const osd_tree = { a: { 1: 1 }, b: { 2: 1 }, c: { 3: 1 } };
let res;
console.log('16 PGs, size=3');
res = await LPOptimizer.optimize_initial({ osd_tree, pg_size: 3, pg_count: 16 });
LPOptimizer.print_change_stats(res, false);
console.log('\nReduce PG size to 2');
res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs.map(pg => pg.slice(0, 2)), osd_tree, pg_size: 2 });
LPOptimizer.print_change_stats(res, false);
console.log('\nRemove OSD 3');
delete osd_tree['c'];
res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree, pg_size: 2 });
LPOptimizer.print_change_stats(res, false);
}
run().catch(console.error);

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
const LPOptimizer = require('./lp-optimizer.js'); const LPOptimizer = require('./lp-optimizer.js');

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
const LPOptimizer = require('./lp-optimizer.js'); const LPOptimizer = require('./lp-optimizer.js');

View File

@@ -48,4 +48,4 @@ FIO=`rpm -qi fio | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Ve
QEMU=`rpm -qi qemu qemu-kvm | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'` QEMU=`rpm -qi qemu qemu-kvm | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'`
perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec
perl -i -pe 's/(Requires:\s*qemu(?:-kvm)?)([^\n]+)?/$1 = '$QEMU'/' $VITASTOR/rpm/vitastor-el$EL.spec perl -i -pe 's/(Requires:\s*qemu(?:-kvm)?)([^\n]+)?/$1 = '$QEMU'/' $VITASTOR/rpm/vitastor-el$EL.spec
tar --transform 's#^#vitastor-0.5.4/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.5.4$(rpm --eval '%dist').tar.gz * tar --transform 's#^#vitastor-0.5.10/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.5.10$(rpm --eval '%dist').tar.gz *

View File

@@ -1,5 +1,5 @@
# Build packages for CentOS 8 inside a container # Build packages for CentOS 8 inside a container
# cd ..; podman build -t qemu-el8 -v `pwd`/build:/root/build -f rpm/qemu-el8.Dockerfile . # cd ..; podman build -t qemu-el8 -v `pwd`/packages:/root/packages -f rpm/qemu-el8.Dockerfile .
FROM centos:8 FROM centos:8
@@ -14,8 +14,8 @@ RUN cd ~/rpmbuild/SPECS && dnf builddep -y --enablerepo=PowerTools --spec qemu-k
ADD qemu-*-vitastor.patch /root/vitastor/ ADD qemu-*-vitastor.patch /root/vitastor/
RUN set -e; \ RUN set -e; \
mkdir -p /root/build/qemu-el8; \ mkdir -p /root/packages/qemu-el8; \
rm -rf /root/build/qemu-el8/*; \ rm -rf /root/packages/qemu-el8/*; \
rpm --nomd5 -i /root/qemu*.src.rpm; \ rpm --nomd5 -i /root/qemu*.src.rpm; \
cd ~/rpmbuild/SPECS; \ cd ~/rpmbuild/SPECS; \
PN=$(grep ^Patch qemu-kvm.spec | tail -n1 | perl -pe 's/Patch(\d+).*/$1/'); \ PN=$(grep ^Patch qemu-kvm.spec | tail -n1 | perl -pe 's/Patch(\d+).*/$1/'); \
@@ -27,5 +27,5 @@ RUN set -e; \
perl -i -pe 's/(^Release:\s*\d+)/$1.vitastor/' qemu-kvm.spec; \ perl -i -pe 's/(^Release:\s*\d+)/$1.vitastor/' qemu-kvm.spec; \
cp /root/vitastor/qemu-4.2-vitastor.patch ~/rpmbuild/SOURCES; \ cp /root/vitastor/qemu-4.2-vitastor.patch ~/rpmbuild/SOURCES; \
rpmbuild --nocheck -ba qemu-kvm.spec; \ rpmbuild --nocheck -ba qemu-kvm.spec; \
cp ~/rpmbuild/RPMS/*/*qemu* /root/build/qemu-el8/; \ cp ~/rpmbuild/RPMS/*/*qemu* /root/packages/qemu-el8/; \
cp ~/rpmbuild/SRPMS/*qemu* /root/build/qemu-el8/ cp ~/rpmbuild/SRPMS/*qemu* /root/packages/qemu-el8/

View File

@@ -1,5 +1,5 @@
# Build packages for CentOS 7 inside a container # Build packages for CentOS 7 inside a container
# cd ..; podman build -t vitastor-el7 -v `pwd`/build:/root/build -f rpm/vitastor-el7.Dockerfile . # cd ..; podman build -t vitastor-el7 -v `pwd`/packages:/root/packages -f rpm/vitastor-el7.Dockerfile .
# localedef -i ru_RU -f UTF-8 ru_RU.UTF-8 # localedef -i ru_RU -f UTF-8 ru_RU.UTF-8
FROM centos:7 FROM centos:7
@@ -25,23 +25,23 @@ RUN set -e; \
cd ~/rpmbuild/SPECS/; \ cd ~/rpmbuild/SPECS/; \
. /opt/rh/devtoolset-9/enable; \ . /opt/rh/devtoolset-9/enable; \
rpmbuild -ba liburing.spec; \ rpmbuild -ba liburing.spec; \
mkdir -p /root/build/liburing-el7; \ mkdir -p /root/packages/liburing-el7; \
rm -rf /root/build/liburing-el7/*; \ rm -rf /root/packages/liburing-el7/*; \
cp ~/rpmbuild/RPMS/*/liburing* /root/build/liburing-el7/; \ cp ~/rpmbuild/RPMS/*/liburing* /root/packages/liburing-el7/; \
cp ~/rpmbuild/SRPMS/liburing* /root/build/liburing-el7/ cp ~/rpmbuild/SRPMS/liburing* /root/packages/liburing-el7/
RUN rpm -i `ls /root/build/liburing-el7/liburing-*.x86_64.rpm | grep -v debug` RUN rpm -i `ls /root/packages/liburing-el7/liburing-*.x86_64.rpm | grep -v debug`
ADD . /root/vitastor ADD . /root/vitastor
RUN set -e; \ RUN set -e; \
cd /root/vitastor/rpm; \ cd /root/vitastor/rpm; \
sh build-tarball.sh; \ sh build-tarball.sh; \
cp /root/vitastor-0.5.4.el7.tar.gz ~/rpmbuild/SOURCES; \ cp /root/vitastor-0.5.10.el7.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \ cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \ cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \ rpmbuild -ba vitastor.spec; \
mkdir -p /root/build/vitastor-el7; \ mkdir -p /root/packages/vitastor-el7; \
rm -rf /root/build/vitastor-el7/*; \ rm -rf /root/packages/vitastor-el7/*; \
cp ~/rpmbuild/RPMS/*/vitastor* /root/build/vitastor-el7/; \ cp ~/rpmbuild/RPMS/*/vitastor* /root/packages/vitastor-el7/; \
cp ~/rpmbuild/SRPMS/vitastor* /root/build/vitastor-el7/ cp ~/rpmbuild/SRPMS/vitastor* /root/packages/vitastor-el7/

View File

@@ -1,11 +1,11 @@
Name: vitastor Name: vitastor
Version: 0.5.4 Version: 0.5.10
Release: 2%{?dist} Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.0 License: Vitastor Network Public License 1.1
URL: https://vitastor.io/ URL: https://vitastor.io/
Source0: vitastor-0.5.4.el7.tar.gz Source0: vitastor-0.5.10.el7.tar.gz
BuildRequires: liburing-devel >= 0.6 BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel BuildRequires: gperftools-devel
@@ -14,6 +14,7 @@ BuildRequires: rh-nodejs12
BuildRequires: rh-nodejs12-npm BuildRequires: rh-nodejs12-npm
BuildRequires: jerasure-devel BuildRequires: jerasure-devel
BuildRequires: gf-complete-devel BuildRequires: gf-complete-devel
BuildRequires: cmake
Requires: fio = 3.7-1.el7 Requires: fio = 3.7-1.el7
Requires: qemu-kvm = 2.0.0-1.el7.6 Requires: qemu-kvm = 2.0.0-1.el7.6
Requires: rh-nodejs12 Requires: rh-nodejs12
@@ -35,12 +36,13 @@ size with configurable redundancy (replication or erasure codes/XOR).
%build %build
. /opt/rh/devtoolset-9/enable . /opt/rh/devtoolset-9/enable
make %{?_smp_mflags} BINDIR=%_bindir LIBDIR=%_libdir QEMU_PLUGINDIR=%_libdir/qemu-kvm %cmake . -DQEMU_PLUGINDIR=qemu-kvm
%make_build
%install %install
rm -rf $RPM_BUILD_ROOT rm -rf $RPM_BUILD_ROOT
%make_install BINDIR=%_bindir LIBDIR=%_libdir QEMU_PLUGINDIR=%_libdir/qemu-kvm %make_install
. /opt/rh/rh-nodejs12/enable . /opt/rh/rh-nodejs12/enable
cd mon cd mon
npm install npm install
@@ -56,7 +58,11 @@ cp -r mon %buildroot/usr/lib/vitastor/mon
%_bindir/vitastor-osd %_bindir/vitastor-osd
%_bindir/vitastor-rm %_bindir/vitastor-rm
%_libdir/qemu-kvm/block-vitastor.so %_libdir/qemu-kvm/block-vitastor.so
%_libdir/vitastor %_libdir/libfio_vitastor.so
%_libdir/libfio_vitastor_blk.so
%_libdir/libfio_vitastor_sec.so
%_libdir/libvitastor_blk.so
%_libdir/libvitastor_client.so
/usr/lib/vitastor /usr/lib/vitastor

View File

@@ -1,5 +1,5 @@
# Build packages for CentOS 8 inside a container # Build packages for CentOS 8 inside a container
# cd ..; podman build -t vitastor-el8 -v `pwd`/build:/root/build -f rpm/vitastor-el8.Dockerfile . # cd ..; podman build -t vitastor-el8 -v `pwd`/packages:/root/packages -f rpm/vitastor-el8.Dockerfile .
FROM centos:8 FROM centos:8
@@ -13,8 +13,8 @@ RUN rm -rf /var/lib/dnf/*; dnf download --disablerepo='*' --enablerepo='vitastor
RUN dnf download --source fio RUN dnf download --source fio
RUN rpm --nomd5 -i qemu*.src.rpm RUN rpm --nomd5 -i qemu*.src.rpm
RUN rpm --nomd5 -i fio*.src.rpm RUN rpm --nomd5 -i fio*.src.rpm
RUN cd ~/rpmbuild/SPECS && dnf builddep -y --enablerepo=PowerTools --spec qemu-kvm.spec RUN cd ~/rpmbuild/SPECS && dnf builddep -y --enablerepo=powertools --spec qemu-kvm.spec
RUN cd ~/rpmbuild/SPECS && dnf builddep -y --enablerepo=PowerTools --spec fio.spec RUN cd ~/rpmbuild/SPECS && dnf builddep -y --enablerepo=powertools --spec fio.spec && dnf install -y cmake
ADD https://vitastor.io/rpms/liburing-el7/liburing-0.7-2.el7.src.rpm /root ADD https://vitastor.io/rpms/liburing-el7/liburing-0.7-2.el7.src.rpm /root
@@ -23,23 +23,23 @@ RUN set -e; \
cd ~/rpmbuild/SPECS/; \ cd ~/rpmbuild/SPECS/; \
. /opt/rh/gcc-toolset-9/enable; \ . /opt/rh/gcc-toolset-9/enable; \
rpmbuild -ba liburing.spec; \ rpmbuild -ba liburing.spec; \
mkdir -p /root/build/liburing-el8; \ mkdir -p /root/packages/liburing-el8; \
rm -rf /root/build/liburing-el8/*; \ rm -rf /root/packages/liburing-el8/*; \
cp ~/rpmbuild/RPMS/*/liburing* /root/build/liburing-el8/; \ cp ~/rpmbuild/RPMS/*/liburing* /root/packages/liburing-el8/; \
cp ~/rpmbuild/SRPMS/liburing* /root/build/liburing-el8/ cp ~/rpmbuild/SRPMS/liburing* /root/packages/liburing-el8/
RUN rpm -i `ls /root/build/liburing-el7/liburing-*.x86_64.rpm | grep -v debug` RUN rpm -i `ls /root/packages/liburing-el7/liburing-*.x86_64.rpm | grep -v debug`
ADD . /root/vitastor ADD . /root/vitastor
RUN set -e; \ RUN set -e; \
cd /root/vitastor/rpm; \ cd /root/vitastor/rpm; \
sh build-tarball.sh; \ sh build-tarball.sh; \
cp /root/vitastor-0.5.4.el8.tar.gz ~/rpmbuild/SOURCES; \ cp /root/vitastor-0.5.10.el8.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \ cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \ cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \ rpmbuild -ba vitastor.spec; \
mkdir -p /root/build/vitastor-el8; \ mkdir -p /root/packages/vitastor-el8; \
rm -rf /root/build/vitastor-el8/*; \ rm -rf /root/packages/vitastor-el8/*; \
cp ~/rpmbuild/RPMS/*/vitastor* /root/build/vitastor-el8/; \ cp ~/rpmbuild/RPMS/*/vitastor* /root/packages/vitastor-el8/; \
cp ~/rpmbuild/SRPMS/vitastor* /root/build/vitastor-el8/ cp ~/rpmbuild/SRPMS/vitastor* /root/packages/vitastor-el8/

View File

@@ -1,11 +1,11 @@
Name: vitastor Name: vitastor
Version: 0.5.4 Version: 0.5.10
Release: 2%{?dist} Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.0 License: Vitastor Network Public License 1.1
URL: https://vitastor.io/ URL: https://vitastor.io/
Source0: vitastor-0.5.4.el8.tar.gz Source0: vitastor-0.5.10.el8.tar.gz
BuildRequires: liburing-devel >= 0.6 BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel BuildRequires: gperftools-devel
@@ -13,6 +13,7 @@ BuildRequires: gcc-toolset-9-gcc-c++
BuildRequires: nodejs >= 10 BuildRequires: nodejs >= 10
BuildRequires: jerasure-devel BuildRequires: jerasure-devel
BuildRequires: gf-complete-devel BuildRequires: gf-complete-devel
BuildRequires: cmake
Requires: fio = 3.7-3.el8 Requires: fio = 3.7-3.el8
Requires: qemu-kvm = 4.2.0-29.el8.6 Requires: qemu-kvm = 4.2.0-29.el8.6
Requires: nodejs >= 10 Requires: nodejs >= 10
@@ -33,12 +34,13 @@ size with configurable redundancy (replication or erasure codes/XOR).
%build %build
. /opt/rh/gcc-toolset-9/enable . /opt/rh/gcc-toolset-9/enable
make %{?_smp_mflags} BINDIR=%_bindir LIBDIR=%_libdir QEMU_PLUGINDIR=%_libdir/qemu-kvm %cmake . -DQEMU_PLUGINDIR=qemu-kvm
%make_build
%install %install
rm -rf $RPM_BUILD_ROOT rm -rf $RPM_BUILD_ROOT
%make_install BINDIR=%_bindir LIBDIR=%_libdir QEMU_PLUGINDIR=%_libdir/qemu-kvm %make_install
cd mon cd mon
npm install npm install
cd .. cd ..
@@ -53,7 +55,11 @@ cp -r mon %buildroot/usr/lib/vitastor
%_bindir/vitastor-osd %_bindir/vitastor-osd
%_bindir/vitastor-rm %_bindir/vitastor-rm
%_libdir/qemu-kvm/block-vitastor.so %_libdir/qemu-kvm/block-vitastor.so
%_libdir/vitastor %_libdir/libfio_vitastor.so
%_libdir/libfio_vitastor_blk.so
%_libdir/libfio_vitastor_sec.so
%_libdir/libvitastor_blk.so
%_libdir/libvitastor_client.so
/usr/lib/vitastor /usr/lib/vitastor

View File

@@ -1,77 +0,0 @@
#!/bin/bash
if [ ! "$BASH_VERSION" ] ; then
echo "Use bash to run this script ($0)" 1>&2
exit 1
fi
format_error()
{
echo $(echo -n -e "\033[1;31m")$1$(echo -n -e "\033[m")
$ETCDCTL get --prefix /vitastor > ./testdata/etcd-dump.txt
exit 1
}
format_green()
{
echo $(echo -n -e "\033[1;32m")$1$(echo -n -e "\033[m")
}
set -e -x
trap 'kill -9 $(jobs -p)' EXIT
ETCD=${ETCD:-etcd}
ETCD_PORT=${ETCD_PORT:-12379}
rm -rf ./testdata
mkdir -p ./testdata
dd if=/dev/zero of=./testdata/test_osd1.bin bs=1024 count=1 seek=$((1024*1024-1))
dd if=/dev/zero of=./testdata/test_osd2.bin bs=1024 count=1 seek=$((1024*1024-1))
dd if=/dev/zero of=./testdata/test_osd3.bin bs=1024 count=1 seek=$((1024*1024-1))
$ETCD -name etcd_test --data-dir ./testdata/etcd \
--advertise-client-urls http://127.0.0.1:$ETCD_PORT --listen-client-urls http://127.0.0.1:$ETCD_PORT \
--initial-advertise-peer-urls http://127.0.0.1:$((ETCD_PORT+1)) --listen-peer-urls http://127.0.0.1:$((ETCD_PORT+1)) \
--max-txn-ops=100000 --auto-compaction-retention=10 --auto-compaction-mode=revision &>./testdata/etcd.log &
ETCD_PID=$!
ETCD_URL=127.0.0.1:$ETCD_PORT/v3
ETCDCTL="${ETCD}ctl --endpoints=http://$ETCD_URL"
./osd --osd_num 1 --bind_address 127.0.0.1 --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd1.bin 2>/dev/null) &>./testdata/osd1.log &
OSD1_PID=$!
./osd --osd_num 2 --bind_address 127.0.0.1 --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd2.bin 2>/dev/null) &>./testdata/osd2.log &
OSD2_PID=$!
./osd --osd_num 3 --bind_address 127.0.0.1 --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd3.bin 2>/dev/null) &>./testdata/osd3.log &
OSD3_PID=$!
cd mon
npm install
cd ..
node mon/mon-main.js --etcd_url http://$ETCD_URL --etcd_prefix "/vitastor" &>./testdata/mon.log &
MON_PID=$!
$ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"xor","pg_size":3,"pg_minsize":2,"parity_chunks":1,"pg_count":1,"failure_domain":"osd"}}'
sleep 2
if ! ($ETCDCTL get /vitastor/config/pgs --print-value-only | jq -s -e '(. | length) != 0 and (.[0].items["1"]["1"].osd_set | sort) == ["1","2","3"]'); then
format_error "FAILED: 1 PG NOT CONFIGURED"
fi
if ! ($ETCDCTL get /vitastor/pg/state/1/1 --print-value-only | jq -s -e '(. | length) != 0 and .[0].state == ["active"]'); then
format_error "FAILED: 1 PG NOT UP"
fi
echo leak:fio >> testdata/lsan-suppress.txt
echo leak:tcmalloc >> testdata/lsan-suppress.txt
echo leak:ceph >> testdata/lsan-suppress.txt
echo leak:librbd >> testdata/lsan-suppress.txt
echo leak:_M_mutate >> testdata/lsan-suppress.txt
echo leak:_M_assign >> testdata/lsan-suppress.txt
#LSAN_OPTIONS=suppressions=`pwd`/testdata/lsan-suppress.txt LD_PRELOAD=libasan.so.5 \
# fio -thread -name=test -ioengine=./libfio_sec_osd.so -bs=4k -fsync=128 `$ETCDCTL get /vitastor/osd/state/1 --print-value-only | jq -r '"-host="+.addresses[0]+" -port="+(.port|tostring)'` -rw=write -size=32M
LSAN_OPTIONS=suppressions=`pwd`/testdata/lsan-suppress.txt LD_PRELOAD=libasan.so.5 \
fio -thread -name=test -ioengine=./libfio_cluster.so -bs=4M -direct=1 -iodepth=1 -fsync=1 -rw=write -etcd=$ETCD_URL -pool=1 -inode=1 -size=1G -cluster_log_level=10
format_green OK

188
src/CMakeLists.txt Normal file
View File

@@ -0,0 +1,188 @@
cmake_minimum_required(VERSION 2.8)
project(vitastor)
include(GNUInstallDirs)
set(QEMU_PLUGINDIR qemu CACHE STRING "QEMU plugin directory suffix (qemu-kvm on RHEL)")
set(WITH_ASAN false CACHE BOOL "Build with AddressSanitizer")
if("${CMAKE_INSTALL_PREFIX}" MATCHES "^/usr/local/?$")
if(EXISTS "/etc/debian_version")
set(CMAKE_INSTALL_LIBDIR "lib/${CMAKE_LIBRARY_ARCHITECTURE}")
endif()
set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}")
endif()
add_definitions(-DVERSION="0.6-dev")
add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith)
if (${WITH_ASAN})
add_definitions(-fsanitize=address -fno-omit-frame-pointer)
add_link_options(-fsanitize=address -fno-omit-frame-pointer)
endif (${WITH_ASAN})
set(CMAKE_BUILD_TYPE RelWithDebInfo)
string(REGEX REPLACE "([\\/\\-]O)[12]?" "\\13" CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE}")
string(REGEX REPLACE "([\\/\\-]O)[12]?" "\\13" CMAKE_CXX_FLAGS_MINSIZEREL "${CMAKE_CXX_FLAGS_MINSIZEREL}")
string(REGEX REPLACE "([\\/\\-]O)[12]?" "\\13" CMAKE_CXX_FLAGS_RELWITHDEBINFO "${CMAKE_CXX_FLAGS_RELWITHDEBINFO}")
string(REGEX REPLACE "([\\/\\-]D) *NDEBUG" "" CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE}")
string(REGEX REPLACE "([\\/\\-]D) *NDEBUG" "" CMAKE_CXX_FLAGS_MINSIZEREL "${CMAKE_CXX_FLAGS_MINSIZEREL}")
string(REGEX REPLACE "([\\/\\-]D) *NDEBUG" "" CMAKE_CXX_FLAGS_RELWITHDEBINFO "${CMAKE_CXX_FLAGS_RELWITHDEBINFO}")
string(REGEX REPLACE "([\\/\\-]O)[12]?" "\\13" CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE}")
string(REGEX REPLACE "([\\/\\-]O)[12]?" "\\13" CMAKE_C_FLAGS_MINSIZEREL "${CMAKE_C_FLAGS_MINSIZEREL}")
string(REGEX REPLACE "([\\/\\-]O)[12]?" "\\13" CMAKE_C_FLAGS_RELWITHDEBINFO "${CMAKE_C_FLAGS_RELWITHDEBINFO}")
string(REGEX REPLACE "([\\/\\-]D) *NDEBUG" "" CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE}")
string(REGEX REPLACE "([\\/\\-]D) *NDEBUG" "" CMAKE_C_FLAGS_MINSIZEREL "${CMAKE_C_FLAGS_MINSIZEREL}")
string(REGEX REPLACE "([\\/\\-]D) *NDEBUG" "" CMAKE_C_FLAGS_RELWITHDEBINFO "${CMAKE_C_FLAGS_RELWITHDEBINFO}")
find_package(PkgConfig)
pkg_check_modules(LIBURING REQUIRED liburing)
pkg_check_modules(GLIB REQUIRED glib-2.0)
include_directories(
../
/usr/include/jerasure
${LIBURING_INCLUDE_DIRS}
)
# libvitastor_blk.so
add_library(vitastor_blk SHARED
allocator.cpp blockstore.cpp blockstore_impl.cpp blockstore_init.cpp blockstore_open.cpp blockstore_journal.cpp blockstore_read.cpp
blockstore_write.cpp blockstore_sync.cpp blockstore_stable.cpp blockstore_rollback.cpp blockstore_flush.cpp crc32c.c ringloop.cpp
)
target_link_libraries(vitastor_blk
${LIBURING_LIBRARIES}
tcmalloc_minimal
)
# libfio_vitastor_blk.so
add_library(fio_vitastor_blk SHARED
fio_engine.cpp
../json11/json11.cpp
)
target_link_libraries(fio_vitastor_blk
vitastor_blk
)
# vitastor-osd
add_executable(vitastor-osd
osd_main.cpp osd.cpp osd_secondary.cpp msgr_receive.cpp msgr_send.cpp osd_peering.cpp osd_flush.cpp osd_peering_pg.cpp
osd_primary.cpp osd_primary_subops.cpp etcd_state_client.cpp messenger.cpp osd_cluster.cpp http_client.cpp osd_ops.cpp pg_states.cpp
osd_rmw.cpp base64.cpp timerfd_manager.cpp epoll_manager.cpp ../json11/json11.cpp
)
target_link_libraries(vitastor-osd
vitastor_blk
Jerasure
)
# libfio_vitastor_sec.so
add_library(fio_vitastor_sec SHARED
fio_sec_osd.cpp
rw_blocking.cpp
)
target_link_libraries(fio_vitastor_sec
tcmalloc_minimal
)
# libvitastor_client.so
add_library(vitastor_client SHARED
cluster_client.cpp epoll_manager.cpp etcd_state_client.cpp
messenger.cpp msgr_send.cpp msgr_receive.cpp ringloop.cpp ../json11/json11.cpp
http_client.cpp osd_ops.cpp pg_states.cpp timerfd_manager.cpp base64.cpp
)
target_link_libraries(vitastor_client
tcmalloc_minimal
${LIBURING_LIBRARIES}
)
# libfio_vitastor.so
add_library(fio_vitastor SHARED
fio_cluster.cpp
)
target_link_libraries(fio_vitastor
vitastor_client
)
# vitastor-nbd
add_executable(vitastor-nbd
nbd_proxy.cpp
)
target_link_libraries(vitastor-nbd
vitastor_client
)
# vitastor-rm
add_executable(vitastor-rm
rm_inode.cpp
)
target_link_libraries(vitastor-rm
vitastor_client
)
# vitastor-dump-journal
add_executable(vitastor-dump-journal
dump_journal.cpp crc32c.c
)
# qemu_driver.so
add_library(qemu_proxy STATIC qemu_proxy.cpp)
target_compile_options(qemu_proxy PUBLIC -fPIC)
target_include_directories(qemu_proxy PUBLIC
../qemu/b/qemu
../qemu/include
${GLIB_INCLUDE_DIRS}
)
target_link_libraries(qemu_proxy
vitastor_client
)
add_library(qemu_vitastor SHARED
qemu_driver.c
)
target_link_libraries(qemu_vitastor
qemu_proxy
)
set_target_properties(qemu_vitastor PROPERTIES
PREFIX ""
OUTPUT_NAME "block-vitastor"
)
### Test stubs
# stub_osd, stub_bench, osd_test
add_executable(stub_osd stub_osd.cpp rw_blocking.cpp)
target_link_libraries(stub_osd tcmalloc_minimal)
add_executable(stub_bench stub_bench.cpp rw_blocking.cpp)
target_link_libraries(stub_bench tcmalloc_minimal)
add_executable(osd_test osd_test.cpp rw_blocking.cpp)
target_link_libraries(osd_test tcmalloc_minimal)
# osd_rmw_test
add_executable(osd_rmw_test osd_rmw_test.cpp allocator.cpp)
target_link_libraries(osd_rmw_test Jerasure tcmalloc_minimal)
# stub_uring_osd
add_executable(stub_uring_osd
stub_uring_osd.cpp epoll_manager.cpp messenger.cpp msgr_send.cpp msgr_receive.cpp ringloop.cpp timerfd_manager.cpp ../json11/json11.cpp
)
target_link_libraries(stub_uring_osd
${LIBURING_LIBRARIES}
tcmalloc_minimal
)
# osd_peering_pg_test
add_executable(osd_peering_pg_test osd_peering_pg_test.cpp osd_peering_pg.cpp)
target_link_libraries(osd_peering_pg_test tcmalloc_minimal)
# test_allocator
add_executable(test_allocator test_allocator.cpp allocator.cpp)
## test_blockstore, test_shit
#add_executable(test_blockstore test_blockstore.cpp timerfd_interval.cpp)
#target_link_libraries(test_blockstore blockstore)
#add_executable(test_shit test_shit.cpp osd_peering_pg.cpp)
#target_link_libraries(test_shit ${LIBURING_LIBRARIES} m)
### Install
install(TARGETS vitastor-osd vitastor-dump-journal vitastor-nbd vitastor-rm RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR})
install(TARGETS fio_vitastor fio_vitastor_blk fio_vitastor_sec vitastor_blk vitastor_client LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR})
install(TARGETS qemu_vitastor LIBRARY DESTINATION /usr/${CMAKE_INSTALL_LIBDIR}/${QEMU_PLUGINDIR})

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include <stdexcept> #include <stdexcept>
#include "allocator.h" #include "allocator.h"
@@ -13,19 +13,19 @@ allocator::allocator(uint64_t blocks)
{ {
throw std::invalid_argument("blocks"); throw std::invalid_argument("blocks");
} }
uint64_t p2 = 1, total = 1; uint64_t p2 = 1;
total = 0;
while (p2 * 64 < blocks) while (p2 * 64 < blocks)
{ {
p2 = p2 * 64;
total += p2; total += p2;
p2 = p2 * 64;
} }
total -= p2;
total += (blocks+63) / 64; total += (blocks+63) / 64;
mask = new uint64_t[2 + total]; mask = new uint64_t[total];
size = free = blocks; size = free = blocks;
last_one_mask = (blocks % 64) == 0 last_one_mask = (blocks % 64) == 0
? UINT64_MAX ? UINT64_MAX
: ~(UINT64_MAX << (64 - blocks % 64)); : ((1l << (blocks % 64)) - 1);
for (uint64_t i = 0; i < total; i++) for (uint64_t i = 0; i < total; i++)
{ {
mask[i] = 0; mask[i] = 0;
@@ -99,6 +99,10 @@ uint64_t allocator::find_free()
uint64_t p2 = 1, offset = 0, addr = 0, f, i; uint64_t p2 = 1, offset = 0, addr = 0, f, i;
while (p2 < size) while (p2 < size)
{ {
if (offset+addr >= total)
{
return UINT64_MAX;
}
uint64_t m = mask[offset + addr]; uint64_t m = mask[offset + addr];
for (i = 0, f = 1; i < 64; i++, f <<= 1) for (i = 0, f = 1; i < 64; i++, f <<= 1)
{ {
@@ -113,11 +117,6 @@ uint64_t allocator::find_free()
return UINT64_MAX; return UINT64_MAX;
} }
addr = (addr * 64) | i; addr = (addr * 64) | i;
if (addr >= size)
{
// No space
return UINT64_MAX;
}
offset += p2; offset += p2;
p2 = p2 * 64; p2 = p2 * 64;
} }

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#pragma once #pragma once
@@ -8,6 +8,7 @@
// Hierarchical bitmap allocator // Hierarchical bitmap allocator
class allocator class allocator
{ {
uint64_t total;
uint64_t size; uint64_t size;
uint64_t free; uint64_t free;
uint64_t last_one_mask; uint64_t last_one_mask;

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "base64.h" #include "base64.h"

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#pragma once #pragma once
#include <string> #include <string>

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h" #include "blockstore_impl.h"
@@ -35,12 +35,7 @@ bool blockstore_t::is_safe_to_stop()
void blockstore_t::enqueue_op(blockstore_op_t *op) void blockstore_t::enqueue_op(blockstore_op_t *op)
{ {
impl->enqueue_op(op, false); impl->enqueue_op(op);
}
void blockstore_t::enqueue_op_first(blockstore_op_t *op)
{
impl->enqueue_op(op, true);
} }
std::unordered_map<object_id, uint64_t> & blockstore_t::get_unstable_writes() std::unordered_map<object_id, uint64_t> & blockstore_t::get_unstable_writes()

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#pragma once #pragma once
@@ -175,10 +175,6 @@ public:
// Submission // Submission
void enqueue_op(blockstore_op_t *op); void enqueue_op(blockstore_op_t *op);
// Insert operation into the beginning of the queue
// Intended for the OSD syncer "thread" to be able to stabilize something when the journal is full
void enqueue_op_first(blockstore_op_t *op);
// Unstable writes are added here (map of object_id -> version) // Unstable writes are added here (map of object_id -> version)
std::unordered_map<object_id, uint64_t> & get_unstable_writes(); std::unordered_map<object_id, uint64_t> & get_unstable_writes();

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h" #include "blockstore_impl.h"
@@ -7,6 +7,8 @@ journal_flusher_t::journal_flusher_t(int flusher_count, blockstore_impl_t *bs)
{ {
this->bs = bs; this->bs = bs;
this->flusher_count = flusher_count; this->flusher_count = flusher_count;
this->cur_flusher_count = 1;
this->target_flusher_count = 1;
dequeuing = false; dequeuing = false;
trimming = false; trimming = false;
active_flushers = 0; active_flushers = 0;
@@ -68,10 +70,24 @@ bool journal_flusher_t::is_active()
void journal_flusher_t::loop() void journal_flusher_t::loop()
{ {
for (int i = 0; (active_flushers > 0 || dequeuing) && i < flusher_count; i++) target_flusher_count = bs->write_iodepth*2;
if (target_flusher_count <= 0)
target_flusher_count = 1;
else if (target_flusher_count > flusher_count)
target_flusher_count = flusher_count;
if (target_flusher_count > cur_flusher_count)
cur_flusher_count = target_flusher_count;
else if (target_flusher_count < cur_flusher_count)
{ {
co[i].loop(); while (target_flusher_count < cur_flusher_count)
{
if (co[cur_flusher_count-1].wait_state)
break;
cur_flusher_count--;
}
} }
for (int i = 0; (active_flushers > 0 || dequeuing) && i < cur_flusher_count; i++)
co[i].loop();
} }
void journal_flusher_t::enqueue_flush(obj_ver_id ov) void journal_flusher_t::enqueue_flush(obj_ver_id ov)
@@ -807,31 +823,34 @@ bool journal_flusher_co::fsync_batch(bool fsync_meta, int wait_base)
sync_found: sync_found:
cur_sync->ready_count++; cur_sync->ready_count++;
flusher->syncing_flushers++; flusher->syncing_flushers++;
if (flusher->syncing_flushers >= flusher->flusher_count || !flusher->flush_queue.size()) resume_1:
if (!cur_sync->state)
{ {
// Sync batch is ready. Do it. if (flusher->syncing_flushers >= flusher->cur_flusher_count || !flusher->flush_queue.size())
await_sqe(0);
data->iov = { 0 };
data->callback = simple_callback_w;
my_uring_prep_fsync(sqe, fsync_meta ? bs->meta_fd : bs->data_fd, IORING_FSYNC_DATASYNC);
cur_sync->state = 1;
wait_count++;
resume_1:
if (wait_count > 0)
{ {
// Sync batch is ready. Do it.
await_sqe(0);
data->iov = { 0 };
data->callback = simple_callback_w;
my_uring_prep_fsync(sqe, fsync_meta ? bs->meta_fd : bs->data_fd, IORING_FSYNC_DATASYNC);
cur_sync->state = 1;
wait_count++;
resume_2:
if (wait_count > 0)
{
wait_state = 2;
return false;
}
// Sync completed. All previous coroutines waiting for it must be resumed
cur_sync->state = 2;
bs->ringloop->wakeup();
}
else
{
// Wait until someone else sends and completes a sync.
wait_state = 1; wait_state = 1;
return false; return false;
} }
// Sync completed. All previous coroutines waiting for it must be resumed
cur_sync->state = 2;
bs->ringloop->wakeup();
}
// Wait until someone else sends and completes a sync.
resume_2:
if (!cur_sync->state)
{
wait_state = 2;
return false;
} }
flusher->syncing_flushers--; flusher->syncing_flushers--;
cur_sync->ready_count--; cur_sync->ready_count--;

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
struct copy_buffer_t struct copy_buffer_t
{ {
@@ -80,7 +80,7 @@ class journal_flusher_t
{ {
int trim_wanted = 0; int trim_wanted = 0;
bool dequeuing; bool dequeuing;
int flusher_count; int flusher_count, cur_flusher_count, target_flusher_count;
int flusher_start_threshold; int flusher_start_threshold;
journal_flusher_co *co; journal_flusher_co *co;
blockstore_impl_t *bs; blockstore_impl_t *bs;

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h" #include "blockstore_impl.h"
@@ -101,26 +101,14 @@ void blockstore_impl_t::loop()
{ {
// try to submit ops // try to submit ops
unsigned initial_ring_space = ringloop->space_left(); unsigned initial_ring_space = ringloop->space_left();
// FIXME: rework this "sync polling" // has_writes == 0 - no writes before the current queue item
auto cur_sync = in_progress_syncs.begin(); // has_writes == 1 - some writes in progress
while (cur_sync != in_progress_syncs.end()) // has_writes == 2 - tried to submit some writes, but failed
int has_writes = 0, op_idx = 0, new_idx = 0;
for (; op_idx < submit_queue.size(); op_idx++, new_idx++)
{ {
if (continue_sync(*cur_sync) != 2) auto op = submit_queue[op_idx];
{ submit_queue[new_idx] = op;
// List is unmodified
cur_sync++;
}
else
{
cur_sync = in_progress_syncs.begin();
}
}
auto cur = submit_queue.begin();
int has_writes = 0;
while (cur != submit_queue.end())
{
auto op_ptr = cur;
auto op = *(cur++);
// FIXME: This needs some simplification // FIXME: This needs some simplification
// Writes should not block reads if the ring is not full and reads don't depend on them // Writes should not block reads if the ring is not full and reads don't depend on them
// In all other cases we should stop submission // In all other cases we should stop submission
@@ -142,10 +130,13 @@ void blockstore_impl_t::loop()
} }
unsigned ring_space = ringloop->space_left(); unsigned ring_space = ringloop->space_left();
unsigned prev_sqe_pos = ringloop->save(); unsigned prev_sqe_pos = ringloop->save();
bool dequeue_op = false; // 0 = can't submit
// 1 = in progress
// 2 = can be removed from queue
int wr_st = 0;
if (op->opcode == BS_OP_READ) if (op->opcode == BS_OP_READ)
{ {
dequeue_op = dequeue_read(op); wr_st = dequeue_read(op);
} }
else if (op->opcode == BS_OP_WRITE || op->opcode == BS_OP_WRITE_STABLE) else if (op->opcode == BS_OP_WRITE || op->opcode == BS_OP_WRITE_STABLE)
{ {
@@ -154,8 +145,8 @@ void blockstore_impl_t::loop()
// Some writes already could not be submitted // Some writes already could not be submitted
continue; continue;
} }
dequeue_op = dequeue_write(op); wr_st = dequeue_write(op);
has_writes = dequeue_op ? 1 : 2; has_writes = wr_st > 0 ? 1 : 2;
} }
else if (op->opcode == BS_OP_DELETE) else if (op->opcode == BS_OP_DELETE)
{ {
@@ -164,8 +155,8 @@ void blockstore_impl_t::loop()
// Some writes already could not be submitted // Some writes already could not be submitted
continue; continue;
} }
dequeue_op = dequeue_del(op); wr_st = dequeue_del(op);
has_writes = dequeue_op ? 1 : 2; has_writes = wr_st > 0 ? 1 : 2;
} }
else if (op->opcode == BS_OP_SYNC) else if (op->opcode == BS_OP_SYNC)
{ {
@@ -178,29 +169,31 @@ void blockstore_impl_t::loop()
// Can't submit SYNC before previous writes // Can't submit SYNC before previous writes
continue; continue;
} }
dequeue_op = dequeue_sync(op); wr_st = continue_sync(op, false);
if (wr_st != 2)
{
has_writes = wr_st > 0 ? 1 : 2;
}
} }
else if (op->opcode == BS_OP_STABLE) else if (op->opcode == BS_OP_STABLE)
{ {
dequeue_op = dequeue_stable(op); wr_st = dequeue_stable(op);
} }
else if (op->opcode == BS_OP_ROLLBACK) else if (op->opcode == BS_OP_ROLLBACK)
{ {
dequeue_op = dequeue_rollback(op); wr_st = dequeue_rollback(op);
} }
else if (op->opcode == BS_OP_LIST) else if (op->opcode == BS_OP_LIST)
{ {
// LIST doesn't need to be blocked by previous modifications, // LIST doesn't need to be blocked by previous modifications
// it only needs to include all in-progress writes as they're guaranteed
// to be readable and stabilizable/rollbackable by subsequent operations
process_list(op); process_list(op);
dequeue_op = true; wr_st = 2;
} }
if (dequeue_op) if (wr_st == 2)
{ {
submit_queue.erase(op_ptr); new_idx--;
} }
else if (wr_st == 0)
{ {
ringloop->restore(prev_sqe_pos); ringloop->restore(prev_sqe_pos);
if (PRIV(op)->wait_for == WAIT_SQE) if (PRIV(op)->wait_for == WAIT_SQE)
@@ -211,6 +204,14 @@ void blockstore_impl_t::loop()
} }
} }
} }
if (op_idx != new_idx)
{
while (op_idx < submit_queue.size())
{
submit_queue[new_idx++] = submit_queue[op_idx++];
}
submit_queue.resize(new_idx);
}
if (!readonly) if (!readonly)
{ {
flusher->loop(); flusher->loop();
@@ -233,7 +234,7 @@ bool blockstore_impl_t::is_safe_to_stop()
{ {
// It's safe to stop blockstore when there are no in-flight operations, // It's safe to stop blockstore when there are no in-flight operations,
// no in-progress syncs and flusher isn't doing anything // no in-progress syncs and flusher isn't doing anything
if (submit_queue.size() > 0 || in_progress_syncs.size() > 0 || !readonly && flusher->is_active()) if (submit_queue.size() > 0 || !readonly && flusher->is_active())
{ {
return false; return false;
} }
@@ -300,7 +301,7 @@ void blockstore_impl_t::check_wait(blockstore_op_t *op)
} }
else if (PRIV(op)->wait_for == WAIT_FREE) else if (PRIV(op)->wait_for == WAIT_FREE)
{ {
if (!data_alloc->get_free_count() && !flusher->is_active()) if (!data_alloc->get_free_count() && flusher->is_active())
{ {
#ifdef BLOCKSTORE_DEBUG #ifdef BLOCKSTORE_DEBUG
printf("Still waiting for free space on the data device\n"); printf("Still waiting for free space on the data device\n");
@@ -315,7 +316,7 @@ void blockstore_impl_t::check_wait(blockstore_op_t *op)
} }
} }
void blockstore_impl_t::enqueue_op(blockstore_op_t *op, bool first) void blockstore_impl_t::enqueue_op(blockstore_op_t *op)
{ {
if (op->opcode < BS_OP_MIN || op->opcode > BS_OP_MAX || if (op->opcode < BS_OP_MIN || op->opcode > BS_OP_MAX ||
((op->opcode == BS_OP_READ || op->opcode == BS_OP_WRITE || op->opcode == BS_OP_WRITE_STABLE) && ( ((op->opcode == BS_OP_READ || op->opcode == BS_OP_WRITE || op->opcode == BS_OP_WRITE_STABLE) && (
@@ -323,8 +324,7 @@ void blockstore_impl_t::enqueue_op(blockstore_op_t *op, bool first)
op->len > block_size-op->offset || op->len > block_size-op->offset ||
(op->len % disk_alignment) (op->len % disk_alignment)
)) || )) ||
readonly && op->opcode != BS_OP_READ && op->opcode != BS_OP_LIST || readonly && op->opcode != BS_OP_READ && op->opcode != BS_OP_LIST)
first && (op->opcode == BS_OP_WRITE || op->opcode == BS_OP_WRITE_STABLE))
{ {
// Basic verification not passed // Basic verification not passed
op->retval = -EINVAL; op->retval = -EINVAL;
@@ -374,25 +374,12 @@ void blockstore_impl_t::enqueue_op(blockstore_op_t *op, bool first)
std::function<void (blockstore_op_t*)>(op->callback)(op); std::function<void (blockstore_op_t*)>(op->callback)(op);
return; return;
} }
if (op->opcode == BS_OP_SYNC && immediate_commit == IMMEDIATE_ALL)
{
op->retval = 0;
std::function<void (blockstore_op_t*)>(op->callback)(op);
return;
}
// Call constructor without allocating memory. We'll call destructor before returning op back // Call constructor without allocating memory. We'll call destructor before returning op back
new ((void*)op->private_data) blockstore_op_private_t; new ((void*)op->private_data) blockstore_op_private_t;
PRIV(op)->wait_for = 0; PRIV(op)->wait_for = 0;
PRIV(op)->op_state = 0; PRIV(op)->op_state = 0;
PRIV(op)->pending_ops = 0; PRIV(op)->pending_ops = 0;
if (!first) submit_queue.push_back(op);
{
submit_queue.push_back(op);
}
else
{
submit_queue.push_front(op);
}
ringloop->wakeup(); ringloop->wakeup();
} }

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#pragma once #pragma once
@@ -160,8 +160,6 @@ struct blockstore_op_private_t
// Sync // Sync
std::vector<obj_ver_id> sync_big_writes, sync_small_writes; std::vector<obj_ver_id> sync_big_writes, sync_small_writes;
int sync_small_checked, sync_big_checked; int sync_small_checked, sync_big_checked;
std::list<blockstore_op_t*>::iterator in_progress_ptr;
int prev_sync_count;
}; };
// https://github.com/algorithm-ninja/cpp-btree // https://github.com/algorithm-ninja/cpp-btree
@@ -199,7 +197,10 @@ class blockstore_impl_t
// Suitable only for server SSDs with capacitors, requires disabled data and journal fsyncs // Suitable only for server SSDs with capacitors, requires disabled data and journal fsyncs
int immediate_commit = IMMEDIATE_NONE; int immediate_commit = IMMEDIATE_NONE;
bool inmemory_meta = false; bool inmemory_meta = false;
int flusher_count; // Maximum flusher count
unsigned flusher_count;
// Maximum queue depth
unsigned max_write_iodepth = 128;
/******* END OF OPTIONS *******/ /******* END OF OPTIONS *******/
struct ring_consumer_t ring_consumer; struct ring_consumer_t ring_consumer;
@@ -207,9 +208,8 @@ class blockstore_impl_t
blockstore_clean_db_t clean_db; blockstore_clean_db_t clean_db;
uint8_t *clean_bitmap = NULL; uint8_t *clean_bitmap = NULL;
blockstore_dirty_db_t dirty_db; blockstore_dirty_db_t dirty_db;
std::list<blockstore_op_t*> submit_queue; // FIXME: funny thing is that vector is better here std::vector<blockstore_op_t*> submit_queue;
std::vector<obj_ver_id> unsynced_big_writes, unsynced_small_writes; std::vector<obj_ver_id> unsynced_big_writes, unsynced_small_writes;
std::list<blockstore_op_t*> in_progress_syncs; // ...and probably here, too
allocator *data_alloc = NULL; allocator *data_alloc = NULL;
uint8_t *zero_object; uint8_t *zero_object;
@@ -226,6 +226,7 @@ class blockstore_impl_t
struct journal_t journal; struct journal_t journal;
journal_flusher_t *flusher; journal_flusher_t *flusher;
int write_iodepth = 0;
bool live = false, queue_stall = false; bool live = false, queue_stall = false;
ring_loop_t *ringloop; ring_loop_t *ringloop;
@@ -267,6 +268,7 @@ class blockstore_impl_t
// Write // Write
bool enqueue_write(blockstore_op_t *op); bool enqueue_write(blockstore_op_t *op);
void cancel_all_writes(blockstore_op_t *op, blockstore_dirty_db_t::iterator dirty_it, int retval);
int dequeue_write(blockstore_op_t *op); int dequeue_write(blockstore_op_t *op);
int dequeue_del(blockstore_op_t *op); int dequeue_del(blockstore_op_t *op);
int continue_write(blockstore_op_t *op); int continue_write(blockstore_op_t *op);
@@ -274,11 +276,9 @@ class blockstore_impl_t
void handle_write_event(ring_data_t *data, blockstore_op_t *op); void handle_write_event(ring_data_t *data, blockstore_op_t *op);
// Sync // Sync
int dequeue_sync(blockstore_op_t *op); int continue_sync(blockstore_op_t *op, bool queue_has_in_progress_sync);
void handle_sync_event(ring_data_t *data, blockstore_op_t *op); void handle_sync_event(ring_data_t *data, blockstore_op_t *op);
int continue_sync(blockstore_op_t *op); void ack_sync(blockstore_op_t *op);
void ack_one_sync(blockstore_op_t *op);
int ack_sync(blockstore_op_t *op);
// Stabilize // Stabilize
int dequeue_stable(blockstore_op_t *op); int dequeue_stable(blockstore_op_t *op);
@@ -318,7 +318,7 @@ public:
bool is_stalled(); bool is_stalled();
// Submission // Submission
void enqueue_op(blockstore_op_t *op, bool first = false); void enqueue_op(blockstore_op_t *op);
// Unstable writes are added here (map of object_id -> version) // Unstable writes are added here (map of object_id -> version)
std::unordered_map<object_id, uint64_t> unstable_writes; std::unordered_map<object_id, uint64_t> unstable_writes;

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h" #include "blockstore_impl.h"

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#pragma once #pragma once

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h" #include "blockstore_impl.h"

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#pragma once #pragma once

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include <sys/file.h> #include <sys/file.h>
#include "blockstore_impl.h" #include "blockstore_impl.h"
@@ -70,6 +70,7 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
meta_block_size = strtoull(config["meta_block_size"].c_str(), NULL, 10); meta_block_size = strtoull(config["meta_block_size"].c_str(), NULL, 10);
bitmap_granularity = strtoull(config["bitmap_granularity"].c_str(), NULL, 10); bitmap_granularity = strtoull(config["bitmap_granularity"].c_str(), NULL, 10);
flusher_count = strtoull(config["flusher_count"].c_str(), NULL, 10); flusher_count = strtoull(config["flusher_count"].c_str(), NULL, 10);
max_write_iodepth = strtoull(config["max_write_iodepth"].c_str(), NULL, 10);
// Validate // Validate
if (!block_size) if (!block_size)
{ {
@@ -83,6 +84,10 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
{ {
flusher_count = 32; flusher_count = 32;
} }
if (!max_write_iodepth)
{
max_write_iodepth = 128;
}
if (!disk_alignment) if (!disk_alignment)
{ {
disk_alignment = 4096; disk_alignment = 4096;

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h" #include "blockstore_impl.h"
@@ -112,7 +112,7 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
read_op->version = 0; read_op->version = 0;
read_op->retval = read_op->len; read_op->retval = read_op->len;
FINISH_OP(read_op); FINISH_OP(read_op);
return 1; return 2;
} }
uint64_t fulfilled = 0; uint64_t fulfilled = 0;
PRIV(read_op)->pending_ops = 0; PRIV(read_op)->pending_ops = 0;
@@ -191,8 +191,8 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
if (bmp_end > bmp_start) if (bmp_end > bmp_start)
{ {
// fill with zeroes // fill with zeroes
fulfill_read(read_op, fulfilled, bmp_start * bitmap_granularity, assert(fulfill_read(read_op, fulfilled, bmp_start * bitmap_granularity,
bmp_end * bitmap_granularity, (BS_ST_DELETE | BS_ST_STABLE), 0, 0); bmp_end * bitmap_granularity, (BS_ST_DELETE | BS_ST_STABLE), 0, 0));
} }
bmp_start = bmp_end; bmp_start = bmp_end;
while (clean_entry_bitmap[bmp_end >> 3] & (1 << (bmp_end & 0x7)) && bmp_end < bmp_size) while (clean_entry_bitmap[bmp_end >> 3] & (1 << (bmp_end & 0x7)) && bmp_end < bmp_size)
@@ -218,7 +218,7 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
else if (fulfilled < read_op->len) else if (fulfilled < read_op->len)
{ {
// fill remaining parts with zeroes // fill remaining parts with zeroes
fulfill_read(read_op, fulfilled, 0, block_size, (BS_ST_DELETE | BS_ST_STABLE), 0, 0); assert(fulfill_read(read_op, fulfilled, 0, block_size, (BS_ST_DELETE | BS_ST_STABLE), 0, 0));
} }
assert(fulfilled == read_op->len); assert(fulfilled == read_op->len);
read_op->version = result_version; read_op->version = result_version;
@@ -232,10 +232,10 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
} }
read_op->retval = read_op->len; read_op->retval = read_op->len;
FINISH_OP(read_op); FINISH_OP(read_op);
return 1; return 2;
} }
read_op->retval = 0; read_op->retval = 0;
return 1; return 2;
} }
void blockstore_impl_t::handle_read_event(ring_data_t *data, blockstore_op_t *op) void blockstore_impl_t::handle_read_event(ring_data_t *data, blockstore_op_t *op)

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h" #include "blockstore_impl.h"
@@ -50,7 +50,7 @@ skip_ov:
{ {
op->retval = -EBUSY; op->retval = -EBUSY;
FINISH_OP(op); FINISH_OP(op);
return 1; return 2;
} }
if (dirty_it == dirty_db.begin()) if (dirty_it == dirty_db.begin())
{ {
@@ -66,7 +66,7 @@ skip_ov:
// Already rolled back // Already rolled back
op->retval = 0; op->retval = 0;
FINISH_OP(op); FINISH_OP(op);
return 1; return 2;
} }
// Check journal space // Check journal space
blockstore_journal_check_t space_check(this); blockstore_journal_check_t space_check(this);
@@ -126,11 +126,8 @@ resume_2:
resume_3: resume_3:
if (!disable_journal_fsync) if (!disable_journal_fsync)
{ {
io_uring_sqe *sqe = get_sqe(); io_uring_sqe *sqe;
if (!sqe) BS_SUBMIT_GET_SQE_DECL(sqe);
{
return 0;
}
ring_data_t *data = ((ring_data_t*)sqe->user_data); ring_data_t *data = ((ring_data_t*)sqe->user_data);
my_uring_prep_fsync(sqe, journal.fd, IORING_FSYNC_DATASYNC); my_uring_prep_fsync(sqe, journal.fd, IORING_FSYNC_DATASYNC);
data->iov = { 0 }; data->iov = { 0 };
@@ -151,7 +148,7 @@ resume_5:
// Acknowledge op // Acknowledge op
op->retval = 0; op->retval = 0;
FINISH_OP(op); FINISH_OP(op);
return 1; return 2;
} }
void blockstore_impl_t::mark_rolled_back(const obj_ver_id & ov) void blockstore_impl_t::mark_rolled_back(const obj_ver_id & ov)
@@ -216,10 +213,7 @@ void blockstore_impl_t::handle_rollback_event(ring_data_t *data, blockstore_op_t
if (PRIV(op)->pending_ops == 0) if (PRIV(op)->pending_ops == 0)
{ {
PRIV(op)->op_state++; PRIV(op)->op_state++;
if (!continue_rollback(op)) ringloop->wakeup();
{
submit_queue.push_front(op);
}
} }
} }

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h" #include "blockstore_impl.h"
@@ -60,7 +60,7 @@ int blockstore_impl_t::dequeue_stable(blockstore_op_t *op)
// No such object version // No such object version
op->retval = -ENOENT; op->retval = -ENOENT;
FINISH_OP(op); FINISH_OP(op);
return 1; return 2;
} }
else else
{ {
@@ -77,7 +77,7 @@ int blockstore_impl_t::dequeue_stable(blockstore_op_t *op)
// Object not synced yet. Caller must sync it first // Object not synced yet. Caller must sync it first
op->retval = -EBUSY; op->retval = -EBUSY;
FINISH_OP(op); FINISH_OP(op);
return 1; return 2;
} }
else if (!IS_STABLE(dirty_it->second.state)) else if (!IS_STABLE(dirty_it->second.state))
{ {
@@ -89,7 +89,7 @@ int blockstore_impl_t::dequeue_stable(blockstore_op_t *op)
// Already stable // Already stable
op->retval = 0; op->retval = 0;
FINISH_OP(op); FINISH_OP(op);
return 1; return 2;
} }
// Check journal space // Check journal space
blockstore_journal_check_t space_check(this); blockstore_journal_check_t space_check(this);
@@ -150,11 +150,8 @@ resume_2:
resume_3: resume_3:
if (!disable_journal_fsync) if (!disable_journal_fsync)
{ {
io_uring_sqe *sqe = get_sqe(); io_uring_sqe *sqe;
if (!sqe) BS_SUBMIT_GET_SQE_DECL(sqe);
{
return 0;
}
ring_data_t *data = ((ring_data_t*)sqe->user_data); ring_data_t *data = ((ring_data_t*)sqe->user_data);
my_uring_prep_fsync(sqe, journal.fd, IORING_FSYNC_DATASYNC); my_uring_prep_fsync(sqe, journal.fd, IORING_FSYNC_DATASYNC);
data->iov = { 0 }; data->iov = { 0 };
@@ -176,7 +173,7 @@ resume_5:
// Acknowledge op // Acknowledge op
op->retval = 0; op->retval = 0;
FINISH_OP(op); FINISH_OP(op);
return 1; return 2;
} }
void blockstore_impl_t::mark_stable(const obj_ver_id & v) void blockstore_impl_t::mark_stable(const obj_ver_id & v)
@@ -228,9 +225,6 @@ void blockstore_impl_t::handle_stable_event(ring_data_t *data, blockstore_op_t *
if (PRIV(op)->pending_ops == 0) if (PRIV(op)->pending_ops == 0)
{ {
PRIV(op)->op_state++; PRIV(op)->op_state++;
if (!continue_stable(op)) ringloop->wakeup();
{
submit_queue.push_front(op);
}
} }
} }

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h" #include "blockstore_impl.h"
@@ -12,8 +12,15 @@
#define SYNC_JOURNAL_SYNC_SENT 7 #define SYNC_JOURNAL_SYNC_SENT 7
#define SYNC_DONE 8 #define SYNC_DONE 8
int blockstore_impl_t::dequeue_sync(blockstore_op_t *op) int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_progress_sync)
{ {
if (immediate_commit == IMMEDIATE_ALL)
{
// We can return immediately because sync is only dequeued after all previous writes
op->retval = 0;
FINISH_OP(op);
return 2;
}
if (PRIV(op)->op_state == 0) if (PRIV(op)->op_state == 0)
{ {
stop_sync_submitted = false; stop_sync_submitted = false;
@@ -29,34 +36,15 @@ int blockstore_impl_t::dequeue_sync(blockstore_op_t *op)
PRIV(op)->op_state = SYNC_HAS_SMALL; PRIV(op)->op_state = SYNC_HAS_SMALL;
else else
PRIV(op)->op_state = SYNC_DONE; PRIV(op)->op_state = SYNC_DONE;
// Always add sync to in_progress_syncs because we clear unsynced_big_writes and unsynced_small_writes
PRIV(op)->prev_sync_count = in_progress_syncs.size();
PRIV(op)->in_progress_ptr = in_progress_syncs.insert(in_progress_syncs.end(), op);
} }
continue_sync(op);
// Always dequeue because we always add syncs to in_progress_syncs
return 1;
}
int blockstore_impl_t::continue_sync(blockstore_op_t *op)
{
auto cb = [this, op](ring_data_t *data) { handle_sync_event(data, op); };
if (PRIV(op)->op_state == SYNC_HAS_SMALL) if (PRIV(op)->op_state == SYNC_HAS_SMALL)
{ {
// No big writes, just fsync the journal // No big writes, just fsync the journal
for (; PRIV(op)->sync_small_checked < PRIV(op)->sync_small_writes.size(); PRIV(op)->sync_small_checked++)
{
if (IS_IN_FLIGHT(dirty_db[PRIV(op)->sync_small_writes[PRIV(op)->sync_small_checked]].state))
{
// Wait for small inflight writes to complete
return 0;
}
}
if (journal.sector_info[journal.cur_sector].dirty) if (journal.sector_info[journal.cur_sector].dirty)
{ {
// Write out the last journal sector if it happens to be dirty // Write out the last journal sector if it happens to be dirty
BS_SUBMIT_GET_ONLY_SQE(sqe); BS_SUBMIT_GET_ONLY_SQE(sqe);
prepare_journal_sector_write(journal, journal.cur_sector, sqe, cb); prepare_journal_sector_write(journal, journal.cur_sector, sqe, [this, op](ring_data_t *data) { handle_sync_event(data, op); });
PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector; PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
PRIV(op)->pending_ops = 1; PRIV(op)->pending_ops = 1;
PRIV(op)->op_state = SYNC_JOURNAL_WRITE_SENT; PRIV(op)->op_state = SYNC_JOURNAL_WRITE_SENT;
@@ -69,21 +57,13 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
} }
if (PRIV(op)->op_state == SYNC_HAS_BIG) if (PRIV(op)->op_state == SYNC_HAS_BIG)
{ {
for (; PRIV(op)->sync_big_checked < PRIV(op)->sync_big_writes.size(); PRIV(op)->sync_big_checked++)
{
if (IS_IN_FLIGHT(dirty_db[PRIV(op)->sync_big_writes[PRIV(op)->sync_big_checked]].state))
{
// Wait for big inflight writes to complete
return 0;
}
}
// 1st step: fsync data // 1st step: fsync data
if (!disable_data_fsync) if (!disable_data_fsync)
{ {
BS_SUBMIT_GET_SQE(sqe, data); BS_SUBMIT_GET_SQE(sqe, data);
my_uring_prep_fsync(sqe, data_fd, IORING_FSYNC_DATASYNC); my_uring_prep_fsync(sqe, data_fd, IORING_FSYNC_DATASYNC);
data->iov = { 0 }; data->iov = { 0 };
data->callback = cb; data->callback = [this, op](ring_data_t *data) { handle_sync_event(data, op); };
PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0; PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
PRIV(op)->pending_ops = 1; PRIV(op)->pending_ops = 1;
PRIV(op)->op_state = SYNC_DATA_SYNC_SENT; PRIV(op)->op_state = SYNC_DATA_SYNC_SENT;
@@ -96,14 +76,6 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
} }
if (PRIV(op)->op_state == SYNC_DATA_SYNC_DONE) if (PRIV(op)->op_state == SYNC_DATA_SYNC_DONE)
{ {
for (; PRIV(op)->sync_small_checked < PRIV(op)->sync_small_writes.size(); PRIV(op)->sync_small_checked++)
{
if (IS_IN_FLIGHT(dirty_db[PRIV(op)->sync_small_writes[PRIV(op)->sync_small_checked]].state))
{
// Wait for small inflight writes to complete
return 0;
}
}
// 2nd step: Data device is synced, prepare & write journal entries // 2nd step: Data device is synced, prepare & write journal entries
// Check space in the journal and journal memory buffers // Check space in the journal and journal memory buffers
blockstore_journal_check_t space_check(this); blockstore_journal_check_t space_check(this);
@@ -127,7 +99,7 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
{ {
if (cur_sector == -1) if (cur_sector == -1)
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector; PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], cb); prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], [this, op](ring_data_t *data) { handle_sync_event(data, op); });
cur_sector = journal.cur_sector; cur_sector = journal.cur_sector;
} }
journal_entry_big_write *je = (journal_entry_big_write*)prefill_single_journal_entry( journal_entry_big_write *je = (journal_entry_big_write*)prefill_single_journal_entry(
@@ -152,7 +124,7 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
journal.crc32_last = je->crc32; journal.crc32_last = je->crc32;
it++; it++;
} }
prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], cb); prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], [this, op](ring_data_t *data) { handle_sync_event(data, op); });
assert(s == space_check.sectors_to_write); assert(s == space_check.sectors_to_write);
if (cur_sector == -1) if (cur_sector == -1)
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector; PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
@@ -168,7 +140,7 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
BS_SUBMIT_GET_SQE(sqe, data); BS_SUBMIT_GET_SQE(sqe, data);
my_uring_prep_fsync(sqe, journal.fd, IORING_FSYNC_DATASYNC); my_uring_prep_fsync(sqe, journal.fd, IORING_FSYNC_DATASYNC);
data->iov = { 0 }; data->iov = { 0 };
data->callback = cb; data->callback = [this, op](ring_data_t *data) { handle_sync_event(data, op); };
PRIV(op)->pending_ops = 1; PRIV(op)->pending_ops = 1;
PRIV(op)->op_state = SYNC_JOURNAL_SYNC_SENT; PRIV(op)->op_state = SYNC_JOURNAL_SYNC_SENT;
return 1; return 1;
@@ -178,9 +150,10 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
PRIV(op)->op_state = SYNC_DONE; PRIV(op)->op_state = SYNC_DONE;
} }
} }
if (PRIV(op)->op_state == SYNC_DONE) if (PRIV(op)->op_state == SYNC_DONE && !queue_has_in_progress_sync)
{ {
return ack_sync(op); ack_sync(op);
return 2;
} }
return 1; return 1;
} }
@@ -212,42 +185,16 @@ void blockstore_impl_t::handle_sync_event(ring_data_t *data, blockstore_op_t *op
else if (PRIV(op)->op_state == SYNC_JOURNAL_SYNC_SENT) else if (PRIV(op)->op_state == SYNC_JOURNAL_SYNC_SENT)
{ {
PRIV(op)->op_state = SYNC_DONE; PRIV(op)->op_state = SYNC_DONE;
ack_sync(op);
} }
else else
{ {
throw std::runtime_error("BUG: unexpected sync op state"); throw std::runtime_error("BUG: unexpected sync op state");
} }
ringloop->wakeup();
} }
} }
int blockstore_impl_t::ack_sync(blockstore_op_t *op) void blockstore_impl_t::ack_sync(blockstore_op_t *op)
{
if (PRIV(op)->op_state == SYNC_DONE && PRIV(op)->prev_sync_count == 0)
{
// Remove dependency of subsequent syncs
auto it = PRIV(op)->in_progress_ptr;
int done_syncs = 1;
++it;
// Acknowledge sync
ack_one_sync(op);
while (it != in_progress_syncs.end())
{
auto & next_sync = *it++;
PRIV(next_sync)->prev_sync_count -= done_syncs;
if (PRIV(next_sync)->prev_sync_count == 0 && PRIV(next_sync)->op_state == SYNC_DONE)
{
done_syncs++;
// Acknowledge next_sync
ack_one_sync(next_sync);
}
}
return 2;
}
return 0;
}
void blockstore_impl_t::ack_one_sync(blockstore_op_t *op)
{ {
// Handle states // Handle states
for (auto it = PRIV(op)->sync_big_writes.begin(); it != PRIV(op)->sync_big_writes.end(); it++) for (auto it = PRIV(op)->sync_big_writes.begin(); it != PRIV(op)->sync_big_writes.end(); it++)
@@ -295,7 +242,6 @@ void blockstore_impl_t::ack_one_sync(blockstore_op_t *op)
} }
} }
} }
in_progress_syncs.erase(PRIV(op)->in_progress_ptr);
op->retval = 0; op->retval = 0;
FINISH_OP(op); FINISH_OP(op);
} }

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h" #include "blockstore_impl.h"
@@ -124,6 +124,29 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
return true; return true;
} }
void blockstore_impl_t::cancel_all_writes(blockstore_op_t *op, blockstore_dirty_db_t::iterator dirty_it, int retval)
{
while (dirty_it != dirty_db.end() && dirty_it->first.oid == op->oid)
{
dirty_db.erase(dirty_it++);
}
bool found = false;
for (auto other_op: submit_queue)
{
if (!found && other_op == op)
found = true;
else if (found && other_op->oid == op->oid &&
(other_op->opcode == BS_OP_WRITE || other_op->opcode == BS_OP_WRITE_STABLE))
{
// Mark operations to cancel them
PRIV(other_op)->real_version = UINT64_MAX;
other_op->retval = retval;
}
}
op->retval = retval;
FINISH_OP(op);
}
// First step of the write algorithm: dequeue operation and submit initial write(s) // First step of the write algorithm: dequeue operation and submit initial write(s)
int blockstore_impl_t::dequeue_write(blockstore_op_t *op) int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
{ {
@@ -143,6 +166,12 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
} }
if (PRIV(op)->real_version != 0) if (PRIV(op)->real_version != 0)
{ {
if (PRIV(op)->real_version == UINT64_MAX)
{
// This is the flag value used to cancel operations
FINISH_OP(op);
return 2;
}
// Restore original low version number for unblocked operations // Restore original low version number for unblocked operations
#ifdef BLOCKSTORE_DEBUG #ifdef BLOCKSTORE_DEBUG
printf("Restoring %lx:%lx version: v%lu -> v%lu\n", op->oid.inode, op->oid.stripe, op->version, PRIV(op)->real_version); printf("Restoring %lx:%lx version: v%lu -> v%lu\n", op->oid.inode, op->oid.stripe, op->version, PRIV(op)->real_version);
@@ -152,11 +181,9 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
if (prev_it->first.oid == op->oid && prev_it->first.version >= PRIV(op)->real_version) if (prev_it->first.oid == op->oid && prev_it->first.version >= PRIV(op)->real_version)
{ {
// Original version is still invalid // Original version is still invalid
// FIXME Oops. Successive small writes will currently break in an unexpected way. Fix it // All subsequent writes to the same object must be canceled too
dirty_db.erase(dirty_it); cancel_all_writes(op, dirty_it, -EEXIST);
op->retval = -EEXIST; return 2;
FINISH_OP(op);
return 1;
} }
op->version = PRIV(op)->real_version; op->version = PRIV(op)->real_version;
PRIV(op)->real_version = 0; PRIV(op)->real_version = 0;
@@ -167,6 +194,10 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
.version = op->version, .version = op->version,
}, e).first; }, e).first;
} }
if (write_iodepth >= max_write_iodepth)
{
return 0;
}
if ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE) if ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE)
{ {
blockstore_journal_check_t space_check(this); blockstore_journal_check_t space_check(this);
@@ -185,12 +216,10 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
PRIV(op)->wait_for = WAIT_FREE; PRIV(op)->wait_for = WAIT_FREE;
return 0; return 0;
} }
// FIXME Oops. Successive small writes will currently break in an unexpected way. Fix it cancel_all_writes(op, dirty_it, -ENOSPC);
dirty_db.erase(dirty_it); return 2;
op->retval = -ENOSPC;
FINISH_OP(op);
return 1;
} }
write_iodepth++;
BS_SUBMIT_GET_SQE(sqe, data); BS_SUBMIT_GET_SQE(sqe, data);
dirty_it->second.location = loc << block_order; dirty_it->second.location = loc << block_order;
dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_SUBMITTED; dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_SUBMITTED;
@@ -243,6 +272,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
{ {
return 0; return 0;
} }
write_iodepth++;
// There is sufficient space. Get SQE(s) // There is sufficient space. Get SQE(s)
struct io_uring_sqe *sqe1 = NULL; struct io_uring_sqe *sqe1 = NULL;
if (immediate_commit != IMMEDIATE_NONE || if (immediate_commit != IMMEDIATE_NONE ||
@@ -340,7 +370,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
if (!PRIV(op)->pending_ops) if (!PRIV(op)->pending_ops)
{ {
PRIV(op)->op_state = 4; PRIV(op)->op_state = 4;
continue_write(op); return continue_write(op);
} }
else else
{ {
@@ -354,24 +384,24 @@ int blockstore_impl_t::continue_write(blockstore_op_t *op)
{ {
io_uring_sqe *sqe = NULL; io_uring_sqe *sqe = NULL;
journal_entry_big_write *je; journal_entry_big_write *je;
int op_state = PRIV(op)->op_state;
if (op_state != 2 && op_state != 4)
{
// In progress
return 1;
}
auto dirty_it = dirty_db.find((obj_ver_id){ auto dirty_it = dirty_db.find((obj_ver_id){
.oid = op->oid, .oid = op->oid,
.version = op->version, .version = op->version,
}); });
assert(dirty_it != dirty_db.end()); assert(dirty_it != dirty_db.end());
if (PRIV(op)->op_state == 2) if (op_state == 2)
goto resume_2; goto resume_2;
else if (PRIV(op)->op_state == 4) else if (op_state == 4)
goto resume_4; goto resume_4;
else
return 1;
resume_2: resume_2:
// Only for the immediate_commit mode: prepare and submit big_write journal entry // Only for the immediate_commit mode: prepare and submit big_write journal entry
sqe = get_sqe(); BS_SUBMIT_GET_SQE_DECL(sqe);
if (!sqe)
{
return 0;
}
je = (journal_entry_big_write*)prefill_single_journal_entry( je = (journal_entry_big_write*)prefill_single_journal_entry(
journal, op->opcode == BS_OP_WRITE_STABLE ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE, journal, op->opcode == BS_OP_WRITE_STABLE ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
sizeof(journal_entry_big_write) sizeof(journal_entry_big_write)
@@ -432,8 +462,9 @@ resume_4:
} }
// Acknowledge write // Acknowledge write
op->retval = op->len; op->retval = op->len;
write_iodepth--;
FINISH_OP(op); FINISH_OP(op);
return 1; return 2;
} }
void blockstore_impl_t::handle_write_event(ring_data_t *data, blockstore_op_t *op) void blockstore_impl_t::handle_write_event(ring_data_t *data, blockstore_op_t *op)
@@ -452,10 +483,7 @@ void blockstore_impl_t::handle_write_event(ring_data_t *data, blockstore_op_t *o
{ {
release_journal_sectors(op); release_journal_sectors(op);
PRIV(op)->op_state++; PRIV(op)->op_state++;
if (!continue_write(op)) ringloop->wakeup();
{
submit_queue.push_front(op);
}
} }
} }
@@ -493,6 +521,10 @@ void blockstore_impl_t::release_journal_sectors(blockstore_op_t *op)
int blockstore_impl_t::dequeue_del(blockstore_op_t *op) int blockstore_impl_t::dequeue_del(blockstore_op_t *op)
{ {
if (PRIV(op)->op_state)
{
return continue_write(op);
}
auto dirty_it = dirty_db.find((obj_ver_id){ auto dirty_it = dirty_db.find((obj_ver_id){
.oid = op->oid, .oid = op->oid,
.version = op->version, .version = op->version,
@@ -503,6 +535,7 @@ int blockstore_impl_t::dequeue_del(blockstore_op_t *op)
{ {
return 0; return 0;
} }
write_iodepth++;
io_uring_sqe *sqe = NULL; io_uring_sqe *sqe = NULL;
if (immediate_commit != IMMEDIATE_NONE || if (immediate_commit != IMMEDIATE_NONE ||
(journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_del) && (journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_del) &&
@@ -561,7 +594,7 @@ int blockstore_impl_t::dequeue_del(blockstore_op_t *op)
if (!PRIV(op)->pending_ops) if (!PRIV(op)->pending_ops)
{ {
PRIV(op)->op_state = 4; PRIV(op)->op_state = 4;
continue_write(op); return continue_write(op);
} }
else else
{ {

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#include <stdexcept> #include <stdexcept>
#include "cluster_client.h" #include "cluster_client.h"
@@ -8,13 +8,11 @@ cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd
{ {
this->ringloop = ringloop; this->ringloop = ringloop;
this->tfd = tfd; this->tfd = tfd;
this->config = config;
log_level = config["log_level"].int64_value();
msgr.osd_num = 0; msgr.osd_num = 0;
msgr.tfd = tfd; msgr.tfd = tfd;
msgr.ringloop = ringloop; msgr.ringloop = ringloop;
msgr.log_level = log_level;
msgr.repeer_pgs = [this](osd_num_t peer_osd) msgr.repeer_pgs = [this](osd_num_t peer_osd)
{ {
if (msgr.osd_peer_fds.find(peer_osd) != msgr.osd_peer_fds.end()) if (msgr.osd_peer_fds.find(peer_osd) != msgr.osd_peer_fds.end())
@@ -67,8 +65,7 @@ cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd
msgr.stop_client(op->peer_fd); msgr.stop_client(op->peer_fd);
delete op; delete op;
}; };
msgr.use_sync_send_recv = config["use_sync_send_recv"].bool_value() || msgr.init();
config["use_sync_send_recv"].uint64_value();
st_cli.tfd = tfd; st_cli.tfd = tfd;
st_cli.on_load_config_hook = [this](json11::Json::object & cfg) { on_load_config_hook(cfg); }; st_cli.on_load_config_hook = [this](json11::Json::object & cfg) { on_load_config_hook(cfg); };
@@ -185,16 +182,8 @@ void cluster_client_t::on_load_config_hook(json11::Json::object & config)
{ {
up_wait_retry_interval = 50; up_wait_retry_interval = 50;
} }
msgr.peer_connect_interval = config["peer_connect_interval"].uint64_value(); msgr.parse_config(config);
if (!msgr.peer_connect_interval) msgr.parse_config(this->config);
{
msgr.peer_connect_interval = DEFAULT_PEER_CONNECT_INTERVAL;
}
msgr.peer_connect_timeout = config["peer_connect_timeout"].uint64_value();
if (!msgr.peer_connect_timeout)
{
msgr.peer_connect_timeout = DEFAULT_PEER_CONNECT_TIMEOUT;
}
st_cli.load_pgs(); st_cli.load_pgs();
} }

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once #pragma once
@@ -82,6 +82,7 @@ class cluster_client_t
public: public:
etcd_state_client_t st_cli; etcd_state_client_t st_cli;
osd_messenger_t msgr; osd_messenger_t msgr;
json11::Json config;
cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config); cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config);
~cluster_client_t(); ~cluster_client_t();

View File

@@ -8,4 +8,10 @@
// unsigned __int64 _mm_crc32_u64 (unsigned __int64 crc, unsigned __int64 v) // unsigned __int64 _mm_crc32_u64 (unsigned __int64 crc, unsigned __int64 v)
// unsigned int _mm_crc32_u8 (unsigned int crc, unsigned char v) // unsigned int _mm_crc32_u8 (unsigned int crc, unsigned char v)
#ifdef __cplusplus
extern "C" {
#endif
uint32_t crc32c(uint32_t crc, const void *buf, size_t len); uint32_t crc32c(uint32_t crc, const void *buf, size_t len);
#ifdef __cplusplus
};
#endif

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#define _LARGEFILE64_SOURCE #define _LARGEFILE64_SOURCE
#include <sys/types.h> #include <sys/types.h>

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#include <sys/epoll.h> #include <sys/epoll.h>
#include <sys/poll.h> #include <sys/poll.h>

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once #pragma once

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#include "osd_ops.h" #include "osd_ops.h"
#include "pg_states.h" #include "pg_states.h"
@@ -7,6 +7,16 @@
#include "http_client.h" #include "http_client.h"
#include "base64.h" #include "base64.h"
etcd_state_client_t::~etcd_state_client_t()
{
etcd_watches_initialised = -1;
if (etcd_watch_ws)
{
etcd_watch_ws->close();
etcd_watch_ws = NULL;
}
}
json_kv_t etcd_state_client_t::parse_etcd_kv(const json11::Json & kv_json) json_kv_t etcd_state_client_t::parse_etcd_kv(const json11::Json & kv_json)
{ {
json_kv_t kv; json_kv_t kv;
@@ -46,6 +56,23 @@ void etcd_state_client_t::etcd_call(std::string api, json11::Json payload, int t
http_request_json(tfd, etcd_address, req, timeout, callback); http_request_json(tfd, etcd_address, req, timeout, callback);
} }
void etcd_state_client_t::add_etcd_url(std::string addr)
{
if (addr.length() > 0)
{
if (strtolower(addr.substr(0, 7)) == "http://")
addr = addr.substr(7);
else if (strtolower(addr.substr(0, 8)) == "https://")
{
printf("HTTPS is unsupported for etcd. Either use plain HTTP or setup a local proxy for etcd interaction\n");
exit(1);
}
if (addr.find('/') < 0)
addr += "/v3";
this->etcd_addresses.push_back(addr);
}
}
void etcd_state_client_t::parse_config(json11::Json & config) void etcd_state_client_t::parse_config(json11::Json & config)
{ {
this->etcd_addresses.clear(); this->etcd_addresses.clear();
@@ -55,13 +82,7 @@ void etcd_state_client_t::parse_config(json11::Json & config)
while (1) while (1)
{ {
int pos = ea.find(','); int pos = ea.find(',');
std::string addr = pos >= 0 ? ea.substr(0, pos) : ea; add_etcd_url(pos >= 0 ? ea.substr(0, pos) : ea);
if (addr.length() > 0)
{
if (addr.find('/') < 0)
addr += "/v3";
this->etcd_addresses.push_back(addr);
}
if (pos >= 0) if (pos >= 0)
ea = ea.substr(pos+1); ea = ea.substr(pos+1);
else else
@@ -72,13 +93,7 @@ void etcd_state_client_t::parse_config(json11::Json & config)
{ {
for (auto & ea: config["etcd_address"].array_items()) for (auto & ea: config["etcd_address"].array_items())
{ {
std::string addr = ea.string_value(); add_etcd_url(ea.string_value());
if (addr != "")
{
if (addr.find('/') < 0)
addr += "/v3";
this->etcd_addresses.push_back(addr);
}
} }
} }
this->etcd_prefix = config["etcd_prefix"].string_value(); this->etcd_prefix = config["etcd_prefix"].string_value();
@@ -160,7 +175,7 @@ void etcd_state_client_t::start_etcd_watcher()
start_etcd_watcher(); start_etcd_watcher();
}); });
} }
else else if (etcd_watches_initialised > 0)
{ {
// Connection was live, retry immediately // Connection was live, retry immediately
start_etcd_watcher(); start_etcd_watcher();

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once #pragma once
@@ -54,6 +54,9 @@ struct pool_config_t
struct etcd_state_client_t struct etcd_state_client_t
{ {
protected:
void add_etcd_url(std::string);
public:
std::vector<std::string> etcd_addresses; std::vector<std::string> etcd_addresses;
std::string etcd_prefix; std::string etcd_prefix;
int log_level = 0; int log_level = 0;
@@ -81,4 +84,5 @@ struct etcd_state_client_t
void load_pgs(); void load_pgs();
void parse_state(const std::string & key, const json11::Json & value); void parse_state(const std::string & key, const json11::Json & value);
void parse_config(json11::Json & config); void parse_config(json11::Json & config);
~etcd_state_client_t();
}; };

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
// FIO engine to test cluster I/O // FIO engine to test cluster I/O
// //
@@ -117,8 +117,15 @@ static struct fio_option options[] = {
static int sec_setup(struct thread_data *td) static int sec_setup(struct thread_data *td)
{ {
sec_options *o = (sec_options*)td->eo;
sec_data *bsd; sec_data *bsd;
if (!o->etcd_host)
{
td_verror(td, EINVAL, "etcd address is missing");
return 1;
}
bsd = new sec_data; bsd = new sec_data;
if (!bsd) if (!bsd)
{ {

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
// FIO engine to test Blockstore // FIO engine to test Blockstore
// //

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
// FIO engine to test Blockstore through Secondary OSD interface // FIO engine to test Blockstore through Secondary OSD interface
// //

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#include <netinet/tcp.h> #include <netinet/tcp.h>
#include <sys/epoll.h> #include <sys/epoll.h>
@@ -22,7 +22,6 @@
#define READ_BUFFER_SIZE 9000 #define READ_BUFFER_SIZE 9000
static int extract_port(std::string & host); static int extract_port(std::string & host);
static std::string strtolower(const std::string & in);
static std::string trim(const std::string & in); static std::string trim(const std::string & in);
static std::string ws_format_frame(int type, uint64_t size); static std::string ws_format_frame(int type, uint64_t size);
static bool ws_parse_frame(std::string & buf, int & type, std::string & res); static bool ws_parse_frame(std::string & buf, int & type, std::string & res);
@@ -673,7 +672,7 @@ static int extract_port(std::string & host)
return port; return port;
} }
static std::string strtolower(const std::string & in) std::string strtolower(const std::string & in)
{ {
std::string s = in; std::string s = in;
for (int i = 0; i < s.length(); i++) for (int i = 0; i < s.length(); i++)

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once #pragma once
#include <string> #include <string>
@@ -49,6 +49,8 @@ std::vector<std::string> getifaddr_list(bool include_v6 = false);
uint64_t stoull_full(const std::string & str, int base = 10); uint64_t stoull_full(const std::string & str, int base = 10);
std::string strtolower(const std::string & in);
void http_request(timerfd_manager_t *tfd, const std::string & host, const std::string & request, void http_request(timerfd_manager_t *tfd, const std::string & host, const std::string & request,
const http_options_t & options, std::function<void(const http_response_t *response)> callback); const http_options_t & options, std::function<void(const http_response_t *response)> callback);

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once #pragma once

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#include <unistd.h> #include <unistd.h>
#include <fcntl.h> #include <fcntl.h>
@@ -26,14 +26,106 @@ osd_op_t::~osd_op_t()
} }
} }
void osd_messenger_t::init()
{
keepalive_timer_id = tfd->set_timer(1000, true, [this](int)
{
for (auto cl_it = clients.begin(); cl_it != clients.end();)
{
auto cl = (cl_it++)->second;
if (!cl->osd_num)
{
// Do not run keepalive on regular clients
continue;
}
if (cl->ping_time_remaining > 0)
{
cl->ping_time_remaining--;
if (!cl->ping_time_remaining)
{
// Ping timed out, stop the client
stop_client(cl->peer_fd, true);
}
}
else if (cl->idle_time_remaining > 0)
{
cl->idle_time_remaining--;
if (!cl->idle_time_remaining)
{
// Connection is idle for <osd_idle_time>, send ping
osd_op_t *op = new osd_op_t();
op->op_type = OSD_OP_OUT;
op->peer_fd = cl->peer_fd;
op->req = (osd_any_op_t){
.hdr = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = this->next_subop_id++,
.opcode = OSD_OP_PING,
},
};
op->callback = [this, cl](osd_op_t *op)
{
int fail_fd = (op->reply.hdr.retval != 0 ? op->peer_fd : -1);
cl->ping_time_remaining = 0;
delete op;
if (fail_fd >= 0)
{
stop_client(fail_fd, true);
}
};
outbox_push(op);
cl->ping_time_remaining = osd_ping_timeout;
cl->idle_time_remaining = osd_idle_timeout;
}
}
else
{
cl->idle_time_remaining = osd_idle_timeout;
}
}
});
}
osd_messenger_t::~osd_messenger_t() osd_messenger_t::~osd_messenger_t()
{ {
if (keepalive_timer_id >= 0)
{
tfd->clear_timer(keepalive_timer_id);
keepalive_timer_id = -1;
}
while (clients.size() > 0) while (clients.size() > 0)
{ {
stop_client(clients.begin()->first, true); stop_client(clients.begin()->first, true);
} }
} }
void osd_messenger_t::parse_config(const json11::Json & config)
{
this->use_sync_send_recv = config["use_sync_send_recv"].bool_value() ||
config["use_sync_send_recv"].uint64_value();
this->peer_connect_interval = config["peer_connect_interval"].uint64_value();
if (!this->peer_connect_interval)
{
this->peer_connect_interval = DEFAULT_PEER_CONNECT_INTERVAL;
}
this->peer_connect_timeout = config["peer_connect_timeout"].uint64_value();
if (!this->peer_connect_timeout)
{
this->peer_connect_timeout = DEFAULT_PEER_CONNECT_TIMEOUT;
}
this->osd_idle_timeout = config["osd_idle_timeout"].uint64_value();
if (!this->osd_idle_timeout)
{
this->osd_idle_timeout = DEFAULT_OSD_PING_TIMEOUT;
}
this->osd_ping_timeout = config["osd_ping_timeout"].uint64_value();
if (!this->osd_ping_timeout)
{
this->osd_ping_timeout = DEFAULT_OSD_PING_TIMEOUT;
}
this->log_level = config["log_level"].uint64_value();
}
void osd_messenger_t::connect_peer(uint64_t peer_osd, json11::Json peer_state) void osd_messenger_t::connect_peer(uint64_t peer_osd, json11::Json peer_state)
{ {
if (wanted_peers.find(peer_osd) == wanted_peers.end()) if (wanted_peers.find(peer_osd) == wanted_peers.end())

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once #pragma once
@@ -34,6 +34,7 @@
#define DEFAULT_PEER_CONNECT_INTERVAL 5 #define DEFAULT_PEER_CONNECT_INTERVAL 5
#define DEFAULT_PEER_CONNECT_TIMEOUT 5 #define DEFAULT_PEER_CONNECT_TIMEOUT 5
#define DEFAULT_OSD_PING_TIMEOUT 5
// Kind of a vector with small-list-optimisation // Kind of a vector with small-list-optimisation
struct osd_op_buf_list_t struct osd_op_buf_list_t
@@ -198,6 +199,8 @@ struct osd_client_t
int peer_fd; int peer_fd;
int peer_state; int peer_state;
int connect_timeout_id = -1; int connect_timeout_id = -1;
int ping_time_remaining = 0;
int idle_time_remaining = 0;
osd_num_t osd_num = 0; osd_num_t osd_num = 0;
void *in_buf = NULL; void *in_buf = NULL;
@@ -251,6 +254,7 @@ struct osd_messenger_t
{ {
timerfd_manager_t *tfd; timerfd_manager_t *tfd;
ring_loop_t *ringloop; ring_loop_t *ringloop;
int keepalive_timer_id = -1;
// osd_num_t is only for logging and asserts // osd_num_t is only for logging and asserts
osd_num_t osd_num; osd_num_t osd_num;
@@ -258,6 +262,8 @@ struct osd_messenger_t
int receive_buffer_size = 64*1024; int receive_buffer_size = 64*1024;
int peer_connect_interval = DEFAULT_PEER_CONNECT_INTERVAL; int peer_connect_interval = DEFAULT_PEER_CONNECT_INTERVAL;
int peer_connect_timeout = DEFAULT_PEER_CONNECT_TIMEOUT; int peer_connect_timeout = DEFAULT_PEER_CONNECT_TIMEOUT;
int osd_idle_timeout = DEFAULT_OSD_PING_TIMEOUT;
int osd_ping_timeout = DEFAULT_OSD_PING_TIMEOUT;
int log_level = 0; int log_level = 0;
bool use_sync_send_recv = false; bool use_sync_send_recv = false;
@@ -274,6 +280,8 @@ struct osd_messenger_t
osd_op_stats_t stats; osd_op_stats_t stats;
public: public:
void init();
void parse_config(const json11::Json & config);
void connect_peer(uint64_t osd_num, json11::Json peer_state); void connect_peer(uint64_t osd_num, json11::Json peer_state);
void stop_client(int peer_fd, bool force = false); void stop_client(int peer_fd, bool force = false);
void outbox_push(osd_op_t *cur_op); void outbox_push(osd_op_t *cur_op);

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#include "messenger.h" #include "messenger.h"

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#define _XOPEN_SOURCE #define _XOPEN_SOURCE
#include <limits.h> #include <limits.h>

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
// Similar to qemu-nbd, but sets timeout and uses io_uring // Similar to qemu-nbd, but sets timeout and uses io_uring
#include <linux/nbd.h> #include <linux/nbd.h>
@@ -111,7 +111,7 @@ public:
{ {
printf( printf(
"Vitastor NBD proxy\n" "Vitastor NBD proxy\n"
"(c) Vitaliy Filippov, 2020 (VNPL-1.0)\n\n" "(c) Vitaliy Filippov, 2020 (VNPL-1.1)\n\n"
"USAGE:\n" "USAGE:\n"
" %s map --etcd_address <etcd_address> --pool <pool> --inode <inode> --size <size in bytes>\n" " %s map --etcd_address <etcd_address> --pool <pool> --inode <inode> --size <size in bytes>\n"
" %s unmap /dev/nbd0\n" " %s unmap /dev/nbd0\n"

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once #pragma once

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include <sys/socket.h> #include <sys/socket.h>
#include <sys/poll.h> #include <sys/poll.h>
@@ -37,6 +37,7 @@ osd_t::osd_t(blockstore_config_t & config, blockstore_t *bs, ring_loop_t *ringlo
c_cli.ringloop = this->ringloop; c_cli.ringloop = this->ringloop;
c_cli.exec_op = [this](osd_op_t *op) { exec_op(op); }; c_cli.exec_op = [this](osd_op_t *op) { exec_op(op); };
c_cli.repeer_pgs = [this](osd_num_t peer_osd) { repeer_pgs(peer_osd); }; c_cli.repeer_pgs = [this](osd_num_t peer_osd) { repeer_pgs(peer_osd); };
c_cli.init();
init_cluster(); init_cluster();
@@ -55,6 +56,7 @@ void osd_t::parse_config(blockstore_config_t & config)
{ {
if (config.find("log_level") == config.end()) if (config.find("log_level") == config.end())
config["log_level"] = "1"; config["log_level"] = "1";
log_level = strtoull(config["log_level"].c_str(), NULL, 10);
// Initial startup configuration // Initial startup configuration
json11::Json json_config = json11::Json(config); json11::Json json_config = json11::Json(config);
st_cli.parse_config(json_config); st_cli.parse_config(json_config);
@@ -66,6 +68,8 @@ void osd_t::parse_config(blockstore_config_t & config)
throw std::runtime_error("osd_num is required in the configuration"); throw std::runtime_error("osd_num is required in the configuration");
c_cli.osd_num = osd_num; c_cli.osd_num = osd_num;
run_primary = config["run_primary"] != "false" && config["run_primary"] != "0" && config["run_primary"] != "no"; run_primary = config["run_primary"] != "false" && config["run_primary"] != "0" && config["run_primary"] != "no";
no_rebalance = config["no_rebalance"] == "true" || config["no_rebalance"] == "1" || config["no_rebalance"] == "yes";
no_recovery = config["no_recovery"] == "true" || config["no_recovery"] == "1" || config["no_recovery"] == "yes";
// Cluster configuration // Cluster configuration
bind_address = config["bind_address"]; bind_address = config["bind_address"];
if (bind_address == "") if (bind_address == "")
@@ -92,6 +96,9 @@ void osd_t::parse_config(blockstore_config_t & config)
recovery_queue_depth = strtoull(config["recovery_queue_depth"].c_str(), NULL, 10); recovery_queue_depth = strtoull(config["recovery_queue_depth"].c_str(), NULL, 10);
if (recovery_queue_depth < 1 || recovery_queue_depth > MAX_RECOVERY_QUEUE) if (recovery_queue_depth < 1 || recovery_queue_depth > MAX_RECOVERY_QUEUE)
recovery_queue_depth = DEFAULT_RECOVERY_QUEUE; recovery_queue_depth = DEFAULT_RECOVERY_QUEUE;
recovery_sync_batch = strtoull(config["recovery_sync_batch"].c_str(), NULL, 10);
if (recovery_sync_batch < 1 || recovery_sync_batch > MAX_RECOVERY_QUEUE)
recovery_sync_batch = DEFAULT_RECOVERY_BATCH;
if (config["readonly"] == "true" || config["readonly"] == "1" || config["readonly"] == "yes") if (config["readonly"] == "true" || config["readonly"] == "1" || config["readonly"] == "yes")
readonly = true; readonly = true;
print_stats_interval = strtoull(config["print_stats_interval"].c_str(), NULL, 10); print_stats_interval = strtoull(config["print_stats_interval"].c_str(), NULL, 10);
@@ -100,14 +107,7 @@ void osd_t::parse_config(blockstore_config_t & config)
slow_log_interval = strtoull(config["slow_log_interval"].c_str(), NULL, 10); slow_log_interval = strtoull(config["slow_log_interval"].c_str(), NULL, 10);
if (!slow_log_interval) if (!slow_log_interval)
slow_log_interval = 10; slow_log_interval = 10;
c_cli.peer_connect_interval = strtoull(config["peer_connect_interval"].c_str(), NULL, 10); c_cli.parse_config(json_config);
if (!c_cli.peer_connect_interval)
c_cli.peer_connect_interval = DEFAULT_PEER_CONNECT_INTERVAL;
c_cli.peer_connect_timeout = strtoull(config["peer_connect_timeout"].c_str(), NULL, 10);
if (!c_cli.peer_connect_timeout)
c_cli.peer_connect_timeout = DEFAULT_PEER_CONNECT_TIMEOUT;
log_level = strtoull(config["log_level"].c_str(), NULL, 10);
c_cli.log_level = log_level;
} }
void osd_t::bind_socket() void osd_t::bind_socket()
@@ -211,6 +211,12 @@ void osd_t::exec_op(osd_op_t *cur_op)
finish_op(cur_op, -EINVAL); finish_op(cur_op, -EINVAL);
return; return;
} }
if (cur_op->req.hdr.opcode == OSD_OP_PING)
{
// Pong
finish_op(cur_op, 0);
return;
}
if (readonly && if (readonly &&
cur_op->req.hdr.opcode != OSD_OP_SEC_READ && cur_op->req.hdr.opcode != OSD_OP_SEC_READ &&
cur_op->req.hdr.opcode != OSD_OP_SEC_LIST && cur_op->req.hdr.opcode != OSD_OP_SEC_LIST &&
@@ -261,9 +267,9 @@ void osd_t::reset_stats()
void osd_t::print_stats() void osd_t::print_stats()
{ {
for (int i = 0; i <= OSD_OP_MAX; i++) for (int i = OSD_OP_MIN; i <= OSD_OP_MAX; i++)
{ {
if (c_cli.stats.op_stat_count[i] != prev_stats.op_stat_count[i]) if (c_cli.stats.op_stat_count[i] != prev_stats.op_stat_count[i] && i != OSD_OP_PING)
{ {
uint64_t avg = (c_cli.stats.op_stat_sum[i] - prev_stats.op_stat_sum[i])/(c_cli.stats.op_stat_count[i] - prev_stats.op_stat_count[i]); uint64_t avg = (c_cli.stats.op_stat_sum[i] - prev_stats.op_stat_sum[i])/(c_cli.stats.op_stat_count[i] - prev_stats.op_stat_count[i]);
uint64_t bw = (c_cli.stats.op_stat_bytes[i] - prev_stats.op_stat_bytes[i]) / print_stats_interval; uint64_t bw = (c_cli.stats.op_stat_bytes[i] - prev_stats.op_stat_bytes[i]) / print_stats_interval;
@@ -284,7 +290,7 @@ void osd_t::print_stats()
prev_stats.op_stat_bytes[i] = c_cli.stats.op_stat_bytes[i]; prev_stats.op_stat_bytes[i] = c_cli.stats.op_stat_bytes[i];
} }
} }
for (int i = 0; i <= OSD_OP_MAX; i++) for (int i = OSD_OP_MIN; i <= OSD_OP_MAX; i++)
{ {
if (c_cli.stats.subop_stat_count[i] != prev_stats.subop_stat_count[i]) if (c_cli.stats.subop_stat_count[i] != prev_stats.subop_stat_count[i])
{ {

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#pragma once #pragma once
@@ -37,6 +37,7 @@
#define DEFAULT_AUTOSYNC_INTERVAL 5 #define DEFAULT_AUTOSYNC_INTERVAL 5
#define MAX_RECOVERY_QUEUE 2048 #define MAX_RECOVERY_QUEUE 2048
#define DEFAULT_RECOVERY_QUEUE 4 #define DEFAULT_RECOVERY_QUEUE 4
#define DEFAULT_RECOVERY_BATCH 16
//#define OSD_STUB //#define OSD_STUB
@@ -64,6 +65,8 @@ class osd_t
bool readonly = false; bool readonly = false;
osd_num_t osd_num = 1; // OSD numbers start with 1 osd_num_t osd_num = 1; // OSD numbers start with 1
bool run_primary = false; bool run_primary = false;
bool no_rebalance = false;
bool no_recovery = false;
std::string bind_address; std::string bind_address;
int bind_port, listen_backlog; int bind_port, listen_backlog;
// FIXME: Implement client queue depth limit // FIXME: Implement client queue depth limit
@@ -74,6 +77,7 @@ class osd_t
int immediate_commit = IMMEDIATE_NONE; int immediate_commit = IMMEDIATE_NONE;
int autosync_interval = DEFAULT_AUTOSYNC_INTERVAL; // sync every 5 seconds int autosync_interval = DEFAULT_AUTOSYNC_INTERVAL; // sync every 5 seconds
int recovery_queue_depth = DEFAULT_RECOVERY_QUEUE; int recovery_queue_depth = DEFAULT_RECOVERY_QUEUE;
int recovery_sync_batch = DEFAULT_RECOVERY_BATCH;
int log_level = 0; int log_level = 0;
// cluster state // cluster state
@@ -95,9 +99,11 @@ class osd_t
std::map<pool_pg_num_t, pg_t> pgs; std::map<pool_pg_num_t, pg_t> pgs;
std::set<pool_pg_num_t> dirty_pgs; std::set<pool_pg_num_t> dirty_pgs;
std::set<osd_num_t> dirty_osds; std::set<osd_num_t> dirty_osds;
int copies_to_delete_after_sync_count = 0;
uint64_t misplaced_objects = 0, degraded_objects = 0, incomplete_objects = 0; uint64_t misplaced_objects = 0, degraded_objects = 0, incomplete_objects = 0;
int peering_state = 0; int peering_state = 0;
std::map<object_id, osd_recovery_op_t> recovery_ops; std::map<object_id, osd_recovery_op_t> recovery_ops;
int recovery_done = 0;
osd_op_t *autosync_op = NULL; osd_op_t *autosync_op = NULL;
// Unstable writes // Unstable writes
@@ -160,6 +166,7 @@ class osd_t
void submit_list_subop(osd_num_t role_osd, pg_peering_state_t *ps); void submit_list_subop(osd_num_t role_osd, pg_peering_state_t *ps);
void discard_list_subop(osd_op_t *list_op); void discard_list_subop(osd_op_t *list_op);
bool stop_pg(pg_t & pg); bool stop_pg(pg_t & pg);
void reset_pg(pg_t & pg);
void finish_stop_pg(pg_t & pg); void finish_stop_pg(pg_t & pg);
// flushing, recovery and backfill // flushing, recovery and backfill
@@ -198,6 +205,7 @@ class osd_t
void pg_cancel_write_queue(pg_t & pg, osd_op_t *first_op, object_id oid, int retval); void pg_cancel_write_queue(pg_t & pg, osd_op_t *first_op, object_id oid, int retval);
void submit_primary_subops(int submit_type, uint64_t op_version, int pg_size, const uint64_t* osd_set, osd_op_t *cur_op); void submit_primary_subops(int submit_type, uint64_t op_version, int pg_size, const uint64_t* osd_set, osd_op_t *cur_op);
void submit_primary_del_subops(osd_op_t *cur_op, uint64_t *cur_set, uint64_t set_size, pg_osd_set_t & loc_set); void submit_primary_del_subops(osd_op_t *cur_op, uint64_t *cur_set, uint64_t set_size, pg_osd_set_t & loc_set);
void submit_primary_del_batch(osd_op_t *cur_op, obj_ver_osd_t *chunks_to_delete, int chunks_to_delete_count);
void submit_primary_sync_subops(osd_op_t *cur_op); void submit_primary_sync_subops(osd_op_t *cur_op);
void submit_primary_stab_subops(osd_op_t *cur_op); void submit_primary_stab_subops(osd_op_t *cur_op);

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "osd.h" #include "osd.h"
#include "base64.h" #include "base64.h"
@@ -37,7 +37,7 @@ void osd_t::init_cluster()
.pg_cursize = 0, .pg_cursize = 0,
.pg_size = 3, .pg_size = 3,
.pg_minsize = 2, .pg_minsize = 2,
.parity_chunks = 1, .pg_data_size = 2,
.pool_id = 1, .pool_id = 1,
.pg_num = 1, .pg_num = 1,
.target_set = { 1, 2, 3 }, .target_set = { 1, 2, 3 },
@@ -142,7 +142,7 @@ json11::Json osd_t::get_statistics()
} }
st["host"] = self_state["host"]; st["host"] = self_state["host"];
json11::Json::object op_stats, subop_stats; json11::Json::object op_stats, subop_stats;
for (int i = 0; i <= OSD_OP_MAX; i++) for (int i = OSD_OP_MIN; i <= OSD_OP_MAX; i++)
{ {
op_stats[osd_op_names[i]] = json11::Json::object { op_stats[osd_op_names[i]] = json11::Json::object {
{ "count", c_cli.stats.op_stat_count[i] }, { "count", c_cli.stats.op_stat_count[i] },
@@ -150,7 +150,7 @@ json11::Json osd_t::get_statistics()
{ "bytes", c_cli.stats.op_stat_bytes[i] }, { "bytes", c_cli.stats.op_stat_bytes[i] },
}; };
} }
for (int i = 0; i <= OSD_OP_MAX; i++) for (int i = OSD_OP_MIN; i <= OSD_OP_MAX; i++)
{ {
subop_stats[osd_op_names[i]] = json11::Json::object { subop_stats[osd_op_names[i]] = json11::Json::object {
{ "count", c_cli.stats.subop_stat_count[i] }, { "count", c_cli.stats.subop_stat_count[i] },
@@ -593,7 +593,10 @@ void osd_t::apply_pg_config()
} }
else else
{ {
throw std::runtime_error("Unexpected PG "+std::to_string(pg_num)+" state: "+std::to_string(pg_it->second.state)); throw std::runtime_error(
"Unexpected PG "+std::to_string(pool_id)+"/"+std::to_string(pg_num)+
" state: "+std::to_string(pg_it->second.state)
);
} }
} }
auto & pg = this->pgs[{ .pool_id = pool_id, .pg_num = pg_num }]; auto & pg = this->pgs[{ .pool_id = pool_id, .pg_num = pg_num }];
@@ -603,7 +606,8 @@ void osd_t::apply_pg_config()
.pg_cursize = 0, .pg_cursize = 0,
.pg_size = pool_item.second.pg_size, .pg_size = pool_item.second.pg_size,
.pg_minsize = pool_item.second.pg_minsize, .pg_minsize = pool_item.second.pg_minsize,
.parity_chunks = pool_item.second.parity_chunks, .pg_data_size = pg.scheme == POOL_SCHEME_REPLICATED
? 1 : pool_item.second.pg_size - pool_item.second.parity_chunks,
.pool_id = pool_id, .pool_id = pool_id,
.pg_num = pg_num, .pg_num = pg_num,
.reported_epoch = pg_cfg.epoch, .reported_epoch = pg_cfg.epoch,
@@ -613,7 +617,7 @@ void osd_t::apply_pg_config()
}; };
if (pg.scheme == POOL_SCHEME_JERASURE) if (pg.scheme == POOL_SCHEME_JERASURE)
{ {
use_jerasure(pg.pg_size, pg.pg_size-pg.parity_chunks, true); use_jerasure(pg.pg_size, pg.pg_data_size, true);
} }
this->pg_state_dirty.insert({ .pool_id = pool_id, .pg_num = pg_num }); this->pg_state_dirty.insert({ .pool_id = pool_id, .pg_num = pg_num });
pg.print_state(); pg.print_state();
@@ -661,7 +665,21 @@ void osd_t::report_pg_states()
auto & pg = pg_it->second; auto & pg = pg_it->second;
reporting_pgs.push_back({ *it, pg.history_changed }); reporting_pgs.push_back({ *it, pg.history_changed });
std::string state_key_base64 = base64_encode(st_cli.etcd_prefix+"/pg/state/"+std::to_string(pg.pool_id)+"/"+std::to_string(pg.pg_num)); std::string state_key_base64 = base64_encode(st_cli.etcd_prefix+"/pg/state/"+std::to_string(pg.pool_id)+"/"+std::to_string(pg.pg_num));
if (pg.state == PG_STARTING) bool pg_state_exists = false;
if (pg.state != PG_STARTING)
{
auto pool_it = st_cli.pool_config.find(pg.pool_id);
if (pool_it != st_cli.pool_config.end())
{
auto pg_it = pool_it->second.pg_config.find(pg.pg_num);
if (pg_it != pool_it->second.pg_config.end() &&
pg_it->second.cur_state != 0)
{
pg_state_exists = true;
}
}
}
if (!pg_state_exists)
{ {
// Check that the PG key does not exist // Check that the PG key does not exist
// Failed check indicates an unsuccessful PG lock attempt in this case // Failed check indicates an unsuccessful PG lock attempt in this case
@@ -673,9 +691,7 @@ void osd_t::report_pg_states()
} }
else else
{ {
// Check that the key is ours // Check that the key is ours if it already exists
// Failed check indicates success for OFFLINE pgs (PG lock is already deleted)
// and an unexpected race condition for started pgs (PG lock is held by someone else)
checks.push_back(json11::Json::object { checks.push_back(json11::Json::object {
{ "target", "LEASE" }, { "target", "LEASE" },
{ "lease", etcd_lease_id }, { "lease", etcd_lease_id },
@@ -797,17 +813,16 @@ void osd_t::report_pg_states()
for (auto pp: reporting_pgs) for (auto pp: reporting_pgs)
{ {
auto pg_it = this->pgs.find(pp.first); auto pg_it = this->pgs.find(pp.first);
if (pg_it != this->pgs.end()) if (pg_it != this->pgs.end() &&
pg_it->second.state == PG_OFFLINE &&
pg_state_dirty.find(pp.first) == pg_state_dirty.end())
{ {
if (pg_it->second.state == PG_OFFLINE) // Forget offline PGs after reporting their state
if (pg_it->second.scheme == POOL_SCHEME_JERASURE)
{ {
// Remove offline PGs after reporting their state use_jerasure(pg_it->second.pg_size, pg_it->second.pg_data_size, false);
this->pgs.erase(pg_it);
if (pg_it->second.scheme == POOL_SCHEME_JERASURE)
{
use_jerasure(pg_it->second.pg_size, pg_it->second.pg_size-pg_it->second.parity_chunks, false);
}
} }
this->pgs.erase(pg_it);
} }
} }
// Push other PG state updates, if any // Push other PG state updates, if any

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "osd.h" #include "osd.h"
@@ -95,7 +95,7 @@ void osd_t::handle_flush_op(bool rollback, pool_id_t pool_id, pg_num_t pg_num, p
{ {
// This flush batch is done // This flush batch is done
std::vector<osd_op_t*> continue_ops; std::vector<osd_op_t*> continue_ops;
auto & pg = pgs[pg_id]; auto & pg = pgs.at(pg_id);
auto it = pg.flush_actions.begin(), prev_it = it; auto it = pg.flush_actions.begin(), prev_it = it;
auto erase_start = it; auto erase_start = it;
while (1) while (1)
@@ -209,32 +209,38 @@ void osd_t::submit_flush_op(pool_id_t pool_id, pg_num_t pg_num, pg_flush_batch_t
bool osd_t::pick_next_recovery(osd_recovery_op_t &op) bool osd_t::pick_next_recovery(osd_recovery_op_t &op)
{ {
for (auto pg_it = pgs.begin(); pg_it != pgs.end(); pg_it++) if (!no_recovery)
{ {
if ((pg_it->second.state & (PG_ACTIVE | PG_HAS_DEGRADED)) == (PG_ACTIVE | PG_HAS_DEGRADED)) for (auto pg_it = pgs.begin(); pg_it != pgs.end(); pg_it++)
{ {
for (auto obj_it = pg_it->second.degraded_objects.begin(); obj_it != pg_it->second.degraded_objects.end(); obj_it++) if ((pg_it->second.state & (PG_ACTIVE | PG_HAS_DEGRADED)) == (PG_ACTIVE | PG_HAS_DEGRADED))
{ {
if (recovery_ops.find(obj_it->first) == recovery_ops.end()) for (auto obj_it = pg_it->second.degraded_objects.begin(); obj_it != pg_it->second.degraded_objects.end(); obj_it++)
{ {
op.degraded = true; if (recovery_ops.find(obj_it->first) == recovery_ops.end())
op.oid = obj_it->first; {
return true; op.degraded = true;
op.oid = obj_it->first;
return true;
}
} }
} }
} }
} }
for (auto pg_it = pgs.begin(); pg_it != pgs.end(); pg_it++) if (!no_rebalance)
{ {
if ((pg_it->second.state & (PG_ACTIVE | PG_HAS_MISPLACED)) == (PG_ACTIVE | PG_HAS_MISPLACED)) for (auto pg_it = pgs.begin(); pg_it != pgs.end(); pg_it++)
{ {
for (auto obj_it = pg_it->second.misplaced_objects.begin(); obj_it != pg_it->second.misplaced_objects.end(); obj_it++) if ((pg_it->second.state & (PG_ACTIVE | PG_HAS_MISPLACED)) == (PG_ACTIVE | PG_HAS_MISPLACED))
{ {
if (recovery_ops.find(obj_it->first) == recovery_ops.end()) for (auto obj_it = pg_it->second.misplaced_objects.begin(); obj_it != pg_it->second.misplaced_objects.end(); obj_it++)
{ {
op.degraded = false; if (recovery_ops.find(obj_it->first) == recovery_ops.end())
op.oid = obj_it->first; {
return true; op.degraded = false;
op.oid = obj_it->first;
return true;
}
} }
} }
} }
@@ -264,7 +270,6 @@ void osd_t::submit_recovery_op(osd_recovery_op_t *op)
} }
op->osd_op->callback = [this, op](osd_op_t *osd_op) op->osd_op->callback = [this, op](osd_op_t *osd_op)
{ {
// Don't sync the write, it will be synced by our regular sync coroutine
if (osd_op->reply.hdr.retval < 0) if (osd_op->reply.hdr.retval < 0)
{ {
// Error recovering object // Error recovering object
@@ -286,6 +291,17 @@ void osd_t::submit_recovery_op(osd_recovery_op_t *op)
op->osd_op = NULL; op->osd_op = NULL;
recovery_ops.erase(op->oid); recovery_ops.erase(op->oid);
delete osd_op; delete osd_op;
if (immediate_commit != IMMEDIATE_ALL)
{
recovery_done++;
if (recovery_done >= recovery_sync_batch)
{
// Force sync every <recovery_sync_batch> operations
// This is required not to pile up an excessive amount of delete operations
autosync();
recovery_done = 0;
}
}
continue_recovery(); continue_recovery();
}; };
exec_op(op->osd_op); exec_op(op->osd_op);

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once #pragma once

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "osd.h" #include "osd.h"

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#include "osd_ops.h" #include "osd_ops.h"
@@ -19,4 +19,5 @@ const char* osd_op_names[] = {
"primary_write", "primary_write",
"primary_sync", "primary_sync",
"primary_delete", "primary_delete",
"ping",
}; };

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once #pragma once
@@ -27,7 +27,8 @@
#define OSD_OP_WRITE 12 #define OSD_OP_WRITE 12
#define OSD_OP_SYNC 13 #define OSD_OP_SYNC 13
#define OSD_OP_DELETE 14 #define OSD_OP_DELETE 14
#define OSD_OP_MAX 14 #define OSD_OP_PING 15
#define OSD_OP_MAX 15
// Alignment & limit for read/write operations // Alignment & limit for read/write operations
#ifndef MEM_ALIGNMENT #ifndef MEM_ALIGNMENT
#define MEM_ALIGNMENT 512 #define MEM_ALIGNMENT 512

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include <netinet/tcp.h> #include <netinet/tcp.h>
#include <sys/epoll.h> #include <sys/epoll.h>
@@ -98,15 +98,13 @@ void osd_t::repeer_pgs(osd_num_t peer_osd)
} }
} }
// Repeer on each connect/disconnect peer event // Reset PG state (when peering or stopping)
void osd_t::start_pg_peering(pg_t & pg) void osd_t::reset_pg(pg_t & pg)
{ {
pg.state = PG_PEERING;
this->peering_state |= OSD_PEERING_PGS;
report_pg_state(pg);
// Reset PG state
pg.cur_peers.clear(); pg.cur_peers.clear();
pg.state_dict.clear(); pg.state_dict.clear();
copies_to_delete_after_sync_count -= pg.copies_to_delete_after_sync.size();
pg.copies_to_delete_after_sync.clear();
incomplete_objects -= pg.incomplete_objects.size(); incomplete_objects -= pg.incomplete_objects.size();
misplaced_objects -= pg.misplaced_objects.size(); misplaced_objects -= pg.misplaced_objects.size();
degraded_objects -= pg.degraded_objects.size(); degraded_objects -= pg.degraded_objects.size();
@@ -135,6 +133,15 @@ void osd_t::start_pg_peering(pg_t & pg)
it++; it++;
} }
dirty_pgs.erase({ .pool_id = pg.pool_id, .pg_num = pg.pg_num }); dirty_pgs.erase({ .pool_id = pg.pool_id, .pg_num = pg.pg_num });
}
// Repeer on each connect/disconnect peer event
void osd_t::start_pg_peering(pg_t & pg)
{
pg.state = PG_PEERING;
this->peering_state |= OSD_PEERING_PGS;
reset_pg(pg);
report_pg_state(pg);
// Drop connections of clients who have this PG in dirty_pgs // Drop connections of clients who have this PG in dirty_pgs
if (immediate_commit != IMMEDIATE_ALL) if (immediate_commit != IMMEDIATE_ALL)
{ {
@@ -175,13 +182,18 @@ void osd_t::start_pg_peering(pg_t & pg)
// (PG history is kept up to the latest active+clean state) // (PG history is kept up to the latest active+clean state)
for (auto & history_set: pg.target_history) for (auto & history_set: pg.target_history)
{ {
bool found = false; bool found = true;
for (auto history_osd: history_set) for (auto history_osd: history_set)
{ {
if (history_osd != 0 && c_cli.osd_peer_fds.find(history_osd) != c_cli.osd_peer_fds.end()) if (history_osd != 0)
{ {
found = true; found = false;
break; if (history_osd == this->osd_num ||
c_cli.osd_peer_fds.find(history_osd) != c_cli.osd_peer_fds.end())
{
found = true;
break;
}
} }
} }
if (!found) if (!found)
@@ -454,11 +466,11 @@ bool osd_t::stop_pg(pg_t & pg)
if (pg.peering_state) if (pg.peering_state)
{ {
// Stop peering // Stop peering
for (auto it = pg.peering_state->list_ops.begin(); it != pg.peering_state->list_ops.end();) for (auto it = pg.peering_state->list_ops.begin(); it != pg.peering_state->list_ops.end(); it++)
{ {
discard_list_subop(it->second); discard_list_subop(it->second);
} }
for (auto it = pg.peering_state->list_results.begin(); it != pg.peering_state->list_results.end();) for (auto it = pg.peering_state->list_results.begin(); it != pg.peering_state->list_results.end(); it++)
{ {
if (it->second.buf) if (it->second.buf)
{ {
@@ -468,12 +480,19 @@ bool osd_t::stop_pg(pg_t & pg)
delete pg.peering_state; delete pg.peering_state;
pg.peering_state = NULL; pg.peering_state = NULL;
} }
if (!(pg.state & PG_ACTIVE)) if (pg.state & (PG_STOPPING | PG_OFFLINE))
{ {
return false; return false;
} }
if (!(pg.state & PG_ACTIVE))
{
finish_stop_pg(pg);
return true;
}
pg.state = pg.state & ~PG_ACTIVE | PG_STOPPING; pg.state = pg.state & ~PG_ACTIVE | PG_STOPPING;
if (pg.inflight == 0 && !pg.flush_batch) if (pg.inflight == 0 && !pg.flush_batch &&
// We must either forget all PG's unstable writes or wait for it to become clean
dirty_pgs.find({ .pool_id = pg.pool_id, .pg_num = pg.pg_num }) == dirty_pgs.end())
{ {
finish_stop_pg(pg); finish_stop_pg(pg);
} }
@@ -487,6 +506,7 @@ bool osd_t::stop_pg(pg_t & pg)
void osd_t::finish_stop_pg(pg_t & pg) void osd_t::finish_stop_pg(pg_t & pg)
{ {
pg.state = PG_OFFLINE; pg.state = PG_OFFLINE;
reset_pg(pg);
report_pg_state(pg); report_pg_state(pg);
} }

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include <unordered_map> #include <unordered_map>
#include "osd_peering_pg.h" #include "osd_peering_pg.h"
@@ -108,7 +108,7 @@ void pg_obj_state_check_t::start_object()
void pg_obj_state_check_t::handle_version() void pg_obj_state_check_t::handle_version()
{ {
if (!target_ver && last_ver != list[list_pos].version && (n_stable > 0 || n_roles >= pg->pg_minsize)) if (!target_ver && last_ver != list[list_pos].version && (n_stable > 0 || n_roles >= pg->pg_data_size))
{ {
// Version is either stable or recoverable // Version is either stable or recoverable
target_ver = last_ver; target_ver = last_ver;
@@ -171,7 +171,7 @@ void pg_obj_state_check_t::handle_version()
void pg_obj_state_check_t::finish_object() void pg_obj_state_check_t::finish_object()
{ {
if (!target_ver && (n_stable > 0 || n_roles >= pg->pg_minsize)) if (!target_ver && (n_stable > 0 || n_roles >= pg->pg_data_size))
{ {
// Version is either stable or recoverable // Version is either stable or recoverable
target_ver = last_ver; target_ver = last_ver;
@@ -233,7 +233,7 @@ void pg_obj_state_check_t::finish_object()
{ {
return; return;
} }
if (!replicated && n_roles < pg->pg_minsize) if (!replicated && n_roles < pg->pg_data_size)
{ {
if (log_level > 1) if (log_level > 1)
{ {

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include <map> #include <map>
#include <vector> #include <vector>
@@ -56,6 +56,13 @@ struct obj_piece_id_t
uint64_t osd_num; uint64_t osd_num;
}; };
struct obj_ver_osd_t
{
uint64_t osd_num;
object_id oid;
uint64_t version;
};
struct flush_action_t struct flush_action_t
{ {
bool rollback = false, make_stable = false; bool rollback = false, make_stable = false;
@@ -75,7 +82,7 @@ struct pg_t
{ {
int state = 0; int state = 0;
uint64_t scheme = 0; uint64_t scheme = 0;
uint64_t pg_cursize = 0, pg_size = 0, pg_minsize = 0, parity_chunks = 0; uint64_t pg_cursize = 0, pg_size = 0, pg_minsize = 0, pg_data_size = 0;
pool_id_t pool_id = 0; pool_id_t pool_id = 0;
pg_num_t pg_num = 0; pg_num_t pg_num = 0;
uint64_t clean_count = 0, total_count = 0; uint64_t clean_count = 0, total_count = 0;
@@ -101,6 +108,7 @@ struct pg_t
std::map<pg_osd_set_t, pg_osd_set_state_t> state_dict; std::map<pg_osd_set_t, pg_osd_set_state_t> state_dict;
btree::btree_map<object_id, pg_osd_set_state_t*> incomplete_objects, misplaced_objects, degraded_objects; btree::btree_map<object_id, pg_osd_set_state_t*> incomplete_objects, misplaced_objects, degraded_objects;
std::map<obj_piece_id_t, flush_action_t> flush_actions; std::map<obj_piece_id_t, flush_action_t> flush_actions;
std::vector<obj_ver_osd_t> copies_to_delete_after_sync;
btree::btree_map<object_id, uint64_t> ver_override; btree::btree_map<object_id, uint64_t> ver_override;
pg_peering_state_t *peering_state = NULL; pg_peering_state_t *peering_state = NULL;
pg_flush_batch_t *flush_batch = NULL; pg_flush_batch_t *flush_batch = NULL;

View File

@@ -1,8 +1,9 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#define _LARGEFILE64_SOURCE #define _LARGEFILE64_SOURCE
#include "malloc_or_die.h"
#include "osd_peering_pg.h" #include "osd_peering_pg.h"
#define STRIPE_SHIFT 12 #define STRIPE_SHIFT 12

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "osd_primary.h" #include "osd_primary.h"
@@ -103,7 +103,7 @@ void osd_t::continue_primary_read(osd_op_t *cur_op)
if (op_data->st == 1) goto resume_1; if (op_data->st == 1) goto resume_1;
else if (op_data->st == 2) goto resume_2; else if (op_data->st == 2) goto resume_2;
{ {
auto & pg = pgs[{ .pool_id = INODE_POOL(op_data->oid.inode), .pg_num = op_data->pg_num }]; auto & pg = pgs.at({ .pool_id = INODE_POOL(op_data->oid.inode), .pg_num = op_data->pg_num });
for (int role = 0; role < op_data->pg_data_size; role++) for (int role = 0; role < op_data->pg_data_size; role++)
{ {
op_data->stripes[role].read_start = op_data->stripes[role].req_start; op_data->stripes[role].read_start = op_data->stripes[role].req_start;
@@ -211,7 +211,7 @@ void osd_t::continue_primary_write(osd_op_t *cur_op)
return; return;
} }
osd_primary_op_data_t *op_data = cur_op->op_data; osd_primary_op_data_t *op_data = cur_op->op_data;
auto & pg = pgs[{ .pool_id = INODE_POOL(op_data->oid.inode), .pg_num = op_data->pg_num }]; auto & pg = pgs.at({ .pool_id = INODE_POOL(op_data->oid.inode), .pg_num = op_data->pg_num });
if (op_data->st == 1) goto resume_1; if (op_data->st == 1) goto resume_1;
else if (op_data->st == 2) goto resume_2; else if (op_data->st == 2) goto resume_2;
else if (op_data->st == 3) goto resume_3; else if (op_data->st == 3) goto resume_3;
@@ -365,9 +365,34 @@ resume_7:
recovery_stat_bytes[0][recovery_type] += op_data->stripes[role].write_end - op_data->stripes[role].write_start; recovery_stat_bytes[0][recovery_type] += op_data->stripes[role].write_end - op_data->stripes[role].write_start;
} }
} }
if (op_data->object_state->state & OBJ_MISPLACED) // Any kind of a non-clean object can have extra chunks, because we don't record objects
// as degraded & misplaced or incomplete & misplaced at the same time. So try to remove extra chunks
if (immediate_commit != IMMEDIATE_ALL)
{
// We can't remove extra chunks yet if fsyncs are explicit, because
// new copies may not be committed to stable storage yet
// We can only remove extra chunks after a successful SYNC for this PG
for (auto & chunk: op_data->object_state->osd_set)
{
// Check is the same as in submit_primary_del_subops()
if (op_data->scheme == POOL_SCHEME_REPLICATED
? !contains_osd(pg.cur_set.data(), pg.pg_size, chunk.osd_num)
: (chunk.osd_num != pg.cur_set[chunk.role]))
{
pg.copies_to_delete_after_sync.push_back((obj_ver_osd_t){
.osd_num = chunk.osd_num,
.oid = {
.inode = op_data->oid.inode,
.stripe = op_data->oid.stripe | (op_data->scheme == POOL_SCHEME_REPLICATED ? 0 : chunk.role),
},
.version = op_data->fact_ver,
});
copies_to_delete_after_sync_count++;
}
}
}
else
{ {
// Remove extra chunks
submit_primary_del_subops(cur_op, pg.cur_set.data(), pg.pg_size, op_data->object_state->osd_set); submit_primary_del_subops(cur_op, pg.cur_set.data(), pg.pg_size, op_data->object_state->osd_set);
if (op_data->n_subops > 0) if (op_data->n_subops > 0)
{ {
@@ -391,19 +416,19 @@ continue_others:
// Remove version override // Remove version override
pg.ver_override.erase(op_data->oid); pg.ver_override.erase(op_data->oid);
object_id oid = op_data->oid; object_id oid = op_data->oid;
// Remove the operation from queue before calling finish_op so it doesn't see the completed operation in queue
auto next_it = pg.write_queue.find(oid);
if (next_it != pg.write_queue.end() && next_it->second == cur_op)
{
pg.write_queue.erase(next_it++);
}
// finish_op would invalidate next_it if it cleared pg.write_queue, but it doesn't do that :)
finish_op(cur_op, cur_op->reply.hdr.retval); finish_op(cur_op, cur_op->reply.hdr.retval);
// Continue other write operations to the same object // Continue other write operations to the same object
auto next_it = pg.write_queue.find(oid); if (next_it != pg.write_queue.end() && next_it->first == oid)
auto this_it = next_it;
if (this_it != pg.write_queue.end() && this_it->second == cur_op)
{ {
next_it++; osd_op_t *next_op = next_it->second;
pg.write_queue.erase(this_it); continue_primary_write(next_op);
if (next_it != pg.write_queue.end() && next_it->first == oid)
{
osd_op_t *next_op = next_it->second;
continue_primary_write(next_op);
}
} }
} }
@@ -513,6 +538,8 @@ void osd_t::continue_primary_sync(osd_op_t *cur_op)
else if (op_data->st == 4) goto resume_4; else if (op_data->st == 4) goto resume_4;
else if (op_data->st == 5) goto resume_5; else if (op_data->st == 5) goto resume_5;
else if (op_data->st == 6) goto resume_6; else if (op_data->st == 6) goto resume_6;
else if (op_data->st == 7) goto resume_7;
else if (op_data->st == 8) goto resume_8;
assert(op_data->st == 0); assert(op_data->st == 0);
if (syncs_in_progress.size() > 0) if (syncs_in_progress.size() > 0)
{ {
@@ -574,15 +601,38 @@ resume_2:
this->unstable_writes.clear(); this->unstable_writes.clear();
} }
{ {
void *dirty_buf = malloc_or_die(sizeof(pool_pg_num_t)*dirty_pgs.size() + sizeof(osd_num_t)*dirty_osds.size()); void *dirty_buf = malloc_or_die(
sizeof(pool_pg_num_t)*dirty_pgs.size() +
sizeof(osd_num_t)*dirty_osds.size() +
sizeof(obj_ver_osd_t)*this->copies_to_delete_after_sync_count
);
op_data->dirty_pgs = (pool_pg_num_t*)dirty_buf; op_data->dirty_pgs = (pool_pg_num_t*)dirty_buf;
op_data->dirty_osds = (osd_num_t*)(dirty_buf + sizeof(pool_pg_num_t)*dirty_pgs.size()); op_data->dirty_osds = (osd_num_t*)(dirty_buf + sizeof(pool_pg_num_t)*dirty_pgs.size());
op_data->dirty_pg_count = dirty_pgs.size(); op_data->dirty_pg_count = dirty_pgs.size();
op_data->dirty_osd_count = dirty_osds.size(); op_data->dirty_osd_count = dirty_osds.size();
if (this->copies_to_delete_after_sync_count)
{
op_data->copies_to_delete_count = 0;
op_data->copies_to_delete = (obj_ver_osd_t*)(op_data->dirty_osds + op_data->dirty_osd_count);
for (auto dirty_pg_num: dirty_pgs)
{
auto & pg = pgs.at(dirty_pg_num);
assert(pg.copies_to_delete_after_sync.size() <= this->copies_to_delete_after_sync_count);
memcpy(
op_data->copies_to_delete + op_data->copies_to_delete_count,
pg.copies_to_delete_after_sync.data(),
sizeof(obj_ver_osd_t)*pg.copies_to_delete_after_sync.size()
);
op_data->copies_to_delete_count += pg.copies_to_delete_after_sync.size();
this->copies_to_delete_after_sync_count -= pg.copies_to_delete_after_sync.size();
pg.copies_to_delete_after_sync.clear();
}
assert(this->copies_to_delete_after_sync_count == 0);
}
int dpg = 0; int dpg = 0;
for (auto dirty_pg_num: dirty_pgs) for (auto dirty_pg_num: dirty_pgs)
{ {
pgs[dirty_pg_num].inflight++; pgs.at(dirty_pg_num).inflight++;
op_data->dirty_pgs[dpg++] = dirty_pg_num; op_data->dirty_pgs[dpg++] = dirty_pg_num;
} }
dirty_pgs.clear(); dirty_pgs.clear();
@@ -639,7 +689,7 @@ resume_6:
.pool_id = INODE_POOL(w.oid.inode), .pool_id = INODE_POOL(w.oid.inode),
.pg_num = map_to_pg(w.oid, st_cli.pool_config.at(INODE_POOL(w.oid.inode)).pg_stripe_size), .pg_num = map_to_pg(w.oid, st_cli.pool_config.at(INODE_POOL(w.oid.inode)).pg_stripe_size),
}; };
if (pgs[wpg].state & PG_ACTIVE) if (pgs.at(wpg).state & PG_ACTIVE)
{ {
uint64_t & dest = this->unstable_writes[(osd_object_id_t){ uint64_t & dest = this->unstable_writes[(osd_object_id_t){
.osd_num = unstable_osd.osd_num, .osd_num = unstable_osd.osd_num,
@@ -651,12 +701,44 @@ resume_6:
} }
} }
} }
if (op_data->copies_to_delete)
{
// Return 'copies to delete' back into respective PGs
for (int i = 0; i < op_data->copies_to_delete_count; i++)
{
auto & w = op_data->copies_to_delete[i];
auto & pg = pgs.at((pool_pg_num_t){
.pool_id = INODE_POOL(w.oid.inode),
.pg_num = map_to_pg(w.oid, st_cli.pool_config.at(INODE_POOL(w.oid.inode)).pg_stripe_size),
});
if (pg.state & PG_ACTIVE)
{
pg.copies_to_delete_after_sync.push_back(w);
copies_to_delete_after_sync_count++;
}
}
}
}
else if (op_data->copies_to_delete)
{
// Actually delete copies which we wanted to delete
submit_primary_del_batch(cur_op, op_data->copies_to_delete, op_data->copies_to_delete_count);
resume_7:
op_data->st = 7;
return;
resume_8:
if (op_data->errors > 0)
{
goto resume_6;
}
} }
for (int i = 0; i < op_data->dirty_pg_count; i++) for (int i = 0; i < op_data->dirty_pg_count; i++)
{ {
auto & pg = pgs.at(op_data->dirty_pgs[i]); auto & pg = pgs.at(op_data->dirty_pgs[i]);
pg.inflight--; pg.inflight--;
if ((pg.state & PG_STOPPING) && pg.inflight == 0 && !pg.flush_batch) if ((pg.state & PG_STOPPING) && pg.inflight == 0 && !pg.flush_batch &&
// We must either forget all PG's unstable writes or wait for it to become clean
dirty_pgs.find({ .pool_id = pg.pool_id, .pg_num = pg.pg_num }) == dirty_pgs.end())
{ {
finish_stop_pg(pg); finish_stop_pg(pg);
} }
@@ -750,7 +832,7 @@ void osd_t::continue_primary_del(osd_op_t *cur_op)
return; return;
} }
osd_primary_op_data_t *op_data = cur_op->op_data; osd_primary_op_data_t *op_data = cur_op->op_data;
auto & pg = pgs[{ .pool_id = INODE_POOL(op_data->oid.inode), .pg_num = op_data->pg_num }]; auto & pg = pgs.at({ .pool_id = INODE_POOL(op_data->oid.inode), .pg_num = op_data->pg_num });
if (op_data->st == 1) goto resume_1; if (op_data->st == 1) goto resume_1;
else if (op_data->st == 2) goto resume_2; else if (op_data->st == 2) goto resume_2;
else if (op_data->st == 3) goto resume_3; else if (op_data->st == 3) goto resume_3;

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#pragma once #pragma once
@@ -38,4 +38,8 @@ struct osd_primary_op_data_t
osd_num_t *dirty_osds = NULL; osd_num_t *dirty_osds = NULL;
int dirty_osd_count = 0; int dirty_osd_count = 0;
obj_ver_id *unstable_writes = NULL; obj_ver_id *unstable_writes = NULL;
obj_ver_osd_t *copies_to_delete = NULL;
int copies_to_delete_count = 0;
}; };
bool contains_osd(osd_num_t *osd_set, uint64_t size, osd_num_t osd_num);

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "osd_primary.h" #include "osd_primary.h"
@@ -40,10 +40,12 @@ void osd_t::finish_op(osd_op_t *cur_op, int retval)
{ {
if (cur_op->op_data->pg_num > 0) if (cur_op->op_data->pg_num > 0)
{ {
auto & pg = pgs[{ .pool_id = INODE_POOL(cur_op->op_data->oid.inode), .pg_num = cur_op->op_data->pg_num }]; auto & pg = pgs.at({ .pool_id = INODE_POOL(cur_op->op_data->oid.inode), .pg_num = cur_op->op_data->pg_num });
pg.inflight--; pg.inflight--;
assert(pg.inflight >= 0); assert(pg.inflight >= 0);
if ((pg.state & PG_STOPPING) && pg.inflight == 0 && !pg.flush_batch) if ((pg.state & PG_STOPPING) && pg.inflight == 0 && !pg.flush_batch &&
// We must either forget all PG's unstable writes or wait for it to become clean
dirty_pgs.find({ .pool_id = pg.pool_id, .pg_num = pg.pg_num }) == dirty_pgs.end())
{ {
finish_stop_pg(pg); finish_stop_pg(pg);
} }
@@ -353,7 +355,7 @@ void osd_t::cancel_primary_write(osd_op_t *cur_op)
} }
} }
static bool contains_osd(osd_num_t *osd_set, uint64_t size, osd_num_t osd_num) bool contains_osd(osd_num_t *osd_set, uint64_t size, osd_num_t osd_num)
{ {
for (uint64_t i = 0; i < size; i++) for (uint64_t i = 0; i < size; i++)
{ {
@@ -369,78 +371,82 @@ void osd_t::submit_primary_del_subops(osd_op_t *cur_op, osd_num_t *cur_set, uint
{ {
osd_primary_op_data_t *op_data = cur_op->op_data; osd_primary_op_data_t *op_data = cur_op->op_data;
bool rep = op_data->scheme == POOL_SCHEME_REPLICATED; bool rep = op_data->scheme == POOL_SCHEME_REPLICATED;
int extra_chunks = 0; obj_ver_osd_t extra_chunks[loc_set.size()];
// ordered comparison for EC/XOR, unordered for replicated pools int chunks_to_del = 0;
for (auto & chunk: loc_set) for (auto & chunk: loc_set)
{ {
if (!cur_set || (rep ? !contains_osd(cur_set, set_size, chunk.osd_num) : chunk.osd_num != cur_set[chunk.role])) // ordered comparison for EC/XOR, unordered for replicated pools
if (!cur_set || (rep
? !contains_osd(cur_set, set_size, chunk.osd_num)
: (chunk.osd_num != cur_set[chunk.role])))
{ {
extra_chunks++; extra_chunks[chunks_to_del++] = (obj_ver_osd_t){
.osd_num = chunk.osd_num,
.oid = {
.inode = op_data->oid.inode,
.stripe = op_data->oid.stripe | (rep ? 0 : chunk.role),
},
// Same version as write
.version = op_data->fact_ver,
};
} }
} }
op_data->n_subops = extra_chunks; submit_primary_del_batch(cur_op, extra_chunks, chunks_to_del);
}
void osd_t::submit_primary_del_batch(osd_op_t *cur_op, obj_ver_osd_t *chunks_to_delete, int chunks_to_delete_count)
{
osd_primary_op_data_t *op_data = cur_op->op_data;
op_data->n_subops = chunks_to_delete_count;
op_data->done = op_data->errors = 0; op_data->done = op_data->errors = 0;
if (!extra_chunks) if (!op_data->n_subops)
{ {
return; return;
} }
osd_op_t *subops = new osd_op_t[extra_chunks]; osd_op_t *subops = new osd_op_t[chunks_to_delete_count];
op_data->subops = subops; op_data->subops = subops;
int i = 0; for (int i = 0; i < chunks_to_delete_count; i++)
for (auto & chunk: loc_set)
{ {
if (!cur_set || (rep ? !contains_osd(cur_set, set_size, chunk.osd_num) : chunk.osd_num != cur_set[chunk.role])) auto & chunk = chunks_to_delete[i];
if (chunk.osd_num == this->osd_num)
{ {
int stripe_num = op_data->scheme == POOL_SCHEME_REPLICATED ? 0 : chunk.role; clock_gettime(CLOCK_REALTIME, &subops[i].tv_begin);
if (chunk.osd_num == this->osd_num) subops[i].op_type = (uint64_t)cur_op;
{ subops[i].bs_op = new blockstore_op_t({
clock_gettime(CLOCK_REALTIME, &subops[i].tv_begin); .opcode = BS_OP_DELETE,
subops[i].op_type = (uint64_t)cur_op; .callback = [subop = &subops[i], this](blockstore_op_t *bs_subop)
subops[i].bs_op = new blockstore_op_t({
.opcode = BS_OP_DELETE,
.callback = [subop = &subops[i], this](blockstore_op_t *bs_subop)
{
handle_primary_bs_subop(subop);
},
.oid = {
.inode = op_data->oid.inode,
.stripe = op_data->oid.stripe | stripe_num,
},
// Same version as write
.version = op_data->fact_ver,
});
bs->enqueue_op(subops[i].bs_op);
}
else
{
subops[i].op_type = OSD_OP_OUT;
subops[i].peer_fd = c_cli.osd_peer_fds.at(chunk.osd_num);
subops[i].req.sec_del = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = c_cli.next_subop_id++,
.opcode = OSD_OP_SEC_DELETE,
},
.oid = {
.inode = op_data->oid.inode,
.stripe = op_data->oid.stripe | stripe_num,
},
// Same version as write
.version = op_data->fact_ver,
};
subops[i].callback = [cur_op, this](osd_op_t *subop)
{ {
int fail_fd = subop->reply.hdr.retval != 0 ? subop->peer_fd : -1; handle_primary_bs_subop(subop);
handle_primary_subop(subop, cur_op); },
if (fail_fd >= 0) .oid = chunk.oid,
{ .version = chunk.version,
// delete operation failed, drop the connection });
c_cli.stop_client(fail_fd); bs->enqueue_op(subops[i].bs_op);
} }
}; else
c_cli.outbox_push(&subops[i]); {
} subops[i].op_type = OSD_OP_OUT;
i++; subops[i].peer_fd = c_cli.osd_peer_fds.at(chunk.osd_num);
subops[i].req.sec_del = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = c_cli.next_subop_id++,
.opcode = OSD_OP_SEC_DELETE,
},
.oid = chunk.oid,
.version = chunk.version,
};
subops[i].callback = [cur_op, this](osd_op_t *subop)
{
int fail_fd = subop->reply.hdr.retval != 0 ? subop->peer_fd : -1;
handle_primary_subop(subop, cur_op);
if (fail_fd >= 0)
{
// delete operation failed, drop the connection
c_cli.stop_client(fail_fd);
}
};
c_cli.outbox_push(&subops[i]);
} }
} }
} }

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include <stdexcept> #include <stdexcept>
#include <string.h> #include <string.h>
@@ -11,7 +11,7 @@
#include "osd_rmw.h" #include "osd_rmw.h"
#include "malloc_or_die.h" #include "malloc_or_die.h"
#define OSD_JERASURE_W 32 #define OSD_JERASURE_W 8
static inline void extend_read(uint32_t start, uint32_t end, osd_rmw_stripe_t & stripe) static inline void extend_read(uint32_t start, uint32_t end, osd_rmw_stripe_t & stripe)
{ {

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#pragma once #pragma once

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#define RMW_DEBUG #define RMW_DEBUG

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include "osd.h" #include "osd.h"

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
#include <sys/types.h> #include <sys/types.h>
#include <sys/socket.h> #include <sys/socket.h>

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details) // License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#include "pg_states.h" #include "pg_states.h"

Some files were not shown because too many files have changed in this diff Show More