Compare commits

...

173 Commits

Author SHA1 Message Date
7e6e1a5a82 Release 0.5.10
The version seems to be stable after this bunch of fixes :)

- Fix delete & write operation ordering during rebalance to not lose objects in the immediate_commit=off mode
- Fix a possible crash caused by very high iodepths
- Re-distribute PG primaries over OSDs that come up after a short downtime
- Allow to specify etcd URLs for OSDs with http://, do not die with a strange error if -etcd option is missing for fio
- Fix a journal flushing deadlock which sometimes occurred in the immediate_commit=off mode
- Fix a bug where OSDs could hang if the data device filled up
- Fix an allocator bug where it was unable to allocate up to last (n%64) data device blocks
- Fix monitor crash that occurred on removal of some etcd keys
- Fix a bug where PGs could remain incomplete due to incorrect PG history with just zeroes in osd_sets
2021-03-16 12:48:26 +03:00
435045751d Delete objects only after a SYNC during rebalance in the non-immediate_commit mode
Previously OSDs could commit deletes before writes during recovery or rebalance
in the "lazy fsync" (immediate_commit=off) mode which could result in lost objects
2021-03-16 12:48:26 +03:00
c5fb1d5987 Do not duplicate blockstore operations when io_uring fills up
This bug was leading to OSDs dying with "Assertion `fulfilled == read_op->len' failed"
when testing fio -rw=randread -numjobs=8 -iodepth=128
2021-03-16 12:48:26 +03:00
9f59381bea Re-distribute PG primaries over OSDs that come up after a short downtime 2021-03-16 12:48:26 +03:00
9ac7e75178 Allow to specify etcd URLs for OSDs with http://, do not die with a strange error if -etcd option is missing for fio 2021-03-16 12:48:26 +03:00
88671cf745 Fix a bug causing all flushers to wait for an fsync without actually trying to do it
This happened because flusher_count became dynamic and fsync_batch() was comparing the number
of flushers currently ready to do an fsync with the maximum number of flushers. Also the number
wasn't rechecked on every loop which was also incorrect.

Now the interrupted_rebalance test passes even without IMMEDIATE_COMMIT=1.
2021-03-13 17:27:29 +03:00
fe1749c427 Fix the multiple_interrupted_rebalance test 2021-03-13 17:19:45 +03:00
ceb9c28de7 Set default log_level before passing config to etcd_state_client 2021-03-13 17:19:45 +03:00
299d7d7c95 Use common macro for get_sqe 2021-03-13 17:19:45 +03:00
d1526b415f Correctly resume writes when OSD is full to return an error 2021-03-13 17:19:45 +03:00
f49fd53d55 Fix a bug where allocator was unable to allocate up to last (n%64) blocks, add tests for it 2021-03-13 02:19:02 +03:00
dd76eda5e5 Test multiple interrupted rebalancings
Currently only passes with immediate_commit=all configuration
(env variable IMMEDIATE_COMMIT=1 for the bash script)
2021-03-12 12:55:44 +03:00
87dbd8fa57 Use empty hash as the default value for some etcd keys in the monitor 2021-03-12 12:40:15 +03:00
b44f49aab2 Ignore zero OSDs in history osd_sets 2021-03-12 12:40:15 +03:00
036555638e Release 0.5.9
- Fix two monitor bugs which led to objects being "logically lost" (physically
  present on some secondary OSDs while primary doesn't know about it) after multiple
  interrupted rebalancings
- Implement "no_recovery" and "no_rebalance" flags
2021-03-11 00:39:10 +03:00
af5155fcd9 Implement "no_recovery" and "no_rebalance" flags 2021-03-11 00:36:31 +03:00
0d2efbecc9 Preserve previous PG history when changing PG distribution
Fixes incorrect PG history in case when a new rebalance is started
before the finish of the previous one which could make primary OSDs unable
to locate some objects on some secondaries.
2021-03-11 00:16:10 +03:00
e62e8b6bae Use real pg configuration instead of the "last clean" one for generating PG history
Basically fixes the bug introduced in 0.5.7 where an rebalance interrupted
by the monitor could result in forgetting objects moved to the new place
2021-03-10 02:01:44 +03:00
c4ba24c305 Do not print ping op latency 2021-03-10 02:01:44 +03:00
19e47a0279 Release 0.5.8
- Add heartbeats (fixes failover in case of network issues or offline nodes)
- Fix a bug where a PG could incorrectly become listed as 'incomplete' if historical osd_sets
  included a set with the the PG's primary OSD as the only alive one
- Use osd_out_time = 10 minutes by default instead of 30 minutes
- Make monitors stick to a single selected etcd URL on start and not try to select random ones
  on every request - this was leading to etcd interaction errors when some etcds were unavailable
2020-03-09 02:38:17 +03:00
bd178ac20f Fix history osd_set check - local OSD is always available! 2021-03-09 02:18:18 +03:00
7006875a24 Make monitor stick to one etcd until the restart 2021-03-09 02:15:38 +03:00
ad577c4aac Add PING operation and timeouts to detect OSD failures when a host goes down 2021-03-09 02:15:38 +03:00
836635c518 Use osd_out_time = 10 minutes by default 2021-03-09 02:15:38 +03:00
88a03f4e98 Release 0.5.7
- Fix multiple bugs leading to OSDs sometimes being unable to correctly activate PGs
  when a lot of PG peering events occurred in a small amount of time
- Fix a bug where OSDs could list incomplete object versions during peering. The bug
  manifested with "local rollback operation failed" messages in OSD logs
- Fix a bug where misplaced chunks for degraded and incomplete objects were not removed
  from extra OSDs during recovery
- Fix incorrect PG history configuration resulting in OSDs being unable to find some
  of the objects after a PG count change
- Simplify block layer write ordering logic
- Avoid extra data move when a lot of OSDs are first stopped for long time and then restarted
- Fix incorrect degraded & misplaced object statistics after a completed rebalance
- Fix incorrect usage of pg_minsize instead of the minimal possible object chunk count in EC pools
2021-03-08 23:37:02 +03:00
2a5036669d Fix PG count change procedure
In previous versions PG histories were calculated incorrectly during
PG count change which led to objects being lost on OSDs not in PG's osd set.
2021-03-08 23:15:58 +03:00
2e0c853180 Make test_change_pg_count check if any objects are lost during the test 2021-03-08 23:15:07 +03:00
e91ff2a9ec Only forget offline PGs if their state is not changed during reporting 2021-03-08 17:04:10 +03:00
086667f568 Do not check PG state key ownership if it doesn't exist yet
This fixes the bug where OSDs were sometimes trying to report updated PG states
infinitely without luck when PGs transitioned from 'starting' to 'peering' too fast
2021-03-08 17:04:10 +03:00
73ce20e246 Add a test for the "reappear after move" case 2021-03-08 17:04:10 +03:00
1be94da437 Check & remove extra chunks for degraded / incomplete objects, too 2021-03-08 17:04:10 +03:00
80e12358a2 Use pg_data_size instead of pg_minsize for object state calculation 2021-03-08 17:04:10 +03:00
36c935ace6 Use std::vector for the blockstore submission queue 2021-03-08 17:04:10 +03:00
0d8b5e2ef9 Remove unused enqueue_op_first() 2021-03-08 17:04:10 +03:00
98f1e2c277 Rework write/sync ordering
Make syncs wait for all previous writes because it's the only way
to make sure that OSDs do not receive incomplete writes in LIST results
during peering when some writes are still in progress.

Also simplify blockstore submission queue logic.
2021-03-08 17:04:10 +03:00
21e7686037 Fix possible "assertion failed: pg.inflight >= 0" error during PG stop 2021-03-08 17:04:10 +03:00
ab21a1908b Check for the dirty PG flag when trying to continue to stop it after sync 2021-03-08 17:04:10 +03:00
30d1ccd43e Fix an infinite loop when discarding list operations during stop_pg() 2021-03-08 17:04:10 +03:00
8bdd6d8d78 Reset PG state when stopping them 2021-03-08 17:04:10 +03:00
09b3e4e789 Fix OSDs being unable to stop PGs that are 'peering', not 'active'
This was sometimes leading to incorrect misplaced and degraded object count statistics
2021-03-08 17:04:10 +03:00
07912fd670 Use history/last_clean_pgs to avoid extra data move when observing a series of changes in the cluster 2021-03-08 17:04:10 +03:00
bc742ccf8c Fix a small memory leak in etcd_state_client 2021-03-08 17:04:10 +03:00
314b20437b Do not break subsequent small writes badly when a big write is canceled 2021-03-08 17:04:10 +03:00
29bac892ad Add .gitignore 2021-03-08 17:04:10 +03:00
cf7547faf3 Fix *.sh build scripts 2021-03-02 02:17:11 +03:00
ab90ed747f Release 0.5.6
- Fix operation statistics
- Fix a rebalance hang introduced in 0.5.5
- Test PG count changes with actual data moving
- Fix a possible 'unexpected pg state: 0' error during PG count change
2021-03-01 16:26:04 +03:00
29d8ac8b1b Do not report statistics for the empty operation 2021-03-01 16:20:57 +03:00
97795ea1b1 Use pg_minsize=2 in the pg_count change test
Also don't check for has_degraded because it's not a bug that objects
are _temporarily_ listed as degraded during PG peering as it's not
required for the new primary to connect to _all_ older peers to start
peering. The test may be improved in the future by temporarily disabling
degraded recovery during it and returning the has_degraded check back.
2021-03-01 16:18:08 +03:00
24e7075f08 Fix monitor's statistics aggregation 2021-02-28 19:51:16 +03:00
6155b23a7e Replace pgs[id] with pgs.at(id) to prevent accidental auto-vivification 2021-02-28 19:36:59 +03:00
7d49706c07 Improve the pg_count change test: add more OSDs and actually move data between them 2021-02-28 19:36:59 +03:00
46e79f3306 Wait for PGs to become clean before stopping them 2021-02-28 19:36:59 +03:00
41fd14e024 Fix deletes not increasing write_iodepth 2021-02-28 19:36:59 +03:00
bb2d9a3afe Release 0.5.5
- Transition to CMake build system
- Fix Monitor being unable to change PG sizes
- Fix PG optimizer not using some OSDs in some cases
- Fix inability to change PG count online
- Improve journal flusher performance
- Add a little better systemd unit generator
- Use w=8 with jerasure (breaking change for EC pools)
2021-02-26 01:59:18 +03:00
e899ed2c25 Make OSDs with 256 flushers (as they are now dynamic) 2021-02-26 01:59:18 +03:00
e21b14b72c Fix rpm specs for building with CMake 2021-02-26 01:59:18 +03:00
5af8eddaa9 Add the remaining build script for Debian 2021-02-26 01:59:18 +03:00
4f5a94c07a Modify instructions for the CMake build 2021-02-26 00:28:57 +03:00
e16b87ecc8 Rename random_combinations() parameter from "unordered" to "ordered" as it's more correct 2021-02-25 23:59:34 +03:00
fcb4aa0a11 Fix Monitor being unable to change PG sizes 2021-02-25 23:59:34 +03:00
12adfa470c Add a test for changing PG size 2021-02-25 23:59:33 +03:00
7f15e0c084 Add a simple test for the PG optimizer 2021-02-25 23:59:33 +03:00
08d4bef419 Fix PG optimizer removing PGs without adding new ones
This happened when the distribution was already valid for the current OSD tree,
but didn't use all OSDs. For example, OSDs 1 2 3 and all PGs equal to [ 1, 2 ]
remained unchanged.
2021-02-25 23:59:33 +03:00
2d73b19a6c Fix online PG count change bugs 2021-02-25 23:59:33 +03:00
69c87009e9 Add a test for changing PG count 2021-02-25 23:59:33 +03:00
c974cb539c Make flusher_count adaptive and limit write iodepth 2021-02-25 23:59:33 +03:00
00e98f64f3 A little better systemd unit generator 2021-02-25 23:59:33 +03:00
91a70dfb1b Add a test for the no_same_sector_overwrites mode 2021-02-25 23:59:33 +03:00
178388ac8c Use packages/ subdir instead of build/ for Docker package builds 2021-02-25 23:59:04 +03:00
bf9a175efc Move C/C++ sources to src subdirectory 2021-02-25 23:59:03 +03:00
08aed962de Use CMake 2021-02-25 23:58:08 +03:00
8c65e890b9 Slightly clean up the build script 2021-02-25 23:56:54 +03:00
8cda70b889 Allow to enable AddressSanitizer with "ASAN=1 make" 2021-02-25 23:55:33 +03:00
61ab22403a Use w=8 with jerasure 2021-02-25 23:55:33 +03:00
16da663a66 Add another test for failure domains 2021-02-25 23:55:33 +03:00
4a2dcf7b6b Update the license to VNPL 1.1
VNPL 1.1 is slightly reworded to make it clear that proprietary software
interacting with Vitastor and providing some kind of service to end users isn't
a "Proxy Program" if it's not specially designed to be used with Vitastor.

For example, Windows OS running in a virtual machine stored in a Vitastor
cluster clearly isn't.
2021-02-25 23:55:33 +03:00
8d48cc56b0 Generate randomly permutated OSD combinations when optimizing for compressed chunks 2021-02-25 23:55:33 +03:00
9f58f01425 Mirror afr.js from /vitalif/ceph-afr-calc 2021-02-25 23:55:33 +03:00
b9e7d31aa1 Release v0.5.4
- Fix a rare hang, more or less reproducible with very slow drives
- Fix a hang with the no_same_sector_overwrites mode
2021-02-24 01:40:30 +03:00
2d9f09dcb6 Attempt forced trim when stopping an overrun flusher
Fixes a rare hang happening in the event of journal space running out without
new work to do for flushers except the current sector.
The hang could be reproduced more or less consistently with very slow drives.
2021-02-24 01:33:01 +03:00
7cc59260c5 Fix no_same_sector_overwrites related bug 2021-02-23 18:50:51 +03:00
ca0a11ec85 Release 0.5.3 2021-02-03 00:38:57 +03:00
51c0b5afee Whitelist more leaks 2021-02-02 02:05:41 +03:00
e1e01d042e Rename sector_info.usage_count to flush_count 2021-02-02 01:32:23 +03:00
534a4a657e Rename space_check.sectors_required to sectors_to_write 2021-02-02 01:30:23 +03:00
9b5d8b9ad4 Fix multiple-sector journal writes, add assertions to not miss any SQEs 2021-02-02 01:29:11 +03:00
e66ed47515 Clear SQEs before returning them to the caller to prevent erroneous double submissions 2021-02-02 01:26:54 +03:00
036c6d4c42 Add a simple test case 2021-02-01 19:43:10 +03:00
4cb79a3bf8 Allow to calculate simple-offsets for files 2021-02-01 19:43:10 +03:00
3bf53754c2 Fix several I/O bugs 2021-02-01 19:43:10 +03:00
6023cac361 Do not stop clients before they are connected 2021-02-01 19:31:10 +03:00
915d04c446 Allow empty global configuration, report OSD statistics faster 2021-02-01 19:31:10 +03:00
21e06ea40d Fix memory leaks in fio engines 2021-02-01 19:31:10 +03:00
9ef7f865b0 Fix incorrect calls to prepare_journal_sector_write() when flushing multiple sectors 2021-02-01 19:31:10 +03:00
9dd20a31aa Do not use pg_minsize in the client code! 2021-02-01 19:31:10 +03:00
28be049909 Dump only actual part of the journal by default 2021-01-01 23:04:30 +03:00
78fbaacf1f External jerasure's w into defines
In fact, w=8 looks better than w=32, so it may be changed in the future
2020-12-31 19:15:22 +03:00
1526c5a213 Add lp_solve into dependencies 2020-12-31 01:32:31 +03:00
c7cc414c90 Skip removed descriptors in epoll (this is possible in real clusters) 2020-12-30 17:04:18 +03:00
f4ea313707 Fix cl->read_op being freed without calling the completion callback 2020-12-30 16:55:54 +03:00
b88b76f316 Parallel usage of multiple network interfaces was a sick fantasy 2020-12-30 00:05:17 +03:00
4a17a61d1f Make rm_inode work with incomplete and degraded objects, allow to wait before deleting objects 2020-12-28 16:38:08 +03:00
ccabbbfbcb For reference: include a spec patch for building QEMU 4.2 or CentOS 7 2020-12-06 15:43:38 +03:00
26dac57083 State that jerasure is now supported 2020-12-06 15:25:48 +03:00
44a53d8352 Huh. Fix rpath for packages 2020-12-05 20:16:39 +03:00
9d80bd2d98 Build with jerasure, split some build scripts 2020-12-05 19:02:23 +03:00
322a38a144 Fix non-preserved real_pg_count leading to inability to change pools online 2020-12-04 23:46:48 +03:00
1018764c91 Fix write->delete->write bugs, add & fix some debugging output 2020-12-04 23:21:58 +03:00
a45e0e5e67 Use custom decoding instead of just jerasure_matrix_decode()
- Cache the decoding matrix
- Don't do unnecessary erasures->erased conversion during decoding
- Avoid extra memory allocations during decoding
- Don't always reconstruct coding chunks
- Reconstruct chunks one-by-one, without overlapping ranges
2020-12-04 17:43:48 +03:00
44656fbf67 Allow writes with low version numbers after a delete 2020-12-04 11:54:41 +03:00
089f138e0c Allow situations where the journal contains a big_write(v1) after delete(v2) and v1 < v2
Fixes a crash in the following scenario:
- client issues a delete request (object version is at least 2)
- OSD has time to flush it to the metadata, but doesn't have time to move the journal start pointer on disk
- client overwrites the same object and it gets the version number 1 again
- OSD is restarted and sees delete(v=2), big_write(v=1) in the journal
- dirty_db sequence gets broken and OSD crashes with assert("Writes and deletes shouldn't happen at the same time")
2020-12-04 11:47:27 +03:00
bcc8e697f9 Delete PGs when deleting pools
(All OSD crash with "Online PG count change not allowed" if you try to delete an active pool though)
2020-12-04 11:47:27 +03:00
a4c46ba745 Add jerasure EC support (reed_sol_van, others are slower) (not tested yet) 2020-12-04 11:47:27 +03:00
5596ad8997 Use custom QEMU build for CentOS 7 2020-12-04 11:47:05 +03:00
59c29b0cee Fix RPATH for CentOS builds, add additional repos into the CentOS installation instructions 2020-12-04 11:47:04 +03:00
959089b919 Enable progress_notify=true for etcd watches 2020-11-17 16:29:42 +03:00
d3e7749616 Final fixes for packaging 2020-11-10 23:33:07 +03:00
b56f8820ec Container packaging for Debian 11 Bullseye, CentOS 7 and CentOS 8 2020-11-10 00:02:53 +03:00
4bd2bd48eb Build Vitastor packages, too 2020-11-09 14:41:39 +03:00
a3fc9f8d7d Add a Dockerfile to build patched QEMU for Debian (Buster) 2020-11-09 02:30:41 +03:00
530975aed7 Make it also build with GCC 8 and on Debian Buster 2020-11-09 00:07:07 +03:00
1446aad107 Simple patch for qemu-kvm .spec 2020-11-08 02:14:53 +03:00
46479e2456 Add RPM build scripts for CentOS 8 2020-11-08 01:55:17 +03:00
e41bee72a5 Lower node.js requirement to 10.x 2020-11-08 01:54:12 +03:00
2e0f223ddb Add RPM build scripts for CentOS 7 2020-11-07 01:52:10 +03:00
3be7bc29d8 Make it build with QEMU 2.0, too
Also begin to work on rpms
2020-11-06 20:05:00 +03:00
0c43ff9daf Add scripts to copy fio and qemu includes to the source package 2020-11-06 18:40:42 +03:00
64d471cf53 Add simple Debian packaging 2020-11-06 18:40:42 +03:00
809b2ad8cd Add install target 2020-11-06 01:12:22 +03:00
550d4af151 Rename test.cpp to test_shit.cpp (random shit) 2020-11-06 01:12:22 +03:00
cf0f23ab8e Add patches for QEMU QAPI IDL 2020-11-04 23:30:51 +03:00
a516fefa8c Add qemu_module_dummy and qemu_stamp_xxx to qemu_driver.c 2020-11-04 23:10:29 +03:00
3b7279b376 Add Ceph EC 2+1 test results 2020-11-01 14:13:35 +03:00
824ea507d0 Do not try to push more segments than IOV_MAX at once as it leads to EMSGSIZE 2020-10-30 01:25:43 +03:00
23ea409081 Fix "can't get SQE, will fall out of sync with EPOLLET" when overflowing the ring
OSDs shouldn't crash or hang with long iodepths anymore
2020-10-30 01:06:36 +03:00
2ccb75974b Fix a rare crash caused by a stopped client still being in write_ready_clients 2020-10-30 01:04:58 +03:00
6561d4e040 Validate pool ID before executing the operation
Reply -EPIPE for non-existing pools because we assume that it means
that pool config isn't loaded yet. Previously OSD crashed on such operations
2020-10-30 01:02:46 +03:00
1eda7f529d Note about Linux 5.8+ 2020-10-28 19:17:22 +03:00
0a174bb313 Return success for already finished rollback operations
There was a FIXME and I actually hit it during tests :)
2020-10-24 18:46:19 +03:00
720985e4c7 Fix NULL rmw buffer after the latest changes and add a testcase for it 2020-10-24 18:29:19 +03:00
4872f617a4 Clear connect timeout in stop_client() to stop races during disconnections 2020-10-24 10:37:16 +03:00
e8ac08be14 Allow to overwrite incomplete objects or parts of objects to recover them 2020-10-24 02:14:41 +03:00
660c2412fb Improve debugging output for incomplete/degraded 2020-10-24 01:28:47 +03:00
faa5e1436f Attempt journal trim even without new flushes
This is the new guaranteed unblocking method which replaces old trims
in init and rollback, and also fixes a possible stall when just several
writes in the beginning of the journal are flushed without triggering
a subsequent trim.
2020-10-24 01:28:47 +03:00
5fbe36198a Fix journal trimming
1) Update journal's used_start in memory only after updating journal superblock.
Doing the opposite is incorrect because part of the journal will be lost if writers
overwrite its old beginning.

2) Sync journal device after updating the superblock.

3) Do not trim in rollback and init because trimming there would also require
updating the superblock. And the only reason to trim in both those places was
to unblock writers. And a guaranteed unblocking method will follow in the next
commit :)
2020-10-24 01:08:33 +03:00
99c45bb5ed Fix debugging output during journal loading 2020-10-24 01:08:33 +03:00
701eb79422 Stabilize writes before deleting extra chunks to not stall peer journals 2020-10-23 22:45:05 +03:00
220bda0667 Fix possible buffer over(under)flow when handling LIST 2020-10-23 02:17:44 +03:00
1e8f0328e0 Cancel outbound operations after re-peering PGs 2020-10-22 22:54:38 +00:00
f011e0c675 Do not block stabilize by list and list by write 2020-10-22 22:13:40 +00:00
1a694c387e Print slow ops in log 2020-10-20 23:41:23 +00:00
738ad5af79 Fix infinite looping in continue_recovery_op() when pg_cancel_write_queue() is called 2020-10-20 22:23:15 +00:00
9abf3c17c9 Correct fix for "Pool %u PG %u configuration is invalid" during startup
Establish watcher connection after loading PGs
2020-10-20 21:09:14 +00:00
d2b901aa09 Fix default auto-created failure domains 2020-10-20 21:07:40 +00:00
befff09370 Fix possible crash due to uninitialized ring_data_t in ringloop 2020-10-20 10:44:38 +03:00
d1645551d4 Implement write batching
Also fix possible race condition which could in theory lead to "command out of sync"
and a buffer overflow that could happen on incorrect server response.
2020-10-20 03:29:17 +03:00
7cb561f95a Add etcd to the example service generator 2020-10-20 01:50:56 +03:00
ae480196e2 Add a note about etcd bug, fix simple-offsets.js cmdline 2020-10-19 17:05:45 +03:00
398c86f943 Improve PG-related log messages 2020-10-18 12:17:22 +00:00
bec5f921a6 Fix buffer overflows in the no_same_sector_overwrites mode 2020-10-17 23:30:16 +00:00
5335c8de8e Do not use unordered_map for list_ops/list_results 2020-10-17 23:30:16 +00:00
c696a82083 Replace assert with if + error message (may happen on metadata corruption) 2020-10-17 23:30:16 +00:00
900171586b XOR 2+1 test results 2020-10-17 14:58:08 +03:00
70612e5df0 Do not handle change events before loading config 2020-10-17 11:18:39 +00:00
d952c24979 Use timeout in rw callback 2020-10-17 11:00:55 +00:00
776fe954a5 Fix crashes on multiple OSD reconnects
Identify clients by pointers instead of peer_fd as peer may be dropped
and reconnected between callbacks

Yeah maybe I need some Rust, but ... maybe in the future :)
2020-10-17 10:53:04 +00:00
9350656af6 Fix osd tags 2020-10-16 23:28:48 +00:00
ece14a7d65 Hide "Connected with..." client messages by default 2020-10-11 02:22:46 +03:00
be5f314c32 Change notes about gcc requirement to 9+, fio to 3.16+ 2020-10-11 02:00:39 +03:00
15dba96375 Implement inode removal tool. Removes multiple objects from multiple OSDs in parallel 2020-10-10 01:08:19 +03:00
3d05aa9362 Make it build with GCC 10, fio 3.20+ (atomics...) and QEMU 5.1 2020-10-06 02:35:11 +03:00
94efb54feb Implement OSD tags (device classes), fix pool failure_domain configuration 2020-10-04 17:31:50 +03:00
aa2a0ee00f Do not group adjacent stripes by default as it's pointless on SSDs 2020-10-02 10:17:54 +03:00
143 changed files with 6695 additions and 2614 deletions

19
.dockerignore Normal file
View File

@@ -0,0 +1,19 @@
.git
build
packages
mon/node_modules
*.o
*.so
osd
stub_osd
stub_uring_osd
stub_bench
osd_test
dump_journal
nbd_proxy
rm_inode
fio
qemu
rpm/*.Dockerfile
debian/*.Dockerfile
Dockerfile

18
.gitignore vendored Normal file
View File

@@ -0,0 +1,18 @@
*.o
*.so
package-lock.json
fio
qemu
osd
stub_osd
stub_uring_osd
stub_bench
osd_test
osd_peering_pg_test
dump_journal
nbd_proxy
rm_inode
test_allocator
test_blockstore
test_shit
osd_rmw_test

5
CMakeLists.txt Normal file
View File

@@ -0,0 +1,5 @@
cmake_minimum_required(VERSION 2.8)
project(vitastor)
add_subdirectory(src)

View File

@@ -1,46 +0,0 @@
#!/usr/bin/perl
use strict;
my $deps = {};
for my $line (split /\n/, `grep '^#include "' *.cpp *.h`)
{
if ($line =~ /^([^:]+):\#include "([^"]+)"/s)
{
$deps->{$1}->{$2} = 1;
}
}
my $added;
do
{
$added = 0;
for my $file (keys %$deps)
{
for my $dep (keys %{$deps->{$file}})
{
if ($deps->{$dep})
{
for my $subdep (keys %{$deps->{$dep}})
{
if (!$deps->{$file}->{$subdep})
{
$added = 1;
$deps->{$file}->{$subdep} = 1;
}
}
}
}
}
} while ($added);
for my $file (sort keys %$deps)
{
if ($file =~ /\.cpp$/)
{
my $obj = $file;
$obj =~ s/\.cpp$/.o/s;
print "$obj: $file ".join(" ", sort keys %{$deps->{$file}})."\n";
print "\tg++ \$(CXXFLAGS) -c -o \$\@ \$\<\n";
}
}

169
Makefile
View File

@@ -1,169 +0,0 @@
BLOCKSTORE_OBJS := allocator.o blockstore.o blockstore_impl.o blockstore_init.o blockstore_open.o blockstore_journal.o blockstore_read.o \
blockstore_write.o blockstore_sync.o blockstore_stable.o blockstore_rollback.o blockstore_flush.o crc32c.o ringloop.o
# -fsanitize=address
CXXFLAGS := -g -O3 -Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -fPIC -fdiagnostics-color=always
all: libfio_blockstore.so osd libfio_sec_osd.so libfio_cluster.so stub_osd stub_uring_osd stub_bench osd_test dump_journal qemu_driver.so nbd_proxy
clean:
rm -f *.o
dump_journal: dump_journal.cpp crc32c.o blockstore_journal.h
g++ $(CXXFLAGS) -o $@ $< crc32c.o
libblockstore.so: $(BLOCKSTORE_OBJS)
g++ $(CXXFLAGS) -o $@ -shared $(BLOCKSTORE_OBJS) -ltcmalloc_minimal -luring
libfio_blockstore.so: ./libblockstore.so fio_engine.o json11.o
g++ $(CXXFLAGS) -shared -o $@ fio_engine.o json11.o ./libblockstore.so -ltcmalloc_minimal -luring
OSD_OBJS := osd.o osd_secondary.o msgr_receive.o msgr_send.o osd_peering.o osd_flush.o osd_peering_pg.o \
osd_primary.o osd_primary_subops.o etcd_state_client.o messenger.o osd_cluster.o http_client.o osd_ops.o pg_states.o \
osd_rmw.o json11.o base64.o timerfd_manager.o epoll_manager.o
osd: ./libblockstore.so osd_main.cpp osd.h osd_ops.h $(OSD_OBJS)
g++ $(CXXFLAGS) -o $@ osd_main.cpp $(OSD_OBJS) ./libblockstore.so -ltcmalloc_minimal -luring
stub_osd: stub_osd.o rw_blocking.o
g++ $(CXXFLAGS) -o $@ stub_osd.o rw_blocking.o -ltcmalloc_minimal
STUB_URING_OSD_OBJS := stub_uring_osd.o epoll_manager.o messenger.o msgr_send.o msgr_receive.o ringloop.o timerfd_manager.o json11.o
stub_uring_osd: $(STUB_URING_OSD_OBJS)
g++ $(CXXFLAGS) -o $@ -ltcmalloc_minimal $(STUB_URING_OSD_OBJS) -luring
stub_bench: stub_bench.cpp osd_ops.h rw_blocking.o
g++ $(CXXFLAGS) -o $@ stub_bench.cpp rw_blocking.o -ltcmalloc_minimal
osd_test: osd_test.cpp osd_ops.h rw_blocking.o
g++ $(CXXFLAGS) -o $@ osd_test.cpp rw_blocking.o -ltcmalloc_minimal
osd_peering_pg_test: osd_peering_pg_test.cpp osd_peering_pg.o
g++ $(CXXFLAGS) -o $@ $< osd_peering_pg.o -ltcmalloc_minimal
libfio_sec_osd.so: fio_sec_osd.o rw_blocking.o
g++ $(CXXFLAGS) -ltcmalloc_minimal -shared -o $@ fio_sec_osd.o rw_blocking.o
FIO_CLUSTER_OBJS := cluster_client.o epoll_manager.o etcd_state_client.o \
messenger.o msgr_send.o msgr_receive.o ringloop.o json11.o http_client.o osd_ops.o pg_states.o timerfd_manager.o base64.o
libfio_cluster.so: fio_cluster.o $(FIO_CLUSTER_OBJS)
g++ $(CXXFLAGS) -ltcmalloc_minimal -shared -o $@ $< $(FIO_CLUSTER_OBJS) -luring
nbd_proxy: nbd_proxy.o $(FIO_CLUSTER_OBJS)
g++ $(CXXFLAGS) -ltcmalloc_minimal -o $@ $< $(FIO_CLUSTER_OBJS) -luring
qemu_driver.o: qemu_driver.c qemu_proxy.h
gcc -I qemu/b/qemu `pkg-config glib-2.0 --cflags` \
-I qemu/include $(CXXFLAGS) -c -o $@ $<
qemu_driver.so: qemu_driver.o qemu_proxy.o $(FIO_CLUSTER_OBJS)
g++ $(CXXFLAGS) -ltcmalloc_minimal -shared -o $@ $< $(FIO_CLUSTER_OBJS) qemu_driver.o qemu_proxy.o -luring
test_blockstore: ./libblockstore.so test_blockstore.cpp timerfd_interval.o
g++ $(CXXFLAGS) -o test_blockstore test_blockstore.cpp timerfd_interval.o ./libblockstore.so -ltcmalloc_minimal -luring
test: test.cpp osd_peering_pg.o
g++ $(CXXFLAGS) -o test test.cpp osd_peering_pg.o -luring -lm
test_allocator: test_allocator.cpp allocator.o
g++ $(CXXFLAGS) -o test_allocator test_allocator.cpp allocator.o
crc32c.o: crc32c.c crc32c.h
g++ $(CXXFLAGS) -c -o $@ $<
json11.o: json11/json11.cpp
g++ $(CXXFLAGS) -c -o json11.o json11/json11.cpp
# Autogenerated
allocator.o: allocator.cpp allocator.h
g++ $(CXXFLAGS) -c -o $@ $<
base64.o: base64.cpp base64.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore.o: blockstore.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_flush.o: blockstore_flush.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_impl.o: blockstore_impl.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_init.o: blockstore_init.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_journal.o: blockstore_journal.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_open.o: blockstore_open.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_read.o: blockstore_read.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_rollback.o: blockstore_rollback.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_stable.o: blockstore_stable.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_sync.o: blockstore_sync.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_write.o: blockstore_write.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
cluster_client.o: cluster_client.cpp cluster_client.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
dump_journal.o: dump_journal.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
epoll_manager.o: epoll_manager.cpp epoll_manager.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
etcd_state_client.o: etcd_state_client.cpp base64.h etcd_state_client.h http_client.h json11/json11.hpp object_id.h osd_id.h osd_ops.h pg_states.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
fio_cluster.o: fio_cluster.cpp cluster_client.h epoll_manager.h etcd_state_client.h fio/fio.h fio/optgroup.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
fio_engine.o: fio_engine.cpp blockstore.h fio/fio.h fio/optgroup.h json11/json11.hpp object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
fio_sec_osd.o: fio_sec_osd.cpp fio/fio.h fio/optgroup.h object_id.h osd_id.h osd_ops.h rw_blocking.h
g++ $(CXXFLAGS) -c -o $@ $<
http_client.o: http_client.cpp http_client.h json11/json11.hpp timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
messenger.o: messenger.cpp json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
msgr_receive.o: msgr_receive.cpp json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
msgr_send.o: msgr_send.cpp json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
nbd_proxy.o: nbd_proxy.cpp cluster_client.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd.o: osd.cpp blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_cluster.o: osd_cluster.cpp base64.h blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_flush.o: osd_flush.cpp blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_main.o: osd_main.cpp blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_ops.o: osd_ops.cpp object_id.h osd_id.h osd_ops.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_peering.o: osd_peering.cpp base64.h blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_peering_pg.o: osd_peering_pg.cpp cpp-btree/btree_map.h object_id.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_peering_pg_test.o: osd_peering_pg_test.cpp cpp-btree/btree_map.h object_id.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_primary.o: osd_primary.cpp blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h osd_primary.h osd_rmw.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_primary_subops.o: osd_primary_subops.cpp blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h osd_primary.h osd_rmw.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_rmw.o: osd_rmw.cpp malloc_or_die.h object_id.h osd_id.h osd_rmw.h xor.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_rmw_test.o: osd_rmw_test.cpp malloc_or_die.h object_id.h osd_id.h osd_rmw.cpp osd_rmw.h test_pattern.h xor.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_secondary.o: osd_secondary.cpp blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_test.o: osd_test.cpp object_id.h osd_id.h osd_ops.h rw_blocking.h test_pattern.h
g++ $(CXXFLAGS) -c -o $@ $<
pg_states.o: pg_states.cpp pg_states.h
g++ $(CXXFLAGS) -c -o $@ $<
qemu_proxy.o: qemu_proxy.cpp cluster_client.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h qemu_proxy.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
ringloop.o: ringloop.cpp ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
rw_blocking.o: rw_blocking.cpp rw_blocking.h
g++ $(CXXFLAGS) -c -o $@ $<
stub_bench.o: stub_bench.cpp object_id.h osd_id.h osd_ops.h rw_blocking.h
g++ $(CXXFLAGS) -c -o $@ $<
stub_osd.o: stub_osd.cpp object_id.h osd_id.h osd_ops.h rw_blocking.h
g++ $(CXXFLAGS) -c -o $@ $<
stub_uring_osd.o: stub_uring_osd.cpp epoll_manager.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
test.o: test.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
test_allocator.o: test_allocator.cpp allocator.h
g++ $(CXXFLAGS) -c -o $@ $<
test_blockstore.o: test_blockstore.cpp blockstore.h object_id.h ringloop.h timerfd_interval.h
g++ $(CXXFLAGS) -c -o $@ $<
timerfd_interval.o: timerfd_interval.cpp ringloop.h timerfd_interval.h
g++ $(CXXFLAGS) -c -o $@ $<
timerfd_manager.o: timerfd_manager.cpp timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<

178
README.md
View File

@@ -16,12 +16,13 @@ breaking changes in the future. However, the following is implemented:
- Basic part: highly-available block storage with symmetric clustering and no SPOF
- Performance ;-D
- Two redundancy schemes: Replication and XOR n+1 (simplest case of EC)
- Multiple redundancy schemes: Replication, XOR n+1, Reed-Solomon erasure codes
based on jerasure library with any number of data and parity drives in a group
- Configuration via simple JSON data structures in etcd
- Automatic data distribution over OSDs, with support for:
- Mathematical optimization for better uniformity and less data movement
- Multiple pools
- Placement tree
- Placement tree, OSD selection by tags (device classes) and placement root
- Configurable failure domains
- Recovery of degraded blocks
- Rebalancing (data movement between OSDs)
@@ -31,24 +32,24 @@ breaking changes in the future. However, the following is implemented:
- QEMU driver (built out-of-tree)
- Loadable fio engine for benchmarks (also built out-of-tree)
- NBD proxy for kernel mounts
- Inode removal tool (vitastor-rm)
- Packaging for Debian and CentOS
## Roadmap
- Packaging for Debian and, probably, CentOS too
- OSD creation tool (OSDs currently have to be created by hand)
- Inode deletion tool (currently you can't delete anything :))
- Other administrative tools
- Per-inode I/O and space usage statistics
- jerasure EC support with any number of data and parity drives in a group
- Parallel usage of multiple network interfaces
- Proxmox and OpenNebula plugins
- iSCSI proxy
- Inode metadata storage in etcd
- Snapshots and copy-on-write image clones
- Operation timeouts and better failure detection
- Scrubbing without checksums (verification of replicas)
- Checksums
- SSD+HDD optimizations, possibly including tiered storage and soft journal flushes
- RDMA and NVDIMM support
- Web GUI
- Compression (possibly)
- Read caching using system page cache (possibly)
@@ -80,7 +81,7 @@ Architectural differences from Ceph:
per drive you should run multiple OSDs each on a different partition of the drive.
Vitastor isn't CPU-hungry though (as opposed to Ceph), so 1 core is sufficient in a lot of cases.
- Metadata and journal are always kept in memory. Metadata size depends linearly on drive capacity
and data store block size which is 128 KB by default. With 128 KB blocks, metadata should occupy
and data store block size which is 128 KB by default. With 128 KB blocks metadata should occupy
around 512 MB per 1 TB (which is still less than Ceph wants). Journal doesn't have to be big,
the example test below was conducted with only 16 MB journal. A big journal is probably even
harmful as dirty write metadata also take some memory.
@@ -152,9 +153,9 @@ I use the following 6 commands with small variations to benchmark any storage:
`fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=write -runtime=60 -filename=/dev/sdX`
- Linear read:
`fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=read -runtime=60 -filename=/dev/sdX`
- Random write latency (this hurts storages the most):
- Random write latency (T1Q1, this hurts storages the most):
`fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -fsync=1 -rw=randwrite -runtime=60 -filename=/dev/sdX`
- Random read latency:
- Random read latency (T1Q1):
`fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -rw=randread -runtime=60 -filename=/dev/sdX`
- Parallel write iops (use numjobs if a single CPU core is insufficient to saturate the load):
`fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 [-numjobs=4 -group_reporting] -rw=randwrite -runtime=60 -filename=/dev/sdX`
@@ -205,7 +206,7 @@ Hardware configuration: 4 nodes, each with:
CPU powersaving was disabled. Both Vitastor and Ceph were configured with 2 OSDs per 1 SSD.
All of the results below apply to 4 KB blocks.
All of the results below apply to 4 KB blocks and random access (unless indicated otherwise).
Raw drive performance:
- T1Q1 write ~27000 iops (~0.037ms latency)
@@ -232,6 +233,8 @@ Vitastor:
- T1Q1 read: 6838 iops (0.145ms latency)
- T2Q64 write: 162000 iops, total CPU usage by OSDs about 3 virtual cores on each node
- T8Q64 read: 895000 iops, total CPU usage by OSDs about 4 virtual cores on each node
- Linear write (4M T1Q32): 2800 MB/s
- Linear read (4M T1Q32): 1500 MB/s
T8Q64 read test was conducted over 1 larger inode (3.2T) from all hosts (every host was running 2 instances of fio).
Vitastor has no performance penalties related to running multiple clients over a single inode.
@@ -245,6 +248,24 @@ Vitastor was configured with: `--disable_data_fsync true --immediate_commit all
--journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024
--journal_size 16777216`.
### EC/XOR 2+1
Vitastor:
- T1Q1 write: 2808 iops (~0.355ms latency)
- T1Q1 read: 6190 iops (~0.16ms latency)
- T2Q64 write: 85500 iops, total CPU usage by OSDs about 3.4 virtual cores on each node
- T8Q64 read: 812000 iops, total CPU usage by OSDs about 4.7 virtual cores on each node
- Linear write (4M T1Q32): 3200 MB/s
- Linear read (4M T1Q32): 1800 MB/s
Ceph:
- T1Q1 write: 730 iops (~1.37ms latency)
- T1Q1 read: 1500 iops with cold cache (~0.66ms latency), 2300 iops after 2 minute metadata cache warmup (~0.435ms latency)
- T4Q128 write (4 RBD images): 45300 iops, total CPU usage by OSDs about 30 virtual cores on each node
- T8Q64 read (4 RBD images): 278600 iops, total CPU usage by OSDs about 40 virtual cores on each node
- Linear write (4M T1Q32): 1950 MB/s before preallocation, 2500 MB/s after preallocation
- Linear read (4M T1Q32): 2400 MB/s
### NBD
NBD is currently required to mount Vitastor via kernel, but it imposes additional overhead
@@ -256,19 +277,50 @@ Vitastor with single-thread NBD on the same hardware:
- T1Q1 read: 5518 iops (0.18ms latency)
- T1Q128 write: 94400 iops
- T1Q128 read: 103000 iops
- Linear write (4M T1Q128): 1266 MB/s (compared to 2600 MB/s via fio)
- Linear read (4M T1Q128): 975 MB/s (compared to 1400 MB/s via fio)
- Linear write (4M T1Q128): 1266 MB/s (compared to 2800 MB/s via fio)
- Linear read (4M T1Q128): 975 MB/s (compared to 1500 MB/s via fio)
## Building
## Installation
- Install Linux kernel 5.4 or newer for io_uring support.
### Debian
- Trust Vitastor package signing key:
`wget -q -O - https://vitastor.io/debian/pubkey | sudo apt-key add -`
- Add Vitastor package repository to your /etc/apt/sources.list:
- Debian 11 (Bullseye/Sid): `deb https://vitastor.io/debian bullseye main`
- Debian 10 (Buster): `deb https://vitastor.io/debian buster main`
- For Debian 10 (Buster) also enable backports repository:
`deb http://deb.debian.org/debian buster-backports main`
- Install packages: `apt update; apt install vitastor lp-solve etcd linux-image-amd64`
### CentOS
- Add Vitastor package repository:
- CentOS 7: `yum install https://vitastor.io/rpms/centos/7/vitastor-release-1.0-1.el7.noarch.rpm`
- CentOS 8: `dnf install https://vitastor.io/rpms/centos/8/vitastor-release-1.0-1.el8.noarch.rpm`
- Enable EPEL: `yum/dnf install epel-release`
- Enable additional CentOS repositories:
- CentOS 7: `yum install centos-release-scl`
- CentOS 8: `dnf install centos-release-advanced-virtualization`
- Enable elrepo-kernel:
- CentOS 7: `yum install https://www.elrepo.org/elrepo-release-7.el7.elrepo.noarch.rpm`
- CentOS 8: `dnf install https://www.elrepo.org/elrepo-release-8.el8.elrepo.noarch.rpm`
- Install packages: `yum/dnf install vitastor lpsolve etcd kernel-ml qemu-kvm`
### Building from Source
- Install Linux kernel 5.4 or newer, for io_uring support. 5.8 or later is highly recommended because
there is at least one known io_uring hang with 5.4 and an HP SmartArray controller.
- Install liburing 0.4 or newer and its headers.
- Install lp_solve.
- Install etcd.
- Install node.js 12 or newer.
- Install gcc and g++ 9.x.
- Install etcd. Attention: you need a fixed version from here: https://github.com/vitalif/etcd/,
branch release-3.4, because there is a bug in upstream etcd which makes Vitastor OSDs fail to
move PGs out of "starting" state if you have at least around ~500 PGs or so. The custom build
will be unnecessary when etcd merges the fix: https://github.com/etcd-io/etcd/pull/12402.
- Install node.js 10 or newer.
- Install gcc and g++ 8.x or newer.
- Clone https://yourcmc.ru/git/vitalif/vitastor/ with submodules.
- Install QEMU 4.x or 5.x, get its source, begin to build it, stop the build and copy headers:
- Install QEMU 3.0+, get its source, begin to build it, stop the build and copy headers:
- `<qemu>/include` &rarr; `<vitastor>/qemu/include`
- Debian:
* Use qemu packages from the main repository
@@ -278,13 +330,14 @@ Vitastor with single-thread NBD on the same hardware:
* Use qemu packages from the Advanced-Virtualization repository. To enable it, run
`yum install centos-release-advanced-virtualization.noarch` and then `yum install qemu`
* `<qemu>/config-host.h` &rarr; `<vitastor>/qemu/b/qemu/config-host.h`
* `<qemu>/qapi` &rarr; `<vitastor>/qemu/b/qemu/qapi`
* For QEMU 3.0+: `<qemu>/qapi` &rarr; `<vitastor>/qemu/b/qemu/qapi`
* For QEMU 2.0+: `<qemu>/qapi-types.h` &rarr; `<vitastor>/qemu/b/qemu/qapi-types.h`
- `config-host.h` and `qapi` are required because they contain generated headers
- Install fio 3.16, get its source and symlink it into `<vitastor>/fio`. It doesn't currently
build with fio 3.20 or newer due to the conflicts between g++ and gcc's atomics. This will
be fixed in the future.
- Build Vitastor with `make -j8`.
- Copy binaries somewhere.
- You can also rebuild QEMU with a patch that makes LD_PRELOAD unnecessary to load vitastor driver.
See `qemu-*.*-vitastor.patch`.
- Install fio 3.7 or later, get its source and symlink it into `<vitastor>/fio`.
- Build & install Vitastor with `mkdir build && cd build && cmake .. && make -j8 && make install`.
Pay attention to the `QEMU_PLUGINDIR` cmake option - it must be set to `qemu-kvm` on RHEL.
## Running
@@ -295,56 +348,56 @@ and calculate disk offsets almost by hand. This will be fixed in near future.
with lazy fsync, but prepare for inferior single-thread latency.
- Get a fast network (at least 10 Gbit/s).
- Disable CPU powersaving: `cpupower idle-set -D 0 && cpupower frequency-set -g performance`.
- Install etcd with `--max-txn-ops=100000 --auto-compaction-retention=10 --auto-compaction-mode=revision` options.
- Create global configuration in etcd: `etcdctl put /vitastor/config/global '{"immediate_commit":"all"}'`
(if all your drives have capacitors).
- Create pool configuration in etcd: `etcdctl put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":256,"failure_domain":"host"}}'`.
- Calculate offsets for your drives with `node ./mon/simple-offsets.js /dev/sdX`.
- Make systemd units for your OSDs. Look at `./mon/make-units.sh` for example.
Notable configuration variables from the example:
- Check `/usr/lib/vitastor/mon/make-units.sh` and `/usr/lib/vitastor/mon/make-osd.sh` and
put desired values into the variables at the top of these files.
- Create systemd units for the monitor and etcd: `/usr/lib/vitastor/mon/make-units.sh`
- Create systemd units for your OSDs: `/usr/lib/vitastor/mon/make-osd.sh /dev/disk/by-partuuid/XXX [/dev/disk/by-partuuid/YYY ...]`
- You can edit the units and change OSD configuration. Notable configuration variables:
- `disable_data_fsync 1` - only safe with server-grade drives with capacitors.
- `immediate_commit all` - use this if all your drives are server-grade.
- `disable_device_lock 1` - only required if you run multiple OSDs on one block device.
- `flusher_count 16` - flusher is a micro-thread that removes old data from the journal.
More flushers mean more aggressive journal flushing which allows for more throughput
but slightly hurts latency under less load. Flushing will probably be improved in the future
because currently high queue depths sometimes lead to performance degradation.
- `flusher_count 256` - flusher is a micro-thread that removes old data from the journal.
You don't have to worry about this parameter anymore, 256 is enough.
- `disk_alignment`, `journal_block_size`, `meta_block_size` should be set to the internal
block size of your SSDs which is 4096 on most drives.
- `journal_no_same_sector_overwrites true` prevents multiple overwrites of the same journal sector.
Some SSDs (like Intel D3-4510) don't like such overwrites so they benefit from this setting.
When this setting is set, it is also required to raise `journal_sector_buffer_count` setting,
which is the number of dirty journal sectors that may be written to at the same time.
Most (99%) SSDs don't need this option. But Intel D3-4510 does because it doesn't like when you
overwrite the same sector twice in a short period of time. The setting forces Vitastor to never
overwrite the same journal sector twice in a row which makes D3-4510 almost happy. Not totally
happy, because overwrites of the same block can still happen in the metadata area... When this
setting is set, it is also required to raise `journal_sector_buffer_count` setting, which is the
number of dirty journal sectors that may be written to at the same time.
- `systemctl start vitastor.target` everywhere.
- Start any number of monitors: `cd mon; node mon-main.js --etcd_url 'http://10.115.0.10:2379,http://10.115.0.11:2379,http://10.115.0.12:2379,http://10.115.0.13:2379' --etcd_prefix '/vitastor' --etcd_start_timeout 5`.
- Create global configuration in etcd: `etcdctl --endpoints=... put /vitastor/config/global '{"immediate_commit":"all"}'`
(if all your drives have capacitors).
- Create pool configuration in etcd: `etcdctl --endpoints=... put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":256,"failure_domain":"host"}}'`.
For jerasure pools the configuration should look like the following: `2:{"name":"ecpool","scheme":"jerasure","pg_size":4,"parity_chunks":2,"pg_minsize":2,"pg_count":256,"failure_domain":"host"}`.
- At this point, one of the monitors will configure PGs and OSDs will start them.
- You can check PG states with `etcdctl get --prefix /vitastor/pg/state`. All PGs should become 'active'.
- Run tests with (for example): `fio -thread -ioengine=./libfio_cluster.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -pool=1 -inode=1 -size=400G`.
- You can check PG states with `etcdctl --endpoints=... get --prefix /vitastor/pg/state`. All PGs should become 'active'.
- Run tests with (for example): `fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -pool=1 -inode=1 -size=400G`.
- Upload VM disk image with qemu-img (for example):
```
LD_PRELOAD=./qemu_driver.so qemu-img convert -f qcow2 debian10.qcow2 -p
-O raw 'vitastor:etcd_host=10.115.0.10\:2379/v3:pool=1:inode=1:size=2147483648'
qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:etcd_host=10.115.0.10\:2379/v3:pool=1:inode=1:size=2147483648'
```
Note that the command requires to be run with `LD_PRELOAD=/usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so qemu-img ...`
if you use unmodified QEMU.
- Run QEMU with (for example):
```
LD_PRELOAD=./qemu_driver.so qemu-system-x86_64 -enable-kvm -m 1024
qemu-system-x86_64 -enable-kvm -m 1024
-drive 'file=vitastor:etcd_host=10.115.0.10\:2379/v3:pool=1:inode=1:size=2147483648',format=raw,if=none,id=drive-virtio-disk0,cache=none
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=off,physical_block_size=4096,logical_block_size=512
-vnc 0.0.0.0:0
```
- Remove inode with (for example):
```
vitastor-rm --etcd_address 10.115.0.10:2379/v3 --pool 1 --inode 1 --parallel_osds 16 --iodepth 32
```
## Known Problems
- OSDs may currently crash with "can't get SQE, will fall out of sync with EPOLLET"
if you try to load them with very long iodepths because io_uring queue (ring) is limited
and OSDs don't check if it fills up.
- Object deletion requests may currently lead to unfound objects on crashes because
proper handling of deletions in a cluster requires a "three-phase cleanup process"
and it's currently not implemented. In fact, even though deletion requests are
implemented, there's no user tool to delete anything from the cluster yet :).
Of course I'll create such tool, but its first implementation will be vulnerable to this issue.
It's not a big deal though, because you'll be able to just repeat the deletion request
in this case.
- Object deletion requests may currently lead to 'incomplete' objects if your OSDs crash during
deletion because proper handling of object cleanup in a cluster should be "three-phase"
and it's currently not implemented. Just to repeat the removal again in this case.
## Implementation Principles
@@ -366,22 +419,27 @@ Copyright (c) Vitaliy Filippov (vitalif [at] yourcmc.ru), 2019+
You can also find me in the Russian Telegram Ceph chat: https://t.me/ceph_ru
All server-side code (OSD, Monitor and so on) is licensed under the terms of
Vitastor Network Public License 1.0 (VNPL 1.0), a copyleft license based on
Vitastor Network Public License 1.1 (VNPL 1.1), a copyleft license based on
GNU GPLv3.0 with the additional "Network Interaction" clause which requires
opensourcing all programs directly or indirectly interacting with Vitastor
through a computer network ("Proxy Programs"). Proxy Programs may be made public
not only under the terms of the same license, but also under the terms of any
GPL-Compatible Free Software License, as listed by the Free Software Foundation.
through a computer network and expressly designed to be used in conjunction
with it ("Proxy Programs"). Proxy Programs may be made public not only under
the terms of the same license, but also under the terms of any GPL-Compatible
Free Software License, as listed by the Free Software Foundation.
This is a stricter copyleft license than the Affero GPL.
Please note that VNPL doesn't require you to open the code of proprietary
software running inside a VM if it's not specially designed to be used with
Vitastor.
Basically, you can't use the software in a proprietary environment to provide
its functionality to users without opensourcing all intermediary components
standing between the user and Vitastor or purchasing a commercial license
from the author 😀.
Client libraries (cluster_client and so on) are dual-licensed under the same
VNPL 1.0 and also GNU GPL 2.0 or later to allow for compatibility with GPLed
VNPL 1.1 and also GNU GPL 2.0 or later to allow for compatibility with GPLed
software like QEMU and fio.
You can find the full text of VNPL-1.0 in the file [VNPL-1.0.txt](VNPL-1.0.txt).
You can find the full text of VNPL-1.1 in the file [VNPL-1.1.txt](VNPL-1.1.txt).
GPL 2.0 is also included in this repository as [GPL-2.0.txt](GPL-2.0.txt).

View File

@@ -1,7 +1,7 @@
VITASTOR NETWORK PUBLIC LICENSE
Version 1, 17 September 2020
Version 1.1, 6 February 2021
Copyright (C) 2020 Vitaliy Filippov <vitalif@yourcmc.ru>
Copyright (C) 2021 Vitaliy Filippov <vitalif@yourcmc.ru>
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
@@ -540,12 +540,15 @@ License would be to refrain entirely from conveying the Program.
13. Remote Network Interaction.
Notwithstanding any other provision of this License, if you provide
any user an opportunity to interact with the covered work directly
or indirectly through a computer network, an imitation of such network,
or an additional program (hereinafter referred to as a "Proxy Program")
that, in turn, interacts with the covered work through a computer network,
an imitation of such network, or another Proxy Program itself,
A "Proxy Program" means a separate program which is specially designed to
be used in conjunction with the covered work and interacts with it directly
or indirectly through any kind of API (application programming interfaces),
a computer network, an imitation of such network, or another Proxy Program
itself.
Notwithstanding any other provision of this License, if you provide any user
with an opportunity to interact with the covered work through a computer
network, an imitation of such network, or any number of "Proxy Programs",
you must prominently offer that user an opportunity to receive the
Corresponding Source of the covered work and all Proxy Programs from a
network server at no charge, through some standard or customary means of

13
copy-fio-includes.sh Executable file
View File

@@ -0,0 +1,13 @@
#!/bin/bash
gcc -I. -E -o fio_headers.i src/fio_headers.h
rm -rf fio-copy
for i in `grep -Po 'fio/[^"]+' fio_headers.i | sort | uniq`; do
j=${i##fio/}
p=$(dirname $j)
mkdir -p fio-copy/$p
cp $i fio-copy/$j
done
rm fio_headers.i

18
copy-qemu-includes.sh Executable file
View File

@@ -0,0 +1,18 @@
#!/bin/bash
#cd qemu
#debian/rules b/configure-stamp
#cd b/qemu; make qapi
gcc -I qemu/b/qemu `pkg-config glib-2.0 --cflags` \
-I qemu/include -E -o qemu_driver.i src/qemu_driver.c
rm -rf qemu-copy
for i in `grep -Po 'qemu/[^"]+' qemu_driver.i | sort | uniq`; do
j=${i##qemu/}
p=$(dirname $j)
mkdir -p qemu-copy/$p
cp $i qemu-copy/$j
done
rm qemu_driver.i

7
debian/build-vitastor-bullseye.sh vendored Executable file
View File

@@ -0,0 +1,7 @@
#!/bin/bash
sed 's/$REL/bullseye/g' < vitastor.Dockerfile > ../Dockerfile
cd ..
mkdir -p packages
sudo podman build -v `pwd`/packages:/root/packages -f Dockerfile .
rm Dockerfile

7
debian/build-vitastor-buster.sh vendored Executable file
View File

@@ -0,0 +1,7 @@
#!/bin/bash
sed 's/$REL/buster/g' < vitastor.Dockerfile > ../Dockerfile
cd ..
mkdir -p packages
sudo podman build -v `pwd`/packages:/root/packages -f Dockerfile .
rm Dockerfile

17
debian/changelog vendored Normal file
View File

@@ -0,0 +1,17 @@
vitastor (0.5.10-1) unstable; urgency=medium
* Bugfixes
-- Vitaliy Filippov <vitalif@yourcmc.ru> Tue, 02 Feb 2021 23:01:24 +0300
vitastor (0.5.1-1) unstable; urgency=medium
* Add jerasure support
-- Vitaliy Filippov <vitalif@yourcmc.ru> Sat, 05 Dec 2020 17:02:26 +0300
vitastor (0.5-1) unstable; urgency=medium
* First packaging for Debian
-- Vitaliy Filippov <vitalif@yourcmc.ru> Thu, 05 Nov 2020 02:20:59 +0300

1
debian/compat vendored Normal file
View File

@@ -0,0 +1 @@
13

17
debian/control vendored Normal file
View File

@@ -0,0 +1,17 @@
Source: vitastor
Section: admin
Priority: optional
Maintainer: Vitaliy Filippov <vitalif@yourcmc.ru>
Build-Depends: debhelper, liburing-dev (>= 0.6), g++ (>= 8), libstdc++6 (>= 8), linux-libc-dev, libgoogle-perftools-dev, libjerasure-dev, libgf-complete-dev
Standards-Version: 4.5.0
Homepage: https://vitastor.io/
Rules-Requires-Root: no
Package: vitastor
Architecture: amd64
Depends: ${shlibs:Depends}, ${misc:Depends}, fio (= ${dep:fio}), qemu (= ${dep:qemu}), nodejs (>= 10), node-sprintf-js, node-ws (>= 7), libjerasure2, lp-solve
Description: Vitastor, a fast software-defined clustered block storage
Vitastor is a small, simple and fast clustered block storage (storage for VM drives),
architecturally similar to Ceph which means strong consistency, primary-replication,
symmetric clustering and automatic data distribution over any number of drives of any
size with configurable redundancy (replication or erasure codes/XOR).

21
debian/copyright vendored Normal file
View File

@@ -0,0 +1,21 @@
Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
Upstream-Name: vitastor
Upstream-Contact: Vitaliy Filippov <vitalif@yourcmc.ru>
Source: https://vitastor.io
Files: *
Copyright: 2019+ Vitaliy Filippov <vitalif@yourcmc.ru>
License: Multiple licenses VNPL-1.1 and/or GPL-2.0+
All server-side code (OSD, Monitor and so on) is licensed under the terms of
Vitastor Network Public License 1.1 (VNPL 1.1), a copyleft license based on
GNU GPLv3.0 with the additional "Network Interaction" clause which requires
opensourcing all programs directly or indirectly interacting with Vitastor
through a computer network and expressly designed to be used in conjunction
with it ("Proxy Programs"). Proxy Programs may be made public not only under
the terms of the same license, but also under the terms of any GPL-Compatible
Free Software License, as listed by the Free Software Foundation.
This is a stricter copyleft license than the Affero GPL.
.
Client libraries (cluster_client and so on) are dual-licensed under the same
VNPL 1.1 and also GNU GPL 2.0 or later to allow for compatibility with GPLed
software like QEMU and fio.

3
debian/install vendored Normal file
View File

@@ -0,0 +1,3 @@
VNPL-1.1.txt usr/share/doc/vitastor
GPL-2.0.txt usr/share/doc/vitastor
mon usr/lib/vitastor

44
debian/patched-qemu.Dockerfile vendored Normal file
View File

@@ -0,0 +1,44 @@
# Build patched QEMU for Debian Buster or Bullseye/Sid inside a container
# cd ..; podman build --build-arg REL=bullseye -v `pwd`/packages:/root/packages -f debian/patched-qemu.Dockerfile .
FROM debian:$REL
WORKDIR /root
RUN if [ "$REL" = "buster" ]; then \
echo 'deb http://deb.debian.org/debian buster-backports main' >> /etc/apt/sources.list; \
echo >> /etc/apt/preferences; \
echo 'Package: *' >> /etc/apt/preferences; \
echo 'Pin: release a=buster-backports' >> /etc/apt/preferences; \
echo 'Pin-Priority: 500' >> /etc/apt/preferences; \
fi; \
grep '^deb ' /etc/apt/sources.list | perl -pe 's/^deb/deb-src/' >> /etc/apt/sources.list; \
echo 'APT::Install-Recommends false;' >> /etc/apt/apt.conf; \
echo 'APT::Install-Suggests false;' >> /etc/apt/apt.conf
RUN apt-get update
RUN apt-get -y install qemu fio liburing1 liburing-dev libgoogle-perftools-dev devscripts
RUN apt-get -y build-dep qemu
RUN apt-get -y build-dep fio
RUN apt-get --download-only source qemu
RUN apt-get --download-only source fio
ADD qemu-5.0-vitastor.patch qemu-5.1-vitastor.patch /root/vitastor/
RUN set -e; \
mkdir -p /root/packages/qemu-$REL; \
rm -rf /root/packages/qemu-$REL/*; \
cd /root/packages/qemu-$REL; \
dpkg-source -x /root/qemu*.dsc; \
if [ -d /root/packages/qemu-$REL/qemu-5.0 ]; then \
cp /root/vitastor/qemu-5.0-vitastor.patch /root/packages/qemu-$REL/qemu-5.0/debian/patches; \
echo qemu-5.0-vitastor.patch >> /root/packages/qemu-$REL/qemu-5.0/debian/patches/series; \
else \
cp /root/vitastor/qemu-5.1-vitastor.patch /root/packages/qemu-$REL/qemu-*/debian/patches; \
P=`ls -d /root/packages/qemu-$REL/qemu-*/debian/patches`; \
echo qemu-5.1-vitastor.patch >> $P/series; \
fi; \
cd /root/packages/qemu-$REL/qemu-*/; \
V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)(~bpo[\d\+]*)?\).*$/$1/')+vitastor1; \
DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v $V 'Plug Vitastor block driver'; \
DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \
rm -rf /root/packages/qemu-$REL/qemu-*/

9
debian/rules vendored Executable file
View File

@@ -0,0 +1,9 @@
#!/usr/bin/make -f
export DH_VERBOSE = 1
%:
dh $@
override_dh_installdeb:
cat debian/substvars >> debian/vitastor.substvars
dh_installdeb

1
debian/source/format vendored Normal file
View File

@@ -0,0 +1 @@
3.0 (quilt)

2
debian/substvars vendored Normal file
View File

@@ -0,0 +1,2 @@
dep:fio=3.16-1
dep:qemu=1:5.1+dfsg-4+vitastor1

67
debian/vitastor.Dockerfile vendored Normal file
View File

@@ -0,0 +1,67 @@
# Build Vitastor packages for Debian Buster or Bullseye/Sid inside a container
# cd ..; podman build --build-arg REL=bullseye -v `pwd`/packages:/root/packages -f debian/vitastor.Dockerfile .
FROM debian:$REL
WORKDIR /root
RUN if [ "$REL" = "buster" ]; then \
echo 'deb http://deb.debian.org/debian buster-backports main' >> /etc/apt/sources.list; \
echo >> /etc/apt/preferences; \
echo 'Package: *' >> /etc/apt/preferences; \
echo 'Pin: release a=buster-backports' >> /etc/apt/preferences; \
echo 'Pin-Priority: 500' >> /etc/apt/preferences; \
fi; \
grep '^deb ' /etc/apt/sources.list | perl -pe 's/^deb/deb-src/' >> /etc/apt/sources.list; \
echo 'APT::Install-Recommends false;' >> /etc/apt/apt.conf; \
echo 'APT::Install-Suggests false;' >> /etc/apt/apt.conf
RUN apt-get update
RUN apt-get -y install qemu fio liburing1 liburing-dev libgoogle-perftools-dev devscripts
RUN apt-get -y build-dep qemu
RUN apt-get -y build-dep fio
RUN apt-get --download-only source qemu
RUN apt-get --download-only source fio
RUN apt-get -y install libjerasure-dev cmake
ADD . /root/vitastor
RUN set -e -x; \
mkdir -p /root/fio-build/; \
cd /root/fio-build/; \
rm -rf /root/fio-build/*; \
dpkg-source -x /root/fio*.dsc; \
cd /root/packages/qemu-$REL/; \
rm -rf qemu*/; \
dpkg-source -x qemu*.dsc; \
cd /root/packages/qemu-$REL/qemu*/; \
debian/rules b/configure-stamp; \
cd b/qemu; \
make -j8 qapi/qapi-builtin-types.h; \
mkdir -p /root/packages/vitastor-$REL; \
rm -rf /root/packages/vitastor-$REL/*; \
cd /root/packages/vitastor-$REL; \
cp -r /root/vitastor vitastor-0.5.10; \
ln -s /root/packages/qemu-$REL/qemu-*/ vitastor-0.5.10/qemu; \
ln -s /root/fio-build/fio-*/ vitastor-0.5.10/fio; \
cd vitastor-0.5.10; \
FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
QEMU=$(head -n1 qemu/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
sh copy-qemu-includes.sh; \
sh copy-fio-includes.sh; \
rm qemu fio; \
mkdir -p a b debian/patches; \
mv qemu-copy b/qemu; \
mv fio-copy b/fio; \
diff -NaurpbB a b > debian/patches/qemu-fio-headers.patch || true; \
echo qemu-fio-headers.patch >> debian/patches/series; \
rm -rf a b; \
rm -rf /root/packages/qemu-$REL/qemu*/; \
echo "dep:fio=$FIO" > debian/substvars; \
echo "dep:qemu=$QEMU" >> debian/substvars; \
cd /root/packages/vitastor-$REL; \
tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.5.10.orig.tar.xz vitastor-0.5.10; \
cd vitastor-0.5.10; \
V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \
DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \
rm -rf /root/packages/vitastor-$REL/vitastor-*/

View File

@@ -1,51 +0,0 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include <iostream>
#include <functional>
#include <array>
#include <cstdlib> // for malloc() and free()
using namespace std;
// replace operator new and delete to log allocations
void* operator new(std::size_t n)
{
cout << "Allocating " << n << " bytes" << endl;
return malloc(n);
}
void operator delete(void* p) throw()
{
free(p);
}
class test
{
public:
std::string s;
void a(std::function<void()> & f, const char *str)
{
auto l = [this, str]() { cout << str << " ? " << s << " from this\n"; };
cout << "Assigning lambda3 of size " << sizeof(l) << endl;
f = l;
}
};
int main()
{
std::array<char, 16> arr1;
auto lambda1 = [arr1](){};
cout << "Assigning lambda1 of size " << sizeof(lambda1) << endl;
std::function<void()> f1 = lambda1;
std::array<char, 17> arr2;
auto lambda2 = [arr2](){};
cout << "Assigning lambda2 of size " << sizeof(lambda2) << endl;
std::function<void()> f2 = lambda2;
test t;
std::function<void()> f3;
t.s = "str";
t.a(f3, "huyambda");
f3();
}

View File

@@ -1,22 +1,59 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
module.exports = {
scale_pg_count,
};
function add_pg_history(new_pg_history, new_pg, prev_pgs, prev_pg_history, old_pg)
{
if (!new_pg_history[new_pg])
{
new_pg_history[new_pg] = {
osd_sets: {},
all_peers: {},
epoch: 0,
};
}
const nh = new_pg_history[new_pg], oh = prev_pg_history[old_pg];
nh.osd_sets[prev_pgs[old_pg].join(' ')] = prev_pgs[old_pg];
if (oh && oh.osd_sets && oh.osd_sets.length)
{
for (const pg of oh.osd_sets)
{
nh.osd_sets[pg.join(' ')] = pg;
}
}
if (oh && oh.all_peers && oh.all_peers.length)
{
for (const osd_num of oh.all_peers)
{
nh.all_peers[osd_num] = Number(osd_num);
}
}
if (oh && oh.epoch)
{
nh.epoch = nh.epoch < oh.epoch ? oh.epoch : nh.epoch;
}
}
function finish_pg_history(merged_history)
{
merged_history.osd_sets = Object.values(merged_history.osd_sets);
merged_history.all_peers = Object.values(merged_history.all_peers);
}
function scale_pg_count(prev_pgs, prev_pg_history, new_pg_history, new_pg_count)
{
const old_pg_count = prev_pgs.length;
// Add all possibly intersecting PGs into the history of new PGs
// Add all possibly intersecting PGs to the history of new PGs
if (!(new_pg_count % old_pg_count))
{
// New PG count is a multiple of the old PG count
const mul = (new_pg_count / old_pg_count);
// New PG count is a multiple of old PG count
for (let i = 0; i < new_pg_count; i++)
{
const old_i = Math.floor(new_pg_count / mul);
new_pg_history[i] = JSON.parse(JSON.stringify(prev_pg_history[1+old_i]));
add_pg_history(new_pg_history, i, prev_pgs, prev_pg_history, i % old_pg_count);
finish_pg_history(new_pg_history[i]);
}
}
else if (!(old_pg_count % new_pg_count))
@@ -25,68 +62,26 @@ function scale_pg_count(prev_pgs, prev_pg_history, new_pg_history, new_pg_count)
const mul = (old_pg_count / new_pg_count);
for (let i = 0; i < new_pg_count; i++)
{
new_pg_history[i] = {
osd_sets: [],
all_peers: [],
epoch: 0,
};
for (let j = 0; j < mul; j++)
{
new_pg_history[i].osd_sets.push(prev_pgs[i*mul]);
const hist = prev_pg_history[1+i*mul+j];
if (hist && hist.osd_sets && hist.osd_sets.length)
{
Array.prototype.push.apply(new_pg_history[i].osd_sets, hist.osd_sets);
}
if (hist && hist.all_peers && hist.all_peers.length)
{
Array.prototype.push.apply(new_pg_history[i].all_peers, hist.all_peers);
}
if (hist && hist.epoch)
{
new_pg_history[i].epoch = new_pg_history[i].epoch < hist.epoch ? hist.epoch : new_pg_history[i].epoch;
}
add_pg_history(new_pg_history, i, prev_pgs, prev_pg_history, i+j*new_pg_count);
}
finish_pg_history(new_pg_history[i]);
}
}
else
{
// Any PG may intersect with any PG after non-multiple PG count change
// So, merge ALL PGs history
let all_sets = {};
let all_peers = {};
let max_epoch = 0;
for (const pg of prev_pgs)
let merged_history = {};
for (let i = 0; i < old_pg_count; i++)
{
all_sets[pg.join(' ')] = pg;
add_pg_history(merged_history, 1, prev_pgs, prev_pg_history, i);
}
for (const pg in prev_pg_history)
{
const hist = prev_pg_history[pg];
if (hist && hist.osd_sets)
{
for (const pg of hist.osd_sets)
{
all_sets[pg.join(' ')] = pg;
}
}
if (hist && hist.all_peers)
{
for (const osd_num of hist.all_peers)
{
all_peers[osd_num] = Number(osd_num);
}
}
if (hist && hist.epoch)
{
max_epoch = max_epoch < hist.epoch ? hist.epoch : max_epoch;
}
}
all_sets = Object.values(all_sets);
all_peers = Object.values(all_peers);
finish_pg_history(merged_history[1]);
for (let i = 0; i < new_pg_count; i++)
{
new_pg_history[i] = { osd_sets: all_sets, all_peers, epoch: max_epoch };
new_pg_history[i] = { ...merged_history[1] };
}
}
// Mark history keys for removed PGs as removed
@@ -94,19 +89,16 @@ function scale_pg_count(prev_pgs, prev_pg_history, new_pg_history, new_pg_count)
{
new_pg_history[i] = null;
}
// Just for the lp_solve optimizer - pick a "previous" PG for each "new" one
if (old_pg_count < new_pg_count)
{
for (let i = new_pg_count-1; i >= 0; i--)
for (let i = old_pg_count; i < new_pg_count; i++)
{
prev_pgs[i] = prev_pgs[Math.floor(i/new_pg_count*old_pg_count)];
prev_pgs[i] = prev_pgs[i % old_pg_count];
}
}
else if (old_pg_count > new_pg_count)
{
for (let i = 0; i < new_pg_count; i++)
{
prev_pgs[i] = prev_pgs[Math.round(i/new_pg_count*old_pg_count)];
}
prev_pgs.splice(new_pg_count, old_pg_count-new_pg_count);
}
}

View File

@@ -1,31 +1,16 @@
// Functions to calculate Annualized Failure Rate of your cluster
// if you know AFR of your drives, number of drives, expected rebalance time
// and replication factor
// License: VNPL-1.0 (see README.md for details)
const { sprintf } = require('sprintf-js');
// License: VNPL-1.1 (see https://yourcmc.ru/git/vitalif/vitastor/src/branch/master/README.md for details) or AGPL-3.0
// Author: Vitaliy Filippov, 2020+
module.exports = {
cluster_afr_fullmesh,
failure_rate_fullmesh,
cluster_afr,
print_cluster_afr,
c_n_k,
};
print_cluster_afr({ n_hosts: 4, n_drives: 6, afr_drive: 0.03, afr_host: 0.05, capacity: 4000, speed: 0.1, replicas: 2 });
print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, capacity: 4000, speed: 0.1, replicas: 2 });
print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, afr_host: 0.05, capacity: 4000, speed: 0.1, replicas: 2 });
print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, capacity: 4000, speed: 0.1, ec: [ 2, 1 ] });
print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, afr_host: 0.05, capacity: 4000, speed: 0.1, ec: [ 2, 1 ] });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, capacity: 8000, speed: 0.02, replicas: 2 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 2 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, capacity: 8000, speed: 0.02, replicas: 3 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 3 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, capacity: 8000, speed: 0.02, replicas: 3, pgs: 100 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 3, pgs: 100 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 3, pgs: 100, degraded_replacement: 1 });
/******** "FULL MESH": ASSUME EACH OSD COMMUNICATES WITH ALL OTHER OSDS ********/
// Estimate AFR of the cluster
@@ -56,93 +41,38 @@ function failure_rate_fullmesh(n, a, f)
/******** PGS: EACH OSD ONLY COMMUNICATES WITH <pgs> OTHER OSDs ********/
// <n> hosts of <m> drives of <capacity> GB, each able to backfill at <speed> GB/s,
// <k> replicas, <pgs> unique peer PGs per OSD
// <k> replicas, <pgs> unique peer PGs per OSD (~50 for 100 PG-per-OSD in a big cluster)
//
// For each of n*m drives: P(drive fails in a year) * P(any of its peers fail in <l*365> next days).
// More peers per OSD increase rebalance speed (more drives work together to resilver) if you
// let them finish rebalance BEFORE replacing the failed drive.
// let them finish rebalance BEFORE replacing the failed drive (degraded_replacement=false).
// At the same time, more peers per OSD increase probability of any of them to fail!
// osd_rm=true means that failed OSDs' data is rebalanced over all other hosts,
// not over the same host as it's in Ceph by default (dead OSDs are marked 'out').
//
// Probability of all except one drives in a replica group to fail is (AFR^(k-1)).
// So with <x> PGs it becomes ~ (x * (AFR*L/365)^(k-1)). Interesting but reasonable consequence
// is that, with k=2, total failure rate doesn't depend on number of peers per OSD,
// because it gets increased linearly by increased number of peers to fail
// and decreased linearly by reduced rebalance time.
function cluster_afr_pgs({ n_hosts, n_drives, afr_drive, capacity, speed, replicas, pgs = 1, degraded_replacement })
function cluster_afr({ n_hosts, n_drives, afr_drive, afr_host, capacity, speed, ec, ec_data, ec_parity, replicas, pgs = 1, osd_rm, degraded_replacement, down_out_interval = 600 })
{
pgs = Math.min(pgs, (n_hosts-1)*n_drives/(replicas-1));
const l = capacity/(degraded_replacement ? 1 : pgs)/speed/86400/365;
return 1 - (1 - afr_drive * (1-(1-(afr_drive*l)**(replicas-1))**pgs)) ** (n_hosts*n_drives);
}
function cluster_afr_pgs_ec({ n_hosts, n_drives, afr_drive, capacity, speed, ec: [ ec_data, ec_parity ], pgs = 1, degraded_replacement })
{
const ec_total = ec_data+ec_parity;
pgs = Math.min(pgs, (n_hosts-1)*n_drives/(ec_total-1));
const l = capacity/(degraded_replacement ? 1 : pgs)/speed/86400/365;
return 1 - (1 - afr_drive * (1-(1-failure_rate_fullmesh(ec_total-1, afr_drive*l, ec_parity))**pgs)) ** (n_hosts*n_drives);
}
// Same as above, but also take server failures into account
function cluster_afr_pgs_hosts({ n_hosts, n_drives, afr_drive, afr_host, capacity, speed, replicas, pgs = 1, degraded_replacement })
{
let otherhosts = Math.min(pgs, (n_hosts-1)/(replicas-1));
pgs = Math.min(pgs, (n_hosts-1)*n_drives/(replicas-1));
let pgh = Math.min(pgs*n_drives, (n_hosts-1)*n_drives/(replicas-1));
const ld = capacity/(degraded_replacement ? 1 : pgs)/speed/86400/365;
const lh = n_drives*capacity/pgs/speed/86400/365;
const p1 = ((afr_drive+afr_host*pgs/otherhosts)*lh);
const p2 = ((afr_drive+afr_host*pgs/otherhosts)*ld);
return 1 - ((1 - afr_host * (1-(1-p1**(replicas-1))**pgh)) ** n_hosts) *
((1 - afr_drive * (1-(1-p2**(replicas-1))**pgs)) ** (n_hosts*n_drives));
}
function cluster_afr_pgs_ec_hosts({ n_hosts, n_drives, afr_drive, afr_host, capacity, speed, ec: [ ec_data, ec_parity ], pgs = 1, degraded_replacement })
{
const ec_total = ec_data+ec_parity;
const otherhosts = Math.min(pgs, (n_hosts-1)/(ec_total-1));
pgs = Math.min(pgs, (n_hosts-1)*n_drives/(ec_total-1));
const pgh = Math.min(pgs*n_drives, (n_hosts-1)*n_drives/(ec_total-1));
const ld = capacity/(degraded_replacement ? 1 : pgs)/speed/86400/365;
const lh = n_drives*capacity/pgs/speed/86400/365;
const p1 = ((afr_drive+afr_host*pgs/otherhosts)*lh);
const p2 = ((afr_drive+afr_host*pgs/otherhosts)*ld);
return 1 - ((1 - afr_host * (1-(1-failure_rate_fullmesh(ec_total-1, p1, ec_parity))**pgh)) ** n_hosts) *
((1 - afr_drive * (1-(1-failure_rate_fullmesh(ec_total-1, p2, ec_parity))**pgs)) ** (n_hosts*n_drives));
}
// Wrapper for 4 above functions
function cluster_afr(config)
{
if (config.ec && config.afr_host)
{
return cluster_afr_pgs_ec_hosts(config);
}
else if (config.ec)
{
return cluster_afr_pgs_ec(config);
}
else if (config.afr_host)
{
return cluster_afr_pgs_hosts(config);
}
else
{
return cluster_afr_pgs(config);
}
}
function print_cluster_afr(config)
{
console.log(
`${config.n_hosts} nodes with ${config.n_drives} ${sprintf("%.1f", config.capacity/1000)}TB drives`+
`, capable to backfill at ${sprintf("%.1f", config.speed*1000)} MB/s, drive AFR ${sprintf("%.1f", config.afr_drive*100)}%`+
(config.afr_host ? `, host AFR ${sprintf("%.1f", config.afr_host*100)}%` : '')+
(config.ec ? `, EC ${config.ec[0]}+${config.ec[1]}` : `, ${config.replicas} replicas`)+
`, ${config.pgs||1} PG per OSD`+
(config.degraded_replacement ? `\n...and you don't let the rebalance finish before replacing drives` : '')
);
console.log('-> '+sprintf("%.7f%%", 100*cluster_afr(config))+'\n');
const pg_size = (ec ? ec_data+ec_parity : replicas);
pgs = Math.min(pgs, (n_hosts-1)*n_drives/(pg_size-1));
const host_pgs = Math.min(pgs*n_drives, (n_hosts-1)*n_drives/(pg_size-1));
const resilver_disk = n_drives == 1 || osd_rm ? pgs : (n_drives-1);
const disk_heal_time = (down_out_interval + capacity/(degraded_replacement ? 1 : resilver_disk)/speed)/86400/365;
const host_heal_time = (down_out_interval + n_drives*capacity/pgs/speed)/86400/365;
const disk_heal_fail = ((afr_drive+afr_host/n_drives)*disk_heal_time);
const host_heal_fail = ((afr_drive+afr_host/n_drives)*host_heal_time);
const disk_pg_fail = ec
? failure_rate_fullmesh(ec_data+ec_parity-1, disk_heal_fail, ec_parity)
: disk_heal_fail**(replicas-1);
const host_pg_fail = ec
? failure_rate_fullmesh(ec_data+ec_parity-1, host_heal_fail, ec_parity)
: host_heal_fail**(replicas-1);
return 1 - ((1 - afr_drive * (1-(1-disk_pg_fail)**pgs)) ** (n_hosts*n_drives))
* ((1 - afr_host * (1-(1-host_pg_fail)**host_pgs)) ** n_hosts);
}
/******** UTILITY ********/

28
mon/afr_test.js Normal file
View File

@@ -0,0 +1,28 @@
const { sprintf } = require('sprintf-js');
const { cluster_afr } = require('./afr.js');
print_cluster_afr({ n_hosts: 4, n_drives: 6, afr_drive: 0.03, afr_host: 0.05, capacity: 4000, speed: 0.1, replicas: 2 });
print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, afr_host: 0, capacity: 4000, speed: 0.1, replicas: 2 });
print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, afr_host: 0.05, capacity: 4000, speed: 0.1, replicas: 2 });
print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, afr_host: 0, capacity: 4000, speed: 0.1, ec: true, ec_data: 2, ec_parity: 1 });
print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, afr_host: 0.05, capacity: 4000, speed: 0.1, ec: true, ec_data: 2, ec_parity: 1 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0, capacity: 8000, speed: 0.02, replicas: 2 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 2 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0, capacity: 8000, speed: 0.02, replicas: 3 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 3 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0, capacity: 8000, speed: 0.02, replicas: 3, pgs: 100 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 3, pgs: 100 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 3, pgs: 100, degraded_replacement: 1 });
function print_cluster_afr(config)
{
console.log(
`${config.n_hosts} nodes with ${config.n_drives} ${sprintf("%.1f", config.capacity/1000)}TB drives`+
`, capable to backfill at ${sprintf("%.1f", config.speed*1000)} MB/s, drive AFR ${sprintf("%.1f", config.afr_drive*100)}%`+
(config.afr_host ? `, host AFR ${sprintf("%.1f", config.afr_host*100)}%` : '')+
(config.ec ? `, EC ${config.ec_data}+${config.ec_parity}` : `, ${config.replicas} replicas`)+
`, ${config.pgs||1} PG per OSD`+
(config.degraded_replacement ? `\n...and you don't let the rebalance finish before replacing drives` : '')
);
console.log('-> '+sprintf("%.7f%%", 100*cluster_afr(config))+'\n');
}

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
// Data distribution optimizer using linear programming (lp_solve)
@@ -58,7 +58,7 @@ async function optimize_initial({ osd_tree, pg_count, pg_size = 3, pg_minsize =
}
const all_weights = Object.assign({}, ...Object.values(osd_tree));
const total_weight = Object.values(all_weights).reduce((a, c) => Number(a) + Number(c), 0);
const all_pgs = Object.values(random_combinations(osd_tree, pg_size, max_combinations));
const all_pgs = Object.values(random_combinations(osd_tree, pg_size, max_combinations, parity_space > 1));
const pg_per_osd = {};
for (const pg of all_pgs)
{
@@ -249,7 +249,7 @@ async function optimize_change({ prev_pgs: prev_int_pgs, osd_tree, pg_size = 3,
}
}
// Get all combinations
let all_pgs = random_combinations(osd_tree, pg_size, max_combinations);
let all_pgs = random_combinations(osd_tree, pg_size, max_combinations, parity_space > 1);
add_valid_previous(osd_tree, prev_weights, all_pgs);
all_pgs = Object.values(all_pgs);
const pg_per_osd = {};
@@ -275,6 +275,11 @@ async function optimize_change({ prev_pgs: prev_int_pgs, osd_tree, pg_size = 3,
lp += 'max: '+all_pg_names.map(pg_name => (
prev_weights[pg_name] ? `${pg_size+1}*add_${pg_name} - ${pg_size+1}*del_${pg_name}` : `${pg_size+1-move_weights[pg_name]}*${pg_name}`
)).join(' + ')+';\n';
lp += all_pg_names
.map(pg_name => (prev_weights[pg_name] ? `add_${pg_name} - del_${pg_name}` : `${pg_name}`))
.join(' + ')+' = '+(pg_count
- Object.keys(prev_weights).reduce((a, old_pg_name) => (a + (all_pgs_hash[old_pg_name] ? prev_weights[old_pg_name] : 0)), 0)
)+';\n';
for (const osd in pg_per_osd)
{
if (osd !== NO_OSD)
@@ -488,7 +493,8 @@ function extract_osds(osd_tree, levels, osd_level, osds = {})
return osds;
}
function random_combinations(osd_tree, pg_size, count)
// ordered = don't treat (x,y) and (y,x) as equal
function random_combinations(osd_tree, pg_size, count, ordered)
{
let seed = 0x5f020e43;
let rng = () =>
@@ -516,25 +522,47 @@ function random_combinations(osd_tree, pg_size, count)
pg.push(osds[cur_hosts[next_host]][next_osd]);
cur_hosts.splice(next_host, 1);
}
while (pg.length < pg_size)
const cyclic_pgs = [ pg ];
if (ordered)
{
pg.push(NO_OSD);
for (let i = 1; i < pg.size; i++)
{
cyclic_pgs.push([ ...pg.slice(i), ...pg.slice(0, i) ]);
}
}
for (const pg of cyclic_pgs)
{
while (pg.length < pg_size)
{
pg.push(NO_OSD);
}
r['pg_'+pg.join('_')] = pg;
}
r['pg_'+pg.join('_')] = pg;
}
}
// Generate purely random combinations
restart: while (count > 0)
while (count > 0)
{
let host_idx = [];
for (let i = 0; i < pg_size && i < hosts.length; i++)
const cur_hosts = [ ...hosts.map((h, i) => i) ];
const max_hosts = pg_size < hosts.length ? pg_size : hosts.length;
if (ordered)
{
let start = i > 0 ? host_idx[i-1]+1 : 0;
if (start >= hosts.length)
for (let i = 0; i < max_hosts; i++)
{
continue restart;
const r = rng() % cur_hosts.length;
host_idx[i] = cur_hosts[r];
cur_hosts.splice(r, 1);
}
}
else
{
for (let i = 0; i < max_hosts; i++)
{
const r = rng() % (cur_hosts.length - (max_hosts - i - 1));
host_idx[i] = cur_hosts[r];
cur_hosts.splice(0, r+1);
}
host_idx[i] = start + rng() % (hosts.length-start);
}
let pg = host_idx.map(h => osds[hosts[h]][rng() % osds[hosts[h]].length]);
while (pg.length < pg_size)

76
mon/make-osd.sh Executable file
View File

@@ -0,0 +1,76 @@
#!/bin/bash
# Very simple systemd unit generator for vitastor-osd services
# Not the final solution yet, mostly for tests
# Copyright (c) Vitaliy Filippov, 2019+
# License: MIT
# USAGE: ./make-osd.sh /dev/disk/by-partuuid/xxx [ /dev/disk/by-partuuid/yyy]...
IP_SUBSTR="10.200.1."
ETCD_HOSTS="etcd0=http://10.200.1.10:2380,etcd1=http://10.200.1.11:2380,etcd2=http://10.200.1.12:2380"
set -e -x
IP=`ip -json a s | jq -r '.[].addr_info[] | select(.local | startswith("'$IP_SUBSTR'")) | .local'`
[ "$IP" != "" ] || exit 1
ETCD_MON=$(echo $ETCD_HOSTS | perl -pe 's/:2380/:2379/g; s/etcd\d*=//g;')
D=`dirname $0`
# Create OSDs on all passed devices
OSD_NUM=1
for DEV in $*; do
# Ugly :) -> node.js rework pending
while true; do
ST=$(etcdctl --endpoints="$ETCD_MON" get --print-value-only /vitastor/osd/stats/$OSD_NUM)
if [ "$ST" = "" ]; then
break
fi
OSD_NUM=$((OSD_NUM+1))
done
etcdctl --endpoints="$ETCD_MON" put /vitastor/osd/stats/$OSD_NUM '{}'
echo Creating OSD $OSD_NUM on $DEV
OPT=`node $D/simple-offsets.js --device $DEV --format options | tr '\n' ' '`
META=`echo $OPT | grep -Po '(?<=data_offset )\d+'`
dd if=/dev/zero of=$DEV bs=1048576 count=$(((META+1048575)/1048576)) oflag=direct
cat >/etc/systemd/system/vitastor-osd$OSD_NUM.service <<EOF
[Unit]
Description=Vitastor object storage daemon osd.$OSD_NUM
After=network-online.target local-fs.target time-sync.target
Wants=network-online.target local-fs.target time-sync.target
PartOf=vitastor.target
[Service]
LimitNOFILE=1048576
LimitNPROC=1048576
LimitMEMLOCK=infinity
ExecStart=/usr/bin/vitastor-osd \\
--etcd_address $IP:2379/v3 \\
--bind_address $IP \\
--osd_num $OSD_NUM \\
--disable_data_fsync 1 \\
--immediate_commit all \\
--flusher_count 256 \\
--disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 \\
--journal_no_same_sector_overwrites true \\
--journal_sector_buffer_count 1024 \\
$OPT
WorkingDirectory=/
ExecStartPre=+chown vitastor:vitastor $DEV
User=vitastor
PrivateTmp=false
TasksMax=infinity
Restart=always
StartLimitInterval=0
RestartSec=10
[Install]
WantedBy=vitastor.target
EOF
systemctl enable vitastor-osd$OSD_NUM
done

157
mon/make-units.sh Normal file → Executable file
View File

@@ -1,19 +1,59 @@
#!/bin/bash
# Example startup script generator
# Of course this isn't a production solution yet, this is just for tests
# Very simple systemd unit generator for etcd & vitastor-mon services
# Not the final solution yet, mostly for tests
# Copyright (c) Vitaliy Filippov, 2019+
# License: MIT
IP=`ip -json a s | jq -r '.[].addr_info[] | select(.broadcast == "10.115.0.255") | .local'`
# USAGE: ./make-units.sh
IP_SUBSTR="10.200.1."
ETCD_HOSTS="etcd0=http://10.200.1.10:2380,etcd1=http://10.200.1.11:2380,etcd2=http://10.200.1.12:2380"
# determine IP
IP=`ip -json a s | jq -r '.[].addr_info[] | select(.local | startswith("'$IP_SUBSTR'")) | .local'`
[ "$IP" != "" ] || exit 1
ETCD_NUM=${ETCD_HOSTS/$IP*/}
[ "$ETCD_NUM" != "$ETCD_HOSTS" ] || exit 1
ETCD_NUM=$(echo $ETCD_NUM | tr -d -c , | wc -c)
# etcd
useradd etcd
mkdir -p /var/lib/etcd$ETCD_NUM.etcd
cat >/etc/systemd/system/etcd.service <<EOF
[Unit]
Description=etcd for vitastor
After=network-online.target local-fs.target time-sync.target
Wants=network-online.target local-fs.target time-sync.target
[Service]
Restart=always
ExecStart=/usr/local/bin/etcd -name etcd$ETCD_NUM --data-dir /var/lib/etcd$ETCD_NUM.etcd \\
--advertise-client-urls http://$IP:2379 --listen-client-urls http://$IP:2379 \\
--initial-advertise-peer-urls http://$IP:2380 --listen-peer-urls http://$IP:2380 \\
--initial-cluster-token vitastor-etcd-1 --initial-cluster $ETCD_HOSTS \\
--initial-cluster-state new --max-txn-ops=100000 --auto-compaction-retention=10 --auto-compaction-mode=revision
WorkingDirectory=/var/lib/etcd$ETCD_NUM.etcd
ExecStartPre=+chown -R etcd /var/lib/etcd$ETCD_NUM.etcd
User=etcd
PrivateTmp=false
TasksMax=infinity
Restart=always
StartLimitInterval=0
RestartSec=10
[Install]
WantedBy=local.target
EOF
systemctl daemon-reload
systemctl enable etcd
systemctl start etcd
useradd vitastor
chmod 755 /root
BASE=${IP/*./}
BASE=$(((BASE-10)*12))
# Vitastor target
cat >/etc/systemd/system/vitastor.target <<EOF
[Unit]
Description=vitastor target
@@ -21,116 +61,25 @@ Description=vitastor target
WantedBy=multi-user.target
EOF
i=1
for DEV in `ls /dev/disk/by-id/ | grep ata-INTEL_SSDSC2KB`; do
dd if=/dev/zero of=/dev/disk/by-id/$DEV bs=1048576 count=$(((427814912+1048575)/1048576+2))
dd if=/dev/zero of=/dev/disk/by-id/$DEV bs=1048576 count=$(((427814912+1048575)/1048576+2)) seek=$((1920377991168/1048576))
cat >/etc/systemd/system/vitastor-osd$((BASE+i)).service <<EOF
# Monitor unit
ETCD_MON=$(echo $ETCD_HOSTS | perl -pe 's/:2380/:2379/g; s/etcd\d*=//g;')
cat >/etc/systemd/system/vitastor-mon.service <<EOF
[Unit]
Description=Vitastor object storage daemon osd.$((BASE+i))
Description=Vitastor monitor
After=network-online.target local-fs.target time-sync.target
Wants=network-online.target local-fs.target time-sync.target
PartOf=vitastor.target
[Service]
LimitNOFILE=1048576
LimitNPROC=1048576
LimitMEMLOCK=infinity
ExecStart=/root/vitastor/osd \\
--etcd_address $IP:2379/v3 \\
--bind_address $IP \\
--osd_num $((BASE+i)) \\
--disable_data_fsync 1 \\
--disable_device_lock 1 \\
--immediate_commit all \\
--flusher_count 8 \\
--disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 \\
--journal_no_same_sector_overwrites true \\
--journal_sector_buffer_count 1024 \\
--journal_offset 0 \\
--meta_offset 16777216 \\
--data_offset 427814912 \\
--data_size $((1920377991168-427814912)) \\
--data_device /dev/disk/by-id/$DEV
WorkingDirectory=/root/vitastor
ExecStartPre=+chown vitastor:vitastor /dev/disk/by-id/$DEV
Restart=always
ExecStart=node /usr/lib/vitastor/mon/mon-main.js --etcd_url '$ETCD_MON' --etcd_prefix '/vitastor' --etcd_start_timeout 5
WorkingDirectory=/
User=vitastor
PrivateTmp=false
TasksMax=infinity
Restart=always
StartLimitInterval=0
StartLimitIntervalSec=0
RestartSec=10
[Install]
WantedBy=vitastor.target
EOF
systemctl enable vitastor-osd$((BASE+i))
i=$((i+1))
cat >/etc/systemd/system/vitastor-osd$((BASE+i)).service <<EOF
[Unit]
Description=Vitastor object storage daemon osd.$((BASE+i))
After=network-online.target local-fs.target time-sync.target
Wants=network-online.target local-fs.target time-sync.target
PartOf=vitastor.target
[Service]
LimitNOFILE=1048576
LimitNPROC=1048576
LimitMEMLOCK=infinity
ExecStart=/root/vitastor/osd \\
--etcd_address $IP:2379/v3 \\
--bind_address $IP \\
--osd_num $((BASE+i)) \\
--disable_data_fsync 1 \\
--immediate_commit all \\
--flusher_count 8 \\
--disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 \\
--journal_no_same_sector_overwrites true \\
--journal_sector_buffer_count 1024 \\
--journal_offset 1920377991168 \\
--meta_offset $((1920377991168+16777216)) \\
--data_offset $((1920377991168+427814912)) \\
--data_size $((1920377991168-427814912)) \\
--data_device /dev/disk/by-id/$DEV
WorkingDirectory=/root/vitastor
ExecStartPre=+chown vitastor:vitastor /dev/disk/by-id/$DEV
User=vitastor
PrivateTmp=false
TasksMax=infinity
Restart=always
StartLimitInterval=0
StartLimitIntervalSec=0
RestartSec=10
[Install]
WantedBy=vitastor.target
EOF
systemctl enable vitastor-osd$((BASE+i))
i=$((i+1))
done
exit
node mon-main.js --etcd_url 'http://10.115.0.10:2379,http://10.115.0.11:2379,http://10.115.0.12:2379,http://10.115.0.13:2379' --etcd_prefix '/vitastor' --etcd_start_timeout 5
podman run -d --network host --restart always -v /var/lib/etcd0.etcd:/etcd0.etcd --name etcd quay.io/coreos/etcd:v3.4.13 etcd -name etcd0 \
-advertise-client-urls http://10.115.0.10:2379 -listen-client-urls http://10.115.0.10:2379 \
-initial-advertise-peer-urls http://10.115.0.10:2380 -listen-peer-urls http://10.115.0.10:2380 \
-initial-cluster-token vitastor-etcd-1 -initial-cluster etcd0=http://10.115.0.10:2380,etcd1=http://10.115.0.11:2380,etcd2=http://10.115.0.12:2380,etcd3=http://10.115.0.13:2380 \
-initial-cluster-state new --max-txn-ops=100000 --auto-compaction-retention=10 --auto-compaction-mode=revision
etcdctl --endpoints http://10.115.0.10:2379 put /vitastor/config/global '{"immediate_commit":"all"}'
etcdctl --endpoints http://10.115.0.10:2379 put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":48,"failure_domain":"host"}}'
#let pgs = {};
#for (let n = 0; n < 48; n++) { let i = n/2 | 0; pgs[1+n] = { osd_set: [ (1+i%12+(i/12 | 0)*24), (1+12+i%12+(i/12 | 0)*24) ], primary: (1+(n%2)*12+i%12+(i/12 | 0)*24) }; };
#console.log(JSON.stringify({ items: { 1: pgs } }));
#etcdctl --endpoints http://10.115.0.10:2379 put /vitastor/config/pgs ...
# --disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 \\
# --data_offset 427814912 \\
# --disk_alignment 4096 --journal_block_size 512 --meta_block_size 512 \\
# --data_offset 433434624 \\

2
mon/mon-main.js Normal file → Executable file
View File

@@ -1,7 +1,7 @@
#!/usr/bin/node
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
const Mon = require('./mon.js');

File diff suppressed because it is too large Load Diff

View File

@@ -4,6 +4,7 @@
// Simple tool to calculate journal and metadata offsets for a single device
// Will be replaced by smarter tools in the future
const fs = require('fs').promises;
const child_process = require('child_process');
async function run()
@@ -15,6 +16,7 @@ async function run()
device_block_size: 4096,
journal_offset: 0,
device_size: 0,
format: 'text',
};
for (let i = 2; i < process.argv.length; i++)
{
@@ -24,7 +26,22 @@ async function run()
i++;
}
}
const device_size = Number(options.device_size || await system("blockdev --getsize64 "+options.device));
if (!options.device)
{
process.stderr.write('USAGE: nodejs '+process.argv[1]+' --device /dev/sdXXX\n');
process.exit(1);
}
options.device_size = Number(options.device_size);
let device_size = options.device_size;
if (!device_size)
{
const st = await fs.stat(options.device);
options.device_block_size = st.blksize;
if (st.isBlockDevice())
device_size = Number(await system("/sbin/blockdev --getsize64 "+options.device))
else
device_size = st.size;
}
if (!device_size)
{
process.stderr.write('Failed to get device size\n');
@@ -32,25 +49,45 @@ async function run()
}
options.journal_offset = Math.ceil(options.journal_offset/options.device_block_size)*options.device_block_size;
const meta_offset = options.journal_offset + Math.ceil(options.journal_size/options.device_block_size)*options.device_block_size;
const entries_per_block = Math.floor(options.device_block_size / (24 + options.object_size/options.bitmap_granularity/8));
const entries_per_block = Math.floor(options.device_block_size / (24 + 2*options.object_size/options.bitmap_granularity/8));
const object_count = Math.floor((device_size-meta_offset)/options.object_size);
const meta_size = Math.ceil(object_count / entries_per_block) * options.device_block_size;
const data_offset = meta_offset + meta_size;
const meta_size_fmt = (meta_size > 1024*1024*1024 ? Math.round(meta_size/1024/1024/1024*100)/100+" GB"
: Math.round(meta_size/1024/1024*100)/100+" MB");
process.stdout.write(
`Metadata size: ${meta_size_fmt}\n`+
`Options for the OSD:\n`+
` --journal_offset ${options.journal_offset}\n`+
` --meta_offset ${meta_offset}\n`+
` --data_offset ${data_offset}\n`+
(options.device_size ? ` --data_size ${device_size-data_offset}\n` : '')
);
if (options.format == 'text' || options.format == 'options')
{
if (options.format == 'text')
{
process.stderr.write(
`Metadata size: ${meta_size_fmt}\n`+
`Options for the OSD:\n`
);
}
process.stdout.write(
` --data_device ${options.device}\n`+
` --journal_offset ${options.journal_offset}\n`+
` --meta_offset ${meta_offset}\n`+
` --data_offset ${data_offset}\n`+
(options.device_size ? ` --data_size ${device_size-data_offset}\n` : '')
);
}
else if (options.format == 'env')
{
process.stdout.write(
`journal_offset=${options.journal_offset}\n`+
`meta_offset=${meta_offset}\n`+
`data_offset=${data_offset}\n`+
`data_size=${device_size-data_offset}\n`
);
}
else
process.stdout.write('Unknown format: '+options.format);
}
function system(cmd)
{
return new Promise((ok, no) => child_process.exec(cmd, { maxBuffer: 64*1024*1024 }, (err, stdout, stderr) => (err ? no(err) : ok(stdout))));
return new Promise((ok, no) => child_process.exec(cmd, { maxBuffer: 64*1024*1024 }, (err, stdout, stderr) => (err ? no(err.message) : ok(stdout))));
}
run().catch(console.error);
run().catch(err => { console.error(err); process.exit(1); });

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
// Interesting real-world example coming from Ceph with EC and compression enabled.
// EC parity chunks can't be compressed as efficiently as data chunks,

View File

@@ -0,0 +1,25 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
const LPOptimizer = require('./lp-optimizer.js');
async function run()
{
const osd_tree = { a: { 1: 1 }, b: { 2: 1 }, c: { 3: 1 } };
let res;
console.log('16 PGs, size=3');
res = await LPOptimizer.optimize_initial({ osd_tree, pg_size: 3, pg_count: 16 });
LPOptimizer.print_change_stats(res, false);
console.log('\nReduce PG size to 2');
res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs.map(pg => pg.slice(0, 2)), osd_tree, pg_size: 2 });
LPOptimizer.print_change_stats(res, false);
console.log('\nRemove OSD 3');
delete osd_tree['c'];
res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree, pg_size: 2 });
LPOptimizer.print_change_stats(res, false);
}
run().catch(console.error);

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
const LPOptimizer = require('./lp-optimizer.js');

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
const LPOptimizer = require('./lp-optimizer.js');

View File

@@ -1,186 +0,0 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
#include "messenger.h"
void osd_messenger_t::outbox_push(osd_op_t *cur_op)
{
assert(cur_op->peer_fd);
auto & cl = clients.at(cur_op->peer_fd);
if (cur_op->op_type == OSD_OP_OUT)
{
clock_gettime(CLOCK_REALTIME, &cur_op->tv_begin);
}
else
{
// Check that operation actually belongs to this client
bool found = false;
for (auto it = cl.received_ops.begin(); it != cl.received_ops.end(); it++)
{
if (*it == cur_op)
{
found = true;
cl.received_ops.erase(it, it+1);
break;
}
}
if (!found)
{
delete cur_op;
return;
}
}
cl.outbox.push_back(cur_op);
if (!ringloop)
{
while (cl.write_op || cl.outbox.size())
{
try_send(cl);
}
}
else if (cl.write_op || cl.outbox.size() > 1 || !try_send(cl))
{
if (cl.write_state == 0)
{
cl.write_state = CL_WRITE_READY;
write_ready_clients.push_back(cur_op->peer_fd);
}
ringloop->wakeup();
}
}
bool osd_messenger_t::try_send(osd_client_t & cl)
{
int peer_fd = cl.peer_fd;
if (!cl.write_op)
{
// pick next command
cl.write_op = cl.outbox.front();
cl.outbox.pop_front();
cl.write_state = CL_WRITE_REPLY;
if (cl.write_op->op_type == OSD_OP_IN)
{
// Measure execution latency
timespec tv_end;
clock_gettime(CLOCK_REALTIME, &tv_end);
stats.op_stat_count[cl.write_op->req.hdr.opcode]++;
if (!stats.op_stat_count[cl.write_op->req.hdr.opcode])
{
stats.op_stat_count[cl.write_op->req.hdr.opcode]++;
stats.op_stat_sum[cl.write_op->req.hdr.opcode] = 0;
stats.op_stat_bytes[cl.write_op->req.hdr.opcode] = 0;
}
stats.op_stat_sum[cl.write_op->req.hdr.opcode] += (
(tv_end.tv_sec - cl.write_op->tv_begin.tv_sec)*1000000 +
(tv_end.tv_nsec - cl.write_op->tv_begin.tv_nsec)/1000
);
if (cl.write_op->req.hdr.opcode == OSD_OP_READ ||
cl.write_op->req.hdr.opcode == OSD_OP_WRITE)
{
stats.op_stat_bytes[cl.write_op->req.hdr.opcode] += cl.write_op->req.rw.len;
}
else if (cl.write_op->req.hdr.opcode == OSD_OP_SEC_READ ||
cl.write_op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
cl.write_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE)
{
stats.op_stat_bytes[cl.write_op->req.hdr.opcode] += cl.write_op->req.sec_rw.len;
}
cl.send_list.push_back(cl.write_op->reply.buf, OSD_PACKET_SIZE);
if (cl.write_op->req.hdr.opcode == OSD_OP_READ ||
cl.write_op->req.hdr.opcode == OSD_OP_SEC_READ ||
cl.write_op->req.hdr.opcode == OSD_OP_SEC_LIST ||
cl.write_op->req.hdr.opcode == OSD_OP_SHOW_CONFIG)
{
cl.send_list.append(cl.write_op->iov);
}
}
else
{
cl.send_list.push_back(cl.write_op->req.buf, OSD_PACKET_SIZE);
if (cl.write_op->req.hdr.opcode == OSD_OP_WRITE ||
cl.write_op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
cl.write_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE ||
cl.write_op->req.hdr.opcode == OSD_OP_SEC_STABILIZE ||
cl.write_op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK)
{
cl.send_list.append(cl.write_op->iov);
}
}
}
cl.write_msg.msg_iov = cl.send_list.get_iovec();
cl.write_msg.msg_iovlen = cl.send_list.get_size();
if (ringloop && !use_sync_send_recv)
{
io_uring_sqe* sqe = ringloop->get_sqe();
if (!sqe)
{
return false;
}
ring_data_t* data = ((ring_data_t*)sqe->user_data);
data->callback = [this, peer_fd](ring_data_t *data) { handle_send(data->res, peer_fd); };
my_uring_prep_sendmsg(sqe, peer_fd, &cl.write_msg, 0);
}
else
{
int result = sendmsg(peer_fd, &cl.write_msg, MSG_NOSIGNAL);
if (result < 0)
{
result = -errno;
}
handle_send(result, peer_fd);
}
return true;
}
void osd_messenger_t::send_replies()
{
for (int i = 0; i < write_ready_clients.size(); i++)
{
int peer_fd = write_ready_clients[i];
if (!try_send(clients[peer_fd]))
{
write_ready_clients.erase(write_ready_clients.begin(), write_ready_clients.begin() + i);
return;
}
}
write_ready_clients.clear();
}
void osd_messenger_t::handle_send(int result, int peer_fd)
{
auto cl_it = clients.find(peer_fd);
if (cl_it != clients.end())
{
auto & cl = cl_it->second;
if (result < 0 && result != -EAGAIN)
{
// this is a client socket, so don't panic. just disconnect it
printf("Client %d socket write error: %d (%s). Disconnecting client\n", peer_fd, -result, strerror(-result));
stop_client(peer_fd);
return;
}
if (result >= 0)
{
cl.send_list.eat(result);
if (cl.send_list.done >= cl.send_list.count)
{
// Done
cl.send_list.reset();
if (cl.write_op->op_type == OSD_OP_IN)
{
delete cl.write_op;
}
else
{
cl.sent_ops[cl.write_op->req.hdr.id] = cl.write_op;
}
cl.write_op = NULL;
cl.write_state = cl.outbox.size() > 0 ? CL_WRITE_READY : 0;
}
}
if (cl.write_state != 0)
{
write_ready_clients.push_back(peer_fd);
}
}
}

View File

@@ -1,363 +0,0 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include <string.h>
#include "osd_rmw.cpp"
#include "test_pattern.h"
void dump_stripes(osd_rmw_stripe_t *stripes, int pg_size);
void test1();
void test4();
void test5();
void test6();
void test7();
void test8();
void test9();
/***
Cases:
1. split(offset=128K-4K, len=8K)
= [ [ 128K-4K, 128K ], [ 0, 4K ], [ 0, 0 ] ]
2. read(offset=128K-4K, len=8K, osd_set=[1,0,3])
= { read: [ [ 0, 128K ], [ 0, 4K ], [ 0, 4K ] ] }
3. cover_read(0, 128K, { req: [ 128K-4K, 4K ] })
= { read: [ 0, 128K-4K ] }
4. write(offset=128K-4K, len=8K, osd_set=[1,0,3])
= {
read: [ [ 0, 128K ], [ 4K, 128K ], [ 4K, 128K ] ],
write: [ [ 128K-4K, 128K ], [ 0, 4K ], [ 0, 128K ] ],
input buffer: [ write0, write1 ],
rmw buffer: [ write2, read0, read1, read2 ],
}
+ check write2 buffer
5. write(offset=0, len=128K+64K, osd_set=[1,0,3])
= {
req: [ [ 0, 128K ], [ 0, 64K ], [ 0, 0 ] ],
read: [ [ 64K, 128K ], [ 64K, 128K ], [ 64K, 128K ] ],
write: [ [ 0, 128K ], [ 0, 64K ], [ 0, 128K ] ],
input buffer: [ write0, write1 ],
rmw buffer: [ write2, read0, read1, read2 ],
}
6. write(offset=0, len=128K+64K, osd_set=[1,2,3])
= {
req: [ [ 0, 128K ], [ 0, 64K ], [ 0, 0 ] ],
read: [ [ 0, 0 ], [ 64K, 128K ], [ 0, 0 ] ],
write: [ [ 0, 128K ], [ 0, 64K ], [ 0, 128K ] ],
input buffer: [ write0, write1 ],
rmw buffer: [ write2, read1 ],
}
7. calc_rmw(offset=128K-4K, len=8K, osd_set=[1,0,3], write_set=[1,2,3])
= {
read: [ [ 0, 128K ], [ 0, 128K ], [ 0, 128K ] ],
write: [ [ 128K-4K, 128K ], [ 0, 4K ], [ 0, 128K ] ],
input buffer: [ write0, write1 ],
rmw buffer: [ write2, read0, read1, read2 ],
}
then, after calc_rmw_parity_xor(): {
write: [ [ 128K-4K, 128K ], [ 0, 128K ], [ 0, 128K ] ],
write1==read1,
}
+ check write1 buffer
+ check write2 buffer
8. calc_rmw(offset=0, len=128K+4K, osd_set=[0,2,3], write_set=[1,2,3])
= {
read: [ [ 0, 0 ], [ 4K, 128K ], [ 0, 0 ] ],
write: [ [ 0, 128K ], [ 0, 4K ], [ 0, 128K ] ],
input buffer: [ write0, write1 ],
rmw buffer: [ write2, read1 ],
}
+ check write2 buffer
9. object recovery case:
calc_rmw(offset=0, len=0, read_osd_set=[0,2,3], write_osd_set=[1,2,3])
= {
read: [ [ 0, 128K ], [ 0, 128K ], [ 0, 128K ] ],
write: [ [ 0, 0 ], [ 0, 0 ], [ 0, 0 ] ],
input buffer: NULL,
rmw buffer: [ read0, read1, read2 ],
}
then, after calc_rmw_parity_xor(): {
write: [ [ 0, 128K ], [ 0, 0 ], [ 0, 0 ] ],
write0==read0,
}
+ check write0 buffer
***/
int main(int narg, char *args[])
{
// Test 1
test1();
// Test 4
test4();
// Test 5
test5();
// Test 6
test6();
// Test 7
test7();
// Test 8
test8();
// Test 9
test9();
// End
printf("all ok\n");
return 0;
}
void dump_stripes(osd_rmw_stripe_t *stripes, int pg_size)
{
printf("request");
for (int i = 0; i < pg_size; i++)
{
printf(" {%uK-%uK}", stripes[i].req_start/1024, stripes[i].req_end/1024);
}
printf("\n");
printf("read");
for (int i = 0; i < pg_size; i++)
{
printf(" {%uK-%uK}", stripes[i].read_start/1024, stripes[i].read_end/1024);
}
printf("\n");
printf("write");
for (int i = 0; i < pg_size; i++)
{
printf(" {%uK-%uK}", stripes[i].write_start/1024, stripes[i].write_end/1024);
}
printf("\n");
}
void test1()
{
osd_num_t osd_set[3] = { 1, 0, 3 };
osd_rmw_stripe_t stripes[3] = { 0 };
// Test 1.1
split_stripes(2, 128*1024, 128*1024-4096, 8192, stripes);
assert(stripes[0].req_start == 128*1024-4096 && stripes[0].req_end == 128*1024);
assert(stripes[1].req_start == 0 && stripes[1].req_end == 4096);
assert(stripes[2].req_end == 0);
// Test 1.2
for (int i = 0; i < 3; i++)
{
stripes[i].read_start = stripes[i].req_start;
stripes[i].read_end = stripes[i].req_end;
}
assert(extend_missing_stripes(stripes, osd_set, 2, 3) == 0);
assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024);
assert(stripes[2].read_start == 0 && stripes[2].read_end == 4096);
// Test 1.3
stripes[0] = { .req_start = 128*1024-4096, .req_end = 128*1024 };
cover_read(0, 128*1024, stripes[0]);
assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024-4096);
}
void test4()
{
osd_num_t osd_set[3] = { 1, 0, 3 };
osd_rmw_stripe_t stripes[3] = { 0 };
// Test 4.1
split_stripes(2, 128*1024, 128*1024-4096, 8192, stripes);
void* write_buf = malloc(8192);
void* rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 2, osd_set, 128*1024);
assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024);
assert(stripes[1].read_start == 4096 && stripes[1].read_end == 128*1024);
assert(stripes[2].read_start == 4096 && stripes[2].read_end == 128*1024);
assert(stripes[0].write_start == 128*1024-4096 && stripes[0].write_end == 128*1024);
assert(stripes[1].write_start == 0 && stripes[1].write_end == 4096);
assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
assert(stripes[0].read_buf == rmw_buf+128*1024);
assert(stripes[1].read_buf == rmw_buf+128*1024*2);
assert(stripes[2].read_buf == rmw_buf+128*1024*3-4096);
assert(stripes[0].write_buf == write_buf);
assert(stripes[1].write_buf == write_buf+4096);
assert(stripes[2].write_buf == rmw_buf);
// Test 4.2
set_pattern(write_buf, 8192, PATTERN0);
set_pattern(stripes[0].read_buf, 128*1024, PATTERN1); // old data
set_pattern(stripes[1].read_buf, 128*1024-4096, UINT64_MAX); // didn't read it, it's missing
set_pattern(stripes[2].read_buf, 128*1024-4096, 0); // old parity = 0
calc_rmw_parity_xor(stripes, 3, osd_set, osd_set, 128*1024);
check_pattern(stripes[2].write_buf, 4096, PATTERN0^PATTERN1); // new parity
check_pattern(stripes[2].write_buf+4096, 128*1024-4096*2, 0); // new parity
check_pattern(stripes[2].write_buf+128*1024-4096, 4096, PATTERN0^PATTERN1); // new parity
free(rmw_buf);
free(write_buf);
}
void test5()
{
osd_num_t osd_set[3] = { 1, 0, 3 };
osd_rmw_stripe_t stripes[3] = { 0 };
// Test 5.1
split_stripes(2, 128*1024, 0, 64*1024*3, stripes);
assert(stripes[0].req_start == 0 && stripes[0].req_end == 128*1024);
assert(stripes[1].req_start == 0 && stripes[1].req_end == 64*1024);
assert(stripes[2].req_end == 0);
// Test 5.2
void *write_buf = malloc(64*1024*3);
void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 2, osd_set, 128*1024);
assert(stripes[0].read_start == 64*1024 && stripes[0].read_end == 128*1024);
assert(stripes[1].read_start == 64*1024 && stripes[1].read_end == 128*1024);
assert(stripes[2].read_start == 64*1024 && stripes[2].read_end == 128*1024);
assert(stripes[0].write_start == 0 && stripes[0].write_end == 128*1024);
assert(stripes[1].write_start == 0 && stripes[1].write_end == 64*1024);
assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
assert(stripes[0].read_buf == rmw_buf+128*1024);
assert(stripes[1].read_buf == rmw_buf+64*3*1024);
assert(stripes[2].read_buf == rmw_buf+64*4*1024);
assert(stripes[0].write_buf == write_buf);
assert(stripes[1].write_buf == write_buf+128*1024);
assert(stripes[2].write_buf == rmw_buf);
free(rmw_buf);
free(write_buf);
}
void test6()
{
osd_num_t osd_set[3] = { 1, 2, 3 };
osd_rmw_stripe_t stripes[3] = { 0 };
// Test 6.1
split_stripes(2, 128*1024, 0, 64*1024*3, stripes);
void *write_buf = malloc(64*1024*3);
void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 3, osd_set, 128*1024);
assert(stripes[0].read_end == 0);
assert(stripes[1].read_start == 64*1024 && stripes[1].read_end == 128*1024);
assert(stripes[2].read_end == 0);
assert(stripes[0].write_start == 0 && stripes[0].write_end == 128*1024);
assert(stripes[1].write_start == 0 && stripes[1].write_end == 64*1024);
assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
assert(stripes[0].read_buf == 0);
assert(stripes[1].read_buf == rmw_buf+128*1024);
assert(stripes[2].read_buf == 0);
assert(stripes[0].write_buf == write_buf);
assert(stripes[1].write_buf == write_buf+128*1024);
assert(stripes[2].write_buf == rmw_buf);
free(rmw_buf);
free(write_buf);
}
void test7()
{
osd_num_t osd_set[3] = { 1, 0, 3 };
osd_num_t write_osd_set[3] = { 1, 2, 3 };
osd_rmw_stripe_t stripes[3] = { 0 };
// Test 7.1
split_stripes(2, 128*1024, 128*1024-4096, 8192, stripes);
void *write_buf = malloc(8192);
void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 2, write_osd_set, 128*1024);
assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024);
assert(stripes[1].read_start == 0 && stripes[1].read_end == 128*1024);
assert(stripes[2].read_start == 0 && stripes[2].read_end == 128*1024);
assert(stripes[0].write_start == 128*1024-4096 && stripes[0].write_end == 128*1024);
assert(stripes[1].write_start == 0 && stripes[1].write_end == 4096);
assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
assert(stripes[0].read_buf == rmw_buf+128*1024);
assert(stripes[1].read_buf == rmw_buf+128*1024*2);
assert(stripes[2].read_buf == rmw_buf+128*1024*3);
assert(stripes[0].write_buf == write_buf);
assert(stripes[1].write_buf == write_buf+4096);
assert(stripes[2].write_buf == rmw_buf);
// Test 7.2
set_pattern(write_buf, 8192, PATTERN0);
set_pattern(stripes[0].read_buf, 128*1024, PATTERN1); // old data
set_pattern(stripes[1].read_buf, 128*1024, UINT64_MAX); // didn't read it, it's missing
set_pattern(stripes[2].read_buf, 128*1024, 0); // old parity = 0
calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024);
assert(stripes[0].write_start == 128*1024-4096 && stripes[0].write_end == 128*1024);
assert(stripes[1].write_start == 0 && stripes[1].write_end == 128*1024);
assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
assert(stripes[1].write_buf == stripes[1].read_buf);
check_pattern(stripes[1].write_buf, 4096, PATTERN0);
check_pattern(stripes[1].write_buf+4096, 128*1024-4096, PATTERN1);
check_pattern(stripes[2].write_buf, 4096, PATTERN0^PATTERN1); // new parity
check_pattern(stripes[2].write_buf+4096, 128*1024-4096*2, 0); // new parity
check_pattern(stripes[2].write_buf+128*1024-4096, 4096, PATTERN0^PATTERN1); // new parity
free(rmw_buf);
free(write_buf);
}
void test8()
{
osd_num_t osd_set[3] = { 0, 2, 3 };
osd_num_t write_osd_set[3] = { 1, 2, 3 };
osd_rmw_stripe_t stripes[3] = { 0 };
// Test 8.1
split_stripes(2, 128*1024, 0, 128*1024+4096, stripes);
void *write_buf = malloc(128*1024+4096);
void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 2, write_osd_set, 128*1024);
assert(stripes[0].read_start == 0 && stripes[0].read_end == 0);
assert(stripes[1].read_start == 4096 && stripes[1].read_end == 128*1024);
assert(stripes[2].read_start == 0 && stripes[2].read_end == 0);
assert(stripes[0].write_start == 0 && stripes[0].write_end == 128*1024);
assert(stripes[1].write_start == 0 && stripes[1].write_end == 4096);
assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
assert(stripes[0].read_buf == NULL);
assert(stripes[1].read_buf == rmw_buf+128*1024);
assert(stripes[2].read_buf == NULL);
assert(stripes[0].write_buf == write_buf);
assert(stripes[1].write_buf == write_buf+128*1024);
assert(stripes[2].write_buf == rmw_buf);
// Test 8.2
set_pattern(write_buf, 128*1024+4096, PATTERN0);
set_pattern(stripes[1].read_buf, 128*1024-4096, PATTERN1);
calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024);
assert(stripes[0].write_start == 0 && stripes[0].write_end == 128*1024); // recheck again
assert(stripes[1].write_start == 0 && stripes[1].write_end == 4096); // recheck again
assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024); // recheck again
assert(stripes[0].write_buf == write_buf); // recheck again
assert(stripes[1].write_buf == write_buf+128*1024); // recheck again
assert(stripes[2].write_buf == rmw_buf); // recheck again
check_pattern(stripes[2].write_buf, 4096, 0); // new parity
check_pattern(stripes[2].write_buf+4096, 128*1024-4096, PATTERN0^PATTERN1); // new parity
free(rmw_buf);
free(write_buf);
}
void test9()
{
osd_num_t osd_set[3] = { 0, 2, 3 };
osd_num_t write_osd_set[3] = { 1, 2, 3 };
osd_rmw_stripe_t stripes[3] = { 0 };
// Test 9.0
split_stripes(2, 128*1024, 64*1024, 0, stripes);
assert(stripes[0].req_start == 0 && stripes[0].req_end == 0);
assert(stripes[1].req_start == 0 && stripes[1].req_end == 0);
assert(stripes[2].req_start == 0 && stripes[2].req_end == 0);
// Test 9.1
void *write_buf = NULL;
void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 3, write_osd_set, 128*1024);
assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024);
assert(stripes[1].read_start == 0 && stripes[1].read_end == 128*1024);
assert(stripes[2].read_start == 0 && stripes[2].read_end == 128*1024);
assert(stripes[0].write_start == 0 && stripes[0].write_end == 0);
assert(stripes[1].write_start == 0 && stripes[1].write_end == 0);
assert(stripes[2].write_start == 0 && stripes[2].write_end == 0);
assert(stripes[0].read_buf == rmw_buf);
assert(stripes[1].read_buf == rmw_buf+128*1024);
assert(stripes[2].read_buf == rmw_buf+128*1024*2);
assert(stripes[0].write_buf == NULL);
assert(stripes[1].write_buf == NULL);
assert(stripes[2].write_buf == NULL);
// Test 9.2
set_pattern(stripes[1].read_buf, 128*1024, 0);
set_pattern(stripes[2].read_buf, 128*1024, PATTERN1);
calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024);
assert(stripes[0].write_start == 0 && stripes[0].write_end == 128*1024);
assert(stripes[1].write_start == 0 && stripes[1].write_end == 0);
assert(stripes[2].write_start == 0 && stripes[2].write_end == 0);
assert(stripes[0].write_buf == rmw_buf);
assert(stripes[1].write_buf == NULL);
assert(stripes[2].write_buf == NULL);
check_pattern(stripes[0].read_buf, 128*1024, PATTERN1);
check_pattern(stripes[0].write_buf, 128*1024, PATTERN1);
free(rmw_buf);
}

84
qemu-3.1-vitastor.patch Normal file
View File

@@ -0,0 +1,84 @@
Index: qemu-3.1+dfsg/qapi/block-core.json
===================================================================
--- qemu-3.1+dfsg.orig/qapi/block-core.json
+++ qemu-3.1+dfsg/qapi/block-core.json
@@ -2617,7 +2617,7 @@
##
{ 'enum': 'BlockdevDriver',
'data': [ 'blkdebug', 'blklogwrites', 'blkverify', 'bochs', 'cloop',
- 'copy-on-read', 'dmg', 'file', 'ftp', 'ftps', 'gluster',
+ 'copy-on-read', 'dmg', 'file', 'ftp', 'ftps', 'gluster', 'vitastor',
'host_cdrom', 'host_device', 'http', 'https', 'iscsi', 'luks',
'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels', 'qcow',
'qcow2', 'qed', 'quorum', 'raw', 'rbd', 'replication', 'sheepdog',
@@ -3367,6 +3367,24 @@
'*tag': 'str' } }
##
+# @BlockdevOptionsVitastor:
+#
+# Driver specific block device options for vitastor
+#
+# @inode: Inode number
+# @pool: Pool ID
+# @size: Desired image size in bytes
+# @etcd_host: etcd connection address
+# @etcd_prefix: etcd key/value prefix
+##
+{ 'struct': 'BlockdevOptionsVitastor',
+ 'data': { 'inode': 'uint64',
+ 'pool': 'uint64',
+ 'size': 'uint64',
+ 'etcd_host': 'str',
+ '*etcd_prefix': 'str' } }
+
+##
# @ReplicationMode:
#
# An enumeration of replication modes.
@@ -3713,6 +3731,7 @@
'rbd': 'BlockdevOptionsRbd',
'replication':'BlockdevOptionsReplication',
'sheepdog': 'BlockdevOptionsSheepdog',
+ 'vitastor': 'BlockdevOptionsVitastor',
'ssh': 'BlockdevOptionsSsh',
'throttle': 'BlockdevOptionsThrottle',
'vdi': 'BlockdevOptionsGenericFormat',
@@ -4158,6 +4177,17 @@
'*block-state-zero': 'bool' } }
##
+# @BlockdevCreateOptionsVitastor:
+#
+# Driver specific image creation options for Vitastor.
+#
+# @size: Size of the virtual disk in bytes
+##
+{ 'struct': 'BlockdevCreateOptionsVitastor',
+ 'data': { 'location': 'BlockdevOptionsVitastor',
+ 'size': 'size' } }
+
+##
# @BlockdevVpcSubformat:
#
# @dynamic: Growing image file
@@ -4212,6 +4242,7 @@
'qed': 'BlockdevCreateOptionsQed',
'rbd': 'BlockdevCreateOptionsRbd',
'sheepdog': 'BlockdevCreateOptionsSheepdog',
+ 'vitastor': 'BlockdevCreateOptionsVitastor',
'ssh': 'BlockdevCreateOptionsSsh',
'vdi': 'BlockdevCreateOptionsVdi',
'vhdx': 'BlockdevCreateOptionsVhdx',
Index: qemu-3.1+dfsg/scripts/modules/module_block.py
===================================================================
--- qemu-3.1+dfsg.orig/scripts/modules/module_block.py
+++ qemu-3.1+dfsg/scripts/modules/module_block.py
@@ -88,6 +88,7 @@ def print_bottom(fheader):
output_file = sys.argv[1]
with open(output_file, 'w') as fheader:
print_top(fheader)
+ add_module(fheader, "vitastor", "vitastor", "vitastor")
for filename in sys.argv[2:]:
if os.path.isfile(filename):

84
qemu-4.2-vitastor.patch Normal file
View File

@@ -0,0 +1,84 @@
Index: qemu/qapi/block-core.json
===================================================================
--- qemu.orig/qapi/block-core.json 2020-11-07 22:57:38.932613674 +0000
+++ qemu.orig/qapi/block-core.json 2020-11-07 22:59:49.890722862 +0000
@@ -2907,7 +2907,7 @@
'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels', 'qcow',
'qcow2', 'qed', 'quorum', 'raw', 'rbd',
{ 'name': 'replication', 'if': 'defined(CONFIG_REPLICATION)' },
- 'sheepdog',
+ 'sheepdog', 'vitastor',
'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', 'vxhs' ] }
##
@@ -3725,6 +3725,24 @@
'*tag': 'str' } }
##
+# @BlockdevOptionsVitastor:
+#
+# Driver specific block device options for vitastor
+#
+# @inode: Inode number
+# @pool: Pool ID
+# @size: Desired image size in bytes
+# @etcd_host: etcd connection address
+# @etcd_prefix: etcd key/value prefix
+##
+{ 'struct': 'BlockdevOptionsVitastor',
+ 'data': { 'inode': 'uint64',
+ 'pool': 'uint64',
+ 'size': 'uint64',
+ 'etcd_host': 'str',
+ '*etcd_prefix': 'str' } }
+
+##
# @ReplicationMode:
#
# An enumeration of replication modes.
@@ -4084,6 +4102,7 @@
'replication': { 'type': 'BlockdevOptionsReplication',
'if': 'defined(CONFIG_REPLICATION)' },
'sheepdog': 'BlockdevOptionsSheepdog',
+ 'vitastor': 'BlockdevOptionsVitastor',
'ssh': 'BlockdevOptionsSsh',
'throttle': 'BlockdevOptionsThrottle',
'vdi': 'BlockdevOptionsGenericFormat',
@@ -4461,6 +4480,17 @@
'*cluster-size' : 'size' } }
##
+# @BlockdevCreateOptionsVitastor:
+#
+# Driver specific image creation options for Vitastor.
+#
+# @size: Size of the virtual disk in bytes
+##
+{ 'struct': 'BlockdevCreateOptionsVitastor',
+ 'data': { 'location': 'BlockdevOptionsVitastor',
+ 'size': 'size' } }
+
+##
# @BlockdevVmdkSubformat:
#
# Subformat options for VMDK images
@@ -4722,6 +4752,7 @@
'qed': 'BlockdevCreateOptionsQed',
'rbd': 'BlockdevCreateOptionsRbd',
'sheepdog': 'BlockdevCreateOptionsSheepdog',
+ 'vitastor': 'BlockdevCreateOptionsVitastor',
'ssh': 'BlockdevCreateOptionsSsh',
'vdi': 'BlockdevCreateOptionsVdi',
'vhdx': 'BlockdevCreateOptionsVhdx',
Index: qemu/scripts/modules/module_block.py
===================================================================
--- qemu.orig/scripts/modules/module_block.py 2020-11-07 22:57:38.936613739 +0000
+++ qemu/scripts/modules/module_block.py 2020-11-07 22:59:49.890722862 +0000
@@ -86,6 +86,7 @@ def print_bottom(fheader):
output_file = sys.argv[1]
with open(output_file, 'w') as fheader:
print_top(fheader)
+ add_module(fheader, "vitastor", "vitastor", "vitastor")
for filename in sys.argv[2:]:
if os.path.isfile(filename):

84
qemu-5.0-vitastor.patch Normal file
View File

@@ -0,0 +1,84 @@
Index: qemu/qapi/block-core.json
===================================================================
--- qemu.orig/qapi/block-core.json
+++ qemu/qapi/block-core.json
@@ -2798,7 +2798,7 @@
'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
{ 'name': 'replication', 'if': 'defined(CONFIG_REPLICATION)' },
- 'sheepdog',
+ 'sheepdog', 'vitastor',
'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', 'vxhs' ] }
##
@@ -3635,6 +3635,24 @@
'*tag': 'str' } }
##
+# @BlockdevOptionsVitastor:
+#
+# Driver specific block device options for vitastor
+#
+# @inode: Inode number
+# @pool: Pool ID
+# @size: Desired image size in bytes
+# @etcd_host: etcd connection address
+# @etcd_prefix: etcd key/value prefix
+##
+{ 'struct': 'BlockdevOptionsVitastor',
+ 'data': { 'inode': 'uint64',
+ 'pool': 'uint64',
+ 'size': 'uint64',
+ 'etcd_host': 'str',
+ '*etcd_prefix': 'str' } }
+
+##
# @ReplicationMode:
#
# An enumeration of replication modes.
@@ -3995,6 +4013,7 @@
'replication': { 'type': 'BlockdevOptionsReplication',
'if': 'defined(CONFIG_REPLICATION)' },
'sheepdog': 'BlockdevOptionsSheepdog',
+ 'vitastor': 'BlockdevOptionsVitastor',
'ssh': 'BlockdevOptionsSsh',
'throttle': 'BlockdevOptionsThrottle',
'vdi': 'BlockdevOptionsGenericFormat',
@@ -4365,6 +4384,17 @@
'*cluster-size' : 'size' } }
##
+# @BlockdevCreateOptionsVitastor:
+#
+# Driver specific image creation options for Vitastor.
+#
+# @size: Size of the virtual disk in bytes
+##
+{ 'struct': 'BlockdevCreateOptionsVitastor',
+ 'data': { 'location': 'BlockdevOptionsVitastor',
+ 'size': 'size' } }
+
+##
# @BlockdevVmdkSubformat:
#
# Subformat options for VMDK images
@@ -4626,6 +4656,7 @@
'qed': 'BlockdevCreateOptionsQed',
'rbd': 'BlockdevCreateOptionsRbd',
'sheepdog': 'BlockdevCreateOptionsSheepdog',
+ 'vitastor': 'BlockdevCreateOptionsVitastor',
'ssh': 'BlockdevCreateOptionsSsh',
'vdi': 'BlockdevCreateOptionsVdi',
'vhdx': 'BlockdevCreateOptionsVhdx',
Index: qemu/scripts/modules/module_block.py
===================================================================
--- qemu.orig/scripts/modules/module_block.py
+++ qemu/scripts/modules/module_block.py
@@ -85,6 +85,7 @@ def print_bottom(fheader):
output_file = sys.argv[1]
with open(output_file, 'w') as fheader:
print_top(fheader)
+ add_module(fheader, "vitastor", "vitastor", "vitastor")
for filename in sys.argv[2:]:
if os.path.isfile(filename):

84
qemu-5.1-vitastor.patch Normal file
View File

@@ -0,0 +1,84 @@
Index: qemu-5.1+dfsg/qapi/block-core.json
===================================================================
--- qemu-5.1+dfsg.orig/qapi/block-core.json
+++ qemu-5.1+dfsg/qapi/block-core.json
@@ -2807,7 +2807,7 @@
'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
{ 'name': 'replication', 'if': 'defined(CONFIG_REPLICATION)' },
- 'sheepdog',
+ 'sheepdog', 'vitastor',
'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
##
@@ -3644,6 +3644,24 @@
'*tag': 'str' } }
##
+# @BlockdevOptionsVitastor:
+#
+# Driver specific block device options for vitastor
+#
+# @inode: Inode number
+# @pool: Pool ID
+# @size: Desired image size in bytes
+# @etcd_host: etcd connection address
+# @etcd_prefix: etcd key/value prefix
+##
+{ 'struct': 'BlockdevOptionsVitastor',
+ 'data': { 'inode': 'uint64',
+ 'pool': 'uint64',
+ 'size': 'uint64',
+ 'etcd_host': 'str',
+ '*etcd_prefix': 'str' } }
+
+##
# @ReplicationMode:
#
# An enumeration of replication modes.
@@ -3988,6 +4006,7 @@
'replication': { 'type': 'BlockdevOptionsReplication',
'if': 'defined(CONFIG_REPLICATION)' },
'sheepdog': 'BlockdevOptionsSheepdog',
+ 'vitastor': 'BlockdevOptionsVitastor',
'ssh': 'BlockdevOptionsSsh',
'throttle': 'BlockdevOptionsThrottle',
'vdi': 'BlockdevOptionsGenericFormat',
@@ -4376,6 +4395,17 @@
'*cluster-size' : 'size' } }
##
+# @BlockdevCreateOptionsVitastor:
+#
+# Driver specific image creation options for Vitastor.
+#
+# @size: Size of the virtual disk in bytes
+##
+{ 'struct': 'BlockdevCreateOptionsVitastor',
+ 'data': { 'location': 'BlockdevOptionsVitastor',
+ 'size': 'size' } }
+
+##
# @BlockdevVmdkSubformat:
#
# Subformat options for VMDK images
@@ -4637,6 +4667,7 @@
'qed': 'BlockdevCreateOptionsQed',
'rbd': 'BlockdevCreateOptionsRbd',
'sheepdog': 'BlockdevCreateOptionsSheepdog',
+ 'vitastor': 'BlockdevCreateOptionsVitastor',
'ssh': 'BlockdevCreateOptionsSsh',
'vdi': 'BlockdevCreateOptionsVdi',
'vhdx': 'BlockdevCreateOptionsVhdx',
Index: qemu-5.1+dfsg/scripts/modules/module_block.py
===================================================================
--- qemu-5.1+dfsg.orig/scripts/modules/module_block.py
+++ qemu-5.1+dfsg/scripts/modules/module_block.py
@@ -86,6 +86,7 @@ if __name__ == '__main__':
output_file = sys.argv[1]
with open(output_file, 'w') as fheader:
print_top(fheader)
+ add_module(fheader, "vitastor", "vitastor", "vitastor")
for filename in sys.argv[2:]:
if os.path.isfile(filename):

51
rpm/build-tarball.sh Executable file
View File

@@ -0,0 +1,51 @@
#!/bin/bash
# Vitastor depends on QEMU and FIO headers, but QEMU and FIO don't have -devel packages
# So we have to copy their headers into the source tarball
set -e
VITASTOR=$(dirname $0)
VITASTOR=$(realpath "$VITASTOR/..")
if [ -d /opt/rh/gcc-toolset-9 ]; then
# CentOS 8
EL=8
. /opt/rh/gcc-toolset-9/enable
else
# CentOS 7
EL=7
. /opt/rh/devtoolset-9/enable
fi
cd ~/rpmbuild/SPECS
rpmbuild -bp fio.spec
perl -i -pe 's/^make V=1/exit 0; make V=1/' qemu*.spec
rpmbuild -bc qemu*.spec
perl -i -pe 's/^exit 0; make V=1/make V=1/' qemu*.spec
cd ~/rpmbuild/BUILD/qemu*/
rm -rf $VITASTOR/qemu $VITASTOR/fio
mkdir -p $VITASTOR/qemu/b/qemu
make -j8 config-host.h
cp config-host.h $VITASTOR/qemu/b/qemu
cp -r include $VITASTOR/qemu
if [ -f qapi-schema.json ]; then
# QEMU 2.0
make qapi-types.h
cp qapi-types.h $VITASTOR/qemu/b/qemu
else
# QEMU 3.0+
make qapi
cp -r qapi $VITASTOR/qemu/b/qemu
fi
cd $VITASTOR
sh copy-qemu-includes.sh
rm -rf qemu
mv qemu-copy qemu
ln -s ~/rpmbuild/BUILD/fio*/ fio
sh copy-fio-includes.sh
rm fio
mv fio-copy fio
FIO=`rpm -qi fio | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'`
QEMU=`rpm -qi qemu qemu-kvm | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'`
perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec
perl -i -pe 's/(Requires:\s*qemu(?:-kvm)?)([^\n]+)?/$1 = '$QEMU'/' $VITASTOR/rpm/vitastor-el$EL.spec
tar --transform 's#^#vitastor-0.5.10/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.5.10$(rpm --eval '%dist').tar.gz *

31
rpm/qemu-el8.Dockerfile Normal file
View File

@@ -0,0 +1,31 @@
# Build packages for CentOS 8 inside a container
# cd ..; podman build -t qemu-el8 -v `pwd`/packages:/root/packages -f rpm/qemu-el8.Dockerfile .
FROM centos:8
WORKDIR /root
RUN rm -f /etc/yum.repos.d/CentOS-Media.repo
RUN dnf -y install centos-release-advanced-virtualization epel-release dnf-plugins-core rpm-build
RUN rm -rf /var/lib/dnf/*; dnf download --disablerepo='*' --enablerepo='centos-advanced-virtualization-source' --source qemu-kvm
RUN rpm --nomd5 -i qemu*.src.rpm
RUN cd ~/rpmbuild/SPECS && dnf builddep -y --enablerepo=PowerTools --spec qemu-kvm.spec
ADD qemu-*-vitastor.patch /root/vitastor/
RUN set -e; \
mkdir -p /root/packages/qemu-el8; \
rm -rf /root/packages/qemu-el8/*; \
rpm --nomd5 -i /root/qemu*.src.rpm; \
cd ~/rpmbuild/SPECS; \
PN=$(grep ^Patch qemu-kvm.spec | tail -n1 | perl -pe 's/Patch(\d+).*/$1/'); \
csplit qemu-kvm.spec "/^Patch$PN/"; \
cat xx00 > qemu-kvm.spec; \
head -n 1 xx01 >> qemu-kvm.spec; \
echo "Patch$((PN+1)): qemu-4.2-vitastor.patch" >> qemu-kvm.spec; \
tail -n +2 xx01 >> qemu-kvm.spec; \
perl -i -pe 's/(^Release:\s*\d+)/$1.vitastor/' qemu-kvm.spec; \
cp /root/vitastor/qemu-4.2-vitastor.patch ~/rpmbuild/SOURCES; \
rpmbuild --nocheck -ba qemu-kvm.spec; \
cp ~/rpmbuild/RPMS/*/*qemu* /root/packages/qemu-el8/; \
cp ~/rpmbuild/SRPMS/*qemu* /root/packages/qemu-el8/

257
rpm/qemu-kvm-el7.spec.patch Normal file
View File

@@ -0,0 +1,257 @@
--- qemu-kvm.spec.orig 2020-11-09 23:41:03.000000000 +0000
+++ qemu-kvm.spec 2020-12-06 10:44:24.207640963 +0000
@@ -2,7 +2,7 @@
%global SLOF_gittagcommit 899d9883
%global have_usbredir 1
-%global have_spice 1
+%global have_spice 0
%global have_opengl 1
%global have_fdt 0
%global have_gluster 1
@@ -56,7 +56,7 @@ Requires: %{name}-block-curl = %{epoch}:
Requires: %{name}-block-gluster = %{epoch}:%{version}-%{release} \
%endif \
Requires: %{name}-block-iscsi = %{epoch}:%{version}-%{release} \
-Requires: %{name}-block-rbd = %{epoch}:%{version}-%{release} \
+#Requires: %{name}-block-rbd = %{epoch}:%{version}-%{release} \
Requires: %{name}-block-ssh = %{epoch}:%{version}-%{release}
# Macro to properly setup RHEL/RHEV conflict handling
@@ -67,7 +67,7 @@ Obsoletes: %1-rhev
Summary: QEMU is a machine emulator and virtualizer
Name: qemu-kvm
Version: 4.2.0
-Release: 29.vitastor%{?dist}.6
+Release: 30.vitastor%{?dist}.6
# Epoch because we pushed a qemu-1.0 package. AIUI this can't ever be dropped
Epoch: 15
License: GPLv2 and GPLv2+ and CC-BY
@@ -99,8 +99,8 @@ Source30: kvm-s390x.conf
Source31: kvm-x86.conf
Source32: qemu-pr-helper.service
Source33: qemu-pr-helper.socket
-Source34: 81-kvm-rhel.rules
-Source35: udev-kvm-check.c
+#Source34: 81-kvm-rhel.rules
+#Source35: udev-kvm-check.c
Source36: README.tests
@@ -825,7 +825,9 @@ Patch331: kvm-Drop-bogus-IPv6-messages.p
Patch333: kvm-virtiofsd-Whitelist-fchmod.patch
# For bz#1883869 - virtiofsd core dump in KATA Container [rhel-8.2.1.z]
Patch334: kvm-virtiofsd-avoid-proc-self-fd-tempdir.patch
-Patch335: qemu-4.2-vitastor.patch
+Patch335: qemu-use-sphinx-1.2.patch
+Patch336: qemu-config-tcmalloc-warning.patch
+Patch337: qemu-4.2-vitastor.patch
BuildRequires: wget
BuildRequires: rpm-build
@@ -842,7 +844,8 @@ BuildRequires: pciutils-devel
BuildRequires: libiscsi-devel
BuildRequires: ncurses-devel
BuildRequires: libattr-devel
-BuildRequires: libusbx-devel >= 1.0.22
+BuildRequires: gperftools-devel
+BuildRequires: libusbx-devel >= 1.0.21
%if %{have_usbredir}
BuildRequires: usbredir-devel >= 0.7.1
%endif
@@ -856,12 +859,12 @@ BuildRequires: virglrenderer-devel
# For smartcard NSS support
BuildRequires: nss-devel
%endif
-BuildRequires: libseccomp-devel >= 2.4.0
+#Requires: libseccomp >= 2.4.0
# For network block driver
BuildRequires: libcurl-devel
BuildRequires: libssh-devel
-BuildRequires: librados-devel
-BuildRequires: librbd-devel
+#BuildRequires: librados-devel
+#BuildRequires: librbd-devel
%if %{have_gluster}
# For gluster block driver
BuildRequires: glusterfs-api-devel
@@ -955,25 +958,25 @@ hardware for a full system such as a PC
%package -n qemu-kvm-core
Summary: qemu-kvm core components
+Requires: gperftools-libs
Requires: qemu-img = %{epoch}:%{version}-%{release}
%ifarch %{ix86} x86_64
Requires: seabios-bin >= 1.10.2-1
Requires: sgabios-bin
-Requires: edk2-ovmf
%endif
%ifarch aarch64
Requires: edk2-aarch64
%endif
%ifnarch aarch64 s390x
-Requires: seavgabios-bin >= 1.12.0-3
-Requires: ipxe-roms-qemu >= 20170123-1
+Requires: seavgabios-bin >= 1.11.0-1
+Requires: ipxe-roms-qemu >= 20181214-1
+Requires: /usr/share/ipxe.efi
%endif
%ifarch %{power64}
Requires: SLOF >= %{SLOF_gittagdate}-1.git%{SLOF_gittagcommit}
%endif
Requires: %{name}-common = %{epoch}:%{version}-%{release}
-Requires: libseccomp >= 2.4.0
# For compressed guest memory dumps
Requires: lzo snappy
%if %{have_kvm_setup}
@@ -1085,15 +1088,15 @@ This package provides the additional iSC
Install this package if you want to access iSCSI volumes.
-%package block-rbd
-Summary: QEMU Ceph/RBD block driver
-Requires: %{name}-common%{?_isa} = %{epoch}:%{version}-%{release}
-
-%description block-rbd
-This package provides the additional Ceph/RBD block driver for QEMU.
-
-Install this package if you want to access remote Ceph volumes
-using the rbd protocol.
+#%package block-rbd
+#Summary: QEMU Ceph/RBD block driver
+#Requires: %{name}-common%{?_isa} = %{epoch}:%{version}-%{release}
+#
+#%description block-rbd
+#This package provides the additional Ceph/RBD block driver for QEMU.
+#
+#Install this package if you want to access remote Ceph volumes
+#using the rbd protocol.
%package block-ssh
@@ -1117,12 +1120,14 @@ the Secure Shell (SSH) protocol.
# --build-id option is used for giving info to the debug packages.
buildldflags="VL_LDFLAGS=-Wl,--build-id"
-%global block_drivers_list qcow2,raw,file,host_device,nbd,iscsi,rbd,blkdebug,luks,null-co,nvme,copy-on-read,throttle
+#%global block_drivers_list qcow2,raw,file,host_device,nbd,iscsi,rbd,blkdebug,luks,null-co,nvme,copy-on-read,throttle
+%global block_drivers_list qcow2,raw,file,host_device,nbd,iscsi,blkdebug,luks,null-co,nvme,copy-on-read,throttle
%if 0%{have_gluster}
%global block_drivers_list %{block_drivers_list},gluster
%endif
+[ -e /usr/bin/sphinx-build ] || ln -s sphinx-build-3 /usr/bin/sphinx-build
./configure \
--prefix="%{_prefix}" \
--libdir="%{_libdir}" \
@@ -1152,15 +1157,15 @@ buildldflags="VL_LDFLAGS=-Wl,--build-id"
%else
--disable-numa \
%endif
- --enable-rbd \
+ --disable-rbd \
%if 0%{have_librdma}
--enable-rdma \
%else
--disable-rdma \
%endif
--disable-pvrdma \
- --enable-seccomp \
-%if 0%{have_spice}
+ --disable-seccomp \
+%if %{have_spice}
--enable-spice \
--enable-smartcard \
--enable-virglrenderer \
@@ -1179,7 +1184,7 @@ buildldflags="VL_LDFLAGS=-Wl,--build-id"
%else
--disable-usb-redir \
%endif
- --disable-tcmalloc \
+ --enable-tcmalloc \
%ifarch x86_64
--enable-libpmem \
%else
@@ -1193,9 +1198,7 @@ buildldflags="VL_LDFLAGS=-Wl,--build-id"
%endif
--python=%{__python3} \
--target-list="%{buildarch}" \
- --block-drv-rw-whitelist=%{block_drivers_list} \
--audio-drv-list= \
- --block-drv-ro-whitelist=vmdk,vhdx,vpc,https,ssh \
--with-coroutine=ucontext \
--tls-priority=NORMAL \
--disable-bluez \
@@ -1262,7 +1265,7 @@ buildldflags="VL_LDFLAGS=-Wl,--build-id"
--disable-sanitizers \
--disable-hvf \
--disable-whpx \
- --enable-malloc-trim \
+ --disable-malloc-trim \
--disable-membarrier \
--disable-vhost-crypto \
--disable-libxml2 \
@@ -1308,7 +1311,7 @@ make V=1 %{?_smp_mflags} $buildldflags
cp -a %{kvm_target}-softmmu/qemu-system-%{kvm_target} qemu-kvm
gcc %{SOURCE6} $RPM_OPT_FLAGS $RPM_LD_FLAGS -o ksmctl
-gcc %{SOURCE35} $RPM_OPT_FLAGS $RPM_LD_FLAGS -o udev-kvm-check
+#gcc %{SOURCE35} $RPM_OPT_FLAGS $RPM_LD_FLAGS -o udev-kvm-check
%install
%define _udevdir %(pkg-config --variable=udevdir udev)
@@ -1343,8 +1346,8 @@ mkdir -p $RPM_BUILD_ROOT%{testsdir}/test
mkdir -p $RPM_BUILD_ROOT%{testsdir}/tests/qemu-iotests
mkdir -p $RPM_BUILD_ROOT%{testsdir}/scripts/qmp
-install -p -m 0755 udev-kvm-check $RPM_BUILD_ROOT%{_udevdir}
-install -p -m 0644 %{SOURCE34} $RPM_BUILD_ROOT%{_udevrulesdir}
+#install -p -m 0755 udev-kvm-check $RPM_BUILD_ROOT%{_udevdir}
+#install -p -m 0644 %{SOURCE34} $RPM_BUILD_ROOT%{_udevrulesdir}
install -m 0644 scripts/dump-guest-memory.py \
$RPM_BUILD_ROOT%{_datadir}/%{name}
@@ -1562,6 +1565,8 @@ rm -rf $RPM_BUILD_ROOT%{qemudocdir}/inte
# Remove spec
rm -rf $RPM_BUILD_ROOT%{qemudocdir}/specs
+%global __os_install_post %(echo '%{__os_install_post}' | sed -e 's!/usr/lib[^[:space:]]*/brp-python-bytecompile[[:space:]].*$!!g')
+
%check
export DIFF=diff; make check V=1
@@ -1645,8 +1650,8 @@ useradd -r -u 107 -g qemu -G kvm -d / -s
%config(noreplace) %{_sysconfdir}/sysconfig/ksm
%{_unitdir}/ksmtuned.service
%{_sbindir}/ksmtuned
-%{_udevdir}/udev-kvm-check
-%{_udevrulesdir}/81-kvm-rhel.rules
+#%{_udevdir}/udev-kvm-check
+#%{_udevrulesdir}/81-kvm-rhel.rules
%ghost %{_sysconfdir}/kvm
%config(noreplace) %{_sysconfdir}/ksmtuned.conf
%dir %{_sysconfdir}/%{name}
@@ -1711,8 +1716,8 @@ useradd -r -u 107 -g qemu -G kvm -d / -s
%{_libexecdir}/vhost-user-gpu
%{_datadir}/%{name}/vhost-user/50-qemu-gpu.json
%endif
-%{_libexecdir}/virtiofsd
-%{_datadir}/%{name}/vhost-user/50-qemu-virtiofsd.json
+#%{_libexecdir}/virtiofsd
+#%{_datadir}/%{name}/vhost-user/50-qemu-virtiofsd.json
%files -n qemu-img
%defattr(-,root,root)
@@ -1748,8 +1753,8 @@ useradd -r -u 107 -g qemu -G kvm -d / -s
%files block-iscsi
%{_libdir}/qemu-kvm/block-iscsi.so
-%files block-rbd
-%{_libdir}/qemu-kvm/block-rbd.so
+#%files block-rbd
+#%{_libdir}/qemu-kvm/block-rbd.so
%files block-ssh
%{_libdir}/qemu-kvm/block-ssh.so

29
rpm/qemu-kvm.spec.patch Normal file
View File

@@ -0,0 +1,29 @@
--- qemu-kvm.spec 2020-12-05 13:13:54.388623517 +0000
+++ qemu-kvm.spec 2020-12-05 13:13:58.728696598 +0000
@@ -67,7 +67,7 @@ Obsoletes: %1-rhev
Summary: QEMU is a machine emulator and virtualizer
Name: qemu-kvm
Version: 4.2.0
-Release: 29%{?dist}.6
+Release: 29.vitastor%{?dist}.6
# Epoch because we pushed a qemu-1.0 package. AIUI this can't ever be dropped
Epoch: 15
License: GPLv2 and GPLv2+ and CC-BY
@@ -825,6 +825,7 @@ Patch331: kvm-Drop-bogus-IPv6-messages.p
Patch333: kvm-virtiofsd-Whitelist-fchmod.patch
# For bz#1883869 - virtiofsd core dump in KATA Container [rhel-8.2.1.z]
Patch334: kvm-virtiofsd-avoid-proc-self-fd-tempdir.patch
+Patch335: qemu-4.2-vitastor.patch
BuildRequires: wget
BuildRequires: rpm-build
@@ -1192,9 +1193,7 @@ buildldflags="VL_LDFLAGS=-Wl,--build-id"
%endif
--python=%{__python3} \
--target-list="%{buildarch}" \
- --block-drv-rw-whitelist=%{block_drivers_list} \
--audio-drv-list= \
- --block-drv-ro-whitelist=vmdk,vhdx,vpc,https,ssh \
--with-coroutine=ucontext \
--tls-priority=NORMAL \
--disable-bluez \

View File

@@ -0,0 +1,47 @@
# Build packages for CentOS 7 inside a container
# cd ..; podman build -t vitastor-el7 -v `pwd`/packages:/root/packages -f rpm/vitastor-el7.Dockerfile .
# localedef -i ru_RU -f UTF-8 ru_RU.UTF-8
FROM centos:7
WORKDIR /root
RUN rm -f /etc/yum.repos.d/CentOS-Media.repo
RUN yum -y --enablerepo=extras install centos-release-scl epel-release yum-utils rpm-build
RUN yum -y install https://vitastor.io/rpms/centos/7/vitastor-release-1.0-1.el7.noarch.rpm
RUN yum -y install devtoolset-9-gcc-c++ devtoolset-9-libatomic-devel gperftools-devel qemu-kvm fio rh-nodejs12 jerasure-devel gf-complete-devel
RUN yumdownloader --disablerepo=centos-sclo-rh --source qemu-kvm
RUN yumdownloader --disablerepo=centos-sclo-rh --source fio
RUN rpm --nomd5 -i qemu*.src.rpm
RUN rpm --nomd5 -i fio*.src.rpm
RUN rm -f /etc/yum.repos.d/CentOS-Media.repo
RUN cd ~/rpmbuild/SPECS && yum-builddep -y --enablerepo='*' --disablerepo=centos-sclo-rh --disablerepo=centos-sclo-rh-source --disablerepo=centos-sclo-sclo-testing qemu-kvm.spec
RUN cd ~/rpmbuild/SPECS && yum-builddep -y --enablerepo='*' --disablerepo=centos-sclo-rh --disablerepo=centos-sclo-rh-source --disablerepo=centos-sclo-sclo-testing fio.spec
ADD https://vitastor.io/rpms/liburing-el7/liburing-0.7-2.el7.src.rpm /root
RUN set -e; \
rpm -i liburing*.src.rpm; \
cd ~/rpmbuild/SPECS/; \
. /opt/rh/devtoolset-9/enable; \
rpmbuild -ba liburing.spec; \
mkdir -p /root/packages/liburing-el7; \
rm -rf /root/packages/liburing-el7/*; \
cp ~/rpmbuild/RPMS/*/liburing* /root/packages/liburing-el7/; \
cp ~/rpmbuild/SRPMS/liburing* /root/packages/liburing-el7/
RUN rpm -i `ls /root/packages/liburing-el7/liburing-*.x86_64.rpm | grep -v debug`
ADD . /root/vitastor
RUN set -e; \
cd /root/vitastor/rpm; \
sh build-tarball.sh; \
cp /root/vitastor-0.5.10.el7.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \
mkdir -p /root/packages/vitastor-el7; \
rm -rf /root/packages/vitastor-el7/*; \
cp ~/rpmbuild/RPMS/*/vitastor* /root/packages/vitastor-el7/; \
cp ~/rpmbuild/SRPMS/vitastor* /root/packages/vitastor-el7/

69
rpm/vitastor-el7.spec Normal file
View File

@@ -0,0 +1,69 @@
Name: vitastor
Version: 0.5.10
Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1
URL: https://vitastor.io/
Source0: vitastor-0.5.10.el7.tar.gz
BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel
BuildRequires: devtoolset-9-gcc-c++
BuildRequires: rh-nodejs12
BuildRequires: rh-nodejs12-npm
BuildRequires: jerasure-devel
BuildRequires: gf-complete-devel
BuildRequires: cmake
Requires: fio = 3.7-1.el7
Requires: qemu-kvm = 2.0.0-1.el7.6
Requires: rh-nodejs12
Requires: rh-nodejs12-npm
Requires: liburing >= 0.6
Requires: libJerasure2
Requires: lpsolve
%description
Vitastor is a small, simple and fast clustered block storage (storage for VM drives),
architecturally similar to Ceph which means strong consistency, primary-replication,
symmetric clustering and automatic data distribution over any number of drives of any
size with configurable redundancy (replication or erasure codes/XOR).
%prep
%setup -q
%build
. /opt/rh/devtoolset-9/enable
%cmake . -DQEMU_PLUGINDIR=qemu-kvm
%make_build
%install
rm -rf $RPM_BUILD_ROOT
%make_install
. /opt/rh/rh-nodejs12/enable
cd mon
npm install
cd ..
mkdir -p %buildroot/usr/lib/vitastor
cp -r mon %buildroot/usr/lib/vitastor/mon
%files
%doc
%_bindir/vitastor-dump-journal
%_bindir/vitastor-nbd
%_bindir/vitastor-osd
%_bindir/vitastor-rm
%_libdir/qemu-kvm/block-vitastor.so
%_libdir/libfio_vitastor.so
%_libdir/libfio_vitastor_blk.so
%_libdir/libfio_vitastor_sec.so
%_libdir/libvitastor_blk.so
%_libdir/libvitastor_client.so
/usr/lib/vitastor
%changelog

View File

@@ -0,0 +1,45 @@
# Build packages for CentOS 8 inside a container
# cd ..; podman build -t vitastor-el8 -v `pwd`/packages:/root/packages -f rpm/vitastor-el8.Dockerfile .
FROM centos:8
WORKDIR /root
RUN rm -f /etc/yum.repos.d/CentOS-Media.repo
RUN dnf -y install centos-release-advanced-virtualization epel-release dnf-plugins-core
RUN yum -y install https://vitastor.io/rpms/centos/8/vitastor-release-1.0-1.el8.noarch.rpm
RUN dnf --enablerepo='centos-advanced-virtualization' -y install gcc-toolset-9 gcc-toolset-9-gcc-c++ gperftools-devel qemu-kvm fio nodejs rpm-build jerasure-devel gf-complete-devel
RUN rm -rf /var/lib/dnf/*; dnf download --disablerepo='*' --enablerepo='vitastor' --source qemu-kvm
RUN dnf download --source fio
RUN rpm --nomd5 -i qemu*.src.rpm
RUN rpm --nomd5 -i fio*.src.rpm
RUN cd ~/rpmbuild/SPECS && dnf builddep -y --enablerepo=powertools --spec qemu-kvm.spec
RUN cd ~/rpmbuild/SPECS && dnf builddep -y --enablerepo=powertools --spec fio.spec && dnf install -y cmake
ADD https://vitastor.io/rpms/liburing-el7/liburing-0.7-2.el7.src.rpm /root
RUN set -e; \
rpm -i liburing*.src.rpm; \
cd ~/rpmbuild/SPECS/; \
. /opt/rh/gcc-toolset-9/enable; \
rpmbuild -ba liburing.spec; \
mkdir -p /root/packages/liburing-el8; \
rm -rf /root/packages/liburing-el8/*; \
cp ~/rpmbuild/RPMS/*/liburing* /root/packages/liburing-el8/; \
cp ~/rpmbuild/SRPMS/liburing* /root/packages/liburing-el8/
RUN rpm -i `ls /root/packages/liburing-el7/liburing-*.x86_64.rpm | grep -v debug`
ADD . /root/vitastor
RUN set -e; \
cd /root/vitastor/rpm; \
sh build-tarball.sh; \
cp /root/vitastor-0.5.10.el8.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \
mkdir -p /root/packages/vitastor-el8; \
rm -rf /root/packages/vitastor-el8/*; \
cp ~/rpmbuild/RPMS/*/vitastor* /root/packages/vitastor-el8/; \
cp ~/rpmbuild/SRPMS/vitastor* /root/packages/vitastor-el8/

66
rpm/vitastor-el8.spec Normal file
View File

@@ -0,0 +1,66 @@
Name: vitastor
Version: 0.5.10
Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1
URL: https://vitastor.io/
Source0: vitastor-0.5.10.el8.tar.gz
BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel
BuildRequires: gcc-toolset-9-gcc-c++
BuildRequires: nodejs >= 10
BuildRequires: jerasure-devel
BuildRequires: gf-complete-devel
BuildRequires: cmake
Requires: fio = 3.7-3.el8
Requires: qemu-kvm = 4.2.0-29.el8.6
Requires: nodejs >= 10
Requires: liburing >= 0.6
Requires: libJerasure2
Requires: lpsolve
%description
Vitastor is a small, simple and fast clustered block storage (storage for VM drives),
architecturally similar to Ceph which means strong consistency, primary-replication,
symmetric clustering and automatic data distribution over any number of drives of any
size with configurable redundancy (replication or erasure codes/XOR).
%prep
%setup -q
%build
. /opt/rh/gcc-toolset-9/enable
%cmake . -DQEMU_PLUGINDIR=qemu-kvm
%make_build
%install
rm -rf $RPM_BUILD_ROOT
%make_install
cd mon
npm install
cd ..
mkdir -p %buildroot/usr/lib/vitastor
cp -r mon %buildroot/usr/lib/vitastor
%files
%doc
%_bindir/vitastor-dump-journal
%_bindir/vitastor-nbd
%_bindir/vitastor-osd
%_bindir/vitastor-rm
%_libdir/qemu-kvm/block-vitastor.so
%_libdir/libfio_vitastor.so
%_libdir/libfio_vitastor_blk.so
%_libdir/libfio_vitastor_sec.so
%_libdir/libvitastor_blk.so
%_libdir/libvitastor_client.so
/usr/lib/vitastor
%changelog

188
src/CMakeLists.txt Normal file
View File

@@ -0,0 +1,188 @@
cmake_minimum_required(VERSION 2.8)
project(vitastor)
include(GNUInstallDirs)
set(QEMU_PLUGINDIR qemu CACHE STRING "QEMU plugin directory suffix (qemu-kvm on RHEL)")
set(WITH_ASAN false CACHE BOOL "Build with AddressSanitizer")
if("${CMAKE_INSTALL_PREFIX}" MATCHES "^/usr/local/?$")
if(EXISTS "/etc/debian_version")
set(CMAKE_INSTALL_LIBDIR "lib/${CMAKE_LIBRARY_ARCHITECTURE}")
endif()
set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}")
endif()
add_definitions(-DVERSION="0.6-dev")
add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith)
if (${WITH_ASAN})
add_definitions(-fsanitize=address -fno-omit-frame-pointer)
add_link_options(-fsanitize=address -fno-omit-frame-pointer)
endif (${WITH_ASAN})
set(CMAKE_BUILD_TYPE RelWithDebInfo)
string(REGEX REPLACE "([\\/\\-]O)[12]?" "\\13" CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE}")
string(REGEX REPLACE "([\\/\\-]O)[12]?" "\\13" CMAKE_CXX_FLAGS_MINSIZEREL "${CMAKE_CXX_FLAGS_MINSIZEREL}")
string(REGEX REPLACE "([\\/\\-]O)[12]?" "\\13" CMAKE_CXX_FLAGS_RELWITHDEBINFO "${CMAKE_CXX_FLAGS_RELWITHDEBINFO}")
string(REGEX REPLACE "([\\/\\-]D) *NDEBUG" "" CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE}")
string(REGEX REPLACE "([\\/\\-]D) *NDEBUG" "" CMAKE_CXX_FLAGS_MINSIZEREL "${CMAKE_CXX_FLAGS_MINSIZEREL}")
string(REGEX REPLACE "([\\/\\-]D) *NDEBUG" "" CMAKE_CXX_FLAGS_RELWITHDEBINFO "${CMAKE_CXX_FLAGS_RELWITHDEBINFO}")
string(REGEX REPLACE "([\\/\\-]O)[12]?" "\\13" CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE}")
string(REGEX REPLACE "([\\/\\-]O)[12]?" "\\13" CMAKE_C_FLAGS_MINSIZEREL "${CMAKE_C_FLAGS_MINSIZEREL}")
string(REGEX REPLACE "([\\/\\-]O)[12]?" "\\13" CMAKE_C_FLAGS_RELWITHDEBINFO "${CMAKE_C_FLAGS_RELWITHDEBINFO}")
string(REGEX REPLACE "([\\/\\-]D) *NDEBUG" "" CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE}")
string(REGEX REPLACE "([\\/\\-]D) *NDEBUG" "" CMAKE_C_FLAGS_MINSIZEREL "${CMAKE_C_FLAGS_MINSIZEREL}")
string(REGEX REPLACE "([\\/\\-]D) *NDEBUG" "" CMAKE_C_FLAGS_RELWITHDEBINFO "${CMAKE_C_FLAGS_RELWITHDEBINFO}")
find_package(PkgConfig)
pkg_check_modules(LIBURING REQUIRED liburing)
pkg_check_modules(GLIB REQUIRED glib-2.0)
include_directories(
../
/usr/include/jerasure
${LIBURING_INCLUDE_DIRS}
)
# libvitastor_blk.so
add_library(vitastor_blk SHARED
allocator.cpp blockstore.cpp blockstore_impl.cpp blockstore_init.cpp blockstore_open.cpp blockstore_journal.cpp blockstore_read.cpp
blockstore_write.cpp blockstore_sync.cpp blockstore_stable.cpp blockstore_rollback.cpp blockstore_flush.cpp crc32c.c ringloop.cpp
)
target_link_libraries(vitastor_blk
${LIBURING_LIBRARIES}
tcmalloc_minimal
)
# libfio_vitastor_blk.so
add_library(fio_vitastor_blk SHARED
fio_engine.cpp
../json11/json11.cpp
)
target_link_libraries(fio_vitastor_blk
vitastor_blk
)
# vitastor-osd
add_executable(vitastor-osd
osd_main.cpp osd.cpp osd_secondary.cpp msgr_receive.cpp msgr_send.cpp osd_peering.cpp osd_flush.cpp osd_peering_pg.cpp
osd_primary.cpp osd_primary_subops.cpp etcd_state_client.cpp messenger.cpp osd_cluster.cpp http_client.cpp osd_ops.cpp pg_states.cpp
osd_rmw.cpp base64.cpp timerfd_manager.cpp epoll_manager.cpp ../json11/json11.cpp
)
target_link_libraries(vitastor-osd
vitastor_blk
Jerasure
)
# libfio_vitastor_sec.so
add_library(fio_vitastor_sec SHARED
fio_sec_osd.cpp
rw_blocking.cpp
)
target_link_libraries(fio_vitastor_sec
tcmalloc_minimal
)
# libvitastor_client.so
add_library(vitastor_client SHARED
cluster_client.cpp epoll_manager.cpp etcd_state_client.cpp
messenger.cpp msgr_send.cpp msgr_receive.cpp ringloop.cpp ../json11/json11.cpp
http_client.cpp osd_ops.cpp pg_states.cpp timerfd_manager.cpp base64.cpp
)
target_link_libraries(vitastor_client
tcmalloc_minimal
${LIBURING_LIBRARIES}
)
# libfio_vitastor.so
add_library(fio_vitastor SHARED
fio_cluster.cpp
)
target_link_libraries(fio_vitastor
vitastor_client
)
# vitastor-nbd
add_executable(vitastor-nbd
nbd_proxy.cpp
)
target_link_libraries(vitastor-nbd
vitastor_client
)
# vitastor-rm
add_executable(vitastor-rm
rm_inode.cpp
)
target_link_libraries(vitastor-rm
vitastor_client
)
# vitastor-dump-journal
add_executable(vitastor-dump-journal
dump_journal.cpp crc32c.c
)
# qemu_driver.so
add_library(qemu_proxy STATIC qemu_proxy.cpp)
target_compile_options(qemu_proxy PUBLIC -fPIC)
target_include_directories(qemu_proxy PUBLIC
../qemu/b/qemu
../qemu/include
${GLIB_INCLUDE_DIRS}
)
target_link_libraries(qemu_proxy
vitastor_client
)
add_library(qemu_vitastor SHARED
qemu_driver.c
)
target_link_libraries(qemu_vitastor
qemu_proxy
)
set_target_properties(qemu_vitastor PROPERTIES
PREFIX ""
OUTPUT_NAME "block-vitastor"
)
### Test stubs
# stub_osd, stub_bench, osd_test
add_executable(stub_osd stub_osd.cpp rw_blocking.cpp)
target_link_libraries(stub_osd tcmalloc_minimal)
add_executable(stub_bench stub_bench.cpp rw_blocking.cpp)
target_link_libraries(stub_bench tcmalloc_minimal)
add_executable(osd_test osd_test.cpp rw_blocking.cpp)
target_link_libraries(osd_test tcmalloc_minimal)
# osd_rmw_test
add_executable(osd_rmw_test osd_rmw_test.cpp allocator.cpp)
target_link_libraries(osd_rmw_test Jerasure tcmalloc_minimal)
# stub_uring_osd
add_executable(stub_uring_osd
stub_uring_osd.cpp epoll_manager.cpp messenger.cpp msgr_send.cpp msgr_receive.cpp ringloop.cpp timerfd_manager.cpp ../json11/json11.cpp
)
target_link_libraries(stub_uring_osd
${LIBURING_LIBRARIES}
tcmalloc_minimal
)
# osd_peering_pg_test
add_executable(osd_peering_pg_test osd_peering_pg_test.cpp osd_peering_pg.cpp)
target_link_libraries(osd_peering_pg_test tcmalloc_minimal)
# test_allocator
add_executable(test_allocator test_allocator.cpp allocator.cpp)
## test_blockstore, test_shit
#add_executable(test_blockstore test_blockstore.cpp timerfd_interval.cpp)
#target_link_libraries(test_blockstore blockstore)
#add_executable(test_shit test_shit.cpp osd_peering_pg.cpp)
#target_link_libraries(test_shit ${LIBURING_LIBRARIES} m)
### Install
install(TARGETS vitastor-osd vitastor-dump-journal vitastor-nbd vitastor-rm RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR})
install(TARGETS fio_vitastor fio_vitastor_blk fio_vitastor_sec vitastor_blk vitastor_client LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR})
install(TARGETS qemu_vitastor LIBRARY DESTINATION /usr/${CMAKE_INSTALL_LIBDIR}/${QEMU_PLUGINDIR})

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
#include <stdexcept>
#include "allocator.h"
@@ -13,19 +13,19 @@ allocator::allocator(uint64_t blocks)
{
throw std::invalid_argument("blocks");
}
uint64_t p2 = 1, total = 1;
uint64_t p2 = 1;
total = 0;
while (p2 * 64 < blocks)
{
p2 = p2 * 64;
total += p2;
p2 = p2 * 64;
}
total -= p2;
total += (blocks+63) / 64;
mask = new uint64_t[2 + total];
mask = new uint64_t[total];
size = free = blocks;
last_one_mask = (blocks % 64) == 0
? UINT64_MAX
: ~(UINT64_MAX << (64 - blocks % 64));
: ((1l << (blocks % 64)) - 1);
for (uint64_t i = 0; i < total; i++)
{
mask[i] = 0;
@@ -99,6 +99,10 @@ uint64_t allocator::find_free()
uint64_t p2 = 1, offset = 0, addr = 0, f, i;
while (p2 < size)
{
if (offset+addr >= total)
{
return UINT64_MAX;
}
uint64_t m = mask[offset + addr];
for (i = 0, f = 1; i < 64; i++, f <<= 1)
{
@@ -113,11 +117,6 @@ uint64_t allocator::find_free()
return UINT64_MAX;
}
addr = (addr * 64) | i;
if (addr >= size)
{
// No space
return UINT64_MAX;
}
offset += p2;
p2 = p2 * 64;
}

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
#pragma once
@@ -8,6 +8,7 @@
// Hierarchical bitmap allocator
class allocator
{
uint64_t total;
uint64_t size;
uint64_t free;
uint64_t last_one_mask;

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
#include "base64.h"

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
#pragma once
#include <string>

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h"
@@ -35,12 +35,7 @@ bool blockstore_t::is_safe_to_stop()
void blockstore_t::enqueue_op(blockstore_op_t *op)
{
impl->enqueue_op(op, false);
}
void blockstore_t::enqueue_op_first(blockstore_op_t *op)
{
impl->enqueue_op(op, true);
impl->enqueue_op(op);
}
std::unordered_map<object_id, uint64_t> & blockstore_t::get_unstable_writes()

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
#pragma once
@@ -9,6 +9,7 @@
#include <stdint.h>
#include <string>
#include <map>
#include <unordered_map>
#include <functional>
@@ -174,10 +175,6 @@ public:
// Submission
void enqueue_op(blockstore_op_t *op);
// Insert operation into the beginning of the queue
// Intended for the OSD syncer "thread" to be able to stabilize something when the journal is full
void enqueue_op_first(blockstore_op_t *op);
// Unstable writes are added here (map of object_id -> version)
std::unordered_map<object_id, uint64_t> & get_unstable_writes();

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h"
@@ -7,12 +7,17 @@ journal_flusher_t::journal_flusher_t(int flusher_count, blockstore_impl_t *bs)
{
this->bs = bs;
this->flusher_count = flusher_count;
this->cur_flusher_count = 1;
this->target_flusher_count = 1;
dequeuing = false;
trimming = false;
active_flushers = 0;
syncing_flushers = 0;
// FIXME: allow to configure flusher_start_threshold and journal_trim_interval
flusher_start_threshold = bs->journal_block_size / sizeof(journal_entry_stable);
journal_trim_interval = flusher_start_threshold;
journal_trim_interval = 512;
journal_trim_counter = 0;
trim_wanted = 0;
journal_superblock = bs->journal.inmemory ? bs->journal.buffer : memalign_or_die(MEM_ALIGNMENT, bs->journal_block_size);
co = new journal_flusher_co[flusher_count];
for (int i = 0; i < flusher_count; i++)
@@ -65,14 +70,31 @@ bool journal_flusher_t::is_active()
void journal_flusher_t::loop()
{
for (int i = 0; (active_flushers > 0 || dequeuing) && i < flusher_count; i++)
target_flusher_count = bs->write_iodepth*2;
if (target_flusher_count <= 0)
target_flusher_count = 1;
else if (target_flusher_count > flusher_count)
target_flusher_count = flusher_count;
if (target_flusher_count > cur_flusher_count)
cur_flusher_count = target_flusher_count;
else if (target_flusher_count < cur_flusher_count)
{
co[i].loop();
while (target_flusher_count < cur_flusher_count)
{
if (co[cur_flusher_count-1].wait_state)
break;
cur_flusher_count--;
}
}
for (int i = 0; (active_flushers > 0 || dequeuing) && i < cur_flusher_count; i++)
co[i].loop();
}
void journal_flusher_t::enqueue_flush(obj_ver_id ov)
{
#ifdef BLOCKSTORE_DEBUG
printf("enqueue_flush %lx:%lx v%lu\n", ov.oid.inode, ov.oid.stripe, ov.version);
#endif
auto it = flush_versions.find(ov.oid);
if (it != flush_versions.end())
{
@@ -84,15 +106,18 @@ void journal_flusher_t::enqueue_flush(obj_ver_id ov)
flush_versions[ov.oid] = ov.version;
flush_queue.push_back(ov.oid);
}
if (!dequeuing && flush_queue.size() >= flusher_start_threshold)
if (!dequeuing && (flush_queue.size() >= flusher_start_threshold || trim_wanted > 0))
{
dequeuing = true;
bs->ringloop->wakeup();
}
}
void journal_flusher_t::unshift_flush(obj_ver_id ov)
void journal_flusher_t::unshift_flush(obj_ver_id ov, bool force)
{
#ifdef BLOCKSTORE_DEBUG
printf("unshift_flush %lx:%lx v%lu\n", ov.oid.inode, ov.oid.stripe, ov.version);
#endif
auto it = flush_versions.find(ov.oid);
if (it != flush_versions.end())
{
@@ -102,15 +127,38 @@ void journal_flusher_t::unshift_flush(obj_ver_id ov)
else
{
flush_versions[ov.oid] = ov.version;
flush_queue.push_front(ov.oid);
if (!force)
flush_queue.push_front(ov.oid);
}
if (!dequeuing && flush_queue.size() >= flusher_start_threshold)
if (force)
flush_queue.push_front(ov.oid);
if (force || !dequeuing && (flush_queue.size() >= flusher_start_threshold || trim_wanted > 0))
{
dequeuing = true;
bs->ringloop->wakeup();
}
}
void journal_flusher_t::remove_flush(object_id oid)
{
#ifdef BLOCKSTORE_DEBUG
printf("undo_flush %lx:%lx\n", oid.inode, oid.stripe);
#endif
auto v_it = flush_versions.find(oid);
if (v_it != flush_versions.end())
{
flush_versions.erase(v_it);
for (auto q_it = flush_queue.begin(); q_it != flush_queue.end(); q_it++)
{
if (*q_it == oid)
{
flush_queue.erase(q_it);
break;
}
}
}
}
void journal_flusher_t::request_trim()
{
dequeuing = true;
@@ -118,6 +166,16 @@ void journal_flusher_t::request_trim()
bs->ringloop->wakeup();
}
void journal_flusher_t::mark_trim_possible()
{
if (trim_wanted > 0)
{
dequeuing = true;
journal_trim_counter++;
bs->ringloop->wakeup();
}
}
void journal_flusher_t::release_trim()
{
trim_wanted--;
@@ -172,9 +230,22 @@ bool journal_flusher_co::loop()
goto resume_17;
else if (wait_state == 18)
goto resume_18;
else if (wait_state == 19)
goto resume_19;
else if (wait_state == 20)
goto resume_20;
else if (wait_state == 21)
goto resume_21;
resume_0:
if (!flusher->flush_queue.size() || !flusher->dequeuing)
{
stop_flusher:
if (flusher->trim_wanted > 0 && flusher->journal_trim_counter > 0)
{
// Attempt forced trim
flusher->active_flushers++;
goto trim_journal;
}
flusher->dequeuing = false;
wait_state = 0;
return true;
@@ -273,9 +344,7 @@ resume_0:
#ifdef BLOCKSTORE_DEBUG
printf("No older flushes, stopping\n");
#endif
flusher->dequeuing = false;
wait_state = 0;
return true;
goto stop_flusher;
}
}
}
@@ -294,12 +363,12 @@ resume_1:
return false;
}
// Writes and deletes shouldn't happen at the same time
assert(!(copy_count > 0 || has_writes) || !has_delete);
if (copy_count == 0 && !has_writes && !has_delete || has_delete && old_clean_loc == UINT64_MAX)
assert(!has_writes || !has_delete);
if (!has_writes && !has_delete || has_delete && old_clean_loc == UINT64_MAX)
{
// Nothing to flush
bs->erase_dirty(dirty_start, std::next(dirty_end), clean_loc);
goto trim_journal;
goto release_oid;
}
if (clean_loc == UINT64_MAX)
{
@@ -418,7 +487,12 @@ resume_1:
else
{
clean_disk_entry *new_entry = (clean_disk_entry*)(meta_new.buf + meta_new.pos*bs->clean_entry_size);
assert(new_entry->oid.inode == 0 || new_entry->oid == cur.oid);
if (new_entry->oid.inode != 0 && new_entry->oid != cur.oid)
{
printf("Fatal error (metadata corruption or bug): tried to overwrite non-zero metadata entry %lu (%lx:%lx) with %lx:%lx\n",
clean_loc >> bs->block_order, new_entry->oid.inode, new_entry->oid.stripe, cur.oid.inode, cur.oid.stripe);
exit(1);
}
new_entry->oid = cur.oid;
new_entry->version = cur.version;
if (!bs->inmemory_meta)
@@ -474,14 +548,35 @@ resume_1:
}
// Update clean_db and dirty_db, free old data locations
update_clean_db();
#ifdef BLOCKSTORE_DEBUG
printf("Flushed %lx:%lx v%lu (%d copies, wr:%d, del:%d), %ld left\n", cur.oid.inode, cur.oid.stripe, cur.version,
copy_count, has_writes, has_delete, flusher->flush_queue.size());
#endif
release_oid:
repeat_it = flusher->sync_to_repeat.find(cur.oid);
if (repeat_it != flusher->sync_to_repeat.end() && repeat_it->second > cur.version)
{
// Requeue version
flusher->unshift_flush({ .oid = cur.oid, .version = repeat_it->second }, false);
}
flusher->sync_to_repeat.erase(repeat_it);
trim_journal:
// Clear unused part of the journal every <journal_trim_interval> flushes
if (!((++flusher->journal_trim_counter) % flusher->journal_trim_interval) || flusher->trim_wanted > 0)
{
flusher->journal_trim_counter = 0;
if (bs->journal.trim())
new_trim_pos = bs->journal.get_trim_pos();
if (new_trim_pos != bs->journal.used_start)
{
// Update journal "superblock"
resume_19:
// Wait for other coroutines trimming the journal, if any
if (flusher->trimming)
{
wait_state = 19;
return false;
}
flusher->trimming = true;
// First update journal "superblock" and only then update <used_start> in memory
await_sqe(12);
*((journal_entry_start*)flusher->journal_superblock) = {
.crc32 = 0,
@@ -489,7 +584,7 @@ resume_1:
.type = JE_START,
.size = sizeof(journal_entry_start),
.reserved = 0,
.journal_start = bs->journal.used_start,
.journal_start = new_trim_pos,
};
((journal_entry_start*)flusher->journal_superblock)->crc32 = je_crc32((journal_entry*)flusher->journal_superblock);
data->iov = (struct iovec){ flusher->journal_superblock, bs->journal_block_size };
@@ -502,21 +597,28 @@ resume_1:
wait_state = 13;
return false;
}
if (!bs->disable_journal_fsync)
{
await_sqe(20);
my_uring_prep_fsync(sqe, bs->journal.fd, IORING_FSYNC_DATASYNC);
data->iov = { 0 };
data->callback = simple_callback_w;
resume_21:
if (wait_count > 0)
{
wait_state = 21;
return false;
}
}
bs->journal.used_start = new_trim_pos;
#ifdef BLOCKSTORE_DEBUG
printf("Journal trimmed to %08lx (next_free=%08lx)\n", bs->journal.used_start, bs->journal.next_free);
#endif
flusher->trimming = false;
}
}
// All done
#ifdef BLOCKSTORE_DEBUG
printf("Flushed %lx:%lx v%lu (%d copies, wr:%d, del:%d), %ld left\n", cur.oid.inode, cur.oid.stripe, cur.version,
copy_count, has_writes, has_delete, flusher->flush_queue.size());
#endif
flusher->active_flushers--;
repeat_it = flusher->sync_to_repeat.find(cur.oid);
if (repeat_it != flusher->sync_to_repeat.end() && repeat_it->second > cur.version)
{
// Requeue version
flusher->unshift_flush({ .oid = cur.oid, .version = repeat_it->second });
}
flusher->sync_to_repeat.erase(repeat_it);
wait_state = 0;
goto resume_0;
}
@@ -544,7 +646,7 @@ bool journal_flusher_co::scan_dirty(int wait_base)
{
char err[1024];
snprintf(
err, 1024, "BUG: Unexpected dirty_entry %lx:%lx v%lu state during flush: %d",
err, 1024, "BUG: Unexpected dirty_entry %lx:%lx v%lu unstable state during flush: %d",
dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version, dirty_it->second.state
);
throw std::runtime_error(err);
@@ -721,31 +823,34 @@ bool journal_flusher_co::fsync_batch(bool fsync_meta, int wait_base)
sync_found:
cur_sync->ready_count++;
flusher->syncing_flushers++;
if (flusher->syncing_flushers >= flusher->flusher_count || !flusher->flush_queue.size())
resume_1:
if (!cur_sync->state)
{
// Sync batch is ready. Do it.
await_sqe(0);
data->iov = { 0 };
data->callback = simple_callback_w;
my_uring_prep_fsync(sqe, fsync_meta ? bs->meta_fd : bs->data_fd, IORING_FSYNC_DATASYNC);
cur_sync->state = 1;
wait_count++;
resume_1:
if (wait_count > 0)
if (flusher->syncing_flushers >= flusher->cur_flusher_count || !flusher->flush_queue.size())
{
// Sync batch is ready. Do it.
await_sqe(0);
data->iov = { 0 };
data->callback = simple_callback_w;
my_uring_prep_fsync(sqe, fsync_meta ? bs->meta_fd : bs->data_fd, IORING_FSYNC_DATASYNC);
cur_sync->state = 1;
wait_count++;
resume_2:
if (wait_count > 0)
{
wait_state = 2;
return false;
}
// Sync completed. All previous coroutines waiting for it must be resumed
cur_sync->state = 2;
bs->ringloop->wakeup();
}
else
{
// Wait until someone else sends and completes a sync.
wait_state = 1;
return false;
}
// Sync completed. All previous coroutines waiting for it must be resumed
cur_sync->state = 2;
bs->ringloop->wakeup();
}
// Wait until someone else sends and completes a sync.
resume_2:
if (!cur_sync->state)
{
wait_state = 2;
return false;
}
flusher->syncing_flushers--;
cur_sync->ready_count--;

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
struct copy_buffer_t
{
@@ -59,6 +59,8 @@ class journal_flusher_co
uint64_t clean_bitmap_offset, clean_bitmap_len;
void *new_clean_bitmap;
uint64_t new_trim_pos;
// local: scan_dirty()
uint64_t offset, end_offset, submit_offset, submit_len;
@@ -78,13 +80,14 @@ class journal_flusher_t
{
int trim_wanted = 0;
bool dequeuing;
int flusher_count;
int flusher_count, cur_flusher_count, target_flusher_count;
int flusher_start_threshold;
journal_flusher_co *co;
blockstore_impl_t *bs;
friend class journal_flusher_co;
int journal_trim_counter, journal_trim_interval;
bool trimming;
void* journal_superblock;
int active_flushers;
@@ -100,8 +103,10 @@ public:
~journal_flusher_t();
void loop();
bool is_active();
void mark_trim_possible();
void request_trim();
void release_trim();
void enqueue_flush(obj_ver_id oid);
void unshift_flush(obj_ver_id oid);
void unshift_flush(obj_ver_id oid, bool force);
void remove_flush(object_id oid);
};

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h"
@@ -101,26 +101,14 @@ void blockstore_impl_t::loop()
{
// try to submit ops
unsigned initial_ring_space = ringloop->space_left();
// FIXME: rework this "sync polling"
auto cur_sync = in_progress_syncs.begin();
while (cur_sync != in_progress_syncs.end())
// has_writes == 0 - no writes before the current queue item
// has_writes == 1 - some writes in progress
// has_writes == 2 - tried to submit some writes, but failed
int has_writes = 0, op_idx = 0, new_idx = 0;
for (; op_idx < submit_queue.size(); op_idx++, new_idx++)
{
if (continue_sync(*cur_sync) != 2)
{
// List is unmodified
cur_sync++;
}
else
{
cur_sync = in_progress_syncs.begin();
}
}
auto cur = submit_queue.begin();
int has_writes = 0;
while (cur != submit_queue.end())
{
auto op_ptr = cur;
auto op = *(cur++);
auto op = submit_queue[op_idx];
submit_queue[new_idx] = op;
// FIXME: This needs some simplification
// Writes should not block reads if the ring is not full and reads don't depend on them
// In all other cases we should stop submission
@@ -142,30 +130,33 @@ void blockstore_impl_t::loop()
}
unsigned ring_space = ringloop->space_left();
unsigned prev_sqe_pos = ringloop->save();
bool dequeue_op = false;
// 0 = can't submit
// 1 = in progress
// 2 = can be removed from queue
int wr_st = 0;
if (op->opcode == BS_OP_READ)
{
dequeue_op = dequeue_read(op);
wr_st = dequeue_read(op);
}
else if (op->opcode == BS_OP_WRITE || op->opcode == BS_OP_WRITE_STABLE)
{
if (has_writes == 2)
{
// Some writes could not be submitted
break;
// Some writes already could not be submitted
continue;
}
dequeue_op = dequeue_write(op);
has_writes = dequeue_op ? 1 : 2;
wr_st = dequeue_write(op);
has_writes = wr_st > 0 ? 1 : 2;
}
else if (op->opcode == BS_OP_DELETE)
{
if (has_writes == 2)
{
// Some writes could not be submitted
break;
// Some writes already could not be submitted
continue;
}
dequeue_op = dequeue_del(op);
has_writes = dequeue_op ? 1 : 2;
wr_st = dequeue_del(op);
has_writes = wr_st > 0 ? 1 : 2;
}
else if (op->opcode == BS_OP_SYNC)
{
@@ -178,43 +169,31 @@ void blockstore_impl_t::loop()
// Can't submit SYNC before previous writes
continue;
}
dequeue_op = dequeue_sync(op);
wr_st = continue_sync(op, false);
if (wr_st != 2)
{
has_writes = wr_st > 0 ? 1 : 2;
}
}
else if (op->opcode == BS_OP_STABLE)
{
if (has_writes == 2)
{
// Don't submit additional flushes before completing previous LISTs
break;
}
dequeue_op = dequeue_stable(op);
wr_st = dequeue_stable(op);
}
else if (op->opcode == BS_OP_ROLLBACK)
{
if (has_writes == 2)
{
// Don't submit additional flushes before completing previous LISTs
break;
}
dequeue_op = dequeue_rollback(op);
wr_st = dequeue_rollback(op);
}
else if (op->opcode == BS_OP_LIST)
{
// Block LIST operation by previous modifications,
// so it always returns a consistent state snapshot
if (has_writes == 2 || inflight_writes > 0)
has_writes = 2;
else
{
process_list(op);
dequeue_op = true;
}
// LIST doesn't need to be blocked by previous modifications
process_list(op);
wr_st = 2;
}
if (dequeue_op)
if (wr_st == 2)
{
submit_queue.erase(op_ptr);
new_idx--;
}
else
if (wr_st == 0)
{
ringloop->restore(prev_sqe_pos);
if (PRIV(op)->wait_for == WAIT_SQE)
@@ -225,6 +204,14 @@ void blockstore_impl_t::loop()
}
}
}
if (op_idx != new_idx)
{
while (op_idx < submit_queue.size())
{
submit_queue[new_idx++] = submit_queue[op_idx++];
}
submit_queue.resize(new_idx);
}
if (!readonly)
{
flusher->loop();
@@ -247,7 +234,7 @@ bool blockstore_impl_t::is_safe_to_stop()
{
// It's safe to stop blockstore when there are no in-flight operations,
// no in-progress syncs and flusher isn't doing anything
if (submit_queue.size() > 0 || in_progress_syncs.size() > 0 || !readonly && flusher->is_active())
if (submit_queue.size() > 0 || !readonly && flusher->is_active())
{
return false;
}
@@ -301,7 +288,7 @@ void blockstore_impl_t::check_wait(blockstore_op_t *op)
else if (PRIV(op)->wait_for == WAIT_JOURNAL_BUFFER)
{
int next = ((journal.cur_sector + 1) % journal.sector_count);
if (journal.sector_info[next].usage_count > 0 ||
if (journal.sector_info[next].flush_count > 0 ||
journal.sector_info[next].dirty)
{
// do not submit
@@ -314,7 +301,7 @@ void blockstore_impl_t::check_wait(blockstore_op_t *op)
}
else if (PRIV(op)->wait_for == WAIT_FREE)
{
if (!data_alloc->get_free_count() && !flusher->is_active())
if (!data_alloc->get_free_count() && flusher->is_active())
{
#ifdef BLOCKSTORE_DEBUG
printf("Still waiting for free space on the data device\n");
@@ -329,7 +316,7 @@ void blockstore_impl_t::check_wait(blockstore_op_t *op)
}
}
void blockstore_impl_t::enqueue_op(blockstore_op_t *op, bool first)
void blockstore_impl_t::enqueue_op(blockstore_op_t *op)
{
if (op->opcode < BS_OP_MIN || op->opcode > BS_OP_MAX ||
((op->opcode == BS_OP_READ || op->opcode == BS_OP_WRITE || op->opcode == BS_OP_WRITE_STABLE) && (
@@ -337,8 +324,7 @@ void blockstore_impl_t::enqueue_op(blockstore_op_t *op, bool first)
op->len > block_size-op->offset ||
(op->len % disk_alignment)
)) ||
readonly && op->opcode != BS_OP_READ && op->opcode != BS_OP_LIST ||
first && (op->opcode == BS_OP_WRITE || op->opcode == BS_OP_WRITE_STABLE))
readonly && op->opcode != BS_OP_READ && op->opcode != BS_OP_LIST)
{
// Basic verification not passed
op->retval = -EINVAL;
@@ -388,25 +374,12 @@ void blockstore_impl_t::enqueue_op(blockstore_op_t *op, bool first)
std::function<void (blockstore_op_t*)>(op->callback)(op);
return;
}
if (op->opcode == BS_OP_SYNC && immediate_commit == IMMEDIATE_ALL)
{
op->retval = 0;
std::function<void (blockstore_op_t*)>(op->callback)(op);
return;
}
// Call constructor without allocating memory. We'll call destructor before returning op back
new ((void*)op->private_data) blockstore_op_private_t;
PRIV(op)->wait_for = 0;
PRIV(op)->op_state = 0;
PRIV(op)->pending_ops = 0;
if (!first)
{
submit_queue.push_back(op);
}
else
{
submit_queue.push_front(op);
}
submit_queue.push_back(op);
ringloop->wakeup();
}
@@ -531,7 +504,7 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
if (!replace_stable(dirty_it->first.oid, dirty_it->first.version, 0, clean_stable_count, stable))
{
// Then try to replace the last dirty stable version in the second part of the list
if (stable[stable_count-1].oid == dirty_it->first.oid)
if (stable_count > 0 && stable[stable_count-1].oid == dirty_it->first.oid)
{
stable[stable_count-1].version = dirty_it->first.version;
}

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
#pragma once
@@ -30,12 +30,13 @@
#define BS_ST_BIG_WRITE 0x02
#define BS_ST_DELETE 0x03
#define BS_ST_WAIT_BIG 0x10
#define BS_ST_IN_FLIGHT 0x20
#define BS_ST_SUBMITTED 0x30
#define BS_ST_WRITTEN 0x40
#define BS_ST_SYNCED 0x50
#define BS_ST_STABLE 0x60
#define BS_ST_WAIT_DEL 0x10
#define BS_ST_WAIT_BIG 0x20
#define BS_ST_IN_FLIGHT 0x30
#define BS_ST_SUBMITTED 0x40
#define BS_ST_WRITTEN 0x50
#define BS_ST_SYNCED 0x60
#define BS_ST_STABLE 0x70
#define BS_ST_INSTANT 0x100
@@ -153,12 +154,12 @@ struct blockstore_op_private_t
// Write
struct iovec iov_zerofill[3];
// Warning: must not have a default value here because it's written to before calling constructor in blockstore_write.cpp O_o
uint64_t real_version;
// Sync
std::vector<obj_ver_id> sync_big_writes, sync_small_writes;
int sync_small_checked, sync_big_checked;
std::list<blockstore_op_t*>::iterator in_progress_ptr;
int prev_sync_count;
};
// https://github.com/algorithm-ninja/cpp-btree
@@ -196,7 +197,10 @@ class blockstore_impl_t
// Suitable only for server SSDs with capacitors, requires disabled data and journal fsyncs
int immediate_commit = IMMEDIATE_NONE;
bool inmemory_meta = false;
int flusher_count;
// Maximum flusher count
unsigned flusher_count;
// Maximum queue depth
unsigned max_write_iodepth = 128;
/******* END OF OPTIONS *******/
struct ring_consumer_t ring_consumer;
@@ -204,9 +208,8 @@ class blockstore_impl_t
blockstore_clean_db_t clean_db;
uint8_t *clean_bitmap = NULL;
blockstore_dirty_db_t dirty_db;
std::list<blockstore_op_t*> submit_queue; // FIXME: funny thing is that vector is better here
std::vector<blockstore_op_t*> submit_queue;
std::vector<obj_ver_id> unsynced_big_writes, unsynced_small_writes;
std::list<blockstore_op_t*> in_progress_syncs; // ...and probably here, too
allocator *data_alloc = NULL;
uint8_t *zero_object;
@@ -223,10 +226,10 @@ class blockstore_impl_t
struct journal_t journal;
journal_flusher_t *flusher;
int write_iodepth = 0;
bool live = false, queue_stall = false;
ring_loop_t *ringloop;
int inflight_writes = 0;
bool stop_sync_submitted;
@@ -265,6 +268,7 @@ class blockstore_impl_t
// Write
bool enqueue_write(blockstore_op_t *op);
void cancel_all_writes(blockstore_op_t *op, blockstore_dirty_db_t::iterator dirty_it, int retval);
int dequeue_write(blockstore_op_t *op);
int dequeue_del(blockstore_op_t *op);
int continue_write(blockstore_op_t *op);
@@ -272,11 +276,9 @@ class blockstore_impl_t
void handle_write_event(ring_data_t *data, blockstore_op_t *op);
// Sync
int dequeue_sync(blockstore_op_t *op);
int continue_sync(blockstore_op_t *op, bool queue_has_in_progress_sync);
void handle_sync_event(ring_data_t *data, blockstore_op_t *op);
int continue_sync(blockstore_op_t *op);
void ack_one_sync(blockstore_op_t *op);
int ack_sync(blockstore_op_t *op);
void ack_sync(blockstore_op_t *op);
// Stabilize
int dequeue_stable(blockstore_op_t *op);
@@ -316,7 +318,7 @@ public:
bool is_stalled();
// Submission
void enqueue_op(blockstore_op_t *op, bool first = false);
void enqueue_op(blockstore_op_t *op);
// Unstable writes are added here (map of object_id -> version)
std::unordered_map<object_id, uint64_t> unstable_writes;

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h"
@@ -111,7 +111,7 @@ void blockstore_init_meta::handle_entries(void* entries, unsigned count, int blo
{
// free the previous block
#ifdef BLOCKSTORE_DEBUG
printf("Free block %lu (new location is %lu)\n", clean_it->second.location >> block_order, done_cnt+i >> block_order);
printf("Free block %lu (new location is %lu)\n", clean_it->second.location >> block_order, done_cnt+i);
#endif
bs->data_alloc->set(clean_it->second.location >> block_order, false);
}
@@ -399,8 +399,7 @@ resume_1:
}
}
}
// Trim journal on start so we don't stall when all entries are older
bs->journal.trim();
bs->flusher->mark_trim_possible();
bs->journal.dirty_start = bs->journal.next_free;
printf(
"Journal entries loaded: %lu, free journal space: %lu bytes (%08lx..%08lx is used), free blocks: %lu / %lu\n",
@@ -560,9 +559,54 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
printf(
"je_big_write%s oid=%lx:%lx ver=%lu loc=%lu\n",
je->type == JE_BIG_WRITE_INSTANT ? "_instant" : "",
je->big_write.oid.inode, je->big_write.oid.stripe, je->big_write.version, je->big_write.location
je->big_write.oid.inode, je->big_write.oid.stripe, je->big_write.version, je->big_write.location >> bs->block_order
);
#endif
auto dirty_it = bs->dirty_db.upper_bound((obj_ver_id){
.oid = je->big_write.oid,
.version = UINT64_MAX,
});
if (dirty_it != bs->dirty_db.begin() && bs->dirty_db.size() > 0)
{
dirty_it--;
if (dirty_it->first.oid == je->big_write.oid &&
dirty_it->first.version >= je->big_write.version &&
(dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_DELETE)
{
// It is allowed to overwrite a deleted object with a
// version number smaller than deletion version number,
// because the presence of a BIG_WRITE entry means that
// its data and metadata are already flushed.
// We don't know if newer versions are flushed, but
// the previous delete definitely is.
// So we flush previous dirty entries, but retain the clean one.
// This feature is required for writes happening shortly
// after deletes.
auto dirty_end = dirty_it;
dirty_end++;
while (1)
{
if (dirty_it == bs->dirty_db.begin())
{
break;
}
dirty_it--;
if (dirty_it->first.oid != je->big_write.oid)
{
dirty_it++;
break;
}
}
auto clean_it = bs->clean_db.find(je->big_write.oid);
bs->erase_dirty(
dirty_it, dirty_end,
clean_it != bs->clean_db.end() ? clean_it->second.location : UINT64_MAX
);
// Remove it from the flusher's queue, too
// Otherwise it may end up referring to a small unstable write after reading the rest of the journal
bs->flusher->remove_flush(je->big_write.oid);
}
}
auto clean_it = bs->clean_db.find(je->big_write.oid);
if (clean_it == bs->clean_db.end() ||
clean_it->second.version < je->big_write.version)
@@ -585,6 +629,12 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
#endif
bs->data_alloc->set(je->big_write.location >> bs->block_order, true);
bs->journal.used_sectors[proc_pos]++;
#ifdef BLOCKSTORE_DEBUG
printf(
"journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
proc_pos, ov.oid.inode, ov.oid.stripe, ov.version, bs->journal.used_sectors[proc_pos]
);
#endif
auto & unstab = bs->unstable_writes[ov.oid];
unstab = unstab < ov.version ? ov.version : unstab;
if (je->type == JE_BIG_WRITE_INSTANT)

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
#pragma once

View File

@@ -1,12 +1,12 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h"
blockstore_journal_check_t::blockstore_journal_check_t(blockstore_impl_t *bs)
{
this->bs = bs;
sectors_required = 0;
sectors_to_write = 0;
next_pos = bs->journal.next_free;
next_sector = bs->journal.cur_sector;
first_sector = -1;
@@ -20,23 +20,26 @@ int blockstore_journal_check_t::check_available(blockstore_op_t *op, int entries
int required = entries_required;
while (1)
{
int fits = bs->journal.no_same_sector_overwrites && bs->journal.sector_info[next_sector].written
int fits = bs->journal.no_same_sector_overwrites && next_pos == bs->journal.next_free && bs->journal.sector_info[next_sector].written
? 0
: (bs->journal.block_size - next_in_pos) / size;
if (fits > 0)
{
if (fits > required)
{
fits = required;
}
if (first_sector == -1)
{
first_sector = next_sector;
}
required -= fits;
next_in_pos += fits * size;
sectors_required++;
sectors_to_write++;
}
else if (bs->journal.sector_info[next_sector].dirty)
{
// sectors_required is more like "sectors to write"
sectors_required++;
sectors_to_write++;
}
if (required <= 0)
{
@@ -59,7 +62,7 @@ int blockstore_journal_check_t::check_available(blockstore_op_t *op, int entries
" is too small for a batch of "+std::to_string(entries_required)+" entries of "+std::to_string(size)+" bytes"
);
}
if (bs->journal.sector_info[next_sector].usage_count > 0 ||
if (bs->journal.sector_info[next_sector].flush_count > 0 ||
bs->journal.sector_info[next_sector].dirty)
{
// No memory buffer available. Wait for it.
@@ -71,17 +74,18 @@ int blockstore_journal_check_t::check_available(blockstore_op_t *op, int entries
dirty++;
used++;
}
if (bs->journal.sector_info[i].usage_count > 0)
if (bs->journal.sector_info[i].flush_count > 0)
{
used++;
}
}
// In fact, it's even more rare than "ran out of journal space", so print a warning
printf(
"Ran out of journal sector buffers: %d/%lu buffers used (%d dirty), next buffer (%ld) is %s and flushed %lu times\n",
"Ran out of journal sector buffers: %d/%lu buffers used (%d dirty), next buffer (%ld)"
" is %s and flushed %lu times. Consider increasing \'journal_sector_buffer_count\'\n",
used, bs->journal.sector_count, dirty, next_sector,
bs->journal.sector_info[next_sector].dirty ? "dirty" : "not dirty",
bs->journal.sector_info[next_sector].usage_count
bs->journal.sector_info[next_sector].flush_count
);
PRIV(op)->wait_for = WAIT_JOURNAL_BUFFER;
return 0;
@@ -100,10 +104,8 @@ int blockstore_journal_check_t::check_available(blockstore_op_t *op, int entries
{
// No space in the journal. Wait until used_start changes.
printf(
"Ran out of journal space (free space: %lu bytes)\n",
(bs->journal.next_free >= bs->journal.used_start
? bs->journal.len-bs->journal.block_size - (bs->journal.next_free-bs->journal.used_start)
: bs->journal.used_start - bs->journal.next_free)
"Ran out of journal space (used_start=%08lx, next_free=%08lx, dirty_start=%08lx)\n",
bs->journal.used_start, bs->journal.next_free, bs->journal.dirty_start
);
PRIV(op)->wait_for = WAIT_JOURNAL;
bs->flusher->request_trim();
@@ -115,22 +117,21 @@ int blockstore_journal_check_t::check_available(blockstore_op_t *op, int entries
journal_entry* prefill_single_journal_entry(journal_t & journal, uint16_t type, uint32_t size)
{
if (journal.block_size - journal.in_sector_pos < size ||
journal.no_same_sector_overwrites && journal.sector_info[journal.cur_sector].written)
if (!journal.entry_fits(size))
{
assert(!journal.sector_info[journal.cur_sector].dirty);
// Move to the next journal sector
journal.sector_info[journal.cur_sector].written = false;
if (journal.sector_info[journal.cur_sector].usage_count > 0)
if (journal.sector_info[journal.cur_sector].flush_count > 0)
{
// Also select next sector buffer in memory
journal.cur_sector = ((journal.cur_sector + 1) % journal.sector_count);
assert(!journal.sector_info[journal.cur_sector].usage_count);
assert(!journal.sector_info[journal.cur_sector].flush_count);
}
else
{
journal.dirty_start = journal.next_free;
}
journal.sector_info[journal.cur_sector].written = false;
journal.sector_info[journal.cur_sector].offset = journal.next_free;
journal.in_sector_pos = 0;
journal.next_free = (journal.next_free+journal.block_size) < journal.len ? journal.next_free + journal.block_size : journal.block_size;
@@ -156,7 +157,7 @@ void prepare_journal_sector_write(journal_t & journal, int cur_sector, io_uring_
{
journal.sector_info[cur_sector].dirty = false;
journal.sector_info[cur_sector].written = true;
journal.sector_info[cur_sector].usage_count++;
journal.sector_info[cur_sector].flush_count++;
ring_data_t *data = ((ring_data_t*)sqe->user_data);
data->iov = (struct iovec){
(journal.inmemory
@@ -183,7 +184,7 @@ journal_t::~journal_t()
buffer = NULL;
}
bool journal_t::trim()
uint64_t journal_t::get_trim_pos()
{
auto journal_used_it = used_sectors.lower_bound(used_start);
#ifdef BLOCKSTORE_DEBUG
@@ -201,26 +202,19 @@ bool journal_t::trim()
if (journal_used_it == used_sectors.end())
{
// Journal is empty
used_start = next_free;
return next_free;
}
else
{
used_start = journal_used_it->first;
// next_free does not need updating here
// next_free does not need updating during trim
return journal_used_it->first;
}
}
else if (journal_used_it->first > used_start)
{
// Journal is cleared up to <journal_used_it>
used_start = journal_used_it->first;
return journal_used_it->first;
}
else
{
// Can't trim journal
return false;
}
#ifdef BLOCKSTORE_DEBUG
printf("Journal trimmed to %08lx (next_free=%08lx)\n", used_start, next_free);
#endif
return true;
// Can't trim journal
return used_start;
}

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
#pragma once
@@ -10,6 +10,8 @@
#define JOURNAL_BUFFER_SIZE 4*1024*1024
// We reserve some extra space for future stabilize requests during writes
// FIXME: This value should be dynamic i.e. Blockstore ideally shouldn't allow
// writing more than can be stabilized afterwards
#define JOURNAL_STABILIZE_RESERVATION 65536
// Journal entries
@@ -131,7 +133,7 @@ inline uint32_t je_crc32(journal_entry *je)
struct journal_sector_info_t
{
uint64_t offset;
uint64_t usage_count;
uint64_t flush_count;
bool written;
bool dirty;
};
@@ -167,13 +169,19 @@ struct journal_t
~journal_t();
bool trim();
uint64_t get_trim_pos();
inline bool entry_fits(int size)
{
return !(block_size - in_sector_pos < size ||
no_same_sector_overwrites && sector_info[cur_sector].written);
}
};
struct blockstore_journal_check_t
{
blockstore_impl_t *bs;
uint64_t next_pos, next_sector, next_in_pos;
int sectors_required, first_sector;
int sectors_to_write, first_sector;
bool right_dir; // writing to the end or the beginning of the ring buffer
blockstore_journal_check_t(blockstore_impl_t *bs);

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
#include <sys/file.h>
#include "blockstore_impl.h"
@@ -70,6 +70,7 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
meta_block_size = strtoull(config["meta_block_size"].c_str(), NULL, 10);
bitmap_granularity = strtoull(config["bitmap_granularity"].c_str(), NULL, 10);
flusher_count = strtoull(config["flusher_count"].c_str(), NULL, 10);
max_write_iodepth = strtoull(config["max_write_iodepth"].c_str(), NULL, 10);
// Validate
if (!block_size)
{
@@ -83,6 +84,10 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
{
flusher_count = 32;
}
if (!max_write_iodepth)
{
max_write_iodepth = 128;
}
if (!disk_alignment)
{
disk_alignment = 4096;

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h"
@@ -112,7 +112,7 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
read_op->version = 0;
read_op->retval = read_op->len;
FINISH_OP(read_op);
return 1;
return 2;
}
uint64_t fulfilled = 0;
PRIV(read_op)->pending_ops = 0;
@@ -191,8 +191,8 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
if (bmp_end > bmp_start)
{
// fill with zeroes
fulfill_read(read_op, fulfilled, bmp_start * bitmap_granularity,
bmp_end * bitmap_granularity, (BS_ST_DELETE | BS_ST_STABLE), 0, 0);
assert(fulfill_read(read_op, fulfilled, bmp_start * bitmap_granularity,
bmp_end * bitmap_granularity, (BS_ST_DELETE | BS_ST_STABLE), 0, 0));
}
bmp_start = bmp_end;
while (clean_entry_bitmap[bmp_end >> 3] & (1 << (bmp_end & 0x7)) && bmp_end < bmp_size)
@@ -218,7 +218,7 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
else if (fulfilled < read_op->len)
{
// fill remaining parts with zeroes
fulfill_read(read_op, fulfilled, 0, block_size, (BS_ST_DELETE | BS_ST_STABLE), 0, 0);
assert(fulfill_read(read_op, fulfilled, 0, block_size, (BS_ST_DELETE | BS_ST_STABLE), 0, 0));
}
assert(fulfilled == read_op->len);
read_op->version = result_version;
@@ -232,10 +232,10 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
}
read_op->retval = read_op->len;
FINISH_OP(read_op);
return 1;
return 2;
}
read_op->retval = 0;
return 1;
return 2;
}
void blockstore_impl_t::handle_read_event(ring_data_t *data, blockstore_op_t *op)

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h"
@@ -9,10 +9,14 @@ int blockstore_impl_t::dequeue_rollback(blockstore_op_t *op)
{
return continue_rollback(op);
}
obj_ver_id* v;
obj_ver_id *v, *nv;
int i, todo = op->len;
for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++)
for (i = 0, v = (obj_ver_id*)op->buf, nv = (obj_ver_id*)op->buf; i < op->len; i++, v++, nv++)
{
if (nv != v)
{
*nv = *v;
}
// Check that there are some versions greater than v->version (which may be zero),
// check that they're unstable, synced, and not currently written to
auto dirty_it = dirty_db.lower_bound((obj_ver_id){
@@ -21,31 +25,32 @@ int blockstore_impl_t::dequeue_rollback(blockstore_op_t *op)
});
if (dirty_it == dirty_db.begin())
{
if (v->version == 0)
{
// Already rolled back
// FIXME Skip this object version
}
bad_op:
op->retval = -ENOENT;
FINISH_OP(op);
return 1;
skip_ov:
// Already rolled back, skip this object version
todo--;
nv--;
continue;
}
else
{
dirty_it--;
if (dirty_it->first.oid != v->oid || dirty_it->first.version < v->version)
{
goto bad_op;
goto skip_ov;
}
while (dirty_it->first.oid == v->oid && dirty_it->first.version > v->version)
{
if (!IS_SYNCED(dirty_it->second.state) ||
if (IS_IN_FLIGHT(dirty_it->second.state))
{
// Object write is still in progress. Wait until the write request completes
return 0;
}
else if (!IS_SYNCED(dirty_it->second.state) ||
IS_STABLE(dirty_it->second.state))
{
op->retval = -EBUSY;
FINISH_OP(op);
return 1;
return 2;
}
if (dirty_it == dirty_db.begin())
{
@@ -55,6 +60,14 @@ int blockstore_impl_t::dequeue_rollback(blockstore_op_t *op)
}
}
}
op->len = todo;
if (!todo)
{
// Already rolled back
op->retval = 0;
FINISH_OP(op);
return 2;
}
// Check journal space
blockstore_journal_check_t space_check(this);
if (!space_check.check_available(op, todo, sizeof(journal_entry_rollback), 0))
@@ -62,43 +75,38 @@ int blockstore_impl_t::dequeue_rollback(blockstore_op_t *op)
return 0;
}
// There is sufficient space. Get SQEs
struct io_uring_sqe *sqe[space_check.sectors_required];
for (i = 0; i < space_check.sectors_required; i++)
struct io_uring_sqe *sqe[space_check.sectors_to_write];
for (i = 0; i < space_check.sectors_to_write; i++)
{
BS_SUBMIT_GET_SQE_DECL(sqe[i]);
}
// Prepare and submit journal entries
auto cb = [this, op](ring_data_t *data) { handle_rollback_event(data, op); };
int s = 0, cur_sector = -1;
if ((journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_rollback) &&
journal.sector_info[journal.cur_sector].dirty)
{
if (cur_sector == -1)
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
cur_sector = journal.cur_sector;
prepare_journal_sector_write(journal, cur_sector, sqe[s++], cb);
}
for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++)
{
if (!journal.entry_fits(sizeof(journal_entry_rollback)) &&
journal.sector_info[journal.cur_sector].dirty)
{
if (cur_sector == -1)
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], cb);
cur_sector = journal.cur_sector;
}
journal_entry_rollback *je = (journal_entry_rollback*)
prefill_single_journal_entry(journal, JE_ROLLBACK, sizeof(journal_entry_rollback));
journal.sector_info[journal.cur_sector].dirty = false;
je->oid = v->oid;
je->version = v->version;
je->crc32 = je_crc32((journal_entry*)je);
journal.crc32_last = je->crc32;
if (cur_sector != journal.cur_sector)
{
if (cur_sector == -1)
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
cur_sector = journal.cur_sector;
prepare_journal_sector_write(journal, cur_sector, sqe[s++], cb);
}
}
prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], cb);
assert(s == space_check.sectors_to_write);
if (cur_sector == -1)
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
PRIV(op)->pending_ops = s;
PRIV(op)->op_state = 1;
inflight_writes++;
return 1;
}
@@ -118,11 +126,8 @@ resume_2:
resume_3:
if (!disable_journal_fsync)
{
io_uring_sqe *sqe = get_sqe();
if (!sqe)
{
return 0;
}
io_uring_sqe *sqe;
BS_SUBMIT_GET_SQE_DECL(sqe);
ring_data_t *data = ((ring_data_t*)sqe->user_data);
my_uring_prep_fsync(sqe, journal.fd, IORING_FSYNC_DATASYNC);
data->iov = { 0 };
@@ -139,12 +144,11 @@ resume_5:
{
mark_rolled_back(*v);
}
journal.trim();
inflight_writes--;
flusher->mark_trim_possible();
// Acknowledge op
op->retval = 0;
FINISH_OP(op);
return 1;
return 2;
}
void blockstore_impl_t::mark_rolled_back(const obj_ver_id & ov)
@@ -200,7 +204,6 @@ void blockstore_impl_t::handle_rollback_event(ring_data_t *data, blockstore_op_t
live = true;
if (data->res != data->iov.iov_len)
{
inflight_writes--;
throw std::runtime_error(
"write operation failed ("+std::to_string(data->res)+" != "+std::to_string(data->iov.iov_len)+
"). in-memory state is corrupted. AAAAAAAaaaaaaaaa!!!111"
@@ -210,19 +213,44 @@ void blockstore_impl_t::handle_rollback_event(ring_data_t *data, blockstore_op_t
if (PRIV(op)->pending_ops == 0)
{
PRIV(op)->op_state++;
if (!continue_rollback(op))
{
submit_queue.push_front(op);
}
ringloop->wakeup();
}
}
void blockstore_impl_t::erase_dirty(blockstore_dirty_db_t::iterator dirty_start, blockstore_dirty_db_t::iterator dirty_end, uint64_t clean_loc)
{
auto dirty_it = dirty_end;
while (dirty_it != dirty_start)
if (dirty_end == dirty_start)
{
return;
}
auto dirty_it = dirty_end;
dirty_it--;
if (IS_DELETE(dirty_it->second.state))
{
object_id oid = dirty_it->first.oid;
#ifdef BLOCKSTORE_DEBUG
printf("Unblock writes-after-delete %lx:%lx v%lx\n", oid.inode, oid.stripe, dirty_it->first.version);
#endif
dirty_it = dirty_end;
// Unblock operations blocked by delete flushing
uint32_t next_state = BS_ST_IN_FLIGHT;
while (dirty_it != dirty_db.end() && dirty_it->first.oid == oid)
{
if ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_WAIT_DEL)
{
dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | next_state;
if (IS_BIG_WRITE(dirty_it->second.state))
{
next_state = BS_ST_WAIT_BIG;
}
}
dirty_it++;
}
dirty_it = dirty_end;
dirty_it--;
}
while (1)
{
if (IS_BIG_WRITE(dirty_it->second.state) && dirty_it->second.location != clean_loc)
{
#ifdef BLOCKSTORE_DEBUG
@@ -241,6 +269,11 @@ void blockstore_impl_t::erase_dirty(blockstore_dirty_db_t::iterator dirty_start,
{
journal.used_sectors.erase(dirty_it->second.journal_sector);
}
if (dirty_it == dirty_start)
{
break;
}
dirty_it--;
}
dirty_db.erase(dirty_start, dirty_end);
}

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h"
@@ -60,19 +60,24 @@ int blockstore_impl_t::dequeue_stable(blockstore_op_t *op)
// No such object version
op->retval = -ENOENT;
FINISH_OP(op);
return 1;
return 2;
}
else
{
// Already stable
}
}
else if (IS_IN_FLIGHT(dirty_it->second.state))
{
// Object write is still in progress. Wait until the write request completes
return 0;
}
else if (!IS_SYNCED(dirty_it->second.state))
{
// Object not synced yet. Caller must sync it first
op->retval = -EBUSY;
FINISH_OP(op);
return 1;
return 2;
}
else if (!IS_STABLE(dirty_it->second.state))
{
@@ -84,7 +89,7 @@ int blockstore_impl_t::dequeue_stable(blockstore_op_t *op)
// Already stable
op->retval = 0;
FINISH_OP(op);
return 1;
return 2;
}
// Check journal space
blockstore_journal_check_t space_check(this);
@@ -93,44 +98,39 @@ int blockstore_impl_t::dequeue_stable(blockstore_op_t *op)
return 0;
}
// There is sufficient space. Get SQEs
struct io_uring_sqe *sqe[space_check.sectors_required];
for (i = 0; i < space_check.sectors_required; i++)
struct io_uring_sqe *sqe[space_check.sectors_to_write];
for (i = 0; i < space_check.sectors_to_write; i++)
{
BS_SUBMIT_GET_SQE_DECL(sqe[i]);
}
// Prepare and submit journal entries
auto cb = [this, op](ring_data_t *data) { handle_stable_event(data, op); };
int s = 0, cur_sector = -1;
if ((journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_stable) &&
journal.sector_info[journal.cur_sector].dirty)
{
if (cur_sector == -1)
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
cur_sector = journal.cur_sector;
prepare_journal_sector_write(journal, cur_sector, sqe[s++], cb);
}
for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++)
{
// FIXME: Only stabilize versions that aren't stable yet
if (!journal.entry_fits(sizeof(journal_entry_stable)) &&
journal.sector_info[journal.cur_sector].dirty)
{
if (cur_sector == -1)
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], cb);
cur_sector = journal.cur_sector;
}
journal_entry_stable *je = (journal_entry_stable*)
prefill_single_journal_entry(journal, JE_STABLE, sizeof(journal_entry_stable));
journal.sector_info[journal.cur_sector].dirty = false;
je->oid = v->oid;
je->version = v->version;
je->crc32 = je_crc32((journal_entry*)je);
journal.crc32_last = je->crc32;
if (cur_sector != journal.cur_sector)
{
if (cur_sector == -1)
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
cur_sector = journal.cur_sector;
prepare_journal_sector_write(journal, cur_sector, sqe[s++], cb);
}
}
prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], cb);
assert(s == space_check.sectors_to_write);
if (cur_sector == -1)
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
PRIV(op)->pending_ops = s;
PRIV(op)->op_state = 1;
inflight_writes++;
return 1;
}
@@ -150,11 +150,8 @@ resume_2:
resume_3:
if (!disable_journal_fsync)
{
io_uring_sqe *sqe = get_sqe();
if (!sqe)
{
return 0;
}
io_uring_sqe *sqe;
BS_SUBMIT_GET_SQE_DECL(sqe);
ring_data_t *data = ((ring_data_t*)sqe->user_data);
my_uring_prep_fsync(sqe, journal.fd, IORING_FSYNC_DATASYNC);
data->iov = { 0 };
@@ -173,11 +170,10 @@ resume_5:
// Mark all dirty_db entries up to op->version as stable
mark_stable(*v);
}
inflight_writes--;
// Acknowledge op
op->retval = 0;
FINISH_OP(op);
return 1;
return 2;
}
void blockstore_impl_t::mark_stable(const obj_ver_id & v)
@@ -205,9 +201,6 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v)
break;
}
}
#ifdef BLOCKSTORE_DEBUG
printf("enqueue_flush %lx:%lx v%lu\n", v.oid.inode, v.oid.stripe, v.version);
#endif
flusher->enqueue_flush(v);
}
auto unstab_it = unstable_writes.find(v.oid);
@@ -223,7 +216,6 @@ void blockstore_impl_t::handle_stable_event(ring_data_t *data, blockstore_op_t *
live = true;
if (data->res != data->iov.iov_len)
{
inflight_writes--;
throw std::runtime_error(
"write operation failed ("+std::to_string(data->res)+" != "+std::to_string(data->iov.iov_len)+
"). in-memory state is corrupted. AAAAAAAaaaaaaaaa!!!111"
@@ -233,9 +225,6 @@ void blockstore_impl_t::handle_stable_event(ring_data_t *data, blockstore_op_t *
if (PRIV(op)->pending_ops == 0)
{
PRIV(op)->op_state++;
if (!continue_stable(op))
{
submit_queue.push_front(op);
}
ringloop->wakeup();
}
}

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h"
@@ -12,8 +12,15 @@
#define SYNC_JOURNAL_SYNC_SENT 7
#define SYNC_DONE 8
int blockstore_impl_t::dequeue_sync(blockstore_op_t *op)
int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_progress_sync)
{
if (immediate_commit == IMMEDIATE_ALL)
{
// We can return immediately because sync is only dequeued after all previous writes
op->retval = 0;
FINISH_OP(op);
return 2;
}
if (PRIV(op)->op_state == 0)
{
stop_sync_submitted = false;
@@ -29,34 +36,15 @@ int blockstore_impl_t::dequeue_sync(blockstore_op_t *op)
PRIV(op)->op_state = SYNC_HAS_SMALL;
else
PRIV(op)->op_state = SYNC_DONE;
// Always add sync to in_progress_syncs because we clear unsynced_big_writes and unsynced_small_writes
PRIV(op)->prev_sync_count = in_progress_syncs.size();
PRIV(op)->in_progress_ptr = in_progress_syncs.insert(in_progress_syncs.end(), op);
}
continue_sync(op);
// Always dequeue because we always add syncs to in_progress_syncs
return 1;
}
int blockstore_impl_t::continue_sync(blockstore_op_t *op)
{
auto cb = [this, op](ring_data_t *data) { handle_sync_event(data, op); };
if (PRIV(op)->op_state == SYNC_HAS_SMALL)
{
// No big writes, just fsync the journal
for (; PRIV(op)->sync_small_checked < PRIV(op)->sync_small_writes.size(); PRIV(op)->sync_small_checked++)
{
if (IS_IN_FLIGHT(dirty_db[PRIV(op)->sync_small_writes[PRIV(op)->sync_small_checked]].state))
{
// Wait for small inflight writes to complete
return 0;
}
}
if (journal.sector_info[journal.cur_sector].dirty)
{
// Write out the last journal sector if it happens to be dirty
BS_SUBMIT_GET_ONLY_SQE(sqe);
prepare_journal_sector_write(journal, journal.cur_sector, sqe, cb);
prepare_journal_sector_write(journal, journal.cur_sector, sqe, [this, op](ring_data_t *data) { handle_sync_event(data, op); });
PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
PRIV(op)->pending_ops = 1;
PRIV(op)->op_state = SYNC_JOURNAL_WRITE_SENT;
@@ -69,21 +57,13 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
}
if (PRIV(op)->op_state == SYNC_HAS_BIG)
{
for (; PRIV(op)->sync_big_checked < PRIV(op)->sync_big_writes.size(); PRIV(op)->sync_big_checked++)
{
if (IS_IN_FLIGHT(dirty_db[PRIV(op)->sync_big_writes[PRIV(op)->sync_big_checked]].state))
{
// Wait for big inflight writes to complete
return 0;
}
}
// 1st step: fsync data
if (!disable_data_fsync)
{
BS_SUBMIT_GET_SQE(sqe, data);
my_uring_prep_fsync(sqe, data_fd, IORING_FSYNC_DATASYNC);
data->iov = { 0 };
data->callback = cb;
data->callback = [this, op](ring_data_t *data) { handle_sync_event(data, op); };
PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
PRIV(op)->pending_ops = 1;
PRIV(op)->op_state = SYNC_DATA_SYNC_SENT;
@@ -96,46 +76,37 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
}
if (PRIV(op)->op_state == SYNC_DATA_SYNC_DONE)
{
for (; PRIV(op)->sync_small_checked < PRIV(op)->sync_small_writes.size(); PRIV(op)->sync_small_checked++)
{
if (IS_IN_FLIGHT(dirty_db[PRIV(op)->sync_small_writes[PRIV(op)->sync_small_checked]].state))
{
// Wait for small inflight writes to complete
return 0;
}
}
// 2nd step: Data device is synced, prepare & write journal entries
// Check space in the journal and journal memory buffers
blockstore_journal_check_t space_check(this);
if (!space_check.check_available(op, PRIV(op)->sync_big_writes.size(), sizeof(journal_entry_big_write), 0))
if (!space_check.check_available(op, PRIV(op)->sync_big_writes.size(), sizeof(journal_entry_big_write), JOURNAL_STABILIZE_RESERVATION))
{
return 0;
}
// Get SQEs. Don't bother about merging, submit each journal sector as a separate request
struct io_uring_sqe *sqe[space_check.sectors_required];
for (int i = 0; i < space_check.sectors_required; i++)
struct io_uring_sqe *sqe[space_check.sectors_to_write];
for (int i = 0; i < space_check.sectors_to_write; i++)
{
BS_SUBMIT_GET_SQE_DECL(sqe[i]);
}
// Prepare and submit journal entries
auto it = PRIV(op)->sync_big_writes.begin();
int s = 0, cur_sector = -1;
if ((journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_big_write) &&
journal.sector_info[journal.cur_sector].dirty)
{
if (cur_sector == -1)
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
cur_sector = journal.cur_sector;
prepare_journal_sector_write(journal, cur_sector, sqe[s++], cb);
}
while (it != PRIV(op)->sync_big_writes.end())
{
if (!journal.entry_fits(sizeof(journal_entry_big_write)) &&
journal.sector_info[journal.cur_sector].dirty)
{
if (cur_sector == -1)
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], [this, op](ring_data_t *data) { handle_sync_event(data, op); });
cur_sector = journal.cur_sector;
}
journal_entry_big_write *je = (journal_entry_big_write*)prefill_single_journal_entry(
journal, (dirty_db[*it].state & BS_ST_INSTANT) ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
sizeof(journal_entry_big_write)
);
dirty_db[*it].journal_sector = journal.sector_info[journal.cur_sector].offset;
journal.sector_info[journal.cur_sector].dirty = false;
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
#ifdef BLOCKSTORE_DEBUG
printf(
@@ -152,14 +123,11 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
je->crc32 = je_crc32((journal_entry*)je);
journal.crc32_last = je->crc32;
it++;
if (cur_sector != journal.cur_sector)
{
if (cur_sector == -1)
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
cur_sector = journal.cur_sector;
prepare_journal_sector_write(journal, cur_sector, sqe[s++], cb);
}
}
prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], [this, op](ring_data_t *data) { handle_sync_event(data, op); });
assert(s == space_check.sectors_to_write);
if (cur_sector == -1)
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
PRIV(op)->pending_ops = s;
PRIV(op)->op_state = SYNC_JOURNAL_WRITE_SENT;
@@ -172,7 +140,7 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
BS_SUBMIT_GET_SQE(sqe, data);
my_uring_prep_fsync(sqe, journal.fd, IORING_FSYNC_DATASYNC);
data->iov = { 0 };
data->callback = cb;
data->callback = [this, op](ring_data_t *data) { handle_sync_event(data, op); };
PRIV(op)->pending_ops = 1;
PRIV(op)->op_state = SYNC_JOURNAL_SYNC_SENT;
return 1;
@@ -182,9 +150,10 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
PRIV(op)->op_state = SYNC_DONE;
}
}
if (PRIV(op)->op_state == SYNC_DONE)
if (PRIV(op)->op_state == SYNC_DONE && !queue_has_in_progress_sync)
{
return ack_sync(op);
ack_sync(op);
return 2;
}
return 1;
}
@@ -216,42 +185,16 @@ void blockstore_impl_t::handle_sync_event(ring_data_t *data, blockstore_op_t *op
else if (PRIV(op)->op_state == SYNC_JOURNAL_SYNC_SENT)
{
PRIV(op)->op_state = SYNC_DONE;
ack_sync(op);
}
else
{
throw std::runtime_error("BUG: unexpected sync op state");
}
ringloop->wakeup();
}
}
int blockstore_impl_t::ack_sync(blockstore_op_t *op)
{
if (PRIV(op)->op_state == SYNC_DONE && PRIV(op)->prev_sync_count == 0)
{
// Remove dependency of subsequent syncs
auto it = PRIV(op)->in_progress_ptr;
int done_syncs = 1;
++it;
// Acknowledge sync
ack_one_sync(op);
while (it != in_progress_syncs.end())
{
auto & next_sync = *it++;
PRIV(next_sync)->prev_sync_count -= done_syncs;
if (PRIV(next_sync)->prev_sync_count == 0 && PRIV(next_sync)->op_state == SYNC_DONE)
{
done_syncs++;
// Acknowledge next_sync
ack_one_sync(next_sync);
}
}
return 2;
}
return 0;
}
void blockstore_impl_t::ack_one_sync(blockstore_op_t *op)
void blockstore_impl_t::ack_sync(blockstore_op_t *op)
{
// Handle states
for (auto it = PRIV(op)->sync_big_writes.begin(); it != PRIV(op)->sync_big_writes.end(); it++)
@@ -299,7 +242,6 @@ void blockstore_impl_t::ack_one_sync(blockstore_op_t *op)
}
}
}
in_progress_syncs.erase(PRIV(op)->in_progress_ptr);
op->retval = 0;
FINISH_OP(op);
}

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
#include "blockstore_impl.h"
@@ -7,7 +7,7 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
{
// Check or assign version number
bool found = false, deleted = false, is_del = (op->opcode == BS_OP_DELETE);
bool is_inflight_big = false;
bool wait_big = false, wait_del = false;
uint64_t version = 1;
if (dirty_db.size() > 0)
{
@@ -21,7 +21,8 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
found = true;
version = dirty_it->first.version + 1;
deleted = IS_DELETE(dirty_it->second.state);
is_inflight_big = (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE
wait_del = ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_WAIT_DEL);
wait_big = (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE
? !IS_SYNCED(dirty_it->second.state)
: ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_WAIT_BIG);
}
@@ -38,23 +39,43 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
deleted = true;
}
}
if (op->version == 0)
{
op->version = version;
}
else if (op->version < version)
{
// Invalid version requested
op->retval = -EEXIST;
return false;
}
if (deleted && is_del)
{
// Already deleted
op->retval = 0;
return false;
}
if (is_inflight_big && !is_del && !deleted && op->len < block_size &&
PRIV(op)->real_version = 0;
if (op->version == 0)
{
op->version = version;
}
else if (op->version < version)
{
// Implicit operations must be added like that: DEL [FLUSH] BIG [SYNC] SMALL SMALL
if (deleted || wait_del)
{
// It's allowed to write versions with low numbers over deletes
// However, we have to flush those deletes first as we use version number for ordering
#ifdef BLOCKSTORE_DEBUG
printf("Write %lx:%lx v%lu over delete (real v%lu) offset=%u len=%u\n", op->oid.inode, op->oid.stripe, version, op->version, op->offset, op->len);
#endif
wait_del = true;
PRIV(op)->real_version = op->version;
op->version = version;
flusher->unshift_flush((obj_ver_id){
.oid = op->oid,
.version = version-1,
}, true);
}
else
{
// Invalid version requested
op->retval = -EEXIST;
return false;
}
}
if (wait_big && !is_del && !deleted && op->len < block_size &&
immediate_commit != IMMEDIATE_ALL)
{
// Issue an additional sync so that the previous big write can reach the journal
@@ -69,22 +90,31 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
#ifdef BLOCKSTORE_DEBUG
if (is_del)
printf("Delete %lx:%lx v%lu\n", op->oid.inode, op->oid.stripe, op->version);
else
else if (!wait_del)
printf("Write %lx:%lx v%lu offset=%u len=%u\n", op->oid.inode, op->oid.stripe, op->version, op->offset, op->len);
#endif
// No strict need to add it into dirty_db here, it's just left
// FIXME No strict need to add it into dirty_db here, it's just left
// from the previous implementation where reads waited for writes
uint32_t state;
if (is_del)
state = BS_ST_DELETE | BS_ST_IN_FLIGHT;
else
{
state = (op->len == block_size || deleted ? BS_ST_BIG_WRITE : BS_ST_SMALL_WRITE);
if (wait_del)
state |= BS_ST_WAIT_DEL;
else if (state == BS_ST_SMALL_WRITE && wait_big)
state |= BS_ST_WAIT_BIG;
else
state |= BS_ST_IN_FLIGHT;
if (op->opcode == BS_OP_WRITE_STABLE)
state |= BS_ST_INSTANT;
}
dirty_db.emplace((obj_ver_id){
.oid = op->oid,
.version = op->version,
}, (dirty_entry){
.state = (uint32_t)(
is_del
? (BS_ST_DELETE | BS_ST_IN_FLIGHT)
: (op->opcode == BS_OP_WRITE_STABLE ? BS_ST_INSTANT : 0) | (op->len == block_size || deleted
? (BS_ST_BIG_WRITE | BS_ST_IN_FLIGHT)
: (is_inflight_big ? (BS_ST_SMALL_WRITE | BS_ST_WAIT_BIG) : (BS_ST_SMALL_WRITE | BS_ST_IN_FLIGHT)))
),
.state = state,
.flags = 0,
.location = 0,
.offset = is_del ? 0 : op->offset,
@@ -94,6 +124,29 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
return true;
}
void blockstore_impl_t::cancel_all_writes(blockstore_op_t *op, blockstore_dirty_db_t::iterator dirty_it, int retval)
{
while (dirty_it != dirty_db.end() && dirty_it->first.oid == op->oid)
{
dirty_db.erase(dirty_it++);
}
bool found = false;
for (auto other_op: submit_queue)
{
if (!found && other_op == op)
found = true;
else if (found && other_op->oid == op->oid &&
(other_op->opcode == BS_OP_WRITE || other_op->opcode == BS_OP_WRITE_STABLE))
{
// Mark operations to cancel them
PRIV(other_op)->real_version = UINT64_MAX;
other_op->retval = retval;
}
}
op->retval = retval;
FINISH_OP(op);
}
// First step of the write algorithm: dequeue operation and submit initial write(s)
int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
{
@@ -106,12 +159,46 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
.version = op->version,
});
assert(dirty_it != dirty_db.end());
if ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_WAIT_BIG)
if ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) < BS_ST_IN_FLIGHT)
{
// Don't dequeue
return 0;
}
else if ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE)
if (PRIV(op)->real_version != 0)
{
if (PRIV(op)->real_version == UINT64_MAX)
{
// This is the flag value used to cancel operations
FINISH_OP(op);
return 2;
}
// Restore original low version number for unblocked operations
#ifdef BLOCKSTORE_DEBUG
printf("Restoring %lx:%lx version: v%lu -> v%lu\n", op->oid.inode, op->oid.stripe, op->version, PRIV(op)->real_version);
#endif
auto prev_it = dirty_it;
prev_it--;
if (prev_it->first.oid == op->oid && prev_it->first.version >= PRIV(op)->real_version)
{
// Original version is still invalid
// All subsequent writes to the same object must be canceled too
cancel_all_writes(op, dirty_it, -EEXIST);
return 2;
}
op->version = PRIV(op)->real_version;
PRIV(op)->real_version = 0;
dirty_entry e = dirty_it->second;
dirty_db.erase(dirty_it);
dirty_it = dirty_db.emplace((obj_ver_id){
.oid = op->oid,
.version = op->version,
}, e).first;
}
if (write_iodepth >= max_write_iodepth)
{
return 0;
}
if ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE)
{
blockstore_journal_check_t space_check(this);
if (!space_check.check_available(op, unsynced_big_writes.size() + 1, sizeof(journal_entry_big_write), JOURNAL_STABILIZE_RESERVATION))
@@ -129,10 +216,10 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
PRIV(op)->wait_for = WAIT_FREE;
return 0;
}
op->retval = -ENOSPC;
FINISH_OP(op);
return 1;
cancel_all_writes(op, dirty_it, -ENOSPC);
return 2;
}
write_iodepth++;
BS_SUBMIT_GET_SQE(sqe, data);
dirty_it->second.location = loc << block_order;
dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_SUBMITTED;
@@ -185,6 +272,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
{
return 0;
}
write_iodepth++;
// There is sufficient space. Get SQE(s)
struct io_uring_sqe *sqe1 = NULL;
if (immediate_commit != IMMEDIATE_NONE ||
@@ -282,14 +370,13 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
if (!PRIV(op)->pending_ops)
{
PRIV(op)->op_state = 4;
continue_write(op);
return continue_write(op);
}
else
{
PRIV(op)->op_state = 3;
}
}
inflight_writes++;
return 1;
}
@@ -297,30 +384,29 @@ int blockstore_impl_t::continue_write(blockstore_op_t *op)
{
io_uring_sqe *sqe = NULL;
journal_entry_big_write *je;
int op_state = PRIV(op)->op_state;
if (op_state != 2 && op_state != 4)
{
// In progress
return 1;
}
auto dirty_it = dirty_db.find((obj_ver_id){
.oid = op->oid,
.version = op->version,
});
assert(dirty_it != dirty_db.end());
if (PRIV(op)->op_state == 2)
if (op_state == 2)
goto resume_2;
else if (PRIV(op)->op_state == 4)
else if (op_state == 4)
goto resume_4;
else
return 1;
resume_2:
// Only for the immediate_commit mode: prepare and submit big_write journal entry
sqe = get_sqe();
if (!sqe)
{
return 0;
}
BS_SUBMIT_GET_SQE_DECL(sqe);
je = (journal_entry_big_write*)prefill_single_journal_entry(
journal, op->opcode == BS_OP_WRITE_STABLE ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
sizeof(journal_entry_big_write)
);
dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
journal.sector_info[journal.cur_sector].dirty = false;
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
#ifdef BLOCKSTORE_DEBUG
printf(
@@ -345,7 +431,7 @@ resume_2:
resume_4:
// Switch object state
#ifdef BLOCKSTORE_DEBUG
printf("Ack write %lx:%lx v%lu = %d\n", op->oid.inode, op->oid.stripe, op->version, dirty_it->second.state);
printf("Ack write %lx:%lx v%lu = state %x\n", op->oid.inode, op->oid.stripe, op->version, dirty_it->second.state);
#endif
bool imm = (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE
? (immediate_commit == IMMEDIATE_ALL)
@@ -374,11 +460,11 @@ resume_4:
dirty_it++;
}
}
inflight_writes--;
// Acknowledge write
op->retval = op->len;
write_iodepth--;
FINISH_OP(op);
return 1;
return 2;
}
void blockstore_impl_t::handle_write_event(ring_data_t *data, blockstore_op_t *op)
@@ -386,7 +472,6 @@ void blockstore_impl_t::handle_write_event(ring_data_t *data, blockstore_op_t *o
live = true;
if (data->res != data->iov.iov_len)
{
inflight_writes--;
// FIXME: our state becomes corrupted after a write error. maybe do something better than just die
throw std::runtime_error(
"write operation failed ("+std::to_string(data->res)+" != "+std::to_string(data->iov.iov_len)+
@@ -398,10 +483,7 @@ void blockstore_impl_t::handle_write_event(ring_data_t *data, blockstore_op_t *o
{
release_journal_sectors(op);
PRIV(op)->op_state++;
if (!continue_write(op))
{
submit_queue.push_front(op);
}
ringloop->wakeup();
}
}
@@ -414,8 +496,8 @@ void blockstore_impl_t::release_journal_sectors(blockstore_op_t *op)
uint64_t s = PRIV(op)->min_flushed_journal_sector;
while (1)
{
journal.sector_info[s-1].usage_count--;
if (s != (1+journal.cur_sector) && journal.sector_info[s-1].usage_count == 0)
journal.sector_info[s-1].flush_count--;
if (s != (1+journal.cur_sector) && journal.sector_info[s-1].flush_count == 0)
{
// We know for sure that we won't write into this sector anymore
uint64_t new_ds = journal.sector_info[s-1].offset + journal.block_size;
@@ -439,16 +521,21 @@ void blockstore_impl_t::release_journal_sectors(blockstore_op_t *op)
int blockstore_impl_t::dequeue_del(blockstore_op_t *op)
{
if (PRIV(op)->op_state)
{
return continue_write(op);
}
auto dirty_it = dirty_db.find((obj_ver_id){
.oid = op->oid,
.version = op->version,
});
assert(dirty_it != dirty_db.end());
blockstore_journal_check_t space_check(this);
if (!space_check.check_available(op, 1, sizeof(journal_entry_del), 0))
if (!space_check.check_available(op, 1, sizeof(journal_entry_del), JOURNAL_STABILIZE_RESERVATION))
{
return 0;
}
write_iodepth++;
io_uring_sqe *sqe = NULL;
if (immediate_commit != IMMEDIATE_NONE ||
(journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_del) &&
@@ -495,7 +582,10 @@ int blockstore_impl_t::dequeue_del(blockstore_op_t *op)
prepare_journal_sector_write(journal, journal.cur_sector, sqe, cb);
PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
PRIV(op)->pending_ops++;
// Remember small write as unsynced
}
else
{
// Remember delete as unsynced
unsynced_small_writes.push_back((obj_ver_id){
.oid = op->oid,
.version = op->version,
@@ -504,7 +594,7 @@ int blockstore_impl_t::dequeue_del(blockstore_op_t *op)
if (!PRIV(op)->pending_ops)
{
PRIV(op)->op_state = 4;
continue_write(op);
return continue_write(op);
}
else
{

View File

@@ -1,12 +1,14 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#include <stdexcept>
#include "cluster_client.h"
cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config)
{
this->ringloop = ringloop;
this->tfd = tfd;
this->config = config;
msgr.osd_num = 0;
msgr.tfd = tfd;
@@ -63,8 +65,7 @@ cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd
msgr.stop_client(op->peer_fd);
delete op;
};
msgr.use_sync_send_recv = config["use_sync_send_recv"].bool_value() ||
config["use_sync_send_recv"].uint64_value();
msgr.init();
st_cli.tfd = tfd;
st_cli.on_load_config_hook = [this](json11::Json::object & cfg) { on_load_config_hook(cfg); };
@@ -72,7 +73,6 @@ cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd
st_cli.on_change_hook = [this](json11::Json::object & changes) { on_change_hook(changes); };
st_cli.on_load_pgs_hook = [this](bool success) { on_load_pgs_hook(success); };
log_level = config["log_level"].int64_value();
st_cli.parse_config(config);
st_cli.load_global_config();
@@ -182,17 +182,8 @@ void cluster_client_t::on_load_config_hook(json11::Json::object & config)
{
up_wait_retry_interval = 50;
}
msgr.peer_connect_interval = config["peer_connect_interval"].uint64_value();
if (!msgr.peer_connect_interval)
{
msgr.peer_connect_interval = DEFAULT_PEER_CONNECT_INTERVAL;
}
msgr.peer_connect_timeout = config["peer_connect_timeout"].uint64_value();
if (!msgr.peer_connect_timeout)
{
msgr.peer_connect_timeout = DEFAULT_PEER_CONNECT_TIMEOUT;
}
st_cli.start_etcd_watcher();
msgr.parse_config(config);
msgr.parse_config(this->config);
st_cli.load_pgs();
}
@@ -202,6 +193,12 @@ void cluster_client_t::on_load_pgs_hook(bool success)
{
pg_counts[pool_item.first] = pool_item.second.real_pg_count;
}
pgs_loaded = true;
for (auto fn: on_ready_hooks)
{
fn();
}
on_ready_hooks.clear();
for (auto op: offline_ops)
{
execute(op);
@@ -253,6 +250,18 @@ void cluster_client_t::on_change_osd_state_hook(uint64_t peer_osd)
}
}
void cluster_client_t::on_ready(std::function<void(void)> fn)
{
if (pgs_loaded)
{
fn();
}
else
{
on_ready_hooks.push_back(fn);
}
}
/**
* How writes are synced when immediate_commit is false
*
@@ -283,7 +292,7 @@ void cluster_client_t::on_change_osd_state_hook(uint64_t peer_osd)
void cluster_client_t::execute(cluster_op_t *op)
{
if (!bs_disk_alignment)
if (!pgs_loaded)
{
// We're offline
offline_ops.push_back(op);
@@ -453,7 +462,7 @@ void cluster_client_t::slice_rw(cluster_op_t *op)
// Primary OSDs still operate individual stripes, but their size is multiplied by PG minsize in case of EC
auto & pool_cfg = st_cli.pool_config[INODE_POOL(op->inode)];
uint64_t pg_block_size = bs_block_size * (
pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pool_cfg.pg_minsize
pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pool_cfg.pg_size-pool_cfg.parity_chunks
);
uint64_t first_stripe = (op->offset / pg_block_size) * pg_block_size;
uint64_t last_stripe = ((op->offset + op->len + pg_block_size - 1) / pg_block_size - 1) * pg_block_size;
@@ -468,7 +477,7 @@ void cluster_client_t::slice_rw(cluster_op_t *op)
uint64_t begin = (op->offset < stripe ? stripe : op->offset);
uint64_t end = (op->offset + op->len) > (stripe + pg_block_size)
? (stripe + pg_block_size) : (op->offset + op->len);
op->parts[i] = {
op->parts[i] = (cluster_op_part_t){
.parent = op,
.offset = begin,
.len = (uint32_t)(end - begin),
@@ -513,7 +522,7 @@ bool cluster_client_t::try_send(cluster_op_t *op, cluster_op_part_t *part)
part->osd_num = primary_osd;
part->sent = true;
op->sent_count++;
part->op = {
part->op = (osd_op_t){
.op_type = OSD_OP_OUT,
.peer_fd = peer_fd,
.req = { .rw = {
@@ -674,7 +683,7 @@ void cluster_client_t::send_sync(cluster_op_t *op, cluster_op_part_t *part)
assert(peer_it != msgr.osd_peer_fds.end());
part->sent = true;
op->sent_count++;
part->op = {
part->op = (osd_op_t){
.op_type = OSD_OP_OUT,
.peer_fd = peer_it->second,
.req = {
@@ -737,7 +746,7 @@ void cluster_client_t::handle_op_part(cluster_op_part_t *part)
assert(op == cur_sync);
finish_sync();
}
else
else if (!op->up_wait)
{
continue_rw(op);
}

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once
@@ -8,7 +8,6 @@
#define MIN_BLOCK_SIZE 4*1024
#define MAX_BLOCK_SIZE 128*1024*1024
#define DEFAULT_BLOCK_SIZE 128*1024
#define DEFAULT_DISK_ALIGNMENT 4096
#define DEFAULT_BITMAP_GRANULARITY 4096
#define DEFAULT_CLIENT_DIRTY_LIMIT 32*1024*1024
@@ -64,8 +63,6 @@ class cluster_client_t
int up_wait_retry_interval = 500; // ms
uint64_t op_id = 1;
etcd_state_client_t st_cli;
osd_messenger_t msgr;
ring_consumer_t consumer;
// operations currently in progress
std::set<cluster_op_t*> cur_ops;
@@ -79,10 +76,18 @@ class cluster_client_t
std::vector<cluster_op_t*> offline_ops;
uint64_t queued_bytes = 0;
bool pgs_loaded = false;
std::vector<std::function<void(void)>> on_ready_hooks;
public:
etcd_state_client_t st_cli;
osd_messenger_t msgr;
json11::Json config;
cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config);
~cluster_client_t();
void execute(cluster_op_t *op);
void on_ready(std::function<void(void)> fn);
void stop();
protected:

View File

@@ -8,4 +8,10 @@
// unsigned __int64 _mm_crc32_u64 (unsigned __int64 crc, unsigned __int64 v)
// unsigned int _mm_crc32_u8 (unsigned int crc, unsigned char v)
#ifdef __cplusplus
extern "C" {
#endif
uint32_t crc32c(uint32_t crc, const void *buf, size_t len);
#ifdef __cplusplus
};
#endif

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
#define _LARGEFILE64_SOURCE
#include <sys/types.h>
@@ -26,23 +26,32 @@ struct journal_dump_t
uint64_t journal_offset;
uint64_t journal_len;
uint64_t journal_pos;
bool all;
bool started;
int fd;
uint32_t crc32_last;
void dump_block(void *buf);
int dump_block(void *buf);
};
int main(int argc, char *argv[])
{
if (argc < 5)
journal_dump_t self = { 0 };
int b = 1;
if (argc >= 2 && !strcmp(argv[1], "--all"))
{
printf("USAGE: %s <journal_file> <journal_block_size> <offset> <size>\n", argv[0]);
self.all = true;
b = 2;
}
if (argc < b+4)
{
printf("USAGE: %s [--all] <journal_file> <journal_block_size> <offset> <size>\n", argv[0]);
return 1;
}
journal_dump_t self;
self.journal_device = argv[1];
self.journal_block = strtoul(argv[2], NULL, 10);
self.journal_offset = strtoull(argv[3], NULL, 10);
self.journal_len = strtoull(argv[4], NULL, 10);
self.journal_device = argv[b];
self.journal_block = strtoul(argv[b+1], NULL, 10);
self.journal_offset = strtoull(argv[b+2], NULL, 10);
self.journal_len = strtoull(argv[b+3], NULL, 10);
if (self.journal_block < MEM_ALIGNMENT || (self.journal_block % MEM_ALIGNMENT) ||
self.journal_block > 128*1024)
{
@@ -57,30 +66,64 @@ int main(int argc, char *argv[])
}
void *data = memalign(MEM_ALIGNMENT, self.journal_block);
self.journal_pos = 0;
while (self.journal_pos < self.journal_len)
if (self.all)
{
while (self.journal_pos < self.journal_len)
{
int r = pread(self.fd, data, self.journal_block, self.journal_offset+self.journal_pos);
assert(r == self.journal_block);
uint64_t s;
for (s = 0; s < self.journal_block; s += 8)
{
if (*((uint64_t*)(data+s)) != 0)
break;
}
if (s == self.journal_block)
{
printf("offset %08lx: zeroes\n", self.journal_pos);
self.journal_pos += self.journal_block;
}
else if (((journal_entry*)data)->magic == JOURNAL_MAGIC)
{
printf("offset %08lx:\n", self.journal_pos);
self.dump_block(data);
}
else
{
printf("offset %08lx: no magic in the beginning, looks like random data (pattern=%lx)\n", self.journal_pos, *((uint64_t*)data));
self.journal_pos += self.journal_block;
}
}
}
else
{
int r = pread(self.fd, data, self.journal_block, self.journal_offset+self.journal_pos);
assert(r == self.journal_block);
uint64_t s;
for (s = 0; s < self.journal_block; s += 8)
journal_entry *je = (journal_entry*)(data);
if (je->magic != JOURNAL_MAGIC || je->type != JE_START || je_crc32(je) != je->crc32)
{
if (*((uint64_t*)(data+s)) != 0)
break;
}
if (s == self.journal_block)
{
printf("offset %08lx: zeroes\n", self.journal_pos);
self.journal_pos += self.journal_block;
}
else if (((journal_entry*)data)->magic == JOURNAL_MAGIC)
{
printf("offset %08lx:\n", self.journal_pos);
self.dump_block(data);
printf("offset %08lx: journal superblock is invalid\n", self.journal_pos);
}
else
{
printf("offset %08lx: no magic in the beginning, looks like random data (pattern=%lx)\n", self.journal_pos, *((uint64_t*)data));
self.journal_pos += self.journal_block;
printf("offset %08lx:\n", self.journal_pos);
self.dump_block(data);
self.started = false;
self.journal_pos = je->start.journal_start;
while (1)
{
if (self.journal_pos >= self.journal_len)
self.journal_pos = self.journal_block;
r = pread(self.fd, data, self.journal_block, self.journal_offset+self.journal_pos);
assert(r == self.journal_block);
printf("offset %08lx:\n", self.journal_pos);
r = self.dump_block(data);
if (r <= 0)
{
printf("end of the journal\n");
break;
}
}
}
}
free(data);
@@ -88,7 +131,7 @@ int main(int argc, char *argv[])
return 0;
}
void journal_dump_t::dump_block(void *buf)
int journal_dump_t::dump_block(void *buf)
{
uint32_t pos = 0;
journal_pos += journal_block;
@@ -97,12 +140,19 @@ void journal_dump_t::dump_block(void *buf)
while (pos < journal_block)
{
journal_entry *je = (journal_entry*)(buf + pos);
if (je->magic != JOURNAL_MAGIC || je->type < JE_MIN || je->type > JE_MAX)
if (je->magic != JOURNAL_MAGIC || je->type < JE_MIN || je->type > JE_MAX ||
!all && started && je->crc32_prev != crc32_last)
{
break;
}
const char *crc32_valid = je_crc32(je) == je->crc32 ? "(valid)" : "(invalid)";
printf("entry % 3d: crc32=%08x %s prev=%08x ", entry, je->crc32, crc32_valid, je->crc32_prev);
bool crc32_valid = je_crc32(je) == je->crc32;
if (!all && !crc32_valid)
{
break;
}
started = true;
crc32_last = je->crc32;
printf("entry % 3d: crc32=%08x %s prev=%08x ", entry, je->crc32, (crc32_valid ? "(valid)" : "(invalid)"), je->crc32_prev);
if (je->type == JE_START)
{
printf("je_start start=%08lx\n", je->start.journal_start);
@@ -170,4 +220,5 @@ void journal_dump_t::dump_block(void *buf)
{
journal_pos = journal_len;
}
return entry;
}

View File

@@ -1,9 +1,10 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#include <sys/epoll.h>
#include <sys/poll.h>
#include <unistd.h>
#include <stdexcept>
#include "epoll_manager.h"
@@ -83,8 +84,12 @@ void epoll_manager_t::handle_epoll_events()
nfds = epoll_wait(epoll_fd, events, MAX_EPOLL_EVENTS, 0);
for (int i = 0; i < nfds; i++)
{
auto & cb = epoll_handlers[events[i].data.fd];
cb(events[i].data.fd, events[i].events);
auto cb_it = epoll_handlers.find(events[i].data.fd);
if (cb_it != epoll_handlers.end())
{
auto & cb = cb_it->second;
cb(events[i].data.fd, events[i].events);
}
}
} while (nfds == MAX_EPOLL_EVENTS);
}

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#include "osd_ops.h"
#include "pg_states.h"
@@ -7,6 +7,16 @@
#include "http_client.h"
#include "base64.h"
etcd_state_client_t::~etcd_state_client_t()
{
etcd_watches_initialised = -1;
if (etcd_watch_ws)
{
etcd_watch_ws->close();
etcd_watch_ws = NULL;
}
}
json_kv_t etcd_state_client_t::parse_etcd_kv(const json11::Json & kv_json)
{
json_kv_t kv;
@@ -46,6 +56,23 @@ void etcd_state_client_t::etcd_call(std::string api, json11::Json payload, int t
http_request_json(tfd, etcd_address, req, timeout, callback);
}
void etcd_state_client_t::add_etcd_url(std::string addr)
{
if (addr.length() > 0)
{
if (strtolower(addr.substr(0, 7)) == "http://")
addr = addr.substr(7);
else if (strtolower(addr.substr(0, 8)) == "https://")
{
printf("HTTPS is unsupported for etcd. Either use plain HTTP or setup a local proxy for etcd interaction\n");
exit(1);
}
if (addr.find('/') < 0)
addr += "/v3";
this->etcd_addresses.push_back(addr);
}
}
void etcd_state_client_t::parse_config(json11::Json & config)
{
this->etcd_addresses.clear();
@@ -55,13 +82,7 @@ void etcd_state_client_t::parse_config(json11::Json & config)
while (1)
{
int pos = ea.find(',');
std::string addr = pos >= 0 ? ea.substr(0, pos) : ea;
if (addr.length() > 0)
{
if (addr.find('/') < 0)
addr += "/v3";
this->etcd_addresses.push_back(addr);
}
add_etcd_url(pos >= 0 ? ea.substr(0, pos) : ea);
if (pos >= 0)
ea = ea.substr(pos+1);
else
@@ -72,13 +93,7 @@ void etcd_state_client_t::parse_config(json11::Json & config)
{
for (auto & ea: config["etcd_address"].array_items())
{
std::string addr = ea.string_value();
if (addr != "")
{
if (addr.find('/') < 0)
addr += "/v3";
this->etcd_addresses.push_back(addr);
}
add_etcd_url(ea.string_value());
}
}
this->etcd_prefix = config["etcd_prefix"].string_value();
@@ -136,7 +151,7 @@ void etcd_state_client_t::start_etcd_watcher()
}
for (auto & kv: changes)
{
if (this->log_level > 0)
if (this->log_level > 3)
{
printf("Incoming event: %s -> %s\n", kv.first.c_str(), kv.second.dump().c_str());
}
@@ -160,7 +175,7 @@ void etcd_state_client_t::start_etcd_watcher()
start_etcd_watcher();
});
}
else
else if (etcd_watches_initialised > 0)
{
// Connection was live, retry immediately
start_etcd_watcher();
@@ -173,6 +188,7 @@ void etcd_state_client_t::start_etcd_watcher()
{ "range_end", base64_encode(etcd_prefix+"/config0") },
{ "start_revision", etcd_watch_revision+1 },
{ "watch_id", ETCD_CONFIG_WATCH_ID },
{ "progress_notify", true },
} }
}).dump());
etcd_watch_ws->post_message(WS_TEXT, json11::Json(json11::Json::object {
@@ -181,6 +197,7 @@ void etcd_state_client_t::start_etcd_watcher()
{ "range_end", base64_encode(etcd_prefix+"/osd/state0") },
{ "start_revision", etcd_watch_revision+1 },
{ "watch_id", ETCD_OSD_STATE_WATCH_ID },
{ "progress_notify", true },
} }
}).dump());
etcd_watch_ws->post_message(WS_TEXT, json11::Json(json11::Json::object {
@@ -189,6 +206,7 @@ void etcd_state_client_t::start_etcd_watcher()
{ "range_end", base64_encode(etcd_prefix+"/pg/state0") },
{ "start_revision", etcd_watch_revision+1 },
{ "watch_id", ETCD_PG_STATE_WATCH_ID },
{ "progress_notify", true },
} }
}).dump());
etcd_watch_ws->post_message(WS_TEXT, json11::Json(json11::Json::object {
@@ -197,6 +215,7 @@ void etcd_state_client_t::start_etcd_watcher()
{ "range_end", base64_encode(etcd_prefix+"/pg/history0") },
{ "start_revision", etcd_watch_revision+1 },
{ "watch_id", ETCD_PG_HISTORY_WATCH_ID },
{ "progress_notify", true },
} }
}).dump());
}
@@ -216,10 +235,6 @@ void etcd_state_client_t::load_global_config()
});
return;
}
if (!etcd_watch_revision)
{
etcd_watch_revision = data["header"]["revision"].uint64_value();
}
json11::Json::object global_config;
if (data["kvs"].array_items().size() > 0)
{
@@ -229,6 +244,11 @@ void etcd_state_client_t::load_global_config()
global_config = kv.value.object_items();
}
}
bs_block_size = global_config["block_size"].uint64_value();
if (!bs_block_size)
{
bs_block_size = DEFAULT_BLOCK_SIZE;
}
on_load_config_hook(global_config);
});
}
@@ -287,6 +307,10 @@ void etcd_state_client_t::load_pgs()
on_load_pgs_hook(false);
return;
}
if (!etcd_watch_revision)
{
etcd_watch_revision = data["header"]["revision"].uint64_value();
}
for (auto & res: data["responses"].array_items())
{
for (auto & kv_json: res["response_range"]["kvs"].array_items())
@@ -296,6 +320,7 @@ void etcd_state_client_t::load_pgs()
}
}
on_load_pgs_hook(true);
start_etcd_watcher();
});
}
@@ -309,65 +334,99 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
}
for (auto & pool_item: value.object_items())
{
pool_config_t pc;
// ID
pool_id_t pool_id = stoull_full(pool_item.first);
if (!pool_id || pool_id >= POOL_ID_MAX)
{
printf("Pool ID %s is invalid (must be a number less than 0x%x), skipping pool\n", pool_item.first.c_str(), POOL_ID_MAX);
continue;
}
if (pool_item.second["pg_size"].uint64_value() < 1 ||
pool_item.second["scheme"] == "xor" && pool_item.second["pg_size"].uint64_value() < 3)
{
printf("Pool %u has invalid pg_size, skipping pool\n", pool_id);
continue;
}
if (pool_item.second["pg_minsize"].uint64_value() < 1 ||
pool_item.second["pg_minsize"].uint64_value() > pool_item.second["pg_size"].uint64_value() ||
pool_item.second["pg_minsize"].uint64_value() < (pool_item.second["pg_size"].uint64_value() - 1))
{
printf("Pool %u has invalid pg_minsize, skipping pool\n", pool_id);
continue;
}
if (pool_item.second["pg_count"].uint64_value() < 1)
{
printf("Pool %u has invalid pg_count, skipping pool\n", pool_id);
continue;
}
if (pool_item.second["name"].string_value() == "")
pc.id = pool_id;
// Pool Name
pc.name = pool_item.second["name"].string_value();
if (pc.name == "")
{
printf("Pool %u has empty name, skipping pool\n", pool_id);
continue;
}
if (pool_item.second["scheme"] != "replicated" && pool_item.second["scheme"] != "xor")
// Failure Domain
pc.failure_domain = pool_item.second["failure_domain"].string_value();
// Coding Scheme
if (pool_item.second["scheme"] == "replicated")
pc.scheme = POOL_SCHEME_REPLICATED;
else if (pool_item.second["scheme"] == "xor")
pc.scheme = POOL_SCHEME_XOR;
else if (pool_item.second["scheme"] == "jerasure")
pc.scheme = POOL_SCHEME_JERASURE;
else
{
printf("Pool %u has invalid coding scheme (only \"xor\" and \"replicated\" are allowed), skipping pool\n", pool_id);
printf("Pool %u has invalid coding scheme (one of \"xor\", \"replicated\" or \"jerasure\" required), skipping pool\n", pool_id);
continue;
}
if (pool_item.second["max_osd_combinations"].uint64_value() > 0 &&
pool_item.second["max_osd_combinations"].uint64_value() < 100)
// PG Size
pc.pg_size = pool_item.second["pg_size"].uint64_value();
if (pc.pg_size < 1 ||
pool_item.second["pg_size"].uint64_value() < 3 &&
(pc.scheme == POOL_SCHEME_XOR || pc.scheme == POOL_SCHEME_JERASURE) ||
pool_item.second["pg_size"].uint64_value() > 256)
{
printf("Pool %u has invalid pg_size, skipping pool\n", pool_id);
continue;
}
// Parity Chunks
pc.parity_chunks = pool_item.second["parity_chunks"].uint64_value();
if (pc.scheme == POOL_SCHEME_XOR)
{
if (pc.parity_chunks > 1)
{
printf("Pool %u has invalid parity_chunks (must be 1), skipping pool\n", pool_id);
continue;
}
pc.parity_chunks = 1;
}
if (pc.scheme == POOL_SCHEME_JERASURE &&
(pc.parity_chunks < 1 || pc.parity_chunks > pc.pg_size-2))
{
printf("Pool %u has invalid parity_chunks (must be between 1 and pg_size-2), skipping pool\n", pool_id);
continue;
}
// PG MinSize
pc.pg_minsize = pool_item.second["pg_minsize"].uint64_value();
if (pc.pg_minsize < 1 || pc.pg_minsize > pc.pg_size ||
(pc.scheme == POOL_SCHEME_XOR || pc.scheme == POOL_SCHEME_JERASURE) &&
pc.pg_minsize < (pc.pg_size-pc.parity_chunks))
{
printf("Pool %u has invalid pg_minsize, skipping pool\n", pool_id);
continue;
}
// PG Count
pc.pg_count = pool_item.second["pg_count"].uint64_value();
if (pc.pg_count < 1)
{
printf("Pool %u has invalid pg_count, skipping pool\n", pool_id);
continue;
}
// Max OSD Combinations
pc.max_osd_combinations = pool_item.second["max_osd_combinations"].uint64_value();
if (!pc.max_osd_combinations)
pc.max_osd_combinations = 10000;
if (pc.max_osd_combinations > 0 && pc.max_osd_combinations < 100)
{
printf("Pool %u has invalid max_osd_combinations (must be at least 100), skipping pool\n", pool_id);
continue;
}
// PG Stripe Size
pc.pg_stripe_size = pool_item.second["pg_stripe_size"].uint64_value();
uint64_t min_stripe_size = bs_block_size * (pc.scheme == POOL_SCHEME_REPLICATED ? 1 : (pc.pg_size-pc.parity_chunks));
if (pc.pg_stripe_size < min_stripe_size)
pc.pg_stripe_size = min_stripe_size;
// Save
pc.real_pg_count = this->pool_config[pool_id].real_pg_count;
std::swap(pc.pg_config, this->pool_config[pool_id].pg_config);
std::swap(this->pool_config[pool_id], pc);
auto & parsed_cfg = this->pool_config[pool_id];
parsed_cfg.exists = true;
parsed_cfg.id = pool_id;
parsed_cfg.name = pool_item.second["name"].string_value();
parsed_cfg.scheme = pool_item.second["scheme"] == "replicated" ? POOL_SCHEME_REPLICATED : POOL_SCHEME_XOR;
parsed_cfg.pg_size = pool_item.second["pg_size"].uint64_value();
parsed_cfg.pg_minsize = pool_item.second["pg_minsize"].uint64_value();
parsed_cfg.pg_count = pool_item.second["pg_count"].uint64_value();
parsed_cfg.failure_domain = pool_item.second["failure_domain"].string_value();
parsed_cfg.pg_stripe_size = pool_item.second["pg_stripe_size"].uint64_value();
if (!parsed_cfg.pg_stripe_size)
{
parsed_cfg.pg_stripe_size = DEFAULT_PG_STRIPE_SIZE;
}
parsed_cfg.max_osd_combinations = pool_item.second["max_osd_combinations"].uint64_value();
if (!parsed_cfg.max_osd_combinations)
{
parsed_cfg.max_osd_combinations = 10000;
}
for (auto & pg_item: parsed_cfg.pg_config)
{
if (pg_item.second.target_set.size() != parsed_cfg.pg_size)

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once
@@ -16,7 +16,7 @@
#define ETCD_SLOW_TIMEOUT 5000
#define ETCD_QUICK_TIMEOUT 1000
#define DEFAULT_PG_STRIPE_SIZE 4*1024*1024
#define DEFAULT_BLOCK_SIZE 128*1024
struct json_kv_t
{
@@ -43,7 +43,7 @@ struct pool_config_t
pool_id_t id;
std::string name;
uint64_t scheme;
uint64_t pg_size, pg_minsize;
uint64_t pg_size, pg_minsize, parity_chunks;
uint64_t pg_count;
uint64_t real_pg_count;
std::string failure_domain;
@@ -54,6 +54,9 @@ struct pool_config_t
struct etcd_state_client_t
{
protected:
void add_etcd_url(std::string);
public:
std::vector<std::string> etcd_addresses;
std::string etcd_prefix;
int log_level = 0;
@@ -62,6 +65,7 @@ struct etcd_state_client_t
int etcd_watches_initialised = 0;
uint64_t etcd_watch_revision = 0;
websocket_t *etcd_watch_ws = NULL;
uint64_t bs_block_size = 0;
std::map<pool_id_t, pool_config_t> pool_config;
std::map<osd_num_t, json11::Json> peer_states;
@@ -80,4 +84,5 @@ struct etcd_state_client_t
void load_pgs();
void parse_state(const std::string & key, const json11::Json & value);
void parse_config(json11::Json & config);
~etcd_state_client_t();
};

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
// FIO engine to test cluster I/O
//
@@ -28,12 +28,7 @@
#include "epoll_manager.h"
#include "cluster_client.h"
extern "C" {
#define CONFIG_HAVE_GETTID
#define CONFIG_PWRITEV2
#include "fio/fio.h"
#include "fio/optgroup.h"
}
#include "fio_headers.h"
struct sec_data
{
@@ -98,7 +93,7 @@ static struct fio_option options[] = {
{
.name = "cluster_log_level",
.lname = "cluster log level",
.type = FIO_OPT_BOOL,
.type = FIO_OPT_INT,
.off1 = offsetof(struct sec_options, cluster_log),
.help = "Set log level for the Vitastor client",
.def = "0",
@@ -122,8 +117,15 @@ static struct fio_option options[] = {
static int sec_setup(struct thread_data *td)
{
sec_options *o = (sec_options*)td->eo;
sec_data *bsd;
if (!o->etcd_host)
{
td_verror(td, EINVAL, "etcd address is missing");
return 1;
}
bsd = new sec_data;
if (!bsd)
{
@@ -150,9 +152,7 @@ static void sec_cleanup(struct thread_data *td)
delete bsd->cli;
delete bsd->epmgr;
delete bsd->ringloop;
bsd->cli = NULL;
bsd->epmgr = NULL;
bsd->ringloop = NULL;
delete bsd;
}
}

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
// FIO engine to test Blockstore
//
@@ -25,12 +25,7 @@
// -bs_config='{"data_device":"./test_data.bin"}' -size=1000M
#include "blockstore.h"
extern "C" {
#define CONFIG_HAVE_GETTID
#define CONFIG_PWRITEV2
#include "fio/fio.h"
#include "fio/optgroup.h"
}
#include "fio_headers.h"
#include "json11/json11.hpp"

16
src/fio_headers.h Normal file
View File

@@ -0,0 +1,16 @@
extern "C" {
// Kill atomics in fio headers
#define _STDATOMIC_H
#include "fio/arch/arch.h"
#undef atomic_load_acquire
#undef atomic_store_release
#define atomic_load_acquire(p) *(p)
#define atomic_store_release(p, v) (*(p)) = (v)
#define CONFIG_HAVE_GETTID
#define CONFIG_SYNC_FILE_RANGE
#define CONFIG_PWRITEV2
#include "fio/fio.h"
#include "fio/optgroup.h"
}

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
// FIO engine to test Blockstore through Secondary OSD interface
//
@@ -30,12 +30,7 @@
#include "rw_blocking.h"
#include "osd_ops.h"
extern "C" {
#define CONFIG_HAVE_GETTID
#define CONFIG_PWRITEV2
#include "fio/fio.h"
#include "fio/optgroup.h"
}
#include "fio_headers.h"
struct sec_data
{
@@ -145,6 +140,7 @@ static void sec_cleanup(struct thread_data *td)
if (bsd)
{
close(bsd->connect_fd);
delete bsd;
}
}
@@ -317,6 +313,7 @@ static int sec_getevents(struct thread_data *td, unsigned int min, unsigned int
exit(1);
}
io_u* io = it->second;
bsd->queue.erase(it);
if (io->ddir == DDIR_READ)
{
if (reply.hdr.retval != io->xfer_buflen)

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#include <netinet/tcp.h>
#include <sys/epoll.h>
@@ -13,6 +13,8 @@
#include <fcntl.h>
#include <string.h>
#include <stdexcept>
#include "json11/json11.hpp"
#include "http_client.h"
#include "timerfd_manager.h"
@@ -20,7 +22,6 @@
#define READ_BUFFER_SIZE 9000
static int extract_port(std::string & host);
static std::string strtolower(const std::string & in);
static std::string trim(const std::string & in);
static std::string ws_format_frame(int type, uint64_t size);
static bool ws_parse_frame(std::string & buf, int & type, std::string & res);
@@ -671,7 +672,7 @@ static int extract_port(std::string & host)
return port;
}
static std::string strtolower(const std::string & in)
std::string strtolower(const std::string & in)
{
std::string s = in;
for (int i = 0; i < s.length(); i++)

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once
#include <string>
@@ -49,6 +49,8 @@ std::vector<std::string> getifaddr_list(bool include_v6 = false);
uint64_t stoull_full(const std::string & str, int base = 10);
std::string strtolower(const std::string & in);
void http_request(timerfd_manager_t *tfd, const std::string & host, const std::string & request,
const http_options_t & options, std::function<void(const http_response_t *response)> callback);

View File

@@ -1,9 +1,10 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once
#include <malloc.h>
#include <stdlib.h>
inline void* memalign_or_die(size_t alignment, size_t size)
{

View File

@@ -1,11 +1,12 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#include <unistd.h>
#include <fcntl.h>
#include <sys/socket.h>
#include <sys/epoll.h>
#include <netinet/tcp.h>
#include <stdexcept>
#include "messenger.h"
@@ -25,14 +26,106 @@ osd_op_t::~osd_op_t()
}
}
void osd_messenger_t::init()
{
keepalive_timer_id = tfd->set_timer(1000, true, [this](int)
{
for (auto cl_it = clients.begin(); cl_it != clients.end();)
{
auto cl = (cl_it++)->second;
if (!cl->osd_num)
{
// Do not run keepalive on regular clients
continue;
}
if (cl->ping_time_remaining > 0)
{
cl->ping_time_remaining--;
if (!cl->ping_time_remaining)
{
// Ping timed out, stop the client
stop_client(cl->peer_fd, true);
}
}
else if (cl->idle_time_remaining > 0)
{
cl->idle_time_remaining--;
if (!cl->idle_time_remaining)
{
// Connection is idle for <osd_idle_time>, send ping
osd_op_t *op = new osd_op_t();
op->op_type = OSD_OP_OUT;
op->peer_fd = cl->peer_fd;
op->req = (osd_any_op_t){
.hdr = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = this->next_subop_id++,
.opcode = OSD_OP_PING,
},
};
op->callback = [this, cl](osd_op_t *op)
{
int fail_fd = (op->reply.hdr.retval != 0 ? op->peer_fd : -1);
cl->ping_time_remaining = 0;
delete op;
if (fail_fd >= 0)
{
stop_client(fail_fd, true);
}
};
outbox_push(op);
cl->ping_time_remaining = osd_ping_timeout;
cl->idle_time_remaining = osd_idle_timeout;
}
}
else
{
cl->idle_time_remaining = osd_idle_timeout;
}
}
});
}
osd_messenger_t::~osd_messenger_t()
{
if (keepalive_timer_id >= 0)
{
tfd->clear_timer(keepalive_timer_id);
keepalive_timer_id = -1;
}
while (clients.size() > 0)
{
stop_client(clients.begin()->first);
stop_client(clients.begin()->first, true);
}
}
void osd_messenger_t::parse_config(const json11::Json & config)
{
this->use_sync_send_recv = config["use_sync_send_recv"].bool_value() ||
config["use_sync_send_recv"].uint64_value();
this->peer_connect_interval = config["peer_connect_interval"].uint64_value();
if (!this->peer_connect_interval)
{
this->peer_connect_interval = DEFAULT_PEER_CONNECT_INTERVAL;
}
this->peer_connect_timeout = config["peer_connect_timeout"].uint64_value();
if (!this->peer_connect_timeout)
{
this->peer_connect_timeout = DEFAULT_PEER_CONNECT_TIMEOUT;
}
this->osd_idle_timeout = config["osd_idle_timeout"].uint64_value();
if (!this->osd_idle_timeout)
{
this->osd_idle_timeout = DEFAULT_OSD_PING_TIMEOUT;
}
this->osd_ping_timeout = config["osd_ping_timeout"].uint64_value();
if (!this->osd_ping_timeout)
{
this->osd_ping_timeout = DEFAULT_OSD_PING_TIMEOUT;
}
this->log_level = config["log_level"].uint64_value();
}
void osd_messenger_t::connect_peer(uint64_t peer_osd, json11::Json peer_state)
{
if (wanted_peers.find(peer_osd) == wanted_peers.end())
@@ -80,6 +173,7 @@ void osd_messenger_t::try_connect_peer(uint64_t peer_osd)
void osd_messenger_t::try_connect_peer_addr(osd_num_t peer_osd, const char *peer_host, int peer_port)
{
assert(peer_osd != this->osd_num);
struct sockaddr_in addr;
int r;
if ((r = inet_pton(AF_INET, peer_host, &addr.sin_addr)) != 1)
@@ -96,17 +190,6 @@ void osd_messenger_t::try_connect_peer_addr(osd_num_t peer_osd, const char *peer
return;
}
fcntl(peer_fd, F_SETFL, fcntl(peer_fd, F_GETFL, 0) | O_NONBLOCK);
int timeout_id = -1;
if (peer_connect_timeout > 0)
{
timeout_id = tfd->set_timer(1000*peer_connect_timeout, false, [this, peer_fd](int timer_id)
{
osd_num_t peer_osd = clients[peer_fd].osd_num;
stop_client(peer_fd);
on_connect_peer(peer_osd, -EIO);
return;
});
}
r = connect(peer_fd, (sockaddr*)&addr, sizeof(addr));
if (r < 0 && errno != EINPROGRESS)
{
@@ -114,8 +197,18 @@ void osd_messenger_t::try_connect_peer_addr(osd_num_t peer_osd, const char *peer
on_connect_peer(peer_osd, -errno);
return;
}
assert(peer_osd != this->osd_num);
clients[peer_fd] = (osd_client_t){
int timeout_id = -1;
if (peer_connect_timeout > 0)
{
timeout_id = tfd->set_timer(1000*peer_connect_timeout, false, [this, peer_fd](int timer_id)
{
osd_num_t peer_osd = clients.at(peer_fd)->osd_num;
stop_client(peer_fd, true);
on_connect_peer(peer_osd, -EIO);
return;
});
}
clients[peer_fd] = new osd_client_t((osd_client_t){
.peer_addr = addr,
.peer_port = peer_port,
.peer_fd = peer_fd,
@@ -123,7 +216,7 @@ void osd_messenger_t::try_connect_peer_addr(osd_num_t peer_osd, const char *peer
.connect_timeout_id = timeout_id,
.osd_num = peer_osd,
.in_buf = malloc_or_die(receive_buffer_size),
};
});
tfd->set_fd_handler(peer_fd, true, [this](int peer_fd, int epoll_events)
{
// Either OUT (connected) or HUP
@@ -133,13 +226,13 @@ void osd_messenger_t::try_connect_peer_addr(osd_num_t peer_osd, const char *peer
void osd_messenger_t::handle_connect_epoll(int peer_fd)
{
auto & cl = clients[peer_fd];
if (cl.connect_timeout_id >= 0)
auto cl = clients[peer_fd];
if (cl->connect_timeout_id >= 0)
{
tfd->clear_timer(cl.connect_timeout_id);
cl.connect_timeout_id = -1;
tfd->clear_timer(cl->connect_timeout_id);
cl->connect_timeout_id = -1;
}
osd_num_t peer_osd = cl.osd_num;
osd_num_t peer_osd = cl->osd_num;
int result = 0;
socklen_t result_len = sizeof(result);
if (getsockopt(peer_fd, SOL_SOCKET, SO_ERROR, &result, &result_len) < 0)
@@ -148,13 +241,13 @@ void osd_messenger_t::handle_connect_epoll(int peer_fd)
}
if (result != 0)
{
stop_client(peer_fd);
stop_client(peer_fd, true);
on_connect_peer(peer_osd, -result);
return;
}
int one = 1;
setsockopt(peer_fd, SOL_TCP, TCP_NODELAY, &one, sizeof(one));
cl.peer_state = PEER_CONNECTED;
cl->peer_state = PEER_CONNECTED;
tfd->set_fd_handler(peer_fd, false, [this](int peer_fd, int epoll_events)
{
handle_peer_epoll(peer_fd, epoll_events);
@@ -170,16 +263,16 @@ void osd_messenger_t::handle_peer_epoll(int peer_fd, int epoll_events)
{
// Stop client
printf("[OSD %lu] client %d disconnected\n", this->osd_num, peer_fd);
stop_client(peer_fd);
stop_client(peer_fd, true);
}
else if (epoll_events & EPOLLIN)
{
// Mark client as ready (i.e. some data is available)
auto & cl = clients[peer_fd];
cl.read_ready++;
if (cl.read_ready == 1)
auto cl = clients[peer_fd];
cl->read_ready++;
if (cl->read_ready == 1)
{
read_ready_clients.push_back(cl.peer_fd);
read_ready_clients.push_back(cl->peer_fd);
if (ringloop)
ringloop->wakeup();
else
@@ -219,17 +312,20 @@ void osd_messenger_t::on_connect_peer(osd_num_t peer_osd, int peer_fd)
}
return;
}
printf("Connected with peer OSD %lu (fd %d)\n", peer_osd, peer_fd);
if (log_level > 0)
{
printf("[OSD %lu] Connected with peer OSD %lu (client %d)\n", osd_num, peer_osd, peer_fd);
}
wanted_peers.erase(peer_osd);
repeer_pgs(peer_osd);
}
void osd_messenger_t::check_peer_config(osd_client_t & cl)
void osd_messenger_t::check_peer_config(osd_client_t *cl)
{
osd_op_t *op = new osd_op_t();
op->op_type = OSD_OP_OUT;
op->peer_fd = cl.peer_fd;
op->req = {
op->peer_fd = cl->peer_fd;
op->req = (osd_any_op_t){
.show_conf = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
@@ -238,16 +334,15 @@ void osd_messenger_t::check_peer_config(osd_client_t & cl)
},
},
};
op->callback = [this](osd_op_t *op)
op->callback = [this, cl](osd_op_t *op)
{
osd_client_t & cl = clients[op->peer_fd];
std::string json_err;
json11::Json config;
bool err = false;
if (op->reply.hdr.retval < 0)
{
err = true;
printf("Failed to get config from OSD %lu (retval=%ld), disconnecting peer\n", cl.osd_num, op->reply.hdr.retval);
printf("Failed to get config from OSD %lu (retval=%ld), disconnecting peer\n", cl->osd_num, op->reply.hdr.retval);
}
else
{
@@ -255,46 +350,37 @@ void osd_messenger_t::check_peer_config(osd_client_t & cl)
if (json_err != "")
{
err = true;
printf("Failed to get config from OSD %lu: bad JSON: %s, disconnecting peer\n", cl.osd_num, json_err.c_str());
printf("Failed to get config from OSD %lu: bad JSON: %s, disconnecting peer\n", cl->osd_num, json_err.c_str());
}
else if (config["osd_num"].uint64_value() != cl.osd_num)
else if (config["osd_num"].uint64_value() != cl->osd_num)
{
err = true;
printf("Connected to OSD %lu instead of OSD %lu, peer state is outdated, disconnecting peer\n", config["osd_num"].uint64_value(), cl.osd_num);
printf("Connected to OSD %lu instead of OSD %lu, peer state is outdated, disconnecting peer\n", config["osd_num"].uint64_value(), cl->osd_num);
}
}
if (err)
{
osd_num_t osd_num = cl.osd_num;
osd_num_t osd_num = cl->osd_num;
stop_client(op->peer_fd);
on_connect_peer(osd_num, -1);
delete op;
return;
}
osd_peer_fds[cl.osd_num] = cl.peer_fd;
on_connect_peer(cl.osd_num, cl.peer_fd);
osd_peer_fds[cl->osd_num] = cl->peer_fd;
on_connect_peer(cl->osd_num, cl->peer_fd);
delete op;
};
outbox_push(op);
}
void osd_messenger_t::cancel_osd_ops(osd_client_t & cl)
void osd_messenger_t::cancel_osd_ops(osd_client_t *cl)
{
for (auto p: cl.sent_ops)
for (auto p: cl->sent_ops)
{
cancel_op(p.second);
}
cl.sent_ops.clear();
for (auto op: cl.outbox)
{
cancel_op(op);
}
cl.outbox.clear();
if (cl.write_op)
{
cancel_op(cl.write_op);
cl.write_op = NULL;
}
cl->sent_ops.clear();
cl->outbox.clear();
}
void osd_messenger_t::cancel_op(osd_op_t *op)
@@ -315,7 +401,7 @@ void osd_messenger_t::cancel_op(osd_op_t *op)
}
}
void osd_messenger_t::stop_client(int peer_fd)
void osd_messenger_t::stop_client(int peer_fd, bool force)
{
assert(peer_fd != 0);
auto it = clients.find(peer_fd);
@@ -324,32 +410,49 @@ void osd_messenger_t::stop_client(int peer_fd)
return;
}
uint64_t repeer_osd = 0;
osd_client_t cl = it->second;
if (cl.peer_state == PEER_CONNECTED)
osd_client_t *cl = it->second;
if (cl->peer_state == PEER_CONNECTED)
{
if (cl.osd_num)
if (cl->osd_num)
{
// Reload configuration from etcd when the connection is dropped
printf("[OSD %lu] Stopping client %d (OSD peer %lu)\n", osd_num, peer_fd, cl.osd_num);
repeer_osd = cl.osd_num;
if (log_level > 0)
printf("[OSD %lu] Stopping client %d (OSD peer %lu)\n", osd_num, peer_fd, cl->osd_num);
repeer_osd = cl->osd_num;
}
else
{
printf("[OSD %lu] Stopping client %d (regular client)\n", osd_num, peer_fd);
if (log_level > 0)
printf("[OSD %lu] Stopping client %d (regular client)\n", osd_num, peer_fd);
}
}
else if (!force)
{
return;
}
cl->peer_state = PEER_STOPPED;
clients.erase(it);
tfd->set_fd_handler(peer_fd, false, NULL);
if (cl.osd_num)
if (cl->connect_timeout_id >= 0)
{
osd_peer_fds.erase(cl.osd_num);
// Cancel outbound operations
cancel_osd_ops(cl);
tfd->clear_timer(cl->connect_timeout_id);
cl->connect_timeout_id = -1;
}
if (cl.read_op)
if (cl->osd_num)
{
delete cl.read_op;
cl.read_op = NULL;
osd_peer_fds.erase(cl->osd_num);
}
if (cl->read_op)
{
if (cl->read_op->callback)
{
cancel_op(cl->read_op);
}
else
{
delete cl->read_op;
}
cl->read_op = NULL;
}
for (auto rit = read_ready_clients.begin(); rit != read_ready_clients.end(); rit++)
{
@@ -367,12 +470,24 @@ void osd_messenger_t::stop_client(int peer_fd)
break;
}
}
free(cl.in_buf);
free(cl->in_buf);
cl->in_buf = NULL;
close(peer_fd);
if (repeer_osd)
{
// First repeer PGs as canceling OSD ops may push new operations
// and we need correct PG states when we do that
repeer_pgs(repeer_osd);
}
if (cl->osd_num)
{
// Cancel outbound operations
cancel_osd_ops(cl);
}
if (cl->refs <= 0)
{
delete cl;
}
}
void osd_messenger_t::accept_connections(int listen_fd)
@@ -390,13 +505,13 @@ void osd_messenger_t::accept_connections(int listen_fd)
fcntl(peer_fd, F_SETFL, fcntl(peer_fd, F_GETFL, 0) | O_NONBLOCK);
int one = 1;
setsockopt(peer_fd, SOL_TCP, TCP_NODELAY, &one, sizeof(one));
clients[peer_fd] = {
clients[peer_fd] = new osd_client_t((osd_client_t){
.peer_addr = addr,
.peer_port = ntohs(addr.sin_port),
.peer_fd = peer_fd,
.peer_state = PEER_CONNECTED,
.in_buf = malloc_or_die(receive_buffer_size),
};
});
// Add FD to epoll
tfd->set_fd_handler(peer_fd, false, [this](int peer_fd, int epoll_events)
{

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once
@@ -30,9 +30,11 @@
#define PEER_CONNECTING 1
#define PEER_CONNECTED 2
#define PEER_STOPPED 3
#define DEFAULT_PEER_CONNECT_INTERVAL 5
#define DEFAULT_PEER_CONNECT_TIMEOUT 5
#define DEFAULT_OSD_PING_TIMEOUT 5
// Kind of a vector with small-list-optimisation
struct osd_op_buf_list_t
@@ -190,11 +192,15 @@ struct osd_op_t
struct osd_client_t
{
int refs = 0;
sockaddr_in peer_addr;
int peer_port;
int peer_fd;
int peer_state;
int connect_timeout_id = -1;
int ping_time_remaining = 0;
int idle_time_remaining = 0;
osd_num_t osd_num = 0;
void *in_buf = NULL;
@@ -202,8 +208,8 @@ struct osd_client_t
// Read state
int read_ready = 0;
osd_op_t *read_op = NULL;
iovec read_iov;
msghdr read_msg;
iovec read_iov = { 0 };
msghdr read_msg = { 0 };
int read_remaining = 0;
int read_state = 0;
osd_op_buf_list_t recv_list;
@@ -212,17 +218,16 @@ struct osd_client_t
std::vector<osd_op_t*> received_ops;
// Outbound operations
std::deque<osd_op_t*> outbox;
std::map<int, osd_op_t*> sent_ops;
std::map<uint64_t, osd_op_t*> sent_ops;
// PGs dirtied by this client's primary-writes
std::set<pool_pg_num_t> dirty_pgs;
// Write state
osd_op_t *write_op = NULL;
msghdr write_msg;
msghdr write_msg = { 0 };
int write_state = 0;
osd_op_buf_list_t send_list;
std::vector<iovec> send_list, next_send_list;
std::vector<osd_op_t*> outbox, next_outbox;
};
struct osd_wanted_peer_t
@@ -249,6 +254,7 @@ struct osd_messenger_t
{
timerfd_manager_t *tfd;
ring_loop_t *ringloop;
int keepalive_timer_id = -1;
// osd_num_t is only for logging and asserts
osd_num_t osd_num;
@@ -256,6 +262,8 @@ struct osd_messenger_t
int receive_buffer_size = 64*1024;
int peer_connect_interval = DEFAULT_PEER_CONNECT_INTERVAL;
int peer_connect_timeout = DEFAULT_PEER_CONNECT_TIMEOUT;
int osd_idle_timeout = DEFAULT_OSD_PING_TIMEOUT;
int osd_ping_timeout = DEFAULT_OSD_PING_TIMEOUT;
int log_level = 0;
bool use_sync_send_recv = false;
@@ -263,7 +271,7 @@ struct osd_messenger_t
std::map<uint64_t, int> osd_peer_fds;
uint64_t next_subop_id = 1;
std::map<int, osd_client_t> clients;
std::map<int, osd_client_t*> clients;
std::vector<int> read_ready_clients;
std::vector<int> write_ready_clients;
std::vector<std::function<void()>> set_immediate;
@@ -272,8 +280,10 @@ struct osd_messenger_t
osd_op_stats_t stats;
public:
void init();
void parse_config(const json11::Json & config);
void connect_peer(uint64_t osd_num, json11::Json peer_state);
void stop_client(int peer_fd);
void stop_client(int peer_fd, bool force = false);
void outbox_push(osd_op_t *cur_op);
std::function<void(osd_op_t*)> exec_op;
std::function<void(osd_num_t)> repeer_pgs;
@@ -288,15 +298,16 @@ protected:
void try_connect_peer_addr(osd_num_t peer_osd, const char *peer_host, int peer_port);
void handle_connect_epoll(int peer_fd);
void on_connect_peer(osd_num_t peer_osd, int peer_fd);
void check_peer_config(osd_client_t & cl);
void cancel_osd_ops(osd_client_t & cl);
void check_peer_config(osd_client_t *cl);
void cancel_osd_ops(osd_client_t *cl);
void cancel_op(osd_op_t *op);
bool try_send(osd_client_t & cl);
void handle_send(int result, int peer_fd);
bool try_send(osd_client_t *cl);
void measure_exec(osd_op_t *cur_op);
void handle_send(int result, osd_client_t *cl);
bool handle_read(int result, int peer_fd);
bool handle_finished_read(osd_client_t & cl);
bool handle_read(int result, osd_client_t *cl);
bool handle_finished_read(osd_client_t *cl);
void handle_op_hdr(osd_client_t *cl);
bool handle_reply_hdr(osd_client_t *cl);
void handle_reply_ready(osd_op_t *op);

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#include "messenger.h"
@@ -8,135 +8,146 @@ void osd_messenger_t::read_requests()
for (int i = 0; i < read_ready_clients.size(); i++)
{
int peer_fd = read_ready_clients[i];
auto & cl = clients[peer_fd];
if (cl.read_remaining < receive_buffer_size)
osd_client_t *cl = clients[peer_fd];
if (cl->read_msg.msg_iovlen)
{
cl.read_iov.iov_base = cl.in_buf;
cl.read_iov.iov_len = receive_buffer_size;
cl.read_msg.msg_iov = &cl.read_iov;
cl.read_msg.msg_iovlen = 1;
continue;
}
if (cl->read_remaining < receive_buffer_size)
{
cl->read_iov.iov_base = cl->in_buf;
cl->read_iov.iov_len = receive_buffer_size;
cl->read_msg.msg_iov = &cl->read_iov;
cl->read_msg.msg_iovlen = 1;
}
else
{
cl.read_iov.iov_base = 0;
cl.read_iov.iov_len = cl.read_remaining;
cl.read_msg.msg_iov = cl.recv_list.get_iovec();
cl.read_msg.msg_iovlen = cl.recv_list.get_size();
cl->read_iov.iov_base = 0;
cl->read_iov.iov_len = cl->read_remaining;
cl->read_msg.msg_iov = cl->recv_list.get_iovec();
cl->read_msg.msg_iovlen = cl->recv_list.get_size();
}
cl->refs++;
if (ringloop && !use_sync_send_recv)
{
io_uring_sqe* sqe = ringloop->get_sqe();
if (!sqe)
{
cl->read_msg.msg_iovlen = 0;
read_ready_clients.erase(read_ready_clients.begin(), read_ready_clients.begin() + i);
return;
}
ring_data_t* data = ((ring_data_t*)sqe->user_data);
data->callback = [this, peer_fd](ring_data_t *data) { handle_read(data->res, peer_fd); };
my_uring_prep_recvmsg(sqe, peer_fd, &cl.read_msg, 0);
data->callback = [this, cl](ring_data_t *data) { handle_read(data->res, cl); };
my_uring_prep_recvmsg(sqe, peer_fd, &cl->read_msg, 0);
}
else
{
int result = recvmsg(peer_fd, &cl.read_msg, 0);
int result = recvmsg(peer_fd, &cl->read_msg, 0);
if (result < 0)
{
result = -errno;
}
handle_read(result, peer_fd);
handle_read(result, cl);
}
}
read_ready_clients.clear();
}
bool osd_messenger_t::handle_read(int result, int peer_fd)
bool osd_messenger_t::handle_read(int result, osd_client_t *cl)
{
bool ret = false;
auto cl_it = clients.find(peer_fd);
if (cl_it != clients.end())
cl->read_msg.msg_iovlen = 0;
cl->refs--;
if (cl->peer_state == PEER_STOPPED)
{
auto & cl = cl_it->second;
if (result <= 0 && result != -EAGAIN)
if (cl->refs <= 0)
{
// this is a client socket, so don't panic on error. just disconnect it
if (result != 0)
{
printf("Client %d socket read error: %d (%s). Disconnecting client\n", peer_fd, -result, strerror(-result));
}
stop_client(peer_fd);
return false;
delete cl;
}
if (result == -EAGAIN || result < cl.read_iov.iov_len)
return false;
}
if (result <= 0 && result != -EAGAIN)
{
// this is a client socket, so don't panic on error. just disconnect it
if (result != 0)
{
cl.read_ready--;
if (cl.read_ready > 0)
read_ready_clients.push_back(peer_fd);
printf("Client %d socket read error: %d (%s). Disconnecting client\n", cl->peer_fd, -result, strerror(-result));
}
stop_client(cl->peer_fd);
return false;
}
if (result == -EAGAIN || result < cl->read_iov.iov_len)
{
cl->read_ready--;
if (cl->read_ready > 0)
read_ready_clients.push_back(cl->peer_fd);
}
else
{
read_ready_clients.push_back(cl->peer_fd);
}
if (result > 0)
{
if (cl->read_iov.iov_base == cl->in_buf)
{
// Compose operation(s) from the buffer
int remain = result;
void *curbuf = cl->in_buf;
while (remain > 0)
{
if (!cl->read_op)
{
cl->read_op = new osd_op_t;
cl->read_op->peer_fd = cl->peer_fd;
cl->read_op->op_type = OSD_OP_IN;
cl->recv_list.push_back(cl->read_op->req.buf, OSD_PACKET_SIZE);
cl->read_remaining = OSD_PACKET_SIZE;
cl->read_state = CL_READ_HDR;
}
while (cl->recv_list.done < cl->recv_list.count && remain > 0)
{
iovec* cur = cl->recv_list.get_iovec();
if (cur->iov_len > remain)
{
memcpy(cur->iov_base, curbuf, remain);
cl->read_remaining -= remain;
cur->iov_len -= remain;
cur->iov_base += remain;
remain = 0;
}
else
{
memcpy(cur->iov_base, curbuf, cur->iov_len);
curbuf += cur->iov_len;
cl->read_remaining -= cur->iov_len;
remain -= cur->iov_len;
cur->iov_len = 0;
cl->recv_list.done++;
}
}
if (cl->recv_list.done >= cl->recv_list.count)
{
if (!handle_finished_read(cl))
{
goto fin;
}
}
}
}
else
{
read_ready_clients.push_back(peer_fd);
// Long data
cl->read_remaining -= result;
cl->recv_list.eat(result);
if (cl->recv_list.done >= cl->recv_list.count)
{
handle_finished_read(cl);
}
}
if (result > 0)
if (result >= cl->read_iov.iov_len)
{
if (cl.read_iov.iov_base == cl.in_buf)
{
// Compose operation(s) from the buffer
int remain = result;
void *curbuf = cl.in_buf;
while (remain > 0)
{
if (!cl.read_op)
{
cl.read_op = new osd_op_t;
cl.read_op->peer_fd = peer_fd;
cl.read_op->op_type = OSD_OP_IN;
cl.recv_list.push_back(cl.read_op->req.buf, OSD_PACKET_SIZE);
cl.read_remaining = OSD_PACKET_SIZE;
cl.read_state = CL_READ_HDR;
}
while (cl.recv_list.done < cl.recv_list.count && remain > 0)
{
iovec* cur = cl.recv_list.get_iovec();
if (cur->iov_len > remain)
{
memcpy(cur->iov_base, curbuf, remain);
cl.read_remaining -= remain;
cur->iov_len -= remain;
cur->iov_base += remain;
remain = 0;
}
else
{
memcpy(cur->iov_base, curbuf, cur->iov_len);
curbuf += cur->iov_len;
cl.read_remaining -= cur->iov_len;
remain -= cur->iov_len;
cur->iov_len = 0;
cl.recv_list.done++;
}
}
if (cl.recv_list.done >= cl.recv_list.count)
{
if (!handle_finished_read(cl))
{
goto fin;
}
}
}
}
else
{
// Long data
cl.read_remaining -= result;
cl.recv_list.eat(result);
if (cl.recv_list.done >= cl.recv_list.count)
{
handle_finished_read(cl);
}
}
if (result >= cl.read_iov.iov_len)
{
ret = true;
}
ret = true;
}
}
fin:
@@ -148,30 +159,36 @@ fin:
return ret;
}
bool osd_messenger_t::handle_finished_read(osd_client_t & cl)
bool osd_messenger_t::handle_finished_read(osd_client_t *cl)
{
cl.recv_list.reset();
if (cl.read_state == CL_READ_HDR)
cl->recv_list.reset();
if (cl->read_state == CL_READ_HDR)
{
if (cl.read_op->req.hdr.magic == SECONDARY_OSD_REPLY_MAGIC)
return handle_reply_hdr(&cl);
if (cl->read_op->req.hdr.magic == SECONDARY_OSD_REPLY_MAGIC)
return handle_reply_hdr(cl);
else if (cl->read_op->req.hdr.magic == SECONDARY_OSD_OP_MAGIC)
handle_op_hdr(cl);
else
handle_op_hdr(&cl);
{
printf("Received garbage: magic=%lx id=%lu opcode=%lx from %d\n", cl->read_op->req.hdr.magic, cl->read_op->req.hdr.id, cl->read_op->req.hdr.opcode, cl->peer_fd);
stop_client(cl->peer_fd);
return false;
}
}
else if (cl.read_state == CL_READ_DATA)
else if (cl->read_state == CL_READ_DATA)
{
// Operation is ready
cl.received_ops.push_back(cl.read_op);
set_immediate.push_back([this, op = cl.read_op]() { exec_op(op); });
cl.read_op = NULL;
cl.read_state = 0;
cl->received_ops.push_back(cl->read_op);
set_immediate.push_back([this, op = cl->read_op]() { exec_op(op); });
cl->read_op = NULL;
cl->read_state = 0;
}
else if (cl.read_state == CL_READ_REPLY_DATA)
else if (cl->read_state == CL_READ_REPLY_DATA)
{
// Reply is ready
handle_reply_ready(cl.read_op);
cl.read_op = NULL;
cl.read_state = 0;
handle_reply_ready(cl->read_op);
cl->read_op = NULL;
cl->read_state = 0;
}
else
{
@@ -247,6 +264,14 @@ bool osd_messenger_t::handle_reply_hdr(osd_client_t *cl)
{
// Read data. In this case we assume that the buffer is preallocated by the caller (!)
assert(op->iov.count > 0);
if (op->reply.hdr.retval != (op->reply.hdr.opcode == OSD_OP_SEC_READ ? op->req.sec_rw.len : op->req.rw.len))
{
// Check reply length to not overflow the buffer
printf("Client %d read reply of different length\n", cl->peer_fd);
cl->sent_ops[op->req.hdr.id] = op;
stop_client(cl->peer_fd);
return false;
}
cl->recv_list.append(op->iov);
delete cl->read_op;
cl->read_op = op;

237
src/msgr_send.cpp Normal file
View File

@@ -0,0 +1,237 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#define _XOPEN_SOURCE
#include <limits.h>
#include "messenger.h"
void osd_messenger_t::outbox_push(osd_op_t *cur_op)
{
assert(cur_op->peer_fd);
osd_client_t *cl = clients.at(cur_op->peer_fd);
if (cur_op->op_type == OSD_OP_OUT)
{
clock_gettime(CLOCK_REALTIME, &cur_op->tv_begin);
}
else
{
// Check that operation actually belongs to this client
// FIXME: Review if this is still needed
bool found = false;
for (auto it = cl->received_ops.begin(); it != cl->received_ops.end(); it++)
{
if (*it == cur_op)
{
found = true;
cl->received_ops.erase(it, it+1);
break;
}
}
if (!found)
{
delete cur_op;
return;
}
}
auto & to_send_list = cl->write_msg.msg_iovlen ? cl->next_send_list : cl->send_list;
auto & to_outbox = cl->write_msg.msg_iovlen ? cl->next_outbox : cl->outbox;
if (cur_op->op_type == OSD_OP_IN)
{
measure_exec(cur_op);
to_send_list.push_back((iovec){ .iov_base = cur_op->reply.buf, .iov_len = OSD_PACKET_SIZE });
}
else
{
to_send_list.push_back((iovec){ .iov_base = cur_op->req.buf, .iov_len = OSD_PACKET_SIZE });
cl->sent_ops[cur_op->req.hdr.id] = cur_op;
}
to_outbox.push_back(NULL);
// Operation data
if ((cur_op->op_type == OSD_OP_IN
? (cur_op->req.hdr.opcode == OSD_OP_READ ||
cur_op->req.hdr.opcode == OSD_OP_SEC_READ ||
cur_op->req.hdr.opcode == OSD_OP_SEC_LIST ||
cur_op->req.hdr.opcode == OSD_OP_SHOW_CONFIG)
: (cur_op->req.hdr.opcode == OSD_OP_WRITE ||
cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE ||
cur_op->req.hdr.opcode == OSD_OP_SEC_STABILIZE ||
cur_op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK)) && cur_op->iov.count > 0)
{
for (int i = 0; i < cur_op->iov.count; i++)
{
assert(cur_op->iov.buf[i].iov_base);
to_send_list.push_back(cur_op->iov.buf[i]);
to_outbox.push_back(NULL);
}
}
if (cur_op->op_type == OSD_OP_IN)
{
// To free it later
to_outbox[to_outbox.size()-1] = cur_op;
}
if (!ringloop)
{
// FIXME: It's worse because it doesn't allow batching
while (cl->outbox.size())
{
try_send(cl);
}
}
else if (cl->write_msg.msg_iovlen > 0 || !try_send(cl))
{
if (cl->write_state == 0)
{
cl->write_state = CL_WRITE_READY;
write_ready_clients.push_back(cur_op->peer_fd);
}
ringloop->wakeup();
}
}
void osd_messenger_t::measure_exec(osd_op_t *cur_op)
{
// Measure execution latency
if (cur_op->req.hdr.opcode > OSD_OP_MAX)
{
return;
}
timespec tv_end;
clock_gettime(CLOCK_REALTIME, &tv_end);
stats.op_stat_count[cur_op->req.hdr.opcode]++;
if (!stats.op_stat_count[cur_op->req.hdr.opcode])
{
stats.op_stat_count[cur_op->req.hdr.opcode]++;
stats.op_stat_sum[cur_op->req.hdr.opcode] = 0;
stats.op_stat_bytes[cur_op->req.hdr.opcode] = 0;
}
stats.op_stat_sum[cur_op->req.hdr.opcode] += (
(tv_end.tv_sec - cur_op->tv_begin.tv_sec)*1000000 +
(tv_end.tv_nsec - cur_op->tv_begin.tv_nsec)/1000
);
if (cur_op->req.hdr.opcode == OSD_OP_READ ||
cur_op->req.hdr.opcode == OSD_OP_WRITE)
{
stats.op_stat_bytes[cur_op->req.hdr.opcode] += cur_op->req.rw.len;
}
else if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ ||
cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE)
{
stats.op_stat_bytes[cur_op->req.hdr.opcode] += cur_op->req.sec_rw.len;
}
}
bool osd_messenger_t::try_send(osd_client_t *cl)
{
int peer_fd = cl->peer_fd;
if (!cl->send_list.size() || cl->write_msg.msg_iovlen > 0)
{
return true;
}
if (ringloop && !use_sync_send_recv)
{
io_uring_sqe* sqe = ringloop->get_sqe();
if (!sqe)
{
return false;
}
cl->write_msg.msg_iov = cl->send_list.data();
cl->write_msg.msg_iovlen = cl->send_list.size() < IOV_MAX ? cl->send_list.size() : IOV_MAX;
cl->refs++;
ring_data_t* data = ((ring_data_t*)sqe->user_data);
data->callback = [this, cl](ring_data_t *data) { handle_send(data->res, cl); };
my_uring_prep_sendmsg(sqe, peer_fd, &cl->write_msg, 0);
}
else
{
cl->write_msg.msg_iov = cl->send_list.data();
cl->write_msg.msg_iovlen = cl->send_list.size() < IOV_MAX ? cl->send_list.size() : IOV_MAX;
cl->refs++;
int result = sendmsg(peer_fd, &cl->write_msg, MSG_NOSIGNAL);
if (result < 0)
{
result = -errno;
}
handle_send(result, cl);
}
return true;
}
void osd_messenger_t::send_replies()
{
for (int i = 0; i < write_ready_clients.size(); i++)
{
int peer_fd = write_ready_clients[i];
auto cl_it = clients.find(peer_fd);
if (cl_it != clients.end() && !try_send(cl_it->second))
{
write_ready_clients.erase(write_ready_clients.begin(), write_ready_clients.begin() + i);
return;
}
}
write_ready_clients.clear();
}
void osd_messenger_t::handle_send(int result, osd_client_t *cl)
{
cl->write_msg.msg_iovlen = 0;
cl->refs--;
if (cl->peer_state == PEER_STOPPED)
{
if (!cl->refs)
{
delete cl;
}
return;
}
if (result < 0 && result != -EAGAIN)
{
// this is a client socket, so don't panic. just disconnect it
printf("Client %d socket write error: %d (%s). Disconnecting client\n", cl->peer_fd, -result, strerror(-result));
stop_client(cl->peer_fd);
return;
}
if (result >= 0)
{
int done = 0;
while (result > 0 && done < cl->send_list.size())
{
iovec & iov = cl->send_list[done];
if (iov.iov_len <= result)
{
if (cl->outbox[done])
{
// Reply fully sent
delete cl->outbox[done];
}
result -= iov.iov_len;
done++;
}
else
{
iov.iov_len -= result;
iov.iov_base += result;
break;
}
}
if (done > 0)
{
cl->send_list.erase(cl->send_list.begin(), cl->send_list.begin()+done);
cl->outbox.erase(cl->outbox.begin(), cl->outbox.begin()+done);
}
if (cl->next_send_list.size())
{
cl->send_list.insert(cl->send_list.end(), cl->next_send_list.begin(), cl->next_send_list.end());
cl->outbox.insert(cl->outbox.end(), cl->next_outbox.begin(), cl->next_outbox.end());
cl->next_send_list.clear();
cl->next_outbox.clear();
}
cl->write_state = cl->outbox.size() > 0 ? CL_WRITE_READY : 0;
}
if (cl->write_state != 0)
{
write_ready_clients.push_back(cl->peer_fd);
}
}

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
// Similar to qemu-nbd, but sets timeout and uses io_uring
#include <linux/nbd.h>
@@ -17,6 +17,10 @@
#include "epoll_manager.h"
#include "cluster_client.h"
#ifndef MSG_ZEROCOPY
#define MSG_ZEROCOPY 0
#endif
const char *exe_name = NULL;
class nbd_proxy
@@ -107,7 +111,7 @@ public:
{
printf(
"Vitastor NBD proxy\n"
"(c) Vitaliy Filippov, 2020 (VNPL-1.0 or GNU GPL 2.0+)\n\n"
"(c) Vitaliy Filippov, 2020 (VNPL-1.1)\n\n"
"USAGE:\n"
" %s map --etcd_address <etcd_address> --pool <pool> --inode <inode> --size <size in bytes>\n"
" %s unmap /dev/nbd0\n"

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
#include <sys/socket.h>
#include <sys/poll.h>
@@ -28,11 +28,16 @@ osd_t::osd_t(blockstore_config_t & config, blockstore_t *bs, ring_loop_t *ringlo
{
print_stats();
});
this->tfd->set_timer(slow_log_interval*1000, true, [this](int timer_id)
{
print_slow();
});
c_cli.tfd = this->tfd;
c_cli.ringloop = this->ringloop;
c_cli.exec_op = [this](osd_op_t *op) { exec_op(op); };
c_cli.repeer_pgs = [this](osd_num_t peer_osd) { repeer_pgs(peer_osd); };
c_cli.init();
init_cluster();
@@ -49,6 +54,9 @@ osd_t::~osd_t()
void osd_t::parse_config(blockstore_config_t & config)
{
if (config.find("log_level") == config.end())
config["log_level"] = "1";
log_level = strtoull(config["log_level"].c_str(), NULL, 10);
// Initial startup configuration
json11::Json json_config = json11::Json(config);
st_cli.parse_config(json_config);
@@ -60,6 +68,8 @@ void osd_t::parse_config(blockstore_config_t & config)
throw std::runtime_error("osd_num is required in the configuration");
c_cli.osd_num = osd_num;
run_primary = config["run_primary"] != "false" && config["run_primary"] != "0" && config["run_primary"] != "no";
no_rebalance = config["no_rebalance"] == "true" || config["no_rebalance"] == "1" || config["no_rebalance"] == "yes";
no_recovery = config["no_recovery"] == "true" || config["no_recovery"] == "1" || config["no_recovery"] == "yes";
// Cluster configuration
bind_address = config["bind_address"];
if (bind_address == "")
@@ -86,19 +96,18 @@ void osd_t::parse_config(blockstore_config_t & config)
recovery_queue_depth = strtoull(config["recovery_queue_depth"].c_str(), NULL, 10);
if (recovery_queue_depth < 1 || recovery_queue_depth > MAX_RECOVERY_QUEUE)
recovery_queue_depth = DEFAULT_RECOVERY_QUEUE;
recovery_sync_batch = strtoull(config["recovery_sync_batch"].c_str(), NULL, 10);
if (recovery_sync_batch < 1 || recovery_sync_batch > MAX_RECOVERY_QUEUE)
recovery_sync_batch = DEFAULT_RECOVERY_BATCH;
if (config["readonly"] == "true" || config["readonly"] == "1" || config["readonly"] == "yes")
readonly = true;
print_stats_interval = strtoull(config["print_stats_interval"].c_str(), NULL, 10);
if (!print_stats_interval)
print_stats_interval = 3;
c_cli.peer_connect_interval = strtoull(config["peer_connect_interval"].c_str(), NULL, 10);
if (!c_cli.peer_connect_interval)
c_cli.peer_connect_interval = DEFAULT_PEER_CONNECT_INTERVAL;
c_cli.peer_connect_timeout = strtoull(config["peer_connect_timeout"].c_str(), NULL, 10);
if (!c_cli.peer_connect_timeout)
c_cli.peer_connect_timeout = DEFAULT_PEER_CONNECT_TIMEOUT;
log_level = strtoull(config["log_level"].c_str(), NULL, 10);
c_cli.log_level = log_level;
slow_log_interval = strtoull(config["slow_log_interval"].c_str(), NULL, 10);
if (!slow_log_interval)
slow_log_interval = 10;
c_cli.parse_config(json_config);
}
void osd_t::bind_socket()
@@ -202,6 +211,12 @@ void osd_t::exec_op(osd_op_t *cur_op)
finish_op(cur_op, -EINVAL);
return;
}
if (cur_op->req.hdr.opcode == OSD_OP_PING)
{
// Pong
finish_op(cur_op, 0);
return;
}
if (readonly &&
cur_op->req.hdr.opcode != OSD_OP_SEC_READ &&
cur_op->req.hdr.opcode != OSD_OP_SEC_LIST &&
@@ -252,9 +267,9 @@ void osd_t::reset_stats()
void osd_t::print_stats()
{
for (int i = 0; i <= OSD_OP_MAX; i++)
for (int i = OSD_OP_MIN; i <= OSD_OP_MAX; i++)
{
if (c_cli.stats.op_stat_count[i] != prev_stats.op_stat_count[i])
if (c_cli.stats.op_stat_count[i] != prev_stats.op_stat_count[i] && i != OSD_OP_PING)
{
uint64_t avg = (c_cli.stats.op_stat_sum[i] - prev_stats.op_stat_sum[i])/(c_cli.stats.op_stat_count[i] - prev_stats.op_stat_count[i]);
uint64_t bw = (c_cli.stats.op_stat_bytes[i] - prev_stats.op_stat_bytes[i]) / print_stats_interval;
@@ -275,7 +290,7 @@ void osd_t::print_stats()
prev_stats.op_stat_bytes[i] = c_cli.stats.op_stat_bytes[i];
}
}
for (int i = 0; i <= OSD_OP_MAX; i++)
for (int i = OSD_OP_MIN; i <= OSD_OP_MAX; i++)
{
if (c_cli.stats.subop_stat_count[i] != prev_stats.subop_stat_count[i])
{
@@ -313,3 +328,73 @@ void osd_t::print_stats()
printf("[OSD %lu] %lu object(s) misplaced\n", osd_num, misplaced_objects);
}
}
void osd_t::print_slow()
{
char alloc[1024];
timespec now;
clock_gettime(CLOCK_REALTIME, &now);
for (auto & kv: c_cli.clients)
{
for (auto op: kv.second->received_ops)
{
if ((now.tv_sec - op->tv_begin.tv_sec) >= slow_log_interval)
{
int l = sizeof(alloc), n;
char *buf = alloc;
#define bufprintf(s, ...) { n = snprintf(buf, l, s, __VA_ARGS__); n = n < 0 ? 0 : n; buf += n; l -= n; }
bufprintf("[OSD %lu] Slow op", osd_num);
if (kv.second->osd_num)
{
bufprintf(" from peer OSD %lu (client %d)", kv.second->osd_num, kv.second->peer_fd);
}
else
{
bufprintf(" from client %d", kv.second->peer_fd);
}
bufprintf(": %s id=%lu", osd_op_names[op->req.hdr.opcode], op->req.hdr.id);
if (op->req.hdr.opcode == OSD_OP_SEC_READ || op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE || op->req.hdr.opcode == OSD_OP_SEC_DELETE)
{
bufprintf(" %lx:%lx v", op->req.sec_rw.oid.inode, op->req.sec_rw.oid.stripe);
if (op->req.sec_rw.version == UINT64_MAX)
{
bufprintf("%s", "max");
}
else
{
bufprintf("%lu", op->req.sec_rw.version);
}
if (op->req.hdr.opcode != OSD_OP_SEC_DELETE)
{
bufprintf(" offset=%x len=%x", op->req.sec_rw.offset, op->req.sec_rw.len);
}
}
else if (op->req.hdr.opcode == OSD_OP_SEC_STABILIZE || op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK)
{
for (uint64_t i = 0; i < op->req.sec_stab.len; i += sizeof(obj_ver_id))
{
obj_ver_id *ov = (obj_ver_id*)(op->buf + i);
bufprintf(i == 0 ? " %lx:%lx v%lu" : ", %lx:%lx v%lu", ov->oid.inode, ov->oid.stripe, ov->version);
}
}
else if (op->req.hdr.opcode == OSD_OP_SEC_LIST)
{
bufprintf(
" inode=%lx-%lx pg=%u/%u, stripe=%lu",
op->req.sec_list.min_inode, op->req.sec_list.max_inode,
op->req.sec_list.list_pg, op->req.sec_list.pg_count,
op->req.sec_list.pg_stripe_size
);
}
else if (op->req.hdr.opcode == OSD_OP_READ || op->req.hdr.opcode == OSD_OP_WRITE ||
op->req.hdr.opcode == OSD_OP_DELETE)
{
bufprintf(" inode=%lx offset=%lx len=%x", op->req.rw.inode, op->req.rw.offset, op->req.rw.len);
}
#undef bufprintf
printf("%s\n", alloc);
}
}
}
}

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
#pragma once
@@ -37,6 +37,7 @@
#define DEFAULT_AUTOSYNC_INTERVAL 5
#define MAX_RECOVERY_QUEUE 2048
#define DEFAULT_RECOVERY_QUEUE 4
#define DEFAULT_RECOVERY_BATCH 16
//#define OSD_STUB
@@ -64,15 +65,19 @@ class osd_t
bool readonly = false;
osd_num_t osd_num = 1; // OSD numbers start with 1
bool run_primary = false;
bool no_rebalance = false;
bool no_recovery = false;
std::string bind_address;
int bind_port, listen_backlog;
// FIXME: Implement client queue depth limit
int client_queue_depth = 128;
bool allow_test_ops = true;
int print_stats_interval = 3;
int slow_log_interval = 10;
int immediate_commit = IMMEDIATE_NONE;
int autosync_interval = DEFAULT_AUTOSYNC_INTERVAL; // sync every 5 seconds
int recovery_queue_depth = DEFAULT_RECOVERY_QUEUE;
int recovery_sync_batch = DEFAULT_RECOVERY_BATCH;
int log_level = 0;
// cluster state
@@ -94,9 +99,11 @@ class osd_t
std::map<pool_pg_num_t, pg_t> pgs;
std::set<pool_pg_num_t> dirty_pgs;
std::set<osd_num_t> dirty_osds;
int copies_to_delete_after_sync_count = 0;
uint64_t misplaced_objects = 0, degraded_objects = 0, incomplete_objects = 0;
int peering_state = 0;
std::map<object_id, osd_recovery_op_t> recovery_ops;
int recovery_done = 0;
osd_op_t *autosync_op = NULL;
// Unstable writes
@@ -138,6 +145,7 @@ class osd_t
void create_osd_state();
void renew_lease();
void print_stats();
void print_slow();
void reset_stats();
json11::Json get_statistics();
void report_statistics();
@@ -158,6 +166,7 @@ class osd_t
void submit_list_subop(osd_num_t role_osd, pg_peering_state_t *ps);
void discard_list_subop(osd_op_t *list_op);
bool stop_pg(pg_t & pg);
void reset_pg(pg_t & pg);
void finish_stop_pg(pg_t & pg);
// flushing, recovery and backfill
@@ -196,6 +205,7 @@ class osd_t
void pg_cancel_write_queue(pg_t & pg, osd_op_t *first_op, object_id oid, int retval);
void submit_primary_subops(int submit_type, uint64_t op_version, int pg_size, const uint64_t* osd_set, osd_op_t *cur_op);
void submit_primary_del_subops(osd_op_t *cur_op, uint64_t *cur_set, uint64_t set_size, pg_osd_set_t & loc_set);
void submit_primary_del_batch(osd_op_t *cur_op, obj_ver_osd_t *chunks_to_delete, int chunks_to_delete_count);
void submit_primary_sync_subops(osd_op_t *cur_op);
void submit_primary_stab_subops(osd_op_t *cur_op);

View File

@@ -1,9 +1,10 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
#include "osd.h"
#include "base64.h"
#include "etcd_state_client.h"
#include "osd_rmw.h"
// Startup sequence:
// Start etcd watcher -> Load global OSD configuration -> Bind socket -> Acquire lease -> Report&lock OSD state
@@ -32,12 +33,26 @@ void osd_t::init_cluster()
}
pgs[{ 1, 1 }] = (pg_t){
.state = PG_PEERING,
.scheme = POOL_SCHEME_XOR,
.pg_cursize = 0,
.pg_size = 3,
.pg_minsize = 2,
.pg_data_size = 2,
.pool_id = 1,
.pg_num = 1,
.target_set = { 1, 2, 3 },
.cur_set = { 0, 0, 0 },
};
st_cli.pool_config[1] = (pool_config_t){
.exists = true,
.id = 1,
.name = "testpool",
.scheme = POOL_SCHEME_XOR,
.pg_size = 3,
.pg_minsize = 2,
.pg_count = 1,
.real_pg_count = 1,
};
report_pg_state(pgs[{ 1, 1 }]);
pg_counts[1] = 1;
}
@@ -127,7 +142,7 @@ json11::Json osd_t::get_statistics()
}
st["host"] = self_state["host"];
json11::Json::object op_stats, subop_stats;
for (int i = 0; i <= OSD_OP_MAX; i++)
for (int i = OSD_OP_MIN; i <= OSD_OP_MAX; i++)
{
op_stats[osd_op_names[i]] = json11::Json::object {
{ "count", c_cli.stats.op_stat_count[i] },
@@ -135,7 +150,7 @@ json11::Json osd_t::get_statistics()
{ "bytes", c_cli.stats.op_stat_bytes[i] },
};
}
for (int i = 0; i <= OSD_OP_MAX; i++)
for (int i = OSD_OP_MIN; i <= OSD_OP_MAX; i++)
{
subop_stats[osd_op_names[i]] = json11::Json::object {
{ "count", c_cli.stats.subop_stat_count[i] },
@@ -223,8 +238,11 @@ void osd_t::on_change_osd_state_hook(osd_num_t peer_osd)
void osd_t::on_change_etcd_state_hook(json11::Json::object & changes)
{
// FIXME apply config changes in runtime (maybe, some)
apply_pg_count();
apply_pg_config();
if (run_primary)
{
apply_pg_count();
apply_pg_config();
}
}
void osd_t::on_change_pg_history_hook(pool_id_t pool_id, pg_num_t pg_num)
@@ -270,7 +288,6 @@ void osd_t::on_load_config_hook(json11::Json::object & global_config)
}
parse_config(osd_config);
bind_socket();
st_cli.start_etcd_watcher();
acquire_lease();
}
@@ -367,6 +384,7 @@ void osd_t::create_osd_state()
{
st_cli.load_pgs();
}
report_statistics();
});
}
@@ -477,7 +495,11 @@ void osd_t::apply_pg_count()
}
if (still_active > 0)
{
printf("[OSD %lu] PG count change detected, but %d PG(s) are still active. This is not allowed. Exiting\n", this->osd_num, still_active);
printf(
"[OSD %lu] PG count change detected for pool %u (new is %lu, old is %u),"
" but %u PG(s) are still active. This is not allowed. Exiting\n",
this->osd_num, pool_item.first, pool_item.second.real_pg_count, pg_counts[pool_item.first], still_active
);
force_stop(1);
return;
}
@@ -571,7 +593,10 @@ void osd_t::apply_pg_config()
}
else
{
throw std::runtime_error("Unexpected PG "+std::to_string(pg_num)+" state: "+std::to_string(pg_it->second.state));
throw std::runtime_error(
"Unexpected PG "+std::to_string(pool_id)+"/"+std::to_string(pg_num)+
" state: "+std::to_string(pg_it->second.state)
);
}
}
auto & pg = this->pgs[{ .pool_id = pool_id, .pg_num = pg_num }];
@@ -581,6 +606,8 @@ void osd_t::apply_pg_config()
.pg_cursize = 0,
.pg_size = pool_item.second.pg_size,
.pg_minsize = pool_item.second.pg_minsize,
.pg_data_size = pg.scheme == POOL_SCHEME_REPLICATED
? 1 : pool_item.second.pg_size - pool_item.second.parity_chunks,
.pool_id = pool_id,
.pg_num = pg_num,
.reported_epoch = pg_cfg.epoch,
@@ -588,6 +615,10 @@ void osd_t::apply_pg_config()
.all_peers = std::vector<osd_num_t>(all_peers.begin(), all_peers.end()),
.target_set = pg_cfg.target_set,
};
if (pg.scheme == POOL_SCHEME_JERASURE)
{
use_jerasure(pg.pg_size, pg.pg_data_size, true);
}
this->pg_state_dirty.insert({ .pool_id = pool_id, .pg_num = pg_num });
pg.print_state();
if (pg_cfg.cur_primary == this->osd_num)
@@ -634,7 +665,21 @@ void osd_t::report_pg_states()
auto & pg = pg_it->second;
reporting_pgs.push_back({ *it, pg.history_changed });
std::string state_key_base64 = base64_encode(st_cli.etcd_prefix+"/pg/state/"+std::to_string(pg.pool_id)+"/"+std::to_string(pg.pg_num));
if (pg.state == PG_STARTING)
bool pg_state_exists = false;
if (pg.state != PG_STARTING)
{
auto pool_it = st_cli.pool_config.find(pg.pool_id);
if (pool_it != st_cli.pool_config.end())
{
auto pg_it = pool_it->second.pg_config.find(pg.pg_num);
if (pg_it != pool_it->second.pg_config.end() &&
pg_it->second.cur_state != 0)
{
pg_state_exists = true;
}
}
}
if (!pg_state_exists)
{
// Check that the PG key does not exist
// Failed check indicates an unsuccessful PG lock attempt in this case
@@ -646,9 +691,7 @@ void osd_t::report_pg_states()
}
else
{
// Check that the key is ours
// Failed check indicates success for OFFLINE pgs (PG lock is already deleted)
// and an unexpected race condition for started pgs (PG lock is held by someone else)
// Check that the key is ours if it already exists
checks.push_back(json11::Json::object {
{ "target", "LEASE" },
{ "lease", etcd_lease_id },
@@ -770,13 +813,16 @@ void osd_t::report_pg_states()
for (auto pp: reporting_pgs)
{
auto pg_it = this->pgs.find(pp.first);
if (pg_it != this->pgs.end())
if (pg_it != this->pgs.end() &&
pg_it->second.state == PG_OFFLINE &&
pg_state_dirty.find(pp.first) == pg_state_dirty.end())
{
if (pg_it->second.state == PG_OFFLINE)
// Forget offline PGs after reporting their state
if (pg_it->second.scheme == POOL_SCHEME_JERASURE)
{
// Remove offline PGs after reporting their state
this->pgs.erase(pg_it);
use_jerasure(pg_it->second.pg_size, pg_it->second.pg_data_size, false);
}
this->pgs.erase(pg_it);
}
}
// Push other PG state updates, if any

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
#include "osd.h"
@@ -95,7 +95,7 @@ void osd_t::handle_flush_op(bool rollback, pool_id_t pool_id, pg_num_t pg_num, p
{
// This flush batch is done
std::vector<osd_op_t*> continue_ops;
auto & pg = pgs[pg_id];
auto & pg = pgs.at(pg_id);
auto it = pg.flush_actions.begin(), prev_it = it;
auto erase_start = it;
while (1)
@@ -166,7 +166,7 @@ void osd_t::submit_flush_op(pool_id_t pool_id, pg_num_t pg_num, pg_flush_batch_t
{
// local
clock_gettime(CLOCK_REALTIME, &op->tv_begin);
op->bs_op = new blockstore_op_t({
op->bs_op = new blockstore_op_t((blockstore_op_t){
.opcode = (uint64_t)(rollback ? BS_OP_ROLLBACK : BS_OP_STABLE),
.callback = [this, op, pool_id, pg_num, fb](blockstore_op_t *bs_op)
{
@@ -188,7 +188,7 @@ void osd_t::submit_flush_op(pool_id_t pool_id, pg_num_t pg_num, pg_flush_batch_t
op->op_type = OSD_OP_OUT;
op->iov.push_back(op->buf, count * sizeof(obj_ver_id));
op->peer_fd = peer_fd;
op->req = {
op->req = (osd_any_op_t){
.sec_stab = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
@@ -209,32 +209,38 @@ void osd_t::submit_flush_op(pool_id_t pool_id, pg_num_t pg_num, pg_flush_batch_t
bool osd_t::pick_next_recovery(osd_recovery_op_t &op)
{
for (auto pg_it = pgs.begin(); pg_it != pgs.end(); pg_it++)
if (!no_recovery)
{
if ((pg_it->second.state & (PG_ACTIVE | PG_HAS_DEGRADED)) == (PG_ACTIVE | PG_HAS_DEGRADED))
for (auto pg_it = pgs.begin(); pg_it != pgs.end(); pg_it++)
{
for (auto obj_it = pg_it->second.degraded_objects.begin(); obj_it != pg_it->second.degraded_objects.end(); obj_it++)
if ((pg_it->second.state & (PG_ACTIVE | PG_HAS_DEGRADED)) == (PG_ACTIVE | PG_HAS_DEGRADED))
{
if (recovery_ops.find(obj_it->first) == recovery_ops.end())
for (auto obj_it = pg_it->second.degraded_objects.begin(); obj_it != pg_it->second.degraded_objects.end(); obj_it++)
{
op.degraded = true;
op.oid = obj_it->first;
return true;
if (recovery_ops.find(obj_it->first) == recovery_ops.end())
{
op.degraded = true;
op.oid = obj_it->first;
return true;
}
}
}
}
}
for (auto pg_it = pgs.begin(); pg_it != pgs.end(); pg_it++)
if (!no_rebalance)
{
if ((pg_it->second.state & (PG_ACTIVE | PG_HAS_MISPLACED)) == (PG_ACTIVE | PG_HAS_MISPLACED))
for (auto pg_it = pgs.begin(); pg_it != pgs.end(); pg_it++)
{
for (auto obj_it = pg_it->second.misplaced_objects.begin(); obj_it != pg_it->second.misplaced_objects.end(); obj_it++)
if ((pg_it->second.state & (PG_ACTIVE | PG_HAS_MISPLACED)) == (PG_ACTIVE | PG_HAS_MISPLACED))
{
if (recovery_ops.find(obj_it->first) == recovery_ops.end())
for (auto obj_it = pg_it->second.misplaced_objects.begin(); obj_it != pg_it->second.misplaced_objects.end(); obj_it++)
{
op.degraded = false;
op.oid = obj_it->first;
return true;
if (recovery_ops.find(obj_it->first) == recovery_ops.end())
{
op.degraded = false;
op.oid = obj_it->first;
return true;
}
}
}
}
@@ -246,7 +252,7 @@ void osd_t::submit_recovery_op(osd_recovery_op_t *op)
{
op->osd_op = new osd_op_t();
op->osd_op->op_type = OSD_OP_OUT;
op->osd_op->req = {
op->osd_op->req = (osd_any_op_t){
.rw = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
@@ -258,15 +264,23 @@ void osd_t::submit_recovery_op(osd_recovery_op_t *op)
.len = 0,
},
};
if (log_level > 2)
{
printf("Submitting recovery operation for %lx:%lx\n", op->oid.inode, op->oid.stripe);
}
op->osd_op->callback = [this, op](osd_op_t *osd_op)
{
// Don't sync the write, it will be synced by our regular sync coroutine
if (osd_op->reply.hdr.retval < 0)
{
// Error recovering object
if (osd_op->reply.hdr.retval == -EPIPE)
{
// PG is stopped or one of the OSDs is gone, error is harmless
printf(
"Recovery operation failed with object %lx:%lx (PG %u/%u)\n",
op->oid.inode, op->oid.stripe, INODE_POOL(op->oid.inode),
map_to_pg(op->oid, st_cli.pool_config.at(INODE_POOL(op->oid.inode)).pg_stripe_size)
);
}
else
{
@@ -277,6 +291,17 @@ void osd_t::submit_recovery_op(osd_recovery_op_t *op)
op->osd_op = NULL;
recovery_ops.erase(op->oid);
delete osd_op;
if (immediate_commit != IMMEDIATE_ALL)
{
recovery_done++;
if (recovery_done >= recovery_sync_batch)
{
// Force sync every <recovery_sync_batch> operations
// This is required not to pile up an excessive amount of delete operations
autosync();
recovery_done = 0;
}
}
continue_recovery();
};
exec_op(op->osd_op);

View File

@@ -1,10 +1,11 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once
#define POOL_SCHEME_REPLICATED 1
#define POOL_SCHEME_XOR 2
#define POOL_SCHEME_JERASURE 3
#define POOL_ID_MAX 0x10000
#define POOL_ID_BITS 16
#define INODE_POOL(inode) (pool_id_t)((inode) >> (64 - POOL_ID_BITS))

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// License: VNPL-1.1 (see README.md for details)
#include "osd.h"

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#include "osd_ops.h"
@@ -19,4 +19,5 @@ const char* osd_op_names[] = {
"primary_write",
"primary_sync",
"primary_delete",
"ping",
};

View File

@@ -1,5 +1,5 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once
@@ -27,7 +27,8 @@
#define OSD_OP_WRITE 12
#define OSD_OP_SYNC 13
#define OSD_OP_DELETE 14
#define OSD_OP_MAX 14
#define OSD_OP_PING 15
#define OSD_OP_MAX 15
// Alignment & limit for read/write operations
#ifndef MEM_ALIGNMENT
#define MEM_ALIGNMENT 512

Some files were not shown because too many files have changed in this diff Show More