Compare commits

...

223 Commits

Author SHA1 Message Date
Vitaliy Filippov 9f2a948712 Make pg_stripe_size a per-pool config 2020-10-01 18:51:49 +03:00
Vitaliy Filippov ba74eece4a More fixes to the failure model (why am I doing this?..) 2020-10-01 18:38:30 +03:00
Vitaliy Filippov 2fdd8a1b38 More correct failure model (I hope so) 2020-10-01 02:33:48 +03:00
Vitaliy Filippov 526983f7a9 Add usable CLI commands for NBD proxy (map/unmap/list) 2020-09-29 02:06:19 +03:00
Vitaliy Filippov 8e36f04482 One more experiment with cluster AFR% 2020-09-27 19:42:42 +03:00
Vitaliy Filippov f460d8c1c8 Add note about NBD 2020-09-26 00:11:55 +03:00
Vitaliy Filippov 7619a789c0 Set request size in NBD 2020-09-26 00:01:23 +03:00
Vitaliy Filippov e65a28e27e Implement a simple NBD proxy (does not daemonize yet) 2020-09-25 20:51:01 +03:00
Vitaliy Filippov 6852f299ae Add functions to calculate AFR for a cluster 2020-09-24 23:15:26 +03:00
Vitaliy Filippov 1967269c13 Resume operations in cluster_client when PGs are loaded (fixes a hang in qemu-img) 2020-09-20 01:50:19 +03:00
Vitaliy Filippov 7574183ba6 Make qemu driver build with QEMU 3.x 2020-09-20 01:50:19 +03:00
Vitaliy Filippov 108cd6312d Correct some typos in README, add note about qemu-img 2020-09-20 01:50:19 +03:00
Vitaliy Filippov 588b9e6393 Add README 2020-09-17 23:07:50 +03:00
Vitaliy Filippov 0471b09b9c Add license notices to all source code files 2020-09-17 23:07:06 +03:00
Vitaliy Filippov ef911555ed Add cpp-btree and json11 submodules 2020-09-17 23:07:06 +03:00
Vitaliy Filippov 9d20839a02 Add license texts 2020-09-17 23:07:06 +03:00
Vitaliy Filippov 67a2e5640c Fix a GIANT memory leak on read :D 2020-09-17 00:45:59 +03:00
Vitaliy Filippov 28a0f08ce7 Add a very simple tool for calculating device offsets 2020-09-17 00:45:59 +03:00
Vitaliy Filippov 9b4e5b64ae Move monitor to mon/ 2020-09-16 02:15:26 +03:00
Vitaliy Filippov 4ca2eeafff Prefer data OSDs for EC/XOR because they can actually read something locally 2020-09-16 01:34:18 +03:00
Vitaliy Filippov 79156e0ee1 Add test systemd unit generation script 2020-09-13 12:50:15 +03:00
Vitaliy Filippov ed26c33f85 React to down OSDs instantly, set timer to recheck PGs after <osd_out_time> 2020-09-13 12:23:18 +03:00
Vitaliy Filippov 18692517be Increase receive_buffer_size 2020-09-13 00:39:20 +03:00
Vitaliy Filippov de6919b02b Add option to disable multiple overwrites of the same journal sector
This makes sense for some SSDs like Intel D3-4510 because they don't
like overwrites of the same sector:

$ fio -direct=1 -rw=write -bs=4k -size=4k -loops=100000 -iodepth=1
  write: IOPS=3142, BW=12.3MiB/s (12.9MB/s)(97.9MiB/7977msec)

$ fio -direct=1 -rw=write -bs=4k -size=128k -loops=100000 -iodepth=1
  write: IOPS=20.8k, BW=81.4MiB/s (85.3MB/s)(543MiB/6675msec)
2020-09-13 00:37:39 +03:00
Vitaliy Filippov 8f9f438e25 Allow zero reweights, fix changing pgs 2020-09-11 23:59:30 +03:00
Vitaliy Filippov db4b82089e connecting=true was also forgotten 2020-09-11 19:51:17 +03:00
Vitaliy Filippov faa871090f Do not die in mon on bad JSON in etcd 2020-09-11 19:25:52 +03:00
Vitaliy Filippov 49ec8c7c63 Add --verbose 1 flag for mon 2020-09-11 18:49:07 +03:00
Vitaliy Filippov eadd454992 Fix etcd key regexps 2020-09-11 18:33:49 +03:00
Vitaliy Filippov a15bd23ebd Missed a bad PG key 2020-09-11 17:56:18 +03:00
Vitaliy Filippov e3f502b466 Oops, I forgot a file 2020-09-11 17:49:44 +03:00
Vitaliy Filippov 6e72cf2732 Disable stdout/stderr buffering 2020-09-11 16:52:51 +03:00
Vitaliy Filippov 53832d184a Allow to use lazy sync with replicated pools 2020-09-06 12:08:44 +03:00
Vitaliy Filippov 352caeba14 Fix one more bug with replicated reads 2020-09-06 02:27:44 +03:00
Vitaliy Filippov fb533991b7 "Lock" retried objects from other flushers when accounting for overruns
Fixes a rare 100% CPU consuming hang
2020-09-06 02:19:36 +03:00
Vitaliy Filippov 73e26dbbea Add up_wait_retry_interval to config and fix it so it actually works 2020-09-05 22:05:21 +03:00
Vitaliy Filippov 44973e7f27 Fix replicated pool bugs 2020-09-05 21:45:04 +03:00
Vitaliy Filippov 242d9a42a2 Change object format in prints to %lx:%lx v%lu 2020-09-05 17:44:05 +03:00
Vitaliy Filippov 68c3e96e46 Add pool setting to fio and qemu drivers 2020-09-05 17:44:00 +03:00
Vitaliy Filippov cc4714a3a7 Basic fixes for the Monitor 2020-09-05 02:14:43 +03:00
Vitaliy Filippov e051db5a73 Check for unsuccessful memory allocations 2020-09-05 01:42:11 +03:00
Vitaliy Filippov 4f9b5286a0 Add replicated pool support to OSD logic
...in theory :-D now it needs some testing
2020-09-05 01:42:11 +03:00
Vitaliy Filippov 168cc2c803 Add pool support to OSD, part 1
This just fixes all the code so it builds and works like before,
but doesn't yet bring the support for replicated pools.
2020-09-04 17:04:17 +03:00
Vitaliy Filippov 4cdad634b5 Add pool support to the cluster client 2020-09-03 00:52:41 +03:00
Vitaliy Filippov 293cb5bd1d Parse pool configuration in etcd_state_client 2020-09-02 21:54:32 +03:00
Vitaliy Filippov 0918ea08fa Implement min/max inode filters in LIST operation 2020-09-02 14:42:40 +03:00
Vitaliy Filippov a8b3cbd6af Implement per-pool PG calculation, fix some lint warnings 2020-09-01 18:53:54 +03:00
Vitaliy Filippov fe0d78bf8e Filter configuration keys with regexps instead of "osd_tree" value checks 2020-09-01 17:07:06 +03:00
Vitaliy Filippov 085c145a18 Document etcd data (to-be state with pools) at least in some form 2020-09-01 16:29:45 +03:00
Vitaliy Filippov 30da4bddbe Extract scale_pg_count into a separate file 2020-09-01 16:18:58 +03:00
Vitaliy Filippov 14b4a4617e (re)move placement_tree 2020-09-01 16:18:58 +03:00
Vitaliy Filippov 3932c9b2e2 Add WRITE_STABLE to the secondary OSD for the upcoming replication support 2020-09-01 16:18:58 +03:00
Vitaliy Filippov 2e8c69fc5b Rename OSD_OP_SECONDARY_* to OSD_OP_SEC_* 2020-08-31 23:57:50 +03:00
Vitaliy Filippov a86788fe3b Support optimizing for the case when parity chunks occupy more space than data chunks
Mostly as an experiment because the problem solved by this commit
comes from Ceph's EC+compression implementation details and I'm not sure
if my implementation will be the same
2020-08-17 01:44:19 +03:00
Vitaliy Filippov 95ebfad283 Final name is Vitastor 2020-08-03 23:50:59 +03:00
Vitaliy Filippov 6022f28dc9 Add pseudo-random PG generation 2020-07-07 23:13:07 +03:00
Vitaliy Filippov 9d10a4d057 Support arbitrary pg_size in LPOptimizer 2020-07-05 20:28:05 +03:00
Vitaliy Filippov ec7acc8f3a Add WRITE_STABLE operation for future replication support 2020-07-05 01:48:02 +03:00
Vitaliy Filippov 416a80b099 Make blockstore object state a combination of type and workflow 2020-07-04 22:20:32 +03:00
Vitaliy Filippov a7929931eb Implement PG epochs to prevent the "version split"
The "version split" is when:
- A block is written to 1 OSD out of 3, all of them die
- OSDs 2 and 3 come up, the same block is written to both of them
- The remaining OSD comes up. Now all 3 OSDs have the same version of the same object,
  but with different data.
2020-07-04 00:55:27 +03:00
Vitaliy Filippov e680d6c1c3 Rename reconstruct_stripe and calc_rmw_parity to indicate that they are only for XOR N+1 2020-06-30 10:40:43 +03:00
Vitaliy Filippov 9b33f598d3 Fix two more cluster client bugs
1) Sync could delete an unfinished write due to the lack of ordering (fixed by introducing syncing_writes)
2) Writes could be postponed indefinitely due to bad resuming of operations after a sync
2020-06-27 02:13:35 +03:00
Vitaliy Filippov 592bcd3699 Fix QEMU driver bugs (QEMU and qemu-img now work! hooray!) 2020-06-26 18:25:43 +03:00
Vitaliy Filippov 5e1e39633d Implement QEMU block driver 2020-06-25 11:59:43 +03:00
Vitaliy Filippov 41c2655edd Disconnect sockets when read returns zero 2020-06-24 01:32:19 +03:00
Vitaliy Filippov d68370304e Support iovecs in cluster_client_t 2020-06-24 01:31:48 +03:00
Vitaliy Filippov a22d9f38aa Only use EPOLLOUT while connecting 2020-06-23 20:18:31 +03:00
Vitaliy Filippov 8736b3ad32 Add destructors, make ringloop optional in cluster_client_t 2020-06-23 20:10:33 +03:00
Vitaliy Filippov 62343c8022 Allow to turn synchronous recvmsg/sendmsg on with a config option 2020-06-23 01:15:07 +03:00
Vitaliy Filippov 9abaf5b735 Use epoll_manager in osd 2020-06-20 01:28:18 +03:00
Vitaliy Filippov badf68c039 Support iovecs for read operations 2020-06-19 19:47:05 +03:00
Vitaliy Filippov 0f6d193d73 Postpone op callbacks to the end of handle_read(), fix a bug where primary OSD could reply -EPIPE with data to a read operation 2020-06-16 01:36:38 +03:00
Vitaliy Filippov 27ee14a4e6 Fix bugs in cluster_client 2020-06-16 00:08:45 +03:00
Vitaliy Filippov 64afec03ec In theory, implement syncs and replay for the non-immediate commit mode 2020-06-15 00:04:16 +03:00
Vitaliy Filippov 4dde8b8a42 Oops, fix fio_sec_osd block_order parsing 2020-06-09 00:52:00 +03:00
Vitaliy Filippov f5ccb154af Benchmark reads in stub_bench, too 2020-06-08 01:54:44 +03:00
Vitaliy Filippov 73c80e2c39 Move accept_connections() to osd_messenger_t, add a simple uring OSD stub 2020-06-08 01:32:16 +03:00
Vitaliy Filippov 437dc5b630 Implement a FIO engine for testing cluster I/O 2020-06-07 00:30:15 +03:00
Vitaliy Filippov 226f5a2945 Allow to override block_size in fio_sec_osd 2020-06-07 00:10:13 +03:00
Vitaliy Filippov 2187d06eac Add a parameter to pass the initial config to client 2020-06-07 00:10:12 +03:00
Vitaliy Filippov c573bc6bb3 (Probably almost) implement cluster client 2020-06-07 00:09:36 +03:00
Vitaliy Filippov 2f6cf605a1 Rename cluster_client to osd_messenger 2020-06-04 12:57:54 +03:00
Vitaliy Filippov 05ea97119f Fix BS_OP_LIST to account for deleted objects: only list the newest stable entry of each object
This allows list responses to be unaffected by journal flushes, which, in turn,
fixes PG peering when a peer OSD is replaying journal and journal contains deletions
2020-06-02 23:52:48 +03:00
Vitaliy Filippov 571be0f380 Make deletions instantly stable
"2-phase" (write->stabilize) process is pointless for deletions because it
doesn't protect us from incomplete objects. This happens because it removes
the version information from metadata after stabilization. Deletions require
"3-phase" process with a potentially very long 3rd phase.

So, deletions will be allowed to generate degraded and incomplete objects,
and for it to not affect users' ability to delete something, the cluster
will allow to delete whole inodes while storing a list of them in etcd.
Proper TRIM will be impossible until the implementation of the aforementioned
"3-phase" process, though.

By the way, this change also fixes a possible write stall after rebalancing
which was caused by the lack of "stabilize delete" operations.
2020-06-02 23:45:22 +03:00
Vitaliy Filippov 985c309d7f Remove duplicate code between blockstore_{rollback,stable} and blockstore_init 2020-06-02 20:37:00 +03:00
Vitaliy Filippov a56f8cd14e Simplify handle_primary_subop() arguments 2020-06-02 18:44:23 +03:00
Vitaliy Filippov 46e111272f Replace assert(this_it == cur_op) with if() for the case of PG repeering 2020-06-02 14:30:57 +03:00
Vitaliy Filippov 165c204555 Fix BS_OP_DELETE (the implementation was untested up to this point) 2020-06-02 14:26:01 +03:00
Vitaliy Filippov af5cd45071 Oh crap, got SIGPIPE. Add MSG_NOSIGNAL 2020-06-02 11:41:08 +03:00
Vitaliy Filippov c3fe9ad0d1 Fix rebalancing writes (add a forgotten state resume) 2020-06-02 01:26:14 +03:00
Vitaliy Filippov 0fcdeae18b Do not die if a peer is already stopped on flush error 2020-06-01 23:07:08 +03:00
Vitaliy Filippov e6a4b634f8 Fix possible write stall
The stall occurred during fio Q=128 random write tests with low flusher_count (4).
It was caused by flushers being unable to flush the beginning of the journal
because it contained older writes to an object that also had writes in the very end
of the journal, after dirty_start.
2020-06-01 16:18:23 +03:00
Vitaliy Filippov c22e096943 Output journal offsets in debug trace in hex, add detailed "still waiting" messages 2020-06-01 16:18:19 +03:00
Vitaliy Filippov 45b1c2fbf1 Fix canceling of write operations on PG re-peer (which led to use-after-free, too...) 2020-06-01 16:18:14 +03:00
Vitaliy Filippov 3469bead67 Protect "delete this" with a stack refcounter
(to fix use-after-free, too, but "delete this" was a time bomb anyway)
2020-06-01 16:18:09 +03:00
Vitaliy Filippov 3a5d488f19 Fix use-after-free in osd_flush.cpp 2020-06-01 01:56:24 +03:00
Vitaliy Filippov 73e4e30b1f Auto-generate C++ header dependencies 2020-06-01 00:25:25 +03:00
Vitaliy Filippov 5feff1ffb9 Slightly cleanup socket send/receive code 2020-05-31 15:03:27 +03:00
Vitaliy Filippov b466e215f0 Fix queued OP_SYNC execution 2020-05-27 13:55:25 +03:00
Vitaliy Filippov 36f995367f Fix bind_address reporting 2020-05-27 10:58:40 +03:00
Vitaliy Filippov 0aca6e9ca8 Extract peer connect and read-write loop into a separate file (to be shared with the client library) 2020-05-26 22:11:30 +03:00
Vitaliy Filippov fa98be6bc0 Allow to specify multiple etcd addresses 2020-05-25 16:30:05 +03:00
Vitaliy Filippov 256a7f2667 Free op->bs_op manually 2020-05-25 15:31:22 +03:00
Vitaliy Filippov 79bf57b6e2 Allow to override pg_stripe_size 2020-05-25 15:31:22 +03:00
Vitaliy Filippov 53f6aba3e6 Die when journal_sector_buffer_count is too small 2020-05-24 17:26:47 +03:00
Vitaliy Filippov 36595eb669 Print "Ran out of journal sector buffers" warning 2020-05-24 16:48:50 +03:00
Vitaliy Filippov e09d0e0678 Several bug fixes
- Do not block flock() requests
- Fix stop_client(0) attempts leading to std::bad_function_call
- Fix degraded writes crashing due to an unset stripes[i].missing (at least with a missing parity device)
- Fix recovery B/W reporting
2020-05-24 01:51:35 +03:00
Vitaliy Filippov d1602b50b3 Fix BS_OP_ROLLBACK removing an incorrect version
Instead of only removing versions with oid == X and version > Y it was
also removing the previous version in list (with the previous oid or
with version == Y)
2020-05-24 01:51:28 +03:00
Vitaliy Filippov 7df384031a Re-peer PGs after stopping the peer
Fixes the bug where two peers killed at once have lead to PG state PG_DEGRADED|PG_HAS_INCOMPLETE instead of PG_INCOMPLETE
2020-05-23 18:45:12 +03:00
Vitaliy Filippov e614a98543 Add a sad FIXME :-) 2020-05-23 15:43:37 +03:00
Vitaliy Filippov 01dd3ef89e Fix timerfd_manager triggering of multiple times at the same time 2020-05-23 15:43:37 +03:00
Vitaliy Filippov cdccc23aff Print [OSD $osd_num] in stats, print B/W only for ops that log bytes 2020-05-23 15:43:37 +03:00
Vitaliy Filippov 700428829a Fix autosync_interval default not setting when autosync_interval is skipped in config 2020-05-23 15:43:37 +03:00
Vitaliy Filippov 6488d0044a Ignore EPOLL_CTL_DEL ENOENT, fix detection of the rollback version 2020-05-23 15:43:37 +03:00
Vitaliy Filippov 393fe75900 Fix creepy (osd_op_t*)(long) casts 2020-05-23 15:43:37 +03:00
Vitaliy Filippov f036eecf1c Fix osd_rmw object recovery case (len==0) 2020-05-23 15:43:37 +03:00
Vitaliy Filippov e56909fb45 Remove tv_send (unused) and timerfd_interval from blockstore 2020-05-22 15:57:08 +03:00
Vitaliy Filippov fac75b0b57 Handle reweights in mon 2020-05-22 12:52:27 +03:00
Vitaliy Filippov 9f842ec9a5 Remove connect callback because it is always the same 2020-05-22 12:45:12 +03:00
Vitaliy Filippov f6a01a4819 Extract "state-watching" etcd client into a separate file 2020-05-22 12:38:40 +03:00
Vitaliy Filippov 6202260018 Extract HTTP client functions from osd_t 2020-05-21 11:39:01 +03:00
Vitaliy Filippov a61ede9951 Remove io_uring usage from osd_http and timerfd_manager
For better future interoperability with external event loops such as QEMU's one
2020-05-21 01:25:38 +03:00
Vitaliy Filippov f57731f8ca Calculate total stats in the monitor 2020-05-15 01:37:17 +03:00
Vitaliy Filippov 19f25c7cd5 Handle integer overflow of the op_stat_count 2020-05-15 01:37:17 +03:00
Vitaliy Filippov 2c3e84cc41 Implement stop_all_pgs() 2020-05-15 01:37:17 +03:00
Vitaliy Filippov 7bda66b866 Do not crash when optimising PGs in an undersized cluster 2020-05-15 01:29:15 +03:00
Vitaliy Filippov b467d0559f Begin node.js storage monitor service 2020-05-15 01:29:15 +03:00
Vitaliy Filippov c2c2eefea4 Duplicate host in osd/state and osd/stats, take PGs from /config/pgs.items 2020-05-15 01:29:15 +03:00
Vitaliy Filippov 5084ff7c6c Measure & report recovery op count and bandwidth 2020-05-15 01:29:15 +03:00
Vitaliy Filippov 47b6f64106 Support level names 2020-05-11 15:57:21 +03:00
Vitaliy Filippov f71d0c117b Measure & report op bandwidth, include local blockstore ops in stats 2020-05-11 02:58:13 +03:00
Vitaliy Filippov 2b854948f9 Remove dead code 2020-05-09 16:15:02 +03:00
Vitaliy Filippov e7f897ed65 Report hostname to etcd 2020-05-09 02:33:43 +03:00
Vitaliy Filippov c26b6e1fc3 Support CRUSH-like multi-level placement trees 2020-05-09 00:55:24 +03:00
Vitaliy Filippov aaa054e644 Fix optimize_change generating infeasible problems
Mainly happened when removing PG combinations (removing OSDs)

Also randomize OSD combinations when there's a lot of them

Also remove Perl version
2020-05-08 16:42:40 +03:00
Vitaliy Filippov 706a44d4d4 Fix optimize_initial in both perl and js versions 2020-05-06 23:12:03 +03:00
Vitaliy Filippov 842f88f94f Rewrite LPOptimizer.pm to nodejs 2020-05-06 02:08:15 +03:00
Vitaliy Filippov e8149e5848 Implement OSD_OP_DELETE 2020-05-05 00:39:51 +03:00
Vitaliy Filippov 6355b968f4 Track osd_set history and all_peers separately 2020-05-04 15:28:07 +03:00
Vitaliy Filippov 00cf24fbd7 Split osd_primary.cpp 2020-05-03 11:04:20 +03:00
Vitaliy Filippov 1bc08174f9 Sync before listing objects so flushes do not fail thereafter 2020-05-01 12:56:49 +03:00
Vitaliy Filippov cd87333091 Fix PG state comparison leading to unclean PGs not flushing
(a & b == b) -> ((a & b) == b) !
2020-05-01 12:56:46 +03:00
Vitaliy Filippov bd0fe6e4cc Fix PGs not stopping during sync, fix state reporting autovivification of erased PGs 2020-05-01 01:33:14 +03:00
Vitaliy Filippov ce78454215 Reply with -EROFS to write commands in readonly mode 2020-05-01 00:54:34 +03:00
Vitaliy Filippov 762bd42096 Fix use-after-free caused by "delete this" in handle_read 2020-04-30 02:15:53 +03:00
Vitaliy Filippov 7b57eeeeb3 Implement PG state locking and PG moving in response to etcd events 2020-04-29 22:23:38 +03:00
Vitaliy Filippov ec4a52af48 Fix websocket (and timer!) bugs 2020-04-26 01:59:56 +03:00
Vitaliy Filippov 268b497c0b Implement simple websocket client 2020-04-25 23:11:50 +03:00
Vitaliy Filippov 35481925b1 Implement very simple HTTP streaming to handle etcd watches 2020-04-25 01:35:52 +03:00
Vitaliy Filippov 895a80dfc4 Fix etcd 3.2 compatibility (no compare.target == LEASE, /kv/lease/revoke), fix small bugs 2020-04-25 01:35:52 +03:00
Vitaliy Filippov caa01c6aaf Acquire etcd leases, prevent starting two OSDs with the same number 2020-04-25 01:35:52 +03:00
Vitaliy Filippov d398ddfd3b Use snake_case for etcd requests 2020-04-25 01:35:52 +03:00
Vitaliy Filippov 0f2b8dbf6f Use a single timerfd_manager for all timers 2020-04-25 01:35:49 +03:00
Vitaliy Filippov 4f42e9659e Use etcd instead of Consul 2020-04-24 01:03:55 +03:00
Vitaliy Filippov 7cf71a8031 Fix timerfd_manager: remove timer, then call callback 2020-04-21 12:45:18 +03:00
Vitaliy Filippov 9d22559bcf Start peering immediately when loading PGs 2020-04-21 02:27:13 +03:00
Vitaliy Filippov 8c03e3ebab Lock Blockstore devices exclusively by default 2020-04-21 01:59:11 +03:00
Vitaliy Filippov 2a640ba2e8 Remove range port selection (leads to races) 2020-04-21 00:10:59 +03:00
Vitaliy Filippov 6a21ea207e Check peer config (at least, number) after connecting 2020-04-21 00:08:54 +03:00
Vitaliy Filippov 642802b595 Auto-select port numbers 2020-04-20 17:45:27 +03:00
Vitaliy Filippov ff38b464a5 Add consul & connect timeouts, report state before loading PGs, move init_primary to osd_cluster 2020-04-20 15:43:07 +03:00
Vitaliy Filippov 663153713b Reconnect to peers after connecting drops 2020-04-19 01:01:26 +03:00
Vitaliy Filippov dc57c5c362 Report PG states again, clear PG history on reaching active+clean 2020-04-19 00:48:23 +03:00
Vitaliy Filippov f95299b769 Take PG history into account when starting PGs 2020-04-19 00:20:18 +03:00
Vitaliy Filippov 9126ffb0f9 Fix PG loading - now it works, at least once 2020-04-17 02:33:44 +03:00
Vitaliy Filippov 2a8e40835e Fix reporting to Consul, report even if we are purely secondary 2020-04-17 01:59:06 +03:00
Vitaliy Filippov 309486d746 Implement loading PGs from Consul (in theory) 2020-04-16 23:22:32 +03:00
Vitaliy Filippov 582f485578 Extract http & getifaddr_list into a separate file 2020-04-15 15:47:06 +03:00
Vitaliy Filippov 089b4eb208 Retry consul connection attempts and then die 2020-04-15 15:33:18 +03:00
Vitaliy Filippov d78ce509c6 Add simple timer manager 2020-04-15 13:41:44 +03:00
Vitaliy Filippov f3a7ccff50 Use 4K blockstore block by default, use MEM_ALIGNMENT in osd code 2020-04-14 19:19:56 +03:00
Vitaliy Filippov 37b27c3025 Implement basic OSD status reporting to Consul 2020-04-14 14:52:06 +03:00
Vitaliy Filippov edf6d6f897 Fix http_request 2020-04-12 02:08:00 +03:00
Vitaliy Filippov d11e8dcb5e Do not flush or recover in readonly mode 2020-04-11 12:06:18 +03:00
Vitaliy Filippov dd02bc1c44 Add base64 implementation 2020-04-11 12:06:18 +03:00
Vitaliy Filippov 298b013eae Add simple http request function 2020-04-11 12:05:58 +03:00
Vitaliy Filippov 0880a77c1a 2 FIXME for the future 2020-04-06 00:55:47 +03:00
Vitaliy Filippov aa849ea07b Add a test for missing chunk overwrite 2020-04-05 16:14:03 +03:00
Vitaliy Filippov d59be0e8b4 Delete misplaced chunks after moving the object, reset object state in primary_write 2020-04-05 15:51:22 +03:00
Vitaliy Filippov cf7de0f181 (Almost) Implement misplaced recovery, integrating it into calc_rmw() 2020-04-05 15:50:53 +03:00
Vitaliy Filippov 6212195440 Implement parallel recovery 2020-04-04 19:23:12 +03:00
Vitaliy Filippov dfb6e15eaa Implement graceful stopping of PGs 2020-04-03 13:03:42 +03:00
Vitaliy Filippov afe2e76c87 Implement regular automatic syncs, split osd_t constructor into some methods 2020-04-02 22:16:46 +03:00
Vitaliy Filippov 0f43f6d3f6 Fix crashes, print some stats
Notably:
- fix the `delete op` inside lambda callback crash (it frees the lambda itself
  which results in use-after-free with g++)
- fix stop_client() reenterability
- fix a bug in the blockstore layer which resulted in always returning version=0
  for zero-length reads
- change error codes for blockstore_stabilize
2020-03-31 17:55:31 +03:00
Vitaliy Filippov 92c800bb64 Forget unstable writes when re-peering, rename parity_block_size -> pg_stripe_size, pg_parity_size -> pg_block_size 2020-03-31 02:09:25 +03:00
Vitaliy Filippov 8a8b619875 Handle secondary OSD connection errors [in theory] 2020-03-30 19:51:34 +03:00
Vitaliy Filippov 43fe1d88e7 Fix memory leaks with subops, fix recovery crashes 2020-03-28 19:09:20 +03:00
Vitaliy Filippov 1b30120918 Fix stripe reconstruction in recovery, only write modified object parts 2020-03-28 13:58:42 +03:00
Vitaliy Filippov c0a22d825d Fix degraded object recovery (it seems to work now) 2020-03-25 02:17:41 +03:00
Vitaliy Filippov 7acfc95f75 CONFIG_HAVE_GETTID 2020-03-25 01:20:20 +03:00
Vitaliy Filippov 250f22c0b6 Implement basic degraded object recovery (integrated into primary_write) 2020-03-25 01:17:50 +03:00
Vitaliy Filippov dbd8418798 Reply using a single finish_op() method, allow to call OSD ops from inside the OSD 2020-03-24 00:18:52 +03:00
Vitaliy Filippov 036f4c5bf3 Fix unstable flushing, include extra OSDs with old object versions in osd_set 2020-03-23 20:28:47 +03:00
Vitaliy Filippov fd8e1a8418 Slightly reorganize object state check code 2020-03-23 00:42:17 +03:00
Vitaliy Filippov a08e0bfacd Treat misplaced and degraded as separate state parts 2020-03-23 00:40:31 +03:00
Vitaliy Filippov ddc3e927d3 Solve it in integers 2020-03-20 13:58:54 +03:00
Vitaliy Filippov 2aa605f2bb Do not check 2020-03-20 13:38:35 +03:00
Vitaliy Filippov 18915b264a Extract to .pm + fix all_combinations 2020-03-19 21:35:47 +03:00
Vitaliy Filippov 60f795e7eb Add lp_solve based data distribution optimizer 2020-03-19 17:23:24 +03:00
Vitaliy Filippov 3a4279adbf Hash-based PG distribution experiments 2020-03-17 18:52:39 +03:00
Vitaliy Filippov 1ec9794376 Extract flushing into a separate file 2020-03-15 18:39:31 +03:00
Vitaliy Filippov d8164e9d84 Print PG states on every change 2020-03-14 22:19:45 +03:00
Vitaliy Filippov 21d0b06959 Implement flushing (stabilize/rollback) of unstable entries on start of the PG 2020-03-14 02:49:34 +03:00
Vitaliy Filippov 46f9bd2a69 Make blockstore list operation return consistent snapshots 2020-03-14 02:10:25 +03:00
Vitaliy Filippov 6982fe1255 Do not block reads by previous unfinished writes 2020-03-13 21:28:49 +03:00
Vitaliy Filippov eba053febe Do not start small writes before finishing the last big write to the same object 2020-03-12 02:15:01 +03:00
Vitaliy Filippov 899946ff96 Add osd_test function to unblock an OSD blocked by the lack of journal space 2020-03-10 17:19:24 +03:00
Vitaliy Filippov 3dd1b22d55 Fix segfault with concurrent OP_SYNCs 2020-03-10 17:00:23 +03:00
Vitaliy Filippov 31f9445030 Use immediate_commit to benefit the primary OSD 2020-03-10 02:20:16 +03:00
Vitaliy Filippov 3f522c66e6 Implement immediate commit mode 2020-03-10 01:59:15 +03:00
Vitaliy Filippov c3737ae3ff Add journal fsync to stabilize/rollback 2020-03-09 00:35:58 +03:00
Vitaliy Filippov c863543bfe Fix possible journal corruption caused by concurrent flushing and writing of the same journal sector 2020-03-08 01:21:19 +03:00
Vitaliy Filippov 1696446545 Rename min/max _used to _flushed 2020-03-07 16:41:58 +03:00
Vitaliy Filippov 41dddddbf2 Fix some logging 2020-03-07 16:41:53 +03:00
Vitaliy Filippov 2d4e24c9ce Add journal dumper debugging tool 2020-03-06 02:29:43 +03:00
Vitaliy Filippov 844cacd357 Allow incorrectly forbidden BS_OP_LIST in readonly mode 2020-03-06 02:29:39 +03:00
Vitaliy Filippov e19d9fde5f Fix peering_pg, begin tests 2020-03-06 02:02:49 +03:00
Vitaliy Filippov 9cb07d844b Make [un]register_consumer operate on pointers, rename get_loop_again() to has_work() 2020-03-04 21:00:20 +03:00
Vitaliy Filippov 1e21555343 Add FIXME with Oops 2020-03-04 20:34:45 +03:00
Vitaliy Filippov 94cdbcd085 Stop reading when less than <buffer> data is available 2020-03-04 18:03:16 +03:00
Vitaliy Filippov 8315407558 Incoming data pre-buffering 2020-03-04 17:34:45 +03:00
Vitaliy Filippov b27ad550cf Use btree_map instead of sparsepp 2020-03-04 17:12:27 +03:00
Vitaliy Filippov 8e63995306 Allow to specify data area size 2020-03-04 02:32:49 +03:00
104 changed files with 16199 additions and 2777 deletions

6
.gitmodules vendored Normal file
View File

@ -0,0 +1,6 @@
[submodule "cpp-btree"]
path = cpp-btree
url = ../cpp-btree.git
[submodule "json11"]
path = json11
url = ../json11.git

339
GPL-2.0.txt Normal file
View File

@ -0,0 +1,339 @@
GNU GENERAL PUBLIC LICENSE
Version 2, June 1991
Copyright (C) 1989, 1991 Free Software Foundation, Inc.,
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
Preamble
The licenses for most software are designed to take away your
freedom to share and change it. By contrast, the GNU General Public
License is intended to guarantee your freedom to share and change free
software--to make sure the software is free for all its users. This
General Public License applies to most of the Free Software
Foundation's software and to any other program whose authors commit to
using it. (Some other Free Software Foundation software is covered by
the GNU Lesser General Public License instead.) You can apply it to
your programs, too.
When we speak of free software, we are referring to freedom, not
price. Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
this service if you wish), that you receive source code or can get it
if you want it, that you can change the software or use pieces of it
in new free programs; and that you know you can do these things.
To protect your rights, we need to make restrictions that forbid
anyone to deny you these rights or to ask you to surrender the rights.
These restrictions translate to certain responsibilities for you if you
distribute copies of the software, or if you modify it.
For example, if you distribute copies of such a program, whether
gratis or for a fee, you must give the recipients all the rights that
you have. You must make sure that they, too, receive or can get the
source code. And you must show them these terms so they know their
rights.
We protect your rights with two steps: (1) copyright the software, and
(2) offer you this license which gives you legal permission to copy,
distribute and/or modify the software.
Also, for each author's protection and ours, we want to make certain
that everyone understands that there is no warranty for this free
software. If the software is modified by someone else and passed on, we
want its recipients to know that what they have is not the original, so
that any problems introduced by others will not reflect on the original
authors' reputations.
Finally, any free program is threatened constantly by software
patents. We wish to avoid the danger that redistributors of a free
program will individually obtain patent licenses, in effect making the
program proprietary. To prevent this, we have made it clear that any
patent must be licensed for everyone's free use or not licensed at all.
The precise terms and conditions for copying, distribution and
modification follow.
GNU GENERAL PUBLIC LICENSE
TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
0. This License applies to any program or other work which contains
a notice placed by the copyright holder saying it may be distributed
under the terms of this General Public License. The "Program", below,
refers to any such program or work, and a "work based on the Program"
means either the Program or any derivative work under copyright law:
that is to say, a work containing the Program or a portion of it,
either verbatim or with modifications and/or translated into another
language. (Hereinafter, translation is included without limitation in
the term "modification".) Each licensee is addressed as "you".
Activities other than copying, distribution and modification are not
covered by this License; they are outside its scope. The act of
running the Program is not restricted, and the output from the Program
is covered only if its contents constitute a work based on the
Program (independent of having been made by running the Program).
Whether that is true depends on what the Program does.
1. You may copy and distribute verbatim copies of the Program's
source code as you receive it, in any medium, provided that you
conspicuously and appropriately publish on each copy an appropriate
copyright notice and disclaimer of warranty; keep intact all the
notices that refer to this License and to the absence of any warranty;
and give any other recipients of the Program a copy of this License
along with the Program.
You may charge a fee for the physical act of transferring a copy, and
you may at your option offer warranty protection in exchange for a fee.
2. You may modify your copy or copies of the Program or any portion
of it, thus forming a work based on the Program, and copy and
distribute such modifications or work under the terms of Section 1
above, provided that you also meet all of these conditions:
a) You must cause the modified files to carry prominent notices
stating that you changed the files and the date of any change.
b) You must cause any work that you distribute or publish, that in
whole or in part contains or is derived from the Program or any
part thereof, to be licensed as a whole at no charge to all third
parties under the terms of this License.
c) If the modified program normally reads commands interactively
when run, you must cause it, when started running for such
interactive use in the most ordinary way, to print or display an
announcement including an appropriate copyright notice and a
notice that there is no warranty (or else, saying that you provide
a warranty) and that users may redistribute the program under
these conditions, and telling the user how to view a copy of this
License. (Exception: if the Program itself is interactive but
does not normally print such an announcement, your work based on
the Program is not required to print an announcement.)
These requirements apply to the modified work as a whole. If
identifiable sections of that work are not derived from the Program,
and can be reasonably considered independent and separate works in
themselves, then this License, and its terms, do not apply to those
sections when you distribute them as separate works. But when you
distribute the same sections as part of a whole which is a work based
on the Program, the distribution of the whole must be on the terms of
this License, whose permissions for other licensees extend to the
entire whole, and thus to each and every part regardless of who wrote it.
Thus, it is not the intent of this section to claim rights or contest
your rights to work written entirely by you; rather, the intent is to
exercise the right to control the distribution of derivative or
collective works based on the Program.
In addition, mere aggregation of another work not based on the Program
with the Program (or with a work based on the Program) on a volume of
a storage or distribution medium does not bring the other work under
the scope of this License.
3. You may copy and distribute the Program (or a work based on it,
under Section 2) in object code or executable form under the terms of
Sections 1 and 2 above provided that you also do one of the following:
a) Accompany it with the complete corresponding machine-readable
source code, which must be distributed under the terms of Sections
1 and 2 above on a medium customarily used for software interchange; or,
b) Accompany it with a written offer, valid for at least three
years, to give any third party, for a charge no more than your
cost of physically performing source distribution, a complete
machine-readable copy of the corresponding source code, to be
distributed under the terms of Sections 1 and 2 above on a medium
customarily used for software interchange; or,
c) Accompany it with the information you received as to the offer
to distribute corresponding source code. (This alternative is
allowed only for noncommercial distribution and only if you
received the program in object code or executable form with such
an offer, in accord with Subsection b above.)
The source code for a work means the preferred form of the work for
making modifications to it. For an executable work, complete source
code means all the source code for all modules it contains, plus any
associated interface definition files, plus the scripts used to
control compilation and installation of the executable. However, as a
special exception, the source code distributed need not include
anything that is normally distributed (in either source or binary
form) with the major components (compiler, kernel, and so on) of the
operating system on which the executable runs, unless that component
itself accompanies the executable.
If distribution of executable or object code is made by offering
access to copy from a designated place, then offering equivalent
access to copy the source code from the same place counts as
distribution of the source code, even though third parties are not
compelled to copy the source along with the object code.
4. You may not copy, modify, sublicense, or distribute the Program
except as expressly provided under this License. Any attempt
otherwise to copy, modify, sublicense or distribute the Program is
void, and will automatically terminate your rights under this License.
However, parties who have received copies, or rights, from you under
this License will not have their licenses terminated so long as such
parties remain in full compliance.
5. You are not required to accept this License, since you have not
signed it. However, nothing else grants you permission to modify or
distribute the Program or its derivative works. These actions are
prohibited by law if you do not accept this License. Therefore, by
modifying or distributing the Program (or any work based on the
Program), you indicate your acceptance of this License to do so, and
all its terms and conditions for copying, distributing or modifying
the Program or works based on it.
6. Each time you redistribute the Program (or any work based on the
Program), the recipient automatically receives a license from the
original licensor to copy, distribute or modify the Program subject to
these terms and conditions. You may not impose any further
restrictions on the recipients' exercise of the rights granted herein.
You are not responsible for enforcing compliance by third parties to
this License.
7. If, as a consequence of a court judgment or allegation of patent
infringement or for any other reason (not limited to patent issues),
conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License. If you cannot
distribute so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you
may not distribute the Program at all. For example, if a patent
license would not permit royalty-free redistribution of the Program by
all those who receive copies directly or indirectly through you, then
the only way you could satisfy both it and this License would be to
refrain entirely from distribution of the Program.
If any portion of this section is held invalid or unenforceable under
any particular circumstance, the balance of the section is intended to
apply and the section as a whole is intended to apply in other
circumstances.
It is not the purpose of this section to induce you to infringe any
patents or other property right claims or to contest validity of any
such claims; this section has the sole purpose of protecting the
integrity of the free software distribution system, which is
implemented by public license practices. Many people have made
generous contributions to the wide range of software distributed
through that system in reliance on consistent application of that
system; it is up to the author/donor to decide if he or she is willing
to distribute software through any other system and a licensee cannot
impose that choice.
This section is intended to make thoroughly clear what is believed to
be a consequence of the rest of this License.
8. If the distribution and/or use of the Program is restricted in
certain countries either by patents or by copyrighted interfaces, the
original copyright holder who places the Program under this License
may add an explicit geographical distribution limitation excluding
those countries, so that distribution is permitted only in or among
countries not thus excluded. In such case, this License incorporates
the limitation as if written in the body of this License.
9. The Free Software Foundation may publish revised and/or new versions
of the General Public License from time to time. Such new versions will
be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.
Each version is given a distinguishing version number. If the Program
specifies a version number of this License which applies to it and "any
later version", you have the option of following the terms and conditions
either of that version or of any later version published by the Free
Software Foundation. If the Program does not specify a version number of
this License, you may choose any version ever published by the Free Software
Foundation.
10. If you wish to incorporate parts of the Program into other free
programs whose distribution conditions are different, write to the author
to ask for permission. For software which is copyrighted by the Free
Software Foundation, write to the Free Software Foundation; we sometimes
make exceptions for this. Our decision will be guided by the two goals
of preserving the free status of all derivatives of our free software and
of promoting the sharing and reuse of software generally.
NO WARRANTY
11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
REPAIR OR CORRECTION.
12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
POSSIBILITY OF SUCH DAMAGES.
END OF TERMS AND CONDITIONS
How to Apply These Terms to Your New Programs
If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest
to attach them to the start of each source file to most effectively
convey the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.
<one line to give the program's name and a brief idea of what it does.>
Copyright (C) <year> <name of author>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc.,
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
Also add information on how to contact you by electronic and paper mail.
If the program is interactive, make it output a short notice like this
when it starts in an interactive mode:
Gnomovision version 69, Copyright (C) year name of author
Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
This is free software, and you are welcome to redistribute it
under certain conditions; type `show c' for details.
The hypothetical commands `show w' and `show c' should show the appropriate
parts of the General Public License. Of course, the commands you use may
be called something other than `show w' and `show c'; they could even be
mouse-clicks or menu items--whatever suits your program.
You should also get your employer (if you work as a programmer) or your
school, if any, to sign a "copyright disclaimer" for the program, if
necessary. Here is a sample; alter the names:
Yoyodyne, Inc., hereby disclaims all copyright interest in the program
`Gnomovision' (which makes passes at compilers) written by James Hacker.
<signature of Ty Coon>, 1 April 1989
Ty Coon, President of Vice
This General Public License does not permit incorporating your program into
proprietary programs. If your program is a subroutine library, you may
consider it more useful to permit linking proprietary applications with the
library. If this is what you want to do, use the GNU Lesser General
Public License instead of this License.

46
Make-gen.pl Executable file
View File

@ -0,0 +1,46 @@
#!/usr/bin/perl
use strict;
my $deps = {};
for my $line (split /\n/, `grep '^#include "' *.cpp *.h`)
{
if ($line =~ /^([^:]+):\#include "([^"]+)"/s)
{
$deps->{$1}->{$2} = 1;
}
}
my $added;
do
{
$added = 0;
for my $file (keys %$deps)
{
for my $dep (keys %{$deps->{$file}})
{
if ($deps->{$dep})
{
for my $subdep (keys %{$deps->{$dep}})
{
if (!$deps->{$file}->{$subdep})
{
$added = 1;
$deps->{$file}->{$subdep} = 1;
}
}
}
}
}
} while ($added);
for my $file (sort keys %$deps)
{
if ($file =~ /\.cpp$/)
{
my $obj = $file;
$obj =~ s/\.cpp$/.o/s;
print "$obj: $file ".join(" ", sort keys %{$deps->{$file}})."\n";
print "\tg++ \$(CXXFLAGS) -c -o \$\@ \$\<\n";
}
}

205
Makefile
View File

@ -1,66 +1,169 @@
BLOCKSTORE_OBJS := allocator.o blockstore.o blockstore_impl.o blockstore_init.o blockstore_open.o blockstore_journal.o blockstore_read.o \
blockstore_write.o blockstore_sync.o blockstore_stable.o blockstore_rollback.o blockstore_flush.o crc32c.o ringloop.o timerfd_interval.o
blockstore_write.o blockstore_sync.o blockstore_stable.o blockstore_rollback.o blockstore_flush.o crc32c.o ringloop.o
# -fsanitize=address
CXXFLAGS := -g -O3 -Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -fPIC -fdiagnostics-color=always
all: $(BLOCKSTORE_OBJS) libfio_blockstore.so osd libfio_sec_osd.so test_blockstore stub_osd stub_bench osd_test
all: libfio_blockstore.so osd libfio_sec_osd.so libfio_cluster.so stub_osd stub_uring_osd stub_bench osd_test dump_journal qemu_driver.so nbd_proxy
clean:
rm -f *.o
crc32c.o: crc32c.c
dump_journal: dump_journal.cpp crc32c.o blockstore_journal.h
g++ $(CXXFLAGS) -o $@ $< crc32c.o
libblockstore.so: $(BLOCKSTORE_OBJS)
g++ $(CXXFLAGS) -o $@ -shared $(BLOCKSTORE_OBJS) -ltcmalloc_minimal -luring
libfio_blockstore.so: ./libblockstore.so fio_engine.o json11.o
g++ $(CXXFLAGS) -shared -o $@ fio_engine.o json11.o ./libblockstore.so -ltcmalloc_minimal -luring
OSD_OBJS := osd.o osd_secondary.o msgr_receive.o msgr_send.o osd_peering.o osd_flush.o osd_peering_pg.o \
osd_primary.o osd_primary_subops.o etcd_state_client.o messenger.o osd_cluster.o http_client.o osd_ops.o pg_states.o \
osd_rmw.o json11.o base64.o timerfd_manager.o epoll_manager.o
osd: ./libblockstore.so osd_main.cpp osd.h osd_ops.h $(OSD_OBJS)
g++ $(CXXFLAGS) -o $@ osd_main.cpp $(OSD_OBJS) ./libblockstore.so -ltcmalloc_minimal -luring
stub_osd: stub_osd.o rw_blocking.o
g++ $(CXXFLAGS) -o $@ stub_osd.o rw_blocking.o -ltcmalloc_minimal
STUB_URING_OSD_OBJS := stub_uring_osd.o epoll_manager.o messenger.o msgr_send.o msgr_receive.o ringloop.o timerfd_manager.o json11.o
stub_uring_osd: $(STUB_URING_OSD_OBJS)
g++ $(CXXFLAGS) -o $@ -ltcmalloc_minimal $(STUB_URING_OSD_OBJS) -luring
stub_bench: stub_bench.cpp osd_ops.h rw_blocking.o
g++ $(CXXFLAGS) -o $@ stub_bench.cpp rw_blocking.o -ltcmalloc_minimal
osd_test: osd_test.cpp osd_ops.h rw_blocking.o
g++ $(CXXFLAGS) -o $@ osd_test.cpp rw_blocking.o -ltcmalloc_minimal
osd_peering_pg_test: osd_peering_pg_test.cpp osd_peering_pg.o
g++ $(CXXFLAGS) -o $@ $< osd_peering_pg.o -ltcmalloc_minimal
libfio_sec_osd.so: fio_sec_osd.o rw_blocking.o
g++ $(CXXFLAGS) -ltcmalloc_minimal -shared -o $@ fio_sec_osd.o rw_blocking.o
FIO_CLUSTER_OBJS := cluster_client.o epoll_manager.o etcd_state_client.o \
messenger.o msgr_send.o msgr_receive.o ringloop.o json11.o http_client.o osd_ops.o pg_states.o timerfd_manager.o base64.o
libfio_cluster.so: fio_cluster.o $(FIO_CLUSTER_OBJS)
g++ $(CXXFLAGS) -ltcmalloc_minimal -shared -o $@ $< $(FIO_CLUSTER_OBJS) -luring
nbd_proxy: nbd_proxy.o $(FIO_CLUSTER_OBJS)
g++ $(CXXFLAGS) -ltcmalloc_minimal -o $@ $< $(FIO_CLUSTER_OBJS) -luring
qemu_driver.o: qemu_driver.c qemu_proxy.h
gcc -I qemu/b/qemu `pkg-config glib-2.0 --cflags` \
-I qemu/include $(CXXFLAGS) -c -o $@ $<
qemu_driver.so: qemu_driver.o qemu_proxy.o $(FIO_CLUSTER_OBJS)
g++ $(CXXFLAGS) -ltcmalloc_minimal -shared -o $@ $< $(FIO_CLUSTER_OBJS) qemu_driver.o qemu_proxy.o -luring
test_blockstore: ./libblockstore.so test_blockstore.cpp timerfd_interval.o
g++ $(CXXFLAGS) -o test_blockstore test_blockstore.cpp timerfd_interval.o ./libblockstore.so -ltcmalloc_minimal -luring
test: test.cpp osd_peering_pg.o
g++ $(CXXFLAGS) -o test test.cpp osd_peering_pg.o -luring -lm
test_allocator: test_allocator.cpp allocator.o
g++ $(CXXFLAGS) -o test_allocator test_allocator.cpp allocator.o
crc32c.o: crc32c.c crc32c.h
g++ $(CXXFLAGS) -c -o $@ $<
json11.o: json11/json11.cpp
g++ $(CXXFLAGS) -c -o json11.o json11/json11.cpp
# Autogenerated
allocator.o: allocator.cpp allocator.h
g++ $(CXXFLAGS) -c -o $@ $<
base64.o: base64.cpp base64.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore.o: blockstore.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_flush.o: blockstore_flush.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_impl.o: blockstore_impl.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_init.o: blockstore_init.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_journal.o: blockstore_journal.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_open.o: blockstore_open.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_read.o: blockstore_read.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_rollback.o: blockstore_rollback.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_stable.o: blockstore_stable.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_sync.o: blockstore_sync.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
blockstore_write.o: blockstore_write.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
cluster_client.o: cluster_client.cpp cluster_client.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
dump_journal.o: dump_journal.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
epoll_manager.o: epoll_manager.cpp epoll_manager.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
etcd_state_client.o: etcd_state_client.cpp base64.h etcd_state_client.h http_client.h json11/json11.hpp object_id.h osd_id.h osd_ops.h pg_states.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
fio_cluster.o: fio_cluster.cpp cluster_client.h epoll_manager.h etcd_state_client.h fio/fio.h fio/optgroup.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
fio_engine.o: fio_engine.cpp blockstore.h fio/fio.h fio/optgroup.h json11/json11.hpp object_id.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
fio_sec_osd.o: fio_sec_osd.cpp fio/fio.h fio/optgroup.h object_id.h osd_id.h osd_ops.h rw_blocking.h
g++ $(CXXFLAGS) -c -o $@ $<
http_client.o: http_client.cpp http_client.h json11/json11.hpp timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
messenger.o: messenger.cpp json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
msgr_receive.o: msgr_receive.cpp json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
msgr_send.o: msgr_send.cpp json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
nbd_proxy.o: nbd_proxy.cpp cluster_client.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd.o: osd.cpp blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_cluster.o: osd_cluster.cpp base64.h blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_flush.o: osd_flush.cpp blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_main.o: osd_main.cpp blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_ops.o: osd_ops.cpp object_id.h osd_id.h osd_ops.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_peering.o: osd_peering.cpp base64.h blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_peering_pg.o: osd_peering_pg.cpp cpp-btree/btree_map.h object_id.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_peering_pg_test.o: osd_peering_pg_test.cpp cpp-btree/btree_map.h object_id.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_primary.o: osd_primary.cpp blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h osd_primary.h osd_rmw.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_primary_subops.o: osd_primary_subops.cpp blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h osd_primary.h osd_rmw.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_rmw.o: osd_rmw.cpp malloc_or_die.h object_id.h osd_id.h osd_rmw.h xor.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_rmw_test.o: osd_rmw_test.cpp malloc_or_die.h object_id.h osd_id.h osd_rmw.cpp osd_rmw.h test_pattern.h xor.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_secondary.o: osd_secondary.cpp blockstore.h cpp-btree/btree_map.h epoll_manager.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_test.o: osd_test.cpp object_id.h osd_id.h osd_ops.h rw_blocking.h test_pattern.h
g++ $(CXXFLAGS) -c -o $@ $<
pg_states.o: pg_states.cpp pg_states.h
g++ $(CXXFLAGS) -c -o $@ $<
qemu_proxy.o: qemu_proxy.cpp cluster_client.h etcd_state_client.h http_client.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h qemu_proxy.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
ringloop.o: ringloop.cpp ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
timerfd_interval.o: timerfd_interval.cpp timerfd_interval.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
%.o: %.cpp allocator.h blockstore_flush.h blockstore.h blockstore_impl.h blockstore_init.h blockstore_journal.h crc32c.h ringloop.h timerfd_interval.h object_id.h
g++ $(CXXFLAGS) -c -o $@ $<
libblockstore.so: $(BLOCKSTORE_OBJS)
g++ $(CXXFLAGS) -o libblockstore.so -shared $(BLOCKSTORE_OBJS) -ltcmalloc_minimal -luring
libfio_blockstore.so: ./libblockstore.so fio_engine.cpp json11.o
g++ $(CXXFLAGS) -shared -o libfio_blockstore.so fio_engine.cpp json11.o ./libblockstore.so -ltcmalloc_minimal -luring
OSD_OBJS := osd.o osd_secondary.o osd_receive.o osd_send.o osd_peering.o osd_peering_pg.o osd_primary.o osd_rmw.o json11.o timerfd_interval.o
osd_secondary.o: osd_secondary.cpp osd.h osd_ops.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_receive.o: osd_receive.cpp osd.h osd_ops.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_send.o: osd_send.cpp osd.h osd_ops.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_peering.o: osd_peering.cpp osd.h osd_ops.h osd_peering_pg.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_peering_pg.o: osd_peering_pg.cpp object_id.h osd_peering_pg.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_rmw.o: osd_rmw.cpp osd_rmw.h xor.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_rmw_test: osd_rmw_test.cpp osd_rmw.cpp osd_rmw.h xor.h
g++ $(CXXFLAGS) -o $@ $<
osd_primary.o: osd_primary.cpp osd.h osd_ops.h osd_peering_pg.h xor.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
osd.o: osd.cpp osd.h osd_ops.h osd_peering_pg.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
osd: ./libblockstore.so osd_main.cpp osd.h osd_ops.h $(OSD_OBJS)
g++ $(CXXFLAGS) -o osd osd_main.cpp $(OSD_OBJS) ./libblockstore.so -ltcmalloc_minimal -luring
stub_osd: stub_osd.cpp osd_ops.h rw_blocking.o
g++ $(CXXFLAGS) -o stub_osd stub_osd.cpp rw_blocking.o -ltcmalloc_minimal
stub_bench: stub_bench.cpp osd_ops.h rw_blocking.o
g++ $(CXXFLAGS) -o stub_bench stub_bench.cpp rw_blocking.o -ltcmalloc_minimal
rw_blocking.o: rw_blocking.cpp rw_blocking.h
g++ $(CXXFLAGS) -c -o $@ $<
osd_test: osd_test.cpp osd_ops.h rw_blocking.o
g++ $(CXXFLAGS) -o osd_test osd_test.cpp rw_blocking.o -ltcmalloc_minimal
libfio_sec_osd.so: fio_sec_osd.cpp osd_ops.h rw_blocking.o
g++ $(CXXFLAGS) -ltcmalloc_minimal -shared -o libfio_sec_osd.so fio_sec_osd.cpp rw_blocking.o -luring
test_blockstore: ./libblockstore.so test_blockstore.cpp
g++ $(CXXFLAGS) -o test_blockstore test_blockstore.cpp ./libblockstore.so -ltcmalloc_minimal -luring
test: test.cpp osd_peering_pg.o
g++ $(CXXFLAGS) -o test test.cpp osd_peering_pg.o -luring
test_allocator: test_allocator.cpp allocator.o
g++ $(CXXFLAGS) -o test_allocator test_allocator.cpp allocator.o
stub_bench.o: stub_bench.cpp object_id.h osd_id.h osd_ops.h rw_blocking.h
g++ $(CXXFLAGS) -c -o $@ $<
stub_osd.o: stub_osd.cpp object_id.h osd_id.h osd_ops.h rw_blocking.h
g++ $(CXXFLAGS) -c -o $@ $<
stub_uring_osd.o: stub_uring_osd.cpp epoll_manager.h json11/json11.hpp malloc_or_die.h messenger.h object_id.h osd_id.h osd_ops.h ringloop.h timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<
test.o: test.cpp allocator.h blockstore.h blockstore_flush.h blockstore_impl.h blockstore_init.h blockstore_journal.h cpp-btree/btree_map.h crc32c.h malloc_or_die.h object_id.h osd_id.h osd_ops.h osd_peering_pg.h pg_states.h ringloop.h
g++ $(CXXFLAGS) -c -o $@ $<
test_allocator.o: test_allocator.cpp allocator.h
g++ $(CXXFLAGS) -c -o $@ $<
test_blockstore.o: test_blockstore.cpp blockstore.h object_id.h ringloop.h timerfd_interval.h
g++ $(CXXFLAGS) -c -o $@ $<
timerfd_interval.o: timerfd_interval.cpp ringloop.h timerfd_interval.h
g++ $(CXXFLAGS) -c -o $@ $<
timerfd_manager.o: timerfd_manager.cpp timerfd_manager.h
g++ $(CXXFLAGS) -c -o $@ $<

387
README.md Normal file
View File

@ -0,0 +1,387 @@
## Vitastor
## The Idea
Make Software-Defined Block Storage Great Again.
Vitastor is a small, simple and fast clustered block storage (storage for VM drives),
architecturally similar to Ceph which means strong consistency, primary-replication, symmetric
clustering and automatic data distribution over any number of drives of any size
with configurable redundancy (replication or erasure codes/XOR).
## Features
Vitastor is currently a pre-release, a lot of features are missing and you can still expect
breaking changes in the future. However, the following is implemented:
- Basic part: highly-available block storage with symmetric clustering and no SPOF
- Performance ;-D
- Two redundancy schemes: Replication and XOR n+1 (simplest case of EC)
- Configuration via simple JSON data structures in etcd
- Automatic data distribution over OSDs, with support for:
- Mathematical optimization for better uniformity and less data movement
- Multiple pools
- Placement tree
- Configurable failure domains
- Recovery of degraded blocks
- Rebalancing (data movement between OSDs)
- Lazy fsync support
- I/O statistics reporting to etcd
- Generic user-space client library
- QEMU driver (built out-of-tree)
- Loadable fio engine for benchmarks (also built out-of-tree)
- NBD proxy for kernel mounts
## Roadmap
- Packaging for Debian and, probably, CentOS too
- OSD creation tool (OSDs currently have to be created by hand)
- Inode deletion tool (currently you can't delete anything :))
- Other administrative tools
- Per-inode I/O and space usage statistics
- jerasure EC support with any number of data and parity drives in a group
- Parallel usage of multiple network interfaces
- Proxmox and OpenNebula plugins
- iSCSI proxy
- Inode metadata storage in etcd
- Snapshots and copy-on-write image clones
- Operation timeouts and better failure detection
- Checksums
- SSD+HDD optimizations, possibly including tiered storage and soft journal flushes
- RDMA and NVDIMM support
- Compression (possibly)
- Read caching using system page cache (possibly)
## Architecture
Similarities:
- Just like Ceph, Vitastor has Pools, PGs, OSDs, Monitors, Failure Domains, Placement Tree.
- Just like Ceph, Vitastor is transactional (even though there's a "lazy fsync mode" which
doesn't implicitly flush every operation to disks).
- OSDs also have journal and metadata and they can also be put on separate drives.
- Just like in Ceph, client library attempts to recover from any cluster failure so
you can basically reboot the whole cluster and only pause, but not crash, your clients
(I consider this a bug if the client crashes in that case).
Some basic terms for people not familiar with Ceph:
- OSD (Object Storage Daemon) is a process that stores data and serves read/write requests.
- PG (Placement Group) is a container for data that (normally) shares the same replicas.
- Pool is a container for data that has the same redundancy scheme and placement rules.
- Monitor is a separate daemon that watches cluster state and handles failures.
- Failure Domain is a group of OSDs that you allow to fail. It's "host" by default.
- Placement Tree groups OSDs in a hierarchy to later split them into Failure Domains.
Architectural differences from Ceph:
- Vitastor's primary focus is on SSDs. Proper SSD+HDD optimizations may be added in the future, though.
- Vitastor OSD is (and will always be) single-threaded. If you want to dedicate more than 1 core
per drive you should run multiple OSDs each on a different partition of the drive.
Vitastor isn't CPU-hungry though (as opposed to Ceph), so 1 core is sufficient in a lot of cases.
- Metadata and journal are always kept in memory. Metadata size depends linearly on drive capacity
and data store block size which is 128 KB by default. With 128 KB blocks, metadata should occupy
around 512 MB per 1 TB (which is still less than Ceph wants). Journal doesn't have to be big,
the example test below was conducted with only 16 MB journal. A big journal is probably even
harmful as dirty write metadata also take some memory.
- Vitastor storage layer doesn't have internal copy-on-write or redirect-write. I know that maybe
it's possible to create a good copy-on-write storage, but it's much harder and makes performance
less deterministic, so CoW isn't used in Vitastor.
- The basic layer of Vitastor is block storage with fixed-size blocks, not object storage with
rich semantics like in Ceph (RADOS).
- There's a "lazy fsync" mode which allows to batch writes before flushing them to the disk.
This allows to use Vitastor with desktop SSDs, but still lowers performance due to additional
network roundtrips, so use server SSDs with capacitor-based power loss protection
("Advanced Power Loss Protection") for best performance.
- PGs are ephemeral. This means that they aren't stored on data disks and only exist in memory
while OSDs are running.
- Recovery process is per-object (per-block), not per-PG. Also there are no PGLOGs.
- Monitors don't store data. Cluster configuration and state is stored in etcd in simple human-readable
JSON structures. Monitors only watch cluster state and handle data movement.
Thus Vitastor's Monitor isn't a critical component of the system and is more similar to Ceph's Manager.
Vitastor's Monitor is implemented in node.js.
- PG distribution isn't based on consistent hashes. All PG mappings are stored in etcd.
Rebalancing PGs between OSDs is done by mathematical optimization - data distribution problem
is reduced to a linear programming problem and solved by lp_solve. This allows for almost
perfect (96-99% uniformity compared to Ceph's 80-90%) data distribution in most cases, ability
to map PGs by hand without breaking rebalancing logic, reduced OSD peer-to-peer communication
(on average, OSDs have fewer peers) and less data movement. It also probably has a drawback -
this method may fail in very large clusters, but up to several hundreds of OSDs it's perfectly fine.
It's also easy to add consistent hashes in the future if something proves their necessity.
- There's no separate CRUSH layer. You select pool redundancy scheme, placement root, failure domain
and so on directly in pool configuration.
## Understanding Storage Performance
The most important thing for fast storage is latency, not parallel iops.
The best possible latency is achieved with one thread and queue depth of 1 which basically means
"client load as low as possible". In this case IOPS = 1/latency, and this number doesn't
scale with number of servers, drives, server processes or threads and so on.
Single-threaded IOPS and latency numbers only depend on *how fast a single daemon is*.
Why is it important? It's important because some of the applications *can't* use
queue depth greater than 1 because their task isn't parallelizable. A notable example
is any ACID DBMS because all of them write their WALs sequentially with fsync()s.
fsync, by the way, is another important thing often missing in benchmarks. The point is
that drives have cache buffers and don't guarantee that your data is actually persisted
until you call fsync() which is translated to a FLUSH CACHE command by the OS.
Desktop SSDs are very fast without fsync - NVMes, for example, can process ~80000 write
operations per second with queue depth of 1 without fsync - but they're really slow with
fsync because they have to actually write data to flash chips when you call fsync. Typical
number is around 1000-2000 iops with fsync.
Server SSDs often have supercapacitors that act as a built-in UPS and allow the drive
to flush its DRAM cache to the persistent flash storage when a power loss occurs.
This makes them perform equally well with and without fsync. This feature is called
"Advanced Power Loss Protection" by Intel; other vendors either call it similarly
or directly as "Full Capacitor-Based Power Loss Protection".
All software-defined storages that I currently know are slow in terms of latency.
Notable examples are Ceph and internal SDSes used by cloud providers like Amazon, Google,
Yandex and so on. They're all slow and can only reach ~0.3ms read and ~0.6ms 4 KB write latency
with best-in-slot hardware.
And that's in the SSD era when you can buy an SSD that has ~0.04ms latency for 100 $.
I use the following 6 commands with small variations to benchmark any storage:
- Linear write:
`fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=write -runtime=60 -filename=/dev/sdX`
- Linear read:
`fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=read -runtime=60 -filename=/dev/sdX`
- Random write latency (this hurts storages the most):
`fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -fsync=1 -rw=randwrite -runtime=60 -filename=/dev/sdX`
- Random read latency:
`fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -rw=randread -runtime=60 -filename=/dev/sdX`
- Parallel write iops (use numjobs if a single CPU core is insufficient to saturate the load):
`fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 [-numjobs=4 -group_reporting] -rw=randwrite -runtime=60 -filename=/dev/sdX`
- Parallel read iops (use numjobs if a single CPU core is insufficient to saturate the load):
`fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 [-numjobs=4 -group_reporting] -rw=randread -runtime=60 -filename=/dev/sdX`
## Vitastor's Theoretical Maximum Random Access Performance
Replicated setups:
- Single-threaded (T1Q1) read latency: 1 network roundtrip + 1 disk read.
- Single-threaded write+fsync latency:
- With immediate commit: 2 network roundtrips + 1 disk write.
- With lazy commit: 4 network roundtrips + 1 disk write + 1 disk flush.
- Saturated parallel read iops: min(network bandwidth, sum(disk read iops)).
- Saturated parallel write iops: min(network bandwidth, sum(disk write iops / number of replicas / write amplification)).
EC/XOR setups:
- Single-threaded (T1Q1) read latency: 1.5 network roundtrips + 1 disk read.
- Single-threaded write+fsync latency:
- With immediate commit: 3.5 network roundtrips + 1 disk read + 2 disk writes.
- With lazy commit: 5.5 network roundtrips + 1 disk read + 2 disk writes + 2 disk fsyncs.
- 0.5 in actually (k-1)/k which means that an additional roundtrip doesn't happen when
the read sub-operation can be served locally.
- Saturated parallel read iops: min(network bandwidth, sum(disk read iops)).
- Saturated parallel write iops: min(network bandwidth, sum(disk write iops * number of data drives / (number of data + parity drives) / write amplification)).
In fact, you should put disk write iops under the condition of ~10% reads / ~90% writes in this formula.
Write amplification for 4 KB blocks is usually 3-5 in Vitastor:
1. Journal block write
2. Journal data write
3. Metadata block write
4. Another journal block write for EC/XOR setups
5. Data block write
If you manage to get an SSD which handles 512 byte blocks well (Optane?) you may
lower 1, 3 and 4 to 512 bytes (1/8 of data size) and get WA as low as 2.375.
Lazy fsync also reduces WA for parallel workloads because journal blocks are only
written when they fill up or fsync is requested.
## Example Comparison with Ceph
Hardware configuration: 4 nodes, each with:
- 6x SATA SSD Intel D3-4510 3.84 TB
- 2x Xeon Gold 6242 (16 cores @ 2.8 GHz)
- 384 GB RAM
- 1x 25 GbE network interface (Mellanox ConnectX-4 LX), connected to a Juniper QFX5200 switch
CPU powersaving was disabled. Both Vitastor and Ceph were configured with 2 OSDs per 1 SSD.
All of the results below apply to 4 KB blocks.
Raw drive performance:
- T1Q1 write ~27000 iops (~0.037ms latency)
- T1Q1 read ~9800 iops (~0.101ms latency)
- T1Q32 write ~60000 iops
- T1Q32 read ~81700 iops
Ceph 15.2.4 (Bluestore):
- T1Q1 write ~1000 iops (~1ms latency)
- T1Q1 read ~1750 iops (~0.57ms latency)
- T8Q64 write ~100000 iops, total CPU usage by OSDs about 40 virtual cores on each node
- T8Q64 read ~480000 iops, total CPU usage by OSDs about 40 virtual cores on each node
T8Q64 tests were conducted over 8 400GB RBD images from all hosts (every host was running 2 instances of fio).
This is because Ceph has performance penalties related to running multiple clients over a single RBD image.
cephx_sign_messages was set to false during tests, RocksDB and Bluestore settings were left at defaults.
In fact, not that bad for Ceph. These servers are an example of well-balanced Ceph nodes.
However, CPU usage and I/O latency were through the roof, as usual.
Vitastor:
- T1Q1 write: 7087 iops (0.14ms latency)
- T1Q1 read: 6838 iops (0.145ms latency)
- T2Q64 write: 162000 iops, total CPU usage by OSDs about 3 virtual cores on each node
- T8Q64 read: 895000 iops, total CPU usage by OSDs about 4 virtual cores on each node
T8Q64 read test was conducted over 1 larger inode (3.2T) from all hosts (every host was running 2 instances of fio).
Vitastor has no performance penalties related to running multiple clients over a single inode.
If conducted from one node with all primary OSDs moved to other nodes the result was slightly lower (689000 iops),
this is because all operations resulted in network roundtrips between the client and the primary OSD.
When fio was colocated with OSDs (like in Ceph benchmarks above), 1/4 of the read workload actually
used the loopback network.
Vitastor was configured with: `--disable_data_fsync true --immediate_commit all --flusher_count 8
--disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096
--journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024
--journal_size 16777216`.
### NBD
NBD is currently required to mount Vitastor via kernel, but it imposes additional overhead
due to additional copying between the kernel and userspace. This mostly hurts linear
bandwidth, not iops.
Vitastor with single-thread NBD on the same hardware:
- T1Q1 write: 6000 iops (0.166ms latency)
- T1Q1 read: 5518 iops (0.18ms latency)
- T1Q128 write: 94400 iops
- T1Q128 read: 103000 iops
- Linear write (4M T1Q128): 1266 MB/s (compared to 2600 MB/s via fio)
- Linear read (4M T1Q128): 975 MB/s (compared to 1400 MB/s via fio)
## Building
- Install Linux kernel 5.4 or newer for io_uring support.
- Install liburing 0.4 or newer and its headers.
- Install lp_solve.
- Install etcd.
- Install node.js 12 or newer.
- Install gcc and g++ 9.x.
- Clone https://yourcmc.ru/git/vitalif/vitastor/ with submodules.
- Install QEMU 4.x or 5.x, get its source, begin to build it, stop the build and copy headers:
- `<qemu>/include` &rarr; `<vitastor>/qemu/include`
- Debian:
* Use qemu packages from the main repository
* `<qemu>/b/qemu/config-host.h` &rarr; `<vitastor>/qemu/b/qemu/config-host.h`
* `<qemu>/b/qemu/qapi` &rarr; `<vitastor>/qemu/b/qemu/qapi`
- CentOS 8:
* Use qemu packages from the Advanced-Virtualization repository. To enable it, run
`yum install centos-release-advanced-virtualization.noarch` and then `yum install qemu`
* `<qemu>/config-host.h` &rarr; `<vitastor>/qemu/b/qemu/config-host.h`
* `<qemu>/qapi` &rarr; `<vitastor>/qemu/b/qemu/qapi`
- `config-host.h` and `qapi` are required because they contain generated headers
- Install fio 3.16, get its source and symlink it into `<vitastor>/fio`. It doesn't currently
build with fio 3.20 or newer due to the conflicts between g++ and gcc's atomics. This will
be fixed in the future.
- Build Vitastor with `make -j8`.
- Copy binaries somewhere.
## Running
Please note that startup procedure isn't currently simple - you specify configuration
and calculate disk offsets almost by hand. This will be fixed in near future.
- Get some SATA or NVMe SSDs with capacitors (server-grade drives). You can use desktop SSDs
with lazy fsync, but prepare for inferior single-thread latency.
- Get a fast network (at least 10 Gbit/s).
- Disable CPU powersaving: `cpupower idle-set -D 0 && cpupower frequency-set -g performance`.
- Install etcd with `--max-txn-ops=100000 --auto-compaction-retention=10 --auto-compaction-mode=revision` options.
- Create global configuration in etcd: `etcdctl put /vitastor/config/global '{"immediate_commit":"all"}'`
(if all your drives have capacitors).
- Create pool configuration in etcd: `etcdctl put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":256,"failure_domain":"host"}}'`.
- Calculate offsets for your drives with `node ./mon/simple-offsets.js /dev/sdX`.
- Make systemd units for your OSDs. Look at `./mon/make-units.sh` for example.
Notable configuration variables from the example:
- `disable_data_fsync 1` - only safe with server-grade drives with capacitors.
- `immediate_commit all` - use this if all your drives are server-grade.
- `disable_device_lock 1` - only required if you run multiple OSDs on one block device.
- `flusher_count 16` - flusher is a micro-thread that removes old data from the journal.
More flushers mean more aggressive journal flushing which allows for more throughput
but slightly hurts latency under less load. Flushing will probably be improved in the future
because currently high queue depths sometimes lead to performance degradation.
- `disk_alignment`, `journal_block_size`, `meta_block_size` should be set to the internal
block size of your SSDs which is 4096 on most drives.
- `journal_no_same_sector_overwrites true` prevents multiple overwrites of the same journal sector.
Some SSDs (like Intel D3-4510) don't like such overwrites so they benefit from this setting.
When this setting is set, it is also required to raise `journal_sector_buffer_count` setting,
which is the number of dirty journal sectors that may be written to at the same time.
- `systemctl start vitastor.target` everywhere.
- Start any number of monitors: `cd mon; node mon-main.js --etcd_url 'http://10.115.0.10:2379,http://10.115.0.11:2379,http://10.115.0.12:2379,http://10.115.0.13:2379' --etcd_prefix '/vitastor' --etcd_start_timeout 5`.
- At this point, one of the monitors will configure PGs and OSDs will start them.
- You can check PG states with `etcdctl get --prefix /vitastor/pg/state`. All PGs should become 'active'.
- Run tests with (for example): `fio -thread -ioengine=./libfio_cluster.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -pool=1 -inode=1 -size=400G`.
- Upload VM disk image with qemu-img (for example):
```
LD_PRELOAD=./qemu_driver.so qemu-img convert -f qcow2 debian10.qcow2 -p
-O raw 'vitastor:etcd_host=10.115.0.10\:2379/v3:pool=1:inode=1:size=2147483648'
```
- Run QEMU with (for example):
```
LD_PRELOAD=./qemu_driver.so qemu-system-x86_64 -enable-kvm -m 1024
-drive 'file=vitastor:etcd_host=10.115.0.10\:2379/v3:pool=1:inode=1:size=2147483648',format=raw,if=none,id=drive-virtio-disk0,cache=none
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=off,physical_block_size=4096,logical_block_size=512
-vnc 0.0.0.0:0
```
## Known Problems
- OSDs may currently crash with "can't get SQE, will fall out of sync with EPOLLET"
if you try to load them with very long iodepths because io_uring queue (ring) is limited
and OSDs don't check if it fills up.
- Object deletion requests may currently lead to unfound objects on crashes because
proper handling of deletions in a cluster requires a "three-phase cleanup process"
and it's currently not implemented. In fact, even though deletion requests are
implemented, there's no user tool to delete anything from the cluster yet :).
Of course I'll create such tool, but its first implementation will be vulnerable to this issue.
It's not a big deal though, because you'll be able to just repeat the deletion request
in this case.
## Implementation Principles
- I like simple and stupid solutions, so expect Vitastor to stay simple.
- I also like reinventing the wheel to some extent, like writing my own HTTP client
for etcd interaction instead of using prebuilt libraries, because in this case
I'm confident about what my code does and what it doesn't do.
- I don't care about C++ "best practices" like RAII or proper inheritance or usage of
smart pointers or whatever and I don't intend to change my mind, so if you're here
looking for ideal reference C++ code, this probably isn't the right place.
- I like node.js better than any other dynamically-typed language interpreter
because it's faster than any other interpreter in the world, has neutral C-like
syntax and built-in event loop. That's why Monitor is implemented in node.js.
## Author and License
Copyright (c) Vitaliy Filippov (vitalif [at] yourcmc.ru), 2019+
You can also find me in the Russian Telegram Ceph chat: https://t.me/ceph_ru
All server-side code (OSD, Monitor and so on) is licensed under the terms of
Vitastor Network Public License 1.0 (VNPL 1.0), a copyleft license based on
GNU GPLv3.0 with the additional "Network Interaction" clause which requires
opensourcing all programs directly or indirectly interacting with Vitastor
through a computer network ("Proxy Programs"). Proxy Programs may be made public
not only under the terms of the same license, but also under the terms of any
GPL-Compatible Free Software License, as listed by the Free Software Foundation.
This is a stricter copyleft license than the Affero GPL.
Basically, you can't use the software in a proprietary environment to provide
its functionality to users without opensourcing all intermediary components
standing between the user and Vitastor or purchasing a commercial license
from the author 😀.
Client libraries (cluster_client and so on) are dual-licensed under the same
VNPL 1.0 and also GNU GPL 2.0 or later to allow for compatibility with GPLed
software like QEMU and fio.
You can find the full text of VNPL-1.0 in the file [VNPL-1.0.txt](VNPL-1.0.txt).
GPL 2.0 is also included in this repository as [GPL-2.0.txt](GPL-2.0.txt).

645
VNPL-1.0.txt Normal file
View File

@ -0,0 +1,645 @@
VITASTOR NETWORK PUBLIC LICENSE
Version 1, 17 September 2020
Copyright (C) 2020 Vitaliy Filippov <vitalif@yourcmc.ru>
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
Preamble
The Vitastor Network Public License is a free, copyleft license for
software and other kinds of works, specifically designed to ensure
cooperation with the community in the case of network server software.
The licenses for most software and other practical works are designed
to take away your freedom to share and change the works. By contrast,
GNU General Public Licenses and Vitastor Network Public License are
intended to guarantee your freedom to share and change all versions
of a program--to make sure it remains free software for all its users.
When we speak of free software, we are referring to freedom, not
price. GNU General Public Licenses and Vitastor Network Public License
are designed to make sure that you have the freedom to distribute copies
of free software (and charge for them if you wish), that you receive
source code or can get it if you want it, that you can change the software
or use pieces of it in new free programs, and that you know you can do these
things.
Developers that use GNU General Public Licenses and Vitastor
Network Public License protect your rights with two steps:
(1) assert copyright on the software, and (2) offer
you this License which gives you legal permission to copy, distribute
and/or modify the software.
A secondary benefit of defending all users' freedom is that
improvements made in alternate versions of the program, if they
receive widespread use, become available for other developers to
incorporate. Many developers of free software are heartened and
encouraged by the resulting cooperation. However, in the case of
software used on network servers, this result may fail to come about.
The GNU General Public License permits making a modified version and
letting the public access it on a server without ever releasing its
source code to the public. Even the GNU Affero General Public License
permits running a modified version in a closed environment where
public users only interact with it through a closed-source proxy, again,
without making the program and the proxy available to the public
for free.
The Vitastor Network Public License is designed specifically to
ensure that, in such cases, the modified program and the proxy stays
available to the community. It requires the operator of a network server to
provide the source code of the original program and all other programs
communicating with it running there to the users of that server.
Therefore, public use of a modified version, on a server accessible
directly or indirectly to the public, gives the public access to the source
code of the modified version.
The precise terms and conditions for copying, distribution and
modification follow.
TERMS AND CONDITIONS
0. Definitions.
"This License" refers to version 1 of the Vitastor Network Public License.
"Copyright" also means copyright-like laws that apply to other kinds of
works, such as semiconductor masks.
"The Program" refers to any copyrightable work licensed under this
License. Each licensee is addressed as "you". "Licensees" and
"recipients" may be individuals or organizations.
To "modify" a work means to copy from or adapt all or part of the work
in a fashion requiring copyright permission, other than the making of an
exact copy. The resulting work is called a "modified version" of the
earlier work or a work "based on" the earlier work.
A "covered work" means either the unmodified Program or a work based
on the Program.
To "propagate" a work means to do anything with it that, without
permission, would make you directly or secondarily liable for
infringement under applicable copyright law, except executing it on a
computer or modifying a private copy. Propagation includes copying,
distribution (with or without modification), making available to the
public, and in some countries other activities as well.
To "convey" a work means any kind of propagation that enables other
parties to make or receive copies. Mere interaction with a user through
a computer network, with no transfer of a copy, is not conveying.
An interactive user interface displays "Appropriate Legal Notices"
to the extent that it includes a convenient and prominently visible
feature that (1) displays an appropriate copyright notice, and (2)
tells the user that there is no warranty for the work (except to the
extent that warranties are provided), that licensees may convey the
work under this License, and how to view a copy of this License. If
the interface presents a list of user commands or options, such as a
menu, a prominent item in the list meets this criterion.
1. Source Code.
The "source code" for a work means the preferred form of the work
for making modifications to it. "Object code" means any non-source
form of a work.
A "Standard Interface" means an interface that either is an official
standard defined by a recognized standards body, or, in the case of
interfaces specified for a particular programming language, one that
is widely used among developers working in that language.
The "System Libraries" of an executable work include anything, other
than the work as a whole, that (a) is included in the normal form of
packaging a Major Component, but which is not part of that Major
Component, and (b) serves only to enable use of the work with that
Major Component, or to implement a Standard Interface for which an
implementation is available to the public in source code form. A
"Major Component", in this context, means a major essential component
(kernel, window system, and so on) of the specific operating system
(if any) on which the executable work runs, or a compiler used to
produce the work, or an object code interpreter used to run it.
The "Corresponding Source" for a work in object code form means all
the source code needed to generate, install, and (for an executable
work) run the object code and to modify the work, including scripts to
control those activities. However, it does not include the work's
System Libraries, or general-purpose tools or generally available free
programs which are used unmodified in performing those activities but
which are not part of the work. For example, Corresponding Source
includes interface definition files associated with source files for
the work, and the source code for shared libraries and dynamically
linked subprograms that the work is specifically designed to require,
such as by intimate data communication or control flow between those
subprograms and other parts of the work.
The Corresponding Source need not include anything that users
can regenerate automatically from other parts of the Corresponding
Source.
The Corresponding Source for a work in source code form is that
same work.
2. Basic Permissions.
All rights granted under this License are granted for the term of
copyright on the Program, and are irrevocable provided the stated
conditions are met. This License explicitly affirms your unlimited
permission to run the unmodified Program. The output from running a
covered work is covered by this License only if the output, given its
content, constitutes a covered work. This License acknowledges your
rights of fair use or other equivalent, as provided by copyright law.
You may make, run and propagate covered works that you do not
convey, without conditions so long as your license otherwise remains
in force. You may convey covered works to others for the sole purpose
of having them make modifications exclusively for you, or provide you
with facilities for running those works, provided that you comply with
the terms of this License in conveying all material for which you do
not control copyright. Those thus making or running the covered works
for you must do so exclusively on your behalf, under your direction
and control, on terms that prohibit them from making any copies of
your copyrighted material outside their relationship with you.
Conveying under any other circumstances is permitted solely under
the conditions stated below. Sublicensing is not allowed; section 10
makes it unnecessary.
3. Protecting Users' Legal Rights From Anti-Circumvention Law.
No covered work shall be deemed part of an effective technological
measure under any applicable law fulfilling obligations under article
11 of the WIPO copyright treaty adopted on 20 December 1996, or
similar laws prohibiting or restricting circumvention of such
measures.
When you convey a covered work, you waive any legal power to forbid
circumvention of technological measures to the extent such circumvention
is effected by exercising rights under this License with respect to
the covered work, and you disclaim any intention to limit operation or
modification of the work as a means of enforcing, against the work's
users, your or third parties' legal rights to forbid circumvention of
technological measures.
4. Conveying Verbatim Copies.
You may convey verbatim copies of the Program's source code as you
receive it, in any medium, provided that you conspicuously and
appropriately publish on each copy an appropriate copyright notice;
keep intact all notices stating that this License and any
non-permissive terms added in accord with section 7 apply to the code;
keep intact all notices of the absence of any warranty; and give all
recipients a copy of this License along with the Program.
You may charge any price or no price for each copy that you convey,
and you may offer support or warranty protection for a fee.
5. Conveying Modified Source Versions.
You may convey a work based on the Program, or the modifications to
produce it from the Program, in the form of source code under the
terms of section 4, provided that you also meet all of these conditions:
a) The work must carry prominent notices stating that you modified
it, and giving a relevant date.
b) The work must carry prominent notices stating that it is
released under this License and any conditions added under section
7. This requirement modifies the requirement in section 4 to
"keep intact all notices".
c) You must license the entire work, as a whole, under this
License to anyone who comes into possession of a copy. This
License will therefore apply, along with any applicable section 7
additional terms, to the whole of the work, and all its parts,
regardless of how they are packaged. This License gives no
permission to license the work in any other way, but it does not
invalidate such permission if you have separately received it.
d) If the work has interactive user interfaces, each must display
Appropriate Legal Notices; however, if the Program has interactive
interfaces that do not display Appropriate Legal Notices, your
work need not make them do so.
A compilation of a covered work with other separate and independent
works, which are not by their nature extensions of the covered work,
and which are not combined with it such as to form a larger program,
in or on a volume of a storage or distribution medium, is called an
"aggregate" if the compilation and its resulting copyright are not
used to limit the access or legal rights of the compilation's users
beyond what the individual works permit. Inclusion of a covered work
in an aggregate does not cause this License to apply to the other
parts of the aggregate.
6. Conveying Non-Source Forms.
You may convey a covered work in object code form under the terms
of sections 4 and 5, provided that you also convey the
machine-readable Corresponding Source under the terms of this License,
in one of these ways:
a) Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by the
Corresponding Source fixed on a durable physical medium
customarily used for software interchange.
b) Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by a
written offer, valid for at least three years and valid for as
long as you offer spare parts or customer support for that product
model, to give anyone who possesses the object code either (1) a
copy of the Corresponding Source for all the software in the
product that is covered by this License, on a durable physical
medium customarily used for software interchange, for a price no
more than your reasonable cost of physically performing this
conveying of source, or (2) access to copy the
Corresponding Source from a network server at no charge.
c) Convey individual copies of the object code with a copy of the
written offer to provide the Corresponding Source. This
alternative is allowed only occasionally and noncommercially, and
only if you received the object code with such an offer, in accord
with subsection 6b.
d) Convey the object code by offering access from a designated
place (gratis or for a charge), and offer equivalent access to the
Corresponding Source in the same way through the same place at no
further charge. You need not require recipients to copy the
Corresponding Source along with the object code. If the place to
copy the object code is a network server, the Corresponding Source
may be on a different server (operated by you or a third party)
that supports equivalent copying facilities, provided you maintain
clear directions next to the object code saying where to find the
Corresponding Source. Regardless of what server hosts the
Corresponding Source, you remain obligated to ensure that it is
available for as long as needed to satisfy these requirements.
e) Convey the object code using peer-to-peer transmission, provided
you inform other peers where the object code and Corresponding
Source of the work are being offered to the general public at no
charge under subsection 6d.
A separable portion of the object code, whose source code is excluded
from the Corresponding Source as a System Library, need not be
included in conveying the object code work.
A "User Product" is either (1) a "consumer product", which means any
tangible personal property which is normally used for personal, family,
or household purposes, or (2) anything designed or sold for incorporation
into a dwelling. In determining whether a product is a consumer product,
doubtful cases shall be resolved in favor of coverage. For a particular
product received by a particular user, "normally used" refers to a
typical or common use of that class of product, regardless of the status
of the particular user or of the way in which the particular user
actually uses, or expects or is expected to use, the product. A product
is a consumer product regardless of whether the product has substantial
commercial, industrial or non-consumer uses, unless such uses represent
the only significant mode of use of the product.
"Installation Information" for a User Product means any methods,
procedures, authorization keys, or other information required to install
and execute modified versions of a covered work in that User Product from
a modified version of its Corresponding Source. The information must
suffice to ensure that the continued functioning of the modified object
code is in no case prevented or interfered with solely because
modification has been made.
If you convey an object code work under this section in, or with, or
specifically for use in, a User Product, and the conveying occurs as
part of a transaction in which the right of possession and use of the
User Product is transferred to the recipient in perpetuity or for a
fixed term (regardless of how the transaction is characterized), the
Corresponding Source conveyed under this section must be accompanied
by the Installation Information. But this requirement does not apply
if neither you nor any third party retains the ability to install
modified object code on the User Product (for example, the work has
been installed in ROM).
The requirement to provide Installation Information does not include a
requirement to continue to provide support service, warranty, or updates
for a work that has been modified or installed by the recipient, or for
the User Product in which it has been modified or installed. Access to a
network may be denied when the modification itself materially and
adversely affects the operation of the network or violates the rules and
protocols for communication across the network.
Corresponding Source conveyed, and Installation Information provided,
in accord with this section must be in a format that is publicly
documented (and with an implementation available to the public in
source code form), and must require no special password or key for
unpacking, reading or copying.
7. Additional Terms.
"Additional permissions" are terms that supplement the terms of this
License by making exceptions from one or more of its conditions.
Additional permissions that are applicable to the entire Program shall
be treated as though they were included in this License, to the extent
that they are valid under applicable law. If additional permissions
apply only to part of the Program, that part may be used separately
under those permissions, but the entire Program remains governed by
this License without regard to the additional permissions.
When you convey a copy of a covered work, you may at your option
remove any additional permissions from that copy, or from any part of
it. (Additional permissions may be written to require their own
removal in certain cases when you modify the work.) You may place
additional permissions on material, added by you to a covered work,
for which you have or can give appropriate copyright permission.
Notwithstanding any other provision of this License, for material you
add to a covered work, you may (if authorized by the copyright holders of
that material) supplement the terms of this License with terms:
a) Disclaiming warranty or limiting liability differently from the
terms of sections 15 and 16 of this License; or
b) Requiring preservation of specified reasonable legal notices or
author attributions in that material or in the Appropriate Legal
Notices displayed by works containing it; or
c) Prohibiting misrepresentation of the origin of that material, or
requiring that modified versions of such material be marked in
reasonable ways as different from the original version; or
d) Limiting the use for publicity purposes of names of licensors or
authors of the material; or
e) Declining to grant rights under trademark law for use of some
trade names, trademarks, or service marks; or
f) Requiring indemnification of licensors and authors of that
material by anyone who conveys the material (or modified versions of
it) with contractual assumptions of liability to the recipient, for
any liability that these contractual assumptions directly impose on
those licensors and authors.
All other non-permissive additional terms are considered "further
restrictions" within the meaning of section 10. If the Program as you
received it, or any part of it, contains a notice stating that it is
governed by this License along with a term that is a further
restriction, you may remove that term. If a license document contains
a further restriction but permits relicensing or conveying under this
License, you may add to a covered work material governed by the terms
of that license document, provided that the further restriction does
not survive such relicensing or conveying.
If you add terms to a covered work in accord with this section, you
must place, in the relevant source files, a statement of the
additional terms that apply to those files, or a notice indicating
where to find the applicable terms.
Additional terms, permissive or non-permissive, may be stated in the
form of a separately written license, or stated as exceptions;
the above requirements apply either way.
8. Termination.
You may not propagate or modify a covered work except as expressly
provided under this License. Any attempt otherwise to propagate or
modify it is void, and will automatically terminate your rights under
this License (including any patent licenses granted under the third
paragraph of section 11).
However, if you cease all violation of this License, then your
license from a particular copyright holder is reinstated (a)
provisionally, unless and until the copyright holder explicitly and
finally terminates your license, and (b) permanently, if the copyright
holder fails to notify you of the violation by some reasonable means
prior to 60 days after the cessation.
Moreover, your license from a particular copyright holder is
reinstated permanently if the copyright holder notifies you of the
violation by some reasonable means, this is the first time you have
received notice of violation of this License (for any work) from that
copyright holder, and you cure the violation prior to 30 days after
your receipt of the notice.
Termination of your rights under this section does not terminate the
licenses of parties who have received copies or rights from you under
this License. If your rights have been terminated and not permanently
reinstated, you do not qualify to receive new licenses for the same
material under section 10.
9. Acceptance Not Required for Having Copies.
You are not required to accept this License in order to receive or
run a copy of the Program. Ancillary propagation of a covered work
occurring solely as a consequence of using peer-to-peer transmission
to receive a copy likewise does not require acceptance. However,
nothing other than this License grants you permission to propagate or
modify any covered work. These actions infringe copyright if you do
not accept this License. Therefore, by modifying or propagating a
covered work, you indicate your acceptance of this License to do so.
10. Automatic Licensing of Downstream Recipients.
Each time you convey a covered work, the recipient automatically
receives a license from the original licensors, to run, modify and
propagate that work, subject to this License. You are not responsible
for enforcing compliance by third parties with this License.
An "entity transaction" is a transaction transferring control of an
organization, or substantially all assets of one, or subdividing an
organization, or merging organizations. If propagation of a covered
work results from an entity transaction, each party to that
transaction who receives a copy of the work also receives whatever
licenses to the work the party's predecessor in interest had or could
give under the previous paragraph, plus a right to possession of the
Corresponding Source of the work from the predecessor in interest, if
the predecessor has it or can get it with reasonable efforts.
You may not impose any further restrictions on the exercise of the
rights granted or affirmed under this License. For example, you may
not impose a license fee, royalty, or other charge for exercise of
rights granted under this License, and you may not initiate litigation
(including a cross-claim or counterclaim in a lawsuit) alleging that
any patent claim is infringed by making, using, selling, offering for
sale, or importing the Program or any portion of it.
11. Patents.
A "contributor" is a copyright holder who authorizes use under this
License of the Program or a work on which the Program is based. The
work thus licensed is called the contributor's "contributor version".
A contributor's "essential patent claims" are all patent claims
owned or controlled by the contributor, whether already acquired or
hereafter acquired, that would be infringed by some manner, permitted
by this License, of making, using, or selling its contributor version,
but do not include claims that would be infringed only as a
consequence of further modification of the contributor version. For
purposes of this definition, "control" includes the right to grant
patent sublicenses in a manner consistent with the requirements of
this License.
Each contributor grants you a non-exclusive, worldwide, royalty-free
patent license under the contributor's essential patent claims, to
make, use, sell, offer for sale, import and otherwise run, modify and
propagate the contents of its contributor version.
In the following three paragraphs, a "patent license" is any express
agreement or commitment, however denominated, not to enforce a patent
(such as an express permission to practice a patent or covenant not to
sue for patent infringement). To "grant" such a patent license to a
party means to make such an agreement or commitment not to enforce a
patent against the party.
If you convey a covered work, knowingly relying on a patent license,
and the Corresponding Source of the work is not available for anyone
to copy, free of charge and under the terms of this License, through a
publicly available network server or other readily accessible means,
then you must either (1) cause the Corresponding Source to be so
available, or (2) arrange to deprive yourself of the benefit of the
patent license for this particular work, or (3) arrange, in a manner
consistent with the requirements of this License, to extend the patent
license to downstream recipients. "Knowingly relying" means you have
actual knowledge that, but for the patent license, your conveying the
covered work in a country, or your recipient's use of the covered work
in a country, would infringe one or more identifiable patents in that
country that you have reason to believe are valid.
If, pursuant to or in connection with a single transaction or
arrangement, you convey, or propagate by procuring conveyance of, a
covered work, and grant a patent license to some of the parties
receiving the covered work authorizing them to use, propagate, modify
or convey a specific copy of the covered work, then the patent license
you grant is automatically extended to all recipients of the covered
work and works based on it.
A patent license is "discriminatory" if it does not include within
the scope of its coverage, prohibits the exercise of, or is
conditioned on the non-exercise of one or more of the rights that are
specifically granted under this License. You may not convey a covered
work if you are a party to an arrangement with a third party that is
in the business of distributing software, under which you make payment
to the third party based on the extent of your activity of conveying
the work, and under which the third party grants, to any of the
parties who would receive the covered work from you, a discriminatory
patent license (a) in connection with copies of the covered work
conveyed by you (or copies made from those copies), or (b) primarily
for and in connection with specific products or compilations that
contain the covered work, unless you entered into that arrangement,
or that patent license was granted, prior to 28 March 2007.
Nothing in this License shall be construed as excluding or limiting
any implied license or other defenses to infringement that may
otherwise be available to you under applicable patent law.
12. No Surrender of Others' Freedom.
If conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License. If you cannot convey a
covered work so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you may
not convey it at all. For example, if you agree to terms that obligate you
to collect a royalty for further conveying from those to whom you convey
the Program, the only way you could satisfy both those terms and this
License would be to refrain entirely from conveying the Program.
13. Remote Network Interaction.
Notwithstanding any other provision of this License, if you provide
any user an opportunity to interact with the covered work directly
or indirectly through a computer network, an imitation of such network,
or an additional program (hereinafter referred to as a "Proxy Program")
that, in turn, interacts with the covered work through a computer network,
an imitation of such network, or another Proxy Program itself,
you must prominently offer that user an opportunity to receive the
Corresponding Source of the covered work and all Proxy Programs from a
network server at no charge, through some standard or customary means of
facilitating copying of software. The Corresponding Source for the covered
work must be made available under the conditions of this License, and
the Corresponding Source for all Proxy Programs must be made available
under the conditions of either this License or any GPL-Compatible
Free Software License, as described by the Free Software Foundation
in their "GPL-Compatible License List".
14. Revised Versions of this License.
Vitastor Author may publish revised and/or new versions of
the Vitastor Network Public License from time to time. Such new versions
will be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.
Each version is given a distinguishing version number. If the
Program specifies that a certain numbered version of the Vitastor Network
Public License "or any later version" applies to it, you have the
option of following the terms and conditions either of that numbered
version or of any later version. If the Program does not specify a version
number of the Vitastor Network Public License, you may choose any version
ever published.
Later license versions may give you additional or different
permissions. However, no additional obligations are imposed on any
author or copyright holder as a result of your choosing to follow a
later version.
15. Disclaimer of Warranty.
THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
16. Limitation of Liability.
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
SUCH DAMAGES.
17. Interpretation of Sections 15 and 16.
If the disclaimer of warranty and limitation of liability provided
above cannot be given local legal effect according to their terms,
reviewing courts shall apply local law that most closely approximates
an absolute waiver of all civil liability in connection with the
Program, unless a warranty or assumption of liability accompanies a
copy of the Program in return for a fee.
END OF TERMS AND CONDITIONS
How to Apply These Terms to Your New Programs
If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest
to attach them to the start of each source file to most effectively
state the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.
<one line to give the program's name and a brief idea of what it does.>
Copyright (C) <year> <name of author>
This program is free software: you can redistribute it and/or modify
it under the terms of the Vitastor Network Public License as published by
the Vitastor Author, either version 1 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
Vitastor Network Public License for more details.
Also add information on how to contact you by electronic and paper mail.
If your software can interact with users remotely through a computer
network, you should also make sure that it provides a way for users to
get its source. For example, if your program is a web application, its
interface could display a "Source" link that leads users to an archive
of the code. There are many ways you could offer source, and different
solutions will be better for different programs; see section 13 for the
specific requirements.

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include <stdexcept>
#include "allocator.h"

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#pragma once
#include <stdint.h>

55
base64.cpp Normal file
View File

@ -0,0 +1,55 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include "base64.h"
std::string base64_encode(const std::string &in)
{
std::string out;
unsigned val = 0;
int valb = -6;
for (unsigned char c: in)
{
val = (val << 8) + c;
valb += 8;
while (valb >= 0)
{
out.push_back("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"[(val>>valb) & 0x3F]);
valb -= 6;
}
}
if (valb > -6)
out.push_back("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"[((val<<8)>>(valb+8)) & 0x3F]);
while (out.size() % 4)
out.push_back('=');
return out;
}
static char T[256] = { 0 };
std::string base64_decode(const std::string &in)
{
std::string out;
if (T[0] == 0)
{
for (int i = 0; i < 256; i++)
T[i] = -1;
for (int i = 0; i < 64; i++)
T[(unsigned char)("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"[i])] = i;
}
unsigned val = 0;
int valb = -8;
for (unsigned char c: in)
{
if (T[c] == -1)
break;
val = (val<<6) + T[c];
valb += 6;
if (valb >= 0)
{
out.push_back(char((val >> valb) & 0xFF));
valb -= 8;
}
}
return out;
}

8
base64.h Normal file
View File

@ -0,0 +1,8 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#pragma once
#include <string>
std::string base64_encode(const std::string &in);
std::string base64_decode(const std::string &in);

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include "blockstore_impl.h"
blockstore_t::blockstore_t(blockstore_config_t & config, ring_loop_t *ringloop)
@ -55,6 +58,11 @@ uint64_t blockstore_t::get_block_count()
return impl->get_block_count();
}
uint64_t blockstore_t::get_free_block_count()
{
return impl->get_free_block_count();
}
uint32_t blockstore_t::get_disk_alignment()
{
return impl->get_disk_alignment();

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#pragma once
#ifndef _LARGEFILE64_SOURCE
@ -15,7 +18,9 @@
// Memory alignment for direct I/O (usually 512 bytes)
// All other alignments must be a multiple of this one
#ifndef MEM_ALIGNMENT
#define MEM_ALIGNMENT 512
#endif
// Default block size is 128 KB, current allowed range is 4K - 128M
#define DEFAULT_ORDER 17
@ -25,13 +30,14 @@
#define BS_OP_MIN 1
#define BS_OP_READ 1
#define BS_OP_WRITE 2
#define BS_OP_SYNC 3
#define BS_OP_STABLE 4
#define BS_OP_DELETE 5
#define BS_OP_LIST 6
#define BS_OP_ROLLBACK 7
#define BS_OP_SYNC_STAB_ALL 8
#define BS_OP_MAX 8
#define BS_OP_WRITE_STABLE 3
#define BS_OP_SYNC 4
#define BS_OP_STABLE 5
#define BS_OP_DELETE 6
#define BS_OP_LIST 7
#define BS_OP_ROLLBACK 8
#define BS_OP_SYNC_STAB_ALL 9
#define BS_OP_MAX 9
#define BS_OP_PRIVATE_DATA_SIZE 256
@ -39,9 +45,9 @@
Blockstore opcode documentation:
## BS_OP_READ / BS_OP_WRITE
## BS_OP_READ / BS_OP_WRITE / BS_OP_WRITE_STABLE
Read or write object data.
Read or write object data. WRITE_STABLE writes a version that doesn't require marking as stable.
Input:
- oid = requested object
@ -50,6 +56,7 @@ Input:
- version == 0: read the last stable version,
- version == UINT64_MAX: read the last version,
- otherwise: read the newest version that is <= the specified version
- reads aren't guaranteed to return data from previous unfinished writes
For writes:
- if version == 0, a new version is assigned automatically
- if version != 0, it is assigned for the new write if possible, otherwise -EINVAL is returned
@ -92,7 +99,7 @@ Input:
- buf = pre-allocated obj_ver_id array <len> units long
Output:
- retval = 0 or negative error number (-EINVAL)
- retval = 0 or negative error number (-EINVAL, -ENOENT if no such version or -EBUSY if not synced)
## BS_OP_SYNC_STAB_ALL
@ -110,6 +117,8 @@ Input:
- oid.stripe = PG alignment
- len = PG count or 0 to list all objects
- offset = PG number
- oid.inode = min inode number or 0 to list all inodes
- version = max inode number or 0 to list all inodes
Output:
- retval = total obj_ver_id count
@ -175,6 +184,7 @@ public:
// FIXME rename to object_size
uint32_t get_block_size();
uint64_t get_block_count();
uint64_t get_free_block_count();
uint32_t get_disk_alignment();
};

View File

@ -1,14 +1,19 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include "blockstore_impl.h"
journal_flusher_t::journal_flusher_t(int flusher_count, blockstore_impl_t *bs)
{
this->bs = bs;
this->flusher_count = flusher_count;
dequeuing = false;
active_flushers = 0;
sync_threshold = flusher_count == 1 ? 1 : flusher_count/2;
journal_trim_interval = sync_threshold;
syncing_flushers = 0;
flusher_start_threshold = bs->journal_block_size / sizeof(journal_entry_stable);
journal_trim_interval = flusher_start_threshold;
journal_trim_counter = 0;
journal_superblock = bs->journal.inmemory ? bs->journal.buffer : memalign(MEM_ALIGNMENT, bs->journal_block_size);
journal_superblock = bs->journal.inmemory ? bs->journal.buffer : memalign_or_die(MEM_ALIGNMENT, bs->journal_block_size);
co = new journal_flusher_co[flusher_count];
for (int i = 0; i < flusher_count; i++)
{
@ -55,17 +60,13 @@ journal_flusher_t::~journal_flusher_t()
bool journal_flusher_t::is_active()
{
return active_flushers > 0 || start_forced && flush_queue.size() > 0 || flush_queue.size() >= sync_threshold;
return active_flushers > 0 || dequeuing;
}
void journal_flusher_t::loop()
{
for (int i = 0; i < flusher_count; i++)
for (int i = 0; (active_flushers > 0 || dequeuing) && i < flusher_count; i++)
{
if (!active_flushers && (start_forced ? !flush_queue.size() : (flush_queue.size() < sync_threshold)))
{
return;
}
co[i].loop();
}
}
@ -83,6 +84,11 @@ void journal_flusher_t::enqueue_flush(obj_ver_id ov)
flush_versions[ov.oid] = ov.version;
flush_queue.push_back(ov.oid);
}
if (!dequeuing && flush_queue.size() >= flusher_start_threshold)
{
dequeuing = true;
bs->ringloop->wakeup();
}
}
void journal_flusher_t::unshift_flush(obj_ver_id ov)
@ -98,14 +104,25 @@ void journal_flusher_t::unshift_flush(obj_ver_id ov)
flush_versions[ov.oid] = ov.version;
flush_queue.push_front(ov.oid);
}
if (!dequeuing && flush_queue.size() >= flusher_start_threshold)
{
dequeuing = true;
bs->ringloop->wakeup();
}
}
void journal_flusher_t::force_start()
void journal_flusher_t::request_trim()
{
start_forced = true;
dequeuing = true;
trim_wanted++;
bs->ringloop->wakeup();
}
void journal_flusher_t::release_trim()
{
trim_wanted--;
}
#define await_sqe(label) \
resume_##label:\
sqe = bs->get_sqe();\
@ -116,6 +133,7 @@ void journal_flusher_t::force_start()
}\
data = ((ring_data_t*)sqe->user_data);
// FIXME: Implement batch flushing
bool journal_flusher_co::loop()
{
// This is much better than implementing the whole function as an FSM
@ -155,10 +173,9 @@ bool journal_flusher_co::loop()
else if (wait_state == 18)
goto resume_18;
resume_0:
if (!flusher->flush_queue.size() ||
!flusher->start_forced && !flusher->active_flushers && flusher->flush_queue.size() < flusher->sync_threshold)
if (!flusher->flush_queue.size() || !flusher->dequeuing)
{
flusher->start_forced = false;
flusher->dequeuing = false;
wait_state = 0;
return true;
}
@ -173,7 +190,7 @@ resume_0:
if (repeat_it != flusher->sync_to_repeat.end())
{
#ifdef BLOCKSTORE_DEBUG
printf("Postpone %lu:%lu v%lu\n", cur.oid.inode, cur.oid.stripe, cur.version);
printf("Postpone %lx:%lx v%lu\n", cur.oid.inode, cur.oid.stripe, cur.version);
#endif
// We don't flush different parts of history of the same object in parallel
// So we check if someone is already flushing this object
@ -186,42 +203,112 @@ resume_0:
}
else
flusher->sync_to_repeat[cur.oid] = 0;
if (dirty_end->second.journal_sector >= bs->journal.dirty_start &&
(bs->journal.dirty_start >= bs->journal.used_start ||
dirty_end->second.journal_sector < bs->journal.used_start))
{
flusher->enqueue_flush(cur);
// We can't flush journal sectors that are still written to
// However, as we group flushes by oid, current oid may have older writes to flush!
// And it may even block writes if we don't flush the older version
// (if it's in the beginning of the journal)...
// So first try to find an older version of the same object to flush.
bool found = false;
while (dirty_end != bs->dirty_db.begin())
{
dirty_end--;
if (dirty_end->first.oid != cur.oid)
{
break;
}
if (!(dirty_end->second.journal_sector >= bs->journal.dirty_start &&
(bs->journal.dirty_start >= bs->journal.used_start ||
dirty_end->second.journal_sector < bs->journal.used_start)))
{
found = true;
cur.version = dirty_end->first.version;
break;
}
}
if (!found)
{
// Try other objects
flusher->sync_to_repeat.erase(cur.oid);
int search_left = flusher->flush_queue.size() - 1;
#ifdef BLOCKSTORE_DEBUG
printf("Flushing %lu:%lu v%lu\n", cur.oid.inode, cur.oid.stripe, cur.version);
printf("Flusher overran writers (dirty_start=%08lx) - searching for older flushes (%d left)\n", bs->journal.dirty_start, search_left);
#endif
while (search_left > 0)
{
cur.oid = flusher->flush_queue.front();
cur.version = flusher->flush_versions[cur.oid];
flusher->flush_queue.pop_front();
flusher->flush_versions.erase(cur.oid);
dirty_end = bs->dirty_db.find(cur);
if (dirty_end != bs->dirty_db.end())
{
if (dirty_end->second.journal_sector >= bs->journal.dirty_start &&
(bs->journal.dirty_start >= bs->journal.used_start ||
dirty_end->second.journal_sector < bs->journal.used_start))
{
#ifdef BLOCKSTORE_DEBUG
printf("Write %lx:%lx v%lu is too new: offset=%08lx\n", cur.oid.inode, cur.oid.stripe, cur.version, dirty_end->second.journal_sector);
#endif
flusher->enqueue_flush(cur);
}
else
{
repeat_it = flusher->sync_to_repeat.find(cur.oid);
if (repeat_it == flusher->sync_to_repeat.end())
{
flusher->sync_to_repeat[cur.oid] = 0;
break;
}
}
}
search_left--;
}
if (search_left <= 0)
{
#ifdef BLOCKSTORE_DEBUG
printf("No older flushes, stopping\n");
#endif
flusher->dequeuing = false;
wait_state = 0;
return true;
}
}
}
#ifdef BLOCKSTORE_DEBUG
printf("Flushing %lx:%lx v%lu\n", cur.oid.inode, cur.oid.stripe, cur.version);
#endif
flusher->active_flushers++;
resume_1:
// Find it in clean_db
clean_it = bs->clean_db.find(cur.oid);
old_clean_loc = (clean_it != bs->clean_db.end() ? clean_it->second.location : UINT64_MAX);
// Scan dirty versions of the object
if (!scan_dirty(1))
{
wait_state += 1;
return false;
}
if (copy_count == 0 && clean_loc == UINT64_MAX && !has_delete && !has_empty)
// Writes and deletes shouldn't happen at the same time
assert(!(copy_count > 0 || has_writes) || !has_delete);
if (copy_count == 0 && !has_writes && !has_delete || has_delete && old_clean_loc == UINT64_MAX)
{
// Nothing to flush
flusher->active_flushers--;
repeat_it = flusher->sync_to_repeat.find(cur.oid);
if (repeat_it != flusher->sync_to_repeat.end() && repeat_it->second > cur.version)
{
// Requeue version
flusher->unshift_flush({ .oid = cur.oid, .version = repeat_it->second });
}
flusher->sync_to_repeat.erase(repeat_it);
wait_state = 0;
goto resume_0;
bs->erase_dirty(dirty_start, std::next(dirty_end), clean_loc);
goto trim_journal;
}
// Find it in clean_db
clean_it = bs->clean_db.find(cur.oid);
old_clean_loc = (clean_it != bs->clean_db.end() ? clean_it->second.location : UINT64_MAX);
if (clean_loc == UINT64_MAX)
{
if (copy_count > 0 && has_delete || old_clean_loc == UINT64_MAX)
if (old_clean_loc == UINT64_MAX)
{
// Object not allocated. This is a bug.
char err[1024];
snprintf(
err, 1024, "BUG: Object %lu:%lu v%lu that we are trying to flush is not allocated on the data device",
err, 1024, "BUG: Object %lx:%lx v%lu that we are trying to flush is not allocated on the data device",
cur.oid.inode, cur.oid.stripe, cur.version
);
throw std::runtime_error(err);
@ -331,6 +418,7 @@ resume_1:
else
{
clean_disk_entry *new_entry = (clean_disk_entry*)(meta_new.buf + meta_new.pos*bs->clean_entry_size);
assert(new_entry->oid.inode == 0 || new_entry->oid == cur.oid);
new_entry->oid = cur.oid;
new_entry->version = cur.version;
if (!bs->inmemory_meta)
@ -386,8 +474,9 @@ resume_1:
}
// Update clean_db and dirty_db, free old data locations
update_clean_db();
trim_journal:
// Clear unused part of the journal every <journal_trim_interval> flushes
if (!((++flusher->journal_trim_counter) % flusher->journal_trim_interval))
if (!((++flusher->journal_trim_counter) % flusher->journal_trim_interval) || flusher->trim_wanted > 0)
{
flusher->journal_trim_counter = 0;
if (bs->journal.trim())
@ -417,7 +506,8 @@ resume_1:
}
// All done
#ifdef BLOCKSTORE_DEBUG
printf("Flushed %lu:%lu v%lu\n", cur.oid.inode, cur.oid.stripe, cur.version);
printf("Flushed %lx:%lx v%lu (%d copies, wr:%d, del:%d), %ld left\n", cur.oid.inode, cur.oid.stripe, cur.version,
copy_count, has_writes, has_delete, flusher->flush_queue.size());
#endif
flusher->active_flushers--;
repeat_it = flusher->sync_to_repeat.find(cur.oid);
@ -445,19 +535,25 @@ bool journal_flusher_co::scan_dirty(int wait_base)
copy_count = 0;
clean_loc = UINT64_MAX;
has_delete = false;
has_empty = false;
has_writes = false;
skip_copy = false;
clean_init_bitmap = false;
while (1)
{
if (dirty_it->second.state == ST_J_STABLE && !skip_copy)
if (!IS_STABLE(dirty_it->second.state))
{
char err[1024];
snprintf(
err, 1024, "BUG: Unexpected dirty_entry %lx:%lx v%lu state during flush: %d",
dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version, dirty_it->second.state
);
throw std::runtime_error(err);
}
else if (IS_JOURNAL(dirty_it->second.state) && !skip_copy)
{
// First we submit all reads
if (dirty_it->second.len == 0)
{
has_empty = true;
}
else
has_writes = true;
if (dirty_it->second.len != 0)
{
offset = dirty_it->second.offset;
end_offset = dirty_it->second.offset + dirty_it->second.len;
@ -471,18 +567,18 @@ bool journal_flusher_co::scan_dirty(int wait_base)
{
submit_offset = dirty_it->second.location + offset - dirty_it->second.offset;
submit_len = it == v.end() || it->offset >= end_offset ? end_offset-offset : it->offset-offset;
it = v.insert(it, (copy_buffer_t){ .offset = offset, .len = submit_len, .buf = memalign(MEM_ALIGNMENT, submit_len) });
it = v.insert(it, (copy_buffer_t){ .offset = offset, .len = submit_len, .buf = memalign_or_die(MEM_ALIGNMENT, submit_len) });
copy_count++;
if (bs->journal.inmemory)
{
// Take it from memory
memcpy(v.back().buf, bs->journal.buffer + submit_offset, submit_len);
memcpy(it->buf, bs->journal.buffer + submit_offset, submit_len);
}
else
{
// Read it from disk
await_sqe(0);
data->iov = (struct iovec){ v.back().buf, (size_t)submit_len };
data->iov = (struct iovec){ it->buf, (size_t)submit_len };
data->callback = simple_callback_r;
my_uring_prep_readv(
sqe, bs->journal.fd, &data->iov, 1, bs->journal.offset + submit_offset
@ -496,30 +592,22 @@ bool journal_flusher_co::scan_dirty(int wait_base)
}
}
}
else if (dirty_it->second.state == ST_D_STABLE && !skip_copy)
else if (IS_BIG_WRITE(dirty_it->second.state) && !skip_copy)
{
// There is an unflushed big write. Copy small writes in its position
has_writes = true;
clean_loc = dirty_it->second.location;
clean_init_bitmap = true;
clean_bitmap_offset = dirty_it->second.offset;
clean_bitmap_len = dirty_it->second.len;
skip_copy = true;
}
else if (dirty_it->second.state == ST_DEL_STABLE && !skip_copy)
else if (IS_DELETE(dirty_it->second.state) && !skip_copy)
{
// There is an unflushed delete
has_delete = true;
skip_copy = true;
}
else if (!IS_STABLE(dirty_it->second.state))
{
char err[1024];
snprintf(
err, 1024, "BUG: Unexpected dirty_entry %lu:%lu v%lu state during flush: %d",
dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version, dirty_it->second.state
);
throw std::runtime_error(err);
}
dirty_start = dirty_it;
if (dirty_it == bs->dirty_db.begin())
{
@ -555,7 +643,7 @@ bool journal_flusher_co::modify_meta_read(uint64_t meta_loc, flusher_meta_write_
if (wr.it == flusher->meta_sectors.end())
{
// Not in memory yet, read it
wr.buf = memalign(MEM_ALIGNMENT, bs->meta_block_size);
wr.buf = memalign_or_die(MEM_ALIGNMENT, bs->meta_block_size);
wr.it = flusher->meta_sectors.emplace(wr.sector, (meta_sector_t){
.offset = wr.sector,
.len = bs->meta_block_size,
@ -585,7 +673,7 @@ void journal_flusher_co::update_clean_db()
if (old_clean_loc != UINT64_MAX && old_clean_loc != clean_loc)
{
#ifdef BLOCKSTORE_DEBUG
printf("Free block %lu\n", old_clean_loc >> bs->block_order);
printf("Free block %lu (new location is %lu)\n", old_clean_loc >> bs->block_order, clean_loc >> bs->block_order);
#endif
bs->data_alloc->set(old_clean_loc >> bs->block_order, false);
}
@ -632,7 +720,8 @@ bool journal_flusher_co::fsync_batch(bool fsync_meta, int wait_base)
});
sync_found:
cur_sync->ready_count++;
if (cur_sync->ready_count >= flusher->sync_threshold || !flusher->flush_queue.size())
flusher->syncing_flushers++;
if (flusher->syncing_flushers >= flusher->flusher_count || !flusher->flush_queue.size())
{
// Sync batch is ready. Do it.
await_sqe(0);
@ -658,6 +747,7 @@ bool journal_flusher_co::fsync_batch(bool fsync_meta, int wait_base)
wait_state = 2;
return false;
}
flusher->syncing_flushers--;
cur_sync->ready_count--;
if (cur_sync->ready_count == 0)
{

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
struct copy_buffer_t
{
uint64_t offset, len;
@ -45,8 +48,8 @@ class journal_flusher_co
std::map<object_id, uint64_t>::iterator repeat_it;
std::function<void(ring_data_t*)> simple_callback_r, simple_callback_w;
bool skip_copy, has_delete, has_empty;
spp::sparse_hash_map<object_id, clean_entry>::iterator clean_it;
bool skip_copy, has_delete, has_writes;
blockstore_clean_db_t::iterator clean_it;
std::vector<copy_buffer_t> v;
std::vector<copy_buffer_t>::iterator it;
int copy_count;
@ -73,9 +76,10 @@ public:
// Journal flusher itself
class journal_flusher_t
{
bool start_forced = false;
int trim_wanted = 0;
bool dequeuing;
int flusher_count;
int sync_threshold;
int flusher_start_threshold;
journal_flusher_co *co;
blockstore_impl_t *bs;
friend class journal_flusher_co;
@ -84,6 +88,7 @@ class journal_flusher_t
void* journal_superblock;
int active_flushers;
int syncing_flushers;
std::list<flusher_sync_t> syncs;
std::map<object_id, uint64_t> sync_to_repeat;
@ -95,7 +100,8 @@ public:
~journal_flusher_t();
void loop();
bool is_active();
void force_start();
void request_trim();
void release_trim();
void enqueue_flush(obj_ver_id oid);
void unshift_flush(obj_ver_id oid);
};

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include "blockstore_impl.h"
blockstore_impl_t::blockstore_impl_t(blockstore_config_t & config, ring_loop_t *ringloop)
@ -5,9 +8,9 @@ blockstore_impl_t::blockstore_impl_t(blockstore_config_t & config, ring_loop_t *
assert(sizeof(blockstore_op_private_t) <= BS_OP_PRIVATE_DATA_SIZE);
this->ringloop = ringloop;
ring_consumer.loop = [this]() { loop(); };
ringloop->register_consumer(ring_consumer);
ringloop->register_consumer(&ring_consumer);
initialized = 0;
zero_object = (uint8_t*)memalign(MEM_ALIGNMENT, block_size);
zero_object = (uint8_t*)memalign_or_die(MEM_ALIGNMENT, block_size);
data_fd = meta_fd = journal.fd = -1;
parse_config(config);
try
@ -36,7 +39,7 @@ blockstore_impl_t::~blockstore_impl_t()
delete data_alloc;
delete flusher;
free(zero_object);
ringloop->unregister_consumer(ring_consumer);
ringloop->unregister_consumer(&ring_consumer);
if (data_fd >= 0)
close(data_fd);
if (meta_fd >= 0 && meta_fd != data_fd)
@ -98,10 +101,19 @@ void blockstore_impl_t::loop()
{
// try to submit ops
unsigned initial_ring_space = ringloop->space_left();
// FIXME: rework this "sync polling"
auto cur_sync = in_progress_syncs.begin();
while (cur_sync != in_progress_syncs.end())
{
continue_sync(*cur_sync++);
if (continue_sync(*cur_sync) != 2)
{
// List is unmodified
cur_sync++;
}
else
{
cur_sync = in_progress_syncs.begin();
}
}
auto cur = submit_queue.begin();
int has_writes = 0;
@ -115,19 +127,13 @@ void blockstore_impl_t::loop()
if (PRIV(op)->wait_for)
{
check_wait(op);
#ifdef BLOCKSTORE_DEBUG
if (PRIV(op)->wait_for)
{
printf("still waiting for %d\n", PRIV(op)->wait_for);
}
#endif
if (PRIV(op)->wait_for == WAIT_SQE)
{
break;
}
else if (PRIV(op)->wait_for)
{
if (op->opcode == BS_OP_WRITE || op->opcode == BS_OP_DELETE)
if (op->opcode == BS_OP_WRITE || op->opcode == BS_OP_WRITE_STABLE || op->opcode == BS_OP_DELETE)
{
has_writes = 2;
}
@ -136,12 +142,12 @@ void blockstore_impl_t::loop()
}
unsigned ring_space = ringloop->space_left();
unsigned prev_sqe_pos = ringloop->save();
int dequeue_op = 0;
bool dequeue_op = false;
if (op->opcode == BS_OP_READ)
{
dequeue_op = dequeue_read(op);
}
else if (op->opcode == BS_OP_WRITE || op->opcode == BS_OP_DELETE)
else if (op->opcode == BS_OP_WRITE || op->opcode == BS_OP_WRITE_STABLE)
{
if (has_writes == 2)
{
@ -151,6 +157,16 @@ void blockstore_impl_t::loop()
dequeue_op = dequeue_write(op);
has_writes = dequeue_op ? 1 : 2;
}
else if (op->opcode == BS_OP_DELETE)
{
if (has_writes == 2)
{
// Some writes could not be submitted
break;
}
dequeue_op = dequeue_del(op);
has_writes = dequeue_op ? 1 : 2;
}
else if (op->opcode == BS_OP_SYNC)
{
// wait for all small writes to be submitted
@ -166,16 +182,33 @@ void blockstore_impl_t::loop()
}
else if (op->opcode == BS_OP_STABLE)
{
if (has_writes == 2)
{
// Don't submit additional flushes before completing previous LISTs
break;
}
dequeue_op = dequeue_stable(op);
}
else if (op->opcode == BS_OP_ROLLBACK)
{
if (has_writes == 2)
{
// Don't submit additional flushes before completing previous LISTs
break;
}
dequeue_op = dequeue_rollback(op);
}
else if (op->opcode == BS_OP_LIST)
{
process_list(op);
dequeue_op = true;
// Block LIST operation by previous modifications,
// so it always returns a consistent state snapshot
if (has_writes == 2 || inflight_writes > 0)
has_writes = 2;
else
{
process_list(op);
dequeue_op = true;
}
}
if (dequeue_op)
{
@ -205,7 +238,7 @@ void blockstore_impl_t::loop()
{
live = true;
}
queue_stall = !live && !ringloop->get_loop_again();
queue_stall = !live && !ringloop->has_work();
live = false;
}
}
@ -245,19 +278,9 @@ void blockstore_impl_t::check_wait(blockstore_op_t *op)
if (ringloop->space_left() < PRIV(op)->wait_detail)
{
// stop submission if there's still no free space
return;
}
PRIV(op)->wait_for = 0;
}
else if (PRIV(op)->wait_for == WAIT_IN_FLIGHT)
{
auto dirty_it = dirty_db.find((obj_ver_id){
.oid = op->oid,
.version = PRIV(op)->wait_detail,
});
if (dirty_it != dirty_db.end() && IS_IN_FLIGHT(dirty_it->second.state))
{
// do not submit
#ifdef BLOCKSTORE_DEBUG
printf("Still waiting for %lu SQE(s)\n", PRIV(op)->wait_detail);
#endif
return;
}
PRIV(op)->wait_for = 0;
@ -267,8 +290,12 @@ void blockstore_impl_t::check_wait(blockstore_op_t *op)
if (journal.used_start == PRIV(op)->wait_detail)
{
// do not submit
#ifdef BLOCKSTORE_DEBUG
printf("Still waiting to flush journal offset %08lx\n", PRIV(op)->wait_detail);
#endif
return;
}
flusher->release_trim();
PRIV(op)->wait_for = 0;
}
else if (PRIV(op)->wait_for == WAIT_JOURNAL_BUFFER)
@ -278,6 +305,9 @@ void blockstore_impl_t::check_wait(blockstore_op_t *op)
journal.sector_info[next].dirty)
{
// do not submit
#ifdef BLOCKSTORE_DEBUG
printf("Still waiting for a journal buffer\n");
#endif
return;
}
PRIV(op)->wait_for = 0;
@ -286,6 +316,9 @@ void blockstore_impl_t::check_wait(blockstore_op_t *op)
{
if (!data_alloc->get_free_count() && !flusher->is_active())
{
#ifdef BLOCKSTORE_DEBUG
printf("Still waiting for free space on the data device\n");
#endif
return;
}
PRIV(op)->wait_for = 0;
@ -299,17 +332,17 @@ void blockstore_impl_t::check_wait(blockstore_op_t *op)
void blockstore_impl_t::enqueue_op(blockstore_op_t *op, bool first)
{
if (op->opcode < BS_OP_MIN || op->opcode > BS_OP_MAX ||
((op->opcode == BS_OP_READ || op->opcode == BS_OP_WRITE) && (
((op->opcode == BS_OP_READ || op->opcode == BS_OP_WRITE || op->opcode == BS_OP_WRITE_STABLE) && (
op->offset >= block_size ||
op->len > block_size-op->offset ||
(op->len % disk_alignment)
)) ||
readonly && op->opcode != BS_OP_READ ||
first && op->opcode == BS_OP_WRITE)
readonly && op->opcode != BS_OP_READ && op->opcode != BS_OP_LIST ||
first && (op->opcode == BS_OP_WRITE || op->opcode == BS_OP_WRITE_STABLE))
{
// Basic verification not passed
op->retval = -EINVAL;
op->callback(op);
std::function<void (blockstore_op_t*)>(op->callback)(op);
return;
}
if (op->opcode == BS_OP_SYNC_STAB_ALL)
@ -350,21 +383,21 @@ void blockstore_impl_t::enqueue_op(blockstore_op_t *op, bool first)
}
};
}
if (op->opcode == BS_OP_WRITE && !enqueue_write(op))
if ((op->opcode == BS_OP_WRITE || op->opcode == BS_OP_WRITE_STABLE || op->opcode == BS_OP_DELETE) && !enqueue_write(op))
{
op->callback(op);
std::function<void (blockstore_op_t*)>(op->callback)(op);
return;
}
if (0 && op->opcode == BS_OP_SYNC && immediate_commit)
if (op->opcode == BS_OP_SYNC && immediate_commit == IMMEDIATE_ALL)
{
op->retval = 0;
op->callback(op);
std::function<void (blockstore_op_t*)>(op->callback)(op);
return;
}
// Call constructor without allocating memory. We'll call destructor before returning op back
new ((void*)op->private_data) blockstore_op_private_t;
PRIV(op)->wait_for = 0;
PRIV(op)->sync_state = 0;
PRIV(op)->op_state = 0;
PRIV(op)->pending_ops = 0;
if (!first)
{
@ -377,82 +410,201 @@ void blockstore_impl_t::enqueue_op(blockstore_op_t *op, bool first)
ringloop->wakeup();
}
static bool replace_stable(object_id oid, uint64_t version, int search_start, int search_end, obj_ver_id* list)
{
while (search_start < search_end)
{
int pos = search_start+(search_end-search_start)/2;
if (oid < list[pos].oid)
{
search_end = pos;
}
else if (list[pos].oid < oid)
{
search_start = pos+1;
}
else
{
list[pos].version = version;
return true;
}
}
return false;
}
void blockstore_impl_t::process_list(blockstore_op_t *op)
{
// Count objects
uint32_t list_pg = op->offset;
uint32_t pg_count = op->len;
uint64_t parity_block_size = op->oid.stripe;
if (pg_count != 0 && (parity_block_size < MIN_BLOCK_SIZE || list_pg >= pg_count))
uint64_t pg_stripe_size = op->oid.stripe;
uint64_t min_inode = op->oid.inode;
uint64_t max_inode = op->version;
// Check PG
if (pg_count != 0 && (pg_stripe_size < MIN_BLOCK_SIZE || list_pg >= pg_count))
{
op->retval = -EINVAL;
FINISH_OP(op);
return;
}
uint64_t stable_count = 0;
if (pg_count > 0)
{
for (auto it = clean_db.begin(); it != clean_db.end(); it++)
{
uint32_t pg = (it->first.inode + it->first.stripe / parity_block_size) % pg_count;
if (pg == list_pg)
{
stable_count++;
}
}
}
else
{
stable_count = clean_db.size();
}
uint64_t total_count = stable_count;
for (auto it = dirty_db.begin(); it != dirty_db.end(); it++)
{
if (!pg_count || ((it->first.oid.inode + it->first.oid.stripe / parity_block_size) % pg_count) == list_pg)
{
if (IS_STABLE(it->second.state))
{
stable_count++;
}
total_count++;
}
}
// Allocate memory
op->version = stable_count;
op->retval = total_count;
op->buf = malloc(sizeof(obj_ver_id) * total_count);
if (!op->buf)
// Copy clean_db entries (sorted)
int stable_count = 0, stable_alloc = clean_db.size() / (pg_count ? pg_count : 1);
obj_ver_id *stable = (obj_ver_id*)malloc(sizeof(obj_ver_id) * stable_alloc);
if (!stable)
{
op->retval = -ENOMEM;
FINISH_OP(op);
return;
}
obj_ver_id *vers = (obj_ver_id*)op->buf;
int i = 0;
for (auto it = clean_db.begin(); it != clean_db.end(); it++)
{
if (!pg_count || ((it->first.inode + it->first.stripe / parity_block_size) % pg_count) == list_pg)
auto clean_it = clean_db.begin(), clean_end = clean_db.end();
if ((min_inode != 0 || max_inode != 0) && min_inode <= max_inode)
{
vers[i++] = {
.oid = it->first,
.version = it->second.version,
};
clean_it = clean_db.lower_bound({
.inode = min_inode,
.stripe = 0,
});
clean_end = clean_db.upper_bound({
.inode = max_inode,
.stripe = UINT64_MAX,
});
}
}
int j = stable_count;
for (auto it = dirty_db.begin(); it != dirty_db.end(); it++)
{
if (!pg_count || ((it->first.oid.inode + it->first.oid.stripe / parity_block_size) % pg_count) == list_pg)
for (; clean_it != clean_end; clean_it++)
{
if (IS_STABLE(it->second.state))
if (!pg_count || ((clean_it->first.inode + clean_it->first.stripe / pg_stripe_size) % pg_count) == list_pg)
{
vers[i++] = it->first;
}
else
{
vers[j++] = it->first;
if (stable_count >= stable_alloc)
{
stable_alloc += 32768;
stable = (obj_ver_id*)realloc(stable, sizeof(obj_ver_id) * stable_alloc);
if (!stable)
{
op->retval = -ENOMEM;
FINISH_OP(op);
return;
}
}
stable[stable_count++] = {
.oid = clean_it->first,
.version = clean_it->second.version,
};
}
}
}
int clean_stable_count = stable_count;
// Copy dirty_db entries (sorted, too)
int unstable_count = 0, unstable_alloc = 0;
obj_ver_id *unstable = NULL;
{
auto dirty_it = dirty_db.begin(), dirty_end = dirty_db.end();
if ((min_inode != 0 || max_inode != 0) && min_inode <= max_inode)
{
dirty_it = dirty_db.lower_bound({
.oid = {
.inode = min_inode,
.stripe = 0,
},
.version = 0,
});
dirty_end = dirty_db.upper_bound({
.oid = {
.inode = max_inode,
.stripe = UINT64_MAX,
},
.version = UINT64_MAX,
});
}
for (; dirty_it != dirty_end; dirty_it++)
{
if (!pg_count || ((dirty_it->first.oid.inode + dirty_it->first.oid.stripe / pg_stripe_size) % pg_count) == list_pg)
{
if (IS_DELETE(dirty_it->second.state))
{
// Deletions are always stable, so try to zero out two possible entries
if (!replace_stable(dirty_it->first.oid, 0, 0, clean_stable_count, stable))
{
replace_stable(dirty_it->first.oid, 0, clean_stable_count, stable_count, stable);
}
}
else if (IS_STABLE(dirty_it->second.state))
{
// First try to replace a clean stable version in the first part of the list
if (!replace_stable(dirty_it->first.oid, dirty_it->first.version, 0, clean_stable_count, stable))
{
// Then try to replace the last dirty stable version in the second part of the list
if (stable[stable_count-1].oid == dirty_it->first.oid)
{
stable[stable_count-1].version = dirty_it->first.version;
}
else
{
if (stable_count >= stable_alloc)
{
stable_alloc += 32768;
stable = (obj_ver_id*)realloc(stable, sizeof(obj_ver_id) * stable_alloc);
if (!stable)
{
if (unstable)
free(unstable);
op->retval = -ENOMEM;
FINISH_OP(op);
return;
}
}
stable[stable_count++] = dirty_it->first;
}
}
}
else
{
if (unstable_count >= unstable_alloc)
{
unstable_alloc += 32768;
unstable = (obj_ver_id*)realloc(unstable, sizeof(obj_ver_id) * unstable_alloc);
if (!unstable)
{
if (stable)
free(stable);
op->retval = -ENOMEM;
FINISH_OP(op);
return;
}
}
unstable[unstable_count++] = dirty_it->first;
}
}
}
}
// Remove zeroed out stable entries
int j = 0;
for (int i = 0; i < stable_count; i++)
{
if (stable[i].version != 0)
{
stable[j++] = stable[i];
}
}
stable_count = j;
if (stable_count+unstable_count > stable_alloc)
{
stable_alloc = stable_count+unstable_count;
stable = (obj_ver_id*)realloc(stable, sizeof(obj_ver_id) * stable_alloc);
if (!stable)
{
if (unstable)
free(unstable);
op->retval = -ENOMEM;
FINISH_OP(op);
return;
}
}
// Copy unstable entries
for (int i = 0; i < unstable_count; i++)
{
stable[j++] = unstable[i];
}
free(unstable);
op->version = stable_count;
op->retval = stable_count+unstable_count;
op->buf = stable;
FINISH_OP(op);
}

View File

@ -1,14 +1,15 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#pragma once
#include "blockstore.h"
#include "timerfd_interval.h"
#include <sys/types.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <malloc.h>
#include <linux/fs.h>
#include <vector>
@ -16,43 +17,40 @@
#include <deque>
#include <new>
#include "sparsepp/sparsepp/spp.h"
#include "cpp-btree/btree_map.h"
#include "malloc_or_die.h"
#include "allocator.h"
//#define BLOCKSTORE_DEBUG
// States are not stored on disk. Instead, they're deduced from the journal
// FIXME: Rename to BS_ST_*
#define ST_J_IN_FLIGHT 1
#define ST_J_SUBMITTED 2
#define ST_J_WRITTEN 3
#define ST_J_SYNCED 4
#define ST_J_STABLE 5
#define BS_ST_SMALL_WRITE 0x01
#define BS_ST_BIG_WRITE 0x02
#define BS_ST_DELETE 0x03
#define ST_D_IN_FLIGHT 15
#define ST_D_SUBMITTED 16
#define ST_D_WRITTEN 17
#define ST_D_META_WRITTEN 19
#define ST_D_META_SYNCED 20
#define ST_D_STABLE 21
#define BS_ST_WAIT_BIG 0x10
#define BS_ST_IN_FLIGHT 0x20
#define BS_ST_SUBMITTED 0x30
#define BS_ST_WRITTEN 0x40
#define BS_ST_SYNCED 0x50
#define BS_ST_STABLE 0x60
#define ST_DEL_IN_FLIGHT 31
#define ST_DEL_SUBMITTED 32
#define ST_DEL_WRITTEN 33
#define ST_DEL_SYNCED 34
#define ST_DEL_STABLE 35
#define BS_ST_INSTANT 0x100
#define ST_CURRENT 48
#define IMMEDIATE_NONE 0
#define IMMEDIATE_SMALL 1
#define IMMEDIATE_ALL 2
#define IS_IN_FLIGHT(st) (st == ST_J_IN_FLIGHT || st == ST_D_IN_FLIGHT || st == ST_DEL_IN_FLIGHT || st == ST_J_SUBMITTED || st == ST_D_SUBMITTED || st == ST_DEL_SUBMITTED)
#define IS_STABLE(st) (st == ST_J_STABLE || st == ST_D_STABLE || st == ST_DEL_STABLE || st == ST_CURRENT)
#define IS_SYNCED(st) (IS_STABLE(st) || st == ST_J_SYNCED || st == ST_D_META_SYNCED || st == ST_DEL_SYNCED)
#define IS_JOURNAL(st) (st >= ST_J_SUBMITTED && st <= ST_J_STABLE)
#define IS_BIG_WRITE(st) (st >= ST_D_SUBMITTED && st <= ST_D_STABLE)
#define IS_DELETE(st) (st >= ST_DEL_SUBMITTED && st <= ST_DEL_STABLE)
#define IS_UNSYNCED(st) (st >= ST_J_SUBMITTED && st <= ST_J_WRITTEN || st >= ST_D_SUBMITTED && st <= ST_D_META_WRITTEN || st >= ST_DEL_SUBMITTED && st <= ST_DEL_WRITTEN)
#define BS_ST_TYPE_MASK 0x0F
#define BS_ST_WORKFLOW_MASK 0xF0
#define IS_IN_FLIGHT(st) (((st) & 0xF0) <= BS_ST_SUBMITTED)
#define IS_STABLE(st) (((st) & 0xF0) == BS_ST_STABLE)
#define IS_SYNCED(st) (((st) & 0xF0) >= BS_ST_SYNCED)
#define IS_JOURNAL(st) (((st) & 0x0F) == BS_ST_SMALL_WRITE)
#define IS_BIG_WRITE(st) (((st) & 0x0F) == BS_ST_BIG_WRITE)
#define IS_DELETE(st) (((st) & 0x0F) == BS_ST_DELETE)
#define BS_SUBMIT_GET_SQE(sqe, data) \
BS_SUBMIT_GET_ONLY_SQE(sqe); \
@ -124,8 +122,6 @@ struct __attribute__((__packed__)) dirty_entry
// Suspend operation until there are more free SQEs
#define WAIT_SQE 1
// Suspend operation until version <wait_detail> of object <oid> is written
#define WAIT_IN_FLIGHT 2
// Suspend operation until there are <wait_detail> bytes of free space in the journal on disk
#define WAIT_JOURNAL 3
// Suspend operation until the next journal sector buffer is free
@ -139,7 +135,7 @@ struct fulfill_read_t
};
#define PRIV(op) ((blockstore_op_private_t*)(op)->private_data)
#define FINISH_OP(op) PRIV(op)->~blockstore_op_private_t(); op->callback(op)
#define FINISH_OP(op) PRIV(op)->~blockstore_op_private_t(); std::function<void (blockstore_op_t*)>(op->callback)(op)
struct blockstore_op_private_t
{
@ -147,12 +143,13 @@ struct blockstore_op_private_t
int wait_for;
uint64_t wait_detail;
int pending_ops;
int op_state;
// Read
std::vector<fulfill_read_t> read_vec;
// Sync, write
uint64_t min_used_journal_sector, max_used_journal_sector;
uint64_t min_flushed_journal_sector, max_flushed_journal_sector;
// Write
struct iovec iov_zerofill[3];
@ -161,9 +158,13 @@ struct blockstore_op_private_t
std::vector<obj_ver_id> sync_big_writes, sync_small_writes;
int sync_small_checked, sync_big_checked;
std::list<blockstore_op_t*>::iterator in_progress_ptr;
int sync_state, prev_sync_count;
int prev_sync_count;
};
// https://github.com/algorithm-ninja/cpp-btree
// https://github.com/greg7mdp/sparsepp/ was used previously, but it was TERRIBLY slow after resizing
// with sparsepp, random reads dropped to ~700 iops very fast with just as much as ~32k objects in the DB
typedef btree::btree_map<object_id, clean_entry> blockstore_clean_db_t;
typedef std::map<obj_ver_id, dirty_entry> blockstore_dirty_db_t;
#include "blockstore_init.h"
@ -177,29 +178,30 @@ class blockstore_impl_t
uint32_t block_size;
uint64_t meta_offset;
uint64_t data_offset;
uint64_t cfg_journal_size;
uint64_t cfg_journal_size, cfg_data_size;
// Required write alignment and journal/metadata/data areas' location alignment
uint32_t disk_alignment = 512;
uint32_t disk_alignment = 4096;
// Journal block size - minimum_io_size of the journal device is the best choice
uint64_t journal_block_size = 512;
uint64_t journal_block_size = 4096;
// Metadata block size - minimum_io_size of the metadata device is the best choice
uint64_t meta_block_size = 512;
uint64_t meta_block_size = 4096;
// Sparse write tracking granularity. 4 KB is a good choice. Must be a multiple of disk_alignment
uint64_t bitmap_granularity = 4096;
bool readonly = false;
// By default, Blockstore locks all opened devices exclusively. This option can be used to disable locking
bool disable_flock = false;
// It is safe to disable fsync() if drive write cache is writethrough
bool disable_data_fsync = false, disable_meta_fsync = false, disable_journal_fsync = false;
// Enable if you want every operation to be executed with an "implicit fsync"
// FIXME Not implemented yet
bool immediate_commit = false;
// Suitable only for server SSDs with capacitors, requires disabled data and journal fsyncs
int immediate_commit = IMMEDIATE_NONE;
bool inmemory_meta = false;
int flusher_count;
/******* END OF OPTIONS *******/
struct ring_consumer_t ring_consumer;
// Another option is https://github.com/algorithm-ninja/cpp-btree
spp::sparse_hash_map<object_id, clean_entry> clean_db;
blockstore_clean_db_t clean_db;
uint8_t *clean_bitmap = NULL;
blockstore_dirty_db_t dirty_db;
std::list<blockstore_op_t*> submit_queue; // FIXME: funny thing is that vector is better here
@ -224,6 +226,7 @@ class blockstore_impl_t
bool live = false, queue_stall = false;
ring_loop_t *ringloop;
int inflight_writes = 0;
bool stop_sync_submitted;
@ -264,7 +267,7 @@ class blockstore_impl_t
bool enqueue_write(blockstore_op_t *op);
int dequeue_write(blockstore_op_t *op);
int dequeue_del(blockstore_op_t *op);
void ack_write(blockstore_op_t *op);
int continue_write(blockstore_op_t *op);
void release_journal_sectors(blockstore_op_t *op);
void handle_write_event(ring_data_t *data, blockstore_op_t *op);
@ -277,11 +280,15 @@ class blockstore_impl_t
// Stabilize
int dequeue_stable(blockstore_op_t *op);
int continue_stable(blockstore_op_t *op);
void mark_stable(const obj_ver_id & ov);
void handle_stable_event(ring_data_t *data, blockstore_op_t *op);
void stabilize_object(object_id oid, uint64_t max_ver);
// Rollback
int dequeue_rollback(blockstore_op_t *op);
int continue_rollback(blockstore_op_t *op);
void mark_rolled_back(const obj_ver_id & ov);
void handle_rollback_event(ring_data_t *data, blockstore_op_t *op);
void erase_dirty(blockstore_dirty_db_t::iterator dirty_start, blockstore_dirty_db_t::iterator dirty_end, uint64_t clean_loc);
@ -316,5 +323,6 @@ public:
inline uint32_t get_block_size() { return block_size; }
inline uint64_t get_block_count() { return block_count; }
inline uint64_t get_free_block_count() { return data_alloc->get_free_count(); }
inline uint32_t get_disk_alignment() { return disk_alignment; }
};

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include "blockstore_impl.h"
blockstore_init_meta::blockstore_init_meta(blockstore_impl_t *bs)
@ -108,13 +111,13 @@ void blockstore_init_meta::handle_entries(void* entries, unsigned count, int blo
{
// free the previous block
#ifdef BLOCKSTORE_DEBUG
printf("Free block %lu\n", clean_it->second.location >> bs->block_order);
printf("Free block %lu (new location is %lu)\n", clean_it->second.location >> block_order, done_cnt+i >> block_order);
#endif
bs->data_alloc->set(clean_it->second.location >> block_order, false);
}
entries_loaded++;
#ifdef BLOCKSTORE_DEBUG
printf("Allocate block (clean entry) %lu: %lu:%lu v%lu\n", done_cnt+i, entry->oid.inode, entry->oid.stripe, entry->version);
printf("Allocate block (clean entry) %lu: %lx:%lx v%lu\n", done_cnt+i, entry->oid.inode, entry->oid.stripe, entry->version);
#endif
bs->data_alloc->set(done_cnt+i, true);
bs->clean_db[entry->oid] = (struct clean_entry){
@ -125,7 +128,7 @@ void blockstore_init_meta::handle_entries(void* entries, unsigned count, int blo
else
{
#ifdef BLOCKSTORE_DEBUG
printf("Old clean entry %lu: %lu:%lu v%lu\n", done_cnt+i, entry->oid.inode, entry->oid.stripe, entry->version);
printf("Old clean entry %lu: %lx:%lx v%lu\n", done_cnt+i, entry->oid.inode, entry->oid.stripe, entry->version);
#endif
}
}
@ -202,11 +205,7 @@ int blockstore_init_journal::loop()
goto resume_7;
printf("Reading blockstore journal\n");
if (!bs->journal.inmemory)
{
submitted_buf = memalign(MEM_ALIGNMENT, 2*bs->journal.block_size);
if (!submitted_buf)
throw std::bad_alloc();
}
submitted_buf = memalign_or_die(MEM_ALIGNMENT, 2*bs->journal.block_size);
else
submitted_buf = bs->journal.buffer;
// Read first block of the journal
@ -317,7 +316,7 @@ resume_1:
if (journal_pos < bs->journal.used_start)
end = bs->journal.used_start;
if (!bs->journal.inmemory)
submitted_buf = memalign(MEM_ALIGNMENT, JOURNAL_BUFFER_SIZE);
submitted_buf = memalign_or_die(MEM_ALIGNMENT, JOURNAL_BUFFER_SIZE);
else
submitted_buf = bs->journal.buffer + journal_pos;
data->iov = {
@ -402,8 +401,9 @@ resume_1:
}
// Trim journal on start so we don't stall when all entries are older
bs->journal.trim();
bs->journal.dirty_start = bs->journal.next_free;
printf(
"Journal entries loaded: %lu, free journal space: %lu bytes (%lu..%lu is used), free blocks: %lu / %lu\n",
"Journal entries loaded: %lu, free journal space: %lu bytes (%08lx..%08lx is used), free blocks: %lu / %lu\n",
entries_loaded,
(bs->journal.next_free >= bs->journal.used_start
? bs->journal.len-bs->journal.block_size - (bs->journal.next_free-bs->journal.used_start)
@ -439,7 +439,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
{
journal_entry *je = (journal_entry*)(buf + proc_pos - done_pos + pos);
if (je->magic != JOURNAL_MAGIC || je_crc32(je) != je->crc32 ||
je->type < JE_SMALL_WRITE || je->type > JE_DELETE || started && je->crc32_prev != crc32_last)
je->type < JE_MIN || je->type > JE_MAX || started && je->crc32_prev != crc32_last)
{
if (pos == 0)
{
@ -453,10 +453,15 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
break;
}
}
if (je->type == JE_SMALL_WRITE)
if (je->type == JE_SMALL_WRITE || je->type == JE_SMALL_WRITE_INSTANT)
{
#ifdef BLOCKSTORE_DEBUG
printf("je_small_write oid=%lu:%lu ver=%lu offset=%u len=%u\n", je->small_write.oid.inode, je->small_write.oid.stripe, je->small_write.version, je->small_write.offset, je->small_write.len);
printf(
"je_small_write%s oid=%lx:%lx ver=%lu offset=%u len=%u\n",
je->type == JE_SMALL_WRITE_INSTANT ? "_instant" : "",
je->small_write.oid.inode, je->small_write.oid.stripe, je->small_write.version,
je->small_write.offset, je->small_write.len
);
#endif
// oid, version, offset, len
uint64_t prev_free = next_free;
@ -474,7 +479,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
if (location != je->small_write.data_offset)
{
char err[1024];
snprintf(err, 1024, "BUG: calculated journal data offset (%lu) != stored journal data offset (%lu)", location, je->small_write.data_offset);
snprintf(err, 1024, "BUG: calculated journal data offset (%08lx) != stored journal data offset (%08lx)", location, je->small_write.data_offset);
throw std::runtime_error(err);
}
uint32_t data_crc32 = 0;
@ -509,7 +514,9 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
if (data_crc32 != je->small_write.crc32_data)
{
// journal entry is corrupt, stop here
// interesting thing is that we must clear the corrupt entry if we're not readonly
// interesting thing is that we must clear the corrupt entry if we're not readonly,
// because we don't write next entries in the same journal block
printf("Journal entry data is corrupt (data crc32 %x != %x)\n", data_crc32, je->small_write.crc32_data);
memset(buf + proc_pos - done_pos + pos, 0, bs->journal.block_size - pos);
bs->journal.next_free = prev_free;
init_write_buf = buf + proc_pos - done_pos;
@ -518,14 +525,14 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
}
auto clean_it = bs->clean_db.find(je->small_write.oid);
if (clean_it == bs->clean_db.end() ||
clean_it->second.version < je->big_write.version)
clean_it->second.version < je->small_write.version)
{
obj_ver_id ov = {
.oid = je->small_write.oid,
.version = je->small_write.version,
};
bs->dirty_db.emplace(ov, (dirty_entry){
.state = ST_J_SYNCED,
.state = (BS_ST_SMALL_WRITE | BS_ST_SYNCED),
.flags = 0,
.location = location,
.offset = je->small_write.offset,
@ -534,16 +541,27 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
});
bs->journal.used_sectors[proc_pos]++;
#ifdef BLOCKSTORE_DEBUG
printf("journal offset %lu is used by %lu:%lu v%lu\n", proc_pos, ov.oid.inode, ov.oid.stripe, ov.version);
printf(
"journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
proc_pos, ov.oid.inode, ov.oid.stripe, ov.version, bs->journal.used_sectors[proc_pos]
);
#endif
auto & unstab = bs->unstable_writes[ov.oid];
unstab = unstab < ov.version ? ov.version : unstab;
if (je->type == JE_SMALL_WRITE_INSTANT)
{
bs->mark_stable(ov);
}
}
}
else if (je->type == JE_BIG_WRITE)
else if (je->type == JE_BIG_WRITE || je->type == JE_BIG_WRITE_INSTANT)
{
#ifdef BLOCKSTORE_DEBUG
printf("je_big_write oid=%lu:%lu ver=%lu loc=%lu\n", je->big_write.oid.inode, je->big_write.oid.stripe, je->big_write.version, je->big_write.location);
printf(
"je_big_write%s oid=%lx:%lx ver=%lu loc=%lu\n",
je->type == JE_BIG_WRITE_INSTANT ? "_instant" : "",
je->big_write.oid.inode, je->big_write.oid.stripe, je->big_write.version, je->big_write.location
);
#endif
auto clean_it = bs->clean_db.find(je->big_write.oid);
if (clean_it == bs->clean_db.end() ||
@ -555,7 +573,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
.version = je->big_write.version,
};
bs->dirty_db.emplace(ov, (dirty_entry){
.state = ST_D_META_SYNCED,
.state = (BS_ST_BIG_WRITE | BS_ST_SYNCED),
.flags = 0,
.location = je->big_write.location,
.offset = je->big_write.offset,
@ -569,116 +587,63 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
bs->journal.used_sectors[proc_pos]++;
auto & unstab = bs->unstable_writes[ov.oid];
unstab = unstab < ov.version ? ov.version : unstab;
if (je->type == JE_BIG_WRITE_INSTANT)
{
bs->mark_stable(ov);
}
}
}
else if (je->type == JE_STABLE)
{
#ifdef BLOCKSTORE_DEBUG
printf("je_stable oid=%lu:%lu ver=%lu\n", je->stable.oid.inode, je->stable.oid.stripe, je->stable.version);
printf("je_stable oid=%lx:%lx ver=%lu\n", je->stable.oid.inode, je->stable.oid.stripe, je->stable.version);
#endif
// oid, version
obj_ver_id ov = {
.oid = je->stable.oid,
.version = je->stable.version,
};
auto it = bs->dirty_db.find(ov);
if (it == bs->dirty_db.end())
{
// journal contains a legitimate STABLE entry for a non-existing dirty write
// this probably means that journal was trimmed between WRITE and STABLE entries
// skip it
}
else
{
while (1)
{
it->second.state = (it->second.state == ST_D_META_SYNCED
? ST_D_STABLE
: (it->second.state == ST_DEL_SYNCED ? ST_DEL_STABLE : ST_J_STABLE));
if (it == bs->dirty_db.begin())
break;
it--;
if (it->first.oid != ov.oid || IS_STABLE(it->second.state))
break;
}
bs->flusher->enqueue_flush(ov);
}
auto unstab_it = bs->unstable_writes.find(ov.oid);
if (unstab_it != bs->unstable_writes.end() && unstab_it->second <= ov.version)
{
bs->unstable_writes.erase(unstab_it);
}
bs->mark_stable(ov);
}
else if (je->type == JE_ROLLBACK)
{
#ifdef BLOCKSTORE_DEBUG
printf("je_rollback oid=%lu:%lu ver=%lu\n", je->rollback.oid.inode, je->rollback.oid.stripe, je->rollback.version);
printf("je_rollback oid=%lx:%lx ver=%lu\n", je->rollback.oid.inode, je->rollback.oid.stripe, je->rollback.version);
#endif
// rollback dirty writes of <oid> up to <version>
auto it = bs->dirty_db.lower_bound((obj_ver_id){
obj_ver_id ov = {
.oid = je->rollback.oid,
.version = UINT64_MAX,
});
if (it != bs->dirty_db.begin())
{
uint64_t max_unstable = 0;
auto rm_start = it;
auto rm_end = it;
it--;
while (it->first.oid == je->rollback.oid &&
it->first.version > je->rollback.version &&
!IS_IN_FLIGHT(it->second.state) &&
!IS_STABLE(it->second.state))
{
if (it->first.oid != je->rollback.oid)
break;
else if (it->first.version <= je->rollback.version)
{
if (!IS_STABLE(it->second.state))
max_unstable = it->first.version;
break;
}
else if (IS_STABLE(it->second.state))
break;
// Remove entry
rm_start = it;
if (it == bs->dirty_db.begin())
break;
it--;
}
if (rm_start != rm_end)
{
bs->erase_dirty(rm_start, rm_end, UINT64_MAX);
}
auto unstab_it = bs->unstable_writes.find(je->rollback.oid);
if (unstab_it != bs->unstable_writes.end())
{
if (max_unstable == 0)
bs->unstable_writes.erase(unstab_it);
else
unstab_it->second = max_unstable;
}
}
.version = je->rollback.version,
};
bs->mark_rolled_back(ov);
}
else if (je->type == JE_DELETE)
{
#ifdef BLOCKSTORE_DEBUG
printf("je_delete oid=%lu:%lu ver=%lu\n", je->del.oid.inode, je->del.oid.stripe, je->del.version);
printf("je_delete oid=%lx:%lx ver=%lu\n", je->del.oid.inode, je->del.oid.stripe, je->del.version);
#endif
// oid, version
obj_ver_id ov = {
.oid = je->del.oid,
.version = je->del.version,
};
bs->dirty_db.emplace(ov, (dirty_entry){
.state = ST_DEL_SYNCED,
.flags = 0,
.location = 0,
.offset = 0,
.len = 0,
.journal_sector = proc_pos,
});
bs->journal.used_sectors[proc_pos]++;
auto clean_it = bs->clean_db.find(je->del.oid);
if (clean_it == bs->clean_db.end() ||
clean_it->second.version < je->del.version)
{
// oid, version
obj_ver_id ov = {
.oid = je->del.oid,
.version = je->del.version,
};
bs->dirty_db.emplace(ov, (dirty_entry){
.state = (BS_ST_DELETE | BS_ST_SYNCED),
.flags = 0,
.location = 0,
.offset = 0,
.len = 0,
.journal_sector = proc_pos,
});
bs->journal.used_sectors[proc_pos]++;
// Deletions are treated as immediately stable, because
// "2-phase commit" (write->stabilize) isn't sufficient for them anyway
bs->mark_stable(ov);
}
}
started = true;
pos += je->size;

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#pragma once
class blockstore_init_meta

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include "blockstore_impl.h"
blockstore_journal_check_t::blockstore_journal_check_t(blockstore_impl_t *bs)
@ -6,18 +9,26 @@ blockstore_journal_check_t::blockstore_journal_check_t(blockstore_impl_t *bs)
sectors_required = 0;
next_pos = bs->journal.next_free;
next_sector = bs->journal.cur_sector;
first_sector = -1;
next_in_pos = bs->journal.in_sector_pos;
right_dir = next_pos >= bs->journal.used_start;
}
// Check if we can write <required> entries of <size> bytes and <data_after> data bytes after them to the journal
int blockstore_journal_check_t::check_available(blockstore_op_t *op, int required, int size, int data_after)
int blockstore_journal_check_t::check_available(blockstore_op_t *op, int entries_required, int size, int data_after)
{
int required = entries_required;
while (1)
{
int fits = (bs->journal.block_size - next_in_pos) / size;
int fits = bs->journal.no_same_sector_overwrites && bs->journal.sector_info[next_sector].written
? 0
: (bs->journal.block_size - next_in_pos) / size;
if (fits > 0)
{
if (first_sector == -1)
{
first_sector = next_sector;
}
required -= fits;
next_in_pos += fits * size;
sectors_required++;
@ -38,19 +49,40 @@ int blockstore_journal_check_t::check_available(blockstore_op_t *op, int require
right_dir = false;
}
next_in_pos = 0;
if (bs->journal.sector_info[next_sector].usage_count > 0 ||
bs->journal.sector_info[next_sector].dirty)
next_sector = ((next_sector + 1) % bs->journal.sector_count);
if (next_sector == first_sector)
{
next_sector = ((next_sector + 1) % bs->journal.sector_count);
// next_sector may wrap when all sectors are flushed and the incoming batch is too big
// This is an error condition, we can't wait for anything in this case
throw std::runtime_error(
"Blockstore journal_sector_buffer_count="+std::to_string(bs->journal.sector_count)+
" is too small for a batch of "+std::to_string(entries_required)+" entries of "+std::to_string(size)+" bytes"
);
}
if (bs->journal.sector_info[next_sector].usage_count > 0 ||
bs->journal.sector_info[next_sector].dirty)
{
// No memory buffer available. Wait for it.
#ifdef BLOCKSTORE_DEBUG
printf("next journal buffer %d is still dirty=%d used=%d\n", next_sector,
bs->journal.sector_info[next_sector].dirty, bs->journal.sector_info[next_sector].usage_count);
#endif
int used = 0, dirty = 0;
for (int i = 0; i < bs->journal.sector_count; i++)
{
if (bs->journal.sector_info[i].dirty)
{
dirty++;
used++;
}
if (bs->journal.sector_info[i].usage_count > 0)
{
used++;
}
}
// In fact, it's even more rare than "ran out of journal space", so print a warning
printf(
"Ran out of journal sector buffers: %d/%lu buffers used (%d dirty), next buffer (%ld) is %s and flushed %lu times\n",
used, bs->journal.sector_count, dirty, next_sector,
bs->journal.sector_info[next_sector].dirty ? "dirty" : "not dirty",
bs->journal.sector_info[next_sector].usage_count
);
PRIV(op)->wait_for = WAIT_JOURNAL_BUFFER;
return 0;
}
@ -74,7 +106,7 @@ int blockstore_journal_check_t::check_available(blockstore_op_t *op, int require
: bs->journal.used_start - bs->journal.next_free)
);
PRIV(op)->wait_for = WAIT_JOURNAL;
bs->flusher->force_start();
bs->flusher->request_trim();
PRIV(op)->wait_detail = bs->journal.used_start;
return 0;
}
@ -83,14 +115,21 @@ int blockstore_journal_check_t::check_available(blockstore_op_t *op, int require
journal_entry* prefill_single_journal_entry(journal_t & journal, uint16_t type, uint32_t size)
{
if (journal.block_size - journal.in_sector_pos < size)
if (journal.block_size - journal.in_sector_pos < size ||
journal.no_same_sector_overwrites && journal.sector_info[journal.cur_sector].written)
{
assert(!journal.sector_info[journal.cur_sector].dirty);
// Move to the next journal sector
journal.sector_info[journal.cur_sector].written = false;
if (journal.sector_info[journal.cur_sector].usage_count > 0)
{
// Also select next sector buffer in memory
journal.cur_sector = ((journal.cur_sector + 1) % journal.sector_count);
assert(!journal.sector_info[journal.cur_sector].usage_count);
}
else
{
journal.dirty_start = journal.next_free;
}
journal.sector_info[journal.cur_sector].offset = journal.next_free;
journal.in_sector_pos = 0;
@ -116,6 +155,7 @@ journal_entry* prefill_single_journal_entry(journal_t & journal, uint16_t type,
void prepare_journal_sector_write(journal_t & journal, int cur_sector, io_uring_sqe *sqe, std::function<void(ring_data_t*)> cb)
{
journal.sector_info[cur_sector].dirty = false;
journal.sector_info[cur_sector].written = true;
journal.sector_info[cur_sector].usage_count++;
ring_data_t *data = ((ring_data_t*)sqe->user_data);
data->iov = (struct iovec){
@ -148,8 +188,8 @@ bool journal_t::trim()
auto journal_used_it = used_sectors.lower_bound(used_start);
#ifdef BLOCKSTORE_DEBUG
printf(
"Trimming journal (used_start=%lu, next_free=%lu, first_used=%lu, usage_count=%lu)\n",
used_start, next_free,
"Trimming journal (used_start=%08lx, next_free=%08lx, dirty_start=%08lx, new_start=%08lx, new_refcount=%ld)\n",
used_start, next_free, dirty_start,
journal_used_it == used_sectors.end() ? 0 : journal_used_it->first,
journal_used_it == used_sectors.end() ? 0 : journal_used_it->second
);
@ -180,7 +220,7 @@ bool journal_t::trim()
return false;
}
#ifdef BLOCKSTORE_DEBUG
printf("Journal trimmed to %lu (next_free=%lu)\n", used_start, next_free);
printf("Journal trimmed to %08lx (next_free=%08lx)\n", used_start, next_free);
#endif
return true;
}

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#pragma once
#include "crc32c.h"
@ -12,12 +15,16 @@
// Journal entries
// Journal entries are linked to each other by their crc32 value
// The journal is almost a blockchain, because object versions constantly increase
#define JE_MIN 0x01
#define JE_START 0x01
#define JE_SMALL_WRITE 0x02
#define JE_BIG_WRITE 0x03
#define JE_STABLE 0x04
#define JE_DELETE 0x05
#define JE_ROLLBACK 0x06
#define JE_SMALL_WRITE_INSTANT 0x07
#define JE_BIG_WRITE_INSTANT 0x08
#define JE_MAX 0x08
// crc32c comes first to ease calculation and is equal to crc32()
struct __attribute__((__packed__)) journal_entry_start
@ -125,6 +132,7 @@ struct journal_sector_info_t
{
uint64_t offset;
uint64_t usage_count;
bool written;
bool dirty;
};
@ -135,16 +143,21 @@ struct journal_t
bool inmemory = false;
void *buffer = NULL;
uint64_t block_size = 512;
uint64_t block_size;
uint64_t offset, len;
// Next free block offset
uint64_t next_free = 0;
// First occupied block offset
uint64_t used_start = 0;
// End of the last block not used for writing anymore
uint64_t dirty_start = 0;
uint32_t crc32_last = 0;
// Current sector(s) used for writing
void *sector_buf = NULL;
journal_sector_info_t *sector_info = NULL;
uint64_t sector_count;
bool no_same_sector_overwrites = false;
int cur_sector = 0;
int in_sector_pos = 0;
@ -160,7 +173,7 @@ struct blockstore_journal_check_t
{
blockstore_impl_t *bs;
uint64_t next_pos, next_sector, next_in_pos;
int sectors_required;
int sectors_required, first_sector;
bool right_dir; // writing to the end or the beginning of the ring buffer
blockstore_journal_check_t(blockstore_impl_t *bs);

View File

@ -1,3 +1,7 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include <sys/file.h>
#include "blockstore_impl.h"
static uint32_t is_power_of_two(uint64_t value)
@ -34,10 +38,23 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
{
disable_journal_fsync = true;
}
if (config["disable_device_lock"] == "true" || config["disable_device_lock"] == "1" || config["disable_device_lock"] == "yes")
{
disable_flock = true;
}
if (config["immediate_commit"] == "all")
{
immediate_commit = IMMEDIATE_ALL;
}
else if (config["immediate_commit"] == "small")
{
immediate_commit = IMMEDIATE_SMALL;
}
metadata_buf_size = strtoull(config["meta_buf_size"].c_str(), NULL, 10);
cfg_journal_size = strtoull(config["journal_size"].c_str(), NULL, 10);
data_device = config["data_device"];
data_offset = strtoull(config["data_offset"].c_str(), NULL, 10);
cfg_data_size = strtoull(config["data_size"].c_str(), NULL, 10);
meta_device = config["meta_device"];
meta_offset = strtoull(config["meta_offset"].c_str(), NULL, 10);
block_size = strtoull(config["block_size"].c_str(), NULL, 10);
@ -45,6 +62,8 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
journal_device = config["journal_device"];
journal.offset = strtoull(config["journal_offset"].c_str(), NULL, 10);
journal.sector_count = strtoull(config["journal_sector_buffer_count"].c_str(), NULL, 10);
journal.no_same_sector_overwrites = config["journal_no_same_sector_overwrites"] == "true" ||
config["journal_no_same_sector_overwrites"] == "1" || config["journal_no_same_sector_overwrites"] == "yes";
journal.inmemory = config["inmemory_journal"] != "false";
disk_alignment = strtoull(config["disk_alignment"].c_str(), NULL, 10);
journal_block_size = strtoull(config["journal_block_size"].c_str(), NULL, 10);
@ -66,7 +85,7 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
}
if (!disk_alignment)
{
disk_alignment = 512;
disk_alignment = 4096;
}
else if (disk_alignment % MEM_ALIGNMENT)
{
@ -74,7 +93,7 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
}
if (!journal_block_size)
{
journal_block_size = 512;
journal_block_size = 4096;
}
else if (journal_block_size % MEM_ALIGNMENT)
{
@ -82,7 +101,7 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
}
if (!meta_block_size)
{
meta_block_size = 512;
meta_block_size = 4096;
}
else if (meta_block_size % MEM_ALIGNMENT)
{
@ -128,6 +147,22 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
{
metadata_buf_size = 4*1024*1024;
}
if (meta_device == "")
{
disable_meta_fsync = disable_data_fsync;
}
if (journal_device == "")
{
disable_journal_fsync = disable_meta_fsync;
}
if (immediate_commit != IMMEDIATE_NONE && !disable_journal_fsync)
{
throw std::runtime_error("immediate_commit requires disable_journal_fsync");
}
if (immediate_commit == IMMEDIATE_ALL && !disable_data_fsync)
{
throw std::runtime_error("immediate_commit=all requires disable_journal_fsync and disable_data_fsync");
}
// init some fields
clean_entry_bitmap_size = block_size / bitmap_granularity / 8;
clean_entry_size = sizeof(clean_disk_entry) + clean_entry_bitmap_size;
@ -151,6 +186,15 @@ void blockstore_impl_t::calc_lengths()
data_len = data_len < journal.offset-data_offset
? data_len : journal.offset-data_offset;
}
if (cfg_data_size != 0)
{
if (data_len < cfg_data_size)
{
throw std::runtime_error("Data area ("+std::to_string(data_len)+
" bytes) is less than configured size ("+std::to_string(cfg_data_size)+" bytes)");
}
data_len = cfg_data_size;
}
// meta
meta_area = (meta_fd == data_fd ? data_size : meta_size) - meta_offset;
if (meta_fd == data_fd && meta_offset <= data_offset)
@ -252,6 +296,10 @@ void blockstore_impl_t::open_data()
{
throw std::runtime_error("data_offset exceeds device size = "+std::to_string(data_size));
}
if (!disable_flock && flock(data_fd, LOCK_EX|LOCK_NB) != 0)
{
throw std::runtime_error(std::string("Failed to lock data device: ") + strerror(errno));
}
}
void blockstore_impl_t::open_meta()
@ -269,11 +317,14 @@ void blockstore_impl_t::open_meta()
{
throw std::runtime_error("meta_offset exceeds device size = "+std::to_string(meta_size));
}
if (!disable_flock && flock(meta_fd, LOCK_EX|LOCK_NB) != 0)
{
throw std::runtime_error(std::string("Failed to lock metadata device: ") + strerror(errno));
}
}
else
{
meta_fd = data_fd;
disable_meta_fsync = disable_data_fsync;
meta_size = 0;
if (meta_offset >= data_size)
{
@ -291,12 +342,15 @@ void blockstore_impl_t::open_journal()
{
throw std::runtime_error("Failed to open journal device");
}
check_size(journal.fd, &journal.device_size, "metadata device");
check_size(journal.fd, &journal.device_size, "journal device");
if (!disable_flock && flock(journal.fd, LOCK_EX|LOCK_NB) != 0)
{
throw std::runtime_error(std::string("Failed to lock journal device: ") + strerror(errno));
}
}
else
{
journal.fd = meta_fd;
disable_journal_fsync = disable_meta_fsync;
journal.device_size = 0;
if (journal.offset >= data_size)
{

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include "blockstore_impl.h"
int blockstore_impl_t::fulfill_read_push(blockstore_op_t *op, void *buf, uint64_t offset, uint64_t len,
@ -8,12 +11,10 @@ int blockstore_impl_t::fulfill_read_push(blockstore_op_t *op, void *buf, uint64_
// Zero-length version - skip
return 1;
}
if (IS_IN_FLIGHT(item_state))
else if (IS_IN_FLIGHT(item_state))
{
// Pause until it's written somewhere
PRIV(op)->wait_for = WAIT_IN_FLIGHT;
PRIV(op)->wait_detail = item_version;
return 0;
// Write not finished yet - skip
return 1;
}
else if (IS_DELETE(item_state))
{
@ -39,6 +40,7 @@ int blockstore_impl_t::fulfill_read_push(blockstore_op_t *op, void *buf, uint64_
return 1;
}
// FIXME I've seen a bug here so I want some tests
int blockstore_impl_t::fulfill_read(blockstore_op_t *read_op, uint64_t &fulfilled, uint32_t item_start, uint32_t item_end,
uint32_t item_state, uint64_t item_version, uint64_t item_location)
{
@ -51,8 +53,20 @@ int blockstore_impl_t::fulfill_read(blockstore_op_t *read_op, uint64_t &fulfille
while (1)
{
for (; it != PRIV(read_op)->read_vec.end(); it++)
{
if (it->offset >= cur_start)
{
break;
}
else if (it->offset + it->len > cur_start)
{
cur_start = it->offset + it->len;
if (cur_start >= item_end)
{
goto endwhile;
}
}
}
if (it == PRIV(read_op)->read_vec.end() || it->offset > cur_start)
{
fulfill_read_t el = {
@ -71,9 +85,12 @@ int blockstore_impl_t::fulfill_read(blockstore_op_t *read_op, uint64_t &fulfille
}
cur_start = it->offset + it->len;
if (it == PRIV(read_op)->read_vec.end() || cur_start >= item_end)
{
break;
}
}
}
endwhile:
return 1;
}
@ -133,63 +150,67 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
dirty_it--;
}
}
if (clean_it != clean_db.end() && fulfilled < read_op->len)
if (clean_it != clean_db.end())
{
if (!result_version)
{
result_version = clean_it->second.version;
}
if (!clean_entry_bitmap_size)
if (fulfilled < read_op->len)
{
if (!fulfill_read(read_op, fulfilled, 0, block_size, ST_CURRENT, 0, clean_it->second.location))
if (!clean_entry_bitmap_size)
{
// need to wait. undo added requests, don't dequeue op
PRIV(read_op)->read_vec.clear();
return 0;
}
}
else
{
uint64_t meta_loc = clean_it->second.location >> block_order;
uint8_t *clean_entry_bitmap;
if (inmemory_meta)
{
uint64_t sector = (meta_loc / (meta_block_size / clean_entry_size)) * meta_block_size;
uint64_t pos = (meta_loc % (meta_block_size / clean_entry_size));
clean_entry_bitmap = (uint8_t*)(metadata_buffer + sector + pos*clean_entry_size + sizeof(clean_disk_entry));
if (!fulfill_read(read_op, fulfilled, 0, block_size, (BS_ST_BIG_WRITE | BS_ST_STABLE), 0, clean_it->second.location))
{
// need to wait. undo added requests, don't dequeue op
PRIV(read_op)->read_vec.clear();
return 0;
}
}
else
{
clean_entry_bitmap = (uint8_t*)(clean_bitmap + meta_loc*clean_entry_bitmap_size);
}
uint64_t bmp_start = 0, bmp_end = 0, bmp_size = block_size/bitmap_granularity;
while (bmp_start < bmp_size)
{
while (!(clean_entry_bitmap[bmp_end >> 3] & (1 << (bmp_end & 0x7))) && bmp_end < bmp_size)
uint64_t meta_loc = clean_it->second.location >> block_order;
uint8_t *clean_entry_bitmap;
if (inmemory_meta)
{
bmp_end++;
uint64_t sector = (meta_loc / (meta_block_size / clean_entry_size)) * meta_block_size;
uint64_t pos = (meta_loc % (meta_block_size / clean_entry_size));
clean_entry_bitmap = (uint8_t*)(metadata_buffer + sector + pos*clean_entry_size + sizeof(clean_disk_entry));
}
if (bmp_end > bmp_start)
else
{
// fill with zeroes
fulfill_read(read_op, fulfilled, bmp_start * bitmap_granularity,
bmp_end * bitmap_granularity, ST_DEL_STABLE, 0, 0);
clean_entry_bitmap = (uint8_t*)(clean_bitmap + meta_loc*clean_entry_bitmap_size);
}
bmp_start = bmp_end;
while (clean_entry_bitmap[bmp_end >> 3] & (1 << (bmp_end & 0x7)) && bmp_end < bmp_size)
uint64_t bmp_start = 0, bmp_end = 0, bmp_size = block_size/bitmap_granularity;
while (bmp_start < bmp_size)
{
bmp_end++;
}
if (bmp_end > bmp_start)
{
if (!fulfill_read(read_op, fulfilled, bmp_start * bitmap_granularity,
bmp_end * bitmap_granularity, ST_CURRENT, 0, clean_it->second.location + bmp_start * bitmap_granularity))
while (!(clean_entry_bitmap[bmp_end >> 3] & (1 << (bmp_end & 0x7))) && bmp_end < bmp_size)
{
// need to wait. undo added requests, don't dequeue op
PRIV(read_op)->read_vec.clear();
return 0;
bmp_end++;
}
if (bmp_end > bmp_start)
{
// fill with zeroes
fulfill_read(read_op, fulfilled, bmp_start * bitmap_granularity,
bmp_end * bitmap_granularity, (BS_ST_DELETE | BS_ST_STABLE), 0, 0);
}
bmp_start = bmp_end;
while (clean_entry_bitmap[bmp_end >> 3] & (1 << (bmp_end & 0x7)) && bmp_end < bmp_size)
{
bmp_end++;
}
if (bmp_end > bmp_start)
{
if (!fulfill_read(read_op, fulfilled, bmp_start * bitmap_granularity,
bmp_end * bitmap_granularity, (BS_ST_BIG_WRITE | BS_ST_STABLE), 0,
clean_it->second.location + bmp_start * bitmap_granularity))
{
// need to wait. undo added requests, don't dequeue op
PRIV(read_op)->read_vec.clear();
return 0;
}
bmp_start = bmp_end;
}
}
}
}
@ -197,7 +218,7 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
else if (fulfilled < read_op->len)
{
// fill remaining parts with zeroes
fulfill_read(read_op, fulfilled, 0, block_size, ST_DEL_STABLE, 0, 0);
fulfill_read(read_op, fulfilled, 0, block_size, (BS_ST_DELETE | BS_ST_STABLE), 0, 0);
}
assert(fulfilled == read_op->len);
read_op->version = result_version;

View File

@ -1,7 +1,14 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include "blockstore_impl.h"
int blockstore_impl_t::dequeue_rollback(blockstore_op_t *op)
{
if (PRIV(op)->op_state)
{
return continue_rollback(op);
}
obj_ver_id* v;
int i, todo = op->len;
for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++)
@ -14,8 +21,13 @@ int blockstore_impl_t::dequeue_rollback(blockstore_op_t *op)
});
if (dirty_it == dirty_db.begin())
{
if (v->version == 0)
{
// Already rolled back
// FIXME Skip this object version
}
bad_op:
op->retval = -EINVAL;
op->retval = -ENOENT;
FINISH_OP(op);
return 1;
}
@ -31,7 +43,9 @@ int blockstore_impl_t::dequeue_rollback(blockstore_op_t *op)
if (!IS_SYNCED(dirty_it->second.state) ||
IS_STABLE(dirty_it->second.state))
{
goto bad_op;
op->retval = -EBUSY;
FINISH_OP(op);
return 1;
}
if (dirty_it == dirty_db.begin())
{
@ -60,39 +74,12 @@ int blockstore_impl_t::dequeue_rollback(blockstore_op_t *op)
journal.sector_info[journal.cur_sector].dirty)
{
if (cur_sector == -1)
PRIV(op)->min_used_journal_sector = 1 + journal.cur_sector;
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
cur_sector = journal.cur_sector;
prepare_journal_sector_write(journal, cur_sector, sqe[s++], cb);
}
for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++)
{
// FIXME This is here only for the purpose of tracking unstable_writes. Remove if not required
// FIXME ...aaaand this is similar to blockstore_init.cpp - maybe dedup it?
auto dirty_it = dirty_db.lower_bound((obj_ver_id){
.oid = v->oid,
.version = UINT64_MAX,
});
uint64_t max_unstable = 0;
while (dirty_it != dirty_db.begin())
{
dirty_it--;
if (dirty_it->first.oid != v->oid)
break;
else if (dirty_it->first.version <= v->version)
{
if (!IS_STABLE(dirty_it->second.state))
max_unstable = dirty_it->first.version;
break;
}
}
auto unstab_it = unstable_writes.find(v->oid);
if (unstab_it != unstable_writes.end())
{
if (max_unstable == 0)
unstable_writes.erase(unstab_it);
else
unstab_it->second = max_unstable;
}
journal_entry_rollback *je = (journal_entry_rollback*)
prefill_single_journal_entry(journal, JE_ROLLBACK, sizeof(journal_entry_rollback));
journal.sector_info[journal.cur_sector].dirty = false;
@ -103,21 +90,117 @@ int blockstore_impl_t::dequeue_rollback(blockstore_op_t *op)
if (cur_sector != journal.cur_sector)
{
if (cur_sector == -1)
PRIV(op)->min_used_journal_sector = 1 + journal.cur_sector;
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
cur_sector = journal.cur_sector;
prepare_journal_sector_write(journal, cur_sector, sqe[s++], cb);
}
}
PRIV(op)->max_used_journal_sector = 1 + journal.cur_sector;
PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
PRIV(op)->pending_ops = s;
PRIV(op)->op_state = 1;
inflight_writes++;
return 1;
}
int blockstore_impl_t::continue_rollback(blockstore_op_t *op)
{
if (PRIV(op)->op_state == 2)
goto resume_2;
else if (PRIV(op)->op_state == 3)
goto resume_3;
else if (PRIV(op)->op_state == 5)
goto resume_5;
else
return 1;
resume_2:
// Release used journal sectors
release_journal_sectors(op);
resume_3:
if (!disable_journal_fsync)
{
io_uring_sqe *sqe = get_sqe();
if (!sqe)
{
return 0;
}
ring_data_t *data = ((ring_data_t*)sqe->user_data);
my_uring_prep_fsync(sqe, journal.fd, IORING_FSYNC_DATASYNC);
data->iov = { 0 };
data->callback = [this, op](ring_data_t *data) { handle_rollback_event(data, op); };
PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
PRIV(op)->pending_ops = 1;
PRIV(op)->op_state = 4;
return 1;
}
resume_5:
obj_ver_id* v;
int i;
for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++)
{
mark_rolled_back(*v);
}
journal.trim();
inflight_writes--;
// Acknowledge op
op->retval = 0;
FINISH_OP(op);
return 1;
}
void blockstore_impl_t::mark_rolled_back(const obj_ver_id & ov)
{
auto it = dirty_db.lower_bound((obj_ver_id){
.oid = ov.oid,
.version = UINT64_MAX,
});
if (it != dirty_db.begin())
{
uint64_t max_unstable = 0;
auto rm_start = it;
auto rm_end = it;
it--;
while (it->first.oid == ov.oid &&
it->first.version > ov.version &&
!IS_IN_FLIGHT(it->second.state) &&
!IS_STABLE(it->second.state))
{
if (it->first.oid != ov.oid)
break;
else if (it->first.version <= ov.version)
{
if (!IS_STABLE(it->second.state))
max_unstable = it->first.version;
break;
}
else if (IS_STABLE(it->second.state))
break;
// Remove entry
rm_start = it;
if (it == dirty_db.begin())
break;
it--;
}
if (rm_start != rm_end)
{
erase_dirty(rm_start, rm_end, UINT64_MAX);
}
auto unstab_it = unstable_writes.find(ov.oid);
if (unstab_it != unstable_writes.end())
{
if (max_unstable == 0)
unstable_writes.erase(unstab_it);
else
unstab_it->second = max_unstable;
}
}
}
void blockstore_impl_t::handle_rollback_event(ring_data_t *data, blockstore_op_t *op)
{
live = true;
if (data->res != data->iov.iov_len)
{
inflight_writes--;
throw std::runtime_error(
"write operation failed ("+std::to_string(data->res)+" != "+std::to_string(data->iov.iov_len)+
"). in-memory state is corrupted. AAAAAAAaaaaaaaaa!!!111"
@ -126,37 +209,11 @@ void blockstore_impl_t::handle_rollback_event(ring_data_t *data, blockstore_op_t
PRIV(op)->pending_ops--;
if (PRIV(op)->pending_ops == 0)
{
// Release used journal sectors
release_journal_sectors(op);
obj_ver_id* v;
int i;
for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++)
PRIV(op)->op_state++;
if (!continue_rollback(op))
{
// Erase dirty_db entries
auto rm_end = dirty_db.lower_bound((obj_ver_id){
.oid = v->oid,
.version = UINT64_MAX,
});
rm_end--;
auto rm_start = rm_end;
while (1)
{
if (rm_end->first.oid != v->oid)
break;
else if (rm_end->first.version <= v->version)
break;
rm_start = rm_end;
if (rm_end == dirty_db.begin())
break;
rm_end--;
}
if (rm_end != rm_start)
erase_dirty(rm_start, rm_end, UINT64_MAX);
submit_queue.push_front(op);
}
journal.trim();
// Acknowledge op
op->retval = 0;
FINISH_OP(op);
}
}
@ -173,11 +230,13 @@ void blockstore_impl_t::erase_dirty(blockstore_dirty_db_t::iterator dirty_start,
#endif
data_alloc->set(dirty_it->second.location >> block_order, false);
}
#ifdef BLOCKSTORE_DEBUG
printf("remove usage of journal offset %lu by %lu:%lu v%lu\n", dirty_it->second.journal_sector,
dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version);
#endif
int used = --journal.used_sectors[dirty_it->second.journal_sector];
#ifdef BLOCKSTORE_DEBUG
printf(
"remove usage of journal offset %08lx by %lx:%lx v%lu (%d refs)\n", dirty_it->second.journal_sector,
dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version, used
);
#endif
if (used == 0)
{
journal.used_sectors.erase(dirty_it->second.journal_sector);

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include "blockstore_impl.h"
// Stabilize small write:
@ -40,6 +43,10 @@
int blockstore_impl_t::dequeue_stable(blockstore_op_t *op)
{
if (PRIV(op)->op_state)
{
return continue_stable(op);
}
obj_ver_id* v;
int i, todo = 0;
for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++)
@ -51,7 +58,7 @@ int blockstore_impl_t::dequeue_stable(blockstore_op_t *op)
if (clean_it == clean_db.end() || clean_it->second.version < v->version)
{
// No such object version
op->retval = -EINVAL;
op->retval = -ENOENT;
FINISH_OP(op);
return 1;
}
@ -60,10 +67,10 @@ int blockstore_impl_t::dequeue_stable(blockstore_op_t *op)
// Already stable
}
}
else if (IS_UNSYNCED(dirty_it->second.state))
else if (!IS_SYNCED(dirty_it->second.state))
{
// Object not synced yet. Caller must sync it first
op->retval = EAGAIN;
op->retval = -EBUSY;
FINISH_OP(op);
return 1;
}
@ -98,18 +105,13 @@ int blockstore_impl_t::dequeue_stable(blockstore_op_t *op)
journal.sector_info[journal.cur_sector].dirty)
{
if (cur_sector == -1)
PRIV(op)->min_used_journal_sector = 1 + journal.cur_sector;
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
cur_sector = journal.cur_sector;
prepare_journal_sector_write(journal, cur_sector, sqe[s++], cb);
}
for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++)
{
auto unstab_it = unstable_writes.find(v->oid);
if (unstab_it != unstable_writes.end() &&
unstab_it->second <= v->version)
{
unstable_writes.erase(unstab_it);
}
// FIXME: Only stabilize versions that aren't stable yet
journal_entry_stable *je = (journal_entry_stable*)
prefill_single_journal_entry(journal, JE_STABLE, sizeof(journal_entry_stable));
journal.sector_info[journal.cur_sector].dirty = false;
@ -120,21 +122,108 @@ int blockstore_impl_t::dequeue_stable(blockstore_op_t *op)
if (cur_sector != journal.cur_sector)
{
if (cur_sector == -1)
PRIV(op)->min_used_journal_sector = 1 + journal.cur_sector;
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
cur_sector = journal.cur_sector;
prepare_journal_sector_write(journal, cur_sector, sqe[s++], cb);
}
}
PRIV(op)->max_used_journal_sector = 1 + journal.cur_sector;
PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
PRIV(op)->pending_ops = s;
PRIV(op)->op_state = 1;
inflight_writes++;
return 1;
}
int blockstore_impl_t::continue_stable(blockstore_op_t *op)
{
if (PRIV(op)->op_state == 2)
goto resume_2;
else if (PRIV(op)->op_state == 3)
goto resume_3;
else if (PRIV(op)->op_state == 5)
goto resume_5;
else
return 1;
resume_2:
// Release used journal sectors
release_journal_sectors(op);
resume_3:
if (!disable_journal_fsync)
{
io_uring_sqe *sqe = get_sqe();
if (!sqe)
{
return 0;
}
ring_data_t *data = ((ring_data_t*)sqe->user_data);
my_uring_prep_fsync(sqe, journal.fd, IORING_FSYNC_DATASYNC);
data->iov = { 0 };
data->callback = [this, op](ring_data_t *data) { handle_stable_event(data, op); };
PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
PRIV(op)->pending_ops = 1;
PRIV(op)->op_state = 4;
return 1;
}
resume_5:
// Mark dirty_db entries as stable, acknowledge op completion
obj_ver_id* v;
int i;
for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++)
{
// Mark all dirty_db entries up to op->version as stable
mark_stable(*v);
}
inflight_writes--;
// Acknowledge op
op->retval = 0;
FINISH_OP(op);
return 1;
}
void blockstore_impl_t::mark_stable(const obj_ver_id & v)
{
auto dirty_it = dirty_db.find(v);
if (dirty_it != dirty_db.end())
{
while (1)
{
if ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_SYNCED)
{
dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_STABLE;
}
else if (IS_STABLE(dirty_it->second.state))
{
break;
}
if (dirty_it == dirty_db.begin())
{
break;
}
dirty_it--;
if (dirty_it->first.oid != v.oid)
{
break;
}
}
#ifdef BLOCKSTORE_DEBUG
printf("enqueue_flush %lx:%lx v%lu\n", v.oid.inode, v.oid.stripe, v.version);
#endif
flusher->enqueue_flush(v);
}
auto unstab_it = unstable_writes.find(v.oid);
if (unstab_it != unstable_writes.end() &&
unstab_it->second <= v.version)
{
unstable_writes.erase(unstab_it);
}
}
void blockstore_impl_t::handle_stable_event(ring_data_t *data, blockstore_op_t *op)
{
live = true;
if (data->res != data->iov.iov_len)
{
inflight_writes--;
throw std::runtime_error(
"write operation failed ("+std::to_string(data->res)+" != "+std::to_string(data->iov.iov_len)+
"). in-memory state is corrupted. AAAAAAAaaaaaaaaa!!!111"
@ -143,53 +232,10 @@ void blockstore_impl_t::handle_stable_event(ring_data_t *data, blockstore_op_t *
PRIV(op)->pending_ops--;
if (PRIV(op)->pending_ops == 0)
{
// Release used journal sectors
release_journal_sectors(op);
// Mark dirty_db entries as stable, acknowledge op completion
obj_ver_id* v;
int i;
for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++)
PRIV(op)->op_state++;
if (!continue_stable(op))
{
// Mark all dirty_db entries up to op->version as stable
auto dirty_it = dirty_db.find(*v);
if (dirty_it != dirty_db.end())
{
while (1)
{
if (dirty_it->second.state == ST_J_SYNCED)
{
dirty_it->second.state = ST_J_STABLE;
}
else if (dirty_it->second.state == ST_D_META_SYNCED)
{
dirty_it->second.state = ST_D_STABLE;
}
else if (dirty_it->second.state == ST_DEL_SYNCED)
{
dirty_it->second.state = ST_DEL_STABLE;
}
else if (IS_STABLE(dirty_it->second.state))
{
break;
}
if (dirty_it == dirty_db.begin())
{
break;
}
dirty_it--;
if (dirty_it->first.oid != v->oid)
{
break;
}
}
#ifdef BLOCKSTORE_DEBUG
printf("enqueue_flush %lu:%lu v%lu\n", v->oid.inode, v->oid.stripe, v->version);
#endif
flusher->enqueue_flush(*v);
}
submit_queue.push_front(op);
}
// Acknowledge op
op->retval = 0;
FINISH_OP(op);
}
}

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include "blockstore_impl.h"
#define SYNC_HAS_SMALL 1
@ -11,7 +14,7 @@
int blockstore_impl_t::dequeue_sync(blockstore_op_t *op)
{
if (PRIV(op)->sync_state == 0)
if (PRIV(op)->op_state == 0)
{
stop_sync_submitted = false;
PRIV(op)->sync_big_writes.swap(unsynced_big_writes);
@ -21,11 +24,11 @@ int blockstore_impl_t::dequeue_sync(blockstore_op_t *op)
unsynced_big_writes.clear();
unsynced_small_writes.clear();
if (PRIV(op)->sync_big_writes.size() > 0)
PRIV(op)->sync_state = SYNC_HAS_BIG;
PRIV(op)->op_state = SYNC_HAS_BIG;
else if (PRIV(op)->sync_small_writes.size() > 0)
PRIV(op)->sync_state = SYNC_HAS_SMALL;
PRIV(op)->op_state = SYNC_HAS_SMALL;
else
PRIV(op)->sync_state = SYNC_DONE;
PRIV(op)->op_state = SYNC_DONE;
// Always add sync to in_progress_syncs because we clear unsynced_big_writes and unsynced_small_writes
PRIV(op)->prev_sync_count = in_progress_syncs.size();
PRIV(op)->in_progress_ptr = in_progress_syncs.insert(in_progress_syncs.end(), op);
@ -38,7 +41,7 @@ int blockstore_impl_t::dequeue_sync(blockstore_op_t *op)
int blockstore_impl_t::continue_sync(blockstore_op_t *op)
{
auto cb = [this, op](ring_data_t *data) { handle_sync_event(data, op); };
if (PRIV(op)->sync_state == SYNC_HAS_SMALL)
if (PRIV(op)->op_state == SYNC_HAS_SMALL)
{
// No big writes, just fsync the journal
for (; PRIV(op)->sync_small_checked < PRIV(op)->sync_small_writes.size(); PRIV(op)->sync_small_checked++)
@ -54,17 +57,17 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
// Write out the last journal sector if it happens to be dirty
BS_SUBMIT_GET_ONLY_SQE(sqe);
prepare_journal_sector_write(journal, journal.cur_sector, sqe, cb);
PRIV(op)->min_used_journal_sector = PRIV(op)->max_used_journal_sector = 1 + journal.cur_sector;
PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
PRIV(op)->pending_ops = 1;
PRIV(op)->sync_state = SYNC_JOURNAL_WRITE_SENT;
PRIV(op)->op_state = SYNC_JOURNAL_WRITE_SENT;
return 1;
}
else
{
PRIV(op)->sync_state = SYNC_JOURNAL_WRITE_DONE;
PRIV(op)->op_state = SYNC_JOURNAL_WRITE_DONE;
}
}
if (PRIV(op)->sync_state == SYNC_HAS_BIG)
if (PRIV(op)->op_state == SYNC_HAS_BIG)
{
for (; PRIV(op)->sync_big_checked < PRIV(op)->sync_big_writes.size(); PRIV(op)->sync_big_checked++)
{
@ -81,17 +84,17 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
my_uring_prep_fsync(sqe, data_fd, IORING_FSYNC_DATASYNC);
data->iov = { 0 };
data->callback = cb;
PRIV(op)->min_used_journal_sector = PRIV(op)->max_used_journal_sector = 0;
PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
PRIV(op)->pending_ops = 1;
PRIV(op)->sync_state = SYNC_DATA_SYNC_SENT;
PRIV(op)->op_state = SYNC_DATA_SYNC_SENT;
return 1;
}
else
{
PRIV(op)->sync_state = SYNC_DATA_SYNC_DONE;
PRIV(op)->op_state = SYNC_DATA_SYNC_DONE;
}
}
if (PRIV(op)->sync_state == SYNC_DATA_SYNC_DONE)
if (PRIV(op)->op_state == SYNC_DATA_SYNC_DONE)
{
for (; PRIV(op)->sync_small_checked < PRIV(op)->sync_small_writes.size(); PRIV(op)->sync_small_checked++)
{
@ -121,19 +124,25 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
journal.sector_info[journal.cur_sector].dirty)
{
if (cur_sector == -1)
PRIV(op)->min_used_journal_sector = 1 + journal.cur_sector;
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
cur_sector = journal.cur_sector;
prepare_journal_sector_write(journal, cur_sector, sqe[s++], cb);
}
while (it != PRIV(op)->sync_big_writes.end())
{
journal_entry_big_write *je = (journal_entry_big_write*)
prefill_single_journal_entry(journal, JE_BIG_WRITE, sizeof(journal_entry_big_write));
journal_entry_big_write *je = (journal_entry_big_write*)prefill_single_journal_entry(
journal, (dirty_db[*it].state & BS_ST_INSTANT) ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
sizeof(journal_entry_big_write)
);
dirty_db[*it].journal_sector = journal.sector_info[journal.cur_sector].offset;
journal.sector_info[journal.cur_sector].dirty = false;
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
#ifdef BLOCKSTORE_DEBUG
printf("journal offset %lu is used by %lu:%lu v%lu\n", dirty_db[*it].journal_sector, it->oid.inode, it->oid.stripe, it->version);
printf(
"journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
dirty_db[*it].journal_sector, it->oid.inode, it->oid.stripe, it->version,
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]
);
#endif
je->oid = it->oid;
je->version = it->version;
@ -146,17 +155,17 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
if (cur_sector != journal.cur_sector)
{
if (cur_sector == -1)
PRIV(op)->min_used_journal_sector = 1 + journal.cur_sector;
PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
cur_sector = journal.cur_sector;
prepare_journal_sector_write(journal, cur_sector, sqe[s++], cb);
}
}
PRIV(op)->max_used_journal_sector = 1 + journal.cur_sector;
PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
PRIV(op)->pending_ops = s;
PRIV(op)->sync_state = SYNC_JOURNAL_WRITE_SENT;
PRIV(op)->op_state = SYNC_JOURNAL_WRITE_SENT;
return 1;
}
if (PRIV(op)->sync_state == SYNC_JOURNAL_WRITE_DONE)
if (PRIV(op)->op_state == SYNC_JOURNAL_WRITE_DONE)
{
if (!disable_journal_fsync)
{
@ -165,17 +174,17 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
data->iov = { 0 };
data->callback = cb;
PRIV(op)->pending_ops = 1;
PRIV(op)->sync_state = SYNC_JOURNAL_SYNC_SENT;
PRIV(op)->op_state = SYNC_JOURNAL_SYNC_SENT;
return 1;
}
else
{
PRIV(op)->sync_state = SYNC_DONE;
PRIV(op)->op_state = SYNC_DONE;
}
}
if (PRIV(op)->sync_state == SYNC_DONE)
if (PRIV(op)->op_state == SYNC_DONE)
{
ack_sync(op);
return ack_sync(op);
}
return 1;
}
@ -196,17 +205,17 @@ void blockstore_impl_t::handle_sync_event(ring_data_t *data, blockstore_op_t *op
// Release used journal sectors
release_journal_sectors(op);
// Handle states
if (PRIV(op)->sync_state == SYNC_DATA_SYNC_SENT)
if (PRIV(op)->op_state == SYNC_DATA_SYNC_SENT)
{
PRIV(op)->sync_state = SYNC_DATA_SYNC_DONE;
PRIV(op)->op_state = SYNC_DATA_SYNC_DONE;
}
else if (PRIV(op)->sync_state == SYNC_JOURNAL_WRITE_SENT)
else if (PRIV(op)->op_state == SYNC_JOURNAL_WRITE_SENT)
{
PRIV(op)->sync_state = SYNC_JOURNAL_WRITE_DONE;
PRIV(op)->op_state = SYNC_JOURNAL_WRITE_DONE;
}
else if (PRIV(op)->sync_state == SYNC_JOURNAL_SYNC_SENT)
else if (PRIV(op)->op_state == SYNC_JOURNAL_SYNC_SENT)
{
PRIV(op)->sync_state = SYNC_DONE;
PRIV(op)->op_state = SYNC_DONE;
ack_sync(op);
}
else
@ -218,7 +227,7 @@ void blockstore_impl_t::handle_sync_event(ring_data_t *data, blockstore_op_t *op
int blockstore_impl_t::ack_sync(blockstore_op_t *op)
{
if (PRIV(op)->sync_state == SYNC_DONE && PRIV(op)->prev_sync_count == 0)
if (PRIV(op)->op_state == SYNC_DONE && PRIV(op)->prev_sync_count == 0)
{
// Remove dependency of subsequent syncs
auto it = PRIV(op)->in_progress_ptr;
@ -230,14 +239,14 @@ int blockstore_impl_t::ack_sync(blockstore_op_t *op)
{
auto & next_sync = *it++;
PRIV(next_sync)->prev_sync_count -= done_syncs;
if (PRIV(next_sync)->prev_sync_count == 0 && PRIV(next_sync)->sync_state == SYNC_DONE)
if (PRIV(next_sync)->prev_sync_count == 0 && PRIV(next_sync)->op_state == SYNC_DONE)
{
done_syncs++;
// Acknowledge next_sync
ack_one_sync(next_sync);
}
}
return 1;
return 2;
}
return 0;
}
@ -248,20 +257,47 @@ void blockstore_impl_t::ack_one_sync(blockstore_op_t *op)
for (auto it = PRIV(op)->sync_big_writes.begin(); it != PRIV(op)->sync_big_writes.end(); it++)
{
#ifdef BLOCKSTORE_DEBUG
printf("Ack sync big %lu:%lu v%lu\n", it->oid.inode, it->oid.stripe, it->version);
printf("Ack sync big %lx:%lx v%lu\n", it->oid.inode, it->oid.stripe, it->version);
#endif
auto & unstab = unstable_writes[it->oid];
unstab = unstab < it->version ? it->version : unstab;
dirty_db[*it].state = ST_D_META_SYNCED;
auto dirty_it = dirty_db.find(*it);
dirty_it->second.state = ((dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_SYNCED);
if (dirty_it->second.state & BS_ST_INSTANT)
{
mark_stable(dirty_it->first);
}
dirty_it++;
while (dirty_it != dirty_db.end() && dirty_it->first.oid == it->oid)
{
if ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_WAIT_BIG)
{
dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_IN_FLIGHT;
}
dirty_it++;
}
}
for (auto it = PRIV(op)->sync_small_writes.begin(); it != PRIV(op)->sync_small_writes.end(); it++)
{
#ifdef BLOCKSTORE_DEBUG
printf("Ack sync small %lu:%lu v%lu\n", it->oid.inode, it->oid.stripe, it->version);
printf("Ack sync small %lx:%lx v%lu\n", it->oid.inode, it->oid.stripe, it->version);
#endif
auto & unstab = unstable_writes[it->oid];
unstab = unstab < it->version ? it->version : unstab;
dirty_db[*it].state = dirty_db[*it].state == ST_DEL_WRITTEN ? ST_DEL_SYNCED : ST_J_SYNCED;
if (dirty_db[*it].state == (BS_ST_DELETE | BS_ST_WRITTEN))
{
dirty_db[*it].state = (BS_ST_DELETE | BS_ST_SYNCED);
// Deletions are treated as immediately stable
mark_stable(*it);
}
else /* (BS_ST_INSTANT?) | BS_ST_SMALL_WRITE | BS_ST_WRITTEN */
{
dirty_db[*it].state = (dirty_db[*it].state & ~BS_ST_WORKFLOW_MASK) | BS_ST_SYNCED;
if (dirty_db[*it].state & BS_ST_INSTANT)
{
mark_stable(*it);
}
}
}
in_progress_syncs.erase(PRIV(op)->in_progress_ptr);
op->retval = 0;

View File

@ -1,9 +1,13 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include "blockstore_impl.h"
bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
{
// Check or assign version number
bool found = false, deleted = false, is_del = (op->opcode == BS_OP_DELETE);
bool is_inflight_big = false;
uint64_t version = 1;
if (dirty_db.size() > 0)
{
@ -17,6 +21,9 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
found = true;
version = dirty_it->first.version + 1;
deleted = IS_DELETE(dirty_it->second.state);
is_inflight_big = (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE
? !IS_SYNCED(dirty_it->second.state)
: ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_WAIT_BIG);
}
}
if (!found)
@ -38,7 +45,7 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
else if (op->version < version)
{
// Invalid version requested
op->retval = -EINVAL;
op->retval = -EEXIST;
return false;
}
if (deleted && is_del)
@ -47,18 +54,36 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
op->retval = 0;
return false;
}
// Immediately add the operation into dirty_db, so subsequent reads could see it
if (is_inflight_big && !is_del && !deleted && op->len < block_size &&
immediate_commit != IMMEDIATE_ALL)
{
// Issue an additional sync so that the previous big write can reach the journal
blockstore_op_t *sync_op = new blockstore_op_t;
sync_op->opcode = BS_OP_SYNC;
sync_op->callback = [this, op](blockstore_op_t *sync_op)
{
delete sync_op;
};
enqueue_op(sync_op);
}
#ifdef BLOCKSTORE_DEBUG
printf("%s %lu:%lu v%lu\n", is_del ? "Delete" : "Write", op->oid.inode, op->oid.stripe, op->version);
if (is_del)
printf("Delete %lx:%lx v%lu\n", op->oid.inode, op->oid.stripe, op->version);
else
printf("Write %lx:%lx v%lu offset=%u len=%u\n", op->oid.inode, op->oid.stripe, op->version, op->offset, op->len);
#endif
// No strict need to add it into dirty_db here, it's just left
// from the previous implementation where reads waited for writes
dirty_db.emplace((obj_ver_id){
.oid = op->oid,
.version = op->version,
}, (dirty_entry){
.state = (uint32_t)(
is_del
? ST_DEL_IN_FLIGHT
: (op->len == block_size || deleted ? ST_D_IN_FLIGHT : ST_J_IN_FLIGHT)
? (BS_ST_DELETE | BS_ST_IN_FLIGHT)
: (op->opcode == BS_OP_WRITE_STABLE ? BS_ST_INSTANT : 0) | (op->len == block_size || deleted
? (BS_ST_BIG_WRITE | BS_ST_IN_FLIGHT)
: (is_inflight_big ? (BS_ST_SMALL_WRITE | BS_ST_WAIT_BIG) : (BS_ST_SMALL_WRITE | BS_ST_IN_FLIGHT)))
),
.flags = 0,
.location = 0,
@ -72,11 +97,21 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
// First step of the write algorithm: dequeue operation and submit initial write(s)
int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
{
if (PRIV(op)->op_state)
{
return continue_write(op);
}
auto dirty_it = dirty_db.find((obj_ver_id){
.oid = op->oid,
.version = op->version,
});
if (dirty_it->second.state == ST_D_IN_FLIGHT)
assert(dirty_it != dirty_db.end());
if ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_WAIT_BIG)
{
// Don't dequeue
return 0;
}
else if ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE)
{
blockstore_journal_check_t space_check(this);
if (!space_check.check_available(op, unsynced_big_writes.size() + 1, sizeof(journal_entry_big_write), JOURNAL_STABILIZE_RESERVATION))
@ -100,7 +135,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
}
BS_SUBMIT_GET_SQE(sqe, data);
dirty_it->second.location = loc << block_order;
dirty_it->second.state = ST_D_SUBMITTED;
dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_SUBMITTED;
#ifdef BLOCKSTORE_DEBUG
printf("Allocate block %lu\n", loc);
#endif
@ -125,14 +160,22 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
sqe, data_fd, PRIV(op)->iov_zerofill, vcnt, data_offset + (loc << block_order) + op->offset - stripe_offset
);
PRIV(op)->pending_ops = 1;
PRIV(op)->min_used_journal_sector = PRIV(op)->max_used_journal_sector = 0;
// Remember big write as unsynced
unsynced_big_writes.push_back((obj_ver_id){
.oid = op->oid,
.version = op->version,
});
PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
if (immediate_commit != IMMEDIATE_ALL)
{
// Remember big write as unsynced
unsynced_big_writes.push_back((obj_ver_id){
.oid = op->oid,
.version = op->version,
});
PRIV(op)->op_state = 3;
}
else
{
PRIV(op)->op_state = 1;
}
}
else
else /* if ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_SMALL_WRITE) */
{
// Small (journaled) write
// First check if the journal has sufficient space
@ -144,10 +187,11 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
}
// There is sufficient space. Get SQE(s)
struct io_uring_sqe *sqe1 = NULL;
if ((journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_small_write) &&
if (immediate_commit != IMMEDIATE_NONE ||
(journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_small_write) &&
journal.sector_info[journal.cur_sector].dirty)
{
// Write current journal sector only if it's dirty and full
// Write current journal sector only if it's dirty and full, or in the immediate_commit mode
BS_SUBMIT_GET_SQE_DECL(sqe1);
}
struct io_uring_sqe *sqe2 = NULL;
@ -157,24 +201,32 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
}
// Got SQEs. Prepare previous journal sector write if required
auto cb = [this, op](ring_data_t *data) { handle_write_event(data, op); };
if (sqe1)
if (immediate_commit == IMMEDIATE_NONE)
{
prepare_journal_sector_write(journal, journal.cur_sector, sqe1, cb);
// FIXME rename to min/max _flushing
PRIV(op)->min_used_journal_sector = PRIV(op)->max_used_journal_sector = 1 + journal.cur_sector;
PRIV(op)->pending_ops++;
}
else
{
PRIV(op)->min_used_journal_sector = PRIV(op)->max_used_journal_sector = 0;
if (sqe1)
{
prepare_journal_sector_write(journal, journal.cur_sector, sqe1, cb);
PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
PRIV(op)->pending_ops++;
}
else
{
PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
}
}
// Then pre-fill journal entry
journal_entry_small_write *je = (journal_entry_small_write*)
prefill_single_journal_entry(journal, JE_SMALL_WRITE, sizeof(journal_entry_small_write));
journal_entry_small_write *je = (journal_entry_small_write*)prefill_single_journal_entry(
journal, op->opcode == BS_OP_WRITE_STABLE ? JE_SMALL_WRITE_INSTANT : JE_SMALL_WRITE,
sizeof(journal_entry_small_write)
);
dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
#ifdef BLOCKSTORE_DEBUG
printf("journal offset %lu is used by %lu:%lu v%lu\n", dirty_it->second.journal_sector, dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version);
printf(
"journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
dirty_it->second.journal_sector, dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version,
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]
);
#endif
// Figure out where data will be
journal.next_free = (journal.next_free + op->len) <= journal.len ? journal.next_free : journal_block_size;
@ -186,6 +238,12 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
je->crc32_data = crc32c(0, op->buf, op->len);
je->crc32 = je_crc32((journal_entry*)je);
journal.crc32_last = je->crc32;
if (immediate_commit != IMMEDIATE_NONE)
{
prepare_journal_sector_write(journal, journal.cur_sector, sqe1, cb);
PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
PRIV(op)->pending_ops++;
}
if (op->len > 0)
{
// Prepare journal data write
@ -207,22 +265,119 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
// Zero-length overwrite. Allowed to bump object version in EC placement groups without actually writing data
}
dirty_it->second.location = journal.next_free;
dirty_it->second.state = ST_J_SUBMITTED;
dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_SUBMITTED;
journal.next_free += op->len;
if (journal.next_free >= journal.len)
{
journal.next_free = journal_block_size;
}
// Remember small write as unsynced
unsynced_small_writes.push_back((obj_ver_id){
.oid = op->oid,
.version = op->version,
});
if (immediate_commit == IMMEDIATE_NONE)
{
// Remember small write as unsynced
unsynced_small_writes.push_back((obj_ver_id){
.oid = op->oid,
.version = op->version,
});
}
if (!PRIV(op)->pending_ops)
{
ack_write(op);
PRIV(op)->op_state = 4;
continue_write(op);
}
else
{
PRIV(op)->op_state = 3;
}
}
inflight_writes++;
return 1;
}
int blockstore_impl_t::continue_write(blockstore_op_t *op)
{
io_uring_sqe *sqe = NULL;
journal_entry_big_write *je;
auto dirty_it = dirty_db.find((obj_ver_id){
.oid = op->oid,
.version = op->version,
});
assert(dirty_it != dirty_db.end());
if (PRIV(op)->op_state == 2)
goto resume_2;
else if (PRIV(op)->op_state == 4)
goto resume_4;
else
return 1;
resume_2:
// Only for the immediate_commit mode: prepare and submit big_write journal entry
sqe = get_sqe();
if (!sqe)
{
return 0;
}
je = (journal_entry_big_write*)prefill_single_journal_entry(
journal, op->opcode == BS_OP_WRITE_STABLE ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
sizeof(journal_entry_big_write)
);
dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
journal.sector_info[journal.cur_sector].dirty = false;
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
#ifdef BLOCKSTORE_DEBUG
printf(
"journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
journal.sector_info[journal.cur_sector].offset, op->oid.inode, op->oid.stripe, op->version,
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]
);
#endif
je->oid = op->oid;
je->version = op->version;
je->offset = op->offset;
je->len = op->len;
je->location = dirty_it->second.location;
je->crc32 = je_crc32((journal_entry*)je);
journal.crc32_last = je->crc32;
prepare_journal_sector_write(journal, journal.cur_sector, sqe,
[this, op](ring_data_t *data) { handle_write_event(data, op); });
PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
PRIV(op)->pending_ops = 1;
PRIV(op)->op_state = 3;
return 1;
resume_4:
// Switch object state
#ifdef BLOCKSTORE_DEBUG
printf("Ack write %lx:%lx v%lu = %d\n", op->oid.inode, op->oid.stripe, op->version, dirty_it->second.state);
#endif
bool imm = (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE
? (immediate_commit == IMMEDIATE_ALL)
: (immediate_commit != IMMEDIATE_NONE);
if (imm)
{
auto & unstab = unstable_writes[op->oid];
unstab = unstab < op->version ? op->version : unstab;
}
dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK)
| (imm ? BS_ST_SYNCED : BS_ST_WRITTEN);
if (imm && ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_DELETE || (dirty_it->second.state & BS_ST_INSTANT)))
{
// Deletions are treated as immediately stable
mark_stable(dirty_it->first);
}
if (immediate_commit == IMMEDIATE_ALL)
{
dirty_it++;
while (dirty_it != dirty_db.end() && dirty_it->first.oid == op->oid)
{
if ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_WAIT_BIG)
{
dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_IN_FLIGHT;
}
dirty_it++;
}
}
inflight_writes--;
// Acknowledge write
op->retval = op->len;
FINISH_OP(op);
return 1;
}
@ -231,6 +386,7 @@ void blockstore_impl_t::handle_write_event(ring_data_t *data, blockstore_op_t *o
live = true;
if (data->res != data->iov.iov_len)
{
inflight_writes--;
// FIXME: our state becomes corrupted after a write error. maybe do something better than just die
throw std::runtime_error(
"write operation failed ("+std::to_string(data->res)+" != "+std::to_string(data->iov.iov_len)+
@ -241,88 +397,118 @@ void blockstore_impl_t::handle_write_event(ring_data_t *data, blockstore_op_t *o
if (PRIV(op)->pending_ops == 0)
{
release_journal_sectors(op);
ack_write(op);
PRIV(op)->op_state++;
if (!continue_write(op))
{
submit_queue.push_front(op);
}
}
}
void blockstore_impl_t::release_journal_sectors(blockstore_op_t *op)
{
// Release used journal sectors
if (PRIV(op)->min_used_journal_sector > 0 &&
PRIV(op)->max_used_journal_sector > 0)
// Release flushed journal sectors
if (PRIV(op)->min_flushed_journal_sector > 0 &&
PRIV(op)->max_flushed_journal_sector > 0)
{
uint64_t s = PRIV(op)->min_used_journal_sector;
uint64_t s = PRIV(op)->min_flushed_journal_sector;
while (1)
{
journal.sector_info[s-1].usage_count--;
if (s == PRIV(op)->max_used_journal_sector)
if (s != (1+journal.cur_sector) && journal.sector_info[s-1].usage_count == 0)
{
// We know for sure that we won't write into this sector anymore
uint64_t new_ds = journal.sector_info[s-1].offset + journal.block_size;
if (new_ds >= journal.len)
{
new_ds = journal.block_size;
}
if ((journal.dirty_start + (journal.dirty_start >= journal.used_start ? 0 : journal.len)) <
(new_ds + (new_ds >= journal.used_start ? 0 : journal.len)))
{
journal.dirty_start = new_ds;
}
}
if (s == PRIV(op)->max_flushed_journal_sector)
break;
s = 1 + s % journal.sector_count;
}
PRIV(op)->min_used_journal_sector = PRIV(op)->max_used_journal_sector = 0;
PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
}
}
void blockstore_impl_t::ack_write(blockstore_op_t *op)
{
// Switch object state
auto & dirty_entry = dirty_db[(obj_ver_id){
.oid = op->oid,
.version = op->version,
}];
#ifdef BLOCKSTORE_DEBUG
printf("Ack write %lu:%lu v%lu = %d\n", op->oid.inode, op->oid.stripe, op->version, dirty_entry.state);
#endif
if (dirty_entry.state == ST_J_SUBMITTED)
{
dirty_entry.state = ST_J_WRITTEN;
}
else if (dirty_entry.state == ST_D_SUBMITTED)
{
dirty_entry.state = ST_D_WRITTEN;
}
else if (dirty_entry.state == ST_DEL_SUBMITTED)
{
dirty_entry.state = ST_DEL_WRITTEN;
}
// Acknowledge write without sync
op->retval = op->len;
FINISH_OP(op);
}
int blockstore_impl_t::dequeue_del(blockstore_op_t *op)
{
auto dirty_it = dirty_db.find((obj_ver_id){
.oid = op->oid,
.version = op->version,
});
assert(dirty_it != dirty_db.end());
blockstore_journal_check_t space_check(this);
if (!space_check.check_available(op, 1, sizeof(journal_entry_del), 0))
{
return 0;
}
BS_SUBMIT_GET_ONLY_SQE(sqe);
io_uring_sqe *sqe = NULL;
if (immediate_commit != IMMEDIATE_NONE ||
(journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_del) &&
journal.sector_info[journal.cur_sector].dirty)
{
// Write current journal sector only if it's dirty and full, or in the immediate_commit mode
BS_SUBMIT_GET_SQE_DECL(sqe);
}
auto cb = [this, op](ring_data_t *data) { handle_write_event(data, op); };
// Prepare journal sector write
journal_entry_del *je = (journal_entry_del*)
prefill_single_journal_entry(journal, JE_DELETE, sizeof(struct journal_entry_del));
if (immediate_commit == IMMEDIATE_NONE)
{
if (sqe)
{
prepare_journal_sector_write(journal, journal.cur_sector, sqe, cb);
PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
PRIV(op)->pending_ops++;
}
else
{
PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
}
}
// Pre-fill journal entry
journal_entry_del *je = (journal_entry_del*)prefill_single_journal_entry(
journal, JE_DELETE, sizeof(struct journal_entry_del)
);
dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
#ifdef BLOCKSTORE_DEBUG
printf("journal offset %lu is used by %lu:%lu v%lu\n", dirty_it->second.journal_sector, dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version);
printf(
"journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
dirty_it->second.journal_sector, dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version,
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]
);
#endif
je->oid = op->oid;
je->version = op->version;
je->crc32 = je_crc32((journal_entry*)je);
journal.crc32_last = je->crc32;
auto cb = [this, op](ring_data_t *data) { handle_write_event(data, op); };
prepare_journal_sector_write(journal, journal.cur_sector, sqe, cb);
PRIV(op)->min_used_journal_sector = PRIV(op)->max_used_journal_sector = 1 + journal.cur_sector;
PRIV(op)->pending_ops = 1;
dirty_it->second.state = ST_DEL_SUBMITTED;
// Remember small write as unsynced
unsynced_small_writes.push_back((obj_ver_id){
.oid = op->oid,
.version = op->version,
});
dirty_it->second.state = BS_ST_DELETE | BS_ST_SUBMITTED;
if (immediate_commit != IMMEDIATE_NONE)
{
prepare_journal_sector_write(journal, journal.cur_sector, sqe, cb);
PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
PRIV(op)->pending_ops++;
// Remember small write as unsynced
unsynced_small_writes.push_back((obj_ver_id){
.oid = op->oid,
.version = op->version,
});
}
if (!PRIV(op)->pending_ops)
{
PRIV(op)->op_state = 4;
continue_write(op);
}
else
{
PRIV(op)->op_state = 3;
}
return 1;
}

745
cluster_client.cpp Normal file
View File

@ -0,0 +1,745 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
#include "cluster_client.h"
cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config)
{
this->ringloop = ringloop;
this->tfd = tfd;
msgr.osd_num = 0;
msgr.tfd = tfd;
msgr.ringloop = ringloop;
msgr.repeer_pgs = [this](osd_num_t peer_osd)
{
if (msgr.osd_peer_fds.find(peer_osd) != msgr.osd_peer_fds.end())
{
// peer_osd just connected
continue_ops();
}
else if (unsynced_writes.size())
{
// peer_osd just dropped connection
for (auto op: syncing_writes)
{
for (auto & part: op->parts)
{
if (part.osd_num == peer_osd && part.done)
{
// repeat this operation
part.osd_num = 0;
part.done = false;
assert(!part.sent);
op->done_count--;
}
}
}
for (auto op: unsynced_writes)
{
for (auto & part: op->parts)
{
if (part.osd_num == peer_osd && part.done)
{
// repeat this operation
part.osd_num = 0;
part.done = false;
assert(!part.sent);
op->done_count--;
}
}
if (op->done_count < op->parts.size())
{
cur_ops.insert(op);
}
}
continue_ops();
}
};
msgr.exec_op = [this](osd_op_t *op)
{
// Garbage in
printf("Incoming garbage from peer %d\n", op->peer_fd);
msgr.stop_client(op->peer_fd);
delete op;
};
msgr.use_sync_send_recv = config["use_sync_send_recv"].bool_value() ||
config["use_sync_send_recv"].uint64_value();
st_cli.tfd = tfd;
st_cli.on_load_config_hook = [this](json11::Json::object & cfg) { on_load_config_hook(cfg); };
st_cli.on_change_osd_state_hook = [this](uint64_t peer_osd) { on_change_osd_state_hook(peer_osd); };
st_cli.on_change_hook = [this](json11::Json::object & changes) { on_change_hook(changes); };
st_cli.on_load_pgs_hook = [this](bool success) { on_load_pgs_hook(success); };
log_level = config["log_level"].int64_value();
st_cli.parse_config(config);
st_cli.load_global_config();
if (ringloop)
{
consumer.loop = [this]()
{
msgr.read_requests();
msgr.send_replies();
this->ringloop->submit();
};
ringloop->register_consumer(&consumer);
}
}
cluster_client_t::~cluster_client_t()
{
if (ringloop)
{
ringloop->unregister_consumer(&consumer);
}
}
void cluster_client_t::stop()
{
while (msgr.clients.size() > 0)
{
msgr.stop_client(msgr.clients.begin()->first);
}
}
void cluster_client_t::continue_ops(bool up_retry)
{
for (auto op_it = cur_ops.begin(); op_it != cur_ops.end(); )
{
if ((*op_it)->up_wait)
{
if (up_retry)
{
(*op_it)->up_wait = false;
continue_rw(*op_it++);
}
else
op_it++;
}
else
continue_rw(*op_it++);
}
}
static uint32_t is_power_of_two(uint64_t value)
{
uint32_t l = 0;
while (value > 1)
{
if (value & 1)
{
return 64;
}
value = value >> 1;
l++;
}
return l;
}
void cluster_client_t::on_load_config_hook(json11::Json::object & config)
{
bs_block_size = config["block_size"].uint64_value();
bs_disk_alignment = config["disk_alignment"].uint64_value();
bs_bitmap_granularity = config["bitmap_granularity"].uint64_value();
if (!bs_block_size)
{
bs_block_size = DEFAULT_BLOCK_SIZE;
}
if (!bs_disk_alignment)
{
bs_disk_alignment = DEFAULT_DISK_ALIGNMENT;
}
if (!bs_bitmap_granularity)
{
bs_bitmap_granularity = DEFAULT_BITMAP_GRANULARITY;
}
uint32_t block_order;
if ((block_order = is_power_of_two(bs_block_size)) >= 64 || bs_block_size < MIN_BLOCK_SIZE || bs_block_size >= MAX_BLOCK_SIZE)
{
throw std::runtime_error("Bad block size");
}
if (config["immediate_commit"] == "all")
{
// Cluster-wide immediate_commit mode
immediate_commit = true;
}
else if (config.find("client_dirty_limit") != config.end())
{
client_dirty_limit = config["client_dirty_limit"].uint64_value();
}
if (!client_dirty_limit)
{
client_dirty_limit = DEFAULT_CLIENT_DIRTY_LIMIT;
}
up_wait_retry_interval = config["up_wait_retry_interval"].uint64_value();
if (!up_wait_retry_interval)
{
up_wait_retry_interval = 500;
}
else if (up_wait_retry_interval < 50)
{
up_wait_retry_interval = 50;
}
msgr.peer_connect_interval = config["peer_connect_interval"].uint64_value();
if (!msgr.peer_connect_interval)
{
msgr.peer_connect_interval = DEFAULT_PEER_CONNECT_INTERVAL;
}
msgr.peer_connect_timeout = config["peer_connect_timeout"].uint64_value();
if (!msgr.peer_connect_timeout)
{
msgr.peer_connect_timeout = DEFAULT_PEER_CONNECT_TIMEOUT;
}
st_cli.start_etcd_watcher();
st_cli.load_pgs();
}
void cluster_client_t::on_load_pgs_hook(bool success)
{
for (auto pool_item: st_cli.pool_config)
{
pg_counts[pool_item.first] = pool_item.second.real_pg_count;
}
for (auto op: offline_ops)
{
execute(op);
}
offline_ops.clear();
continue_ops();
}
void cluster_client_t::on_change_hook(json11::Json::object & changes)
{
for (auto pool_item: st_cli.pool_config)
{
if (pg_counts[pool_item.first] != pool_item.second.real_pg_count)
{
// At this point, all pool operations should have been suspended
// And now they have to be resliced!
for (auto op: cur_ops)
{
if (INODE_POOL(op->inode) == pool_item.first)
{
op->needs_reslice = true;
}
}
for (auto op: unsynced_writes)
{
if (INODE_POOL(op->inode) == pool_item.first)
{
op->needs_reslice = true;
}
}
for (auto op: syncing_writes)
{
if (INODE_POOL(op->inode) == pool_item.first)
{
op->needs_reslice = true;
}
}
pg_counts[pool_item.first] = pool_item.second.real_pg_count;
}
}
continue_ops();
}
void cluster_client_t::on_change_osd_state_hook(uint64_t peer_osd)
{
if (msgr.wanted_peers.find(peer_osd) != msgr.wanted_peers.end())
{
msgr.connect_peer(peer_osd, st_cli.peer_states[peer_osd]);
}
}
/**
* How writes are synced when immediate_commit is false
*
* 1) accept up to <client_dirty_limit> write operations for execution,
* queue all subsequent writes into <next_writes>
* 2) accept exactly one SYNC, queue all subsequent SYNCs into <next_writes>, too
* 3) "continue" all accepted writes
*
* "Continue" WRITE:
* 1) if the operation is not a copy yet - copy it (required for replay)
* 2) if the operation is not sliced yet - slice it
* 3) if the operation doesn't require reslice - try to connect & send all remaining parts
* 4) if any of them fail due to disconnected peers or PGs not up, repeat after reconnecting or small timeout
* 5) if any of them fail due to other errors, fail the operation and forget it from the current "unsynced batch"
* 6) if PG count changes before all parts are done, wait for all in-progress parts to finish,
* throw all results away, reslice and resubmit op
* 7) when all parts are done, try to "continue" the current SYNC
* 8) if the operation succeeds, but then some OSDs drop their connections, repeat
* parts from the current "unsynced batch" previously sent to those OSDs in any order
*
* "Continue" current SYNC:
* 1) take all unsynced operations from the current batch
* 2) check if all affected OSDs are still alive
* 3) if yes, send all SYNCs. otherwise, leave current SYNC as is.
* 4) if any of them fail due to disconnected peers, repeat SYNC after repeating all writes
* 5) if any of them fail due to other errors, fail the SYNC operation
*/
void cluster_client_t::execute(cluster_op_t *op)
{
if (!bs_disk_alignment)
{
// We're offline
offline_ops.push_back(op);
return;
}
op->retval = 0;
if (op->opcode != OSD_OP_SYNC && op->opcode != OSD_OP_READ && op->opcode != OSD_OP_WRITE ||
(op->opcode == OSD_OP_READ || op->opcode == OSD_OP_WRITE) && (!op->inode || !op->len ||
op->offset % bs_disk_alignment || op->len % bs_disk_alignment))
{
op->retval = -EINVAL;
std::function<void(cluster_op_t*)>(op->callback)(op);
return;
}
if (op->opcode == OSD_OP_SYNC)
{
execute_sync(op);
return;
}
if (op->opcode == OSD_OP_WRITE && !immediate_commit)
{
if (next_writes.size() > 0)
{
assert(cur_sync);
next_writes.push_back(op);
return;
}
if (queued_bytes >= client_dirty_limit)
{
// Push an extra SYNC operation to flush previous writes
next_writes.push_back(op);
cluster_op_t *sync_op = new cluster_op_t;
sync_op->is_internal = true;
sync_op->opcode = OSD_OP_SYNC;
sync_op->callback = [](cluster_op_t* sync_op) {};
execute_sync(sync_op);
return;
}
queued_bytes += op->len;
}
cur_ops.insert(op);
continue_rw(op);
}
void cluster_client_t::continue_rw(cluster_op_t *op)
{
pool_id_t pool_id = INODE_POOL(op->inode);
if (!pool_id)
{
op->retval = -EINVAL;
std::function<void(cluster_op_t*)>(op->callback)(op);
return;
}
if (st_cli.pool_config.find(pool_id) == st_cli.pool_config.end() ||
st_cli.pool_config[pool_id].real_pg_count == 0)
{
// Postpone operations to unknown pools
return;
}
if (op->opcode == OSD_OP_WRITE && !immediate_commit && !op->is_internal)
{
// Save operation for replay when PG goes out of sync
// (primary OSD drops our connection in this case)
cluster_op_t *op_copy = new cluster_op_t();
op_copy->is_internal = true;
op_copy->orig_op = op;
op_copy->opcode = op->opcode;
op_copy->inode = op->inode;
op_copy->offset = op->offset;
op_copy->len = op->len;
op_copy->buf = malloc_or_die(op->len);
op_copy->iov.push_back(op_copy->buf, op->len);
op_copy->callback = [](cluster_op_t* op_copy)
{
if (op_copy->orig_op)
{
// Acknowledge write and forget the original pointer
op_copy->orig_op->retval = op_copy->retval;
std::function<void(cluster_op_t*)>(op_copy->orig_op->callback)(op_copy->orig_op);
op_copy->orig_op = NULL;
}
};
void *cur_buf = op_copy->buf;
for (int i = 0; i < op->iov.count; i++)
{
memcpy(cur_buf, op->iov.buf[i].iov_base, op->iov.buf[i].iov_len);
cur_buf += op->iov.buf[i].iov_len;
}
unsynced_writes.push_back(op_copy);
cur_ops.erase(op);
cur_ops.insert(op_copy);
op = op_copy;
}
if (!op->parts.size())
{
// Slice the operation into parts
slice_rw(op);
}
if (!op->needs_reslice)
{
// Send unsent parts, if they're not subject to change
for (auto & op_part: op->parts)
{
if (!op_part.sent && !op_part.done)
{
try_send(op, &op_part);
}
}
}
if (!op->sent_count)
{
if (op->done_count >= op->parts.size())
{
// Finished successfully
// Even if the PG count has changed in meanwhile we treat it as success
// because if some operations were invalid for the new PG count we'd get errors
cur_ops.erase(op);
op->retval = op->len;
std::function<void(cluster_op_t*)>(op->callback)(op);
continue_sync();
return;
}
else if (op->retval != 0 && op->retval != -EPIPE)
{
// Fatal error (not -EPIPE)
cur_ops.erase(op);
if (!immediate_commit && op->opcode == OSD_OP_WRITE)
{
for (int i = 0; i < unsynced_writes.size(); i++)
{
if (unsynced_writes[i] == op)
{
unsynced_writes.erase(unsynced_writes.begin()+i, unsynced_writes.begin()+i+1);
break;
}
}
}
bool del = op->is_internal;
std::function<void(cluster_op_t*)>(op->callback)(op);
if (del)
{
if (op->buf)
free(op->buf);
delete op;
}
continue_sync();
return;
}
else
{
// -EPIPE or no error - clear the error
op->retval = 0;
if (op->needs_reslice)
{
op->parts.clear();
op->done_count = 0;
op->needs_reslice = false;
continue_rw(op);
}
}
}
}
void cluster_client_t::slice_rw(cluster_op_t *op)
{
// Slice the request into individual object stripe requests
// Primary OSDs still operate individual stripes, but their size is multiplied by PG minsize in case of EC
auto & pool_cfg = st_cli.pool_config[INODE_POOL(op->inode)];
uint64_t pg_block_size = bs_block_size * (
pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pool_cfg.pg_minsize
);
uint64_t first_stripe = (op->offset / pg_block_size) * pg_block_size;
uint64_t last_stripe = ((op->offset + op->len + pg_block_size - 1) / pg_block_size - 1) * pg_block_size;
op->retval = 0;
op->parts.resize((last_stripe - first_stripe) / pg_block_size + 1);
int iov_idx = 0;
size_t iov_pos = 0;
int i = 0;
for (uint64_t stripe = first_stripe; stripe <= last_stripe; stripe += pg_block_size)
{
pg_num_t pg_num = (op->inode + stripe/pool_cfg.pg_stripe_size) % pool_cfg.real_pg_count + 1;
uint64_t begin = (op->offset < stripe ? stripe : op->offset);
uint64_t end = (op->offset + op->len) > (stripe + pg_block_size)
? (stripe + pg_block_size) : (op->offset + op->len);
op->parts[i] = {
.parent = op,
.offset = begin,
.len = (uint32_t)(end - begin),
.pg_num = pg_num,
.sent = false,
.done = false,
};
int left = end-begin;
while (left > 0 && iov_idx < op->iov.count)
{
if (op->iov.buf[iov_idx].iov_len - iov_pos < left)
{
op->parts[i].iov.push_back(op->iov.buf[iov_idx].iov_base + iov_pos, op->iov.buf[iov_idx].iov_len - iov_pos);
left -= (op->iov.buf[iov_idx].iov_len - iov_pos);
iov_pos = 0;
iov_idx++;
}
else
{
op->parts[i].iov.push_back(op->iov.buf[iov_idx].iov_base + iov_pos, left);
iov_pos += left;
left = 0;
}
}
assert(left == 0);
i++;
}
}
bool cluster_client_t::try_send(cluster_op_t *op, cluster_op_part_t *part)
{
auto & pool_cfg = st_cli.pool_config[INODE_POOL(op->inode)];
auto pg_it = pool_cfg.pg_config.find(part->pg_num);
if (pg_it != pool_cfg.pg_config.end() &&
!pg_it->second.pause && pg_it->second.cur_primary)
{
osd_num_t primary_osd = pg_it->second.cur_primary;
auto peer_it = msgr.osd_peer_fds.find(primary_osd);
if (peer_it != msgr.osd_peer_fds.end())
{
int peer_fd = peer_it->second;
part->osd_num = primary_osd;
part->sent = true;
op->sent_count++;
part->op = {
.op_type = OSD_OP_OUT,
.peer_fd = peer_fd,
.req = { .rw = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = op_id++,
.opcode = op->opcode,
},
.inode = op->inode,
.offset = part->offset,
.len = part->len,
} },
.callback = [this, part](osd_op_t *op_part)
{
handle_op_part(part);
},
};
part->op.iov = part->iov;
msgr.outbox_push(&part->op);
return true;
}
else if (msgr.wanted_peers.find(primary_osd) == msgr.wanted_peers.end())
{
msgr.connect_peer(primary_osd, st_cli.peer_states[primary_osd]);
}
}
return false;
}
void cluster_client_t::execute_sync(cluster_op_t *op)
{
if (immediate_commit)
{
// Syncs are not required in the immediate_commit mode
op->retval = 0;
std::function<void(cluster_op_t*)>(op->callback)(op);
}
else if (cur_sync != NULL)
{
next_writes.push_back(op);
}
else
{
cur_sync = op;
continue_sync();
}
}
void cluster_client_t::continue_sync()
{
if (!cur_sync || cur_sync->parts.size() > 0)
{
// Already submitted
return;
}
cur_sync->retval = 0;
std::set<osd_num_t> sync_osds;
for (auto prev_op: unsynced_writes)
{
if (prev_op->done_count < prev_op->parts.size())
{
// Writes not finished yet
return;
}
for (auto & part: prev_op->parts)
{
if (part.osd_num)
{
sync_osds.insert(part.osd_num);
}
}
}
if (!sync_osds.size())
{
// No dirty writes
finish_sync();
return;
}
// Check that all OSD connections are still alive
for (auto sync_osd: sync_osds)
{
auto peer_it = msgr.osd_peer_fds.find(sync_osd);
if (peer_it == msgr.osd_peer_fds.end())
{
// SYNC is pointless to send to a non connected OSD
return;
}
}
syncing_writes.swap(unsynced_writes);
// Post sync to affected OSDs
cur_sync->parts.resize(sync_osds.size());
int i = 0;
for (auto sync_osd: sync_osds)
{
cur_sync->parts[i] = {
.parent = cur_sync,
.osd_num = sync_osd,
.sent = false,
.done = false,
};
send_sync(cur_sync, &cur_sync->parts[i]);
i++;
}
}
void cluster_client_t::finish_sync()
{
int retval = cur_sync->retval;
if (retval != 0)
{
for (auto op: syncing_writes)
{
if (op->done_count < op->parts.size())
{
cur_ops.insert(op);
}
}
unsynced_writes.insert(unsynced_writes.begin(), syncing_writes.begin(), syncing_writes.end());
syncing_writes.clear();
}
if (retval == -EPIPE)
{
// Retry later
cur_sync->parts.clear();
cur_sync->retval = 0;
cur_sync->sent_count = 0;
cur_sync->done_count = 0;
return;
}
std::function<void(cluster_op_t*)>(cur_sync->callback)(cur_sync);
if (!retval)
{
for (auto op: syncing_writes)
{
assert(op->sent_count == 0);
if (op->is_internal)
{
if (op->buf)
free(op->buf);
delete op;
}
}
syncing_writes.clear();
}
cur_sync = NULL;
queued_bytes = 0;
std::vector<cluster_op_t*> next_wr_copy;
next_wr_copy.swap(next_writes);
for (auto next_op: next_wr_copy)
{
execute(next_op);
}
}
void cluster_client_t::send_sync(cluster_op_t *op, cluster_op_part_t *part)
{
auto peer_it = msgr.osd_peer_fds.find(part->osd_num);
assert(peer_it != msgr.osd_peer_fds.end());
part->sent = true;
op->sent_count++;
part->op = {
.op_type = OSD_OP_OUT,
.peer_fd = peer_it->second,
.req = {
.hdr = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = op_id++,
.opcode = OSD_OP_SYNC,
},
},
.callback = [this, part](osd_op_t *op_part)
{
handle_op_part(part);
},
};
msgr.outbox_push(&part->op);
}
void cluster_client_t::handle_op_part(cluster_op_part_t *part)
{
cluster_op_t *op = part->parent;
part->sent = false;
op->sent_count--;
int expected = part->op.req.hdr.opcode == OSD_OP_SYNC ? 0 : part->op.req.rw.len;
if (part->op.reply.hdr.retval != expected)
{
// Operation failed, retry
printf(
"Operation failed on OSD %lu: retval=%ld (expected %d), dropping connection\n",
part->osd_num, part->op.reply.hdr.retval, expected
);
msgr.stop_client(part->op.peer_fd);
if (part->op.reply.hdr.retval == -EPIPE)
{
op->up_wait = true;
if (!retry_timeout_id)
{
retry_timeout_id = tfd->set_timer(up_wait_retry_interval, false, [this](int)
{
retry_timeout_id = 0;
continue_ops(true);
});
}
}
if (!op->retval || op->retval == -EPIPE)
{
// Don't overwrite other errors with -EPIPE
op->retval = part->op.reply.hdr.retval;
}
}
else
{
// OK
part->done = true;
op->done_count++;
}
if (op->sent_count == 0)
{
if (op->opcode == OSD_OP_SYNC)
{
assert(op == cur_sync);
finish_sync();
}
else
{
continue_rw(op);
}
}
}

102
cluster_client.h Normal file
View File

@ -0,0 +1,102 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
#pragma once
#include "messenger.h"
#include "etcd_state_client.h"
#define MIN_BLOCK_SIZE 4*1024
#define MAX_BLOCK_SIZE 128*1024*1024
#define DEFAULT_BLOCK_SIZE 128*1024
#define DEFAULT_DISK_ALIGNMENT 4096
#define DEFAULT_BITMAP_GRANULARITY 4096
#define DEFAULT_CLIENT_DIRTY_LIMIT 32*1024*1024
struct cluster_op_t;
struct cluster_op_part_t
{
cluster_op_t *parent;
uint64_t offset;
uint32_t len;
pg_num_t pg_num;
osd_num_t osd_num;
osd_op_buf_list_t iov;
bool sent;
bool done;
osd_op_t op;
};
struct cluster_op_t
{
uint64_t opcode; // OSD_OP_READ, OSD_OP_WRITE, OSD_OP_SYNC
uint64_t inode;
uint64_t offset;
uint64_t len;
int retval;
osd_op_buf_list_t iov;
std::function<void(cluster_op_t*)> callback;
protected:
void *buf = NULL;
cluster_op_t *orig_op = NULL;
bool is_internal = false;
bool needs_reslice = false;
bool up_wait = false;
int sent_count = 0, done_count = 0;
std::vector<cluster_op_part_t> parts;
friend class cluster_client_t;
};
class cluster_client_t
{
timerfd_manager_t *tfd;
ring_loop_t *ringloop;
uint64_t bs_block_size = 0;
uint64_t bs_disk_alignment = 0;
uint64_t bs_bitmap_granularity = 0;
std::map<pool_id_t, uint64_t> pg_counts;
bool immediate_commit = false;
// FIXME: Implement inmemory_commit mode. Note that it requires to return overlapping reads from memory.
uint64_t client_dirty_limit = 0;
int log_level;
int up_wait_retry_interval = 500; // ms
uint64_t op_id = 1;
etcd_state_client_t st_cli;
osd_messenger_t msgr;
ring_consumer_t consumer;
// operations currently in progress
std::set<cluster_op_t*> cur_ops;
int retry_timeout_id = 0;
// unsynced operations are copied in memory to allow replay when cluster isn't in the immediate_commit mode
// unsynced_writes are replayed in any order (because only the SYNC operation guarantees ordering)
std::vector<cluster_op_t*> unsynced_writes;
std::vector<cluster_op_t*> syncing_writes;
cluster_op_t* cur_sync = NULL;
std::vector<cluster_op_t*> next_writes;
std::vector<cluster_op_t*> offline_ops;
uint64_t queued_bytes = 0;
public:
cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config);
~cluster_client_t();
void execute(cluster_op_t *op);
void stop();
protected:
void continue_ops(bool up_retry = false);
void on_load_config_hook(json11::Json::object & config);
void on_load_pgs_hook(bool success);
void on_change_hook(json11::Json::object & changes);
void on_change_osd_state_hook(uint64_t peer_osd);
void continue_rw(cluster_op_t *op);
void slice_rw(cluster_op_t *op);
bool try_send(cluster_op_t *op, cluster_op_part_t *part);
void execute_sync(cluster_op_t *op);
void continue_sync();
void finish_sync();
void send_sync(cluster_op_t *op, cluster_op_part_t *part);
void handle_op_part(cluster_op_part_t *part);
};

1
cpp-btree Submodule

@ -0,0 +1 @@
Subproject commit 5dc108754ad40d3b1d024f9bd7cca0595ef1a1db

173
dump_journal.cpp Normal file
View File

@ -0,0 +1,173 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#define _LARGEFILE64_SOURCE
#include <sys/types.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdint.h>
#include <malloc.h>
#include <linux/fs.h>
#include <string.h>
#include <errno.h>
#include <assert.h>
#include <stdio.h>
#include "blockstore_impl.h"
#include "crc32c.h"
struct journal_dump_t
{
char *journal_device;
uint32_t journal_block;
uint64_t journal_offset;
uint64_t journal_len;
uint64_t journal_pos;
int fd;
void dump_block(void *buf);
};
int main(int argc, char *argv[])
{
if (argc < 5)
{
printf("USAGE: %s <journal_file> <journal_block_size> <offset> <size>\n", argv[0]);
return 1;
}
journal_dump_t self;
self.journal_device = argv[1];
self.journal_block = strtoul(argv[2], NULL, 10);
self.journal_offset = strtoull(argv[3], NULL, 10);
self.journal_len = strtoull(argv[4], NULL, 10);
if (self.journal_block < MEM_ALIGNMENT || (self.journal_block % MEM_ALIGNMENT) ||
self.journal_block > 128*1024)
{
printf("Invalid journal block size\n");
return 1;
}
self.fd = open(self.journal_device, O_DIRECT|O_RDONLY);
if (self.fd == -1)
{
printf("Failed to open journal\n");
return 1;
}
void *data = memalign(MEM_ALIGNMENT, self.journal_block);
self.journal_pos = 0;
while (self.journal_pos < self.journal_len)
{
int r = pread(self.fd, data, self.journal_block, self.journal_offset+self.journal_pos);
assert(r == self.journal_block);
uint64_t s;
for (s = 0; s < self.journal_block; s += 8)
{
if (*((uint64_t*)(data+s)) != 0)
break;
}
if (s == self.journal_block)
{
printf("offset %08lx: zeroes\n", self.journal_pos);
self.journal_pos += self.journal_block;
}
else if (((journal_entry*)data)->magic == JOURNAL_MAGIC)
{
printf("offset %08lx:\n", self.journal_pos);
self.dump_block(data);
}
else
{
printf("offset %08lx: no magic in the beginning, looks like random data (pattern=%lx)\n", self.journal_pos, *((uint64_t*)data));
self.journal_pos += self.journal_block;
}
}
free(data);
close(self.fd);
return 0;
}
void journal_dump_t::dump_block(void *buf)
{
uint32_t pos = 0;
journal_pos += journal_block;
int entry = 0;
bool wrapped = false;
while (pos < journal_block)
{
journal_entry *je = (journal_entry*)(buf + pos);
if (je->magic != JOURNAL_MAGIC || je->type < JE_MIN || je->type > JE_MAX)
{
break;
}
const char *crc32_valid = je_crc32(je) == je->crc32 ? "(valid)" : "(invalid)";
printf("entry % 3d: crc32=%08x %s prev=%08x ", entry, je->crc32, crc32_valid, je->crc32_prev);
if (je->type == JE_START)
{
printf("je_start start=%08lx\n", je->start.journal_start);
}
else if (je->type == JE_SMALL_WRITE || je->type == JE_SMALL_WRITE_INSTANT)
{
printf(
"je_small_write%s oid=%lx:%lx ver=%lu offset=%u len=%u loc=%08lx",
je->type == JE_SMALL_WRITE_INSTANT ? "_instant" : "",
je->small_write.oid.inode, je->small_write.oid.stripe,
je->small_write.version, je->small_write.offset, je->small_write.len,
je->small_write.data_offset
);
if (journal_pos + je->small_write.len > journal_len)
{
// data continues from the beginning of the journal
journal_pos = journal_block;
wrapped = true;
}
if (journal_pos != je->small_write.data_offset)
{
printf(" (mismatched, calculated = %lu)", journal_pos);
}
journal_pos += je->small_write.len;
if (journal_pos >= journal_len)
{
journal_pos = journal_block;
wrapped = true;
}
uint32_t data_crc32 = 0;
void *data = memalign(MEM_ALIGNMENT, je->small_write.len);
assert(pread(fd, data, je->small_write.len, journal_offset+je->small_write.data_offset) == je->small_write.len);
data_crc32 = crc32c(0, data, je->small_write.len);
free(data);
printf(
" data_crc32=%08x%s", je->small_write.crc32_data,
(data_crc32 != je->small_write.crc32_data) ? " (invalid)" : " (valid)"
);
printf("\n");
}
else if (je->type == JE_BIG_WRITE || je->type == JE_BIG_WRITE_INSTANT)
{
printf(
"je_big_write%s oid=%lx:%lx ver=%lu loc=%08lx\n",
je->type == JE_BIG_WRITE_INSTANT ? "_instant" : "",
je->big_write.oid.inode, je->big_write.oid.stripe, je->big_write.version, je->big_write.location
);
}
else if (je->type == JE_STABLE)
{
printf("je_stable oid=%lx:%lx ver=%lu\n", je->stable.oid.inode, je->stable.oid.stripe, je->stable.version);
}
else if (je->type == JE_ROLLBACK)
{
printf("je_rollback oid=%lx:%lx ver=%lu\n", je->rollback.oid.inode, je->rollback.oid.stripe, je->rollback.version);
}
else if (je->type == JE_DELETE)
{
printf("je_delete oid=%lx:%lx ver=%lu\n", je->del.oid.inode, je->del.oid.stripe, je->del.version);
}
pos += je->size;
entry++;
}
if (wrapped)
{
journal_pos = journal_len;
}
}

90
epoll_manager.cpp Normal file
View File

@ -0,0 +1,90 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
#include <sys/epoll.h>
#include <sys/poll.h>
#include <unistd.h>
#include "epoll_manager.h"
#define MAX_EPOLL_EVENTS 64
epoll_manager_t::epoll_manager_t(ring_loop_t *ringloop)
{
this->ringloop = ringloop;
epoll_fd = epoll_create(1);
if (epoll_fd < 0)
{
throw std::runtime_error(std::string("epoll_create: ") + strerror(errno));
}
tfd = new timerfd_manager_t([this](int fd, bool wr, std::function<void(int, int)> handler) { set_fd_handler(fd, wr, handler); });
handle_epoll_events();
}
epoll_manager_t::~epoll_manager_t()
{
if (tfd)
{
delete tfd;
tfd = NULL;
}
close(epoll_fd);
}
void epoll_manager_t::set_fd_handler(int fd, bool wr, std::function<void(int, int)> handler)
{
if (handler != NULL)
{
bool exists = epoll_handlers.find(fd) != epoll_handlers.end();
epoll_event ev;
ev.data.fd = fd;
ev.events = (wr ? EPOLLOUT : 0) | EPOLLIN | EPOLLRDHUP | EPOLLET;
if (epoll_ctl(epoll_fd, exists ? EPOLL_CTL_MOD : EPOLL_CTL_ADD, fd, &ev) < 0)
{
throw std::runtime_error(std::string("epoll_ctl: ") + strerror(errno));
}
epoll_handlers[fd] = handler;
}
else
{
if (epoll_ctl(epoll_fd, EPOLL_CTL_DEL, fd, NULL) < 0 && errno != ENOENT)
{
throw std::runtime_error(std::string("epoll_ctl: ") + strerror(errno));
}
epoll_handlers.erase(fd);
}
}
void epoll_manager_t::handle_epoll_events()
{
io_uring_sqe *sqe = ringloop->get_sqe();
if (!sqe)
{
throw std::runtime_error("can't get SQE, will fall out of sync with EPOLLET");
}
ring_data_t *data = ((ring_data_t*)sqe->user_data);
my_uring_prep_poll_add(sqe, epoll_fd, POLLIN);
data->callback = [this](ring_data_t *data)
{
if (data->res < 0)
{
throw std::runtime_error(std::string("epoll failed: ") + strerror(-data->res));
}
handle_epoll_events();
};
ringloop->submit();
int nfds;
epoll_event events[MAX_EPOLL_EVENTS];
do
{
nfds = epoll_wait(epoll_fd, events, MAX_EPOLL_EVENTS, 0);
for (int i = 0; i < nfds; i++)
{
auto & cb = epoll_handlers[events[i].data.fd];
cb(events[i].data.fd, events[i].events);
}
} while (nfds == MAX_EPOLL_EVENTS);
}

23
epoll_manager.h Normal file
View File

@ -0,0 +1,23 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
#pragma once
#include <map>
#include "ringloop.h"
#include "timerfd_manager.h"
class epoll_manager_t
{
int epoll_fd;
ring_loop_t *ringloop;
std::map<int, std::function<void(int, int)>> epoll_handlers;
public:
epoll_manager_t(ring_loop_t *ringloop);
~epoll_manager_t();
void set_fd_handler(int fd, bool wr, std::function<void(int, int)> handler);
void handle_epoll_events();
timerfd_manager_t *tfd;
};

556
etcd_state_client.cpp Normal file
View File

@ -0,0 +1,556 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
#include "osd_ops.h"
#include "pg_states.h"
#include "etcd_state_client.h"
#include "http_client.h"
#include "base64.h"
json_kv_t etcd_state_client_t::parse_etcd_kv(const json11::Json & kv_json)
{
json_kv_t kv;
kv.key = base64_decode(kv_json["key"].string_value());
std::string json_err, json_text = base64_decode(kv_json["value"].string_value());
kv.value = json_text == "" ? json11::Json() : json11::Json::parse(json_text, json_err);
if (json_err != "")
{
printf("Bad JSON in etcd key %s: %s (value: %s)\n", kv.key.c_str(), json_err.c_str(), json_text.c_str());
kv.key = "";
}
return kv;
}
void etcd_state_client_t::etcd_txn(json11::Json txn, int timeout, std::function<void(std::string, json11::Json)> callback)
{
etcd_call("/kv/txn", txn, timeout, callback);
}
void etcd_state_client_t::etcd_call(std::string api, json11::Json payload, int timeout, std::function<void(std::string, json11::Json)> callback)
{
std::string etcd_address = etcd_addresses[rand() % etcd_addresses.size()];
std::string etcd_api_path;
int pos = etcd_address.find('/');
if (pos >= 0)
{
etcd_api_path = etcd_address.substr(pos);
etcd_address = etcd_address.substr(0, pos);
}
std::string req = payload.dump();
req = "POST "+etcd_api_path+api+" HTTP/1.1\r\n"
"Host: "+etcd_address+"\r\n"
"Content-Type: application/json\r\n"
"Content-Length: "+std::to_string(req.size())+"\r\n"
"Connection: close\r\n"
"\r\n"+req;
http_request_json(tfd, etcd_address, req, timeout, callback);
}
void etcd_state_client_t::parse_config(json11::Json & config)
{
this->etcd_addresses.clear();
if (config["etcd_address"].is_string())
{
std::string ea = config["etcd_address"].string_value();
while (1)
{
int pos = ea.find(',');
std::string addr = pos >= 0 ? ea.substr(0, pos) : ea;
if (addr.length() > 0)
{
if (addr.find('/') < 0)
addr += "/v3";
this->etcd_addresses.push_back(addr);
}
if (pos >= 0)
ea = ea.substr(pos+1);
else
break;
}
}
else if (config["etcd_address"].array_items().size())
{
for (auto & ea: config["etcd_address"].array_items())
{
std::string addr = ea.string_value();
if (addr != "")
{
if (addr.find('/') < 0)
addr += "/v3";
this->etcd_addresses.push_back(addr);
}
}
}
this->etcd_prefix = config["etcd_prefix"].string_value();
if (this->etcd_prefix == "")
{
this->etcd_prefix = "/vitastor";
}
else if (this->etcd_prefix[0] != '/')
{
this->etcd_prefix = "/"+this->etcd_prefix;
}
this->log_level = config["log_level"].int64_value();
}
void etcd_state_client_t::start_etcd_watcher()
{
std::string etcd_address = etcd_addresses[rand() % etcd_addresses.size()];
std::string etcd_api_path;
int pos = etcd_address.find('/');
if (pos >= 0)
{
etcd_api_path = etcd_address.substr(pos);
etcd_address = etcd_address.substr(0, pos);
}
etcd_watches_initialised = 0;
etcd_watch_ws = open_websocket(tfd, etcd_address, etcd_api_path+"/watch", ETCD_SLOW_TIMEOUT, [this](const http_response_t *msg)
{
if (msg->body.length())
{
std::string json_err;
json11::Json data = json11::Json::parse(msg->body, json_err);
if (json_err != "")
{
printf("Bad JSON in etcd event: %s, ignoring event\n", json_err.c_str());
}
else
{
if (data["result"]["created"].bool_value())
{
etcd_watches_initialised++;
}
if (etcd_watches_initialised == 4)
{
etcd_watch_revision = data["result"]["header"]["revision"].uint64_value();
}
// First gather all changes into a hash to remove multiple overwrites
json11::Json::object changes;
for (auto & ev: data["result"]["events"].array_items())
{
auto kv = parse_etcd_kv(ev["kv"]);
if (kv.key != "")
{
changes[kv.key] = kv.value;
}
}
for (auto & kv: changes)
{
if (this->log_level > 0)
{
printf("Incoming event: %s -> %s\n", kv.first.c_str(), kv.second.dump().c_str());
}
parse_state(kv.first, kv.second);
}
// React to changes
if (on_change_hook != NULL)
{
on_change_hook(changes);
}
}
}
if (msg->eof)
{
etcd_watch_ws = NULL;
if (etcd_watches_initialised == 0)
{
// Connection not established, retry in <ETCD_SLOW_TIMEOUT>
tfd->set_timer(ETCD_SLOW_TIMEOUT, false, [this](int)
{
start_etcd_watcher();
});
}
else
{
// Connection was live, retry immediately
start_etcd_watcher();
}
}
});
etcd_watch_ws->post_message(WS_TEXT, json11::Json(json11::Json::object {
{ "create_request", json11::Json::object {
{ "key", base64_encode(etcd_prefix+"/config/") },
{ "range_end", base64_encode(etcd_prefix+"/config0") },
{ "start_revision", etcd_watch_revision+1 },
{ "watch_id", ETCD_CONFIG_WATCH_ID },
} }
}).dump());
etcd_watch_ws->post_message(WS_TEXT, json11::Json(json11::Json::object {
{ "create_request", json11::Json::object {
{ "key", base64_encode(etcd_prefix+"/osd/state/") },
{ "range_end", base64_encode(etcd_prefix+"/osd/state0") },
{ "start_revision", etcd_watch_revision+1 },
{ "watch_id", ETCD_OSD_STATE_WATCH_ID },
} }
}).dump());
etcd_watch_ws->post_message(WS_TEXT, json11::Json(json11::Json::object {
{ "create_request", json11::Json::object {
{ "key", base64_encode(etcd_prefix+"/pg/state/") },
{ "range_end", base64_encode(etcd_prefix+"/pg/state0") },
{ "start_revision", etcd_watch_revision+1 },
{ "watch_id", ETCD_PG_STATE_WATCH_ID },
} }
}).dump());
etcd_watch_ws->post_message(WS_TEXT, json11::Json(json11::Json::object {
{ "create_request", json11::Json::object {
{ "key", base64_encode(etcd_prefix+"/pg/history/") },
{ "range_end", base64_encode(etcd_prefix+"/pg/history0") },
{ "start_revision", etcd_watch_revision+1 },
{ "watch_id", ETCD_PG_HISTORY_WATCH_ID },
} }
}).dump());
}
void etcd_state_client_t::load_global_config()
{
etcd_call("/kv/range", json11::Json::object {
{ "key", base64_encode(etcd_prefix+"/config/global") }
}, ETCD_SLOW_TIMEOUT, [this](std::string err, json11::Json data)
{
if (err != "")
{
printf("Error reading OSD configuration from etcd: %s\n", err.c_str());
tfd->set_timer(ETCD_SLOW_TIMEOUT, false, [this](int timer_id)
{
load_global_config();
});
return;
}
if (!etcd_watch_revision)
{
etcd_watch_revision = data["header"]["revision"].uint64_value();
}
json11::Json::object global_config;
if (data["kvs"].array_items().size() > 0)
{
auto kv = parse_etcd_kv(data["kvs"][0]);
if (kv.value.is_object())
{
global_config = kv.value.object_items();
}
}
on_load_config_hook(global_config);
});
}
void etcd_state_client_t::load_pgs()
{
json11::Json::array txn = {
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(etcd_prefix+"/config/pools") },
} }
},
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(etcd_prefix+"/config/pgs") },
} }
},
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(etcd_prefix+"/pg/history/") },
{ "range_end", base64_encode(etcd_prefix+"/pg/history0") },
} }
},
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(etcd_prefix+"/pg/state/") },
{ "range_end", base64_encode(etcd_prefix+"/pg/state0") },
} }
},
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(etcd_prefix+"/osd/state/") },
{ "range_end", base64_encode(etcd_prefix+"/osd/state0") },
} }
},
};
json11::Json::object req = { { "success", txn } };
json11::Json checks = load_pgs_checks_hook != NULL ? load_pgs_checks_hook() : json11::Json();
if (checks.array_items().size() > 0)
{
req["compare"] = checks;
}
etcd_txn(req, ETCD_SLOW_TIMEOUT, [this](std::string err, json11::Json data)
{
if (err != "")
{
printf("Error loading PGs from etcd: %s\n", err.c_str());
tfd->set_timer(ETCD_SLOW_TIMEOUT, false, [this](int timer_id)
{
load_pgs();
});
return;
}
if (!data["succeeded"].bool_value())
{
on_load_pgs_hook(false);
return;
}
for (auto & res: data["responses"].array_items())
{
for (auto & kv_json: res["response_range"]["kvs"].array_items())
{
auto kv = parse_etcd_kv(kv_json);
parse_state(kv.key, kv.value);
}
}
on_load_pgs_hook(true);
});
}
void etcd_state_client_t::parse_state(const std::string & key, const json11::Json & value)
{
if (key == etcd_prefix+"/config/pools")
{
for (auto & pool_item: this->pool_config)
{
pool_item.second.exists = false;
}
for (auto & pool_item: value.object_items())
{
pool_id_t pool_id = stoull_full(pool_item.first);
if (!pool_id || pool_id >= POOL_ID_MAX)
{
printf("Pool ID %s is invalid (must be a number less than 0x%x), skipping pool\n", pool_item.first.c_str(), POOL_ID_MAX);
continue;
}
if (pool_item.second["pg_size"].uint64_value() < 1 ||
pool_item.second["scheme"] == "xor" && pool_item.second["pg_size"].uint64_value() < 3)
{
printf("Pool %u has invalid pg_size, skipping pool\n", pool_id);
continue;
}
if (pool_item.second["pg_minsize"].uint64_value() < 1 ||
pool_item.second["pg_minsize"].uint64_value() > pool_item.second["pg_size"].uint64_value() ||
pool_item.second["pg_minsize"].uint64_value() < (pool_item.second["pg_size"].uint64_value() - 1))
{
printf("Pool %u has invalid pg_minsize, skipping pool\n", pool_id);
continue;
}
if (pool_item.second["pg_count"].uint64_value() < 1)
{
printf("Pool %u has invalid pg_count, skipping pool\n", pool_id);
continue;
}
if (pool_item.second["name"].string_value() == "")
{
printf("Pool %u has empty name, skipping pool\n", pool_id);
continue;
}
if (pool_item.second["scheme"] != "replicated" && pool_item.second["scheme"] != "xor")
{
printf("Pool %u has invalid coding scheme (only \"xor\" and \"replicated\" are allowed), skipping pool\n", pool_id);
continue;
}
if (pool_item.second["max_osd_combinations"].uint64_value() > 0 &&
pool_item.second["max_osd_combinations"].uint64_value() < 100)
{
printf("Pool %u has invalid max_osd_combinations (must be at least 100), skipping pool\n", pool_id);
continue;
}
auto & parsed_cfg = this->pool_config[pool_id];
parsed_cfg.exists = true;
parsed_cfg.id = pool_id;
parsed_cfg.name = pool_item.second["name"].string_value();
parsed_cfg.scheme = pool_item.second["scheme"] == "replicated" ? POOL_SCHEME_REPLICATED : POOL_SCHEME_XOR;
parsed_cfg.pg_size = pool_item.second["pg_size"].uint64_value();
parsed_cfg.pg_minsize = pool_item.second["pg_minsize"].uint64_value();
parsed_cfg.pg_count = pool_item.second["pg_count"].uint64_value();
parsed_cfg.failure_domain = pool_item.second["failure_domain"].string_value();
parsed_cfg.pg_stripe_size = pool_item.second["pg_stripe_size"].uint64_value();
if (!parsed_cfg.pg_stripe_size)
{
parsed_cfg.pg_stripe_size = DEFAULT_PG_STRIPE_SIZE;
}
parsed_cfg.max_osd_combinations = pool_item.second["max_osd_combinations"].uint64_value();
if (!parsed_cfg.max_osd_combinations)
{
parsed_cfg.max_osd_combinations = 10000;
}
for (auto & pg_item: parsed_cfg.pg_config)
{
if (pg_item.second.target_set.size() != parsed_cfg.pg_size)
{
printf("Pool %u PG %u configuration is invalid: osd_set size %lu != pool pg_size %lu\n",
pool_id, pg_item.first, pg_item.second.target_set.size(), parsed_cfg.pg_size);
pg_item.second.pause = true;
}
}
}
}
else if (key == etcd_prefix+"/config/pgs")
{
for (auto & pool_item: this->pool_config)
{
for (auto & pg_item: pool_item.second.pg_config)
{
pg_item.second.exists = false;
}
}
for (auto & pool_item: value["items"].object_items())
{
pool_id_t pool_id = stoull_full(pool_item.first);
if (!pool_id || pool_id >= POOL_ID_MAX)
{
printf("Pool ID %s is invalid in PG configuration (must be a number less than 0x%x), skipping pool\n", pool_item.first.c_str(), POOL_ID_MAX);
continue;
}
for (auto & pg_item: pool_item.second.object_items())
{
pg_num_t pg_num = stoull_full(pg_item.first);
if (!pg_num)
{
printf("Bad key in pool %u PG configuration: %s (must be a number), skipped\n", pool_id, pg_item.first.c_str());
continue;
}
auto & parsed_cfg = this->pool_config[pool_id].pg_config[pg_num];
parsed_cfg.exists = true;
parsed_cfg.pause = pg_item.second["pause"].bool_value();
parsed_cfg.primary = pg_item.second["primary"].uint64_value();
parsed_cfg.target_set.clear();
for (auto & pg_osd: pg_item.second["osd_set"].array_items())
{
parsed_cfg.target_set.push_back(pg_osd.uint64_value());
}
if (parsed_cfg.target_set.size() != pool_config[pool_id].pg_size)
{
printf("Pool %u PG %u configuration is invalid: osd_set size %lu != pool pg_size %lu\n",
pool_id, pg_num, parsed_cfg.target_set.size(), pool_config[pool_id].pg_size);
parsed_cfg.pause = true;
}
}
}
for (auto & pool_item: this->pool_config)
{
int n = 0;
for (auto pg_it = pool_item.second.pg_config.begin(); pg_it != pool_item.second.pg_config.end(); pg_it++)
{
if (pg_it->second.exists && pg_it->first != ++n)
{
printf(
"Invalid pool %u PG configuration: PG numbers don't cover whole 1..%lu range\n",
pool_item.second.id, pool_item.second.pg_config.size()
);
for (pg_it = pool_item.second.pg_config.begin(); pg_it != pool_item.second.pg_config.end(); pg_it++)
{
pg_it->second.exists = false;
}
n = 0;
break;
}
}
pool_item.second.real_pg_count = n;
}
}
else if (key.substr(0, etcd_prefix.length()+12) == etcd_prefix+"/pg/history/")
{
// <etcd_prefix>/pg/history/%d/%d
pool_id_t pool_id = 0;
pg_num_t pg_num = 0;
char null_byte = 0;
sscanf(key.c_str() + etcd_prefix.length()+12, "%u/%u%c", &pool_id, &pg_num, &null_byte);
if (!pool_id || pool_id >= POOL_ID_MAX || !pg_num || null_byte != 0)
{
printf("Bad etcd key %s, ignoring\n", key.c_str());
}
else
{
auto & pg_cfg = this->pool_config[pool_id].pg_config[pg_num];
pg_cfg.target_history.clear();
pg_cfg.all_peers.clear();
// Refuse to start PG if any set of the <osd_sets> has no live OSDs
for (auto hist_item: value["osd_sets"].array_items())
{
std::vector<osd_num_t> history_set;
for (auto pg_osd: hist_item.array_items())
{
history_set.push_back(pg_osd.uint64_value());
}
pg_cfg.target_history.push_back(history_set);
}
// Include these additional OSDs when peering the PG
for (auto pg_osd: value["all_peers"].array_items())
{
pg_cfg.all_peers.push_back(pg_osd.uint64_value());
}
// Read epoch
pg_cfg.epoch = value["epoch"].uint64_value();
if (on_change_pg_history_hook != NULL)
{
on_change_pg_history_hook(pool_id, pg_num);
}
}
}
else if (key.substr(0, etcd_prefix.length()+10) == etcd_prefix+"/pg/state/")
{
// <etcd_prefix>/pg/state/%d/%d
pool_id_t pool_id = 0;
pg_num_t pg_num = 0;
char null_byte = 0;
sscanf(key.c_str() + etcd_prefix.length()+10, "%u/%u%c", &pool_id, &pg_num, &null_byte);
if (!pool_id || pool_id >= POOL_ID_MAX || !pg_num || null_byte != 0)
{
printf("Bad etcd key %s, ignoring\n", key.c_str());
}
else if (value.is_null())
{
this->pool_config[pool_id].pg_config[pg_num].cur_primary = 0;
this->pool_config[pool_id].pg_config[pg_num].cur_state = 0;
}
else
{
osd_num_t cur_primary = value["primary"].uint64_value();
int state = 0;
for (auto & e: value["state"].array_items())
{
int i;
for (i = 0; i < pg_state_bit_count; i++)
{
if (e.string_value() == pg_state_names[i])
{
state = state | pg_state_bits[i];
break;
}
}
if (i >= pg_state_bit_count)
{
printf("Unexpected pool %u PG %u state keyword in etcd: %s\n", pool_id, pg_num, e.dump().c_str());
return;
}
}
if (!cur_primary || !value["state"].is_array() || !state ||
(state & PG_OFFLINE) && state != PG_OFFLINE ||
(state & PG_PEERING) && state != PG_PEERING ||
(state & PG_INCOMPLETE) && state != PG_INCOMPLETE)
{
printf("Unexpected pool %u PG %u state in etcd: primary=%lu, state=%s\n", pool_id, pg_num, cur_primary, value["state"].dump().c_str());
return;
}
this->pool_config[pool_id].pg_config[pg_num].cur_primary = cur_primary;
this->pool_config[pool_id].pg_config[pg_num].cur_state = state;
}
}
else if (key.substr(0, etcd_prefix.length()+11) == etcd_prefix+"/osd/state/")
{
// <etcd_prefix>/osd/state/%d
osd_num_t peer_osd = std::stoull(key.substr(etcd_prefix.length()+11));
if (peer_osd > 0)
{
if (value.is_object() && value["state"] == "up" &&
value["addresses"].is_array() &&
value["port"].int64_value() > 0 && value["port"].int64_value() < 65536)
{
this->peer_states[peer_osd] = value;
}
else
{
this->peer_states.erase(peer_osd);
}
if (on_change_osd_state_hook != NULL)
{
on_change_osd_state_hook(peer_osd);
}
}
}
}

83
etcd_state_client.h Normal file
View File

@ -0,0 +1,83 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
#pragma once
#include "osd_id.h"
#include "http_client.h"
#include "timerfd_manager.h"
#define ETCD_CONFIG_WATCH_ID 1
#define ETCD_PG_STATE_WATCH_ID 2
#define ETCD_PG_HISTORY_WATCH_ID 3
#define ETCD_OSD_STATE_WATCH_ID 4
#define MAX_ETCD_ATTEMPTS 5
#define ETCD_SLOW_TIMEOUT 5000
#define ETCD_QUICK_TIMEOUT 1000
#define DEFAULT_PG_STRIPE_SIZE 4*1024*1024
struct json_kv_t
{
std::string key;
json11::Json value;
};
struct pg_config_t
{
bool exists;
osd_num_t primary;
std::vector<osd_num_t> target_set;
std::vector<std::vector<osd_num_t>> target_history;
std::vector<osd_num_t> all_peers;
bool pause;
osd_num_t cur_primary;
int cur_state;
uint64_t epoch;
};
struct pool_config_t
{
bool exists;
pool_id_t id;
std::string name;
uint64_t scheme;
uint64_t pg_size, pg_minsize;
uint64_t pg_count;
uint64_t real_pg_count;
std::string failure_domain;
uint64_t max_osd_combinations;
uint64_t pg_stripe_size;
std::map<pg_num_t, pg_config_t> pg_config;
};
struct etcd_state_client_t
{
std::vector<std::string> etcd_addresses;
std::string etcd_prefix;
int log_level = 0;
timerfd_manager_t *tfd = NULL;
int etcd_watches_initialised = 0;
uint64_t etcd_watch_revision = 0;
websocket_t *etcd_watch_ws = NULL;
std::map<pool_id_t, pool_config_t> pool_config;
std::map<osd_num_t, json11::Json> peer_states;
std::function<void(json11::Json::object &)> on_change_hook;
std::function<void(json11::Json::object &)> on_load_config_hook;
std::function<json11::Json()> load_pgs_checks_hook;
std::function<void(bool)> on_load_pgs_hook;
std::function<void(pool_id_t, pg_num_t)> on_change_pg_history_hook;
std::function<void(osd_num_t)> on_change_osd_state_hook;
json_kv_t parse_etcd_kv(const json11::Json & kv_json);
void etcd_call(std::string api, json11::Json payload, int timeout, std::function<void(std::string, json11::Json)> callback);
void etcd_txn(json11::Json txn, int timeout, std::function<void(std::string, json11::Json)> callback);
void start_etcd_watcher();
void load_global_config();
void load_pgs();
void parse_state(const std::string & key, const json11::Json & value);
void parse_config(json11::Json & config);
};

338
fio_cluster.cpp Normal file
View File

@ -0,0 +1,338 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
// FIO engine to test cluster I/O
//
// Random write:
//
// fio -thread -ioengine=./libfio_cluster.so -name=test -bs=4k -direct=1 -fsync=16 -iodepth=16 -rw=randwrite \
// -etcd=127.0.0.1:2379 [-etcd_prefix=/vitastor] -pool=1 -inode=1 -size=1000M
//
// Linear write:
//
// fio -thread -ioengine=./libfio_cluster.so -name=test -bs=128k -direct=1 -fsync=32 -iodepth=32 -rw=write \
// -etcd=127.0.0.1:2379 [-etcd_prefix=/vitastor] -pool=1 -inode=1 -size=1000M
//
// Random read (run with -iodepth=32 or -iodepth=1):
//
// fio -thread -ioengine=./libfio_cluster.so -name=test -bs=4k -direct=1 -iodepth=32 -rw=randread \
// -etcd=127.0.0.1:2379 [-etcd_prefix=/vitastor] -pool=1 -inode=1 -size=1000M
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <vector>
#include <unordered_map>
#include "epoll_manager.h"
#include "cluster_client.h"
extern "C" {
#define CONFIG_HAVE_GETTID
#define CONFIG_PWRITEV2
#include "fio/fio.h"
#include "fio/optgroup.h"
}
struct sec_data
{
ring_loop_t *ringloop = NULL;
epoll_manager_t *epmgr = NULL;
cluster_client_t *cli = NULL;
bool last_sync = false;
/* The list of completed io_u structs. */
std::vector<io_u*> completed;
uint64_t op_n = 0, inflight = 0;
bool trace = false;
};
struct sec_options
{
int __pad;
char *etcd_host = NULL;
char *etcd_prefix = NULL;
uint64_t pool = 0;
uint64_t inode = 0;
int cluster_log = 0;
int trace = 0;
};
static struct fio_option options[] = {
{
.name = "etcd",
.lname = "etcd address",
.type = FIO_OPT_STR_STORE,
.off1 = offsetof(struct sec_options, etcd_host),
.help = "etcd address in the form HOST:PORT[/PATH]",
.category = FIO_OPT_C_ENGINE,
.group = FIO_OPT_G_FILENAME,
},
{
.name = "etcd",
.lname = "etcd key prefix",
.type = FIO_OPT_STR_STORE,
.off1 = offsetof(struct sec_options, etcd_prefix),
.help = "etcd key prefix, by default /vitastor",
.category = FIO_OPT_C_ENGINE,
.group = FIO_OPT_G_FILENAME,
},
{
.name = "pool",
.lname = "pool number for the inode",
.type = FIO_OPT_INT,
.off1 = offsetof(struct sec_options, pool),
.help = "pool number for the inode to run tests on",
.category = FIO_OPT_C_ENGINE,
.group = FIO_OPT_G_FILENAME,
},
{
.name = "inode",
.lname = "inode to run tests on",
.type = FIO_OPT_INT,
.off1 = offsetof(struct sec_options, inode),
.help = "inode to run tests on (1 by default)",
.category = FIO_OPT_C_ENGINE,
.group = FIO_OPT_G_FILENAME,
},
{
.name = "cluster_log_level",
.lname = "cluster log level",
.type = FIO_OPT_BOOL,
.off1 = offsetof(struct sec_options, cluster_log),
.help = "Set log level for the Vitastor client",
.def = "0",
.category = FIO_OPT_C_ENGINE,
.group = FIO_OPT_G_FILENAME,
},
{
.name = "osd_trace",
.lname = "OSD trace",
.type = FIO_OPT_BOOL,
.off1 = offsetof(struct sec_options, trace),
.help = "Trace OSD operations",
.def = "0",
.category = FIO_OPT_C_ENGINE,
.group = FIO_OPT_G_FILENAME,
},
{
.name = NULL,
},
};
static int sec_setup(struct thread_data *td)
{
sec_data *bsd;
bsd = new sec_data;
if (!bsd)
{
td_verror(td, errno, "calloc");
return 1;
}
td->io_ops_data = bsd;
if (!td->files_index)
{
add_file(td, "osd_cluster", 0, 0);
td->o.nr_files = td->o.nr_files ? : 1;
td->o.open_files++;
}
return 0;
}
static void sec_cleanup(struct thread_data *td)
{
sec_data *bsd = (sec_data*)td->io_ops_data;
if (bsd)
{
delete bsd->cli;
delete bsd->epmgr;
delete bsd->ringloop;
bsd->cli = NULL;
bsd->epmgr = NULL;
bsd->ringloop = NULL;
}
}
/* Connect to the server from each thread. */
static int sec_init(struct thread_data *td)
{
sec_options *o = (sec_options*)td->eo;
sec_data *bsd = (sec_data*)td->io_ops_data;
json11::Json cfg = json11::Json::object {
{ "etcd_address", std::string(o->etcd_host) },
{ "etcd_prefix", std::string(o->etcd_prefix ? o->etcd_prefix : "/vitastor") },
{ "log_level", o->cluster_log },
};
if (o->pool)
o->inode = (o->inode & ((1l << (64-POOL_ID_BITS)) - 1)) | (o->pool << (64-POOL_ID_BITS));
if (!(o->inode >> (64-POOL_ID_BITS)))
{
td_verror(td, EINVAL, "pool is missing");
return 1;
}
bsd->ringloop = new ring_loop_t(512);
bsd->epmgr = new epoll_manager_t(bsd->ringloop);
bsd->cli = new cluster_client_t(bsd->ringloop, bsd->epmgr->tfd, cfg);
bsd->trace = o->trace ? true : false;
return 0;
}
/* Begin read or write request. */
static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)
{
sec_options *opt = (sec_options*)td->eo;
sec_data *bsd = (sec_data*)td->io_ops_data;
int n = bsd->op_n;
fio_ro_check(td, io);
if (io->ddir == DDIR_SYNC && bsd->last_sync)
{
return FIO_Q_COMPLETED;
}
io->engine_data = bsd;
cluster_op_t *op = new cluster_op_t;
switch (io->ddir)
{
case DDIR_READ:
op->opcode = OSD_OP_READ;
op->inode = opt->inode;
op->offset = io->offset;
op->len = io->xfer_buflen;
op->iov.push_back(io->xfer_buf, io->xfer_buflen);
bsd->last_sync = false;
break;
case DDIR_WRITE:
op->opcode = OSD_OP_WRITE;
op->inode = opt->inode;
op->offset = io->offset;
op->len = io->xfer_buflen;
op->iov.push_back(io->xfer_buf, io->xfer_buflen);
bsd->last_sync = false;
break;
case DDIR_SYNC:
op->opcode = OSD_OP_SYNC;
bsd->last_sync = true;
break;
default:
io->error = EINVAL;
return FIO_Q_COMPLETED;
}
op->callback = [io, n](cluster_op_t *op)
{
io->error = op->retval < 0 ? -op->retval : 0;
sec_data *bsd = (sec_data*)io->engine_data;
bsd->inflight--;
bsd->completed.push_back(io);
if (bsd->trace)
{
printf("--- %s n=%d retval=%d\n", io->ddir == DDIR_READ ? "READ" :
(io->ddir == DDIR_WRITE ? "WRITE" : "SYNC"), n, op->retval);
}
delete op;
};
if (opt->trace)
{
if (io->ddir == DDIR_SYNC)
{
printf("+++ SYNC # %d\n", n);
}
else
{
printf("+++ %s # %d 0x%llx+%llx\n",
io->ddir == DDIR_READ ? "READ" : "WRITE",
n, io->offset, io->xfer_buflen);
}
}
io->error = 0;
bsd->inflight++;
bsd->op_n++;
bsd->cli->execute(op);
if (io->error != 0)
return FIO_Q_COMPLETED;
return FIO_Q_QUEUED;
}
static int sec_getevents(struct thread_data *td, unsigned int min, unsigned int max, const struct timespec *t)
{
sec_data *bsd = (sec_data*)td->io_ops_data;
while (true)
{
bsd->ringloop->loop();
if (bsd->completed.size() >= min)
break;
bsd->ringloop->wait();
}
return bsd->completed.size();
}
static struct io_u *sec_event(struct thread_data *td, int event)
{
sec_data *bsd = (sec_data*)td->io_ops_data;
if (bsd->completed.size() == 0)
return NULL;
/* FIXME We ignore the event number and assume fio calls us exactly once for [0..nr_events-1] */
struct io_u *ev = bsd->completed.back();
bsd->completed.pop_back();
return ev;
}
static int sec_io_u_init(struct thread_data *td, struct io_u *io)
{
io->engine_data = NULL;
return 0;
}
static void sec_io_u_free(struct thread_data *td, struct io_u *io)
{
}
static int sec_open_file(struct thread_data *td, struct fio_file *f)
{
return 0;
}
static int sec_invalidate(struct thread_data *td, struct fio_file *f)
{
return 0;
}
struct ioengine_ops ioengine = {
.name = "vitastor_cluster",
.version = FIO_IOOPS_VERSION,
.flags = FIO_MEMALIGN | FIO_DISKLESSIO | FIO_NOEXTEND,
.setup = sec_setup,
.init = sec_init,
.queue = sec_queue,
.getevents = sec_getevents,
.event = sec_event,
.cleanup = sec_cleanup,
.open_file = sec_open_file,
.invalidate = sec_invalidate,
.io_u_init = sec_io_u_init,
.io_u_free = sec_io_u_free,
.option_struct_size = sizeof(struct sec_options),
.options = options,
};
static void fio_init fio_sec_register(void)
{
register_ioengine(&ioengine);
}
static void fio_exit fio_sec_unregister(void)
{
unregister_ioengine(&ioengine);
}

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// FIO engine to test Blockstore
//
// Initialize storage for tests:
@ -23,6 +26,7 @@
#include "blockstore.h"
extern "C" {
#define CONFIG_HAVE_GETTID
#define CONFIG_PWRITEV2
#include "fio/fio.h"
#include "fio/optgroup.h"
@ -100,7 +104,7 @@ static void bs_cleanup(struct thread_data *td)
bsd->ringloop->loop();
if (bsd->bs->is_safe_to_stop())
goto safe;
} while (bsd->ringloop->get_loop_again());
} while (bsd->ringloop->has_work());
bsd->ringloop->wait();
}
safe:
@ -289,7 +293,7 @@ static int bs_invalidate(struct thread_data *td, struct fio_file *f)
}
struct ioengine_ops ioengine = {
.name = "microceph_blockstore",
.name = "vitastor_blockstore",
.version = FIO_IOOPS_VERSION,
.flags = FIO_MEMALIGN | FIO_DISKLESSIO | FIO_NOEXTEND,
.setup = bs_setup,

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
// FIO engine to test Blockstore through Secondary OSD interface
//
// Prepare storage like in fio_engine.cpp, then start OSD with ./osd, then test it
@ -5,7 +8,7 @@
// Random write:
//
// fio -thread -ioengine=./libfio_sec_osd.so -name=test -bs=4k -direct=1 -fsync=16 -iodepth=16 -rw=randwrite \
// -host=127.0.0.1 -port=11203 [-single_primary=1] -size=1000M
// -host=127.0.0.1 -port=11203 [-block_size_order=17] [-single_primary=1] -size=1000M
//
// Linear write:
//
@ -28,6 +31,7 @@
#include "rw_blocking.h"
#include "osd_ops.h"
extern "C" {
#define CONFIG_HAVE_GETTID
#define CONFIG_PWRITEV2
#include "fio/fio.h"
#include "fio/optgroup.h"
@ -52,6 +56,7 @@ struct sec_options
int port = 0;
int single_primary = 0;
int trace = 0;
int block_order = 17;
};
static struct fio_option options[] = {
@ -73,6 +78,15 @@ static struct fio_option options[] = {
.category = FIO_OPT_C_ENGINE,
.group = FIO_OPT_G_FILENAME,
},
{
.name = "block_size_order",
.lname = "Blockstore block size order",
.type = FIO_OPT_INT,
.off1 = offsetof(struct sec_options, block_order),
.help = "Blockstore block size order (size = 2^order)",
.category = FIO_OPT_C_ENGINE,
.group = FIO_OPT_G_FILENAME,
},
{
.name = "single_primary",
.lname = "Single Primary",
@ -139,6 +153,8 @@ static int sec_init(struct thread_data *td)
{
sec_options *o = (sec_options*)td->eo;
sec_data *bsd = (sec_data*)td->io_ops_data;
bsd->block_order = o->block_order == 0 ? 17 : o->block_order;
bsd->block_size = 1 << o->block_order;
struct sockaddr_in addr;
int r;
@ -192,7 +208,7 @@ static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)
case DDIR_READ:
if (!opt->single_primary)
{
op.hdr.opcode = OSD_OP_SECONDARY_READ;
op.hdr.opcode = OSD_OP_SEC_READ;
op.sec_rw.oid = {
.inode = 1,
.stripe = io->offset >> bsd->block_order,
@ -213,7 +229,7 @@ static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)
case DDIR_WRITE:
if (!opt->single_primary)
{
op.hdr.opcode = OSD_OP_SECONDARY_WRITE;
op.hdr.opcode = OSD_OP_SEC_WRITE;
op.sec_rw.oid = {
.inode = 1,
.stripe = io->offset >> bsd->block_order,
@ -368,7 +384,7 @@ static int sec_invalidate(struct thread_data *td, struct fio_file *f)
}
struct ioengine_ops ioengine = {
.name = "microceph_secondary_osd",
.name = "vitastor_secondary_osd",
.version = FIO_IOOPS_VERSION,
.flags = FIO_MEMALIGN | FIO_DISKLESSIO | FIO_NOEXTEND,
.setup = sec_setup,

691
http_client.cpp Normal file
View File

@ -0,0 +1,691 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
#include <netinet/tcp.h>
#include <sys/epoll.h>
#include <net/if.h>
#include <arpa/inet.h>
#include <ifaddrs.h>
#include <ctype.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include "json11/json11.hpp"
#include "http_client.h"
#include "timerfd_manager.h"
#define READ_BUFFER_SIZE 9000
static int extract_port(std::string & host);
static std::string strtolower(const std::string & in);
static std::string trim(const std::string & in);
static std::string ws_format_frame(int type, uint64_t size);
static bool ws_parse_frame(std::string & buf, int & type, std::string & res);
// FIXME: Use keepalive
struct http_co_t
{
timerfd_manager_t *tfd;
int request_timeout = 0;
std::string host;
std::string request;
std::string ws_outbox;
std::string response;
bool want_streaming;
http_response_t parsed;
uint64_t target_response_size = 0;
int state = 0;
int peer_fd = -1;
int timeout_id = -1;
int epoll_events = 0;
int sent = 0;
std::vector<char> rbuf;
iovec read_iov, send_iov;
msghdr read_msg = { 0 }, send_msg = { 0 };
std::function<void(const http_response_t*)> callback;
websocket_t ws;
int onstack = 0;
bool ended = false;
~http_co_t();
inline void stackin() { onstack++; }
inline void stackout() { onstack--; if (!onstack && ended) end(); }
inline void end() { ended = true; if (!onstack) { delete this; } }
void start_connection();
void handle_events();
void handle_connect_result();
void submit_read();
void submit_send();
bool handle_read();
void post_message(int type, const std::string & msg);
};
#define HTTP_CO_CONNECTING 1
#define HTTP_CO_SENDING_REQUEST 2
#define HTTP_CO_REQUEST_SENT 3
#define HTTP_CO_HEADERS_RECEIVED 4
#define HTTP_CO_WEBSOCKET 5
#define HTTP_CO_CHUNKED 6
#define DEFAULT_TIMEOUT 5000
void http_request(timerfd_manager_t *tfd, const std::string & host, const std::string & request,
const http_options_t & options, std::function<void(const http_response_t *response)> callback)
{
http_co_t *handler = new http_co_t();
handler->request_timeout = options.timeout < 0 ? 0 : (options.timeout == 0 ? DEFAULT_TIMEOUT : options.timeout);
handler->want_streaming = options.want_streaming;
handler->tfd = tfd;
handler->host = host;
handler->request = request;
handler->callback = callback;
handler->ws.co = handler;
handler->start_connection();
}
void http_request_json(timerfd_manager_t *tfd, const std::string & host, const std::string & request,
int timeout, std::function<void(std::string, json11::Json r)> callback)
{
http_request(tfd, host, request, { .timeout = timeout }, [callback](const http_response_t* res)
{
if (res->error_code != 0)
{
callback("Error code: "+std::to_string(res->error_code)+" ("+std::string(strerror(res->error_code))+")", json11::Json());
return;
}
if (res->status_code != 200)
{
callback("HTTP "+std::to_string(res->status_code)+" "+res->status_line+" body: "+trim(res->body), json11::Json());
return;
}
std::string json_err;
json11::Json data = json11::Json::parse(res->body, json_err);
if (json_err != "")
{
callback("Bad JSON: "+json_err+" (response: "+trim(res->body)+")", json11::Json());
return;
}
callback(std::string(), data);
});
}
websocket_t* open_websocket(timerfd_manager_t *tfd, const std::string & host, const std::string & path,
int timeout, std::function<void(const http_response_t *msg)> callback)
{
std::string request = "GET "+path+" HTTP/1.1\r\n"
"Host: "+host+"\r\n"
"Upgrade: websocket\r\n"
"Connection: upgrade\r\n"
"Sec-WebSocket-Key: x3JJHMbDL1EzLkh9GBhXDw==\r\n"
"Sec-WebSocket-Version: 13\r\n"
"\r\n";
http_co_t *handler = new http_co_t();
handler->request_timeout = timeout < 0 ? -1 : (timeout == 0 ? DEFAULT_TIMEOUT : timeout);
handler->want_streaming = false;
handler->tfd = tfd;
handler->host = host;
handler->request = request;
handler->callback = callback;
handler->ws.co = handler;
handler->start_connection();
return &handler->ws;
}
void websocket_t::post_message(int type, const std::string & msg)
{
co->post_message(type, msg);
}
void websocket_t::close()
{
co->end();
}
http_co_t::~http_co_t()
{
if (timeout_id >= 0)
{
tfd->clear_timer(timeout_id);
timeout_id = -1;
}
if (peer_fd >= 0)
{
tfd->set_fd_handler(peer_fd, false, NULL);
close(peer_fd);
peer_fd = -1;
}
if (parsed.headers["transfer-encoding"] == "chunked")
{
int prev = 0, pos = 0;
while ((pos = response.find("\r\n", prev)) >= prev)
{
uint64_t len = strtoull(response.c_str()+prev, NULL, 16);
parsed.body += response.substr(pos+2, len);
prev = pos+2+len+2;
}
}
else
{
std::swap(parsed.body, response);
}
parsed.eof = true;
callback(&parsed);
}
void http_co_t::start_connection()
{
stackin();
int port = extract_port(host);
struct sockaddr_in addr;
int r;
if ((r = inet_pton(AF_INET, host.c_str(), &addr.sin_addr)) != 1)
{
parsed.error_code = ENXIO;
stackout();
end();
return;
}
addr.sin_family = AF_INET;
addr.sin_port = htons(port ? port : 80);
peer_fd = socket(AF_INET, SOCK_STREAM, 0);
if (peer_fd < 0)
{
parsed.error_code = errno;
stackout();
end();
return;
}
fcntl(peer_fd, F_SETFL, fcntl(peer_fd, F_GETFL, 0) | O_NONBLOCK);
if (request_timeout > 0)
{
timeout_id = tfd->set_timer(request_timeout, false, [this](int timer_id)
{
if (response.length() == 0)
{
parsed.error_code = ETIME;
}
end();
});
}
epoll_events = 0;
// Finally call connect
r = ::connect(peer_fd, (sockaddr*)&addr, sizeof(addr));
if (r < 0 && errno != EINPROGRESS)
{
parsed.error_code = errno;
stackout();
end();
return;
}
tfd->set_fd_handler(peer_fd, true, [this](int peer_fd, int epoll_events)
{
this->epoll_events |= epoll_events;
handle_events();
});
state = HTTP_CO_CONNECTING;
stackout();
}
void http_co_t::handle_events()
{
stackin();
while (epoll_events)
{
if (state == HTTP_CO_CONNECTING)
{
handle_connect_result();
}
else
{
epoll_events &= ~EPOLLOUT;
if (epoll_events & EPOLLIN)
{
submit_read();
}
else if (epoll_events & (EPOLLRDHUP|EPOLLERR))
{
end();
break;
}
}
}
stackout();
}
void http_co_t::handle_connect_result()
{
stackin();
int result = 0;
socklen_t result_len = sizeof(result);
if (getsockopt(peer_fd, SOL_SOCKET, SO_ERROR, &result, &result_len) < 0)
{
result = errno;
}
if (result != 0)
{
parsed.error_code = result;
stackout();
end();
return;
}
int one = 1;
setsockopt(peer_fd, SOL_TCP, TCP_NODELAY, &one, sizeof(one));
tfd->set_fd_handler(peer_fd, false, [this](int peer_fd, int epoll_events)
{
this->epoll_events |= epoll_events;
handle_events();
});
state = HTTP_CO_SENDING_REQUEST;
submit_send();
stackout();
}
void http_co_t::submit_read()
{
stackin();
int res;
if (rbuf.size() != READ_BUFFER_SIZE)
{
rbuf.resize(READ_BUFFER_SIZE);
}
read_iov = { .iov_base = rbuf.data(), .iov_len = READ_BUFFER_SIZE };
read_msg.msg_iov = &read_iov;
read_msg.msg_iovlen = 1;
res = recvmsg(peer_fd, &read_msg, 0);
if (res < 0)
{
res = -errno;
}
if (res == -EAGAIN)
{
epoll_events = epoll_events & ~EPOLLIN;
}
else if (res <= 0)
{
// < 0 means error, 0 means EOF
if (!res)
epoll_events = epoll_events & ~EPOLLIN;
end();
}
else
{
response += std::string(rbuf.data(), res);
handle_read();
}
stackout();
}
void http_co_t::submit_send()
{
stackin();
int res;
again:
if (sent < request.size())
{
send_iov = (iovec){ .iov_base = (void*)(request.c_str()+sent), .iov_len = request.size()-sent };
send_msg.msg_iov = &send_iov;
send_msg.msg_iovlen = 1;
res = sendmsg(peer_fd, &send_msg, MSG_NOSIGNAL);
if (res < 0)
{
res = -errno;
}
if (res == -EAGAIN)
{
res = 0;
}
else if (res < 0)
{
stackout();
end();
return;
}
sent += res;
if (state == HTTP_CO_SENDING_REQUEST)
{
if (sent >= request.size())
{
state = HTTP_CO_REQUEST_SENT;
}
else
goto again;
}
else if (state == HTTP_CO_WEBSOCKET)
{
request = request.substr(sent);
sent = 0;
goto again;
}
}
stackout();
}
bool http_co_t::handle_read()
{
stackin();
if (state == HTTP_CO_REQUEST_SENT)
{
int pos = response.find("\r\n\r\n");
if (pos >= 0)
{
if (timeout_id >= 0)
{
tfd->clear_timer(timeout_id);
timeout_id = -1;
}
state = HTTP_CO_HEADERS_RECEIVED;
parse_http_headers(response, &parsed);
if (parsed.status_code == 101 &&
parsed.headers.find("sec-websocket-accept") != parsed.headers.end() &&
parsed.headers["upgrade"] == "websocket" &&
parsed.headers["connection"] == "upgrade")
{
// Don't care about validating the key
state = HTTP_CO_WEBSOCKET;
request = ws_outbox;
ws_outbox = "";
sent = 0;
submit_send();
}
else if (parsed.headers["transfer-encoding"] == "chunked")
{
state = HTTP_CO_CHUNKED;
}
else if (parsed.headers["connection"] != "close")
{
target_response_size = stoull_full(parsed.headers["content-length"]);
if (!target_response_size)
{
// Sorry, unsupported response
stackout();
end();
return false;
}
}
}
}
if (state == HTTP_CO_HEADERS_RECEIVED && target_response_size > 0 && response.size() >= target_response_size)
{
stackout();
end();
return false;
}
if (state == HTTP_CO_CHUNKED && response.size() > 0)
{
int prev = 0, pos = 0;
while ((pos = response.find("\r\n", prev)) >= prev)
{
uint64_t len = strtoull(response.c_str()+prev, NULL, 16);
if (!len)
{
// Zero length chunk indicates EOF
parsed.eof = true;
break;
}
if (response.size() < pos+2+len+2)
{
break;
}
parsed.body += response.substr(pos+2, len);
prev = pos+2+len+2;
}
if (prev > 0)
{
response = response.substr(prev);
}
if (parsed.eof)
{
stackout();
end();
return false;
}
if (want_streaming && parsed.body.size() > 0)
{
callback(&parsed);
parsed.body = "";
}
}
if (state == HTTP_CO_WEBSOCKET && response.size() > 0)
{
while (ws_parse_frame(response, parsed.ws_msg_type, parsed.body))
{
callback(&parsed);
parsed.body = "";
}
}
stackout();
return true;
}
void http_co_t::post_message(int type, const std::string & msg)
{
stackin();
if (state == HTTP_CO_WEBSOCKET)
{
request += ws_format_frame(type, msg.size());
request += msg;
submit_send();
}
else
{
ws_outbox += ws_format_frame(type, msg.size());
ws_outbox += msg;
}
stackout();
}
uint64_t stoull_full(const std::string & str, int base)
{
if (isspace(str[0]))
{
return 0;
}
char *end = NULL;
uint64_t r = strtoull(str.c_str(), &end, base);
if (end != str.c_str()+str.length())
{
return 0;
}
return r;
}
void parse_http_headers(std::string & res, http_response_t *parsed)
{
int pos = res.find("\r\n");
pos = pos < 0 ? res.length() : pos+2;
std::string status_line = res.substr(0, pos);
int http_version;
char *status_text = NULL;
sscanf(status_line.c_str(), "HTTP/1.%d %d %ms", &http_version, &parsed->status_code, &status_text);
if (status_text)
{
parsed->status_line = status_text;
// %ms = allocate a buffer
free(status_text);
status_text = NULL;
}
int prev = pos;
while ((pos = res.find("\r\n", prev)) >= prev)
{
if (pos == prev)
{
res = res.substr(pos+2);
break;
}
std::string header = res.substr(prev, pos-prev);
int p2 = header.find(":");
if (p2 >= 0)
{
std::string key = strtolower(header.substr(0, p2));
int p3 = p2+1;
while (p3 < header.length() && isblank(header[p3]))
p3++;
parsed->headers[key] = key == "connection" || key == "upgrade" || key == "transfer-encoding"
? strtolower(header.substr(p3)) : header.substr(p3);
}
prev = pos+2;
}
}
static std::string ws_format_frame(int type, uint64_t size)
{
// Always zero mask
std::string res;
int p = 0;
res.resize(2 + (size >= 126 ? 2 : 0) + (size >= 65536 ? 6 : 0) + /*mask*/4);
res[p++] = 0x80 | type;
if (size < 126)
res[p++] = size | /*mask*/0x80;
else if (size < 65536)
{
res[p++] = 126 | /*mask*/0x80;
res[p++] = (size >> 8) & 0xFF;
res[p++] = (size >> 0) & 0xFF;
}
else
{
res[p++] = 127 | /*mask*/0x80;
res[p++] = (size >> 56) & 0xFF;
res[p++] = (size >> 48) & 0xFF;
res[p++] = (size >> 40) & 0xFF;
res[p++] = (size >> 32) & 0xFF;
res[p++] = (size >> 24) & 0xFF;
res[p++] = (size >> 16) & 0xFF;
res[p++] = (size >> 8) & 0xFF;
res[p++] = (size >> 0) & 0xFF;
}
res[p++] = 0;
res[p++] = 0;
res[p++] = 0;
res[p++] = 0;
return res;
}
static bool ws_parse_frame(std::string & buf, int & type, std::string & res)
{
uint64_t hdr = 2;
if (buf.size() < hdr)
{
return false;
}
type = buf[0] & ~0x80;
bool mask = !!(buf[1] & 0x80);
hdr += mask ? 4 : 0;
uint64_t len = ((uint8_t)buf[1] & ~0x80);
if (len == 126)
{
hdr += 2;
if (buf.size() < hdr)
{
return false;
}
len = ((uint64_t)(uint8_t)buf[2] << 8) | ((uint64_t)(uint8_t)buf[3] << 0);
}
else if (len == 127)
{
hdr += 8;
if (buf.size() < hdr)
{
return false;
}
len = ((uint64_t)(uint8_t)buf[2] << 56) |
((uint64_t)(uint8_t)buf[3] << 48) |
((uint64_t)(uint8_t)buf[4] << 40) |
((uint64_t)(uint8_t)buf[5] << 32) |
((uint64_t)(uint8_t)buf[6] << 24) |
((uint64_t)(uint8_t)buf[7] << 16) |
((uint64_t)(uint8_t)buf[8] << 8) |
((uint64_t)(uint8_t)buf[9] << 0);
}
if (buf.size() < hdr+len)
{
return false;
}
if (mask)
{
for (int i = 0; i < len; i++)
buf[hdr+i] ^= buf[hdr-4+(i & 3)];
}
res += buf.substr(hdr, len);
buf = buf.substr(hdr+len);
return true;
}
std::vector<std::string> getifaddr_list(bool include_v6)
{
std::vector<std::string> addresses;
ifaddrs *list, *ifa;
if (getifaddrs(&list) == -1)
{
throw std::runtime_error(std::string("getifaddrs: ") + strerror(errno));
}
for (ifa = list; ifa != NULL; ifa = ifa->ifa_next)
{
if (!ifa->ifa_addr)
{
continue;
}
int family = ifa->ifa_addr->sa_family;
if ((family == AF_INET || family == AF_INET6 && include_v6) &&
(ifa->ifa_flags & (IFF_UP | IFF_RUNNING | IFF_LOOPBACK)) == (IFF_UP | IFF_RUNNING))
{
void *addr_ptr;
if (family == AF_INET)
addr_ptr = &((sockaddr_in *)ifa->ifa_addr)->sin_addr;
else
addr_ptr = &((sockaddr_in6 *)ifa->ifa_addr)->sin6_addr;
char addr[INET6_ADDRSTRLEN];
if (!inet_ntop(family, addr_ptr, addr, INET6_ADDRSTRLEN))
{
throw std::runtime_error(std::string("inet_ntop: ") + strerror(errno));
}
addresses.push_back(std::string(addr));
}
}
freeifaddrs(list);
return addresses;
}
static int extract_port(std::string & host)
{
int port = 0;
int pos = 0;
if ((pos = host.find(':')) >= 0)
{
port = strtoull(host.c_str() + pos + 1, NULL, 10);
if (port >= 0x10000)
{
port = 0;
}
host = host.substr(0, pos);
}
return port;
}
static std::string strtolower(const std::string & in)
{
std::string s = in;
for (int i = 0; i < s.length(); i++)
{
s[i] = tolower(s[i]);
}
return s;
}
static std::string trim(const std::string & in)
{
int begin = in.find_first_not_of(" \n\r\t");
if (begin == -1)
return "";
int end = in.find_last_not_of(" \n\r\t");
return in.substr(begin, end+1-begin);
}

59
http_client.h Normal file
View File

@ -0,0 +1,59 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
#pragma once
#include <string>
#include <vector>
#include <map>
#include <functional>
#include "json11/json11.hpp"
#define WS_CONTINUATION 0
#define WS_TEXT 1
#define WS_BINARY 2
#define WS_CLOSE 8
#define WS_PING 9
#define WS_PONG 10
class timerfd_manager_t;
struct http_options_t
{
int timeout;
bool want_streaming;
};
struct http_response_t
{
bool eof = false;
int error_code = 0;
int status_code = 0;
std::string status_line;
std::map<std::string, std::string> headers;
int ws_msg_type = -1;
std::string body;
};
struct http_co_t;
struct websocket_t
{
http_co_t *co;
void post_message(int type, const std::string & msg);
void close();
};
void parse_http_headers(std::string & res, http_response_t *parsed);
std::vector<std::string> getifaddr_list(bool include_v6 = false);
uint64_t stoull_full(const std::string & str, int base = 10);
void http_request(timerfd_manager_t *tfd, const std::string & host, const std::string & request,
const http_options_t & options, std::function<void(const http_response_t *response)> callback);
void http_request_json(timerfd_manager_t *tfd, const std::string & host, const std::string & request,
int timeout, std::function<void(std::string, json11::Json r)> callback);
websocket_t* open_websocket(timerfd_manager_t *tfd, const std::string & host, const std::string & path,
int timeout, std::function<void(const http_response_t *msg)> callback);

1
json11 Submodule

@ -0,0 +1 @@
Subproject commit 97f06cb20c1e136fd37d58fb40f57dd8f8a3a4a7

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include <iostream>
#include <functional>
#include <array>

50
malloc_or_die.h Normal file
View File

@ -0,0 +1,50 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
#pragma once
#include <malloc.h>
inline void* memalign_or_die(size_t alignment, size_t size)
{
void *buf = memalign(alignment, size);
if (!buf)
{
printf("Failed to allocate %lu bytes\n", size);
exit(1);
}
return buf;
}
inline void* malloc_or_die(size_t size)
{
void *buf = malloc(size);
if (!buf)
{
printf("Failed to allocate %lu bytes\n", size);
exit(1);
}
return buf;
}
inline void* realloc_or_die(void *ptr, size_t size)
{
void *buf = realloc(ptr, size);
if (!buf)
{
printf("Failed to allocate %lu bytes\n", size);
exit(1);
}
return buf;
}
inline void* calloc_or_die(size_t nmemb, size_t size)
{
void *buf = calloc(nmemb, size);
if (!buf)
{
printf("Failed to allocate %lu bytes\n", size * nmemb);
exit(1);
}
return buf;
}

412
messenger.cpp Normal file
View File

@ -0,0 +1,412 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
#include <unistd.h>
#include <fcntl.h>
#include <sys/socket.h>
#include <sys/epoll.h>
#include <netinet/tcp.h>
#include "messenger.h"
osd_op_t::~osd_op_t()
{
assert(!bs_op);
assert(!op_data);
if (rmw_buf)
{
free(rmw_buf);
}
if (buf)
{
// Note: reusing osd_op_t WILL currently lead to memory leaks
// So we don't reuse it, but free it every time
free(buf);
}
}
osd_messenger_t::~osd_messenger_t()
{
while (clients.size() > 0)
{
stop_client(clients.begin()->first);
}
}
void osd_messenger_t::connect_peer(uint64_t peer_osd, json11::Json peer_state)
{
if (wanted_peers.find(peer_osd) == wanted_peers.end())
{
wanted_peers[peer_osd] = (osd_wanted_peer_t){
.address_list = peer_state["addresses"],
.port = (int)peer_state["port"].int64_value(),
};
}
else
{
wanted_peers[peer_osd].address_list = peer_state["addresses"];
wanted_peers[peer_osd].port = (int)peer_state["port"].int64_value();
}
wanted_peers[peer_osd].address_changed = true;
if (!wanted_peers[peer_osd].connecting &&
(time(NULL) - wanted_peers[peer_osd].last_connect_attempt) >= peer_connect_interval)
{
try_connect_peer(peer_osd);
}
}
void osd_messenger_t::try_connect_peer(uint64_t peer_osd)
{
auto wp_it = wanted_peers.find(peer_osd);
if (wp_it == wanted_peers.end())
{
return;
}
if (osd_peer_fds.find(peer_osd) != osd_peer_fds.end())
{
wanted_peers.erase(peer_osd);
return;
}
auto & wp = wp_it->second;
if (wp.address_index >= wp.address_list.array_items().size())
{
return;
}
wp.cur_addr = wp.address_list[wp.address_index].string_value();
wp.cur_port = wp.port;
wp.connecting = true;
try_connect_peer_addr(peer_osd, wp.cur_addr.c_str(), wp.cur_port);
}
void osd_messenger_t::try_connect_peer_addr(osd_num_t peer_osd, const char *peer_host, int peer_port)
{
struct sockaddr_in addr;
int r;
if ((r = inet_pton(AF_INET, peer_host, &addr.sin_addr)) != 1)
{
on_connect_peer(peer_osd, -EINVAL);
return;
}
addr.sin_family = AF_INET;
addr.sin_port = htons(peer_port ? peer_port : 11203);
int peer_fd = socket(AF_INET, SOCK_STREAM, 0);
if (peer_fd < 0)
{
on_connect_peer(peer_osd, -errno);
return;
}
fcntl(peer_fd, F_SETFL, fcntl(peer_fd, F_GETFL, 0) | O_NONBLOCK);
int timeout_id = -1;
if (peer_connect_timeout > 0)
{
timeout_id = tfd->set_timer(1000*peer_connect_timeout, false, [this, peer_fd](int timer_id)
{
osd_num_t peer_osd = clients[peer_fd].osd_num;
stop_client(peer_fd);
on_connect_peer(peer_osd, -EIO);
return;
});
}
r = connect(peer_fd, (sockaddr*)&addr, sizeof(addr));
if (r < 0 && errno != EINPROGRESS)
{
close(peer_fd);
on_connect_peer(peer_osd, -errno);
return;
}
assert(peer_osd != this->osd_num);
clients[peer_fd] = (osd_client_t){
.peer_addr = addr,
.peer_port = peer_port,
.peer_fd = peer_fd,
.peer_state = PEER_CONNECTING,
.connect_timeout_id = timeout_id,
.osd_num = peer_osd,
.in_buf = malloc_or_die(receive_buffer_size),
};
tfd->set_fd_handler(peer_fd, true, [this](int peer_fd, int epoll_events)
{
// Either OUT (connected) or HUP
handle_connect_epoll(peer_fd);
});
}
void osd_messenger_t::handle_connect_epoll(int peer_fd)
{
auto & cl = clients[peer_fd];
if (cl.connect_timeout_id >= 0)
{
tfd->clear_timer(cl.connect_timeout_id);
cl.connect_timeout_id = -1;
}
osd_num_t peer_osd = cl.osd_num;
int result = 0;
socklen_t result_len = sizeof(result);
if (getsockopt(peer_fd, SOL_SOCKET, SO_ERROR, &result, &result_len) < 0)
{
result = errno;
}
if (result != 0)
{
stop_client(peer_fd);
on_connect_peer(peer_osd, -result);
return;
}
int one = 1;
setsockopt(peer_fd, SOL_TCP, TCP_NODELAY, &one, sizeof(one));
cl.peer_state = PEER_CONNECTED;
tfd->set_fd_handler(peer_fd, false, [this](int peer_fd, int epoll_events)
{
handle_peer_epoll(peer_fd, epoll_events);
});
// Check OSD number
check_peer_config(cl);
}
void osd_messenger_t::handle_peer_epoll(int peer_fd, int epoll_events)
{
// Mark client as ready (i.e. some data is available)
if (epoll_events & EPOLLRDHUP)
{
// Stop client
printf("[OSD %lu] client %d disconnected\n", this->osd_num, peer_fd);
stop_client(peer_fd);
}
else if (epoll_events & EPOLLIN)
{
// Mark client as ready (i.e. some data is available)
auto & cl = clients[peer_fd];
cl.read_ready++;
if (cl.read_ready == 1)
{
read_ready_clients.push_back(cl.peer_fd);
if (ringloop)
ringloop->wakeup();
else
read_requests();
}
}
}
void osd_messenger_t::on_connect_peer(osd_num_t peer_osd, int peer_fd)
{
auto & wp = wanted_peers.at(peer_osd);
wp.connecting = false;
if (peer_fd < 0)
{
printf("Failed to connect to peer OSD %lu address %s port %d: %s\n", peer_osd, wp.cur_addr.c_str(), wp.cur_port, strerror(-peer_fd));
if (wp.address_changed)
{
wp.address_changed = false;
wp.address_index = 0;
try_connect_peer(peer_osd);
}
else if (wp.address_index < wp.address_list.array_items().size()-1)
{
// Try other addresses
wp.address_index++;
try_connect_peer(peer_osd);
}
else
{
// Retry again in <peer_connect_interval> seconds
wp.last_connect_attempt = time(NULL);
wp.address_index = 0;
tfd->set_timer(1000*peer_connect_interval, false, [this, peer_osd](int)
{
try_connect_peer(peer_osd);
});
}
return;
}
printf("Connected with peer OSD %lu (fd %d)\n", peer_osd, peer_fd);
wanted_peers.erase(peer_osd);
repeer_pgs(peer_osd);
}
void osd_messenger_t::check_peer_config(osd_client_t & cl)
{
osd_op_t *op = new osd_op_t();
op->op_type = OSD_OP_OUT;
op->peer_fd = cl.peer_fd;
op->req = {
.show_conf = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = this->next_subop_id++,
.opcode = OSD_OP_SHOW_CONFIG,
},
},
};
op->callback = [this](osd_op_t *op)
{
osd_client_t & cl = clients[op->peer_fd];
std::string json_err;
json11::Json config;
bool err = false;
if (op->reply.hdr.retval < 0)
{
err = true;
printf("Failed to get config from OSD %lu (retval=%ld), disconnecting peer\n", cl.osd_num, op->reply.hdr.retval);
}
else
{
config = json11::Json::parse(std::string((char*)op->buf), json_err);
if (json_err != "")
{
err = true;
printf("Failed to get config from OSD %lu: bad JSON: %s, disconnecting peer\n", cl.osd_num, json_err.c_str());
}
else if (config["osd_num"].uint64_value() != cl.osd_num)
{
err = true;
printf("Connected to OSD %lu instead of OSD %lu, peer state is outdated, disconnecting peer\n", config["osd_num"].uint64_value(), cl.osd_num);
}
}
if (err)
{
osd_num_t osd_num = cl.osd_num;
stop_client(op->peer_fd);
on_connect_peer(osd_num, -1);
delete op;
return;
}
osd_peer_fds[cl.osd_num] = cl.peer_fd;
on_connect_peer(cl.osd_num, cl.peer_fd);
delete op;
};
outbox_push(op);
}
void osd_messenger_t::cancel_osd_ops(osd_client_t & cl)
{
for (auto p: cl.sent_ops)
{
cancel_op(p.second);
}
cl.sent_ops.clear();
for (auto op: cl.outbox)
{
cancel_op(op);
}
cl.outbox.clear();
if (cl.write_op)
{
cancel_op(cl.write_op);
cl.write_op = NULL;
}
}
void osd_messenger_t::cancel_op(osd_op_t *op)
{
if (op->op_type == OSD_OP_OUT)
{
op->reply.hdr.magic = SECONDARY_OSD_REPLY_MAGIC;
op->reply.hdr.id = op->req.hdr.id;
op->reply.hdr.opcode = op->req.hdr.opcode;
op->reply.hdr.retval = -EPIPE;
// Copy lambda to be unaffected by `delete op`
std::function<void(osd_op_t*)>(op->callback)(op);
}
else
{
// This function is only called in stop_client(), so it's fine to destroy the operation
delete op;
}
}
void osd_messenger_t::stop_client(int peer_fd)
{
assert(peer_fd != 0);
auto it = clients.find(peer_fd);
if (it == clients.end())
{
return;
}
uint64_t repeer_osd = 0;
osd_client_t cl = it->second;
if (cl.peer_state == PEER_CONNECTED)
{
if (cl.osd_num)
{
// Reload configuration from etcd when the connection is dropped
printf("[OSD %lu] Stopping client %d (OSD peer %lu)\n", osd_num, peer_fd, cl.osd_num);
repeer_osd = cl.osd_num;
}
else
{
printf("[OSD %lu] Stopping client %d (regular client)\n", osd_num, peer_fd);
}
}
clients.erase(it);
tfd->set_fd_handler(peer_fd, false, NULL);
if (cl.osd_num)
{
osd_peer_fds.erase(cl.osd_num);
// Cancel outbound operations
cancel_osd_ops(cl);
}
if (cl.read_op)
{
delete cl.read_op;
cl.read_op = NULL;
}
for (auto rit = read_ready_clients.begin(); rit != read_ready_clients.end(); rit++)
{
if (*rit == peer_fd)
{
read_ready_clients.erase(rit);
break;
}
}
for (auto wit = write_ready_clients.begin(); wit != write_ready_clients.end(); wit++)
{
if (*wit == peer_fd)
{
write_ready_clients.erase(wit);
break;
}
}
free(cl.in_buf);
close(peer_fd);
if (repeer_osd)
{
repeer_pgs(repeer_osd);
}
}
void osd_messenger_t::accept_connections(int listen_fd)
{
// Accept new connections
sockaddr_in addr;
socklen_t peer_addr_size = sizeof(addr);
int peer_fd;
while ((peer_fd = accept(listen_fd, (sockaddr*)&addr, &peer_addr_size)) >= 0)
{
assert(peer_fd != 0);
char peer_str[256];
printf("[OSD %lu] new client %d: connection from %s port %d\n", this->osd_num, peer_fd,
inet_ntop(AF_INET, &addr.sin_addr, peer_str, 256), ntohs(addr.sin_port));
fcntl(peer_fd, F_SETFL, fcntl(peer_fd, F_GETFL, 0) | O_NONBLOCK);
int one = 1;
setsockopt(peer_fd, SOL_TCP, TCP_NODELAY, &one, sizeof(one));
clients[peer_fd] = {
.peer_addr = addr,
.peer_port = ntohs(addr.sin_port),
.peer_fd = peer_fd,
.peer_state = PEER_CONNECTED,
.in_buf = malloc_or_die(receive_buffer_size),
};
// Add FD to epoll
tfd->set_fd_handler(peer_fd, false, [this](int peer_fd, int epoll_events)
{
handle_peer_epoll(peer_fd, epoll_events);
});
// Try to accept next connection
peer_addr_size = sizeof(addr);
}
if (peer_fd == -1 && errno != EAGAIN)
{
throw std::runtime_error(std::string("accept: ") + strerror(errno));
}
}

303
messenger.h Normal file
View File

@ -0,0 +1,303 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
#pragma once
#include <sys/types.h>
#include <stdint.h>
#include <arpa/inet.h>
#include <set>
#include <map>
#include <deque>
#include <vector>
#include "malloc_or_die.h"
#include "json11/json11.hpp"
#include "osd_ops.h"
#include "timerfd_manager.h"
#include "ringloop.h"
#define OSD_OP_IN 0
#define OSD_OP_OUT 1
#define CL_READ_HDR 1
#define CL_READ_DATA 2
#define CL_READ_REPLY_DATA 3
#define CL_WRITE_READY 1
#define CL_WRITE_REPLY 2
#define OSD_OP_INLINE_BUF_COUNT 16
#define PEER_CONNECTING 1
#define PEER_CONNECTED 2
#define DEFAULT_PEER_CONNECT_INTERVAL 5
#define DEFAULT_PEER_CONNECT_TIMEOUT 5
// Kind of a vector with small-list-optimisation
struct osd_op_buf_list_t
{
int count = 0, alloc = OSD_OP_INLINE_BUF_COUNT, done = 0;
iovec *buf = NULL;
iovec inline_buf[OSD_OP_INLINE_BUF_COUNT];
inline osd_op_buf_list_t()
{
buf = inline_buf;
}
inline osd_op_buf_list_t(const osd_op_buf_list_t & other)
{
buf = inline_buf;
append(other);
}
inline osd_op_buf_list_t & operator = (const osd_op_buf_list_t & other)
{
reset();
append(other);
return *this;
}
inline ~osd_op_buf_list_t()
{
if (buf && buf != inline_buf)
{
free(buf);
}
}
inline void reset()
{
count = 0;
done = 0;
}
inline iovec* get_iovec()
{
return buf + done;
}
inline int get_size()
{
return count - done;
}
inline void append(const osd_op_buf_list_t & other)
{
if (count+other.count > alloc)
{
if (buf == inline_buf)
{
int old = alloc;
alloc = (((count+other.count+15)/16)*16);
buf = (iovec*)malloc(sizeof(iovec) * alloc);
if (!buf)
{
printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
exit(1);
}
memcpy(buf, inline_buf, sizeof(iovec) * old);
}
else
{
alloc = (((count+other.count+15)/16)*16);
buf = (iovec*)realloc(buf, sizeof(iovec) * alloc);
if (!buf)
{
printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
exit(1);
}
}
}
for (int i = 0; i < other.count; i++)
{
buf[count++] = other.buf[i];
}
}
inline void push_back(void *nbuf, size_t len)
{
if (count >= alloc)
{
if (buf == inline_buf)
{
int old = alloc;
alloc = ((alloc/16)*16 + 1);
buf = (iovec*)malloc(sizeof(iovec) * alloc);
if (!buf)
{
printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
exit(1);
}
memcpy(buf, inline_buf, sizeof(iovec)*old);
}
else
{
alloc = alloc < 16 ? 16 : (alloc+16);
buf = (iovec*)realloc(buf, sizeof(iovec) * alloc);
if (!buf)
{
printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
exit(1);
}
}
}
buf[count++] = { .iov_base = nbuf, .iov_len = len };
}
inline void eat(int result)
{
while (result > 0 && done < count)
{
iovec & iov = buf[done];
if (iov.iov_len <= result)
{
result -= iov.iov_len;
done++;
}
else
{
iov.iov_len -= result;
iov.iov_base += result;
break;
}
}
}
};
struct blockstore_op_t;
struct osd_primary_op_data_t;
struct osd_op_t
{
timespec tv_begin;
uint64_t op_type = OSD_OP_IN;
int peer_fd;
osd_any_op_t req;
osd_any_reply_t reply;
blockstore_op_t *bs_op = NULL;
void *buf = NULL;
void *rmw_buf = NULL;
osd_primary_op_data_t* op_data = NULL;
std::function<void(osd_op_t*)> callback;
osd_op_buf_list_t iov;
~osd_op_t();
};
struct osd_client_t
{
sockaddr_in peer_addr;
int peer_port;
int peer_fd;
int peer_state;
int connect_timeout_id = -1;
osd_num_t osd_num = 0;
void *in_buf = NULL;
// Read state
int read_ready = 0;
osd_op_t *read_op = NULL;
iovec read_iov;
msghdr read_msg;
int read_remaining = 0;
int read_state = 0;
osd_op_buf_list_t recv_list;
// Incoming operations
std::vector<osd_op_t*> received_ops;
// Outbound operations
std::deque<osd_op_t*> outbox;
std::map<int, osd_op_t*> sent_ops;
// PGs dirtied by this client's primary-writes
std::set<pool_pg_num_t> dirty_pgs;
// Write state
osd_op_t *write_op = NULL;
msghdr write_msg;
int write_state = 0;
osd_op_buf_list_t send_list;
};
struct osd_wanted_peer_t
{
json11::Json address_list;
int port;
time_t last_connect_attempt;
bool connecting, address_changed;
int address_index;
std::string cur_addr;
int cur_port;
};
struct osd_op_stats_t
{
uint64_t op_stat_sum[OSD_OP_MAX+1] = { 0 };
uint64_t op_stat_count[OSD_OP_MAX+1] = { 0 };
uint64_t op_stat_bytes[OSD_OP_MAX+1] = { 0 };
uint64_t subop_stat_sum[OSD_OP_MAX+1] = { 0 };
uint64_t subop_stat_count[OSD_OP_MAX+1] = { 0 };
};
struct osd_messenger_t
{
timerfd_manager_t *tfd;
ring_loop_t *ringloop;
// osd_num_t is only for logging and asserts
osd_num_t osd_num;
// FIXME: make receive_buffer_size configurable
int receive_buffer_size = 64*1024;
int peer_connect_interval = DEFAULT_PEER_CONNECT_INTERVAL;
int peer_connect_timeout = DEFAULT_PEER_CONNECT_TIMEOUT;
int log_level = 0;
bool use_sync_send_recv = false;
std::map<osd_num_t, osd_wanted_peer_t> wanted_peers;
std::map<uint64_t, int> osd_peer_fds;
uint64_t next_subop_id = 1;
std::map<int, osd_client_t> clients;
std::vector<int> read_ready_clients;
std::vector<int> write_ready_clients;
std::vector<std::function<void()>> set_immediate;
// op statistics
osd_op_stats_t stats;
public:
void connect_peer(uint64_t osd_num, json11::Json peer_state);
void stop_client(int peer_fd);
void outbox_push(osd_op_t *cur_op);
std::function<void(osd_op_t*)> exec_op;
std::function<void(osd_num_t)> repeer_pgs;
void handle_peer_epoll(int peer_fd, int epoll_events);
void read_requests();
void send_replies();
void accept_connections(int listen_fd);
~osd_messenger_t();
protected:
void try_connect_peer(uint64_t osd_num);
void try_connect_peer_addr(osd_num_t peer_osd, const char *peer_host, int peer_port);
void handle_connect_epoll(int peer_fd);
void on_connect_peer(osd_num_t peer_osd, int peer_fd);
void check_peer_config(osd_client_t & cl);
void cancel_osd_ops(osd_client_t & cl);
void cancel_op(osd_op_t *op);
bool try_send(osd_client_t & cl);
void handle_send(int result, int peer_fd);
bool handle_read(int result, int peer_fd);
bool handle_finished_read(osd_client_t & cl);
void handle_op_hdr(osd_client_t *cl);
bool handle_reply_hdr(osd_client_t *cl);
void handle_reply_ready(osd_op_t *op);
};

112
mon/PGUtil.js Normal file
View File

@ -0,0 +1,112 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
module.exports = {
scale_pg_count,
};
function scale_pg_count(prev_pgs, prev_pg_history, new_pg_history, new_pg_count)
{
const old_pg_count = prev_pgs.length;
// Add all possibly intersecting PGs into the history of new PGs
if (!(new_pg_count % old_pg_count))
{
// New PG count is a multiple of the old PG count
const mul = (new_pg_count / old_pg_count);
for (let i = 0; i < new_pg_count; i++)
{
const old_i = Math.floor(new_pg_count / mul);
new_pg_history[i] = JSON.parse(JSON.stringify(prev_pg_history[1+old_i]));
}
}
else if (!(old_pg_count % new_pg_count))
{
// Old PG count is a multiple of the new PG count
const mul = (old_pg_count / new_pg_count);
for (let i = 0; i < new_pg_count; i++)
{
new_pg_history[i] = {
osd_sets: [],
all_peers: [],
epoch: 0,
};
for (let j = 0; j < mul; j++)
{
new_pg_history[i].osd_sets.push(prev_pgs[i*mul]);
const hist = prev_pg_history[1+i*mul+j];
if (hist && hist.osd_sets && hist.osd_sets.length)
{
Array.prototype.push.apply(new_pg_history[i].osd_sets, hist.osd_sets);
}
if (hist && hist.all_peers && hist.all_peers.length)
{
Array.prototype.push.apply(new_pg_history[i].all_peers, hist.all_peers);
}
if (hist && hist.epoch)
{
new_pg_history[i].epoch = new_pg_history[i].epoch < hist.epoch ? hist.epoch : new_pg_history[i].epoch;
}
}
}
}
else
{
// Any PG may intersect with any PG after non-multiple PG count change
// So, merge ALL PGs history
let all_sets = {};
let all_peers = {};
let max_epoch = 0;
for (const pg of prev_pgs)
{
all_sets[pg.join(' ')] = pg;
}
for (const pg in prev_pg_history)
{
const hist = prev_pg_history[pg];
if (hist && hist.osd_sets)
{
for (const pg of hist.osd_sets)
{
all_sets[pg.join(' ')] = pg;
}
}
if (hist && hist.all_peers)
{
for (const osd_num of hist.all_peers)
{
all_peers[osd_num] = Number(osd_num);
}
}
if (hist && hist.epoch)
{
max_epoch = max_epoch < hist.epoch ? hist.epoch : max_epoch;
}
}
all_sets = Object.values(all_sets);
all_peers = Object.values(all_peers);
for (let i = 0; i < new_pg_count; i++)
{
new_pg_history[i] = { osd_sets: all_sets, all_peers, epoch: max_epoch };
}
}
// Mark history keys for removed PGs as removed
for (let i = new_pg_count; i < old_pg_count; i++)
{
new_pg_history[i] = null;
}
if (old_pg_count < new_pg_count)
{
for (let i = new_pg_count-1; i >= 0; i--)
{
prev_pgs[i] = prev_pgs[Math.floor(i/new_pg_count*old_pg_count)];
}
}
else if (old_pg_count > new_pg_count)
{
for (let i = 0; i < new_pg_count; i++)
{
prev_pgs[i] = prev_pgs[Math.round(i/new_pg_count*old_pg_count)];
}
prev_pgs.splice(new_pg_count, old_pg_count-new_pg_count);
}
}

159
mon/afr.js Normal file
View File

@ -0,0 +1,159 @@
// Functions to calculate Annualized Failure Rate of your cluster
// if you know AFR of your drives, number of drives, expected rebalance time
// and replication factor
// License: VNPL-1.0 (see README.md for details)
const { sprintf } = require('sprintf-js');
module.exports = {
cluster_afr_fullmesh,
failure_rate_fullmesh,
cluster_afr,
print_cluster_afr,
c_n_k,
};
print_cluster_afr({ n_hosts: 4, n_drives: 6, afr_drive: 0.03, afr_host: 0.05, capacity: 4000, speed: 0.1, replicas: 2 });
print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, capacity: 4000, speed: 0.1, replicas: 2 });
print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, afr_host: 0.05, capacity: 4000, speed: 0.1, replicas: 2 });
print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, capacity: 4000, speed: 0.1, ec: [ 2, 1 ] });
print_cluster_afr({ n_hosts: 4, n_drives: 3, afr_drive: 0.03, afr_host: 0.05, capacity: 4000, speed: 0.1, ec: [ 2, 1 ] });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, capacity: 8000, speed: 0.02, replicas: 2 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 2 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, capacity: 8000, speed: 0.02, replicas: 3 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 3 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, capacity: 8000, speed: 0.02, replicas: 3, pgs: 100 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 3, pgs: 100 });
print_cluster_afr({ n_hosts: 10, n_drives: 10, afr_drive: 0.1, afr_host: 0.05, capacity: 8000, speed: 0.02, replicas: 3, pgs: 100, degraded_replacement: 1 });
/******** "FULL MESH": ASSUME EACH OSD COMMUNICATES WITH ALL OTHER OSDS ********/
// Estimate AFR of the cluster
// n - number of drives
// afr - annualized failure rate of a single drive
// l - expected rebalance time in days after a single drive failure
// k - replication factor / number of drives that must fail at the same time for the cluster to fail
function cluster_afr_fullmesh(n, afr, l, k)
{
return 1 - (1 - afr * failure_rate_fullmesh(n-(k-1), afr*l/365, k-1)) ** (n-(k-1));
}
// Probability of at least <f> failures in a cluster with <n> drives with AFR=<a>
function failure_rate_fullmesh(n, a, f)
{
if (f <= 0)
{
return (1-a)**n;
}
let p = 1;
for (let i = 0; i < f; i++)
{
p -= c_n_k(n, i) * (1-a)**(n-i) * a**i;
}
return p;
}
/******** PGS: EACH OSD ONLY COMMUNICATES WITH <pgs> OTHER OSDs ********/
// <n> hosts of <m> drives of <capacity> GB, each able to backfill at <speed> GB/s,
// <k> replicas, <pgs> unique peer PGs per OSD
//
// For each of n*m drives: P(drive fails in a year) * P(any of its peers fail in <l*365> next days).
// More peers per OSD increase rebalance speed (more drives work together to resilver) if you
// let them finish rebalance BEFORE replacing the failed drive.
// At the same time, more peers per OSD increase probability of any of them to fail!
//
// Probability of all except one drives in a replica group to fail is (AFR^(k-1)).
// So with <x> PGs it becomes ~ (x * (AFR*L/365)^(k-1)). Interesting but reasonable consequence
// is that, with k=2, total failure rate doesn't depend on number of peers per OSD,
// because it gets increased linearly by increased number of peers to fail
// and decreased linearly by reduced rebalance time.
function cluster_afr_pgs({ n_hosts, n_drives, afr_drive, capacity, speed, replicas, pgs = 1, degraded_replacement })
{
pgs = Math.min(pgs, (n_hosts-1)*n_drives/(replicas-1));
const l = capacity/(degraded_replacement ? 1 : pgs)/speed/86400/365;
return 1 - (1 - afr_drive * (1-(1-(afr_drive*l)**(replicas-1))**pgs)) ** (n_hosts*n_drives);
}
function cluster_afr_pgs_ec({ n_hosts, n_drives, afr_drive, capacity, speed, ec: [ ec_data, ec_parity ], pgs = 1, degraded_replacement })
{
const ec_total = ec_data+ec_parity;
pgs = Math.min(pgs, (n_hosts-1)*n_drives/(ec_total-1));
const l = capacity/(degraded_replacement ? 1 : pgs)/speed/86400/365;
return 1 - (1 - afr_drive * (1-(1-failure_rate_fullmesh(ec_total-1, afr_drive*l, ec_parity))**pgs)) ** (n_hosts*n_drives);
}
// Same as above, but also take server failures into account
function cluster_afr_pgs_hosts({ n_hosts, n_drives, afr_drive, afr_host, capacity, speed, replicas, pgs = 1, degraded_replacement })
{
let otherhosts = Math.min(pgs, (n_hosts-1)/(replicas-1));
pgs = Math.min(pgs, (n_hosts-1)*n_drives/(replicas-1));
let pgh = Math.min(pgs*n_drives, (n_hosts-1)*n_drives/(replicas-1));
const ld = capacity/(degraded_replacement ? 1 : pgs)/speed/86400/365;
const lh = n_drives*capacity/pgs/speed/86400/365;
const p1 = ((afr_drive+afr_host*pgs/otherhosts)*lh);
const p2 = ((afr_drive+afr_host*pgs/otherhosts)*ld);
return 1 - ((1 - afr_host * (1-(1-p1**(replicas-1))**pgh)) ** n_hosts) *
((1 - afr_drive * (1-(1-p2**(replicas-1))**pgs)) ** (n_hosts*n_drives));
}
function cluster_afr_pgs_ec_hosts({ n_hosts, n_drives, afr_drive, afr_host, capacity, speed, ec: [ ec_data, ec_parity ], pgs = 1, degraded_replacement })
{
const ec_total = ec_data+ec_parity;
const otherhosts = Math.min(pgs, (n_hosts-1)/(ec_total-1));
pgs = Math.min(pgs, (n_hosts-1)*n_drives/(ec_total-1));
const pgh = Math.min(pgs*n_drives, (n_hosts-1)*n_drives/(ec_total-1));
const ld = capacity/(degraded_replacement ? 1 : pgs)/speed/86400/365;
const lh = n_drives*capacity/pgs/speed/86400/365;
const p1 = ((afr_drive+afr_host*pgs/otherhosts)*lh);
const p2 = ((afr_drive+afr_host*pgs/otherhosts)*ld);
return 1 - ((1 - afr_host * (1-(1-failure_rate_fullmesh(ec_total-1, p1, ec_parity))**pgh)) ** n_hosts) *
((1 - afr_drive * (1-(1-failure_rate_fullmesh(ec_total-1, p2, ec_parity))**pgs)) ** (n_hosts*n_drives));
}
// Wrapper for 4 above functions
function cluster_afr(config)
{
if (config.ec && config.afr_host)
{
return cluster_afr_pgs_ec_hosts(config);
}
else if (config.ec)
{
return cluster_afr_pgs_ec(config);
}
else if (config.afr_host)
{
return cluster_afr_pgs_hosts(config);
}
else
{
return cluster_afr_pgs(config);
}
}
function print_cluster_afr(config)
{
console.log(
`${config.n_hosts} nodes with ${config.n_drives} ${sprintf("%.1f", config.capacity/1000)}TB drives`+
`, capable to backfill at ${sprintf("%.1f", config.speed*1000)} MB/s, drive AFR ${sprintf("%.1f", config.afr_drive*100)}%`+
(config.afr_host ? `, host AFR ${sprintf("%.1f", config.afr_host*100)}%` : '')+
(config.ec ? `, EC ${config.ec[0]}+${config.ec[1]}` : `, ${config.replicas} replicas`)+
`, ${config.pgs||1} PG per OSD`+
(config.degraded_replacement ? `\n...and you don't let the rebalance finish before replacing drives` : '')
);
console.log('-> '+sprintf("%.7f%%", 100*cluster_afr(config))+'\n');
}
/******** UTILITY ********/
// Combination count
function c_n_k(n, k)
{
let r = 1;
for (let i = 0; i < k; i++)
{
r *= (n-i) / (i+1);
}
return r;
}

688
mon/lp-optimizer.js Normal file
View File

@ -0,0 +1,688 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// Data distribution optimizer using linear programming (lp_solve)
const child_process = require('child_process');
const NO_OSD = 'Z';
async function lp_solve(text)
{
const cp = child_process.spawn('lp_solve');
let stdout = '', stderr = '', finish_cb;
cp.stdout.on('data', buf => stdout += buf.toString());
cp.stderr.on('data', buf => stderr += buf.toString());
cp.on('exit', () => finish_cb && finish_cb());
cp.stdin.write(text);
cp.stdin.end();
if (cp.exitCode == null)
{
await new Promise(ok => finish_cb = ok);
}
if (!stdout.trim())
{
return null;
}
let score = 0;
let vars = {};
for (const line of stdout.split(/\n/))
{
let m = /^(^Value of objective function: (-?[\d\.]+)|Actual values of the variables:)\s*$/.exec(line);
if (m)
{
if (m[2])
{
score = m[2];
}
continue;
}
else if (/This problem is (infeasible|unbounded)/.exec(line))
{
return null;
}
let [ k, v ] = line.trim().split(/\s+/, 2);
if (v)
{
vars[k] = v;
}
}
return { score, vars };
}
async function optimize_initial({ osd_tree, pg_count, pg_size = 3, pg_minsize = 2, max_combinations = 10000, parity_space = 1 })
{
if (!pg_count || !osd_tree)
{
return null;
}
const all_weights = Object.assign({}, ...Object.values(osd_tree));
const total_weight = Object.values(all_weights).reduce((a, c) => Number(a) + Number(c), 0);
const all_pgs = Object.values(random_combinations(osd_tree, pg_size, max_combinations));
const pg_per_osd = {};
for (const pg of all_pgs)
{
for (let i = 0; i < pg.length; i++)
{
const osd = pg[i];
pg_per_osd[osd] = pg_per_osd[osd] || [];
pg_per_osd[osd].push((i >= pg_minsize ? parity_space+'*' : '')+"pg_"+pg.join("_"));
}
}
const pg_effsize = Math.min(pg_minsize, Object.keys(osd_tree).length)
+ Math.max(0, Math.min(pg_size, Object.keys(osd_tree).length) - pg_minsize) * parity_space;
let lp = '';
lp += "max: "+all_pgs.map(pg => 'pg_'+pg.join('_')).join(' + ')+";\n";
for (const osd in pg_per_osd)
{
if (osd !== NO_OSD)
{
let osd_pg_count = all_weights[osd]/total_weight*pg_effsize*pg_count;
lp += pg_per_osd[osd].join(' + ')+' <= '+osd_pg_count+';\n';
}
}
for (const pg of all_pgs)
{
lp += 'pg_'+pg.join('_')+" >= 0;\n";
}
lp += "sec "+all_pgs.map(pg => 'pg_'+pg.join('_')).join(', ')+";\n";
const lp_result = await lp_solve(lp);
if (!lp_result)
{
console.log(lp);
throw new Error('Problem is infeasible or unbounded - is it a bug?');
}
const int_pgs = make_int_pgs(lp_result.vars, pg_count);
const eff = pg_list_space_efficiency(int_pgs, all_weights, pg_minsize, parity_space);
const res = {
score: lp_result.score,
weights: lp_result.vars,
int_pgs,
space: eff * pg_effsize,
total_space: total_weight,
};
return res;
}
function make_int_pgs(weights, pg_count)
{
const total_weight = Object.values(weights).reduce((a, c) => Number(a) + Number(c), 0);
let int_pgs = [];
let pg_left = pg_count;
let weight_left = total_weight;
for (const pg_name in weights)
{
let n = Math.round(weights[pg_name] / weight_left * pg_left);
for (let i = 0; i < n; i++)
{
int_pgs.push(pg_name.substr(3).split('_'));
}
weight_left -= weights[pg_name];
pg_left -= n;
}
return int_pgs;
}
function calc_intersect_weights(pg_size, pg_count, prev_weights, all_pgs)
{
const move_weights = {};
if ((1 << pg_size) < pg_count)
{
const intersect = {};
for (const pg_name in prev_weights)
{
const pg = pg_name.substr(3).split(/_/);
for (let omit = 1; omit < (1 << pg_size); omit++)
{
let pg_omit = [ ...pg ];
let intersect_count = pg_size;
for (let i = 0; i < pg_size; i++)
{
if (omit & (1 << i))
{
pg_omit[i] = '';
intersect_count--;
}
}
pg_omit = pg_omit.join(':');
intersect[pg_omit] = Math.max(intersect[pg_omit] || 0, intersect_count);
}
}
for (const pg of all_pgs)
{
let max_int = 0;
for (let omit = 1; omit < (1 << pg_size); omit++)
{
let pg_omit = [ ...pg ];
for (let i = 0; i < pg_size; i++)
{
if (omit & (1 << i))
{
pg_omit[i] = '';
}
}
pg_omit = pg_omit.join(':');
max_int = Math.max(max_int, intersect[pg_omit] || 0);
}
move_weights['pg_'+pg.join('_')] = pg_size-max_int;
}
}
else
{
const prev_pg_hashed = Object.keys(prev_weights).map(pg_name => pg_name.substr(3).split(/_/).reduce((a, c) => { a[c] = 1; return a; }, {}));
for (const pg of all_pgs)
{
if (!prev_weights['pg_'+pg.join('_')])
{
let max_int = 0;
for (const prev_hash in prev_pg_hashed)
{
const intersect_count = pg.reduce((a, osd) => a + (prev_hash[osd] ? 1 : 0), 0);
if (max_int < intersect_count)
{
max_int = intersect_count;
if (max_int >= pg_size)
{
break;
}
}
}
move_weights['pg_'+pg.join('_')] = pg_size-max_int;
}
}
}
return move_weights;
}
function add_valid_previous(osd_tree, prev_weights, all_pgs)
{
// Add previous combinations that are still valid
const hosts = Object.keys(osd_tree).sort();
const host_per_osd = {};
for (const host in osd_tree)
{
for (const osd in osd_tree[host])
{
host_per_osd[osd] = host;
}
}
skip_pg: for (const pg_name in prev_weights)
{
const seen_hosts = {};
const pg = pg_name.substr(3).split(/_/);
for (const osd of pg)
{
if (!host_per_osd[osd] || seen_hosts[host_per_osd[osd]])
{
continue skip_pg;
}
seen_hosts[host_per_osd[osd]] = true;
}
if (!all_pgs[pg_name])
{
all_pgs[pg_name] = pg;
}
}
}
// Try to minimize data movement
async function optimize_change({ prev_pgs: prev_int_pgs, osd_tree, pg_size = 3, pg_minsize = 2, max_combinations = 10000, parity_space = 1 })
{
if (!osd_tree)
{
return null;
}
const pg_effsize = Math.min(pg_minsize, Object.keys(osd_tree).length)
+ Math.max(0, Math.min(pg_size, Object.keys(osd_tree).length) - pg_minsize) * parity_space;
const pg_count = prev_int_pgs.length;
const prev_weights = {};
const prev_pg_per_osd = {};
for (const pg of prev_int_pgs)
{
const pg_name = 'pg_'+pg.join('_');
prev_weights[pg_name] = (prev_weights[pg_name]||0) + 1;
for (let i = 0; i < pg.length; i++)
{
const osd = pg[i];
prev_pg_per_osd[osd] = prev_pg_per_osd[osd] || [];
prev_pg_per_osd[osd].push([ pg_name, (i >= pg_minsize ? parity_space : 1) ]);
}
}
// Get all combinations
let all_pgs = random_combinations(osd_tree, pg_size, max_combinations);
add_valid_previous(osd_tree, prev_weights, all_pgs);
all_pgs = Object.values(all_pgs);
const pg_per_osd = {};
for (const pg of all_pgs)
{
const pg_name = 'pg_'+pg.join('_');
for (let i = 0; i < pg.length; i++)
{
const osd = pg[i];
pg_per_osd[osd] = pg_per_osd[osd] || [];
pg_per_osd[osd].push([ pg_name, (i >= pg_minsize ? parity_space : 1) ]);
}
}
// Penalize PGs based on their similarity to old PGs
const move_weights = calc_intersect_weights(pg_size, pg_count, prev_weights, all_pgs);
// Calculate total weight - old PG weights
const all_pg_names = all_pgs.map(pg => 'pg_'+pg.join('_'));
const all_pgs_hash = all_pg_names.reduce((a, c) => { a[c] = true; return a; }, {});
const all_weights = Object.assign({}, ...Object.values(osd_tree));
const total_weight = Object.values(all_weights).reduce((a, c) => Number(a) + Number(c), 0);
// Generate the LP problem
let lp = '';
lp += 'max: '+all_pg_names.map(pg_name => (
prev_weights[pg_name] ? `${pg_size+1}*add_${pg_name} - ${pg_size+1}*del_${pg_name}` : `${pg_size+1-move_weights[pg_name]}*${pg_name}`
)).join(' + ')+';\n';
for (const osd in pg_per_osd)
{
if (osd !== NO_OSD)
{
const osd_sum = (pg_per_osd[osd]||[]).map(([ pg_name, space ]) => (
prev_weights[pg_name] ? `${space} * add_${pg_name} - ${space} * del_${pg_name}` : `${space} * ${pg_name}`
)).join(' + ');
const rm_osd_pg_count = (prev_pg_per_osd[osd]||[])
.reduce((a, [ old_pg_name, space ]) => (a + (all_pgs_hash[old_pg_name] ? space : 0)), 0);
const osd_pg_count = all_weights[osd]*pg_effsize/total_weight*pg_count - rm_osd_pg_count;
lp += osd_sum + ' <= ' + osd_pg_count + ';\n';
}
}
let pg_vars = [];
for (const pg_name of all_pg_names)
{
if (prev_weights[pg_name])
{
pg_vars.push(`add_${pg_name}`, `del_${pg_name}`);
// Can't add or remove less than zero
lp += `add_${pg_name} >= 0;\n`;
lp += `del_${pg_name} >= 0;\n`;
// Can't remove more than the PG already has
lp += `add_${pg_name} - del_${pg_name} >= -${prev_weights[pg_name]};\n`;
}
else
{
pg_vars.push(pg_name);
lp += `${pg_name} >= 0;\n`;
}
}
lp += 'sec '+pg_vars.join(', ')+';\n';
// Solve it
const lp_result = await lp_solve(lp);
if (!lp_result)
{
console.log(lp);
throw new Error('Problem is infeasible or unbounded - is it a bug?');
}
// Generate the new distribution
const weights = { ...prev_weights };
for (const k in prev_weights)
{
if (!all_pgs_hash[k])
{
delete weights[k];
}
}
for (const k in lp_result.vars)
{
if (k.substr(0, 4) === 'add_')
{
weights[k.substr(4)] = (weights[k.substr(4)] || 0) + Number(lp_result.vars[k]);
}
else if (k.substr(0, 4) === 'del_')
{
weights[k.substr(4)] = (weights[k.substr(4)] || 0) - Number(lp_result.vars[k]);
}
else if (k.substr(0, 3) === 'pg_')
{
weights[k] = Number(lp_result.vars[k]);
}
}
for (const k in weights)
{
if (!weights[k])
{
delete weights[k];
}
}
const int_pgs = make_int_pgs(weights, pg_count);
// Align them with most similar previous PGs
const new_pgs = align_pgs(prev_int_pgs, int_pgs);
let differs = 0, osd_differs = 0;
for (let i = 0; i < pg_count; i++)
{
if (new_pgs[i].join('_') != prev_int_pgs[i].join('_'))
{
differs++;
}
for (let j = 0; j < pg_size; j++)
{
if (new_pgs[i][j] != prev_int_pgs[i][j])
{
osd_differs++;
}
}
}
return {
prev_pgs: prev_int_pgs,
score: lp_result.score,
weights,
int_pgs: new_pgs,
differs,
osd_differs,
space: pg_effsize * pg_list_space_efficiency(new_pgs, all_weights, pg_minsize, parity_space),
total_space: total_weight,
};
}
function print_change_stats(retval, detailed)
{
const new_pgs = retval.int_pgs;
const prev_int_pgs = retval.prev_pgs;
if (prev_int_pgs)
{
if (detailed)
{
for (let i = 0; i < new_pgs.length; i++)
{
if (new_pgs[i].join('_') != prev_int_pgs[i].join('_'))
{
console.log("pg "+i+": "+prev_int_pgs[i].join(' ')+" -> "+new_pgs[i].join(' '));
}
}
}
console.log(
"Data movement: "+retval.differs+" pgs, "+
retval.osd_differs+" pg*osds = "+Math.round(retval.osd_differs / prev_int_pgs.length / 3 * 10000)/100+" %"
);
}
console.log(
"Total space (raw): "+Math.round(retval.space*100)/100+" TB, space efficiency: "+
Math.round(retval.space/(retval.total_space||1)*10000)/100+" %"
);
}
function align_pgs(prev_int_pgs, int_pgs)
{
const aligned_pgs = [];
put_aligned_pgs(aligned_pgs, int_pgs, prev_int_pgs, (pg) => [ pg.join(':') ]);
put_aligned_pgs(aligned_pgs, int_pgs, prev_int_pgs, (pg) => [ pg[0]+'::'+pg[2], ':'+pg[1]+':'+pg[2], pg[0]+':'+pg[1]+':' ]);
put_aligned_pgs(aligned_pgs, int_pgs, prev_int_pgs, (pg) => [ pg[0]+'::', ':'+pg[1]+':', '::'+pg[2] ]);
const free_slots = prev_int_pgs.map((pg, i) => !aligned_pgs[i] ? i : null).filter(i => i != null);
for (const pg of int_pgs)
{
if (!free_slots.length)
{
throw new Error("Can't place unaligned PG");
}
aligned_pgs[free_slots.shift()] = pg;
}
return aligned_pgs;
}
function put_aligned_pgs(aligned_pgs, int_pgs, prev_int_pgs, keygen)
{
let prev_indexes = {};
for (let i = 0; i < prev_int_pgs.length; i++)
{
for (let k of keygen(prev_int_pgs[i]))
{
prev_indexes[k] = prev_indexes[k] || [];
prev_indexes[k].push(i);
}
}
PG: for (let i = int_pgs.length-1; i >= 0; i--)
{
let pg = int_pgs[i];
let keys = keygen(int_pgs[i]);
for (let k of keys)
{
while (prev_indexes[k] && prev_indexes[k].length)
{
let idx = prev_indexes[k].shift();
if (!aligned_pgs[idx])
{
aligned_pgs[idx] = pg;
int_pgs.splice(i, 1);
continue PG;
}
}
}
}
}
// Convert multi-level osd_tree = { level: number|string, id?: string, size?: number, children?: osd_tree }[]
// levels = { string: number }
// to a two-level osd_tree suitable for all_combinations()
function flatten_tree(osd_tree, levels, failure_domain_level, osd_level, domains = {}, i = { i: 1 })
{
osd_level = levels[osd_level] || osd_level;
failure_domain_level = levels[failure_domain_level] || failure_domain_level;
for (const node of osd_tree)
{
if ((levels[node.level] || node.level) < failure_domain_level)
{
flatten_tree(node.children||[], levels, failure_domain_level, osd_level, domains, i);
}
else
{
domains['dom'+(i.i++)] = extract_osds([ node ], levels, osd_level);
}
}
return domains;
}
function extract_osds(osd_tree, levels, osd_level, osds = {})
{
for (const node of osd_tree)
{
if ((levels[node.level] || node.level) >= osd_level)
{
osds[node.id] = node.size;
}
else
{
extract_osds(node.children||[], levels, osd_level, osds);
}
}
return osds;
}
function random_combinations(osd_tree, pg_size, count)
{
let seed = 0x5f020e43;
let rng = () =>
{
seed ^= seed << 13;
seed ^= seed >> 17;
seed ^= seed << 5;
return seed + 2147483648;
};
const hosts = Object.keys(osd_tree).sort();
const osds = Object.keys(osd_tree).reduce((a, c) => { a[c] = Object.keys(osd_tree[c]).sort(); return a; }, {});
const r = {};
// Generate random combinations including each OSD at least once
for (let h = 0; h < hosts.length; h++)
{
for (let o = 0; o < osds[hosts[h]].length; o++)
{
const pg = [ osds[hosts[h]][o] ];
const cur_hosts = [ ...hosts ];
cur_hosts.splice(h, 1);
for (let i = 1; i < pg_size && i < hosts.length; i++)
{
const next_host = rng() % cur_hosts.length;
const next_osd = rng() % osds[cur_hosts[next_host]].length;
pg.push(osds[cur_hosts[next_host]][next_osd]);
cur_hosts.splice(next_host, 1);
}
while (pg.length < pg_size)
{
pg.push(NO_OSD);
}
r['pg_'+pg.join('_')] = pg;
}
}
// Generate purely random combinations
restart: while (count > 0)
{
let host_idx = [];
for (let i = 0; i < pg_size && i < hosts.length; i++)
{
let start = i > 0 ? host_idx[i-1]+1 : 0;
if (start >= hosts.length)
{
continue restart;
}
host_idx[i] = start + rng() % (hosts.length-start);
}
let pg = host_idx.map(h => osds[hosts[h]][rng() % osds[hosts[h]].length]);
while (pg.length < pg_size)
{
pg.push(NO_OSD);
}
r['pg_'+pg.join('_')] = pg;
count--;
}
return r;
}
// Super-stupid algorithm. Given the current OSD tree, generate all possible OSD combinations
// osd_tree = { failure_domain1: { osd1: size1, ... }, ... }
// ordered = return combinations without duplicates having different order
function all_combinations(osd_tree, pg_size, ordered, count)
{
const hosts = Object.keys(osd_tree).sort();
const osds = Object.keys(osd_tree).reduce((a, c) => { a[c] = Object.keys(osd_tree[c]).sort(); return a; }, {});
while (hosts.length < pg_size)
{
osds[NO_OSD] = [ NO_OSD ];
hosts.push(NO_OSD);
}
let host_idx = [];
let osd_idx = [];
for (let i = 0; i < pg_size; i++)
{
host_idx.push(i);
osd_idx.push(0);
}
const r = [];
while (!count || count < 0 || r.length < count)
{
r.push(host_idx.map((hi, i) => osds[hosts[hi]][osd_idx[i]]));
let inc = pg_size-1;
while (inc >= 0)
{
osd_idx[inc]++;
if (osd_idx[inc] >= osds[hosts[host_idx[inc]]].length)
{
osd_idx[inc] = 0;
inc--;
}
else
{
break;
}
}
if (inc < 0)
{
// no osds left in the current host combination, select the next one
inc = pg_size-1;
same_again: while (inc >= 0)
{
host_idx[inc]++;
for (let prev_host = 0; prev_host < inc; prev_host++)
{
if (host_idx[prev_host] == host_idx[inc])
{
continue same_again;
}
}
if (host_idx[inc] < (ordered ? hosts.length-(pg_size-1-inc) : hosts.length))
{
while ((++inc) < pg_size)
{
host_idx[inc] = (ordered ? host_idx[inc-1]+1 : 0);
}
break;
}
else
{
inc--;
}
}
if (inc < 0)
{
break;
}
}
}
return r;
}
function pg_weights_space_efficiency(weights, pg_count, osd_sizes)
{
const per_osd = {};
for (const pg_name in weights)
{
for (const osd of pg_name.substr(3).split(/_/))
{
per_osd[osd] = (per_osd[osd]||0) + weights[pg_name];
}
}
return pg_per_osd_space_efficiency(per_osd, pg_count, osd_sizes);
}
function pg_list_space_efficiency(pgs, osd_sizes, pg_minsize, parity_space)
{
const per_osd = {};
for (const pg of pgs)
{
for (let i = 0; i < pg.length; i++)
{
const osd = pg[i];
per_osd[osd] = (per_osd[osd]||0) + (i >= pg_minsize ? (parity_space||1) : 1);
}
}
return pg_per_osd_space_efficiency(per_osd, pgs.length, osd_sizes);
}
function pg_per_osd_space_efficiency(per_osd, pg_count, osd_sizes)
{
// each PG gets randomly selected in 1/N cases
// & there are x PGs per OSD
// => an OSD is selected in x/N cases
// => total space * x/N <= OSD size
// => total space <= OSD size * N/x
let space;
for (let osd in per_osd)
{
if (osd in osd_sizes)
{
const space_estimate = osd_sizes[osd] * pg_count / per_osd[osd];
if (space == null || space > space_estimate)
{
space = space_estimate;
}
}
}
return space == null ? 0 : space;
}
module.exports = {
NO_OSD,
optimize_initial,
optimize_change,
print_change_stats,
pg_weights_space_efficiency,
pg_list_space_efficiency,
pg_per_osd_space_efficiency,
flatten_tree,
lp_solve,
make_int_pgs,
align_pgs,
random_combinations,
all_combinations,
};

136
mon/make-units.sh Normal file
View File

@ -0,0 +1,136 @@
#!/bin/bash
# Example startup script generator
# Of course this isn't a production solution yet, this is just for tests
# Copyright (c) Vitaliy Filippov, 2019+
# License: MIT
IP=`ip -json a s | jq -r '.[].addr_info[] | select(.broadcast == "10.115.0.255") | .local'`
[ "$IP" != "" ] || exit 1
useradd vitastor
chmod 755 /root
BASE=${IP/*./}
BASE=$(((BASE-10)*12))
cat >/etc/systemd/system/vitastor.target <<EOF
[Unit]
Description=vitastor target
[Install]
WantedBy=multi-user.target
EOF
i=1
for DEV in `ls /dev/disk/by-id/ | grep ata-INTEL_SSDSC2KB`; do
dd if=/dev/zero of=/dev/disk/by-id/$DEV bs=1048576 count=$(((427814912+1048575)/1048576+2))
dd if=/dev/zero of=/dev/disk/by-id/$DEV bs=1048576 count=$(((427814912+1048575)/1048576+2)) seek=$((1920377991168/1048576))
cat >/etc/systemd/system/vitastor-osd$((BASE+i)).service <<EOF
[Unit]
Description=Vitastor object storage daemon osd.$((BASE+i))
After=network-online.target local-fs.target time-sync.target
Wants=network-online.target local-fs.target time-sync.target
PartOf=vitastor.target
[Service]
LimitNOFILE=1048576
LimitNPROC=1048576
LimitMEMLOCK=infinity
ExecStart=/root/vitastor/osd \\
--etcd_address $IP:2379/v3 \\
--bind_address $IP \\
--osd_num $((BASE+i)) \\
--disable_data_fsync 1 \\
--disable_device_lock 1 \\
--immediate_commit all \\
--flusher_count 8 \\
--disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 \\
--journal_no_same_sector_overwrites true \\
--journal_sector_buffer_count 1024 \\
--journal_offset 0 \\
--meta_offset 16777216 \\
--data_offset 427814912 \\
--data_size $((1920377991168-427814912)) \\
--data_device /dev/disk/by-id/$DEV
WorkingDirectory=/root/vitastor
ExecStartPre=+chown vitastor:vitastor /dev/disk/by-id/$DEV
User=vitastor
PrivateTmp=false
TasksMax=infinity
Restart=always
StartLimitInterval=0
StartLimitIntervalSec=0
RestartSec=10
[Install]
WantedBy=vitastor.target
EOF
systemctl enable vitastor-osd$((BASE+i))
i=$((i+1))
cat >/etc/systemd/system/vitastor-osd$((BASE+i)).service <<EOF
[Unit]
Description=Vitastor object storage daemon osd.$((BASE+i))
After=network-online.target local-fs.target time-sync.target
Wants=network-online.target local-fs.target time-sync.target
PartOf=vitastor.target
[Service]
LimitNOFILE=1048576
LimitNPROC=1048576
LimitMEMLOCK=infinity
ExecStart=/root/vitastor/osd \\
--etcd_address $IP:2379/v3 \\
--bind_address $IP \\
--osd_num $((BASE+i)) \\
--disable_data_fsync 1 \\
--immediate_commit all \\
--flusher_count 8 \\
--disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 \\
--journal_no_same_sector_overwrites true \\
--journal_sector_buffer_count 1024 \\
--journal_offset 1920377991168 \\
--meta_offset $((1920377991168+16777216)) \\
--data_offset $((1920377991168+427814912)) \\
--data_size $((1920377991168-427814912)) \\
--data_device /dev/disk/by-id/$DEV
WorkingDirectory=/root/vitastor
ExecStartPre=+chown vitastor:vitastor /dev/disk/by-id/$DEV
User=vitastor
PrivateTmp=false
TasksMax=infinity
Restart=always
StartLimitInterval=0
StartLimitIntervalSec=0
RestartSec=10
[Install]
WantedBy=vitastor.target
EOF
systemctl enable vitastor-osd$((BASE+i))
i=$((i+1))
done
exit
node mon-main.js --etcd_url 'http://10.115.0.10:2379,http://10.115.0.11:2379,http://10.115.0.12:2379,http://10.115.0.13:2379' --etcd_prefix '/vitastor' --etcd_start_timeout 5
podman run -d --network host --restart always -v /var/lib/etcd0.etcd:/etcd0.etcd --name etcd quay.io/coreos/etcd:v3.4.13 etcd -name etcd0 \
-advertise-client-urls http://10.115.0.10:2379 -listen-client-urls http://10.115.0.10:2379 \
-initial-advertise-peer-urls http://10.115.0.10:2380 -listen-peer-urls http://10.115.0.10:2380 \
-initial-cluster-token vitastor-etcd-1 -initial-cluster etcd0=http://10.115.0.10:2380,etcd1=http://10.115.0.11:2380,etcd2=http://10.115.0.12:2380,etcd3=http://10.115.0.13:2380 \
-initial-cluster-state new --max-txn-ops=100000 --auto-compaction-retention=10 --auto-compaction-mode=revision
etcdctl --endpoints http://10.115.0.10:2379 put /vitastor/config/global '{"immediate_commit":"all"}'
etcdctl --endpoints http://10.115.0.10:2379 put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":48,"failure_domain":"host"}}'
#let pgs = {};
#for (let n = 0; n < 48; n++) { let i = n/2 | 0; pgs[1+n] = { osd_set: [ (1+i%12+(i/12 | 0)*24), (1+12+i%12+(i/12 | 0)*24) ], primary: (1+(n%2)*12+i%12+(i/12 | 0)*24) }; };
#console.log(JSON.stringify({ items: { 1: pgs } }));
#etcdctl --endpoints http://10.115.0.10:2379 put /vitastor/config/pgs ...
# --disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 \\
# --data_offset 427814912 \\
# --disk_alignment 4096 --journal_block_size 512 --meta_block_size 512 \\
# --data_offset 433434624 \\

25
mon/mon-main.js Normal file
View File

@ -0,0 +1,25 @@
#!/usr/bin/node
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
const Mon = require('./mon.js');
const options = {};
for (let i = 2; i < process.argv.length; i++)
{
if (process.argv[i].substr(0, 2) == '--')
{
options[process.argv[i].substr(2)] = process.argv[i+1];
i++;
}
}
if (!options.etcd_url)
{
console.error('USAGE: '+process.argv[0]+' '+process.argv[1]+' --etcd_url "http://127.0.0.1:2379,..." --etcd_prefix "/vitastor" --etcd_start_timeout 5 [--verbose 1]');
process.exit();
}
new Mon(options).start().catch(e => { console.error(e); process.exit(); });

1169
mon/mon.js Normal file

File diff suppressed because it is too large Load Diff

15
mon/package.json Normal file
View File

@ -0,0 +1,15 @@
{
"name": "vitastor-mon",
"version": "1.0.0",
"description": "Vitastor SDS monitor service",
"main": "mon-main.js",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"author": "Vitaliy Filippov",
"license": "UNLICENSED",
"dependencies": {
"sprintf-js": "^1.1.2",
"ws": "^7.2.5"
}
}

56
mon/simple-offsets.js Normal file
View File

@ -0,0 +1,56 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: MIT
// Simple tool to calculate journal and metadata offsets for a single device
// Will be replaced by smarter tools in the future
const child_process = require('child_process');
async function run()
{
const options = {
object_size: 128*1024,
bitmap_granularity: 4096,
journal_size: 16*1024*1024,
device_block_size: 4096,
journal_offset: 0,
device_size: 0,
};
for (let i = 2; i < process.argv.length; i++)
{
if (process.argv[i].substr(0, 2) == '--')
{
options[process.argv[i].substr(2)] = process.argv[i+1];
i++;
}
}
const device_size = Number(options.device_size || await system("blockdev --getsize64 "+options.device));
if (!device_size)
{
process.stderr.write('Failed to get device size\n');
process.exit(1);
}
options.journal_offset = Math.ceil(options.journal_offset/options.device_block_size)*options.device_block_size;
const meta_offset = options.journal_offset + Math.ceil(options.journal_size/options.device_block_size)*options.device_block_size;
const entries_per_block = Math.floor(options.device_block_size / (24 + options.object_size/options.bitmap_granularity/8));
const object_count = Math.floor((device_size-meta_offset)/options.object_size);
const meta_size = Math.ceil(object_count / entries_per_block) * options.device_block_size;
const data_offset = meta_offset + meta_size;
const meta_size_fmt = (meta_size > 1024*1024*1024 ? Math.round(meta_size/1024/1024/1024*100)/100+" GB"
: Math.round(meta_size/1024/1024*100)/100+" MB");
process.stdout.write(
`Metadata size: ${meta_size_fmt}\n`+
`Options for the OSD:\n`+
` --journal_offset ${options.journal_offset}\n`+
` --meta_offset ${meta_offset}\n`+
` --data_offset ${data_offset}\n`+
(options.device_size ? ` --data_size ${device_size-data_offset}\n` : '')
);
}
function system(cmd)
{
return new Promise((ok, no) => child_process.exec(cmd, { maxBuffer: 64*1024*1024 }, (err, stdout, stderr) => (err ? no(err) : ok(stdout))));
}
run().catch(console.error);

78
mon/stable-stringify.js Normal file
View File

@ -0,0 +1,78 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: MIT
function stableStringify(obj, opts)
{
if (!opts)
opts = {};
if (typeof opts === 'function')
opts = { cmp: opts };
let space = opts.space || '';
if (typeof space === 'number')
space = Array(space+1).join(' ');
const cycles = (typeof opts.cycles === 'boolean') ? opts.cycles : false;
const cmp = opts.cmp && (function (f)
{
return function (node)
{
return function (a, b)
{
let aobj = { key: a, value: node[a] };
let bobj = { key: b, value: node[b] };
return f(aobj, bobj);
};
};
})(opts.cmp);
const seen = new Map();
return (function stringify (parent, key, node, level)
{
const indent = space ? ('\n' + new Array(level + 1).join(space)) : '';
const colonSeparator = space ? ': ' : ':';
if (node === undefined)
{
return;
}
if (typeof node !== 'object' || node === null)
{
return JSON.stringify(node);
}
if (node instanceof Array)
{
const out = [];
for (let i = 0; i < node.length; i++)
{
const item = stringify(node, i, node[i], level+1) || JSON.stringify(null);
out.push(indent + space + item);
}
return '[' + out.join(',') + indent + ']';
}
else
{
if (seen.has(node))
{
if (cycles)
return JSON.stringify('__cycle__');
throw new TypeError('Converting circular structure to JSON');
}
else
seen.set(node, true);
const keys = Object.keys(node).sort(cmp && cmp(node));
const out = [];
for (let i = 0; i < keys.length; i++)
{
const key = keys[i];
const value = stringify(node, key, node[key], level+1);
if (!value)
continue;
const keyValue = JSON.stringify(key)
+ colonSeparator
+ value;
out.push(indent + space + keyValue);
}
seen.delete(node);
return '{' + out.join(',') + indent + '}';
}
})({ '': obj }, '', obj, 0);
}
module.exports = stableStringify;

130
mon/test-nonuniform.js Normal file
View File

@ -0,0 +1,130 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// Interesting real-world example coming from Ceph with EC and compression enabled.
// EC parity chunks can't be compressed as efficiently as data chunks,
// thus they occupy more space (2.26x more space) in OSD object stores.
// This leads to really uneven OSD fill ratio in Ceph even when PGs are perfectly balanced.
// But we support this case with the "parity_space" parameter in optimize_initial()/optimize_change().
const LPOptimizer = require('./lp-optimizer.js');
const osd_tree = {
ripper5: {
osd0: 3.493144989013672,
osd1: 3.493144989013672,
osd2: 3.454082489013672,
osd12: 3.461894989013672,
},
ripper7: {
osd4: 3.638690948486328,
osd5: 3.638690948486328,
osd6: 3.638690948486328,
},
ripper4: {
osd9: 3.4609375,
osd10: 3.4609375,
osd11: 3.4609375,
},
ripper6: {
osd3: 3.5849609375,
osd7: 3.5859336853027344,
osd8: 3.638690948486328,
osd13: 3.461894989013672
},
};
const prev_pgs = [[12,7,5],[6,11,12],[3,6,9],[10,0,5],[2,5,13],[9,8,6],[3,4,12],[7,4,12],[12,11,13],[13,6,0],[4,13,10],[9,7,6],[7,10,0],[10,8,0],[3,10,2],[3,0,4],[6,13,0],[13,10,0],[13,10,5],[8,11,6],[3,9,2],[2,8,5],[8,9,5],[3,12,11],[0,7,4],[13,11,1],[11,3,12],[12,8,10],[7,5,12],[2,13,5],[7,11,0],[13,2,6],[0,6,8],[13,1,6],[0,13,4],[0,8,10],[4,10,0],[8,12,4],[8,12,9],[12,7,4],[13,9,5],[3,2,11],[1,9,7],[1,8,5],[5,12,9],[3,5,12],[2,8,10],[0,8,4],[1,4,11],[7,10,2],[12,13,5],[3,1,11],[7,1,4],[4,12,8],[7,0,9],[11,1,8],[3,0,5],[11,13,0],[1,13,5],[12,7,10],[12,8,4],[11,13,5],[0,11,6],[2,11,3],[13,1,11],[2,7,10],[7,10,12],[7,12,10],[12,11,5],[13,12,10],[2,3,9],[4,3,9],[13,2,5],[7,12,6],[12,10,13],[9,8,1],[13,1,5],[9,5,12],[5,11,7],[6,2,9],[8,11,6],[12,5,8],[6,13,1],[7,6,11],[2,3,6],[8,5,9],[1,13,6],[9,3,2],[7,11,1],[3,10,1],[0,11,7],[3,0,5],[1,3,6],[6,0,9],[3,11,4],[8,10,2],[13,1,9],[12,6,9],[3,12,9],[12,8,9],[7,5,0],[8,12,5],[0,11,3],[12,11,13],[0,7,11],[0,3,10],[1,3,11],[2,7,11],[13,2,6],[9,12,13],[8,2,4],[0,7,4],[5,13,0],[13,12,9],[1,9,8],[0,10,3],[3,5,10],[7,12,9],[2,13,4],[12,7,5],[9,2,7],[3,2,9],[6,2,7],[3,1,9],[4,3,2],[5,3,11],[0,7,6],[1,6,13],[7,10,2],[12,4,8],[13,12,6],[7,5,11],[6,2,3],[2,7,6],[2,3,10],[2,7,10],[11,12,6],[0,13,5],[10,2,4],[13,0,11],[7,0,6],[8,9,4],[8,4,11],[7,11,2],[3,4,2],[6,1,3],[7,2,11],[8,9,4],[11,4,8],[10,3,1],[2,10,13],[1,7,11],[13,11,12],[2,6,9],[10,0,13],[7,10,4],[0,11,13],[13,10,1],[7,5,0],[7,12,10],[3,1,4],[7,1,5],[3,11,5],[7,5,0],[1,3,5],[10,5,12],[0,3,9],[7,1,11],[11,8,12],[3,6,2],[7,12,9],[7,11,12],[4,11,3],[0,11,13],[13,2,5],[1,5,8],[0,11,8],[3,5,1],[11,0,6],[3,11,2],[11,8,12],[4,1,3],[10,13,4],[13,9,6],[2,3,10],[12,7,9],[10,0,4],[10,13,2],[3,11,1],[7,2,9],[1,7,4],[13,1,4],[7,0,6],[5,3,9],[10,0,7],[0,7,10],[3,6,10],[13,0,5],[8,4,1],[3,1,10],[2,10,13],[13,0,5],[13,10,2],[12,7,9],[6,8,10],[6,1,8],[10,8,1],[13,5,0],[5,11,3],[7,6,1],[8,5,9],[2,13,11],[10,12,4],[13,4,1],[2,13,4],[11,7,0],[2,9,7],[1,7,6],[8,0,4],[8,1,9],[7,10,12],[13,9,6],[7,6,11],[13,0,4],[1,8,4],[3,12,5],[10,3,1],[10,2,13],[2,4,8],[6,2,3],[3,0,10],[6,7,12],[8,12,5],[3,0,6],[13,12,10],[11,3,6],[9,0,13],[10,0,6],[7,5,2],[1,3,11],[7,10,2],[2,9,8],[11,13,12],[0,8,4],[8,12,11],[6,0,3],[1,13,4],[11,8,2],[12,3,6],[4,7,1],[7,6,12],[3,10,6],[0,10,7],[8,9,1],[0,10,6],[8,10,1]]
.map(pg => pg.map(n => 'osd'+n));
const by_osd = {};
for (let i = 0; i < prev_pgs.length; i++)
{
for (let j = 0; j < prev_pgs[i].length; j++)
{
by_osd[prev_pgs[i][j]] = by_osd[prev_pgs[i][j]] || [];
by_osd[prev_pgs[i][j]][j] = (by_osd[prev_pgs[i][j]][j] || 0) + 1;
}
}
/*
This set of PGs was balanced by hand, by heavily tuning OSD weights in Ceph:
{
osd0: 4.2,
osd1: 3.5,
osd2: 3.45409,
osd3: 4.5,
osd4: 1.4,
osd5: 1.4,
osd6: 1.75,
osd7: 4.5,
osd8: 4.4,
osd9: 2.2,
osd10: 2.7,
osd11: 2,
osd12: 3.4,
osd13: 3.4,
}
EC+compression is a nightmare in Ceph, yeah :))
To calculate the average ratio between data chunks and parity chunks we
calculate the number of PG chunks for each chunk role for each OSD:
{
osd12: [ 18, 22, 17 ],
osd7: [ 35, 22, 8 ],
osd5: [ 6, 17, 27 ],
osd6: [ 13, 12, 28 ],
osd11: [ 13, 26, 20 ],
osd3: [ 30, 20, 10 ],
osd9: [ 8, 12, 26 ],
osd10: [ 15, 23, 20 ],
osd0: [ 22, 22, 14 ],
osd2: [ 22, 16, 16 ],
osd13: [ 29, 19, 13 ],
osd8: [ 20, 18, 12 ],
osd4: [ 8, 10, 28 ],
osd1: [ 17, 17, 17 ]
}
And now we can pick a pair of OSDs and determine the ratio by solving the following:
osd5 = 23*X + 27*Y = 3249728140
osd13 = 48*X + 13*Y = 2991675992
=>
osd5 - 27/13*osd13 = 23*X - 27/13*48*X = -76.6923076923077*X = -2963752766.46154
=>
X = 38644720.1243731
Y = (osd5-23*X)/27 = 87440725.0792377
Y/X = 2.26268232239284 ~= 2.26
Which means that parity chunks are compressed ~2.26 times worse than data chunks.
Fine, let's try to optimize for it.
*/
async function run()
{
const all_weights = Object.assign({}, ...Object.values(osd_tree));
const total_weight = Object.values(all_weights).reduce((a, c) => Number(a) + Number(c), 0);
const eff = LPOptimizer.pg_list_space_efficiency(prev_pgs, all_weights, 2, 2.26);
const orig = eff*4.26 / total_weight;
console.log('Original efficiency was: '+Math.round(orig*10000)/100+' %');
let prev = await LPOptimizer.optimize_initial({ osd_tree, pg_size: 3, pg_count: 256, parity_space: 2.26 });
LPOptimizer.print_change_stats(prev);
let next = await LPOptimizer.optimize_change({ prev_pgs, osd_tree, pg_size: 3, max_combinations: 10000, parity_space: 2.26 });
LPOptimizer.print_change_stats(next);
}
run().catch(console.error);

View File

@ -0,0 +1,74 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
const LPOptimizer = require('./lp-optimizer.js');
const crush_tree = [
{ level: 1, children: [
{ level: 2, children: [
{ level: 3, id: 1, size: 3 },
{ level: 3, id: 2, size: 3 },
] },
{ level: 2, children: [
{ level: 3, id: 3, size: 3 },
{ level: 3, id: 4, size: 3 },
] },
] },
{ level: 1, children: [
{ level: 2, children: [
{ level: 3, id: 5, size: 3 },
{ level: 3, id: 6, size: 3 },
] },
{ level: 2, children: [
{ level: 3, id: 7, size: 3 },
{ level: 3, id: 8, size: 3 },
] },
] },
{ level: 1, children: [
{ level: 2, children: [
{ level: 3, id: 9, size: 3 },
{ level: 3, id: 10, size: 3 },
] },
{ level: 2, children: [
{ level: 3, id: 11, size: 3 },
{ level: 3, id: 12, size: 3 },
] },
] },
];
const osd_tree = LPOptimizer.flatten_tree(crush_tree, {}, 1, 3);
console.log(osd_tree);
async function run()
{
const cur_tree = {};
console.log('Empty tree:');
let res = await LPOptimizer.optimize_initial({ osd_tree: cur_tree, pg_size: 3, pg_count: 256 });
LPOptimizer.print_change_stats(res, false);
console.log('\nAdding 1st failure domain:');
cur_tree['dom1'] = osd_tree['dom1'];
res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: cur_tree, pg_size: 3 });
LPOptimizer.print_change_stats(res, false);
console.log('\nAdding 2nd failure domain:');
cur_tree['dom2'] = osd_tree['dom2'];
res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: cur_tree, pg_size: 3 });
LPOptimizer.print_change_stats(res, false);
console.log('\nAdding 3rd failure domain:');
cur_tree['dom3'] = osd_tree['dom3'];
res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: cur_tree, pg_size: 3 });
LPOptimizer.print_change_stats(res, false);
console.log('\nRemoving 3rd failure domain:');
delete cur_tree['dom3'];
res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: cur_tree, pg_size: 3 });
LPOptimizer.print_change_stats(res, false);
console.log('\nRemoving 2nd failure domain:');
delete cur_tree['dom2'];
res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: cur_tree, pg_size: 3 });
LPOptimizer.print_change_stats(res, false);
console.log('\nRemoving 1st failure domain:');
delete cur_tree['dom1'];
res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: cur_tree, pg_size: 3 });
LPOptimizer.print_change_stats(res, false);
}
run().catch(console.error);

115
mon/test-optimize.js Normal file
View File

@ -0,0 +1,115 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
const LPOptimizer = require('./lp-optimizer.js');
const osd_tree = {
100: {
7: 3.63869,
},
300: {
10: 3.46089,
11: 3.46089,
12: 3.46089,
},
400: {
1: 3.49309,
2: 3.49309,
3: 3.49309,
},
500: {
4: 3.58498,
// 8: 3.58589,
9: 3.63869,
},
600: {
5: 3.63869,
6: 3.63869,
},
/* 100: {
1: 2.72800,
},
200: {
2: 2.72900,
},
300: {
3: 1.87000,
},
400: {
4: 1.87000,
},
500: {
5: 3.63869,
},*/
};
const crush_tree = [
{ level: 1, children: [
{ level: 2, children: [
{ level: 3, id: 1, size: 3 },
{ level: 3, id: 2, size: 2 },
] },
{ level: 2, children: [
{ level: 3, id: 3, size: 4 },
{ level: 3, id: 4, size: 4 },
] },
] },
{ level: 1, children: [
{ level: 2, children: [
{ level: 3, id: 5, size: 4 },
{ level: 3, id: 6, size: 1 },
] },
{ level: 2, children: [
{ level: 3, id: 7, size: 3 },
{ level: 3, id: 8, size: 5 },
] },
] },
{ level: 1, children: [
{ level: 2, children: [
{ level: 3, id: 9, size: 5 },
{ level: 3, id: 10, size: 2 },
] },
{ level: 2, children: [
{ level: 3, id: 11, size: 3 },
{ level: 3, id: 12, size: 3 },
] },
] },
];
async function run()
{
let res;
// Test: add 1 OSD of almost the same size. Ideal data movement could be 1/12 = 8.33%. Actual is ~13%
// Space efficiency is ~99% in all cases.
console.log('256 PGs, size=2');
res = await LPOptimizer.optimize_initial({ osd_tree, pg_size: 2, pg_count: 256 });
LPOptimizer.print_change_stats(res, false);
console.log('\nAdding osd.8');
osd_tree[500][8] = 3.58589;
res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree, pg_size: 2 });
LPOptimizer.print_change_stats(res, false);
console.log('\nRemoving osd.8');
delete osd_tree[500][8];
res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree, pg_size: 2 });
LPOptimizer.print_change_stats(res, false);
console.log('\n256 PGs, size=3');
res = await LPOptimizer.optimize_initial({ osd_tree, pg_size: 3, pg_count: 256 });
LPOptimizer.print_change_stats(res, false);
console.log('\nAdding osd.8');
osd_tree[500][8] = 3.58589;
res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree, pg_size: 3 });
LPOptimizer.print_change_stats(res, false);
console.log('\nRemoving osd.8');
delete osd_tree[500][8];
res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree, pg_size: 3 });
LPOptimizer.print_change_stats(res, false);
console.log('\n256 PGs, size=3, failure domain=rack');
res = await LPOptimizer.optimize_initial({ osd_tree: LPOptimizer.flatten_tree(crush_tree, {}, 1, 3), pg_size: 3, pg_count: 256 });
LPOptimizer.print_change_stats(res, false);
}
run().catch(console.error);

307
msgr_receive.cpp Normal file
View File

@ -0,0 +1,307 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
#include "messenger.h"
void osd_messenger_t::read_requests()
{
for (int i = 0; i < read_ready_clients.size(); i++)
{
int peer_fd = read_ready_clients[i];
auto & cl = clients[peer_fd];
if (cl.read_remaining < receive_buffer_size)
{
cl.read_iov.iov_base = cl.in_buf;
cl.read_iov.iov_len = receive_buffer_size;
cl.read_msg.msg_iov = &cl.read_iov;
cl.read_msg.msg_iovlen = 1;
}
else
{
cl.read_iov.iov_base = 0;
cl.read_iov.iov_len = cl.read_remaining;
cl.read_msg.msg_iov = cl.recv_list.get_iovec();
cl.read_msg.msg_iovlen = cl.recv_list.get_size();
}
if (ringloop && !use_sync_send_recv)
{
io_uring_sqe* sqe = ringloop->get_sqe();
if (!sqe)
{
read_ready_clients.erase(read_ready_clients.begin(), read_ready_clients.begin() + i);
return;
}
ring_data_t* data = ((ring_data_t*)sqe->user_data);
data->callback = [this, peer_fd](ring_data_t *data) { handle_read(data->res, peer_fd); };
my_uring_prep_recvmsg(sqe, peer_fd, &cl.read_msg, 0);
}
else
{
int result = recvmsg(peer_fd, &cl.read_msg, 0);
if (result < 0)
{
result = -errno;
}
handle_read(result, peer_fd);
}
}
read_ready_clients.clear();
}
bool osd_messenger_t::handle_read(int result, int peer_fd)
{
bool ret = false;
auto cl_it = clients.find(peer_fd);
if (cl_it != clients.end())
{
auto & cl = cl_it->second;
if (result <= 0 && result != -EAGAIN)
{
// this is a client socket, so don't panic on error. just disconnect it
if (result != 0)
{
printf("Client %d socket read error: %d (%s). Disconnecting client\n", peer_fd, -result, strerror(-result));
}
stop_client(peer_fd);
return false;
}
if (result == -EAGAIN || result < cl.read_iov.iov_len)
{
cl.read_ready--;
if (cl.read_ready > 0)
read_ready_clients.push_back(peer_fd);
}
else
{
read_ready_clients.push_back(peer_fd);
}
if (result > 0)
{
if (cl.read_iov.iov_base == cl.in_buf)
{
// Compose operation(s) from the buffer
int remain = result;
void *curbuf = cl.in_buf;
while (remain > 0)
{
if (!cl.read_op)
{
cl.read_op = new osd_op_t;
cl.read_op->peer_fd = peer_fd;
cl.read_op->op_type = OSD_OP_IN;
cl.recv_list.push_back(cl.read_op->req.buf, OSD_PACKET_SIZE);
cl.read_remaining = OSD_PACKET_SIZE;
cl.read_state = CL_READ_HDR;
}
while (cl.recv_list.done < cl.recv_list.count && remain > 0)
{
iovec* cur = cl.recv_list.get_iovec();
if (cur->iov_len > remain)
{
memcpy(cur->iov_base, curbuf, remain);
cl.read_remaining -= remain;
cur->iov_len -= remain;
cur->iov_base += remain;
remain = 0;
}
else
{
memcpy(cur->iov_base, curbuf, cur->iov_len);
curbuf += cur->iov_len;
cl.read_remaining -= cur->iov_len;
remain -= cur->iov_len;
cur->iov_len = 0;
cl.recv_list.done++;
}
}
if (cl.recv_list.done >= cl.recv_list.count)
{
if (!handle_finished_read(cl))
{
goto fin;
}
}
}
}
else
{
// Long data
cl.read_remaining -= result;
cl.recv_list.eat(result);
if (cl.recv_list.done >= cl.recv_list.count)
{
handle_finished_read(cl);
}
}
if (result >= cl.read_iov.iov_len)
{
ret = true;
}
}
}
fin:
for (auto cb: set_immediate)
{
cb();
}
set_immediate.clear();
return ret;
}
bool osd_messenger_t::handle_finished_read(osd_client_t & cl)
{
cl.recv_list.reset();
if (cl.read_state == CL_READ_HDR)
{
if (cl.read_op->req.hdr.magic == SECONDARY_OSD_REPLY_MAGIC)
return handle_reply_hdr(&cl);
else
handle_op_hdr(&cl);
}
else if (cl.read_state == CL_READ_DATA)
{
// Operation is ready
cl.received_ops.push_back(cl.read_op);
set_immediate.push_back([this, op = cl.read_op]() { exec_op(op); });
cl.read_op = NULL;
cl.read_state = 0;
}
else if (cl.read_state == CL_READ_REPLY_DATA)
{
// Reply is ready
handle_reply_ready(cl.read_op);
cl.read_op = NULL;
cl.read_state = 0;
}
else
{
assert(0);
}
return true;
}
void osd_messenger_t::handle_op_hdr(osd_client_t *cl)
{
osd_op_t *cur_op = cl->read_op;
if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ)
{
if (cur_op->req.sec_rw.len > 0)
cur_op->buf = memalign_or_die(MEM_ALIGNMENT, cur_op->req.sec_rw.len);
cl->read_remaining = 0;
}
else if (cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE)
{
if (cur_op->req.sec_rw.len > 0)
cur_op->buf = memalign_or_die(MEM_ALIGNMENT, cur_op->req.sec_rw.len);
cl->read_remaining = cur_op->req.sec_rw.len;
}
else if (cur_op->req.hdr.opcode == OSD_OP_SEC_STABILIZE ||
cur_op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK)
{
if (cur_op->req.sec_stab.len > 0)
cur_op->buf = memalign_or_die(MEM_ALIGNMENT, cur_op->req.sec_stab.len);
cl->read_remaining = cur_op->req.sec_stab.len;
}
else if (cur_op->req.hdr.opcode == OSD_OP_READ)
{
cl->read_remaining = 0;
}
else if (cur_op->req.hdr.opcode == OSD_OP_WRITE)
{
if (cur_op->req.rw.len > 0)
cur_op->buf = memalign_or_die(MEM_ALIGNMENT, cur_op->req.rw.len);
cl->read_remaining = cur_op->req.rw.len;
}
if (cl->read_remaining > 0)
{
// Read data
cl->recv_list.push_back(cur_op->buf, cl->read_remaining);
cl->read_state = CL_READ_DATA;
}
else
{
// Operation is ready
cl->received_ops.push_back(cur_op);
set_immediate.push_back([this, cur_op]() { exec_op(cur_op); });
cl->read_op = NULL;
cl->read_state = 0;
}
}
bool osd_messenger_t::handle_reply_hdr(osd_client_t *cl)
{
auto req_it = cl->sent_ops.find(cl->read_op->req.hdr.id);
if (req_it == cl->sent_ops.end())
{
// Command out of sync. Drop connection
printf("Client %d command out of sync: id %lu\n", cl->peer_fd, cl->read_op->req.hdr.id);
stop_client(cl->peer_fd);
return false;
}
osd_op_t *op = req_it->second;
memcpy(op->reply.buf, cl->read_op->req.buf, OSD_PACKET_SIZE);
cl->sent_ops.erase(req_it);
if ((op->reply.hdr.opcode == OSD_OP_SEC_READ || op->reply.hdr.opcode == OSD_OP_READ) &&
op->reply.hdr.retval > 0)
{
// Read data. In this case we assume that the buffer is preallocated by the caller (!)
assert(op->iov.count > 0);
cl->recv_list.append(op->iov);
delete cl->read_op;
cl->read_op = op;
cl->read_state = CL_READ_REPLY_DATA;
cl->read_remaining = op->reply.hdr.retval;
}
else if (op->reply.hdr.opcode == OSD_OP_SEC_LIST && op->reply.hdr.retval > 0)
{
assert(!op->iov.count);
delete cl->read_op;
cl->read_op = op;
cl->read_state = CL_READ_REPLY_DATA;
cl->read_remaining = sizeof(obj_ver_id) * op->reply.hdr.retval;
op->buf = memalign_or_die(MEM_ALIGNMENT, cl->read_remaining);
cl->recv_list.push_back(op->buf, cl->read_remaining);
}
else if (op->reply.hdr.opcode == OSD_OP_SHOW_CONFIG && op->reply.hdr.retval > 0)
{
assert(!op->iov.count);
delete cl->read_op;
cl->read_op = op;
cl->read_state = CL_READ_REPLY_DATA;
cl->read_remaining = op->reply.hdr.retval;
op->buf = malloc_or_die(op->reply.hdr.retval);
cl->recv_list.push_back(op->buf, op->reply.hdr.retval);
}
else
{
// It's fine to reuse cl->read_op for the next reply
handle_reply_ready(op);
cl->recv_list.push_back(cl->read_op->req.buf, OSD_PACKET_SIZE);
cl->read_remaining = OSD_PACKET_SIZE;
cl->read_state = CL_READ_HDR;
}
return true;
}
void osd_messenger_t::handle_reply_ready(osd_op_t *op)
{
// Measure subop latency
timespec tv_end;
clock_gettime(CLOCK_REALTIME, &tv_end);
stats.subop_stat_count[op->req.hdr.opcode]++;
if (!stats.subop_stat_count[op->req.hdr.opcode])
{
stats.subop_stat_count[op->req.hdr.opcode]++;
stats.subop_stat_sum[op->req.hdr.opcode] = 0;
}
stats.subop_stat_sum[op->req.hdr.opcode] += (
(tv_end.tv_sec - op->tv_begin.tv_sec)*1000000 +
(tv_end.tv_nsec - op->tv_begin.tv_nsec)/1000
);
set_immediate.push_back([this, op]()
{
// Copy lambda to be unaffected by `delete op`
std::function<void(osd_op_t*)>(op->callback)(op);
});
}

186
msgr_send.cpp Normal file
View File

@ -0,0 +1,186 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
#include "messenger.h"
void osd_messenger_t::outbox_push(osd_op_t *cur_op)
{
assert(cur_op->peer_fd);
auto & cl = clients.at(cur_op->peer_fd);
if (cur_op->op_type == OSD_OP_OUT)
{
clock_gettime(CLOCK_REALTIME, &cur_op->tv_begin);
}
else
{
// Check that operation actually belongs to this client
bool found = false;
for (auto it = cl.received_ops.begin(); it != cl.received_ops.end(); it++)
{
if (*it == cur_op)
{
found = true;
cl.received_ops.erase(it, it+1);
break;
}
}
if (!found)
{
delete cur_op;
return;
}
}
cl.outbox.push_back(cur_op);
if (!ringloop)
{
while (cl.write_op || cl.outbox.size())
{
try_send(cl);
}
}
else if (cl.write_op || cl.outbox.size() > 1 || !try_send(cl))
{
if (cl.write_state == 0)
{
cl.write_state = CL_WRITE_READY;
write_ready_clients.push_back(cur_op->peer_fd);
}
ringloop->wakeup();
}
}
bool osd_messenger_t::try_send(osd_client_t & cl)
{
int peer_fd = cl.peer_fd;
if (!cl.write_op)
{
// pick next command
cl.write_op = cl.outbox.front();
cl.outbox.pop_front();
cl.write_state = CL_WRITE_REPLY;
if (cl.write_op->op_type == OSD_OP_IN)
{
// Measure execution latency
timespec tv_end;
clock_gettime(CLOCK_REALTIME, &tv_end);
stats.op_stat_count[cl.write_op->req.hdr.opcode]++;
if (!stats.op_stat_count[cl.write_op->req.hdr.opcode])
{
stats.op_stat_count[cl.write_op->req.hdr.opcode]++;
stats.op_stat_sum[cl.write_op->req.hdr.opcode] = 0;
stats.op_stat_bytes[cl.write_op->req.hdr.opcode] = 0;
}
stats.op_stat_sum[cl.write_op->req.hdr.opcode] += (
(tv_end.tv_sec - cl.write_op->tv_begin.tv_sec)*1000000 +
(tv_end.tv_nsec - cl.write_op->tv_begin.tv_nsec)/1000
);
if (cl.write_op->req.hdr.opcode == OSD_OP_READ ||
cl.write_op->req.hdr.opcode == OSD_OP_WRITE)
{
stats.op_stat_bytes[cl.write_op->req.hdr.opcode] += cl.write_op->req.rw.len;
}
else if (cl.write_op->req.hdr.opcode == OSD_OP_SEC_READ ||
cl.write_op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
cl.write_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE)
{
stats.op_stat_bytes[cl.write_op->req.hdr.opcode] += cl.write_op->req.sec_rw.len;
}
cl.send_list.push_back(cl.write_op->reply.buf, OSD_PACKET_SIZE);
if (cl.write_op->req.hdr.opcode == OSD_OP_READ ||
cl.write_op->req.hdr.opcode == OSD_OP_SEC_READ ||
cl.write_op->req.hdr.opcode == OSD_OP_SEC_LIST ||
cl.write_op->req.hdr.opcode == OSD_OP_SHOW_CONFIG)
{
cl.send_list.append(cl.write_op->iov);
}
}
else
{
cl.send_list.push_back(cl.write_op->req.buf, OSD_PACKET_SIZE);
if (cl.write_op->req.hdr.opcode == OSD_OP_WRITE ||
cl.write_op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
cl.write_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE ||
cl.write_op->req.hdr.opcode == OSD_OP_SEC_STABILIZE ||
cl.write_op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK)
{
cl.send_list.append(cl.write_op->iov);
}
}
}
cl.write_msg.msg_iov = cl.send_list.get_iovec();
cl.write_msg.msg_iovlen = cl.send_list.get_size();
if (ringloop && !use_sync_send_recv)
{
io_uring_sqe* sqe = ringloop->get_sqe();
if (!sqe)
{
return false;
}
ring_data_t* data = ((ring_data_t*)sqe->user_data);
data->callback = [this, peer_fd](ring_data_t *data) { handle_send(data->res, peer_fd); };
my_uring_prep_sendmsg(sqe, peer_fd, &cl.write_msg, 0);
}
else
{
int result = sendmsg(peer_fd, &cl.write_msg, MSG_NOSIGNAL);
if (result < 0)
{
result = -errno;
}
handle_send(result, peer_fd);
}
return true;
}
void osd_messenger_t::send_replies()
{
for (int i = 0; i < write_ready_clients.size(); i++)
{
int peer_fd = write_ready_clients[i];
if (!try_send(clients[peer_fd]))
{
write_ready_clients.erase(write_ready_clients.begin(), write_ready_clients.begin() + i);
return;
}
}
write_ready_clients.clear();
}
void osd_messenger_t::handle_send(int result, int peer_fd)
{
auto cl_it = clients.find(peer_fd);
if (cl_it != clients.end())
{
auto & cl = cl_it->second;
if (result < 0 && result != -EAGAIN)
{
// this is a client socket, so don't panic. just disconnect it
printf("Client %d socket write error: %d (%s). Disconnecting client\n", peer_fd, -result, strerror(-result));
stop_client(peer_fd);
return;
}
if (result >= 0)
{
cl.send_list.eat(result);
if (cl.send_list.done >= cl.send_list.count)
{
// Done
cl.send_list.reset();
if (cl.write_op->op_type == OSD_OP_IN)
{
delete cl.write_op;
}
else
{
cl.sent_ops[cl.write_op->req.hdr.id] = cl.write_op;
}
cl.write_op = NULL;
cl.write_state = cl.outbox.size() > 0 ? CL_WRITE_READY : 0;
}
}
if (cl.write_state != 0)
{
write_ready_clients.push_back(peer_fd);
}
}
}

673
nbd_proxy.cpp Normal file
View File

@ -0,0 +1,673 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
// Similar to qemu-nbd, but sets timeout and uses io_uring
#include <linux/nbd.h>
#include <sys/ioctl.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <arpa/inet.h>
#include <sys/un.h>
#include <unistd.h>
#include <fcntl.h>
#include <signal.h>
#include "epoll_manager.h"
#include "cluster_client.h"
const char *exe_name = NULL;
class nbd_proxy
{
protected:
uint64_t inode = 0;
ring_loop_t *ringloop = NULL;
epoll_manager_t *epmgr = NULL;
cluster_client_t *cli = NULL;
ring_consumer_t consumer;
std::vector<iovec> send_list, next_send_list;
std::vector<void*> to_free;
int nbd_fd = -1;
void *recv_buf = NULL;
int receive_buffer_size = 9000;
nbd_request cur_req;
cluster_op_t *cur_op = NULL;
void *cur_buf = NULL;
int cur_left = 0;
int read_state = 0;
int read_ready = 0;
msghdr read_msg = { 0 }, send_msg = { 0 };
iovec read_iov = { 0 };
public:
static json11::Json::object parse_args(int narg, const char *args[])
{
json11::Json::object cfg;
int pos = 0;
for (int i = 1; i < narg; i++)
{
if (!strcmp(args[i], "-h") || !strcmp(args[i], "--help"))
{
help();
}
else if (args[i][0] == '-' && args[i][1] == '-')
{
const char *opt = args[i]+2;
cfg[opt] = !strcmp(opt, "json") || i == narg-1 ? "1" : args[++i];
}
else if (pos == 0)
{
cfg["command"] = args[i];
pos++;
}
else if (pos == 1 && (cfg["command"] == "map" || cfg["command"] == "unmap"))
{
int n = 0;
if (sscanf(args[i], "/dev/nbd%d", &n) > 0)
cfg["dev_num"] = n;
else
cfg["dev_num"] = args[i];
pos++;
}
}
return cfg;
}
void exec(json11::Json cfg)
{
if (cfg["command"] == "map")
{
start(cfg);
}
else if (cfg["command"] == "unmap")
{
if (cfg["dev_num"].is_null())
{
fprintf(stderr, "device name or number is missing\n");
exit(1);
}
unmap(cfg["dev_num"].uint64_value());
}
else if (cfg["command"] == "list" || cfg["command"] == "list-mapped")
{
auto mapped = list_mapped();
print_mapped(mapped, !cfg["json"].is_null());
}
else
{
help();
}
}
static void help()
{
printf(
"Vitastor NBD proxy\n"
"(c) Vitaliy Filippov, 2020 (VNPL-1.0 or GNU GPL 2.0+)\n\n"
"USAGE:\n"
" %s map --etcd_address <etcd_address> --pool <pool> --inode <inode> --size <size in bytes>\n"
" %s unmap /dev/nbd0\n"
" %s list [--json]\n",
exe_name, exe_name, exe_name
);
exit(0);
}
void unmap(int dev_num)
{
char path[64] = { 0 };
sprintf(path, "/dev/nbd%d", dev_num);
int r, nbd = open(path, O_RDWR);
if (nbd < 0)
{
perror("open");
exit(1);
}
r = ioctl(nbd, NBD_DISCONNECT);
if (r < 0)
{
perror("NBD_DISCONNECT");
exit(1);
}
close(nbd);
}
void start(json11::Json cfg)
{
// Check options
if (cfg["etcd_address"].string_value() == "")
{
fprintf(stderr, "etcd_address is missing\n");
exit(1);
}
if (!cfg["size"].uint64_value())
{
fprintf(stderr, "device size is missing\n");
exit(1);
}
inode = cfg["inode"].uint64_value();
uint64_t pool = cfg["pool"].uint64_value();
if (pool)
{
inode = (inode & ((1l << (64-POOL_ID_BITS)) - 1)) | (pool << (64-POOL_ID_BITS));
}
if (!(inode >> (64-POOL_ID_BITS)))
{
fprintf(stderr, "pool is missing\n");
exit(1);
}
// Initialize NBD
int sockfd[2];
if (socketpair(AF_UNIX, SOCK_STREAM, 0, sockfd) < 0)
{
perror("socketpair");
exit(1);
}
fcntl(sockfd[0], F_SETFL, fcntl(sockfd[0], F_GETFL, 0) | O_NONBLOCK);
nbd_fd = sockfd[0];
load_module();
if (!cfg["dev_num"].is_null())
{
if (run_nbd(sockfd, cfg["dev_num"].int64_value(), cfg["size"].uint64_value(), NBD_FLAG_SEND_FLUSH, 30) < 0)
{
perror("run_nbd");
exit(1);
}
}
else
{
// Find an unused device
int i = 0;
while (true)
{
int r = run_nbd(sockfd, i, cfg["size"].uint64_value(), NBD_FLAG_SEND_FLUSH, 30);
if (r == 0)
{
printf("/dev/nbd%d\n", i);
break;
}
else if (r == -1 && errno == ENOENT)
{
fprintf(stderr, "No free NBD devices found\n");
exit(1);
}
else if (r == -2 && errno == EBUSY)
{
i++;
}
else
{
printf("%d %d\n", r, errno);
perror("run_nbd");
exit(1);
}
}
}
if (cfg["foreground"].is_null())
{
daemonize();
}
// Create client
ringloop = new ring_loop_t(512);
epmgr = new epoll_manager_t(ringloop);
cli = new cluster_client_t(ringloop, epmgr->tfd, cfg);
// Initialize read state
read_state = CL_READ_HDR;
recv_buf = malloc_or_die(receive_buffer_size);
cur_buf = &cur_req;
cur_left = sizeof(nbd_request);
consumer.loop = [this]()
{
submit_read();
submit_send();
ringloop->submit();
};
ringloop->register_consumer(&consumer);
// Add FD to epoll
epmgr->tfd->set_fd_handler(sockfd[0], false, [this](int peer_fd, int epoll_events)
{
read_ready++;
submit_read();
});
while (1)
{
ringloop->loop();
ringloop->wait();
}
}
void load_module()
{
if (access("/sys/module/nbd", F_OK))
{
return;
}
int r;
if ((r = system("modprobe nbd")) != 0)
{
if (r < 0)
perror("Failed to load NBD kernel module");
else
fprintf(stderr, "Failed to load NBD kernel module\n");
exit(1);
}
}
void daemonize()
{
if (fork())
exit(0);
setsid();
if (fork())
exit(0);
chdir("/");
close(0);
close(1);
close(2);
open("/dev/null", O_RDONLY);
open("/dev/null", O_WRONLY);
open("/dev/null", O_WRONLY);
}
json11::Json::object list_mapped()
{
const char *self_filename = exe_name;
for (int i = 0; exe_name[i] != 0; i++)
{
if (exe_name[i] == '/')
self_filename = exe_name+i+1;
}
char path[64] = { 0 };
json11::Json::object mapped;
int dev_num = -1;
int pid;
while (true)
{
dev_num++;
sprintf(path, "/sys/block/nbd%d", dev_num);
if (access(path, F_OK) != 0)
break;
sprintf(path, "/sys/block/nbd%d/pid", dev_num);
std::string pid_str = read_file(path);
if (pid_str == "")
continue;
if (sscanf(pid_str.c_str(), "%d", &pid) < 1)
{
printf("Failed to read pid from /sys/block/nbd%d/pid\n", dev_num);
continue;
}
sprintf(path, "/proc/%d/cmdline", pid);
std::string cmdline = read_file(path);
std::vector<const char*> argv;
int last = 0;
for (int i = 0; i < cmdline.size(); i++)
{
if (cmdline[i] == 0)
{
argv.push_back(cmdline.c_str()+last);
last = i+1;
}
}
if (argv.size() > 0)
{
const char *pid_filename = argv[0];
for (int i = 0; argv[0][i] != 0; i++)
{
if (argv[0][i] == '/')
pid_filename = argv[0]+i+1;
}
if (!strcmp(pid_filename, self_filename))
{
json11::Json::object cfg = nbd_proxy::parse_args(argv.size(), argv.data());
if (cfg["command"] == "map")
{
cfg.erase("command");
cfg["pid"] = pid;
mapped["/dev/nbd"+std::to_string(dev_num)] = cfg;
}
}
}
}
return mapped;
}
void print_mapped(json11::Json mapped, bool json)
{
if (json)
{
printf("%s\n", mapped.dump().c_str());
}
else
{
for (auto & dev: mapped.object_items())
{
printf("%s\n", dev.first.c_str());
for (auto & k: dev.second.object_items())
{
printf("%s: %s\n", k.first.c_str(), k.second.string_value().c_str());
}
printf("\n");
}
}
}
std::string read_file(char *path)
{
int fd = open(path, O_RDONLY);
if (fd < 0)
{
if (errno == ENOENT)
return "";
auto err = "open "+std::string(path);
perror(err.c_str());
exit(1);
}
std::string r;
while (true)
{
int l = r.size();
r.resize(l + 1024);
int rd = read(fd, (void*)(r.c_str() + l), 1024);
if (rd <= 0)
{
r.resize(l);
break;
}
r.resize(l + rd);
}
close(fd);
return r;
}
protected:
int run_nbd(int sockfd[2], int dev_num, uint64_t size, uint64_t flags, unsigned timeout)
{
// Check handle size
assert(sizeof(cur_req.handle) == 8);
char path[64] = { 0 };
sprintf(path, "/dev/nbd%d", dev_num);
int r, nbd = open(path, O_RDWR), qd_fd;
if (nbd < 0)
{
return -1;
}
r = ioctl(nbd, NBD_SET_SOCK, sockfd[1]);
if (r < 0)
{
goto end_close;
}
r = ioctl(nbd, NBD_SET_BLKSIZE, 4096);
if (r < 0)
{
goto end_unmap;
}
r = ioctl(nbd, NBD_SET_SIZE, size);
if (r < 0)
{
goto end_unmap;
}
ioctl(nbd, NBD_SET_FLAGS, flags);
if (timeout >= 0)
{
r = ioctl(nbd, NBD_SET_TIMEOUT, (unsigned long)timeout);
if (r < 0)
{
goto end_unmap;
}
}
// Configure request size
sprintf(path, "/sys/block/nbd%d/queue/max_sectors_kb", dev_num);
qd_fd = open(path, O_WRONLY);
if (qd_fd < 0)
{
goto end_unmap;
}
write(qd_fd, "32768", 5);
close(qd_fd);
if (!fork())
{
// Run in child
close(sockfd[0]);
r = ioctl(nbd, NBD_DO_IT);
if (r < 0)
{
fprintf(stderr, "NBD device terminated with error: %s\n", strerror(errno));
kill(getppid(), SIGTERM);
}
close(sockfd[1]);
ioctl(nbd, NBD_CLEAR_QUE);
ioctl(nbd, NBD_CLEAR_SOCK);
exit(0);
}
close(sockfd[1]);
close(nbd);
return 0;
end_close:
r = errno;
close(nbd);
errno = r;
return -2;
end_unmap:
r = errno;
ioctl(nbd, NBD_CLEAR_SOCK);
close(nbd);
errno = r;
return -3;
}
void submit_send()
{
if (!send_list.size() || send_msg.msg_iovlen > 0)
{
return;
}
io_uring_sqe* sqe = ringloop->get_sqe();
if (!sqe)
{
return;
}
ring_data_t* data = ((ring_data_t*)sqe->user_data);
data->callback = [this](ring_data_t *data) { handle_send(data->res); };
send_msg.msg_iov = send_list.data();
send_msg.msg_iovlen = send_list.size();
my_uring_prep_sendmsg(sqe, nbd_fd, &send_msg, MSG_ZEROCOPY);
}
void handle_send(int result)
{
send_msg.msg_iovlen = 0;
if (result < 0 && result != -EAGAIN)
{
fprintf(stderr, "Socket disconnected: %s\n", strerror(-result));
exit(1);
}
int to_eat = 0;
while (result > 0 && to_eat < send_list.size())
{
if (result >= send_list[to_eat].iov_len)
{
free(to_free[to_eat]);
result -= send_list[to_eat].iov_len;
to_eat++;
}
else
{
send_list[to_eat].iov_base += result;
send_list[to_eat].iov_len -= result;
break;
}
}
if (to_eat > 0)
{
send_list.erase(send_list.begin(), send_list.begin() + to_eat);
to_free.erase(to_free.begin(), to_free.begin() + to_eat);
}
for (int i = 0; i < next_send_list.size(); i++)
{
send_list.push_back(next_send_list[i]);
}
next_send_list.clear();
if (send_list.size() > 0)
{
ringloop->wakeup();
}
}
void submit_read()
{
if (!read_ready || read_msg.msg_iovlen > 0)
{
return;
}
io_uring_sqe* sqe = ringloop->get_sqe();
if (!sqe)
{
return;
}
ring_data_t* data = ((ring_data_t*)sqe->user_data);
data->callback = [this](ring_data_t *data) { handle_read(data->res); };
if (cur_left < receive_buffer_size)
{
read_iov.iov_base = recv_buf;
read_iov.iov_len = receive_buffer_size;
}
else
{
read_iov.iov_base = cur_buf;
read_iov.iov_len = cur_left;
}
read_msg.msg_iov = &read_iov;
read_msg.msg_iovlen = 1;
my_uring_prep_recvmsg(sqe, nbd_fd, &read_msg, 0);
}
void handle_read(int result)
{
read_msg.msg_iovlen = 0;
if (result < 0 && result != -EAGAIN)
{
fprintf(stderr, "Socket disconnected: %s\n", strerror(-result));
exit(1);
}
if (result == -EAGAIN || result < read_iov.iov_len)
{
read_ready--;
}
if (read_ready > 0)
{
ringloop->wakeup();
}
void *b = recv_buf;
while (result > 0)
{
if (read_iov.iov_base == recv_buf)
{
int inc = result >= cur_left ? cur_left : result;
memcpy(cur_buf, b, inc);
cur_left -= inc;
result -= inc;
cur_buf += inc;
b += inc;
}
else
{
assert(result <= cur_left);
cur_left -= result;
result = 0;
}
if (cur_left <= 0)
{
handle_finished_read();
}
}
}
void handle_finished_read()
{
if (read_state == CL_READ_HDR)
{
int req_type = be32toh(cur_req.type);
if (be32toh(cur_req.magic) != NBD_REQUEST_MAGIC ||
req_type != NBD_CMD_READ && req_type != NBD_CMD_WRITE && req_type != NBD_CMD_FLUSH)
{
printf("Unexpected request: magic=%x type=%x, terminating\n", cur_req.magic, req_type);
exit(1);
}
uint64_t handle = *((uint64_t*)cur_req.handle);
#ifdef DEBUG
printf("request %lx +%x %lx\n", be64toh(cur_req.from), be32toh(cur_req.len), handle);
#endif
void *buf = NULL;
cluster_op_t *op = new cluster_op_t;
if (req_type == NBD_CMD_READ || req_type == NBD_CMD_WRITE)
{
op->opcode = req_type == NBD_CMD_READ ? OSD_OP_READ : OSD_OP_WRITE;
op->inode = inode;
op->offset = be64toh(cur_req.from);
op->len = be32toh(cur_req.len);
buf = malloc_or_die(sizeof(nbd_reply) + op->len);
op->iov.push_back(buf + sizeof(nbd_reply), op->len);
}
else if (req_type == NBD_CMD_FLUSH)
{
op->opcode = OSD_OP_SYNC;
buf = malloc_or_die(sizeof(nbd_reply));
}
op->callback = [this, buf, handle](cluster_op_t *op)
{
#ifdef DEBUG
printf("reply %lx e=%d\n", handle, op->retval);
#endif
nbd_reply *reply = (nbd_reply*)buf;
reply->magic = htobe32(NBD_REPLY_MAGIC);
memcpy(reply->handle, &handle, 8);
reply->error = htobe32(op->retval < 0 ? -op->retval : 0);
auto & to_list = send_msg.msg_iovlen > 0 ? next_send_list : send_list;
if (op->retval < 0 || op->opcode != OSD_OP_READ)
to_list.push_back({ .iov_base = buf, .iov_len = sizeof(nbd_reply) });
else
to_list.push_back({ .iov_base = buf, .iov_len = sizeof(nbd_reply) + op->len });
to_free.push_back(buf);
delete op;
ringloop->wakeup();
};
if (req_type == NBD_CMD_WRITE)
{
cur_op = op;
cur_buf = buf + sizeof(nbd_reply);
cur_left = op->len;
read_state = CL_READ_DATA;
}
else
{
cur_op = NULL;
cur_buf = &cur_req;
cur_left = sizeof(nbd_request);
read_state = CL_READ_HDR;
cli->execute(op);
}
}
else
{
cli->execute(cur_op);
cur_op = NULL;
cur_buf = &cur_req;
cur_left = sizeof(nbd_request);
read_state = CL_READ_HDR;
}
}
};
int main(int narg, const char *args[])
{
setvbuf(stdout, NULL, _IONBF, 0);
setvbuf(stderr, NULL, _IONBF, 0);
exe_name = args[0];
nbd_proxy *p = new nbd_proxy();
p->exec(nbd_proxy::parse_args(narg, args));
return 0;
}

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
#pragma once
#include <stdint.h>

494
osd.cpp
View File

@ -1,5 +1,7 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include <sys/socket.h>
#include <sys/epoll.h>
#include <sys/poll.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
@ -7,71 +9,100 @@
#include "osd.h"
static const char* osd_op_names[] = {
"",
"read",
"write",
"sync",
"stabilize",
"rollback",
"delete",
"sync_stab_all",
"list",
"show_config",
"primary_read",
"primary_write",
"primary_sync",
};
osd_t::osd_t(blockstore_config_t & config, blockstore_t *bs, ring_loop_t *ringloop)
{
this->config = config;
this->bs = bs;
this->ringloop = ringloop;
this->tick_tfd = new timerfd_interval(ringloop, 3, [this]()
{
for (int i = 0; i <= OSD_OP_MAX; i++)
{
if (op_stat_count[i] != 0)
{
printf("avg latency for op %d (%s): %ld us\n", i, osd_op_names[i], op_stat_sum[i]/op_stat_count[i]);
op_stat_count[i] = 0;
op_stat_sum[i] = 0;
}
}
for (int i = 0; i <= OSD_OP_MAX; i++)
{
if (subop_stat_count[i] != 0)
{
printf("avg latency for subop %d (%s): %ld us\n", i, osd_op_names[i], subop_stat_sum[i]/subop_stat_count[i]);
subop_stat_count[i] = 0;
subop_stat_sum[i] = 0;
}
}
if (send_stat_count != 0)
{
printf("avg latency to send stabilize subop: %ld us\n", send_stat_sum/send_stat_count);
send_stat_count = 0;
send_stat_sum = 0;
}
});
this->bs_block_size = bs->get_block_size();
// FIXME: use bitmap granularity instead
this->bs_disk_alignment = bs->get_disk_alignment();
bind_address = config["bind_address"];
if (bind_address == "")
bind_address = "0.0.0.0";
bind_port = strtoull(config["bind_port"].c_str(), NULL, 10);
if (!bind_port || bind_port > 65535)
bind_port = 11203;
parse_config(config);
epmgr = new epoll_manager_t(ringloop);
this->tfd = epmgr->tfd;
this->tfd->set_timer(print_stats_interval*1000, true, [this](int timer_id)
{
print_stats();
});
c_cli.tfd = this->tfd;
c_cli.ringloop = this->ringloop;
c_cli.exec_op = [this](osd_op_t *op) { exec_op(op); };
c_cli.repeer_pgs = [this](osd_num_t peer_osd) { repeer_pgs(peer_osd); };
init_cluster();
consumer.loop = [this]() { loop(); };
ringloop->register_consumer(&consumer);
}
osd_t::~osd_t()
{
ringloop->unregister_consumer(&consumer);
delete epmgr;
close(listen_fd);
}
void osd_t::parse_config(blockstore_config_t & config)
{
// Initial startup configuration
json11::Json json_config = json11::Json(config);
st_cli.parse_config(json_config);
etcd_report_interval = strtoull(config["etcd_report_interval"].c_str(), NULL, 10);
if (etcd_report_interval <= 0)
etcd_report_interval = 30;
osd_num = strtoull(config["osd_num"].c_str(), NULL, 10);
if (!osd_num)
throw std::runtime_error("osd_num is required in the configuration");
run_primary = config["run_primary"] == "true" || config["run_primary"] == "1" || config["run_primary"] == "yes";
if (run_primary)
init_primary();
c_cli.osd_num = osd_num;
run_primary = config["run_primary"] != "false" && config["run_primary"] != "0" && config["run_primary"] != "no";
// Cluster configuration
bind_address = config["bind_address"];
if (bind_address == "")
bind_address = "0.0.0.0";
bind_port = stoull_full(config["bind_port"]);
if (bind_port <= 0 || bind_port > 65535)
bind_port = 0;
if (config["immediate_commit"] == "all")
immediate_commit = IMMEDIATE_ALL;
else if (config["immediate_commit"] == "small")
immediate_commit = IMMEDIATE_SMALL;
if (config.find("autosync_interval") != config.end())
{
autosync_interval = strtoull(config["autosync_interval"].c_str(), NULL, 10);
if (autosync_interval > MAX_AUTOSYNC_INTERVAL)
autosync_interval = DEFAULT_AUTOSYNC_INTERVAL;
}
if (config.find("client_queue_depth") != config.end())
{
client_queue_depth = strtoull(config["client_queue_depth"].c_str(), NULL, 10);
if (client_queue_depth < 128)
client_queue_depth = 128;
}
recovery_queue_depth = strtoull(config["recovery_queue_depth"].c_str(), NULL, 10);
if (recovery_queue_depth < 1 || recovery_queue_depth > MAX_RECOVERY_QUEUE)
recovery_queue_depth = DEFAULT_RECOVERY_QUEUE;
if (config["readonly"] == "true" || config["readonly"] == "1" || config["readonly"] == "yes")
readonly = true;
print_stats_interval = strtoull(config["print_stats_interval"].c_str(), NULL, 10);
if (!print_stats_interval)
print_stats_interval = 3;
c_cli.peer_connect_interval = strtoull(config["peer_connect_interval"].c_str(), NULL, 10);
if (!c_cli.peer_connect_interval)
c_cli.peer_connect_interval = DEFAULT_PEER_CONNECT_INTERVAL;
c_cli.peer_connect_timeout = strtoull(config["peer_connect_timeout"].c_str(), NULL, 10);
if (!c_cli.peer_connect_timeout)
c_cli.peer_connect_timeout = DEFAULT_PEER_CONNECT_TIMEOUT;
log_level = strtoull(config["log_level"].c_str(), NULL, 10);
c_cli.log_level = log_level;
}
void osd_t::bind_socket()
{
listen_fd = socket(AF_INET, SOCK_STREAM, 0);
if (listen_fd < 0)
{
@ -88,13 +119,27 @@ osd_t::osd_t(blockstore_config_t & config, blockstore_t *bs, ring_loop_t *ringlo
throw std::runtime_error("bind address "+bind_address+(r == 0 ? " is not valid" : ": no ipv4 support"));
}
addr.sin_family = AF_INET;
addr.sin_port = htons(bind_port);
addr.sin_port = htons(bind_port);
if (bind(listen_fd, (sockaddr*)&addr, sizeof(addr)) < 0)
{
close(listen_fd);
throw std::runtime_error(std::string("bind: ") + strerror(errno));
}
if (bind_port == 0)
{
socklen_t len = sizeof(addr);
if (getsockname(listen_fd, (sockaddr *)&addr, &len) == -1)
{
close(listen_fd);
throw std::runtime_error(std::string("getsockname: ") + strerror(errno));
}
listening_port = ntohs(addr.sin_port);
}
else
{
listening_port = bind_port;
}
if (listen(listen_fd, listen_backlog) < 0)
{
@ -104,55 +149,10 @@ osd_t::osd_t(blockstore_config_t & config, blockstore_t *bs, ring_loop_t *ringlo
fcntl(listen_fd, F_SETFL, fcntl(listen_fd, F_GETFL, 0) | O_NONBLOCK);
epoll_fd = epoll_create(1);
if (epoll_fd < 0)
epmgr->set_fd_handler(listen_fd, false, [this](int fd, int events)
{
close(listen_fd);
throw std::runtime_error(std::string("epoll_create: ") + strerror(errno));
}
epoll_event ev;
ev.data.fd = listen_fd;
ev.events = EPOLLIN | EPOLLET;
if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, listen_fd, &ev) < 0)
{
close(listen_fd);
close(epoll_fd);
throw std::runtime_error(std::string("epoll_ctl: ") + strerror(errno));
}
consumer.loop = [this]() { loop(); };
ringloop->register_consumer(consumer);
}
osd_t::~osd_t()
{
delete tick_tfd;
ringloop->unregister_consumer(consumer);
close(epoll_fd);
close(listen_fd);
}
osd_op_t::~osd_op_t()
{
if (bs_op)
{
delete bs_op;
}
if (op_data)
{
free(op_data);
}
if (rmw_buf)
{
free(rmw_buf);
}
if (buf)
{
// Note: reusing osd_op_t WILL currently lead to memory leaks
// So we don't reuse it, but free it every time
free(buf);
}
c_cli.accept_connections(listen_fd);
});
}
bool osd_t::shutdown()
@ -167,187 +167,12 @@ bool osd_t::shutdown()
void osd_t::loop()
{
if (!wait_state)
{
handle_epoll_events();
wait_state = 1;
}
handle_peers();
read_requests();
send_replies();
c_cli.read_requests();
c_cli.send_replies();
ringloop->submit();
}
void osd_t::handle_epoll_events()
{
io_uring_sqe *sqe = ringloop->get_sqe();
if (!sqe)
{
throw std::runtime_error("can't get SQE, will fall out of sync with EPOLLET");
}
ring_data_t *data = ((ring_data_t*)sqe->user_data);
my_uring_prep_poll_add(sqe, epoll_fd, POLLIN);
data->callback = [this](ring_data_t *data)
{
if (data->res < 0)
{
throw std::runtime_error(std::string("epoll failed: ") + strerror(-data->res));
}
handle_epoll_events();
};
ringloop->submit();
int nfds;
epoll_event events[MAX_EPOLL_EVENTS];
restart:
nfds = epoll_wait(epoll_fd, events, MAX_EPOLL_EVENTS, 0);
for (int i = 0; i < nfds; i++)
{
if (events[i].data.fd == listen_fd)
{
// Accept new connections
sockaddr_in addr;
socklen_t peer_addr_size = sizeof(addr);
int peer_fd;
while ((peer_fd = accept(listen_fd, (sockaddr*)&addr, &peer_addr_size)) >= 0)
{
char peer_str[256];
printf("osd: new client %d: connection from %s port %d\n", peer_fd, inet_ntop(AF_INET, &addr.sin_addr, peer_str, 256), ntohs(addr.sin_port));
fcntl(peer_fd, F_SETFL, fcntl(listen_fd, F_GETFL, 0) | O_NONBLOCK);
int one = 1;
setsockopt(peer_fd, SOL_TCP, TCP_NODELAY, &one, sizeof(one));
clients[peer_fd] = {
.peer_addr = addr,
.peer_port = ntohs(addr.sin_port),
.peer_fd = peer_fd,
.peer_state = PEER_CONNECTED,
};
// Add FD to epoll
epoll_event ev;
ev.data.fd = peer_fd;
ev.events = EPOLLIN | EPOLLET | EPOLLRDHUP;
if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, peer_fd, &ev) < 0)
{
throw std::runtime_error(std::string("epoll_ctl: ") + strerror(errno));
}
// Try to accept next connection
peer_addr_size = sizeof(addr);
}
if (peer_fd == -1 && errno != EAGAIN)
{
throw std::runtime_error(std::string("accept: ") + strerror(errno));
}
}
else
{
auto & cl = clients[events[i].data.fd];
if (cl.peer_state == PEER_CONNECTING)
{
// Either OUT (connected) or HUP
handle_connect_result(cl.peer_fd);
}
else if (events[i].events & EPOLLRDHUP)
{
// Stop client
printf("osd: client %d disconnected\n", cl.peer_fd);
stop_client(cl.peer_fd);
}
else
{
// Mark client as ready (i.e. some data is available)
cl.read_ready++;
if (cl.read_ready == 1)
{
read_ready_clients.push_back(cl.peer_fd);
ringloop->wakeup();
}
}
}
}
if (nfds == MAX_EPOLL_EVENTS)
{
goto restart;
}
}
void osd_t::cancel_osd_ops(osd_client_t & cl)
{
for (auto p: cl.sent_ops)
{
cancel_op(p.second);
}
cl.sent_ops.clear();
for (auto op: cl.outbox)
{
cancel_op(op);
}
cl.outbox.clear();
if (cl.write_op)
{
cancel_op(cl.write_op);
cl.write_op = NULL;
}
}
void osd_t::cancel_op(osd_op_t *op)
{
if (op->op_type == OSD_OP_OUT)
{
op->reply.hdr.magic = SECONDARY_OSD_REPLY_MAGIC;
op->reply.hdr.id = op->req.hdr.id;
op->reply.hdr.opcode = op->req.hdr.opcode;
op->reply.hdr.retval = -EPIPE;
op->callback(op);
}
else
{
delete op;
}
}
void osd_t::stop_client(int peer_fd)
{
auto it = clients.find(peer_fd);
if (it == clients.end())
{
return;
}
auto & cl = it->second;
if (epoll_ctl(epoll_fd, EPOLL_CTL_DEL, peer_fd, NULL) < 0)
{
throw std::runtime_error(std::string("epoll_ctl: ") + strerror(errno));
}
if (cl.osd_num)
{
// Cancel outbound operations
cancel_osd_ops(cl);
osd_peer_fds.erase(cl.osd_num);
repeer_pgs(cl.osd_num, false);
peering_state |= OSD_PEERING_PEERS;
}
if (cl.read_op)
{
delete cl.read_op;
}
for (auto rit = read_ready_clients.begin(); rit != read_ready_clients.end(); rit++)
{
if (*rit == peer_fd)
{
read_ready_clients.erase(rit);
break;
}
}
for (auto wit = write_ready_clients.begin(); wit != write_ready_clients.end(); wit++)
{
if (*wit == peer_fd)
{
write_ready_clients.erase(wit);
break;
}
}
clients.erase(it);
close(peer_fd);
}
void osd_t::exec_op(osd_op_t *cur_op)
{
clock_gettime(CLOCK_REALTIME, &cur_op->tv_begin);
@ -357,23 +182,36 @@ void osd_t::exec_op(osd_op_t *cur_op)
delete cur_op;
return;
}
cur_op->send_list.push_back(cur_op->reply.buf, OSD_PACKET_SIZE);
inflight_ops++;
if (cur_op->req.hdr.magic != SECONDARY_OSD_OP_MAGIC ||
cur_op->req.hdr.opcode < OSD_OP_MIN || cur_op->req.hdr.opcode > OSD_OP_MAX ||
(cur_op->req.hdr.opcode == OSD_OP_SECONDARY_READ || cur_op->req.hdr.opcode == OSD_OP_SECONDARY_WRITE) &&
(cur_op->req.sec_rw.len > OSD_RW_MAX || cur_op->req.sec_rw.len % OSD_RW_ALIGN || cur_op->req.sec_rw.offset % OSD_RW_ALIGN) ||
(cur_op->req.hdr.opcode == OSD_OP_READ || cur_op->req.hdr.opcode == OSD_OP_WRITE) &&
(cur_op->req.rw.len > OSD_RW_MAX || cur_op->req.rw.len % OSD_RW_ALIGN || cur_op->req.rw.offset % OSD_RW_ALIGN))
((cur_op->req.hdr.opcode == OSD_OP_SEC_READ ||
cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE) &&
(cur_op->req.sec_rw.len > OSD_RW_MAX ||
cur_op->req.sec_rw.len % bs_disk_alignment ||
cur_op->req.sec_rw.offset % bs_disk_alignment)) ||
((cur_op->req.hdr.opcode == OSD_OP_READ ||
cur_op->req.hdr.opcode == OSD_OP_WRITE ||
cur_op->req.hdr.opcode == OSD_OP_DELETE) &&
(cur_op->req.rw.len > OSD_RW_MAX ||
cur_op->req.rw.len % bs_disk_alignment ||
cur_op->req.rw.offset % bs_disk_alignment)))
{
// Bad command
cur_op->reply.hdr.magic = SECONDARY_OSD_REPLY_MAGIC;
cur_op->reply.hdr.id = cur_op->req.hdr.id;
cur_op->reply.hdr.opcode = cur_op->req.hdr.opcode;
cur_op->reply.hdr.retval = -EINVAL;
outbox_push(this->clients[cur_op->peer_fd], cur_op);
finish_op(cur_op, -EINVAL);
return;
}
if (readonly &&
cur_op->req.hdr.opcode != OSD_OP_SEC_READ &&
cur_op->req.hdr.opcode != OSD_OP_SEC_LIST &&
cur_op->req.hdr.opcode != OSD_OP_READ &&
cur_op->req.hdr.opcode != OSD_OP_SHOW_CONFIG)
{
// Readonly mode
finish_op(cur_op, -EROFS);
return;
}
inflight_ops++;
if (cur_op->req.hdr.opcode == OSD_OP_TEST_SYNC_STAB_ALL)
{
exec_sync_stab_all(cur_op);
@ -394,8 +232,84 @@ void osd_t::exec_op(osd_op_t *cur_op)
{
continue_primary_sync(cur_op);
}
else if (cur_op->req.hdr.opcode == OSD_OP_DELETE)
{
continue_primary_del(cur_op);
}
else
{
exec_secondary(cur_op);
}
}
void osd_t::reset_stats()
{
c_cli.stats = { 0 };
prev_stats = { 0 };
memset(recovery_stat_count, 0, sizeof(recovery_stat_count));
memset(recovery_stat_bytes, 0, sizeof(recovery_stat_bytes));
}
void osd_t::print_stats()
{
for (int i = 0; i <= OSD_OP_MAX; i++)
{
if (c_cli.stats.op_stat_count[i] != prev_stats.op_stat_count[i])
{
uint64_t avg = (c_cli.stats.op_stat_sum[i] - prev_stats.op_stat_sum[i])/(c_cli.stats.op_stat_count[i] - prev_stats.op_stat_count[i]);
uint64_t bw = (c_cli.stats.op_stat_bytes[i] - prev_stats.op_stat_bytes[i]) / print_stats_interval;
if (c_cli.stats.op_stat_bytes[i] != 0)
{
printf(
"[OSD %lu] avg latency for op %d (%s): %lu us, B/W: %.2f %s\n", osd_num, i, osd_op_names[i], avg,
(bw > 1024*1024*1024 ? bw/1024.0/1024/1024 : (bw > 1024*1024 ? bw/1024.0/1024 : bw/1024.0)),
(bw > 1024*1024*1024 ? "GB/s" : (bw > 1024*1024 ? "MB/s" : "KB/s"))
);
}
else
{
printf("[OSD %lu] avg latency for op %d (%s): %lu us\n", osd_num, i, osd_op_names[i], avg);
}
prev_stats.op_stat_count[i] = c_cli.stats.op_stat_count[i];
prev_stats.op_stat_sum[i] = c_cli.stats.op_stat_sum[i];
prev_stats.op_stat_bytes[i] = c_cli.stats.op_stat_bytes[i];
}
}
for (int i = 0; i <= OSD_OP_MAX; i++)
{
if (c_cli.stats.subop_stat_count[i] != prev_stats.subop_stat_count[i])
{
uint64_t avg = (c_cli.stats.subop_stat_sum[i] - prev_stats.subop_stat_sum[i])/(c_cli.stats.subop_stat_count[i] - prev_stats.subop_stat_count[i]);
printf("[OSD %lu] avg latency for subop %d (%s): %ld us\n", osd_num, i, osd_op_names[i], avg);
prev_stats.subop_stat_count[i] = c_cli.stats.subop_stat_count[i];
prev_stats.subop_stat_sum[i] = c_cli.stats.subop_stat_sum[i];
}
}
for (int i = 0; i < 2; i++)
{
if (recovery_stat_count[0][i] != recovery_stat_count[1][i])
{
uint64_t bw = (recovery_stat_bytes[0][i] - recovery_stat_bytes[1][i]) / print_stats_interval;
printf(
"[OSD %lu] %s recovery: %.1f op/s, B/W: %.2f %s\n", osd_num, recovery_stat_names[i],
(recovery_stat_count[0][i] - recovery_stat_count[1][i]) * 1.0 / print_stats_interval,
(bw > 1024*1024*1024 ? bw/1024.0/1024/1024 : (bw > 1024*1024 ? bw/1024.0/1024 : bw/1024.0)),
(bw > 1024*1024*1024 ? "GB/s" : (bw > 1024*1024 ? "MB/s" : "KB/s"))
);
recovery_stat_count[1][i] = recovery_stat_count[0][i];
recovery_stat_bytes[1][i] = recovery_stat_bytes[0][i];
}
}
if (incomplete_objects > 0)
{
printf("[OSD %lu] %lu object(s) incomplete\n", osd_num, incomplete_objects);
}
if (degraded_objects > 0)
{
printf("[OSD %lu] %lu object(s) degraded\n", osd_num, degraded_objects);
}
if (misplaced_objects > 0)
{
printf("[OSD %lu] %lu object(s) misplaced\n", osd_num, misplaced_objects);
}
}

301
osd.h
View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#pragma once
#include <sys/types.h>
@ -15,171 +18,86 @@
#include "blockstore.h"
#include "ringloop.h"
#include "timerfd_interval.h"
#include "osd_ops.h"
#include "timerfd_manager.h"
#include "epoll_manager.h"
#include "osd_peering_pg.h"
#include "messenger.h"
#include "etcd_state_client.h"
#include "sparsepp/sparsepp/spp.h"
#define OSD_LOADING_PGS 0x01
#define OSD_PEERING_PGS 0x04
#define OSD_FLUSHING_PGS 0x08
#define OSD_RECOVERING 0x10
#define OSD_OP_IN 0
#define OSD_OP_OUT 1
#define IMMEDIATE_NONE 0
#define IMMEDIATE_SMALL 1
#define IMMEDIATE_ALL 2
#define CL_READ_OP 1
#define CL_READ_DATA 2
#define CL_READ_REPLY_DATA 3
#define CL_WRITE_READY 1
#define CL_WRITE_REPLY 2
#define MAX_EPOLL_EVENTS 64
#define OSD_OP_INLINE_BUF_COUNT 16
#define PEER_CONNECTING 1
#define PEER_CONNECTED 2
#define OSD_PEERING_PEERS 1
#define OSD_PEERING_PGS 2
#define MAX_AUTOSYNC_INTERVAL 3600
#define DEFAULT_AUTOSYNC_INTERVAL 5
#define MAX_RECOVERY_QUEUE 2048
#define DEFAULT_RECOVERY_QUEUE 4
//#define OSD_STUB
struct osd_op_buf_list_t
{
int count = 0, alloc = 0, sent = 0;
iovec *buf = NULL;
iovec inline_buf[OSD_OP_INLINE_BUF_COUNT];
~osd_op_buf_list_t()
{
if (buf && buf != inline_buf)
{
free(buf);
}
}
inline iovec* get_iovec()
{
return (buf ? buf : inline_buf) + sent;
}
inline int get_size()
{
return count - sent;
}
inline void push_back(void *nbuf, size_t len)
{
if (count >= alloc)
{
if (!alloc)
{
alloc = OSD_OP_INLINE_BUF_COUNT;
buf = inline_buf;
}
else if (buf == inline_buf)
{
int old = alloc;
alloc = ((alloc/16)*16 + 1);
buf = (iovec*)malloc(sizeof(iovec) * alloc);
memcpy(buf, inline_buf, sizeof(iovec)*old);
}
else
{
alloc = ((alloc/16)*16 + 1);
buf = (iovec*)realloc(buf, sizeof(iovec) * alloc);
}
}
buf[count++] = { .iov_base = nbuf, .iov_len = len };
}
};
struct osd_primary_op_data_t;
struct osd_op_t
{
timespec tv_begin;
timespec tv_send;
int op_type = OSD_OP_IN;
int peer_fd;
osd_any_op_t req;
osd_any_reply_t reply;
blockstore_op_t *bs_op = NULL;
void *buf = NULL;
void *rmw_buf = NULL;
osd_primary_op_data_t* op_data = NULL;
std::function<void(osd_op_t*)> callback;
osd_op_buf_list_t send_list;
~osd_op_t();
};
struct osd_peer_def_t
{
osd_num_t osd_num = 0;
std::string addr;
int port = 0;
time_t last_connect_attempt = 0;
};
struct osd_client_t
{
sockaddr_in peer_addr;
int peer_port;
int peer_fd;
int peer_state;
std::function<void(osd_num_t, int)> connect_callback;
osd_num_t osd_num = 0;
// Read state
int read_ready = 0;
osd_op_t *read_op = NULL;
int read_reply_id = 0;
iovec read_iov;
msghdr read_msg;
void *read_buf = NULL;
int read_remaining = 0;
int read_state = 0;
// Outbound operations sent to this client (which is probably an OSD peer)
std::map<int, osd_op_t*> sent_ops;
// Outbound messages (replies or requests)
std::deque<osd_op_t*> outbox;
// PGs dirtied by this client's primary-writes
std::set<pg_num_t> dirty_pgs;
// Write state
osd_op_t *write_op = NULL;
msghdr write_msg;
int write_state = 0;
};
struct osd_rmw_stripe_t;
struct osd_object_id_t
{
osd_num_t osd_num;
object_id oid;
};
struct osd_recovery_op_t
{
int st = 0;
bool degraded = false;
object_id oid = { 0 };
osd_op_t *osd_op = NULL;
};
class osd_t
{
// config
blockstore_config_t config;
int etcd_report_interval = 30;
bool readonly = false;
osd_num_t osd_num = 1; // OSD numbers start with 1
bool run_primary = false;
std::vector<osd_peer_def_t> peers;
blockstore_config_t config;
std::string bind_address;
int bind_port, listen_backlog;
// FIXME: Implement client queue depth limit
int client_queue_depth = 128;
bool allow_test_ops = true;
int print_stats_interval = 3;
int immediate_commit = IMMEDIATE_NONE;
int autosync_interval = DEFAULT_AUTOSYNC_INTERVAL; // sync every 5 seconds
int recovery_queue_depth = DEFAULT_RECOVERY_QUEUE;
int log_level = 0;
// peer OSDs
// cluster state
std::map<uint64_t, int> osd_peer_fds;
std::vector<pg_t> pgs;
etcd_state_client_t st_cli;
osd_messenger_t c_cli;
int etcd_failed_attempts = 0;
std::string etcd_lease_id;
json11::Json self_state;
bool loading_peer_config = false;
std::set<pool_pg_num_t> pg_state_dirty;
bool pg_config_applied = false;
bool etcd_reporting_pg_state = false;
bool etcd_reporting_stats = false;
// peers and PGs
std::map<pool_id_t, pg_num_t> pg_counts;
std::map<pool_pg_num_t, pg_t> pgs;
std::set<pool_pg_num_t> dirty_pgs;
std::set<osd_num_t> dirty_osds;
uint64_t misplaced_objects = 0, degraded_objects = 0, incomplete_objects = 0;
int peering_state = 0;
unsigned pg_count = 0;
uint64_t next_subop_id = 1;
std::map<object_id, osd_recovery_op_t> recovery_ops;
osd_op_t *autosync_op = NULL;
// Unstable writes
std::map<osd_object_id_t, uint64_t> unstable_writes;
@ -191,53 +109,69 @@ class osd_t
int inflight_ops = 0;
blockstore_t *bs;
uint32_t bs_block_size, bs_disk_alignment;
uint64_t parity_block_size = 4*1024*1024; // 4 MB by default
ring_loop_t *ringloop;
timerfd_interval *tick_tfd;
timerfd_manager_t *tfd = NULL;
epoll_manager_t *epmgr = NULL;
int wait_state = 0;
int epoll_fd = 0;
int listening_port = 0;
int listen_fd = 0;
ring_consumer_t consumer;
std::unordered_map<int,osd_client_t> clients;
std::vector<int> read_ready_clients;
std::vector<int> write_ready_clients;
uint64_t op_stat_sum[OSD_OP_MAX+1] = { 0 };
uint64_t op_stat_count[OSD_OP_MAX+1] = { 0 };
uint64_t subop_stat_sum[OSD_OP_MAX+1] = { 0 };
uint64_t subop_stat_count[OSD_OP_MAX+1] = { 0 };
uint64_t send_stat_sum = 0;
uint64_t send_stat_count = 0;
// op statistics
osd_op_stats_t prev_stats;
const char* recovery_stat_names[2] = { "degraded", "misplaced" };
uint64_t recovery_stat_count[2][2] = { 0 };
uint64_t recovery_stat_bytes[2][2] = { 0 };
// methods
// cluster connection
void parse_config(blockstore_config_t & config);
void init_cluster();
void on_change_osd_state_hook(osd_num_t peer_osd);
void on_change_pg_history_hook(pool_id_t pool_id, pg_num_t pg_num);
void on_change_etcd_state_hook(json11::Json::object & changes);
void on_load_config_hook(json11::Json::object & changes);
json11::Json on_load_pgs_checks_hook();
void on_load_pgs_hook(bool success);
void bind_socket();
void acquire_lease();
json11::Json get_osd_state();
void create_osd_state();
void renew_lease();
void print_stats();
void reset_stats();
json11::Json get_statistics();
void report_statistics();
void report_pg_state(pg_t & pg);
void report_pg_states();
void apply_pg_count();
void apply_pg_config();
// event loop, socket read/write
void loop();
void handle_epoll_events();
void read_requests();
void handle_read(ring_data_t *data, int peer_fd);
void handle_op_hdr(osd_client_t *cl);
void handle_reply_hdr(osd_client_t *cl);
bool try_send(osd_client_t & cl);
void send_replies();
void handle_send(ring_data_t *data, int peer_fd);
void outbox_push(osd_client_t & cl, osd_op_t *op);
// peer handling (primary OSD logic)
void connect_peer(osd_num_t osd_num, const char *peer_host, int peer_port, std::function<void(osd_num_t, int)> callback);
void handle_connect_result(int peer_fd);
void cancel_osd_ops(osd_client_t & cl);
void cancel_op(osd_op_t *op);
void stop_client(int peer_fd);
osd_peer_def_t parse_peer(std::string peer);
void init_primary();
void parse_test_peer(std::string peer);
void handle_peers();
void repeer_pgs(osd_num_t osd_num, bool is_connected);
void start_pg_peering(int i);
void repeer_pgs(osd_num_t osd_num);
void start_pg_peering(pg_t & pg);
void submit_sync_and_list_subop(osd_num_t role_osd, pg_peering_state_t *ps);
void submit_list_subop(osd_num_t role_osd, pg_peering_state_t *ps);
void discard_list_subop(osd_op_t *list_op);
bool stop_pg(pg_t & pg);
void finish_stop_pg(pg_t & pg);
// flushing, recovery and backfill
void submit_pg_flush_ops(pg_t & pg);
void handle_flush_op(bool rollback, pool_id_t pool_id, pg_num_t pg_num, pg_flush_batch_t *fb, osd_num_t peer_osd, int retval);
void submit_flush_op(pool_id_t pool_id, pg_num_t pg_num, pg_flush_batch_t *fb, bool rollback, osd_num_t peer_osd, int count, obj_ver_id *data);
bool pick_next_recovery(osd_recovery_op_t &op);
void submit_recovery_op(osd_recovery_op_t *op);
bool continue_recovery();
pg_osd_set_state_t* change_osd_set(pg_osd_set_state_t *st, pg_t *pg);
// op execution
void exec_op(osd_op_t *cur_op);
void finish_op(osd_op_t *cur_op, int retval);
// secondary ops
void exec_sync_stab_all(osd_op_t *cur_op);
@ -246,18 +180,37 @@ class osd_t
void secondary_op_callback(osd_op_t *cur_op);
// primary ops
void autosync();
bool prepare_primary_rw(osd_op_t *cur_op);
void continue_primary_read(osd_op_t *cur_op);
void continue_primary_write(osd_op_t *cur_op);
void cancel_primary_write(osd_op_t *cur_op);
void continue_primary_sync(osd_op_t *cur_op);
void finish_primary_op(osd_op_t *cur_op, int retval);
void handle_primary_subop(osd_op_t *cur_op, int ok, uint64_t version);
void submit_primary_subops(int submit_type, int read_pg_size, const uint64_t* osd_set, osd_op_t *cur_op);
void continue_primary_del(osd_op_t *cur_op);
bool check_write_queue(osd_op_t *cur_op, pg_t & pg);
void remove_object_from_state(object_id & oid, pg_osd_set_state_t *object_state, pg_t &pg);
bool remember_unstable_write(osd_op_t *cur_op, pg_t & pg, pg_osd_set_t & loc_set, int base_state);
void handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op);
void handle_primary_bs_subop(osd_op_t *subop);
void add_bs_subop_stats(osd_op_t *subop);
void pg_cancel_write_queue(pg_t & pg, osd_op_t *first_op, object_id oid, int retval);
void submit_primary_subops(int submit_type, uint64_t op_version, int pg_size, const uint64_t* osd_set, osd_op_t *cur_op);
void submit_primary_del_subops(osd_op_t *cur_op, uint64_t *cur_set, uint64_t set_size, pg_osd_set_t & loc_set);
void submit_primary_sync_subops(osd_op_t *cur_op);
void submit_primary_stab_subops(osd_op_t *cur_op);
inline pg_num_t map_to_pg(object_id oid, uint64_t pg_stripe_size)
{
uint64_t pg_count = pg_counts[INODE_POOL(oid.inode)];
if (!pg_count)
pg_count = 1;
return (oid.inode + oid.stripe / pg_stripe_size) % pg_count + 1;
}
public:
osd_t(blockstore_config_t & config, blockstore_t *bs, ring_loop_t *ringloop);
~osd_t();
void force_stop(int exitcode);
bool shutdown();
};

View File

@ -1,40 +0,0 @@
void slice()
{
// Slice the request into blockstore requests to individual objects
// Primary OSD still operates individual stripes, except they're twice the size of the blockstore's stripe.
std::vector read_parts;
int block = bs->get_block_size();
uint64_t stripe1 = cur_op->req.rw.offset / block / 2;
uint64_t stripe2 = (cur_op->req.rw.offset + cur_op->req.rw.len + block*2 - 1) / block / 2 - 1;
for (uint64_t s = stripe1; s <= stripe2; s++)
{
uint64_t start = s == stripe1 ? cur_op->req.rw.offset - stripe1*block*2 : 0;
uint64_t end = s == stripe2 ? cur_op->req.rw.offset + cur_op->req.rw.len - stripe2*block*2 : block*2;
if (start < block)
{
read_parts.push_back({
.role = 1,
.oid = {
.inode = cur_op->req.rw.inode,
.stripe = (s << STRIPE_ROLE_BITS) | 1,
},
.version = UINT64_MAX,
.offset = start,
.len = (block < end ? block : end) - start,
});
}
if (end > block)
{
read_parts.push_back({
.role = 2,
.oid = {
.inode = cur_op->req.rw.inode,
.stripe = (s << STRIPE_ROLE_BITS) | 2,
},
.version = UINT64_MAX,
.offset = (start > block ? start-block : 0),
.len = end - (start > block ? start-block : 0),
});
}
}
}

791
osd_cluster.cpp Normal file
View File

@ -0,0 +1,791 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include "osd.h"
#include "base64.h"
#include "etcd_state_client.h"
// Startup sequence:
// Start etcd watcher -> Load global OSD configuration -> Bind socket -> Acquire lease -> Report&lock OSD state
// -> Load PG config -> Report&lock PG states -> Load peers -> Connect to peers -> Peer PGs
// Event handling
// Wait for PG changes -> Start/Stop PGs when requested
// Peer connection is lost -> Reload connection data -> Try to reconnect
void osd_t::init_cluster()
{
if (!st_cli.etcd_addresses.size())
{
if (run_primary)
{
// Test version of clustering code with 1 pool, 1 PG and 2 peers
// Example: peers = 2:127.0.0.1:11204,3:127.0.0.1:11205
std::string peerstr = config["peers"];
while (peerstr.size())
{
int pos = peerstr.find(',');
parse_test_peer(pos < 0 ? peerstr : peerstr.substr(0, pos));
peerstr = pos < 0 ? std::string("") : peerstr.substr(pos+1);
}
if (st_cli.peer_states.size() < 2)
{
throw std::runtime_error("run_primary requires at least 2 peers");
}
pgs[{ 1, 1 }] = (pg_t){
.state = PG_PEERING,
.pg_cursize = 0,
.pool_id = 1,
.pg_num = 1,
.target_set = { 1, 2, 3 },
.cur_set = { 0, 0, 0 },
};
report_pg_state(pgs[{ 1, 1 }]);
pg_counts[1] = 1;
}
bind_socket();
}
else
{
st_cli.tfd = tfd;
st_cli.log_level = log_level;
st_cli.on_change_osd_state_hook = [this](osd_num_t peer_osd) { on_change_osd_state_hook(peer_osd); };
st_cli.on_change_pg_history_hook = [this](pool_id_t pool_id, pg_num_t pg_num) { on_change_pg_history_hook(pool_id, pg_num); };
st_cli.on_change_hook = [this](json11::Json::object & changes) { on_change_etcd_state_hook(changes); };
st_cli.on_load_config_hook = [this](json11::Json::object & cfg) { on_load_config_hook(cfg); };
st_cli.load_pgs_checks_hook = [this]() { return on_load_pgs_checks_hook(); };
st_cli.on_load_pgs_hook = [this](bool success) { on_load_pgs_hook(success); };
peering_state = OSD_LOADING_PGS;
st_cli.load_global_config();
}
if (run_primary && autosync_interval > 0)
{
this->tfd->set_timer(autosync_interval*1000, true, [this](int timer_id)
{
autosync();
});
}
}
void osd_t::parse_test_peer(std::string peer)
{
// OSD_NUM:IP:PORT
int pos1 = peer.find(':');
int pos2 = peer.find(':', pos1+1);
if (pos1 < 0 || pos2 < 0)
throw new std::runtime_error("OSD peer string must be in the form OSD_NUM:IP:PORT");
std::string addr = peer.substr(pos1+1, pos2-pos1-1);
std::string osd_num_str = peer.substr(0, pos1);
std::string port_str = peer.substr(pos2+1);
osd_num_t peer_osd = strtoull(osd_num_str.c_str(), NULL, 10);
if (!peer_osd)
throw new std::runtime_error("Could not parse OSD peer osd_num");
else if (st_cli.peer_states.find(peer_osd) != st_cli.peer_states.end())
throw std::runtime_error("Same osd number "+std::to_string(peer_osd)+" specified twice in peers");
int port = strtoull(port_str.c_str(), NULL, 10);
if (!port)
throw new std::runtime_error("Could not parse OSD peer port");
st_cli.peer_states[peer_osd] = json11::Json::object {
{ "state", "up" },
{ "addresses", json11::Json::array { addr } },
{ "port", port },
};
c_cli.connect_peer(peer_osd, st_cli.peer_states[peer_osd]);
}
json11::Json osd_t::get_osd_state()
{
std::vector<char> hostname;
hostname.resize(1024);
while (gethostname(hostname.data(), hostname.size()) < 0 && errno == ENAMETOOLONG)
hostname.resize(hostname.size()+1024);
hostname.resize(strnlen(hostname.data(), hostname.size()));
json11::Json::object st;
st["state"] = "up";
if (bind_address != "0.0.0.0")
st["addresses"] = json11::Json::array { bind_address };
else
st["addresses"] = getifaddr_list();
st["host"] = std::string(hostname.data(), hostname.size());
st["port"] = listening_port;
st["primary_enabled"] = run_primary;
st["blockstore_enabled"] = bs ? true : false;
return st;
}
json11::Json osd_t::get_statistics()
{
json11::Json::object st;
timespec ts;
clock_gettime(CLOCK_REALTIME, &ts);
char time_str[50] = { 0 };
sprintf(time_str, "%ld.%03ld", ts.tv_sec, ts.tv_nsec/1000000);
st["time"] = time_str;
st["blockstore_ready"] = bs->is_started();
if (bs)
{
st["size"] = bs->get_block_count() * bs->get_block_size();
st["free"] = bs->get_free_block_count() * bs->get_block_size();
}
st["host"] = self_state["host"];
json11::Json::object op_stats, subop_stats;
for (int i = 0; i <= OSD_OP_MAX; i++)
{
op_stats[osd_op_names[i]] = json11::Json::object {
{ "count", c_cli.stats.op_stat_count[i] },
{ "usec", c_cli.stats.op_stat_sum[i] },
{ "bytes", c_cli.stats.op_stat_bytes[i] },
};
}
for (int i = 0; i <= OSD_OP_MAX; i++)
{
subop_stats[osd_op_names[i]] = json11::Json::object {
{ "count", c_cli.stats.subop_stat_count[i] },
{ "usec", c_cli.stats.subop_stat_sum[i] },
};
}
st["op_stats"] = op_stats;
st["subop_stats"] = subop_stats;
st["recovery_stats"] = json11::Json::object {
{ recovery_stat_names[0], json11::Json::object {
{ "count", recovery_stat_count[0][0] },
{ "bytes", recovery_stat_bytes[0][0] },
} },
{ recovery_stat_names[1], json11::Json::object {
{ "count", recovery_stat_count[0][1] },
{ "bytes", recovery_stat_bytes[0][1] },
} },
};
return st;
}
void osd_t::report_statistics()
{
if (etcd_reporting_stats)
{
return;
}
etcd_reporting_stats = true;
json11::Json::array txn = { json11::Json::object {
{ "request_put", json11::Json::object {
{ "key", base64_encode(st_cli.etcd_prefix+"/osd/stats/"+std::to_string(osd_num)) },
{ "value", base64_encode(get_statistics().dump()) },
} }
} };
for (auto & p: pgs)
{
auto & pg = p.second;
if (pg.state & (PG_OFFLINE | PG_STARTING))
{
// Don't report statistics for offline PGs
continue;
}
json11::Json::object pg_stats;
pg_stats["object_count"] = pg.total_count;
pg_stats["clean_count"] = pg.clean_count;
pg_stats["misplaced_count"] = pg.misplaced_objects.size();
pg_stats["degraded_count"] = pg.degraded_objects.size();
pg_stats["incomplete_count"] = pg.incomplete_objects.size();
pg_stats["write_osd_set"] = pg.cur_set;
txn.push_back(json11::Json::object {
{ "request_put", json11::Json::object {
{ "key", base64_encode(st_cli.etcd_prefix+"/pg/stats/"+std::to_string(pg.pool_id)+"/"+std::to_string(pg.pg_num)) },
{ "value", base64_encode(json11::Json(pg_stats).dump()) },
} }
});
}
st_cli.etcd_txn(json11::Json::object { { "success", txn } }, ETCD_SLOW_TIMEOUT, [this](std::string err, json11::Json res)
{
etcd_reporting_stats = false;
if (err != "")
{
printf("[OSD %lu] Error reporting state to etcd: %s\n", this->osd_num, err.c_str());
// Retry indefinitely
tfd->set_timer(ETCD_SLOW_TIMEOUT, false, [this](int timer_id)
{
report_statistics();
});
}
else if (res["error"].string_value() != "")
{
printf("[OSD %lu] Error reporting state to etcd: %s\n", this->osd_num, res["error"].string_value().c_str());
force_stop(1);
}
});
}
void osd_t::on_change_osd_state_hook(osd_num_t peer_osd)
{
if (c_cli.wanted_peers.find(peer_osd) != c_cli.wanted_peers.end())
{
c_cli.connect_peer(peer_osd, st_cli.peer_states[peer_osd]);
}
}
void osd_t::on_change_etcd_state_hook(json11::Json::object & changes)
{
// FIXME apply config changes in runtime (maybe, some)
apply_pg_count();
apply_pg_config();
}
void osd_t::on_change_pg_history_hook(pool_id_t pool_id, pg_num_t pg_num)
{
auto pg_it = pgs.find({
.pool_id = pool_id,
.pg_num = pg_num,
});
if (pg_it != pgs.end() && pg_it->second.epoch > pg_it->second.reported_epoch &&
st_cli.pool_config[pool_id].pg_config[pg_num].epoch >= pg_it->second.epoch)
{
pg_it->second.reported_epoch = st_cli.pool_config[pool_id].pg_config[pg_num].epoch;
object_id oid = { 0 };
bool first = true;
for (auto op: pg_it->second.write_queue)
{
if (first || oid != op.first)
{
oid = op.first;
first = false;
continue_primary_write(op.second);
}
}
}
}
void osd_t::on_load_config_hook(json11::Json::object & global_config)
{
blockstore_config_t osd_config = this->config;
for (auto & cfg_var: global_config)
{
if (this->config.find(cfg_var.first) == this->config.end())
{
if (cfg_var.second.is_string())
{
osd_config[cfg_var.first] = cfg_var.second.string_value();
}
else
{
osd_config[cfg_var.first] = cfg_var.second.dump();
}
}
}
parse_config(osd_config);
bind_socket();
st_cli.start_etcd_watcher();
acquire_lease();
}
// Acquire lease
void osd_t::acquire_lease()
{
// Maximum lease TTL is (report interval) + retries * (timeout + repeat interval)
st_cli.etcd_call("/lease/grant", json11::Json::object {
{ "TTL", etcd_report_interval+(MAX_ETCD_ATTEMPTS*(2*ETCD_QUICK_TIMEOUT)+999)/1000 }
}, ETCD_QUICK_TIMEOUT, [this](std::string err, json11::Json data)
{
if (err != "" || data["ID"].string_value() == "")
{
printf("Error acquiring a lease from etcd: %s\n", err.c_str());
tfd->set_timer(ETCD_QUICK_TIMEOUT, false, [this](int timer_id)
{
acquire_lease();
});
return;
}
etcd_lease_id = data["ID"].string_value();
create_osd_state();
});
printf("[OSD %lu] reporting to etcd at %s every %d seconds\n", this->osd_num, config["etcd_address"].c_str(), etcd_report_interval);
tfd->set_timer(etcd_report_interval*1000, true, [this](int timer_id)
{
renew_lease();
});
}
// Report "up" state once, then keep it alive using the lease
// Do it first to allow "monitors" check it when moving PGs
void osd_t::create_osd_state()
{
std::string state_key = base64_encode(st_cli.etcd_prefix+"/osd/state/"+std::to_string(osd_num));
self_state = get_osd_state();
st_cli.etcd_txn(json11::Json::object {
// Check that the state key does not exist
{ "compare", json11::Json::array {
json11::Json::object {
{ "target", "CREATE" },
{ "create_revision", 0 },
{ "key", state_key },
}
} },
{ "success", json11::Json::array {
json11::Json::object {
{ "request_put", json11::Json::object {
{ "key", state_key },
{ "value", base64_encode(self_state.dump()) },
{ "lease", etcd_lease_id },
} }
},
} },
{ "failure", json11::Json::array {
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", state_key },
} }
},
} },
}, ETCD_QUICK_TIMEOUT, [this](std::string err, json11::Json data)
{
if (err != "")
{
etcd_failed_attempts++;
printf("Error creating OSD state key: %s\n", err.c_str());
if (etcd_failed_attempts > MAX_ETCD_ATTEMPTS)
{
// Die
throw std::runtime_error("Cluster connection failed");
}
// Retry
tfd->set_timer(ETCD_QUICK_TIMEOUT, false, [this](int timer_id)
{
create_osd_state();
});
return;
}
if (!data["succeeded"].bool_value())
{
// OSD is already up
auto kv = st_cli.parse_etcd_kv(data["responses"][0]["response_range"]["kvs"][0]);
printf("Key %s already exists in etcd, OSD %lu is still up\n", kv.key.c_str(), this->osd_num);
int64_t port = kv.value["port"].int64_value();
for (auto & addr: kv.value["addresses"].array_items())
{
printf(" listening at: %s:%ld\n", addr.string_value().c_str(), port);
}
force_stop(0);
return;
}
if (run_primary)
{
st_cli.load_pgs();
}
});
}
// Renew lease
void osd_t::renew_lease()
{
st_cli.etcd_call("/lease/keepalive", json11::Json::object {
{ "ID", etcd_lease_id }
}, ETCD_QUICK_TIMEOUT, [this](std::string err, json11::Json data)
{
if (err == "" && data["result"]["TTL"].string_value() == "")
{
// Die
throw std::runtime_error("etcd lease has expired");
}
if (err != "")
{
etcd_failed_attempts++;
printf("Error renewing etcd lease: %s\n", err.c_str());
if (etcd_failed_attempts > MAX_ETCD_ATTEMPTS)
{
// Die
throw std::runtime_error("Cluster connection failed");
}
// Retry
tfd->set_timer(ETCD_QUICK_TIMEOUT, false, [this](int timer_id)
{
renew_lease();
});
}
else
{
etcd_failed_attempts = 0;
report_statistics();
}
});
}
void osd_t::force_stop(int exitcode)
{
if (etcd_lease_id != "")
{
st_cli.etcd_call("/kv/lease/revoke", json11::Json::object {
{ "ID", etcd_lease_id }
}, ETCD_QUICK_TIMEOUT, [this, exitcode](std::string err, json11::Json data)
{
if (err != "")
{
printf("Error revoking etcd lease: %s\n", err.c_str());
}
printf("[OSD %lu] Force stopping\n", this->osd_num);
exit(exitcode);
});
}
else
{
printf("[OSD %lu] Force stopping\n", this->osd_num);
exit(exitcode);
}
}
json11::Json osd_t::on_load_pgs_checks_hook()
{
assert(this->pgs.size() == 0);
json11::Json::array checks = {
json11::Json::object {
{ "target", "LEASE" },
{ "lease", etcd_lease_id },
{ "key", base64_encode(st_cli.etcd_prefix+"/osd/state/"+std::to_string(osd_num)) },
}
};
return checks;
}
void osd_t::on_load_pgs_hook(bool success)
{
if (!success)
{
printf("Error loading PGs from etcd: lease expired\n");
force_stop(1);
}
else
{
peering_state &= ~OSD_LOADING_PGS;
apply_pg_count();
apply_pg_config();
}
}
void osd_t::apply_pg_count()
{
for (auto & pool_item: st_cli.pool_config)
{
if (pool_item.second.real_pg_count != 0 &&
pool_item.second.real_pg_count != pg_counts[pool_item.first])
{
// Check that all pool PGs are offline. It is not allowed to change PG count when any PGs are online
// The external tool must wait for all PGs to come down before changing PG count
// If it doesn't wait, a restarted OSD may apply the new count immediately which will lead to bugs
// So an OSD just dies if it detects PG count change while there are active PGs
int still_active = 0;
for (auto & kv: pgs)
{
if (kv.first.pool_id == pool_item.first && (kv.second.state & PG_ACTIVE))
{
still_active++;
}
}
if (still_active > 0)
{
printf("[OSD %lu] PG count change detected, but %d PG(s) are still active. This is not allowed. Exiting\n", this->osd_num, still_active);
force_stop(1);
return;
}
}
this->pg_counts[pool_item.first] = pool_item.second.real_pg_count;
}
}
void osd_t::apply_pg_config()
{
bool all_applied = true;
for (auto & pool_item: st_cli.pool_config)
{
auto pool_id = pool_item.first;
for (auto & kv: pool_item.second.pg_config)
{
pg_num_t pg_num = kv.first;
auto & pg_cfg = kv.second;
bool take = pg_cfg.exists && pg_cfg.primary == this->osd_num &&
!pg_cfg.pause && (!pg_cfg.cur_primary || pg_cfg.cur_primary == this->osd_num);
auto pg_it = this->pgs.find({ .pool_id = pool_id, .pg_num = pg_num });
bool currently_taken = pg_it != this->pgs.end() && pg_it->second.state != PG_OFFLINE;
if (currently_taken && !take)
{
// Stop this PG
stop_pg(pg_it->second);
}
else if (take)
{
// Take this PG
std::set<osd_num_t> all_peers;
for (osd_num_t pg_osd: pg_cfg.target_set)
{
if (pg_osd != 0)
{
all_peers.insert(pg_osd);
}
}
for (osd_num_t pg_osd: pg_cfg.all_peers)
{
if (pg_osd != 0)
{
all_peers.insert(pg_osd);
}
}
for (auto & hist_item: pg_cfg.target_history)
{
for (auto pg_osd: hist_item)
{
if (pg_osd != 0)
{
all_peers.insert(pg_osd);
}
}
}
if (currently_taken)
{
if (pg_it->second.state & (PG_ACTIVE | PG_INCOMPLETE | PG_PEERING))
{
if (pg_it->second.target_set == pg_cfg.target_set)
{
// No change in osd_set; history changes are ignored
continue;
}
else
{
// Stop PG, reapply change after stopping
stop_pg(pg_it->second);
all_applied = false;
continue;
}
}
else if (pg_it->second.state & PG_STOPPING)
{
// Reapply change after stopping
all_applied = false;
continue;
}
else if (pg_it->second.state & PG_STARTING)
{
if (pg_cfg.cur_primary == this->osd_num)
{
// PG locked, continue
}
else
{
// Reapply change after locking the PG
all_applied = false;
continue;
}
}
else
{
throw std::runtime_error("Unexpected PG "+std::to_string(pg_num)+" state: "+std::to_string(pg_it->second.state));
}
}
auto & pg = this->pgs[{ .pool_id = pool_id, .pg_num = pg_num }];
pg = (pg_t){
.state = pg_cfg.cur_primary == this->osd_num ? PG_PEERING : PG_STARTING,
.scheme = pool_item.second.scheme,
.pg_cursize = 0,
.pg_size = pool_item.second.pg_size,
.pg_minsize = pool_item.second.pg_minsize,
.pool_id = pool_id,
.pg_num = pg_num,
.reported_epoch = pg_cfg.epoch,
.target_history = pg_cfg.target_history,
.all_peers = std::vector<osd_num_t>(all_peers.begin(), all_peers.end()),
.target_set = pg_cfg.target_set,
};
this->pg_state_dirty.insert({ .pool_id = pool_id, .pg_num = pg_num });
pg.print_state();
if (pg_cfg.cur_primary == this->osd_num)
{
// Add peers
for (auto pg_osd: all_peers)
{
if (pg_osd != this->osd_num && c_cli.osd_peer_fds.find(pg_osd) == c_cli.osd_peer_fds.end())
{
c_cli.connect_peer(pg_osd, st_cli.peer_states[pg_osd]);
}
}
start_pg_peering(pg);
}
else
{
// Reapply change after locking the PG
all_applied = false;
}
}
}
}
report_pg_states();
this->pg_config_applied = all_applied;
}
void osd_t::report_pg_states()
{
if (etcd_reporting_pg_state || !this->pg_state_dirty.size() || !st_cli.etcd_addresses.size())
{
return;
}
std::vector<std::pair<pool_pg_num_t,bool>> reporting_pgs;
json11::Json::array checks;
json11::Json::array success;
json11::Json::array failure;
for (auto it = pg_state_dirty.begin(); it != pg_state_dirty.end(); it++)
{
auto pg_it = this->pgs.find(*it);
if (pg_it == this->pgs.end())
{
continue;
}
auto & pg = pg_it->second;
reporting_pgs.push_back({ *it, pg.history_changed });
std::string state_key_base64 = base64_encode(st_cli.etcd_prefix+"/pg/state/"+std::to_string(pg.pool_id)+"/"+std::to_string(pg.pg_num));
if (pg.state == PG_STARTING)
{
// Check that the PG key does not exist
// Failed check indicates an unsuccessful PG lock attempt in this case
checks.push_back(json11::Json::object {
{ "target", "VERSION" },
{ "version", 0 },
{ "key", state_key_base64 },
});
}
else
{
// Check that the key is ours
// Failed check indicates success for OFFLINE pgs (PG lock is already deleted)
// and an unexpected race condition for started pgs (PG lock is held by someone else)
checks.push_back(json11::Json::object {
{ "target", "LEASE" },
{ "lease", etcd_lease_id },
{ "key", state_key_base64 },
});
}
if (pg.state == PG_OFFLINE)
{
success.push_back(json11::Json::object {
{ "request_delete_range", json11::Json::object {
{ "key", state_key_base64 },
} }
});
}
else
{
json11::Json::array pg_state_keywords;
for (int i = 0; i < pg_state_bit_count; i++)
{
if (pg.state & pg_state_bits[i])
{
pg_state_keywords.push_back(pg_state_names[i]);
}
}
success.push_back(json11::Json::object {
{ "request_put", json11::Json::object {
{ "key", state_key_base64 },
{ "value", base64_encode(json11::Json(json11::Json::object {
{ "primary", this->osd_num },
{ "state", pg_state_keywords },
{ "peers", pg.cur_peers },
}).dump()) },
{ "lease", etcd_lease_id },
} }
});
if (pg.history_changed)
{
// Prevent race conditions (for the case when the monitor is updating this key at the same time)
pg.history_changed = false;
std::string history_key = base64_encode(st_cli.etcd_prefix+"/pg/history/"+std::to_string(pg.pool_id)+"/"+std::to_string(pg.pg_num));
json11::Json::object history_value = {
{ "epoch", pg.epoch },
{ "all_peers", pg.all_peers },
{ "osd_sets", pg.target_history },
};
checks.push_back(json11::Json::object {
{ "target", "MOD" },
{ "key", history_key },
{ "result", "LESS" },
{ "mod_revision", st_cli.etcd_watch_revision+1 },
});
success.push_back(json11::Json::object {
{ "request_put", json11::Json::object {
{ "key", history_key },
{ "value", base64_encode(json11::Json(history_value).dump()) },
} }
});
}
}
failure.push_back(json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", state_key_base64 },
} }
});
}
pg_state_dirty.clear();
etcd_reporting_pg_state = true;
st_cli.etcd_txn(json11::Json::object {
{ "compare", checks }, { "success", success }, { "failure", failure }
}, ETCD_QUICK_TIMEOUT, [this, reporting_pgs](std::string err, json11::Json data)
{
etcd_reporting_pg_state = false;
if (!data["succeeded"].bool_value())
{
// One of PG state updates failed, put dirty flags back
for (auto pp: reporting_pgs)
{
this->pg_state_dirty.insert(pp.first);
if (pp.second)
{
auto pg_it = this->pgs.find(pp.first);
if (pg_it != this->pgs.end())
{
pg_it->second.history_changed = true;
}
}
}
for (auto & res: data["responses"].array_items())
{
if (res["kvs"].array_items().size())
{
auto kv = st_cli.parse_etcd_kv(res["kvs"][0]);
if (kv.key.substr(st_cli.etcd_prefix.length()+10) == st_cli.etcd_prefix+"/pg/state/")
{
pool_id_t pool_id = 0;
pg_num_t pg_num = 0;
char null_byte = 0;
sscanf(kv.key.c_str() + st_cli.etcd_prefix.length()+10, "%u/%u%c", &pool_id, &pg_num, &null_byte);
if (null_byte == 0)
{
auto pg_it = pgs.find({ .pool_id = pool_id, .pg_num = pg_num });
if (pg_it != pgs.end() && pg_it->second.state != PG_OFFLINE && pg_it->second.state != PG_STARTING)
{
// Live PG state update failed
printf("Failed to report state of pool %u PG %u which is live. Race condition detected, exiting\n", pool_id, pg_num);
force_stop(1);
return;
}
}
}
}
}
// Retry after a short pause (hope we'll get some updates and update PG states accordingly)
tfd->set_timer(500, false, [this](int) { report_pg_states(); });
}
else
{
// Success. We'll get our changes back via the watcher and react to them
for (auto pp: reporting_pgs)
{
auto pg_it = this->pgs.find(pp.first);
if (pg_it != this->pgs.end())
{
if (pg_it->second.state == PG_OFFLINE)
{
// Remove offline PGs after reporting their state
this->pgs.erase(pg_it);
}
}
}
// Push other PG state updates, if any
report_pg_states();
if (!this->pg_state_dirty.size())
{
// Update statistics
report_statistics();
}
}
});
}

300
osd_flush.cpp Normal file
View File

@ -0,0 +1,300 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include "osd.h"
#define FLUSH_BATCH 512
void osd_t::submit_pg_flush_ops(pg_t & pg)
{
pg_flush_batch_t *fb = new pg_flush_batch_t();
pg.flush_batch = fb;
auto it = pg.flush_actions.begin(), prev_it = pg.flush_actions.begin();
bool first = true;
while (it != pg.flush_actions.end())
{
if (!first && (it->first.oid.inode != prev_it->first.oid.inode ||
(it->first.oid.stripe & ~STRIPE_MASK) != (prev_it->first.oid.stripe & ~STRIPE_MASK)) &&
fb->rollback_lists[it->first.osd_num].size() >= FLUSH_BATCH ||
fb->stable_lists[it->first.osd_num].size() >= FLUSH_BATCH)
{
// Stop only at the object boundary
break;
}
it->second.submitted = true;
if (it->second.rollback)
{
fb->flush_objects++;
fb->rollback_lists[it->first.osd_num].push_back((obj_ver_id){
.oid = it->first.oid,
.version = it->second.rollback_to,
});
}
if (it->second.make_stable)
{
fb->flush_objects++;
fb->stable_lists[it->first.osd_num].push_back((obj_ver_id){
.oid = it->first.oid,
.version = it->second.stable_to,
});
}
prev_it = it;
first = false;
it++;
}
for (auto & l: fb->rollback_lists)
{
if (l.second.size() > 0)
{
fb->flush_ops++;
submit_flush_op(pg.pool_id, pg.pg_num, fb, true, l.first, l.second.size(), l.second.data());
}
}
for (auto & l: fb->stable_lists)
{
if (l.second.size() > 0)
{
fb->flush_ops++;
submit_flush_op(pg.pool_id, pg.pg_num, fb, false, l.first, l.second.size(), l.second.data());
}
}
}
void osd_t::handle_flush_op(bool rollback, pool_id_t pool_id, pg_num_t pg_num, pg_flush_batch_t *fb, osd_num_t peer_osd, int retval)
{
pool_pg_num_t pg_id = { .pool_id = pool_id, .pg_num = pg_num };
if (pgs.find(pg_id) == pgs.end() || pgs[pg_id].flush_batch != fb)
{
// Throw the result away
return;
}
if (retval != 0)
{
if (peer_osd == this->osd_num)
{
throw std::runtime_error(
std::string(rollback
? "Error while doing local rollback operation: "
: "Error while doing local stabilize operation: "
) + strerror(-retval)
);
}
else
{
printf("Error while doing flush on OSD %lu: %d (%s)\n", osd_num, retval, strerror(-retval));
auto fd_it = c_cli.osd_peer_fds.find(peer_osd);
if (fd_it != c_cli.osd_peer_fds.end())
{
c_cli.stop_client(fd_it->second);
}
return;
}
}
fb->flush_done++;
if (fb->flush_done == fb->flush_ops)
{
// This flush batch is done
std::vector<osd_op_t*> continue_ops;
auto & pg = pgs[pg_id];
auto it = pg.flush_actions.begin(), prev_it = it;
auto erase_start = it;
while (1)
{
if (it == pg.flush_actions.end() ||
it->first.oid.inode != prev_it->first.oid.inode ||
(it->first.oid.stripe & ~STRIPE_MASK) != (prev_it->first.oid.stripe & ~STRIPE_MASK))
{
pg.ver_override.erase((object_id){
.inode = prev_it->first.oid.inode,
.stripe = (prev_it->first.oid.stripe & ~STRIPE_MASK),
});
auto wr_it = pg.write_queue.find((object_id){
.inode = prev_it->first.oid.inode,
.stripe = (prev_it->first.oid.stripe & ~STRIPE_MASK),
});
if (wr_it != pg.write_queue.end())
{
continue_ops.push_back(wr_it->second);
pg.write_queue.erase(wr_it);
}
}
if ((it == pg.flush_actions.end() || !it->second.submitted) &&
erase_start != it)
{
pg.flush_actions.erase(erase_start, it);
}
if (it == pg.flush_actions.end())
{
break;
}
prev_it = it;
if (!it->second.submitted)
{
it++;
erase_start = it;
}
else
{
it++;
}
}
delete fb;
pg.flush_batch = NULL;
if (!pg.flush_actions.size())
{
pg.state = pg.state & ~PG_HAS_UNCLEAN;
report_pg_state(pg);
}
for (osd_op_t *op: continue_ops)
{
continue_primary_write(op);
}
if (pg.inflight == 0 && (pg.state & PG_STOPPING))
{
finish_stop_pg(pg);
}
}
}
void osd_t::submit_flush_op(pool_id_t pool_id, pg_num_t pg_num, pg_flush_batch_t *fb, bool rollback, osd_num_t peer_osd, int count, obj_ver_id *data)
{
osd_op_t *op = new osd_op_t();
// Copy buffer so it gets freed along with the operation
op->buf = malloc_or_die(sizeof(obj_ver_id) * count);
memcpy(op->buf, data, sizeof(obj_ver_id) * count);
if (peer_osd == this->osd_num)
{
// local
clock_gettime(CLOCK_REALTIME, &op->tv_begin);
op->bs_op = new blockstore_op_t({
.opcode = (uint64_t)(rollback ? BS_OP_ROLLBACK : BS_OP_STABLE),
.callback = [this, op, pool_id, pg_num, fb](blockstore_op_t *bs_op)
{
add_bs_subop_stats(op);
handle_flush_op(bs_op->opcode == BS_OP_ROLLBACK, pool_id, pg_num, fb, this->osd_num, bs_op->retval);
delete op->bs_op;
op->bs_op = NULL;
delete op;
},
.len = (uint32_t)count,
.buf = op->buf,
});
bs->enqueue_op(op->bs_op);
}
else
{
// Peer
int peer_fd = c_cli.osd_peer_fds[peer_osd];
op->op_type = OSD_OP_OUT;
op->iov.push_back(op->buf, count * sizeof(obj_ver_id));
op->peer_fd = peer_fd;
op->req = {
.sec_stab = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = c_cli.next_subop_id++,
.opcode = (uint64_t)(rollback ? OSD_OP_SEC_ROLLBACK : OSD_OP_SEC_STABILIZE),
},
.len = count * sizeof(obj_ver_id),
},
};
op->callback = [this, pool_id, pg_num, fb, peer_osd](osd_op_t *op)
{
handle_flush_op(op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK, pool_id, pg_num, fb, peer_osd, op->reply.hdr.retval);
delete op;
};
c_cli.outbox_push(op);
}
}
bool osd_t::pick_next_recovery(osd_recovery_op_t &op)
{
for (auto pg_it = pgs.begin(); pg_it != pgs.end(); pg_it++)
{
if ((pg_it->second.state & (PG_ACTIVE | PG_HAS_DEGRADED)) == (PG_ACTIVE | PG_HAS_DEGRADED))
{
for (auto obj_it = pg_it->second.degraded_objects.begin(); obj_it != pg_it->second.degraded_objects.end(); obj_it++)
{
if (recovery_ops.find(obj_it->first) == recovery_ops.end())
{
op.degraded = true;
op.oid = obj_it->first;
return true;
}
}
}
}
for (auto pg_it = pgs.begin(); pg_it != pgs.end(); pg_it++)
{
if ((pg_it->second.state & (PG_ACTIVE | PG_HAS_MISPLACED)) == (PG_ACTIVE | PG_HAS_MISPLACED))
{
for (auto obj_it = pg_it->second.misplaced_objects.begin(); obj_it != pg_it->second.misplaced_objects.end(); obj_it++)
{
if (recovery_ops.find(obj_it->first) == recovery_ops.end())
{
op.degraded = false;
op.oid = obj_it->first;
return true;
}
}
}
}
return false;
}
void osd_t::submit_recovery_op(osd_recovery_op_t *op)
{
op->osd_op = new osd_op_t();
op->osd_op->op_type = OSD_OP_OUT;
op->osd_op->req = {
.rw = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = 1,
.opcode = OSD_OP_WRITE,
},
.inode = op->oid.inode,
.offset = op->oid.stripe,
.len = 0,
},
};
op->osd_op->callback = [this, op](osd_op_t *osd_op)
{
// Don't sync the write, it will be synced by our regular sync coroutine
if (osd_op->reply.hdr.retval < 0)
{
// Error recovering object
if (osd_op->reply.hdr.retval == -EPIPE)
{
// PG is stopped or one of the OSDs is gone, error is harmless
}
else
{
throw std::runtime_error("Failed to recover an object");
}
}
// CAREFUL! op = &recovery_ops[op->oid]. Don't access op->* after recovery_ops.erase()
op->osd_op = NULL;
recovery_ops.erase(op->oid);
delete osd_op;
continue_recovery();
};
exec_op(op->osd_op);
}
// Just trigger write requests for degraded objects. They'll be recovered during writing
bool osd_t::continue_recovery()
{
while (recovery_ops.size() < recovery_queue_depth)
{
osd_recovery_op_t op;
if (pick_next_recovery(op))
{
recovery_ops[op.oid] = op;
submit_recovery_op(&recovery_ops[op.oid]);
}
else
return false;
}
return true;
}

View File

@ -1,4 +1,27 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
#pragma once
#define POOL_SCHEME_REPLICATED 1
#define POOL_SCHEME_XOR 2
#define POOL_ID_MAX 0x10000
#define POOL_ID_BITS 16
#define INODE_POOL(inode) (pool_id_t)((inode) >> (64 - POOL_ID_BITS))
// Pool ID is 16 bits long
typedef uint32_t pool_id_t;
typedef uint64_t osd_num_t;
typedef uint32_t pg_num_t;
struct pool_pg_num_t
{
pool_id_t pool_id;
pg_num_t pg_num;
};
inline bool operator < (const pool_pg_num_t & a, const pool_pg_num_t & b)
{
return a.pool_id < b.pool_id || a.pool_id == b.pool_id && a.pg_num < b.pg_num;
}

View File

@ -1,14 +1,28 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include "osd.h"
#include <signal.h>
void handle_sigint(int sig)
static osd_t *osd = NULL;
static bool force_stopping = false;
static void handle_sigint(int sig)
{
if (osd && !force_stopping)
{
force_stopping = true;
osd->force_stop(0);
return;
}
exit(0);
}
int main(int narg, char *args[])
{
setvbuf(stdout, NULL, _IONBF, 0);
setvbuf(stderr, NULL, _IONBF, 0);
if (sizeof(osd_any_op_t) > OSD_PACKET_SIZE ||
sizeof(osd_any_reply_t) > OSD_PACKET_SIZE)
{
@ -25,9 +39,11 @@ int main(int narg, char *args[])
}
}
signal(SIGINT, handle_sigint);
signal(SIGTERM, handle_sigint);
ring_loop_t *ringloop = new ring_loop_t(512);
// FIXME: Create Blockstore from on-disk superblock config and check it against the OSD cluster config
blockstore_t *bs = new blockstore_t(config, ringloop);
osd_t *osd = new osd_t(config, bs, ringloop);
osd = new osd_t(config, bs, ringloop);
while (1)
{
ringloop->loop();

22
osd_ops.cpp Normal file
View File

@ -0,0 +1,22 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
#include "osd_ops.h"
const char* osd_op_names[] = {
"",
"read",
"write",
"write_stable",
"sync",
"stabilize",
"rollback",
"delete",
"sync_stab_all",
"list",
"show_config",
"primary_read",
"primary_write",
"primary_sync",
"primary_delete",
};

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
#pragma once
#include "object_id.h"
@ -10,21 +13,25 @@
#define OSD_PACKET_SIZE 0x80
// Opcodes
#define OSD_OP_MIN 1
#define OSD_OP_SECONDARY_READ 1
#define OSD_OP_SECONDARY_WRITE 2
#define OSD_OP_SECONDARY_SYNC 3
#define OSD_OP_SECONDARY_STABILIZE 4
#define OSD_OP_SECONDARY_ROLLBACK 5
#define OSD_OP_SECONDARY_DELETE 6
#define OSD_OP_TEST_SYNC_STAB_ALL 7
#define OSD_OP_SECONDARY_LIST 8
#define OSD_OP_SHOW_CONFIG 9
#define OSD_OP_READ 10
#define OSD_OP_WRITE 11
#define OSD_OP_SYNC 12
#define OSD_OP_MAX 12
#define OSD_OP_SEC_READ 1
#define OSD_OP_SEC_WRITE 2
#define OSD_OP_SEC_WRITE_STABLE 3
#define OSD_OP_SEC_SYNC 4
#define OSD_OP_SEC_STABILIZE 5
#define OSD_OP_SEC_ROLLBACK 6
#define OSD_OP_SEC_DELETE 7
#define OSD_OP_TEST_SYNC_STAB_ALL 8
#define OSD_OP_SEC_LIST 9
#define OSD_OP_SHOW_CONFIG 10
#define OSD_OP_READ 11
#define OSD_OP_WRITE 12
#define OSD_OP_SYNC 13
#define OSD_OP_DELETE 14
#define OSD_OP_MAX 14
// Alignment & limit for read/write operations
#define OSD_RW_ALIGN 512
#ifndef MEM_ALIGNMENT
#define MEM_ALIGNMENT 512
#endif
#define OSD_RW_MAX 64*1024*1024
// common request and reply headers
@ -57,6 +64,7 @@ struct __attribute__((__packed__)) osd_op_secondary_rw_t
// object
object_id oid;
// read/write version (automatic or specific)
// FIXME deny values close to UINT64_MAX
uint64_t version;
// offset
uint32_t offset;
@ -130,7 +138,10 @@ struct __attribute__((__packed__)) osd_op_secondary_list_t
osd_op_header_t header;
// placement group total number and total count
pg_num_t list_pg, pg_count;
uint64_t parity_block_size;
// size of an area that maps to one PG continuously
uint64_t pg_stripe_size;
// inode range (used to select pools)
uint64_t min_inode, max_inode;
};
struct __attribute__((__packed__)) osd_reply_secondary_list_t
@ -142,6 +153,7 @@ struct __attribute__((__packed__)) osd_reply_secondary_list_t
};
// read or write to the primary OSD (must be within individual stripe)
// FIXME: allow to return used block bitmap (required for snapshots)
struct __attribute__((__packed__)) osd_op_rw_t
{
osd_op_header_t header;
@ -169,6 +181,7 @@ struct __attribute__((__packed__)) osd_reply_sync_t
osd_reply_header_t header;
};
// FIXME it would be interesting to try to unify blockstore_op and osd_op formats
union osd_any_op_t
{
osd_op_header_t hdr;
@ -196,3 +209,5 @@ union osd_any_reply_t
osd_reply_sync_t sync;
uint8_t buf[OSD_PACKET_SIZE];
};
extern const char* osd_op_names[];

View File

@ -1,288 +1,234 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include <netinet/tcp.h>
#include <sys/epoll.h>
#include <algorithm>
#include "base64.h"
#include "osd.h"
void osd_t::init_primary()
{
// Initial test version of clustering code requires exactly 2 peers
// FIXME Hardcode
std::string peerstr = config["peers"];
while (peerstr.size())
{
int pos = peerstr.find(',');
peers.push_back(parse_peer(pos < 0 ? peerstr : peerstr.substr(0, pos)));
peerstr = pos < 0 ? std::string("") : peerstr.substr(pos+1);
for (int i = 0; i < peers.size()-1; i++)
if (peers[i].osd_num == peers[peers.size()-1].osd_num)
throw std::runtime_error("same osd number "+std::to_string(peers[i].osd_num)+" specified twice in peers");
}
if (peers.size() < 2)
throw std::runtime_error("run_primary requires at least 2 peers");
pgs.push_back((pg_t){
.state = PG_OFFLINE,
.pg_cursize = 0,
.pg_num = 1,
.target_set = { 1, 2, 3 },
.cur_set = { 1, 0, 0 },
});
pg_count = 1;
peering_state = OSD_PEERING_PEERS;
}
osd_peer_def_t osd_t::parse_peer(std::string peer)
{
// OSD_NUM:IP:PORT
int pos1 = peer.find(':');
int pos2 = peer.find(':', pos1+1);
if (pos1 < 0 || pos2 < 0)
throw new std::runtime_error("OSD peer string must be in the form OSD_NUM:IP:PORT");
osd_peer_def_t r;
r.addr = peer.substr(pos1+1, pos2-pos1-1);
std::string osd_num_str = peer.substr(0, pos1);
std::string port_str = peer.substr(pos2+1);
r.osd_num = strtoull(osd_num_str.c_str(), NULL, 10);
if (!r.osd_num)
throw new std::runtime_error("Could not parse OSD peer osd_num");
r.port = strtoull(port_str.c_str(), NULL, 10);
if (!r.port)
throw new std::runtime_error("Could not parse OSD peer port");
return r;
}
void osd_t::connect_peer(osd_num_t osd_num, const char *peer_host, int peer_port, std::function<void(osd_num_t, int)> callback)
{
struct sockaddr_in addr;
int r;
if ((r = inet_pton(AF_INET, peer_host, &addr.sin_addr)) != 1)
{
callback(osd_num, -EINVAL);
return;
}
addr.sin_family = AF_INET;
addr.sin_port = htons(peer_port ? peer_port : 11203);
int peer_fd = socket(AF_INET, SOCK_STREAM, 0);
if (peer_fd < 0)
{
callback(osd_num, -errno);
return;
}
fcntl(peer_fd, F_SETFL, fcntl(peer_fd, F_GETFL, 0) | O_NONBLOCK);
r = connect(peer_fd, (sockaddr*)&addr, sizeof(addr));
if (r < 0 && errno != EINPROGRESS)
{
close(peer_fd);
callback(osd_num, -errno);
return;
}
clients[peer_fd] = (osd_client_t){
.peer_addr = addr,
.peer_port = peer_port,
.peer_fd = peer_fd,
.peer_state = PEER_CONNECTING,
.connect_callback = callback,
.osd_num = osd_num,
};
osd_peer_fds[osd_num] = peer_fd;
// Add FD to epoll (EPOLLOUT for tracking connect() result)
epoll_event ev;
ev.data.fd = peer_fd;
ev.events = EPOLLOUT | EPOLLIN | EPOLLRDHUP | EPOLLET;
if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, peer_fd, &ev) < 0)
{
throw std::runtime_error(std::string("epoll_ctl: ") + strerror(errno));
}
}
void osd_t::handle_connect_result(int peer_fd)
{
auto & cl = clients[peer_fd];
osd_num_t osd_num = cl.osd_num;
auto callback = cl.connect_callback;
int result = 0;
socklen_t result_len = sizeof(result);
if (getsockopt(peer_fd, SOL_SOCKET, SO_ERROR, &result, &result_len) < 0)
{
result = errno;
}
if (result != 0)
{
stop_client(peer_fd);
callback(osd_num, -result);
return;
}
int one = 1;
setsockopt(peer_fd, SOL_TCP, TCP_NODELAY, &one, sizeof(one));
// Disable EPOLLOUT on this fd
cl.connect_callback = NULL;
cl.peer_state = PEER_CONNECTED;
epoll_event ev;
ev.data.fd = peer_fd;
ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET;
if (epoll_ctl(epoll_fd, EPOLL_CTL_MOD, peer_fd, &ev) < 0)
{
throw std::runtime_error(std::string("epoll_ctl: ") + strerror(errno));
}
callback(osd_num, peer_fd);
}
// Peering loop
void osd_t::handle_peers()
{
if (peering_state & OSD_PEERING_PEERS)
{
for (int i = 0; i < peers.size(); i++)
{
if (osd_peer_fds.find(peers[i].osd_num) == osd_peer_fds.end() &&
time(NULL) - peers[i].last_connect_attempt > 5) // FIXME hardcode 5
{
peers[i].last_connect_attempt = time(NULL);
connect_peer(peers[i].osd_num, peers[i].addr.c_str(), peers[i].port, [this](osd_num_t osd_num, int peer_fd)
{
// FIXME: Check peer config after connecting
if (peer_fd < 0)
{
printf("Failed to connect to peer OSD %lu: %s\n", osd_num, strerror(-peer_fd));
return;
}
printf("Connected with peer OSD %lu (fd %d)\n", clients[peer_fd].osd_num, peer_fd);
int i;
for (i = 0; i < peers.size(); i++)
{
if (osd_peer_fds.find(peers[i].osd_num) == osd_peer_fds.end())
break;
}
if (i >= peers.size())
{
// Connected to all peers
peering_state = peering_state & ~OSD_PEERING_PEERS;
}
repeer_pgs(osd_num, true);
});
}
}
}
if (peering_state & OSD_PEERING_PGS)
{
bool still_doing_pgs = false;
for (int i = 0; i < pgs.size(); i++)
bool still = false;
for (auto & p: pgs)
{
if (pgs[i].state == PG_PEERING)
if (p.second.state == PG_PEERING)
{
if (!pgs[i].peering_state->list_ops.size())
if (!p.second.peering_state->list_ops.size())
{
pgs[i].calc_object_states();
p.second.calc_object_states(log_level);
report_pg_state(p.second);
incomplete_objects += p.second.incomplete_objects.size();
misplaced_objects += p.second.misplaced_objects.size();
// FIXME: degraded objects may currently include misplaced, too! Report them separately?
degraded_objects += p.second.degraded_objects.size();
if ((p.second.state & (PG_ACTIVE | PG_HAS_UNCLEAN)) == (PG_ACTIVE | PG_HAS_UNCLEAN))
peering_state = peering_state | OSD_FLUSHING_PGS;
else if (p.second.state & PG_ACTIVE)
peering_state = peering_state | OSD_RECOVERING;
}
else
{
still_doing_pgs = true;
still = true;
}
}
}
if (!still_doing_pgs)
if (!still)
{
// Done all PGs
peering_state = peering_state & ~OSD_PEERING_PGS;
}
}
}
void osd_t::repeer_pgs(osd_num_t osd_num, bool is_connected)
{
// Re-peer affected PGs
// FIXME: We shouldn't rely just on target_set. Other OSDs may also contain PG data.
osd_num_t real_osd = (is_connected ? osd_num : 0);
for (int i = 0; i < pgs.size(); i++)
if ((peering_state & OSD_FLUSHING_PGS) && !readonly)
{
bool repeer = false;
for (int r = 0; r < pgs[i].target_set.size(); r++)
bool still = false;
for (auto & p: pgs)
{
if (pgs[i].target_set[r] == osd_num &&
pgs[i].cur_set[r] != real_osd)
if ((p.second.state & (PG_ACTIVE | PG_HAS_UNCLEAN)) == (PG_ACTIVE | PG_HAS_UNCLEAN))
{
pgs[i].cur_set[r] = real_osd;
repeer = true;
break;
if (!p.second.flush_batch)
{
submit_pg_flush_ops(p.second);
}
still = true;
}
}
if (repeer)
if (!still)
{
// Repeer this pg
printf("Repeer PG %d because of OSD %lu\n", i, osd_num);
start_pg_peering(i);
peering_state |= OSD_PEERING_PGS;
peering_state = peering_state & ~OSD_FLUSHING_PGS | OSD_RECOVERING;
}
}
if ((peering_state & OSD_RECOVERING) && !readonly)
{
if (!continue_recovery())
{
peering_state = peering_state & ~OSD_RECOVERING;
}
}
}
void osd_t::repeer_pgs(osd_num_t peer_osd)
{
// Re-peer affected PGs
for (auto & p: pgs)
{
bool repeer = false;
if (p.second.state & (PG_PEERING | PG_ACTIVE | PG_INCOMPLETE))
{
for (osd_num_t pg_osd: p.second.all_peers)
{
if (pg_osd == peer_osd)
{
repeer = true;
break;
}
}
if (repeer)
{
// Repeer this pg
printf("[PG %u] Repeer because of OSD %lu\n", p.second.pg_num, peer_osd);
start_pg_peering(p.second);
}
}
}
}
// Repeer on each connect/disconnect peer event
void osd_t::start_pg_peering(int pg_idx)
void osd_t::start_pg_peering(pg_t & pg)
{
auto & pg = pgs[pg_idx];
pg.state = PG_PEERING;
this->peering_state |= OSD_PEERING_PGS;
report_pg_state(pg);
// Reset PG state
pg.cur_peers.clear();
pg.state_dict.clear();
pg.obj_states.clear();
incomplete_objects -= pg.incomplete_objects.size();
misplaced_objects -= pg.misplaced_objects.size();
degraded_objects -= pg.degraded_objects.size();
pg.incomplete_objects.clear();
pg.misplaced_objects.clear();
pg.degraded_objects.clear();
pg.flush_actions.clear();
pg.ver_override.clear();
pg.pg_cursize = 0;
for (int role = 0; role < pg.cur_set.size(); role++)
if (pg.flush_batch)
{
delete pg.flush_batch;
}
pg.flush_batch = NULL;
for (auto p: pg.write_queue)
{
cancel_primary_write(p.second);
}
pg.write_queue.clear();
uint64_t pg_stripe_size = st_cli.pool_config[pg.pool_id].pg_stripe_size;
for (auto it = unstable_writes.begin(); it != unstable_writes.end(); )
{
// Forget this PG's unstable writes
if (INODE_POOL(it->first.oid.inode) == pg.pool_id && map_to_pg(it->first.oid, pg_stripe_size) == pg.pg_num)
unstable_writes.erase(it++);
else
it++;
}
dirty_pgs.erase({ .pool_id = pg.pool_id, .pg_num = pg.pg_num });
// Drop connections of clients who have this PG in dirty_pgs
if (immediate_commit != IMMEDIATE_ALL)
{
std::vector<int> to_stop;
for (auto & cp: c_cli.clients)
{
if (cp.second.dirty_pgs.find({ .pool_id = pg.pool_id, .pg_num = pg.pg_num }) != cp.second.dirty_pgs.end())
{
to_stop.push_back(cp.first);
}
}
for (auto peer_fd: to_stop)
{
c_cli.stop_client(peer_fd);
}
}
// Calculate current write OSD set
pg.pg_cursize = 0;
pg.cur_set.resize(pg.target_set.size());
pg.cur_loc_set.clear();
for (int role = 0; role < pg.target_set.size(); role++)
{
pg.cur_set[role] = pg.target_set[role] == this->osd_num ||
c_cli.osd_peer_fds.find(pg.target_set[role]) != c_cli.osd_peer_fds.end() ? pg.target_set[role] : 0;
if (pg.cur_set[role] != 0)
{
pg.pg_cursize++;
pg.cur_loc_set.push_back({
.role = (uint64_t)role,
.osd_num = pg.cur_set[role],
.outdated = false,
});
}
}
if (pg.target_history.size())
{
// Refuse to start PG if no peers are available from any of the historical OSD sets
// (PG history is kept up to the latest active+clean state)
for (auto & history_set: pg.target_history)
{
bool found = false;
for (auto history_osd: history_set)
{
if (history_osd != 0 && c_cli.osd_peer_fds.find(history_osd) != c_cli.osd_peer_fds.end())
{
found = true;
break;
}
}
if (!found)
{
pg.state = PG_INCOMPLETE;
report_pg_state(pg);
return;
}
}
}
if (pg.pg_cursize < pg.pg_minsize)
{
pg.state = PG_INCOMPLETE;
report_pg_state(pg);
return;
}
std::set<osd_num_t> cur_peers;
for (auto pg_osd: pg.all_peers)
{
if (pg_osd == this->osd_num || c_cli.osd_peer_fds.find(pg_osd) != c_cli.osd_peer_fds.end())
{
cur_peers.insert(pg_osd);
}
else if (c_cli.wanted_peers.find(pg_osd) == c_cli.wanted_peers.end())
{
c_cli.connect_peer(pg_osd, st_cli.peer_states[pg_osd]);
}
}
pg.cur_peers.insert(pg.cur_peers.begin(), cur_peers.begin(), cur_peers.end());
if (pg.peering_state)
{
// Adjust the peering operation that's still in progress
for (auto it = pg.peering_state->list_ops.begin(); it != pg.peering_state->list_ops.end(); it++)
// Adjust the peering operation that's still in progress - discard unneeded results
for (auto it = pg.peering_state->list_ops.begin(); it != pg.peering_state->list_ops.end();)
{
int role;
for (role = 0; role < pg.cur_set.size(); role++)
{
if (pg.cur_set[role] == it->first)
break;
}
if (pg.state == PG_INCOMPLETE || role >= pg.cur_set.size())
if (pg.state == PG_INCOMPLETE || cur_peers.find(it->first) == cur_peers.end())
{
// Discard the result after completion, which, chances are, will be unsuccessful
auto list_op = it->second;
if (list_op->peer_fd == 0)
{
// Self
list_op->bs_op->callback = [list_op](blockstore_op_t *bs_op)
{
if (list_op->bs_op->buf)
free(list_op->bs_op->buf);
delete list_op;
};
}
else
{
// Peer
list_op->callback = [](osd_op_t *list_op)
{
delete list_op;
};
}
discard_list_subop(it->second);
pg.peering_state->list_ops.erase(it);
it = pg.peering_state->list_ops.begin();
}
else
it++;
}
for (auto it = pg.peering_state->list_results.begin(); it != pg.peering_state->list_results.end(); it++)
for (auto it = pg.peering_state->list_results.begin(); it != pg.peering_state->list_results.end();)
{
int role;
for (role = 0; role < pg.cur_set.size(); role++)
{
if (pg.cur_set[role] == it->first)
break;
}
if (pg.state == PG_INCOMPLETE || role >= pg.cur_set.size())
if (pg.state == PG_INCOMPLETE || cur_peers.find(it->first) == cur_peers.end())
{
if (it->second.buf)
{
@ -291,6 +237,8 @@ void osd_t::start_pg_peering(int pg_idx)
pg.peering_state->list_results.erase(it);
it = pg.peering_state->list_results.begin();
}
else
it++;
}
}
if (pg.state == PG_INCOMPLETE)
@ -300,107 +248,297 @@ void osd_t::start_pg_peering(int pg_idx)
delete pg.peering_state;
pg.peering_state = NULL;
}
printf("PG %d is incomplete\n", pg.pg_num);
return;
}
if (!pg.peering_state)
{
pg.peering_state = new pg_peering_state_t();
pg.peering_state->pool_id = pg.pool_id;
pg.peering_state->pg_num = pg.pg_num;
}
auto ps = pg.peering_state;
for (int role = 0; role < pg.cur_set.size(); role++)
for (osd_num_t peer_osd: cur_peers)
{
osd_num_t role_osd = pg.cur_set[role];
if (!role_osd)
if (pg.peering_state->list_ops.find(peer_osd) != pg.peering_state->list_ops.end() ||
pg.peering_state->list_results.find(peer_osd) != pg.peering_state->list_results.end())
{
continue;
}
if (ps->list_ops.find(role_osd) != ps->list_ops.end() ||
ps->list_results.find(role_osd) != ps->list_results.end())
{
continue;
}
if (role_osd == this->osd_num)
{
// Self
osd_op_t *op = new osd_op_t();
op->op_type = 0;
op->peer_fd = 0;
op->bs_op = new blockstore_op_t();
op->bs_op->opcode = BS_OP_LIST;
op->bs_op->oid.stripe = parity_block_size;
op->bs_op->len = pg_count,
op->bs_op->offset = pg.pg_num-1,
op->bs_op->callback = [ps, op, role_osd](blockstore_op_t *bs_op)
{
if (op->bs_op->retval < 0)
{
throw std::runtime_error("local OP_LIST failed");
}
printf(
"Got object list from OSD %lu (local): %d object versions (%lu of them stable)\n",
role_osd, bs_op->retval, bs_op->version
);
ps->list_results[role_osd] = {
.buf = (obj_ver_id*)op->bs_op->buf,
.total_count = (uint64_t)op->bs_op->retval,
.stable_count = op->bs_op->version,
};
ps->list_done++;
ps->list_ops.erase(role_osd);
delete op;
};
bs->enqueue_op(op->bs_op);
ps->list_ops[role_osd] = op;
}
else
{
// Peer
auto & cl = clients[osd_peer_fds[role_osd]];
osd_op_t *op = new osd_op_t();
op->op_type = OSD_OP_OUT;
op->send_list.push_back(op->req.buf, OSD_PACKET_SIZE);
op->peer_fd = cl.peer_fd;
op->req = {
.sec_list = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = this->next_subop_id++,
.opcode = OSD_OP_SECONDARY_LIST,
},
.list_pg = pg.pg_num,
.pg_count = pg_count,
.parity_block_size = parity_block_size,
},
};
op->callback = [this, ps, role_osd](osd_op_t *op)
{
if (op->reply.hdr.retval < 0)
{
printf("Failed to get object list from OSD %lu (retval=%ld), disconnecting peer\n", role_osd, op->reply.hdr.retval);
ps->list_ops.erase(role_osd);
stop_client(op->peer_fd);
delete op;
return;
}
printf(
"Got object list from OSD %lu: %ld object versions (%lu of them stable)\n",
role_osd, op->reply.hdr.retval, op->reply.sec_list.stable_count
);
ps->list_results[role_osd] = {
.buf = (obj_ver_id*)op->buf,
.total_count = (uint64_t)op->reply.hdr.retval,
.stable_count = op->reply.sec_list.stable_count,
};
// set op->buf to NULL so it doesn't get freed
op->buf = NULL;
ps->list_done++;
ps->list_ops.erase(role_osd);
delete op;
};
outbox_push(cl, op);
ps->list_ops[role_osd] = op;
}
submit_sync_and_list_subop(peer_osd, pg.peering_state);
}
ringloop->wakeup();
}
void osd_t::submit_sync_and_list_subop(osd_num_t role_osd, pg_peering_state_t *ps)
{
// Sync before listing, if not readonly
if (readonly)
{
submit_list_subop(role_osd, ps);
}
else if (role_osd == this->osd_num)
{
// Self
osd_op_t *op = new osd_op_t();
op->op_type = 0;
op->peer_fd = 0;
clock_gettime(CLOCK_REALTIME, &op->tv_begin);
op->bs_op = new blockstore_op_t();
op->bs_op->opcode = BS_OP_SYNC;
op->bs_op->callback = [this, ps, op, role_osd](blockstore_op_t *bs_op)
{
if (bs_op->retval < 0)
{
printf("Local OP_SYNC failed: %d (%s)\n", bs_op->retval, strerror(-bs_op->retval));
force_stop(1);
return;
}
add_bs_subop_stats(op);
delete op->bs_op;
op->bs_op = NULL;
delete op;
ps->list_ops.erase(role_osd);
submit_list_subop(role_osd, ps);
};
bs->enqueue_op(op->bs_op);
ps->list_ops[role_osd] = op;
}
else
{
// Peer
auto & cl = c_cli.clients.at(c_cli.osd_peer_fds[role_osd]);
osd_op_t *op = new osd_op_t();
op->op_type = OSD_OP_OUT;
op->peer_fd = cl.peer_fd;
op->req = {
.sec_sync = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = c_cli.next_subop_id++,
.opcode = OSD_OP_SEC_SYNC,
},
},
};
op->callback = [this, ps, role_osd](osd_op_t *op)
{
if (op->reply.hdr.retval < 0)
{
// FIXME: Mark peer as failed and don't reconnect immediately after dropping the connection
printf("Failed to sync OSD %lu: %ld (%s), disconnecting peer\n", role_osd, op->reply.hdr.retval, strerror(-op->reply.hdr.retval));
ps->list_ops.erase(role_osd);
c_cli.stop_client(op->peer_fd);
delete op;
return;
}
delete op;
ps->list_ops.erase(role_osd);
submit_list_subop(role_osd, ps);
};
c_cli.outbox_push(op);
ps->list_ops[role_osd] = op;
}
}
void osd_t::submit_list_subop(osd_num_t role_osd, pg_peering_state_t *ps)
{
if (role_osd == this->osd_num)
{
// Self
osd_op_t *op = new osd_op_t();
op->op_type = 0;
op->peer_fd = 0;
clock_gettime(CLOCK_REALTIME, &op->tv_begin);
op->bs_op = new blockstore_op_t();
op->bs_op->opcode = BS_OP_LIST;
op->bs_op->oid.stripe = st_cli.pool_config[ps->pool_id].pg_stripe_size;
op->bs_op->oid.inode = ((uint64_t)ps->pool_id << (64 - POOL_ID_BITS));
op->bs_op->version = ((uint64_t)(ps->pool_id+1) << (64 - POOL_ID_BITS)) - 1;
op->bs_op->len = pg_counts[ps->pool_id];
op->bs_op->offset = ps->pg_num-1;
op->bs_op->callback = [this, ps, op, role_osd](blockstore_op_t *bs_op)
{
if (op->bs_op->retval < 0)
{
throw std::runtime_error("local OP_LIST failed");
}
add_bs_subop_stats(op);
printf(
"[PG %u] Got object list from OSD %lu (local): %d object versions (%lu of them stable)\n",
ps->pg_num, role_osd, bs_op->retval, bs_op->version
);
ps->list_results[role_osd] = {
.buf = (obj_ver_id*)op->bs_op->buf,
.total_count = (uint64_t)op->bs_op->retval,
.stable_count = op->bs_op->version,
};
ps->list_ops.erase(role_osd);
delete op->bs_op;
op->bs_op = NULL;
delete op;
};
bs->enqueue_op(op->bs_op);
ps->list_ops[role_osd] = op;
}
else
{
// Peer
osd_op_t *op = new osd_op_t();
op->op_type = OSD_OP_OUT;
op->peer_fd = c_cli.osd_peer_fds[role_osd];
op->req = {
.sec_list = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = c_cli.next_subop_id++,
.opcode = OSD_OP_SEC_LIST,
},
.list_pg = ps->pg_num,
.pg_count = pg_counts[ps->pool_id],
.pg_stripe_size = st_cli.pool_config[ps->pool_id].pg_stripe_size,
.min_inode = ((uint64_t)(ps->pool_id) << (64 - POOL_ID_BITS)),
.max_inode = ((uint64_t)(ps->pool_id+1) << (64 - POOL_ID_BITS)) - 1,
},
};
op->callback = [this, ps, role_osd](osd_op_t *op)
{
if (op->reply.hdr.retval < 0)
{
printf("Failed to get object list from OSD %lu (retval=%ld), disconnecting peer\n", role_osd, op->reply.hdr.retval);
ps->list_ops.erase(role_osd);
c_cli.stop_client(op->peer_fd);
delete op;
return;
}
printf(
"[PG %u] Got object list from OSD %lu: %ld object versions (%lu of them stable)\n",
ps->pg_num, role_osd, op->reply.hdr.retval, op->reply.sec_list.stable_count
);
ps->list_results[role_osd] = {
.buf = (obj_ver_id*)op->buf,
.total_count = (uint64_t)op->reply.hdr.retval,
.stable_count = op->reply.sec_list.stable_count,
};
// set op->buf to NULL so it doesn't get freed
op->buf = NULL;
ps->list_ops.erase(role_osd);
delete op;
};
c_cli.outbox_push(op);
ps->list_ops[role_osd] = op;
}
}
void osd_t::discard_list_subop(osd_op_t *list_op)
{
if (list_op->peer_fd == 0)
{
// Self
list_op->bs_op->callback = [list_op](blockstore_op_t *bs_op)
{
if (list_op->bs_op->buf)
free(list_op->bs_op->buf);
delete list_op->bs_op;
list_op->bs_op = NULL;
delete list_op;
};
}
else
{
// Peer
list_op->callback = [](osd_op_t *list_op)
{
delete list_op;
};
}
}
bool osd_t::stop_pg(pg_t & pg)
{
if (pg.peering_state)
{
// Stop peering
for (auto it = pg.peering_state->list_ops.begin(); it != pg.peering_state->list_ops.end();)
{
discard_list_subop(it->second);
}
for (auto it = pg.peering_state->list_results.begin(); it != pg.peering_state->list_results.end();)
{
if (it->second.buf)
{
free(it->second.buf);
}
}
delete pg.peering_state;
pg.peering_state = NULL;
}
if (!(pg.state & PG_ACTIVE))
{
return false;
}
pg.state = pg.state & ~PG_ACTIVE | PG_STOPPING;
if (pg.inflight == 0 && !pg.flush_batch)
{
finish_stop_pg(pg);
}
else
{
report_pg_state(pg);
}
return true;
}
void osd_t::finish_stop_pg(pg_t & pg)
{
pg.state = PG_OFFLINE;
report_pg_state(pg);
}
void osd_t::report_pg_state(pg_t & pg)
{
pg.print_state();
this->pg_state_dirty.insert({ .pool_id = pg.pool_id, .pg_num = pg.pg_num });
if (pg.state == PG_ACTIVE && (pg.target_history.size() > 0 || pg.all_peers.size() > pg.target_set.size()))
{
// Clear history of active+clean PGs
pg.history_changed = true;
pg.target_history.clear();
pg.all_peers = pg.target_set;
pg.cur_peers = pg.target_set;
}
else if (pg.state == (PG_ACTIVE|PG_LEFT_ON_DEAD))
{
// Clear history of active+left_on_dead PGs, but leave dead OSDs in all_peers
pg.history_changed = true;
pg.target_history.clear();
std::set<osd_num_t> dead_peers;
for (auto pg_osd: pg.all_peers)
{
dead_peers.insert(pg_osd);
}
for (auto pg_osd: pg.cur_peers)
{
dead_peers.erase(pg_osd);
}
for (auto pg_osd: pg.target_set)
{
if (pg_osd)
{
dead_peers.insert(pg_osd);
}
}
pg.all_peers.clear();
pg.all_peers.insert(pg.all_peers.begin(), dead_peers.begin(), dead_peers.end());
pg.cur_peers.clear();
for (auto pg_osd: pg.target_set)
{
if (pg_osd)
{
pg.cur_peers.push_back(pg_osd);
}
}
}
if (pg.state == PG_OFFLINE && !this->pg_config_applied)
{
apply_pg_config();
}
report_pg_states();
}

View File

@ -1,159 +1,420 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include "osd_peering_pg.h"
void pg_t::remember_object(pg_obj_state_check_t &st, std::vector<obj_ver_role> &all)
struct obj_ver_role
{
auto & pg = *this;
// Remember the decision
uint64_t state = 0;
if (st.n_roles == pg.pg_cursize)
object_id oid;
uint64_t version;
uint64_t osd_num;
bool is_stable;
};
inline bool operator < (const obj_ver_role & a, const obj_ver_role & b)
{
// ORDER BY inode ASC, stripe & ~STRIPE_MASK ASC, version DESC, role ASC, osd_num ASC
return a.oid.inode < b.oid.inode || a.oid.inode == b.oid.inode && (
(a.oid.stripe & ~STRIPE_MASK) < (b.oid.stripe & ~STRIPE_MASK) ||
(a.oid.stripe & ~STRIPE_MASK) == (b.oid.stripe & ~STRIPE_MASK) && (
a.version > b.version ||
a.version == b.version && (
a.oid.stripe < b.oid.stripe ||
a.oid.stripe == b.oid.stripe && a.osd_num < b.osd_num
)
)
);
}
struct obj_piece_ver_t
{
uint64_t max_ver = 0;
uint64_t stable_ver = 0;
uint64_t max_target = 0;
};
struct pg_obj_state_check_t
{
pg_t *pg;
bool replicated = false;
std::vector<obj_ver_role> list;
int list_pos;
int obj_start = 0, obj_end = 0, ver_start = 0, ver_end = 0;
object_id oid = { 0 };
uint64_t max_ver = 0;
uint64_t last_ver = 0;
uint64_t target_ver = 0;
uint64_t n_copies = 0, has_roles = 0, n_roles = 0, n_stable = 0, n_mismatched = 0;
uint64_t n_unstable = 0, n_invalid = 0;
pg_osd_set_t osd_set;
int log_level;
void walk();
void start_object();
void handle_version();
void finish_object();
};
void pg_obj_state_check_t::walk()
{
pg->clean_count = 0;
pg->total_count = 0;
pg->state = 0;
for (list_pos = 0; list_pos < list.size(); list_pos++)
{
if (st.n_matched == pg.pg_cursize)
state = OBJ_CLEAN;
if (oid.inode != list[list_pos].oid.inode ||
oid.stripe != (list[list_pos].oid.stripe & ~STRIPE_MASK))
{
if (oid.inode != 0)
{
finish_object();
}
start_object();
}
handle_version();
}
if (oid.inode != 0)
{
finish_object();
}
if (pg->state & PG_HAS_INVALID)
{
// Stop PGs with "invalid" objects
pg->state = PG_INCOMPLETE | PG_HAS_INVALID;
return;
}
if (pg->pg_cursize < pg->pg_size)
{
pg->state |= PG_DEGRADED;
}
pg->state |= PG_ACTIVE;
if (pg->state == PG_ACTIVE && pg->cur_peers.size() < pg->all_peers.size())
{
pg->state |= PG_LEFT_ON_DEAD;
}
}
void pg_obj_state_check_t::start_object()
{
obj_start = list_pos;
oid = { .inode = list[list_pos].oid.inode, .stripe = list[list_pos].oid.stripe & ~STRIPE_MASK };
last_ver = max_ver = list[list_pos].version;
target_ver = 0;
ver_start = list_pos;
has_roles = n_copies = n_roles = n_stable = n_mismatched = 0;
n_unstable = n_invalid = 0;
}
void pg_obj_state_check_t::handle_version()
{
if (!target_ver && last_ver != list[list_pos].version && (n_stable > 0 || n_roles >= pg->pg_minsize))
{
// Version is either stable or recoverable
target_ver = last_ver;
ver_end = list_pos;
}
if (!target_ver)
{
if (last_ver != list[list_pos].version)
{
ver_start = list_pos;
has_roles = n_copies = n_roles = n_stable = n_mismatched = 0;
last_ver = list[list_pos].version;
}
unsigned replica = (list[list_pos].oid.stripe & STRIPE_MASK);
n_copies++;
if (replicated && replica > 0 || replica >= pg->pg_size)
{
n_invalid++;
}
else
{
state = OBJ_MISPLACED;
pg.state = pg.state | PG_HAS_MISPLACED;
if (list[list_pos].is_stable)
{
n_stable++;
}
if (replicated)
{
int i;
for (i = 0; i < pg->cur_set.size(); i++)
{
if (pg->cur_set[i] == list[list_pos].osd_num)
{
break;
}
}
if (i == pg->cur_set.size())
{
n_mismatched++;
}
}
else
{
if (pg->cur_set[replica] != list[list_pos].osd_num)
{
n_mismatched++;
}
if (!(has_roles & (1 << replica)))
{
has_roles = has_roles | (1 << replica);
n_roles++;
}
}
}
}
else if (st.n_roles < pg.pg_minsize)
if (!list[list_pos].is_stable)
{
printf("Object is unfound: inode=%lu stripe=%lu version=%lu/%lu\n", st.oid.inode, st.oid.stripe, st.target_ver, st.max_ver);
n_unstable++;
}
}
void pg_obj_state_check_t::finish_object()
{
if (!target_ver && (n_stable > 0 || n_roles >= pg->pg_minsize))
{
// Version is either stable or recoverable
target_ver = last_ver;
ver_end = list_pos;
}
obj_end = list_pos;
// Remember the decision
uint64_t state = 0;
if (n_invalid > 0)
{
// It's not allowed to change the replication scheme for a pool other than by recreating it
// So we must bring the PG offline
state = OBJ_INCOMPLETE;
pg.state = pg.state | PG_HAS_UNFOUND;
pg->state |= PG_HAS_INVALID;
pg->total_count++;
return;
}
if (n_unstable > 0)
{
pg->state |= PG_HAS_UNCLEAN;
std::unordered_map<obj_piece_id_t, obj_piece_ver_t> pieces;
for (int i = obj_start; i < obj_end; i++)
{
auto & pcs = pieces[(obj_piece_id_t){ .oid = list[i].oid, .osd_num = list[i].osd_num }];
if (!pcs.max_ver)
{
pcs.max_ver = list[i].version;
}
if (list[i].is_stable && !pcs.stable_ver)
{
pcs.stable_ver = list[i].version;
}
if (list[i].version <= target_ver && !pcs.max_target)
{
pcs.max_target = list[i].version;
}
}
for (auto pp: pieces)
{
auto & pcs = pp.second;
if (pcs.stable_ver < pcs.max_ver)
{
auto & act = pg->flush_actions[pp.first];
// osd_set doesn't include rollback/stable states, so don't include them in the state code either
if (pcs.max_ver > target_ver)
{
act.rollback = true;
act.rollback_to = pcs.max_target;
}
if (pcs.stable_ver < (pcs.max_ver > target_ver ? pcs.max_target : pcs.max_ver))
{
act.make_stable = true;
act.stable_to = pcs.max_ver > target_ver ? pcs.max_target : pcs.max_ver;
}
}
}
}
if (!target_ver)
{
return;
}
if (!replicated && n_roles < pg->pg_minsize)
{
if (log_level > 1)
{
printf("Object is incomplete: inode=%lu stripe=%lu version=%lu/%lu\n", oid.inode, oid.stripe, target_ver, max_ver);
}
state = OBJ_INCOMPLETE;
pg->state = pg->state | PG_HAS_INCOMPLETE;
}
else if ((replicated ? n_copies : n_roles) < pg->pg_cursize)
{
if (log_level > 1)
{
printf("Object is degraded: inode=%lu stripe=%lu version=%lu/%lu\n", oid.inode, oid.stripe, target_ver, max_ver);
}
state = OBJ_DEGRADED;
pg->state = pg->state | PG_HAS_DEGRADED;
}
else if (n_mismatched > 0)
{
if (log_level > 1 && (replicated || n_roles >= pg->pg_cursize))
{
printf("Object is misplaced: inode=%lu stripe=%lu version=%lu/%lu\n", oid.inode, oid.stripe, target_ver, max_ver);
}
state |= OBJ_MISPLACED;
pg->state = pg->state | PG_HAS_MISPLACED;
}
if (log_level > 1 && ((replicated ? n_copies : n_roles) < pg->pg_cursize || n_mismatched > 0))
{
if (log_level > 2)
{
for (int i = obj_start; i < obj_end; i++)
{
printf("v%lu present on: osd %lu, role %ld%s\n", list[i].version, list[i].osd_num,
(list[i].oid.stripe & STRIPE_MASK), list[i].is_stable ? " (stable)" : "");
}
}
else
{
for (int i = ver_start; i < ver_end; i++)
{
printf("Target version present on: osd %lu, role %ld%s\n", list[i].osd_num,
(list[i].oid.stripe & STRIPE_MASK), list[i].is_stable ? " (stable)" : "");
}
}
}
pg->total_count++;
if (state != 0 || ver_end < obj_end)
{
osd_set.clear();
for (int i = ver_start; i < ver_end; i++)
{
osd_set.push_back((pg_obj_loc_t){
.role = (list[i].oid.stripe & STRIPE_MASK),
.osd_num = list[i].osd_num,
.outdated = false,
});
}
}
if (ver_end < obj_end)
{
// Check for outdated versions not present in the current target OSD set
for (int i = ver_end; i < obj_end; i++)
{
int j;
for (j = 0; j < osd_set.size(); j++)
{
if (osd_set[j].osd_num == list[i].osd_num)
{
break;
}
}
if (j >= osd_set.size() && pg->cur_set[list[i].oid.stripe & STRIPE_MASK] != list[i].osd_num)
{
osd_set.push_back((pg_obj_loc_t){
.role = (list[i].oid.stripe & STRIPE_MASK),
.osd_num = list[i].osd_num,
.outdated = true,
});
if (!(state & (OBJ_INCOMPLETE | OBJ_DEGRADED)))
{
state |= OBJ_MISPLACED;
pg->state = pg->state | PG_HAS_MISPLACED;
}
}
}
}
if (target_ver < max_ver)
{
pg->ver_override[oid] = target_ver;
}
if (state == 0)
{
pg->clean_count++;
}
else
{
printf("Object is degraded: inode=%lu stripe=%lu version=%lu/%lu\n", st.oid.inode, st.oid.stripe, st.target_ver, st.max_ver);
state = OBJ_DEGRADED;
pg.state = pg.state | PG_HAS_DEGRADED;
}
if (st.n_copies > pg.pg_size)
{
state |= OBJ_OVERCOPIED;
pg.state = pg.state | PG_HAS_UNCLEAN;
}
if (st.n_stable < st.n_copies)
{
state |= OBJ_NEEDS_STABLE;
pg.state = pg.state | PG_HAS_UNCLEAN;
}
if (st.target_ver < st.max_ver || st.has_old_unstable)
{
state |= OBJ_NEEDS_ROLLBACK;
pg.state = pg.state | PG_HAS_UNCLEAN;
pg.ver_override[st.oid] = st.target_ver;
}
if (st.is_buggy)
{
state |= OBJ_BUGGY;
// FIXME: bring pg offline
throw std::runtime_error("buggy object state");
}
if (state != OBJ_CLEAN)
{
st.osd_set.clear();
for (int i = st.ver_start; i < st.ver_end; i++)
{
st.osd_set.push_back((pg_obj_loc_t){
.role = (all[i].oid.stripe & STRIPE_MASK),
.osd_num = all[i].osd_num,
.stable = all[i].is_stable,
});
}
std::sort(st.osd_set.begin(), st.osd_set.end());
auto it = pg.state_dict.find(st.osd_set);
if (it == pg.state_dict.end())
auto it = pg->state_dict.find(osd_set);
if (it == pg->state_dict.end())
{
std::vector<uint64_t> read_target;
read_target.resize(pg.pg_size);
for (int i = 0; i < pg.pg_size; i++)
if (replicated)
{
read_target[i] = 0;
for (auto & o: osd_set)
{
if (!o.outdated)
{
read_target.push_back(o.osd_num);
}
}
while (read_target.size() < pg->pg_size)
{
// FIXME: This is because we then use .data() and assume it's at least <pg_size> long
read_target.push_back(0);
}
}
for (auto & o: st.osd_set)
else
{
read_target[o.role] = o.osd_num;
read_target.resize(pg->pg_size);
for (int i = 0; i < pg->pg_size; i++)
{
read_target[i] = 0;
}
for (auto & o: osd_set)
{
if (!o.outdated)
{
read_target[o.role] = o.osd_num;
}
}
}
pg.state_dict[st.osd_set] = {
pg->state_dict[osd_set] = {
.read_target = read_target,
.osd_set = st.osd_set,
.osd_set = osd_set,
.state = state,
.object_count = 1,
};
it = pg.state_dict.find(st.osd_set);
it = pg->state_dict.find(osd_set);
}
else
{
it->second.object_count++;
}
pg.obj_states[st.oid] = &it->second;
if (st.target_ver < st.max_ver)
if (state & OBJ_INCOMPLETE)
{
pg.ver_override[st.oid] = st.target_ver;
pg->incomplete_objects[oid] = &it->second;
}
if (state & (OBJ_NEEDS_ROLLBACK | OBJ_NEEDS_STABLE))
else if (state & OBJ_DEGRADED)
{
spp::sparse_hash_map<obj_piece_id_t, obj_piece_ver_t> pieces;
for (int i = st.obj_start; i < st.obj_end; i++)
{
auto & pcs = pieces[(obj_piece_id_t){ .oid = all[i].oid, .osd_num = all[i].osd_num }];
if (!pcs.max_ver)
{
pcs.max_ver = all[i].version;
}
if (all[i].is_stable && !pcs.stable_ver)
{
pcs.stable_ver = all[i].version;
}
}
for (auto pp: pieces)
{
auto & pcs = pp.second;
if (pcs.stable_ver < pcs.max_ver)
{
auto & act = obj_stab_actions[pp.first];
if (pcs.max_ver > st.target_ver)
{
act.rollback = true;
act.rollback_to = st.target_ver;
}
else if (pcs.max_ver < st.target_ver && pcs.stable_ver < pcs.max_ver)
{
act.rollback = true;
act.rollback_to = pcs.stable_ver;
}
if (pcs.max_ver >= st.target_ver && pcs.stable_ver < st.target_ver)
{
act.make_stable = true;
act.stable_to = st.target_ver;
}
}
}
pg->degraded_objects[oid] = &it->second;
}
else
{
pg->misplaced_objects[oid] = &it->second;
}
}
else
pg.clean_count++;
pg.total_count++;
}
// FIXME: Write at least some tests for this function
void pg_t::calc_object_states()
void pg_t::calc_object_states(int log_level)
{
auto & pg = *this;
// Copy all object lists into one array
std::vector<obj_ver_role> all;
auto ps = pg.peering_state;
pg_obj_state_check_t st;
st.log_level = log_level;
st.pg = this;
st.replicated = (this->scheme == POOL_SCHEME_REPLICATED);
auto ps = peering_state;
epoch = 0;
for (auto it: ps->list_results)
{
auto nstab = it.second.stable_count;
auto n = it.second.total_count;
auto osd_num = it.first;
uint64_t start = all.size();
all.resize(start + n);
uint64_t start = st.list.size();
st.list.resize(start + n);
obj_ver_id *ov = it.second.buf;
for (uint64_t i = 0; i < n; i++, ov++)
{
all[start+i] = {
if ((ov->version >> (64-PG_EPOCH_BITS)) > epoch)
{
epoch = (ov->version >> (64-PG_EPOCH_BITS));
}
st.list[start+i] = {
.oid = ov->oid,
.version = ov->version,
.osd_num = osd_num,
@ -165,101 +426,33 @@ void pg_t::calc_object_states()
}
ps->list_results.clear();
// Sort
std::sort(all.begin(), all.end());
std::sort(st.list.begin(), st.list.end());
// Walk over it and check object states
pg.clean_count = 0;
pg.total_count = 0;
pg.state = 0;
int replica = 0;
pg_obj_state_check_t st;
for (int i = 0; i < all.size(); i++)
st.walk();
if (this->state & (PG_DEGRADED|PG_LEFT_ON_DEAD))
{
if (st.oid.inode != all[i].oid.inode ||
st.oid.stripe != (all[i].oid.stripe & ~STRIPE_MASK))
{
if (st.oid.inode != 0)
{
// Remember object state
st.obj_end = st.ver_end = i;
remember_object(st, all);
}
st.obj_start = st.ver_start = i;
st.oid = { .inode = all[i].oid.inode, .stripe = all[i].oid.stripe & ~STRIPE_MASK };
st.max_ver = st.target_ver = all[i].version;
st.has_roles = st.n_copies = st.n_roles = st.n_stable = st.n_matched = 0;
st.is_buggy = st.has_old_unstable = false;
}
else if (st.target_ver != all[i].version)
{
if (st.n_stable > 0 || st.n_roles >= pg.pg_minsize)
{
// Last processed version is either recoverable or stable, choose it as target and skip previous versions
st.ver_end = i;
i++;
while (i < all.size() && st.oid.inode == all[i].oid.inode &&
st.oid.stripe == (all[i].oid.stripe & ~STRIPE_MASK))
{
if (!all[i].is_stable)
{
st.has_old_unstable = true;
}
i++;
}
st.obj_end = i;
i--;
continue;
}
else
{
// Last processed version is unstable and unrecoverable
// We'll know that because target_ver < max_ver
st.ver_start = i;
st.target_ver = all[i].version;
st.has_roles = st.n_copies = st.n_roles = st.n_stable = st.n_matched = 0;
}
}
replica = (all[i].oid.stripe & STRIPE_MASK);
st.n_copies++;
if (replica >= pg.pg_size)
{
// FIXME In the future, check it against the PG epoch number to handle replication factor/scheme changes
st.is_buggy = true;
}
else
{
if (all[i].is_stable)
{
st.n_stable++;
}
if (pg.cur_set[replica] == all[i].osd_num)
{
st.n_matched++;
}
if (!(st.has_roles & (1 << replica)))
{
st.has_roles = st.has_roles | (1 << replica);
st.n_roles++;
}
}
assert(epoch != ((1ul << PG_EPOCH_BITS)-1));
epoch++;
}
if (st.oid.inode != 0)
{
// Remember object state
st.obj_end = st.ver_end = all.size();
remember_object(st, all);
}
if (pg.pg_cursize < pg.pg_size)
{
pg.state = pg.state | PG_DEGRADED;
}
printf(
"PG %u is active%s%s%s%s%s (%lu objects)\n", pg.pg_num,
(pg.state & PG_DEGRADED) ? " + degraded" : "",
(pg.state & PG_HAS_UNFOUND) ? " + has_unfound" : "",
(pg.state & PG_HAS_DEGRADED) ? " + has_degraded" : "",
(pg.state & PG_HAS_MISPLACED) ? " + has_misplaced" : "",
(pg.state & PG_HAS_UNCLEAN) ? " + has_unclean" : "",
pg.total_count
);
pg.state = pg.state | PG_ACTIVE;
}
void pg_t::print_state()
{
printf(
"[PG %u] is %s%s%s%s%s%s%s%s%s%s%s%s%s (%lu objects)\n", pg_num,
(state & PG_STARTING) ? "starting" : "",
(state & PG_OFFLINE) ? "offline" : "",
(state & PG_PEERING) ? "peering" : "",
(state & PG_INCOMPLETE) ? "incomplete" : "",
(state & PG_ACTIVE) ? "active" : "",
(state & PG_STOPPING) ? "stopping" : "",
(state & PG_DEGRADED) ? " + degraded" : "",
(state & PG_HAS_INCOMPLETE) ? " + has_incomplete" : "",
(state & PG_HAS_DEGRADED) ? " + has_degraded" : "",
(state & PG_HAS_MISPLACED) ? " + has_misplaced" : "",
(state & PG_HAS_UNCLEAN) ? " + has_unclean" : "",
(state & PG_HAS_INVALID) ? " + has_invalid" : "",
(state & PG_LEFT_ON_DEAD) ? " + left_on_dead" : "",
total_count
);
}

View File

@ -1,43 +1,24 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include <map>
#include <unordered_map>
#include <vector>
#include <algorithm>
#include "cpp-btree/btree_map.h"
#include "object_id.h"
#include "osd_ops.h"
#include "pg_states.h"
#include "sparsepp/sparsepp/spp.h"
// Placement group states
// Exactly one of these:
#define PG_OFFLINE (1<<0)
#define PG_PEERING (1<<1)
#define PG_INCOMPLETE (1<<2)
#define PG_ACTIVE (1<<3)
// Plus any of these:
#define PG_DEGRADED (1<<4)
#define PG_HAS_UNFOUND (1<<5)
#define PG_HAS_DEGRADED (1<<6)
#define PG_HAS_MISPLACED (1<<7)
#define PG_HAS_UNCLEAN (1<<8)
// FIXME: Safe default that doesn't depend on parity_block_size of pg_parity_size
#define STRIPE_MASK ((uint64_t)4096 - 1)
// OSD object states
#define OBJ_CLEAN 0x01
#define OBJ_MISPLACED 0x02
#define OBJ_DEGRADED 0x03
#define OBJ_INCOMPLETE 0x04
#define OBJ_NEEDS_STABLE 0x10000
#define OBJ_NEEDS_ROLLBACK 0x20000
#define OBJ_OVERCOPIED 0x40000
#define OBJ_BUGGY 0x80000
#define PG_EPOCH_BITS 48
struct pg_obj_loc_t
{
uint64_t role;
osd_num_t osd_num;
bool stable;
bool outdated;
};
typedef std::vector<pg_obj_loc_t> pg_osd_set_t;
@ -64,28 +45,10 @@ struct osd_op_t;
struct pg_peering_state_t
{
// osd_num -> list result
spp::sparse_hash_map<osd_num_t, osd_op_t*> list_ops;
spp::sparse_hash_map<osd_num_t, pg_list_result_t> list_results;
int list_done = 0;
};
struct pg_obj_state_check_t
{
int obj_start = 0, obj_end = 0, ver_start = 0, ver_end = 0;
object_id oid = { 0 };
uint64_t max_ver = 0;
uint64_t target_ver = 0;
uint64_t n_copies = 0, has_roles = 0, n_roles = 0, n_stable = 0, n_matched = 0;
bool is_buggy = false, has_old_unstable = false;
pg_osd_set_t osd_set;
};
struct obj_ver_role
{
object_id oid;
uint64_t version;
uint64_t osd_num;
bool is_stable;
std::unordered_map<osd_num_t, osd_op_t*> list_ops;
std::unordered_map<osd_num_t, pg_list_result_t> list_results;
pool_id_t pool_id = 0;
pg_num_t pg_num = 0;
};
struct obj_piece_id_t
@ -94,60 +57,67 @@ struct obj_piece_id_t
uint64_t osd_num;
};
struct obj_piece_ver_t
{
uint64_t max_ver = 0;
uint64_t stable_ver = 0;
};
struct obj_stab_action_t
struct flush_action_t
{
bool rollback = false, make_stable = false;
uint64_t stable_to = 0, rollback_to = 0;
bool submitted = false;
};
struct pg_flush_batch_t
{
std::map<osd_num_t, std::vector<obj_ver_id>> rollback_lists;
std::map<osd_num_t, std::vector<obj_ver_id>> stable_lists;
int flush_ops = 0, flush_done = 0;
int flush_objects = 0;
};
struct pg_t
{
int state;
uint64_t pg_cursize = 3, pg_size = 3, pg_minsize = 2;
pg_num_t pg_num;
int state = 0;
uint64_t scheme = 0;
uint64_t pg_cursize = 0, pg_size = 0, pg_minsize = 0;
pool_id_t pool_id = 0;
pg_num_t pg_num = 0;
uint64_t clean_count = 0, total_count = 0;
// epoch number - should increase with each non-clean activation of the PG
uint64_t epoch = 0, reported_epoch = 0;
// target history and all potential peers
std::vector<std::vector<osd_num_t>> target_history;
std::vector<osd_num_t> all_peers;
bool history_changed = false;
// peer list from the last peering event
std::vector<osd_num_t> cur_peers;
// target_set is the "correct" peer OSD set for this PG
std::vector<osd_num_t> target_set;
// cur_set is the current set of connected peer OSDs for this PG
// cur_set = (role => osd_num or UINT64_MAX if missing). role numbers begin with zero
std::vector<osd_num_t> cur_set;
// same thing in state_dict-like format
pg_osd_set_t cur_loc_set;
// moved object map. by default, each object is considered to reside on the cur_set.
// this map stores all objects that differ.
// it may consume up to ~ (raw storage / object size) * 24 bytes in the worst case scenario
// which is up to ~192 MB per 1 TB in the worst case scenario
std::map<pg_osd_set_t, pg_osd_set_state_t> state_dict;
spp::sparse_hash_map<object_id, pg_osd_set_state_t*> obj_states;
std::map<obj_piece_id_t, obj_stab_action_t> obj_stab_actions;
spp::sparse_hash_map<object_id, uint64_t> ver_override;
btree::btree_map<object_id, pg_osd_set_state_t*> incomplete_objects, misplaced_objects, degraded_objects;
std::map<obj_piece_id_t, flush_action_t> flush_actions;
btree::btree_map<object_id, uint64_t> ver_override;
pg_peering_state_t *peering_state = NULL;
pg_flush_batch_t *flush_batch = NULL;
int inflight = 0; // including write_queue
std::multimap<object_id, osd_op_t*> write_queue;
void calc_object_states();
void remember_object(pg_obj_state_check_t &st, std::vector<obj_ver_role> &all);
void calc_object_states(int log_level);
void print_state();
};
inline bool operator < (const pg_obj_loc_t &a, const pg_obj_loc_t &b)
{
return a.role < b.role || a.role == b.role && a.osd_num < b.osd_num ||
a.role == b.role && a.osd_num == b.osd_num && a.stable < b.stable;
}
inline bool operator < (const obj_ver_role & a, const obj_ver_role & b)
{
// ORDER BY inode ASC, stripe & ~STRIPE_MASK ASC, version DESC, osd_num ASC
return a.oid.inode < b.oid.inode || a.oid.inode == b.oid.inode && (
(a.oid.stripe & ~STRIPE_MASK) < (b.oid.stripe & ~STRIPE_MASK) ||
(a.oid.stripe & ~STRIPE_MASK) == (b.oid.stripe & ~STRIPE_MASK) && (
a.version > b.version || a.version == b.version && a.osd_num < b.osd_num
)
);
return a.outdated < b.outdated ||
a.outdated == b.outdated && a.role < b.role ||
a.outdated == b.outdated && a.role == b.role && a.osd_num < b.osd_num;
}
inline bool operator == (const obj_piece_id_t & a, const obj_piece_id_t & b)
@ -172,7 +142,6 @@ namespace std
// Copy-pasted from spp::hash_combine()
seed ^= (e.role + 0xc6a4a7935bd1e995 + (seed << 6) + (seed >> 2));
seed ^= (e.osd_num + 0xc6a4a7935bd1e995 + (seed << 6) + (seed >> 2));
seed ^= ((e.stable ? 1 : 0) + 0xc6a4a7935bd1e995 + (seed << 6) + (seed >> 2));
}
return seed;
}

57
osd_peering_pg_test.cpp Normal file
View File

@ -0,0 +1,57 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#define _LARGEFILE64_SOURCE
#include "osd_peering_pg.h"
#define STRIPE_SHIFT 12
/**
* TODO tests for object & pg state calculation.
*
* 1) pg=1,2,3. objects:
* v1=1s,2s,3s -> clean
* v1=1s,2s,3 v2=1s,2s,_ -> degraded + needs_rollback
* v1=1s,2s,_ -> degraded
* v1=1s,2s,3s v2=1,6,_ -> degraded + needs_stabilize
* v1=2s,1s,3s -> misplaced
* v1=4,5,6 -> misplaced + needs_stabilize
* v1=1s,2s,6s -> misplaced
* 2) ...
*/
int main(int argc, char *argv[])
{
pg_t pg = {
.state = PG_PEERING,
.pg_num = 1,
.target_set = { 1, 2, 3 },
.cur_set = { 1, 2, 3 },
.peering_state = new pg_peering_state_t(),
};
for (uint64_t osd_num = 1; osd_num <= 3; osd_num++)
{
pg_list_result_t r = {
.buf = (obj_ver_id*)malloc_or_die(sizeof(obj_ver_id) * 1024*1024*8),
.total_count = 1024*1024*8,
.stable_count = (uint64_t)(1024*1024*8 - (osd_num == 1 ? 10 : 0)),
};
for (uint64_t i = 0; i < r.total_count; i++)
{
r.buf[i] = {
.oid = {
.inode = 1,
.stripe = (i << STRIPE_SHIFT) | (osd_num-1),
},
.version = (uint64_t)(osd_num == 1 && i >= r.total_count - 10 ? 2 : 1),
};
}
pg.peering_state->list_results[osd_num] = r;
}
pg.calc_object_states(0);
printf("deviation variants=%ld clean=%lu\n", pg.state_dict.size(), pg.clean_count);
for (auto it: pg.state_dict)
{
printf("dev: state=%lx\n", it.second.state);
}
return 0;
}

File diff suppressed because it is too large Load Diff

41
osd_primary.h Normal file
View File

@ -0,0 +1,41 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#pragma once
#include "osd.h"
#include "osd_rmw.h"
#define SUBMIT_READ 0
#define SUBMIT_RMW_READ 1
#define SUBMIT_WRITE 2
struct unstable_osd_num_t
{
osd_num_t osd_num;
int start, len;
};
struct osd_primary_op_data_t
{
int st = 0;
pg_num_t pg_num;
object_id oid;
uint64_t target_ver;
uint64_t fact_ver = 0;
uint64_t scheme = 0;
int n_subops = 0, done = 0, errors = 0, epipe = 0;
int degraded = 0, pg_size, pg_minsize;
osd_rmw_stripe_t *stripes;
osd_op_t *subops = NULL;
uint64_t *prev_set = NULL;
pg_osd_set_state_t *object_state = NULL;
// for sync. oops, requires freeing
std::vector<unstable_osd_num_t> *unstable_write_osds = NULL;
pool_pg_num_t *dirty_pgs = NULL;
int dirty_pg_count = 0;
osd_num_t *dirty_osds = NULL;
int dirty_osd_count = 0;
obj_ver_id *unstable_writes = NULL;
};

575
osd_primary_subops.cpp Normal file
View File

@ -0,0 +1,575 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include "osd_primary.h"
void osd_t::autosync()
{
// FIXME Autosync based on the number of unstable writes to prevent
// "journal_sector_buffer_count is too low for this batch" errors
if (immediate_commit != IMMEDIATE_ALL && !autosync_op)
{
autosync_op = new osd_op_t();
autosync_op->op_type = OSD_OP_IN;
autosync_op->req = {
.sync = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = 1,
.opcode = OSD_OP_SYNC,
},
},
};
autosync_op->callback = [this](osd_op_t *op)
{
if (op->reply.hdr.retval < 0)
{
printf("Warning: automatic sync resulted in an error: %ld (%s)\n", -op->reply.hdr.retval, strerror(-op->reply.hdr.retval));
}
delete autosync_op;
autosync_op = NULL;
};
exec_op(autosync_op);
}
}
void osd_t::finish_op(osd_op_t *cur_op, int retval)
{
inflight_ops--;
if (cur_op->op_data)
{
if (cur_op->op_data->pg_num > 0)
{
auto & pg = pgs[{ .pool_id = INODE_POOL(cur_op->op_data->oid.inode), .pg_num = cur_op->op_data->pg_num }];
pg.inflight--;
assert(pg.inflight >= 0);
if ((pg.state & PG_STOPPING) && pg.inflight == 0 && !pg.flush_batch)
{
finish_stop_pg(pg);
}
}
assert(!cur_op->op_data->subops);
assert(!cur_op->op_data->unstable_write_osds);
assert(!cur_op->op_data->unstable_writes);
assert(!cur_op->op_data->dirty_pgs);
free(cur_op->op_data);
cur_op->op_data = NULL;
}
if (!cur_op->peer_fd)
{
// Copy lambda to be unaffected by `delete op`
std::function<void(osd_op_t*)>(cur_op->callback)(cur_op);
}
else
{
// FIXME add separate magic number
auto cl_it = c_cli.clients.find(cur_op->peer_fd);
if (cl_it != c_cli.clients.end())
{
cur_op->reply.hdr.magic = SECONDARY_OSD_REPLY_MAGIC;
cur_op->reply.hdr.id = cur_op->req.hdr.id;
cur_op->reply.hdr.opcode = cur_op->req.hdr.opcode;
cur_op->reply.hdr.retval = retval;
c_cli.outbox_push(cur_op);
}
else
{
delete cur_op;
}
}
}
void osd_t::submit_primary_subops(int submit_type, uint64_t op_version, int pg_size, const uint64_t* osd_set, osd_op_t *cur_op)
{
bool wr = submit_type == SUBMIT_WRITE;
osd_primary_op_data_t *op_data = cur_op->op_data;
osd_rmw_stripe_t *stripes = op_data->stripes;
bool rep = op_data->scheme == POOL_SCHEME_REPLICATED;
// Allocate subops
int n_subops = 0, zero_read = -1;
for (int role = 0; role < pg_size; role++)
{
if (osd_set[role] == this->osd_num || osd_set[role] != 0 && zero_read == -1)
{
zero_read = role;
}
if (osd_set[role] != 0 && (wr || !rep && stripes[role].read_end != 0))
{
n_subops++;
}
}
if (!n_subops && (submit_type == SUBMIT_RMW_READ || rep))
{
n_subops = 1;
}
else
{
zero_read = -1;
}
osd_op_t *subops = new osd_op_t[n_subops];
op_data->fact_ver = 0;
op_data->done = op_data->errors = 0;
op_data->n_subops = n_subops;
op_data->subops = subops;
int i = 0;
for (int role = 0; role < pg_size; role++)
{
// We always submit zero-length writes to all replicas, even if the stripe is not modified
if (!(wr || !rep && stripes[role].read_end != 0 || zero_read == role))
{
continue;
}
osd_num_t role_osd_num = osd_set[role];
if (role_osd_num != 0)
{
int stripe_num = rep ? 0 : role;
if (role_osd_num == this->osd_num)
{
clock_gettime(CLOCK_REALTIME, &subops[i].tv_begin);
subops[i].op_type = (uint64_t)cur_op;
subops[i].bs_op = new blockstore_op_t({
.opcode = (uint64_t)(wr ? (rep ? BS_OP_WRITE_STABLE : BS_OP_WRITE) : BS_OP_READ),
.callback = [subop = &subops[i], this](blockstore_op_t *bs_subop)
{
handle_primary_bs_subop(subop);
},
.oid = {
.inode = op_data->oid.inode,
.stripe = op_data->oid.stripe | stripe_num,
},
.version = op_version,
.offset = wr ? stripes[stripe_num].write_start : stripes[stripe_num].read_start,
.len = wr ? stripes[stripe_num].write_end - stripes[stripe_num].write_start : stripes[stripe_num].read_end - stripes[stripe_num].read_start,
.buf = wr ? stripes[stripe_num].write_buf : stripes[stripe_num].read_buf,
});
#ifdef OSD_DEBUG
printf(
"Submit %s to local: %lx:%lx v%lu %u-%u\n", wr ? "write" : "read",
op_data->oid.inode, op_data->oid.stripe | stripe_num, op_version,
subops[i].bs_op->offset, subops[i].bs_op->len
);
#endif
bs->enqueue_op(subops[i].bs_op);
}
else
{
subops[i].op_type = OSD_OP_OUT;
subops[i].peer_fd = c_cli.osd_peer_fds.at(role_osd_num);
subops[i].req.sec_rw = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = c_cli.next_subop_id++,
.opcode = (uint64_t)(wr ? (rep ? OSD_OP_SEC_WRITE_STABLE : OSD_OP_SEC_WRITE) : OSD_OP_SEC_READ),
},
.oid = {
.inode = op_data->oid.inode,
.stripe = op_data->oid.stripe | stripe_num,
},
.version = op_version,
.offset = wr ? stripes[stripe_num].write_start : stripes[stripe_num].read_start,
.len = wr ? stripes[stripe_num].write_end - stripes[stripe_num].write_start : stripes[stripe_num].read_end - stripes[stripe_num].read_start,
};
#ifdef OSD_DEBUG
printf(
"Submit %s to osd %lu: %lx:%lx v%lu %u-%u\n", wr ? "write" : "read", role_osd_num,
op_data->oid.inode, op_data->oid.stripe | stripe_num, op_version,
subops[i].req.sec_rw.offset, subops[i].req.sec_rw.len
);
#endif
if (wr)
{
if (stripes[stripe_num].write_end > stripes[stripe_num].write_start)
{
subops[i].iov.push_back(stripes[stripe_num].write_buf, stripes[stripe_num].write_end - stripes[stripe_num].write_start);
}
}
else
{
if (stripes[stripe_num].read_end > stripes[stripe_num].read_start)
{
subops[i].iov.push_back(stripes[stripe_num].read_buf, stripes[stripe_num].read_end - stripes[stripe_num].read_start);
}
}
subops[i].callback = [cur_op, this](osd_op_t *subop)
{
int fail_fd = subop->req.hdr.opcode == OSD_OP_SEC_WRITE &&
subop->reply.hdr.retval != subop->req.sec_rw.len ? subop->peer_fd : -1;
handle_primary_subop(subop, cur_op);
if (fail_fd >= 0)
{
// write operation failed, drop the connection
c_cli.stop_client(fail_fd);
}
};
c_cli.outbox_push(&subops[i]);
}
i++;
}
}
}
static uint64_t bs_op_to_osd_op[] = {
0,
OSD_OP_SEC_READ, // BS_OP_READ = 1
OSD_OP_SEC_WRITE, // BS_OP_WRITE = 2
OSD_OP_SEC_WRITE_STABLE, // BS_OP_WRITE_STABLE = 3
OSD_OP_SEC_SYNC, // BS_OP_SYNC = 4
OSD_OP_SEC_STABILIZE, // BS_OP_STABLE = 5
OSD_OP_SEC_DELETE, // BS_OP_DELETE = 6
OSD_OP_SEC_LIST, // BS_OP_LIST = 7
OSD_OP_SEC_ROLLBACK, // BS_OP_ROLLBACK = 8
OSD_OP_TEST_SYNC_STAB_ALL, // BS_OP_SYNC_STAB_ALL = 9
};
void osd_t::handle_primary_bs_subop(osd_op_t *subop)
{
osd_op_t *cur_op = (osd_op_t*)subop->op_type;
blockstore_op_t *bs_op = subop->bs_op;
int expected = bs_op->opcode == BS_OP_READ || bs_op->opcode == BS_OP_WRITE
|| bs_op->opcode == BS_OP_WRITE_STABLE ? bs_op->len : 0;
if (bs_op->retval != expected && bs_op->opcode != BS_OP_READ)
{
// die
throw std::runtime_error(
"local blockstore modification failed (opcode = "+std::to_string(bs_op->opcode)+
" retval = "+std::to_string(bs_op->retval)+")"
);
}
add_bs_subop_stats(subop);
subop->req.hdr.opcode = bs_op_to_osd_op[bs_op->opcode];
subop->reply.hdr.retval = bs_op->retval;
if (bs_op->opcode == BS_OP_READ || bs_op->opcode == BS_OP_WRITE || bs_op->opcode == BS_OP_WRITE_STABLE)
{
subop->req.sec_rw.len = bs_op->len;
subop->reply.sec_rw.version = bs_op->version;
}
delete bs_op;
subop->bs_op = NULL;
handle_primary_subop(subop, cur_op);
}
void osd_t::add_bs_subop_stats(osd_op_t *subop)
{
// Include local blockstore ops in statistics
uint64_t opcode = bs_op_to_osd_op[subop->bs_op->opcode];
timespec tv_end;
clock_gettime(CLOCK_REALTIME, &tv_end);
c_cli.stats.op_stat_count[opcode]++;
if (!c_cli.stats.op_stat_count[opcode])
{
c_cli.stats.op_stat_count[opcode] = 1;
c_cli.stats.op_stat_sum[opcode] = 0;
c_cli.stats.op_stat_bytes[opcode] = 0;
}
c_cli.stats.op_stat_sum[opcode] += (
(tv_end.tv_sec - subop->tv_begin.tv_sec)*1000000 +
(tv_end.tv_nsec - subop->tv_begin.tv_nsec)/1000
);
if (opcode == OSD_OP_SEC_READ || opcode == OSD_OP_SEC_WRITE)
{
c_cli.stats.op_stat_bytes[opcode] += subop->bs_op->len;
}
}
void osd_t::handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op)
{
uint64_t opcode = subop->req.hdr.opcode;
int retval = subop->reply.hdr.retval;
int expected = opcode == OSD_OP_SEC_READ || opcode == OSD_OP_SEC_WRITE
|| opcode == OSD_OP_SEC_WRITE_STABLE ? subop->req.sec_rw.len : 0;
osd_primary_op_data_t *op_data = cur_op->op_data;
if (retval != expected)
{
printf("%s subop failed: retval = %d (expected %d)\n", osd_op_names[opcode], retval, expected);
if (retval == -EPIPE)
{
op_data->epipe++;
}
op_data->errors++;
}
else
{
op_data->done++;
if (opcode == OSD_OP_SEC_READ || opcode == OSD_OP_SEC_WRITE || opcode == OSD_OP_SEC_WRITE_STABLE)
{
uint64_t version = subop->reply.sec_rw.version;
#ifdef OSD_DEBUG
uint64_t peer_osd = c_cli.clients.find(subop->peer_fd) != c_cli.clients.end()
? c_cli.clients[subop->peer_fd].osd_num : osd_num;
printf("subop %lu from osd %lu: version = %lu\n", opcode, peer_osd, version);
#endif
if (op_data->fact_ver != 0 && op_data->fact_ver != version)
{
throw std::runtime_error(
"different fact_versions returned from "+std::string(osd_op_names[opcode])+
" subops: "+std::to_string(version)+" vs "+std::to_string(op_data->fact_ver)
);
}
op_data->fact_ver = version;
}
}
if ((op_data->errors + op_data->done) >= op_data->n_subops)
{
delete[] op_data->subops;
op_data->subops = NULL;
op_data->st++;
if (cur_op->req.hdr.opcode == OSD_OP_READ)
{
continue_primary_read(cur_op);
}
else if (cur_op->req.hdr.opcode == OSD_OP_WRITE)
{
continue_primary_write(cur_op);
}
else if (cur_op->req.hdr.opcode == OSD_OP_SYNC)
{
continue_primary_sync(cur_op);
}
else if (cur_op->req.hdr.opcode == OSD_OP_DELETE)
{
continue_primary_del(cur_op);
}
else
{
throw std::runtime_error("BUG: unknown opcode");
}
}
}
void osd_t::cancel_primary_write(osd_op_t *cur_op)
{
if (cur_op->op_data && cur_op->op_data->subops)
{
// Primary-write operation is waiting for subops, subops
// are sent to peer OSDs, so we can't just throw them away.
// Mark them with an extra EPIPE.
cur_op->op_data->errors++;
cur_op->op_data->epipe++;
cur_op->op_data->done--; // Caution: `done` must be signed because may become -1 here
}
else
{
finish_op(cur_op, -EPIPE);
}
}
static bool contains_osd(osd_num_t *osd_set, uint64_t size, osd_num_t osd_num)
{
for (uint64_t i = 0; i < size; i++)
{
if (osd_set[i] == osd_num)
{
return true;
}
}
return false;
}
void osd_t::submit_primary_del_subops(osd_op_t *cur_op, osd_num_t *cur_set, uint64_t set_size, pg_osd_set_t & loc_set)
{
osd_primary_op_data_t *op_data = cur_op->op_data;
bool rep = op_data->scheme == POOL_SCHEME_REPLICATED;
int extra_chunks = 0;
// ordered comparison for EC/XOR, unordered for replicated pools
for (auto & chunk: loc_set)
{
if (!cur_set || (rep ? !contains_osd(cur_set, set_size, chunk.osd_num) : chunk.osd_num != cur_set[chunk.role]))
{
extra_chunks++;
}
}
op_data->n_subops = extra_chunks;
op_data->done = op_data->errors = 0;
if (!extra_chunks)
{
return;
}
osd_op_t *subops = new osd_op_t[extra_chunks];
op_data->subops = subops;
int i = 0;
for (auto & chunk: loc_set)
{
if (!cur_set || (rep ? !contains_osd(cur_set, set_size, chunk.osd_num) : chunk.osd_num != cur_set[chunk.role]))
{
int stripe_num = op_data->scheme == POOL_SCHEME_REPLICATED ? 0 : chunk.role;
if (chunk.osd_num == this->osd_num)
{
clock_gettime(CLOCK_REALTIME, &subops[i].tv_begin);
subops[i].op_type = (uint64_t)cur_op;
subops[i].bs_op = new blockstore_op_t({
.opcode = BS_OP_DELETE,
.callback = [subop = &subops[i], this](blockstore_op_t *bs_subop)
{
handle_primary_bs_subop(subop);
},
.oid = {
.inode = op_data->oid.inode,
.stripe = op_data->oid.stripe | stripe_num,
},
// Same version as write
.version = op_data->fact_ver,
});
bs->enqueue_op(subops[i].bs_op);
}
else
{
subops[i].op_type = OSD_OP_OUT;
subops[i].peer_fd = c_cli.osd_peer_fds.at(chunk.osd_num);
subops[i].req.sec_del = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = c_cli.next_subop_id++,
.opcode = OSD_OP_SEC_DELETE,
},
.oid = {
.inode = op_data->oid.inode,
.stripe = op_data->oid.stripe | stripe_num,
},
// Same version as write
.version = op_data->fact_ver,
};
subops[i].callback = [cur_op, this](osd_op_t *subop)
{
int fail_fd = subop->reply.hdr.retval != 0 ? subop->peer_fd : -1;
handle_primary_subop(subop, cur_op);
if (fail_fd >= 0)
{
// delete operation failed, drop the connection
c_cli.stop_client(fail_fd);
}
};
c_cli.outbox_push(&subops[i]);
}
i++;
}
}
}
void osd_t::submit_primary_sync_subops(osd_op_t *cur_op)
{
osd_primary_op_data_t *op_data = cur_op->op_data;
int n_osds = op_data->dirty_osd_count;
osd_op_t *subops = new osd_op_t[n_osds];
op_data->done = op_data->errors = 0;
op_data->n_subops = n_osds;
op_data->subops = subops;
for (int i = 0; i < n_osds; i++)
{
osd_num_t sync_osd = op_data->dirty_osds[i];
if (sync_osd == this->osd_num)
{
clock_gettime(CLOCK_REALTIME, &subops[i].tv_begin);
subops[i].op_type = (uint64_t)cur_op;
subops[i].bs_op = new blockstore_op_t({
.opcode = BS_OP_SYNC,
.callback = [subop = &subops[i], this](blockstore_op_t *bs_subop)
{
handle_primary_bs_subop(subop);
},
});
bs->enqueue_op(subops[i].bs_op);
}
else
{
subops[i].op_type = OSD_OP_OUT;
subops[i].peer_fd = c_cli.osd_peer_fds.at(sync_osd);
subops[i].req.sec_sync = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = c_cli.next_subop_id++,
.opcode = OSD_OP_SEC_SYNC,
},
};
subops[i].callback = [cur_op, this](osd_op_t *subop)
{
int fail_fd = subop->reply.hdr.retval != 0 ? subop->peer_fd : -1;
handle_primary_subop(subop, cur_op);
if (fail_fd >= 0)
{
// sync operation failed, drop the connection
c_cli.stop_client(fail_fd);
}
};
c_cli.outbox_push(&subops[i]);
}
}
}
void osd_t::submit_primary_stab_subops(osd_op_t *cur_op)
{
osd_primary_op_data_t *op_data = cur_op->op_data;
int n_osds = op_data->unstable_write_osds->size();
osd_op_t *subops = new osd_op_t[n_osds];
op_data->done = op_data->errors = 0;
op_data->n_subops = n_osds;
op_data->subops = subops;
for (int i = 0; i < n_osds; i++)
{
auto & stab_osd = (*(op_data->unstable_write_osds))[i];
if (stab_osd.osd_num == this->osd_num)
{
clock_gettime(CLOCK_REALTIME, &subops[i].tv_begin);
subops[i].op_type = (uint64_t)cur_op;
subops[i].bs_op = new blockstore_op_t({
.opcode = BS_OP_STABLE,
.callback = [subop = &subops[i], this](blockstore_op_t *bs_subop)
{
handle_primary_bs_subop(subop);
},
.len = (uint32_t)stab_osd.len,
.buf = (void*)(op_data->unstable_writes + stab_osd.start),
});
bs->enqueue_op(subops[i].bs_op);
}
else
{
subops[i].op_type = OSD_OP_OUT;
subops[i].peer_fd = c_cli.osd_peer_fds.at(stab_osd.osd_num);
subops[i].req.sec_stab = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = c_cli.next_subop_id++,
.opcode = OSD_OP_SEC_STABILIZE,
},
.len = (uint64_t)(stab_osd.len * sizeof(obj_ver_id)),
};
subops[i].iov.push_back(op_data->unstable_writes + stab_osd.start, stab_osd.len * sizeof(obj_ver_id));
subops[i].callback = [cur_op, this](osd_op_t *subop)
{
int fail_fd = subop->reply.hdr.retval != 0 ? subop->peer_fd : -1;
handle_primary_subop(subop, cur_op);
if (fail_fd >= 0)
{
// sync operation failed, drop the connection
c_cli.stop_client(fail_fd);
}
};
c_cli.outbox_push(&subops[i]);
}
}
}
void osd_t::pg_cancel_write_queue(pg_t & pg, osd_op_t *first_op, object_id oid, int retval)
{
auto st_it = pg.write_queue.find(oid), it = st_it;
finish_op(first_op, retval);
if (it != pg.write_queue.end() && it->second == first_op)
{
it++;
}
else
{
// Write queue doesn't match the first operation.
// first_op is a leftover operation from the previous peering of the same PG.
return;
}
while (it != pg.write_queue.end() && it->first == oid)
{
finish_op(it->second, retval);
it++;
}
if (st_it != it)
{
pg.write_queue.erase(st_it, it);
}
}

View File

@ -1,204 +0,0 @@
#include "osd.h"
void osd_t::read_requests()
{
for (int i = 0; i < read_ready_clients.size(); i++)
{
int peer_fd = read_ready_clients[i];
auto & cl = clients[peer_fd];
io_uring_sqe* sqe = ringloop->get_sqe();
if (!sqe)
{
read_ready_clients.erase(read_ready_clients.begin(), read_ready_clients.begin() + i);
return;
}
ring_data_t* data = ((ring_data_t*)sqe->user_data);
if (!cl.read_buf)
{
// no reads in progress
// so this is either a new command or a reply to a previously sent command
if (!cl.read_op)
{
cl.read_op = new osd_op_t;
cl.read_op->peer_fd = peer_fd;
}
cl.read_op->op_type = OSD_OP_IN;
cl.read_buf = &cl.read_op->req.buf;
cl.read_remaining = OSD_PACKET_SIZE;
cl.read_state = CL_READ_OP;
}
cl.read_iov.iov_base = cl.read_buf;
cl.read_iov.iov_len = cl.read_remaining;
cl.read_msg.msg_iov = &cl.read_iov;
cl.read_msg.msg_iovlen = 1;
data->callback = [this, peer_fd](ring_data_t *data) { handle_read(data, peer_fd); };
my_uring_prep_recvmsg(sqe, peer_fd, &cl.read_msg, 0);
}
read_ready_clients.clear();
}
void osd_t::handle_read(ring_data_t *data, int peer_fd)
{
auto cl_it = clients.find(peer_fd);
if (cl_it != clients.end())
{
auto & cl = cl_it->second;
if (data->res == -EAGAIN)
{
cl.read_ready--;
if (cl.read_ready > 0)
read_ready_clients.push_back(peer_fd);
return;
}
else if (data->res < 0)
{
// this is a client socket, so don't panic. just disconnect it
printf("Client %d socket read error: %d (%s). Disconnecting client\n", peer_fd, -data->res, strerror(-data->res));
stop_client(peer_fd);
return;
}
read_ready_clients.push_back(peer_fd);
if (data->res > 0)
{
cl.read_remaining -= data->res;
cl.read_buf += data->res;
if (cl.read_remaining <= 0)
{
cl.read_buf = NULL;
if (cl.read_state == CL_READ_OP)
{
if (cl.read_op->req.hdr.magic == SECONDARY_OSD_REPLY_MAGIC)
{
handle_reply_hdr(&cl);
}
else
{
handle_op_hdr(&cl);
}
}
else if (cl.read_state == CL_READ_DATA)
{
// Operation is ready
exec_op(cl.read_op);
cl.read_op = NULL;
cl.read_state = 0;
}
else if (cl.read_state == CL_READ_REPLY_DATA)
{
// Reply is ready
auto req_it = cl.sent_ops.find(cl.read_reply_id);
osd_op_t *request = req_it->second;
cl.sent_ops.erase(req_it);
cl.read_reply_id = 0;
cl.read_state = 0;
// Measure subop latency
timespec tv_end;
clock_gettime(CLOCK_REALTIME, &tv_end);
subop_stat_count[request->req.hdr.opcode]++;
subop_stat_sum[request->req.hdr.opcode] += (
(tv_end.tv_sec - request->tv_begin.tv_sec)*1000000 +
(tv_end.tv_nsec - request->tv_begin.tv_nsec)/1000
);
request->callback(request);
}
}
}
}
}
void osd_t::handle_op_hdr(osd_client_t *cl)
{
osd_op_t *cur_op = cl->read_op;
if (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_READ)
{
if (cur_op->req.sec_rw.len > 0)
cur_op->buf = memalign(512, cur_op->req.sec_rw.len);
cl->read_remaining = 0;
}
else if (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_WRITE)
{
if (cur_op->req.sec_rw.len > 0)
cur_op->buf = memalign(512, cur_op->req.sec_rw.len);
cl->read_remaining = cur_op->req.sec_rw.len;
}
else if (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_STABILIZE ||
cur_op->req.hdr.opcode == OSD_OP_SECONDARY_ROLLBACK)
{
if (cur_op->req.sec_stab.len > 0)
cur_op->buf = memalign(512, cur_op->req.sec_stab.len);
cl->read_remaining = cur_op->req.sec_stab.len;
}
else if (cur_op->req.hdr.opcode == OSD_OP_READ)
{
if (cur_op->req.rw.len > 0)
cur_op->buf = memalign(512, cur_op->req.rw.len);
cl->read_remaining = 0;
}
else if (cur_op->req.hdr.opcode == OSD_OP_WRITE)
{
if (cur_op->req.rw.len > 0)
cur_op->buf = memalign(512, cur_op->req.rw.len);
cl->read_remaining = cur_op->req.rw.len;
}
if (cl->read_remaining > 0)
{
// Read data
cl->read_buf = cur_op->buf;
cl->read_state = CL_READ_DATA;
}
else
{
// Operation is ready
cl->read_op = NULL;
cl->read_state = 0;
exec_op(cur_op);
}
}
void osd_t::handle_reply_hdr(osd_client_t *cl)
{
osd_op_t *cur_op = cl->read_op;
auto req_it = cl->sent_ops.find(cur_op->req.hdr.id);
if (req_it == cl->sent_ops.end())
{
// Command out of sync. Drop connection
printf("Client %d command out of sync: id %lu\n", cl->peer_fd, cur_op->req.hdr.id);
stop_client(cl->peer_fd);
return;
}
osd_op_t *op = req_it->second;
memcpy(op->reply.buf, cur_op->req.buf, OSD_PACKET_SIZE);
if (op->reply.hdr.opcode == OSD_OP_SECONDARY_READ &&
op->reply.hdr.retval > 0)
{
// Read data. In this case we assume that the buffer is preallocated by the caller (!)
assert(op->buf);
cl->read_state = CL_READ_REPLY_DATA;
cl->read_reply_id = op->req.hdr.id;
cl->read_buf = op->buf;
cl->read_remaining = op->reply.hdr.retval;
}
else if (op->reply.hdr.opcode == OSD_OP_SECONDARY_LIST &&
op->reply.hdr.retval > 0)
{
op->buf = memalign(512, sizeof(obj_ver_id) * op->reply.hdr.retval);
cl->read_state = CL_READ_REPLY_DATA;
cl->read_reply_id = op->req.hdr.id;
cl->read_buf = op->buf;
cl->read_remaining = sizeof(obj_ver_id) * op->reply.hdr.retval;
}
else
{
cl->read_state = 0;
cl->sent_ops.erase(req_it);
// Measure subop latency
timespec tv_end;
clock_gettime(CLOCK_REALTIME, &tv_end);
subop_stat_count[op->req.hdr.opcode]++;
subop_stat_sum[op->req.hdr.opcode] += (
(tv_end.tv_sec - op->tv_begin.tv_sec)*1000000 +
(tv_end.tv_nsec - op->tv_begin.tv_nsec)/1000
);
op->callback(op);
}
}

View File

@ -1,7 +1,11 @@
#include <malloc.h>
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include <string.h>
#include <assert.h>
#include "xor.h"
#include "osd_rmw.h"
#include "malloc_or_die.h"
static inline void extend_read(uint32_t start, uint32_t end, osd_rmw_stripe_t & stripe)
{
@ -55,6 +59,11 @@ static inline void cover_read(uint32_t start, uint32_t end, osd_rmw_stripe_t & s
void split_stripes(uint64_t pg_minsize, uint32_t bs_block_size, uint32_t start, uint32_t end, osd_rmw_stripe_t *stripes)
{
if (end == 0)
{
// Zero length request - offset doesn't matter
return;
}
end = start+end;
for (int role = 0; role < pg_minsize; role++)
{
@ -66,7 +75,7 @@ void split_stripes(uint64_t pg_minsize, uint32_t bs_block_size, uint32_t start,
}
}
void reconstruct_stripe(osd_rmw_stripe_t *stripes, int pg_size, int role)
void reconstruct_stripe_xor(osd_rmw_stripe_t *stripes, int pg_size, int role)
{
int prev = -2;
for (int other = 0; other < pg_size; other++)
@ -79,18 +88,21 @@ void reconstruct_stripe(osd_rmw_stripe_t *stripes, int pg_size, int role)
}
else if (prev >= 0)
{
assert(stripes[role].read_start >= stripes[prev].read_start &&
stripes[role].read_start >= stripes[other].read_start);
memxor(
stripes[prev].read_buf + (stripes[prev].read_start - stripes[role].read_start),
stripes[other].read_buf + (stripes[other].read_start - stripes[other].read_start),
stripes[prev].read_buf + (stripes[role].read_start - stripes[prev].read_start),
stripes[other].read_buf + (stripes[role].read_start - stripes[other].read_start),
stripes[role].read_buf, stripes[role].read_end - stripes[role].read_start
);
prev = -1;
}
else
{
assert(stripes[role].read_start >= stripes[other].read_start);
memxor(
stripes[role].read_buf,
stripes[other].read_buf + (stripes[other].read_start - stripes[role].read_start),
stripes[other].read_buf + (stripes[role].read_start - stripes[other].read_start),
stripes[role].read_buf, stripes[role].read_end - stripes[role].read_start
);
}
@ -143,7 +155,7 @@ void* alloc_read_buffer(osd_rmw_stripe_t *stripes, int read_pg_size, uint64_t ad
}
}
// Allocate buffer
void *buf = memalign(MEM_ALIGNMENT, buf_size);
void *buf = memalign_or_die(MEM_ALIGNMENT, buf_size);
uint64_t buf_pos = add_size;
for (int role = 0; role < read_pg_size; role++)
{
@ -156,10 +168,11 @@ void* alloc_read_buffer(osd_rmw_stripe_t *stripes, int read_pg_size, uint64_t ad
return buf;
}
void* calc_rmw_reads(void *write_buf, osd_rmw_stripe_t *stripes, uint64_t *osd_set, uint64_t pg_size, uint64_t pg_minsize, uint64_t pg_cursize)
void* calc_rmw(void *request_buf, osd_rmw_stripe_t *stripes, uint64_t *read_osd_set,
uint64_t pg_size, uint64_t pg_minsize, uint64_t pg_cursize, uint64_t *write_osd_set, uint64_t chunk_size)
{
// Generic parity modification (read-modify-write) algorithm
// Reconstruct -> Read -> Calc parity -> Write
// Read -> Reconstruct missing chunks -> Calc parity chunks -> Write
// Now we always read continuous ranges. This means that an update of the beginning
// of one data stripe and the end of another will lead to a read of full paired stripes.
// FIXME: (Maybe) read small individual ranges in that case instead.
@ -174,64 +187,89 @@ void* calc_rmw_reads(void *write_buf, osd_rmw_stripe_t *stripes, uint64_t *osd_s
stripes[role].write_end = stripes[role].req_end;
}
}
for (int role = 0; role < pg_minsize; role++)
{
cover_read(start, end, stripes[role]);
}
int has_parity = 0;
int write_parity = 0;
for (int role = pg_minsize; role < pg_size; role++)
{
if (osd_set[role] != 0)
if (write_osd_set[role] != 0)
{
has_parity++;
write_parity = 1;
stripes[role].write_start = start;
stripes[role].write_end = end;
}
else
stripes[role].missing = true;
}
if (write_parity)
{
for (int role = 0; role < pg_minsize; role++)
{
cover_read(start, end, stripes[role]);
}
}
if (write_osd_set != read_osd_set)
{
pg_cursize = 0;
// Object is degraded/misplaced and will be moved to <write_osd_set>
for (int role = 0; role < pg_size; role++)
{
if (write_osd_set[role] != read_osd_set[role] && write_osd_set[role] != 0)
{
// We need to get data for any moved / recovered chunk
// And we need a continuous write buffer so we'll only optimize
// for the case when the whole chunk is ovewritten in the request
if (stripes[role].req_start != 0 ||
stripes[role].req_end != chunk_size)
{
stripes[role].read_start = 0;
stripes[role].read_end = chunk_size;
// Warning: We don't modify write_start/write_end here, we do it in calc_rmw_parity()
}
}
if (read_osd_set[role] != 0)
{
pg_cursize++;
}
}
}
if (pg_cursize < pg_size)
{
if (has_parity == 0)
// Some stripe(s) are missing, so we need to read parity
for (int role = 0; role < pg_size; role++)
{
// Parity is missing, we don't need to read anything
for (int role = 0; role < pg_minsize; role++)
if (read_osd_set[role] == 0)
{
stripes[role].read_end = 0;
}
}
else
{
// Other stripe(s) are missing
for (int role = 0; role < pg_minsize; role++)
{
if (osd_set[role] == 0 && stripes[role].read_end != 0)
stripes[role].missing = true;
if (stripes[role].read_end != 0)
{
stripes[role].missing = true;
for (int r2 = 0; r2 < pg_size; r2++)
int found = 0;
for (int r2 = 0; r2 < pg_size && found < pg_minsize; r2++)
{
// Read the non-covered range of <role> from all other stripes to reconstruct it
if (r2 != role && osd_set[r2] != 0)
// Read the non-covered range of <role> from at least <minsize> other stripes to reconstruct it
if (read_osd_set[r2] != 0)
{
extend_read(stripes[role].read_start, stripes[role].read_end, stripes[r2]);
found++;
}
}
if (found < pg_minsize)
{
// FIXME Object is incomplete - refuse partial overwrite
assert(0);
}
}
}
}
}
// Allocate read buffers
void *rmw_buf = alloc_read_buffer(stripes, pg_size, has_parity * (end - start));
// Position parity & write buffers
void *rmw_buf = alloc_read_buffer(stripes, pg_size, (write_parity ? pg_size-pg_minsize : 0) * (end - start));
// Position write buffers
uint64_t buf_pos = 0, in_pos = 0;
for (int role = 0; role < pg_size; role++)
{
if (stripes[role].req_end != 0)
{
stripes[role].write_buf = write_buf + in_pos;
stripes[role].write_buf = request_buf + in_pos;
in_pos += stripes[role].req_end - stripes[role].req_start;
}
else if (role >= pg_minsize && osd_set[role] != 0)
else if (role >= pg_minsize && write_osd_set[role] != 0 && end != 0)
{
stripes[role].write_buf = rmw_buf + buf_pos;
buf_pos += end - start;
@ -321,47 +359,94 @@ static void xor_multiple_buffers(buf_len_t *xor1, int n1, buf_len_t *xor2, int n
}
}
void calc_rmw_parity(osd_rmw_stripe_t *stripes, int pg_size)
void calc_rmw_parity_xor(osd_rmw_stripe_t *stripes, int pg_size, uint64_t *read_osd_set, uint64_t *write_osd_set, uint32_t chunk_size)
{
if (stripes[pg_size-1].missing)
{
// Parity OSD is unavailable
return;
}
int pg_minsize = pg_size-1;
for (int role = 0; role < pg_size; role++)
{
if (stripes[role].read_end != 0 && stripes[role].missing)
{
// Reconstruct missing stripe (EC k+1)
reconstruct_stripe(stripes, pg_size, role);
// Reconstruct missing stripe (XOR k+1)
reconstruct_stripe_xor(stripes, pg_size, role);
break;
}
}
// Calculate new parity (EC k+1)
int parity = pg_size-1, prev = -2;
auto wr_end = stripes[parity].write_end;
auto wr_start = stripes[parity].write_start;
for (int other = 0; other < pg_size-1; other++)
uint32_t start = 0, end = 0;
if (write_osd_set[pg_minsize] != 0 || write_osd_set != read_osd_set)
{
if (prev == -2)
// Required for the next two if()s
for (int role = 0; role < pg_minsize; role++)
{
prev = other;
}
else
{
int n1 = 0, n2 = 0;
buf_len_t xor1[3], xor2[3];
if (prev == -1)
if (stripes[role].req_end != 0)
{
xor1[n1++] = { .buf = stripes[parity].write_buf, .len = wr_end-wr_start };
start = !end || stripes[role].req_start < start ? stripes[role].req_start : start;
end = std::max(stripes[role].req_end, end);
}
}
}
if (write_osd_set != read_osd_set)
{
for (int role = 0; role < pg_minsize; role++)
{
if (write_osd_set[role] != read_osd_set[role] && write_osd_set[role] != 0 &&
(stripes[role].req_start != 0 || stripes[role].req_end != chunk_size))
{
// Copy modified chunk into the read buffer to write it back
memcpy(
stripes[role].read_buf + stripes[role].req_start,
stripes[role].write_buf,
stripes[role].req_end - stripes[role].req_start
);
stripes[role].write_buf = stripes[role].read_buf;
stripes[role].write_start = 0;
stripes[role].write_end = chunk_size;
}
}
}
if (write_osd_set[pg_minsize] != 0 && end != 0)
{
// Calculate new parity (XOR k+1)
int parity = pg_minsize, prev = -2;
for (int other = 0; other < pg_minsize; other++)
{
if (prev == -2)
{
prev = other;
}
else
{
get_old_new_buffers(stripes[prev], wr_start, wr_end, xor1, n1);
prev = -1;
int n1 = 0, n2 = 0;
buf_len_t xor1[3], xor2[3];
if (prev == -1)
{
xor1[n1++] = { .buf = stripes[parity].write_buf, .len = end-start };
}
else
{
get_old_new_buffers(stripes[prev], start, end, xor1, n1);
prev = -1;
}
get_old_new_buffers(stripes[other], start, end, xor2, n2);
xor_multiple_buffers(xor1, n1, xor2, n2, stripes[parity].write_buf, end-start);
}
}
}
if (write_osd_set != read_osd_set)
{
for (int role = pg_minsize; role < pg_size; role++)
{
if (write_osd_set[role] != read_osd_set[role] && (start != 0 || end != chunk_size))
{
// Copy new parity into the read buffer to write it back
memcpy(
stripes[role].read_buf + start,
stripes[role].write_buf,
end - start
);
stripes[role].write_buf = stripes[role].read_buf;
stripes[role].write_start = 0;
stripes[role].write_end = chunk_size;
}
get_old_new_buffers(stripes[other], wr_start, wr_end, xor2, n2);
xor_multiple_buffers(xor1, n1, xor2, n2, stripes[parity].write_buf, wr_end-wr_start);
}
}
}

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#pragma once
#include <stdint.h>
@ -25,12 +28,13 @@ struct osd_rmw_stripe_t
void split_stripes(uint64_t pg_minsize, uint32_t bs_block_size, uint32_t start, uint32_t len, osd_rmw_stripe_t *stripes);
void reconstruct_stripe(osd_rmw_stripe_t *stripes, int pg_size, int role);
void reconstruct_stripe_xor(osd_rmw_stripe_t *stripes, int pg_size, int role);
int extend_missing_stripes(osd_rmw_stripe_t *stripes, osd_num_t *osd_set, int minsize, int size);
void* alloc_read_buffer(osd_rmw_stripe_t *stripes, int read_pg_size, uint64_t add_size);
void* calc_rmw_reads(void *write_buf, osd_rmw_stripe_t *stripes, uint64_t *osd_set, uint64_t pg_size, uint64_t pg_minsize, uint64_t pg_cursize);
void* calc_rmw(void *request_buf, osd_rmw_stripe_t *stripes, uint64_t *read_osd_set,
uint64_t pg_size, uint64_t pg_minsize, uint64_t pg_cursize, uint64_t *write_osd_set, uint64_t chunk_size);
void calc_rmw_parity(osd_rmw_stripe_t *stripes, int pg_size);
void calc_rmw_parity_xor(osd_rmw_stripe_t *stripes, int pg_size, uint64_t *read_osd_set, uint64_t *write_osd_set, uint32_t chunk_size);

View File

@ -1,17 +1,151 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include <string.h>
#include "osd_rmw.cpp"
#include "test_pattern.h"
void dump_stripes(osd_rmw_stripe_t *stripes, int pg_size);
void test1();
void test4();
void test5();
void test6();
void test7();
void test8();
void test9();
/***
Cases:
1. split(offset=128K-4K, len=8K)
= [ [ 128K-4K, 128K ], [ 0, 4K ], [ 0, 0 ] ]
2. read(offset=128K-4K, len=8K, osd_set=[1,0,3])
= { read: [ [ 0, 128K ], [ 0, 4K ], [ 0, 4K ] ] }
3. cover_read(0, 128K, { req: [ 128K-4K, 4K ] })
= { read: [ 0, 128K-4K ] }
4. write(offset=128K-4K, len=8K, osd_set=[1,0,3])
= {
read: [ [ 0, 128K ], [ 4K, 128K ], [ 4K, 128K ] ],
write: [ [ 128K-4K, 128K ], [ 0, 4K ], [ 0, 128K ] ],
input buffer: [ write0, write1 ],
rmw buffer: [ write2, read0, read1, read2 ],
}
+ check write2 buffer
5. write(offset=0, len=128K+64K, osd_set=[1,0,3])
= {
req: [ [ 0, 128K ], [ 0, 64K ], [ 0, 0 ] ],
read: [ [ 64K, 128K ], [ 64K, 128K ], [ 64K, 128K ] ],
write: [ [ 0, 128K ], [ 0, 64K ], [ 0, 128K ] ],
input buffer: [ write0, write1 ],
rmw buffer: [ write2, read0, read1, read2 ],
}
6. write(offset=0, len=128K+64K, osd_set=[1,2,3])
= {
req: [ [ 0, 128K ], [ 0, 64K ], [ 0, 0 ] ],
read: [ [ 0, 0 ], [ 64K, 128K ], [ 0, 0 ] ],
write: [ [ 0, 128K ], [ 0, 64K ], [ 0, 128K ] ],
input buffer: [ write0, write1 ],
rmw buffer: [ write2, read1 ],
}
7. calc_rmw(offset=128K-4K, len=8K, osd_set=[1,0,3], write_set=[1,2,3])
= {
read: [ [ 0, 128K ], [ 0, 128K ], [ 0, 128K ] ],
write: [ [ 128K-4K, 128K ], [ 0, 4K ], [ 0, 128K ] ],
input buffer: [ write0, write1 ],
rmw buffer: [ write2, read0, read1, read2 ],
}
then, after calc_rmw_parity_xor(): {
write: [ [ 128K-4K, 128K ], [ 0, 128K ], [ 0, 128K ] ],
write1==read1,
}
+ check write1 buffer
+ check write2 buffer
8. calc_rmw(offset=0, len=128K+4K, osd_set=[0,2,3], write_set=[1,2,3])
= {
read: [ [ 0, 0 ], [ 4K, 128K ], [ 0, 0 ] ],
write: [ [ 0, 128K ], [ 0, 4K ], [ 0, 128K ] ],
input buffer: [ write0, write1 ],
rmw buffer: [ write2, read1 ],
}
+ check write2 buffer
9. object recovery case:
calc_rmw(offset=0, len=0, read_osd_set=[0,2,3], write_osd_set=[1,2,3])
= {
read: [ [ 0, 128K ], [ 0, 128K ], [ 0, 128K ] ],
write: [ [ 0, 0 ], [ 0, 0 ], [ 0, 0 ] ],
input buffer: NULL,
rmw buffer: [ read0, read1, read2 ],
}
then, after calc_rmw_parity_xor(): {
write: [ [ 0, 128K ], [ 0, 0 ], [ 0, 0 ] ],
write0==read0,
}
+ check write0 buffer
***/
int main(int narg, char *args[])
{
// Test 1
test1();
// Test 4
test4();
// Test 5
test5();
// Test 6
test6();
// Test 7
test7();
// Test 8
test8();
// Test 9
test9();
// End
printf("all ok\n");
return 0;
}
void dump_stripes(osd_rmw_stripe_t *stripes, int pg_size)
{
printf("request");
for (int i = 0; i < pg_size; i++)
{
printf(" {%uK-%uK}", stripes[i].req_start/1024, stripes[i].req_end/1024);
}
printf("\n");
printf("read");
for (int i = 0; i < pg_size; i++)
{
printf(" {%uK-%uK}", stripes[i].read_start/1024, stripes[i].read_end/1024);
}
printf("\n");
printf("write");
for (int i = 0; i < pg_size; i++)
{
printf(" {%uK-%uK}", stripes[i].write_start/1024, stripes[i].write_end/1024);
}
printf("\n");
}
void test1()
{
osd_num_t osd_set[3] = { 1, 0, 3 };
osd_rmw_stripe_t stripes[3] = { 0 };
// Test 1
// Test 1.1
split_stripes(2, 128*1024, 128*1024-4096, 8192, stripes);
assert(stripes[0].req_start == 128*1024-4096 && stripes[0].req_end == 128*1024);
assert(stripes[1].req_start == 0 && stripes[1].req_end == 4096);
assert(stripes[2].req_end == 0);
// Test 2
// Test 1.2
for (int i = 0; i < 3; i++)
{
stripes[i].read_start = stripes[i].req_start;
@ -20,18 +154,26 @@ int main(int narg, char *args[])
assert(extend_missing_stripes(stripes, osd_set, 2, 3) == 0);
assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024);
assert(stripes[2].read_start == 0 && stripes[2].read_end == 4096);
// Test 3
// Test 1.3
stripes[0] = { .req_start = 128*1024-4096, .req_end = 128*1024 };
cover_read(0, 128*1024, stripes[0]);
assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024-4096);
}
void test4()
{
osd_num_t osd_set[3] = { 1, 0, 3 };
osd_rmw_stripe_t stripes[3] = { 0 };
// Test 4.1
memset(stripes, 0, sizeof(stripes));
split_stripes(2, 128*1024, 128*1024-4096, 8192, stripes);
void* write_buf = malloc(8192);
void* rmw_buf = calc_rmw_reads(write_buf, stripes, osd_set, 3, 2, 2);
void* rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 2, osd_set, 128*1024);
assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024);
assert(stripes[1].read_start == 4096 && stripes[1].read_end == 128*1024);
assert(stripes[2].read_start == 4096 && stripes[2].read_end == 128*1024);
assert(stripes[0].write_start == 128*1024-4096 && stripes[0].write_end == 128*1024);
assert(stripes[1].write_start == 0 && stripes[1].write_end == 4096);
assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
assert(stripes[0].read_buf == rmw_buf+128*1024);
assert(stripes[1].read_buf == rmw_buf+128*1024*2);
assert(stripes[2].read_buf == rmw_buf+128*1024*3-4096);
@ -43,24 +185,32 @@ int main(int narg, char *args[])
set_pattern(stripes[0].read_buf, 128*1024, PATTERN1); // old data
set_pattern(stripes[1].read_buf, 128*1024-4096, UINT64_MAX); // didn't read it, it's missing
set_pattern(stripes[2].read_buf, 128*1024-4096, 0); // old parity = 0
calc_rmw_parity(stripes, 3);
calc_rmw_parity_xor(stripes, 3, osd_set, osd_set, 128*1024);
check_pattern(stripes[2].write_buf, 4096, PATTERN0^PATTERN1); // new parity
check_pattern(stripes[2].write_buf+4096, 128*1024-4096*2, 0); // new parity
check_pattern(stripes[2].write_buf+128*1024-4096, 4096, PATTERN0^PATTERN1); // new parity
free(rmw_buf);
free(write_buf);
}
void test5()
{
osd_num_t osd_set[3] = { 1, 0, 3 };
osd_rmw_stripe_t stripes[3] = { 0 };
// Test 5.1
memset(stripes, 0, sizeof(stripes));
split_stripes(2, 128*1024, 0, 64*1024*3, stripes);
assert(stripes[0].req_start == 0 && stripes[0].req_end == 128*1024);
assert(stripes[1].req_start == 0 && stripes[1].req_end == 64*1024);
assert(stripes[2].req_end == 0);
// Test 5.2
write_buf = malloc(64*1024*3);
rmw_buf = calc_rmw_reads(write_buf, stripes, osd_set, 3, 2, 2);
void *write_buf = malloc(64*1024*3);
void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 2, osd_set, 128*1024);
assert(stripes[0].read_start == 64*1024 && stripes[0].read_end == 128*1024);
assert(stripes[1].read_start == 64*1024 && stripes[1].read_end == 128*1024);
assert(stripes[2].read_start == 64*1024 && stripes[2].read_end == 128*1024);
assert(stripes[0].write_start == 0 && stripes[0].write_end == 128*1024);
assert(stripes[1].write_start == 0 && stripes[1].write_end == 64*1024);
assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
assert(stripes[0].read_buf == rmw_buf+128*1024);
assert(stripes[1].read_buf == rmw_buf+64*3*1024);
assert(stripes[2].read_buf == rmw_buf+64*4*1024);
@ -69,15 +219,22 @@ int main(int narg, char *args[])
assert(stripes[2].write_buf == rmw_buf);
free(rmw_buf);
free(write_buf);
}
void test6()
{
osd_num_t osd_set[3] = { 1, 2, 3 };
osd_rmw_stripe_t stripes[3] = { 0 };
// Test 6.1
memset(stripes, 0, sizeof(stripes));
split_stripes(2, 128*1024, 0, 64*1024*3, stripes);
osd_set[1] = 2;
write_buf = malloc(64*1024*3);
rmw_buf = calc_rmw_reads(write_buf, stripes, osd_set, 3, 2, 3);
void *write_buf = malloc(64*1024*3);
void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 3, osd_set, 128*1024);
assert(stripes[0].read_end == 0);
assert(stripes[1].read_start == 64*1024 && stripes[1].read_end == 128*1024);
assert(stripes[2].read_end == 0);
assert(stripes[0].write_start == 0 && stripes[0].write_end == 128*1024);
assert(stripes[1].write_start == 0 && stripes[1].write_end == 64*1024);
assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
assert(stripes[0].read_buf == 0);
assert(stripes[1].read_buf == rmw_buf+128*1024);
assert(stripes[2].read_buf == 0);
@ -86,8 +243,121 @@ int main(int narg, char *args[])
assert(stripes[2].write_buf == rmw_buf);
free(rmw_buf);
free(write_buf);
osd_set[1] = 0;
// End
printf("all ok\n");
return 0;
}
void test7()
{
osd_num_t osd_set[3] = { 1, 0, 3 };
osd_num_t write_osd_set[3] = { 1, 2, 3 };
osd_rmw_stripe_t stripes[3] = { 0 };
// Test 7.1
split_stripes(2, 128*1024, 128*1024-4096, 8192, stripes);
void *write_buf = malloc(8192);
void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 2, write_osd_set, 128*1024);
assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024);
assert(stripes[1].read_start == 0 && stripes[1].read_end == 128*1024);
assert(stripes[2].read_start == 0 && stripes[2].read_end == 128*1024);
assert(stripes[0].write_start == 128*1024-4096 && stripes[0].write_end == 128*1024);
assert(stripes[1].write_start == 0 && stripes[1].write_end == 4096);
assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
assert(stripes[0].read_buf == rmw_buf+128*1024);
assert(stripes[1].read_buf == rmw_buf+128*1024*2);
assert(stripes[2].read_buf == rmw_buf+128*1024*3);
assert(stripes[0].write_buf == write_buf);
assert(stripes[1].write_buf == write_buf+4096);
assert(stripes[2].write_buf == rmw_buf);
// Test 7.2
set_pattern(write_buf, 8192, PATTERN0);
set_pattern(stripes[0].read_buf, 128*1024, PATTERN1); // old data
set_pattern(stripes[1].read_buf, 128*1024, UINT64_MAX); // didn't read it, it's missing
set_pattern(stripes[2].read_buf, 128*1024, 0); // old parity = 0
calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024);
assert(stripes[0].write_start == 128*1024-4096 && stripes[0].write_end == 128*1024);
assert(stripes[1].write_start == 0 && stripes[1].write_end == 128*1024);
assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
assert(stripes[1].write_buf == stripes[1].read_buf);
check_pattern(stripes[1].write_buf, 4096, PATTERN0);
check_pattern(stripes[1].write_buf+4096, 128*1024-4096, PATTERN1);
check_pattern(stripes[2].write_buf, 4096, PATTERN0^PATTERN1); // new parity
check_pattern(stripes[2].write_buf+4096, 128*1024-4096*2, 0); // new parity
check_pattern(stripes[2].write_buf+128*1024-4096, 4096, PATTERN0^PATTERN1); // new parity
free(rmw_buf);
free(write_buf);
}
void test8()
{
osd_num_t osd_set[3] = { 0, 2, 3 };
osd_num_t write_osd_set[3] = { 1, 2, 3 };
osd_rmw_stripe_t stripes[3] = { 0 };
// Test 8.1
split_stripes(2, 128*1024, 0, 128*1024+4096, stripes);
void *write_buf = malloc(128*1024+4096);
void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 2, write_osd_set, 128*1024);
assert(stripes[0].read_start == 0 && stripes[0].read_end == 0);
assert(stripes[1].read_start == 4096 && stripes[1].read_end == 128*1024);
assert(stripes[2].read_start == 0 && stripes[2].read_end == 0);
assert(stripes[0].write_start == 0 && stripes[0].write_end == 128*1024);
assert(stripes[1].write_start == 0 && stripes[1].write_end == 4096);
assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
assert(stripes[0].read_buf == NULL);
assert(stripes[1].read_buf == rmw_buf+128*1024);
assert(stripes[2].read_buf == NULL);
assert(stripes[0].write_buf == write_buf);
assert(stripes[1].write_buf == write_buf+128*1024);
assert(stripes[2].write_buf == rmw_buf);
// Test 8.2
set_pattern(write_buf, 128*1024+4096, PATTERN0);
set_pattern(stripes[1].read_buf, 128*1024-4096, PATTERN1);
calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024);
assert(stripes[0].write_start == 0 && stripes[0].write_end == 128*1024); // recheck again
assert(stripes[1].write_start == 0 && stripes[1].write_end == 4096); // recheck again
assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024); // recheck again
assert(stripes[0].write_buf == write_buf); // recheck again
assert(stripes[1].write_buf == write_buf+128*1024); // recheck again
assert(stripes[2].write_buf == rmw_buf); // recheck again
check_pattern(stripes[2].write_buf, 4096, 0); // new parity
check_pattern(stripes[2].write_buf+4096, 128*1024-4096, PATTERN0^PATTERN1); // new parity
free(rmw_buf);
free(write_buf);
}
void test9()
{
osd_num_t osd_set[3] = { 0, 2, 3 };
osd_num_t write_osd_set[3] = { 1, 2, 3 };
osd_rmw_stripe_t stripes[3] = { 0 };
// Test 9.0
split_stripes(2, 128*1024, 64*1024, 0, stripes);
assert(stripes[0].req_start == 0 && stripes[0].req_end == 0);
assert(stripes[1].req_start == 0 && stripes[1].req_end == 0);
assert(stripes[2].req_start == 0 && stripes[2].req_end == 0);
// Test 9.1
void *write_buf = NULL;
void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 3, write_osd_set, 128*1024);
assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024);
assert(stripes[1].read_start == 0 && stripes[1].read_end == 128*1024);
assert(stripes[2].read_start == 0 && stripes[2].read_end == 128*1024);
assert(stripes[0].write_start == 0 && stripes[0].write_end == 0);
assert(stripes[1].write_start == 0 && stripes[1].write_end == 0);
assert(stripes[2].write_start == 0 && stripes[2].write_end == 0);
assert(stripes[0].read_buf == rmw_buf);
assert(stripes[1].read_buf == rmw_buf+128*1024);
assert(stripes[2].read_buf == rmw_buf+128*1024*2);
assert(stripes[0].write_buf == NULL);
assert(stripes[1].write_buf == NULL);
assert(stripes[2].write_buf == NULL);
// Test 9.2
set_pattern(stripes[1].read_buf, 128*1024, 0);
set_pattern(stripes[2].read_buf, 128*1024, PATTERN1);
calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024);
assert(stripes[0].write_start == 0 && stripes[0].write_end == 128*1024);
assert(stripes[1].write_start == 0 && stripes[1].write_end == 0);
assert(stripes[2].write_start == 0 && stripes[2].write_end == 0);
assert(stripes[0].write_buf == rmw_buf);
assert(stripes[1].write_buf == NULL);
assert(stripes[2].write_buf == NULL);
check_pattern(stripes[0].read_buf, 128*1024, PATTERN1);
check_pattern(stripes[0].write_buf, 128*1024, PATTERN1);
free(rmw_buf);
}

View File

@ -1,64 +1,59 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include "osd.h"
#include "json11/json11.hpp"
void osd_t::secondary_op_callback(osd_op_t *op)
{
inflight_ops--;
auto cl_it = clients.find(op->peer_fd);
if (cl_it != clients.end())
if (op->req.hdr.opcode == OSD_OP_SEC_READ ||
op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE)
{
op->reply.hdr.magic = SECONDARY_OSD_REPLY_MAGIC;
op->reply.hdr.id = op->req.hdr.id;
op->reply.hdr.opcode = op->req.hdr.opcode;
op->reply.hdr.retval = op->bs_op->retval;
if (op->req.hdr.opcode == OSD_OP_SECONDARY_READ ||
op->req.hdr.opcode == OSD_OP_SECONDARY_WRITE)
{
op->reply.sec_rw.version = op->bs_op->version;
}
else if (op->req.hdr.opcode == OSD_OP_SECONDARY_DELETE)
{
op->reply.sec_del.version = op->bs_op->version;
}
if (op->req.hdr.opcode == OSD_OP_SECONDARY_READ &&
op->reply.hdr.retval > 0)
{
op->send_list.push_back(op->buf, op->reply.hdr.retval);
}
else if (op->req.hdr.opcode == OSD_OP_SECONDARY_LIST)
{
// allocated by blockstore
op->buf = op->bs_op->buf;
if (op->reply.hdr.retval > 0)
{
op->send_list.push_back(op->buf, op->reply.hdr.retval * sizeof(obj_ver_id));
}
op->reply.sec_list.stable_count = op->bs_op->version;
}
auto & cl = cl_it->second;
outbox_push(cl, op);
op->reply.sec_rw.version = op->bs_op->version;
}
else
else if (op->req.hdr.opcode == OSD_OP_SEC_DELETE)
{
delete op;
op->reply.sec_del.version = op->bs_op->version;
}
if (op->req.hdr.opcode == OSD_OP_SEC_READ &&
op->bs_op->retval > 0)
{
op->iov.push_back(op->buf, op->bs_op->retval);
}
else if (op->req.hdr.opcode == OSD_OP_SEC_LIST)
{
// allocated by blockstore
op->buf = op->bs_op->buf;
if (op->bs_op->retval > 0)
{
op->iov.push_back(op->buf, op->bs_op->retval * sizeof(obj_ver_id));
}
op->reply.sec_list.stable_count = op->bs_op->version;
}
int retval = op->bs_op->retval;
delete op->bs_op;
op->bs_op = NULL;
finish_op(op, retval);
}
void osd_t::exec_secondary(osd_op_t *cur_op)
{
cur_op->bs_op = new blockstore_op_t();
cur_op->bs_op->callback = [this, cur_op](blockstore_op_t* bs_op) { secondary_op_callback(cur_op); };
cur_op->bs_op->opcode = (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_READ ? BS_OP_READ
: (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_WRITE ? BS_OP_WRITE
: (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_SYNC ? BS_OP_SYNC
: (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_STABILIZE ? BS_OP_STABLE
: (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_ROLLBACK ? BS_OP_ROLLBACK
: (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_DELETE ? BS_OP_DELETE
: (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_LIST ? BS_OP_LIST
: -1)))))));
if (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_READ ||
cur_op->req.hdr.opcode == OSD_OP_SECONDARY_WRITE)
cur_op->bs_op->opcode = (cur_op->req.hdr.opcode == OSD_OP_SEC_READ ? BS_OP_READ
: (cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE ? BS_OP_WRITE
: (cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE ? BS_OP_WRITE_STABLE
: (cur_op->req.hdr.opcode == OSD_OP_SEC_SYNC ? BS_OP_SYNC
: (cur_op->req.hdr.opcode == OSD_OP_SEC_STABILIZE ? BS_OP_STABLE
: (cur_op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK ? BS_OP_ROLLBACK
: (cur_op->req.hdr.opcode == OSD_OP_SEC_DELETE ? BS_OP_DELETE
: (cur_op->req.hdr.opcode == OSD_OP_SEC_LIST ? BS_OP_LIST
: -1))))))));
if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ ||
cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE)
{
cur_op->bs_op->oid = cur_op->req.sec_rw.oid;
cur_op->bs_op->version = cur_op->req.sec_rw.version;
@ -69,7 +64,7 @@ void osd_t::exec_secondary(osd_op_t *cur_op)
cur_op->bs_op->retval = cur_op->bs_op->len;
#endif
}
else if (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_DELETE)
else if (cur_op->req.hdr.opcode == OSD_OP_SEC_DELETE)
{
cur_op->bs_op->oid = cur_op->req.sec_del.oid;
cur_op->bs_op->version = cur_op->req.sec_del.version;
@ -77,8 +72,8 @@ void osd_t::exec_secondary(osd_op_t *cur_op)
cur_op->bs_op->retval = 0;
#endif
}
else if (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_STABILIZE ||
cur_op->req.hdr.opcode == OSD_OP_SECONDARY_ROLLBACK)
else if (cur_op->req.hdr.opcode == OSD_OP_SEC_STABILIZE ||
cur_op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK)
{
cur_op->bs_op->len = cur_op->req.sec_stab.len/sizeof(obj_ver_id);
cur_op->bs_op->buf = cur_op->buf;
@ -86,18 +81,21 @@ void osd_t::exec_secondary(osd_op_t *cur_op)
cur_op->bs_op->retval = 0;
#endif
}
else if (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_LIST)
else if (cur_op->req.hdr.opcode == OSD_OP_SEC_LIST)
{
if (cur_op->req.sec_list.pg_count < cur_op->req.sec_list.list_pg)
{
// requested pg number is greater than total pg count
printf("Invalid LIST request: pg count %u < pg number %u\n", cur_op->req.sec_list.pg_count, cur_op->req.sec_list.list_pg);
cur_op->bs_op->retval = -EINVAL;
secondary_op_callback(cur_op);
return;
}
cur_op->bs_op->oid.stripe = cur_op->req.sec_list.parity_block_size;
cur_op->bs_op->oid.stripe = cur_op->req.sec_list.pg_stripe_size;
cur_op->bs_op->len = cur_op->req.sec_list.pg_count;
cur_op->bs_op->offset = cur_op->req.sec_list.list_pg - 1;
cur_op->bs_op->oid.inode = cur_op->req.sec_list.min_inode;
cur_op->bs_op->version = cur_op->req.sec_list.max_inode;
#ifdef OSD_STUB
cur_op->bs_op->retval = 0;
cur_op->bs_op->buf = NULL;
@ -114,15 +112,10 @@ void osd_t::exec_show_config(osd_op_t *cur_op)
{
// FIXME: Send the real config, not its source
std::string cfg_str = json11::Json(config).dump();
cur_op->reply.hdr.magic = SECONDARY_OSD_REPLY_MAGIC;
cur_op->reply.hdr.id = cur_op->req.hdr.id;
cur_op->reply.hdr.opcode = cur_op->req.hdr.opcode;
cur_op->reply.hdr.retval = cfg_str.size()+1;
cur_op->buf = malloc(cfg_str.size()+1);
cur_op->buf = malloc_or_die(cfg_str.size()+1);
memcpy(cur_op->buf, cfg_str.c_str(), cfg_str.size()+1);
auto & cl = clients[cur_op->peer_fd];
cur_op->send_list.push_back(cur_op->buf, cur_op->reply.hdr.retval);
outbox_push(cl, cur_op);
cur_op->iov.push_back(cur_op->buf, cfg_str.size()+1);
finish_op(cur_op, cfg_str.size()+1);
}
void osd_t::exec_sync_stab_all(osd_op_t *cur_op)

View File

@ -1,131 +0,0 @@
#include "osd.h"
void osd_t::outbox_push(osd_client_t & cl, osd_op_t *cur_op)
{
assert(cur_op->peer_fd);
if (cur_op->op_type == OSD_OP_OUT)
{
clock_gettime(CLOCK_REALTIME, &cur_op->tv_begin);
}
cl.outbox.push_back(cur_op);
if (cl.write_op || cl.outbox.size() > 1 || !try_send(cl))
{
if (cl.write_state == 0)
{
cl.write_state = CL_WRITE_READY;
write_ready_clients.push_back(cur_op->peer_fd);
}
ringloop->wakeup();
}
}
bool osd_t::try_send(osd_client_t & cl)
{
int peer_fd = cl.peer_fd;
io_uring_sqe* sqe = ringloop->get_sqe();
if (!sqe)
{
return false;
}
ring_data_t* data = ((ring_data_t*)sqe->user_data);
if (!cl.write_op)
{
// pick next command
cl.write_op = cl.outbox.front();
cl.outbox.pop_front();
cl.write_state = CL_WRITE_REPLY;
clock_gettime(CLOCK_REALTIME, &cl.write_op->tv_send);
if (cl.write_op->op_type == OSD_OP_IN)
{
// Measure execution latency
timespec tv_end = cl.write_op->tv_send;
op_stat_count[cl.write_op->req.hdr.opcode]++;
op_stat_sum[cl.write_op->req.hdr.opcode] += (
(tv_end.tv_sec - cl.write_op->tv_begin.tv_sec)*1000000 +
(tv_end.tv_nsec - cl.write_op->tv_begin.tv_nsec)/1000
);
}
}
cl.write_msg.msg_iov = cl.write_op->send_list.get_iovec();
cl.write_msg.msg_iovlen = cl.write_op->send_list.get_size();
data->callback = [this, peer_fd](ring_data_t *data) { handle_send(data, peer_fd); };
my_uring_prep_sendmsg(sqe, peer_fd, &cl.write_msg, 0);
return true;
}
void osd_t::send_replies()
{
for (int i = 0; i < write_ready_clients.size(); i++)
{
int peer_fd = write_ready_clients[i];
if (!try_send(clients[peer_fd]))
{
write_ready_clients.erase(write_ready_clients.begin(), write_ready_clients.begin() + i);
return;
}
}
write_ready_clients.clear();
}
void osd_t::handle_send(ring_data_t *data, int peer_fd)
{
auto cl_it = clients.find(peer_fd);
if (cl_it != clients.end())
{
auto & cl = cl_it->second;
if (data->res < 0 && data->res != -EAGAIN)
{
// this is a client socket, so don't panic. just disconnect it
printf("Client %d socket write error: %d (%s). Disconnecting client\n", peer_fd, -data->res, strerror(-data->res));
stop_client(peer_fd);
return;
}
if (data->res >= 0)
{
osd_op_t *cur_op = cl.write_op;
while (data->res > 0 && cur_op->send_list.sent < cur_op->send_list.count)
{
iovec & iov = cur_op->send_list.buf[cur_op->send_list.sent];
if (iov.iov_len <= data->res)
{
data->res -= iov.iov_len;
cur_op->send_list.sent++;
}
else
{
iov.iov_len -= data->res;
iov.iov_base += data->res;
break;
}
}
if (cur_op->send_list.sent >= cur_op->send_list.count)
{
// Done
if (cur_op->req.hdr.opcode == OSD_OP_SECONDARY_STABILIZE)
{
timespec tv_end;
clock_gettime(CLOCK_REALTIME, &tv_end);
send_stat_count++;
send_stat_sum += (
(tv_end.tv_sec - cl.write_op->tv_send.tv_sec)*1000000 +
(tv_end.tv_nsec - cl.write_op->tv_send.tv_nsec)/1000
);
}
if (cur_op->op_type == OSD_OP_IN)
{
delete cur_op;
}
else
{
cl.sent_ops[cl.write_op->req.hdr.id] = cl.write_op;
}
cl.write_op = NULL;
cl.write_state = cl.outbox.size() > 0 ? CL_WRITE_READY : 0;
}
}
if (cl.write_state != 0)
{
write_ready_clients.push_back(peer_fd);
}
}
}

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
@ -19,6 +22,8 @@
int connect_osd(const char *osd_address, int osd_port);
uint64_t test_read(int connect_fd, uint64_t inode, uint64_t stripe, uint64_t version, uint64_t offset, uint64_t len);
uint64_t test_write(int connect_fd, uint64_t inode, uint64_t stripe, uint64_t version, uint64_t pattern);
void* test_primary_read(int connect_fd, uint64_t inode, uint64_t offset, uint64_t len);
@ -29,6 +34,8 @@ void test_primary_sync(int connect_fd);
void test_sync_stab_all(int connect_fd);
void test_list_stab(int connect_fd);
int main0(int narg, char *args[])
{
int connect_fd;
@ -94,7 +101,16 @@ int main2(int narg, char *args[])
return 0;
}
int main(int narg, char *args[])
int main3(int narg, char *args[])
{
int connect_fd;
connect_fd = connect_osd("127.0.0.1", 11203);
test_list_stab(connect_fd);
close(connect_fd);
return 0;
}
int main4(int narg, char *args[])
{
int connect_fd;
// Cluster write (sync not implemented yet)
@ -106,6 +122,15 @@ int main(int narg, char *args[])
return 0;
}
int main(int narg, char *args[])
{
int connect_fd;
connect_fd = connect_osd("192.168.7.2", 43051);
test_read(connect_fd, 1, 1039663104, UINT64_MAX, 0, 128*1024);
close(connect_fd);
return 0;
}
int connect_osd(const char *osd_address, int osd_port)
{
struct sockaddr_in addr;
@ -148,7 +173,7 @@ bool check_reply(int r, osd_any_op_t & op, osd_any_reply_t & reply, int expected
printf("bad reply: magic, id or opcode does not match request\n");
return false;
}
if (reply.hdr.retval != expected)
if (expected >= 0 && reply.hdr.retval != expected)
{
printf("operation failed, retval=%ld\n", reply.hdr.retval);
return false;
@ -156,13 +181,73 @@ bool check_reply(int r, osd_any_op_t & op, osd_any_reply_t & reply, int expected
return true;
}
uint64_t test_read(int connect_fd, uint64_t inode, uint64_t stripe, uint64_t version, uint64_t offset, uint64_t len)
{
osd_any_op_t op;
osd_any_reply_t reply;
op.hdr.magic = SECONDARY_OSD_OP_MAGIC;
op.hdr.id = 1;
op.hdr.opcode = OSD_OP_SEC_READ;
op.sec_rw.oid = {
.inode = inode,
.stripe = stripe,
};
op.sec_rw.version = version;
op.sec_rw.offset = offset;
op.sec_rw.len = len;
void *data = memalign(MEM_ALIGNMENT, op.sec_rw.len);
write_blocking(connect_fd, op.buf, OSD_PACKET_SIZE);
int r = read_blocking(connect_fd, reply.buf, OSD_PACKET_SIZE);
if (!check_reply(r, op, reply, op.sec_rw.len))
{
free(data);
return 0;
}
r = read_blocking(connect_fd, data, len);
if (r != len)
{
free(data);
perror("read data");
return 0;
}
free(data);
printf("Read %lx:%lx v%lu = v%lu\n", inode, stripe, version, reply.sec_rw.version);
op.hdr.opcode = OSD_OP_SEC_LIST;
op.sec_list.list_pg = 1;
op.sec_list.pg_count = 1;
op.sec_list.pg_stripe_size = 4*1024*1024;
write_blocking(connect_fd, op.buf, OSD_PACKET_SIZE);
r = read_blocking(connect_fd, reply.buf, OSD_PACKET_SIZE);
if (reply.hdr.retval < 0 || !check_reply(r, op, reply, reply.hdr.retval))
{
return 0;
}
data = memalign(MEM_ALIGNMENT, sizeof(obj_ver_id)*reply.hdr.retval);
r = read_blocking(connect_fd, data, sizeof(obj_ver_id)*reply.hdr.retval);
if (r != sizeof(obj_ver_id)*reply.hdr.retval)
{
free(data);
perror("read data");
return 0;
}
obj_ver_id *ov = (obj_ver_id*)data;
for (int i = 0; i < reply.hdr.retval; i++)
{
if (ov[i].oid.inode == inode && (ov[i].oid.stripe & ~(4096-1)) == (stripe & ~(4096-1)))
{
printf("list: %lx:%lx v%lu stable=%d\n", ov[i].oid.inode, ov[i].oid.stripe, ov[i].version, i < reply.sec_list.stable_count ? 1 : 0);
}
}
return 0;
}
uint64_t test_write(int connect_fd, uint64_t inode, uint64_t stripe, uint64_t version, uint64_t pattern)
{
osd_any_op_t op;
osd_any_reply_t reply;
op.hdr.magic = SECONDARY_OSD_OP_MAGIC;
op.hdr.id = 1;
op.hdr.opcode = OSD_OP_SECONDARY_WRITE;
op.hdr.opcode = OSD_OP_SEC_WRITE;
op.sec_rw.oid = {
.inode = inode,
.stripe = stripe,
@ -170,7 +255,7 @@ uint64_t test_write(int connect_fd, uint64_t inode, uint64_t stripe, uint64_t ve
op.sec_rw.version = version;
op.sec_rw.offset = 0;
op.sec_rw.len = 128*1024;
void *data = memalign(512, op.sec_rw.len);
void *data = memalign(MEM_ALIGNMENT, op.sec_rw.len);
for (int i = 0; i < (op.sec_rw.len)/sizeof(uint64_t); i++)
((uint64_t*)data)[i] = pattern;
write_blocking(connect_fd, op.buf, OSD_PACKET_SIZE);
@ -205,7 +290,7 @@ void* test_primary_read(int connect_fd, uint64_t inode, uint64_t offset, uint64_
op.rw.inode = inode;
op.rw.offset = offset;
op.rw.len = len;
void *data = memalign(512, len);
void *data = memalign(MEM_ALIGNMENT, len);
write_blocking(connect_fd, op.buf, OSD_PACKET_SIZE);
int r = read_blocking(connect_fd, reply.buf, OSD_PACKET_SIZE);
if (!check_reply(r, op, reply, len))
@ -233,7 +318,7 @@ void test_primary_write(int connect_fd, uint64_t inode, uint64_t offset, uint64_
op.rw.inode = inode;
op.rw.offset = offset;
op.rw.len = len;
void *data = memalign(512, len);
void *data = memalign(MEM_ALIGNMENT, len);
set_pattern(data, len, pattern);
write_blocking(connect_fd, op.buf, OSD_PACKET_SIZE);
write_blocking(connect_fd, data, len);
@ -265,3 +350,40 @@ void test_sync_stab_all(int connect_fd)
int r = read_blocking(connect_fd, reply.buf, OSD_PACKET_SIZE);
assert(check_reply(r, op, reply, 0));
}
void test_list_stab(int connect_fd)
{
osd_any_op_t op;
osd_any_reply_t reply;
op.hdr.magic = SECONDARY_OSD_OP_MAGIC;
op.hdr.id = 1;
op.hdr.opcode = OSD_OP_SEC_LIST;
op.sec_list.pg_count = 0;
assert(write_blocking(connect_fd, op.buf, OSD_PACKET_SIZE) == OSD_PACKET_SIZE);
int r = read_blocking(connect_fd, reply.buf, OSD_PACKET_SIZE);
assert(check_reply(r, op, reply, -1));
int total_count = reply.hdr.retval;
int stable_count = reply.sec_list.stable_count;
obj_ver_id *data = (obj_ver_id*)malloc(total_count * sizeof(obj_ver_id));
assert(data);
assert(read_blocking(connect_fd, data, total_count * sizeof(obj_ver_id)) == (total_count * sizeof(obj_ver_id)));
int last_start = stable_count;
for (int i = stable_count; i <= total_count; i++)
{
// Stabilize in portions of 32 entries
if (i - last_start >= 32 || i == total_count)
{
op.hdr.opcode = OSD_OP_SEC_STABILIZE;
op.sec_stab.len = sizeof(obj_ver_id) * (i - last_start);
assert(write_blocking(connect_fd, op.buf, OSD_PACKET_SIZE) == OSD_PACKET_SIZE);
assert(write_blocking(connect_fd, data + last_start, op.sec_stab.len) == op.sec_stab.len);
r = read_blocking(connect_fd, reply.buf, OSD_PACKET_SIZE);
assert(check_reply(r, op, reply, 0));
last_start = i;
}
}
obj_ver_id *data2 = (obj_ver_id*)malloc(sizeof(obj_ver_id) * 32);
assert(data2);
free(data2);
free(data);
}

38
pg_states.cpp Normal file
View File

@ -0,0 +1,38 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
#include "pg_states.h"
const int pg_state_bit_count = 14;
const int pg_state_bits[14] = {
PG_STARTING,
PG_PEERING,
PG_INCOMPLETE,
PG_ACTIVE,
PG_STOPPING,
PG_OFFLINE,
PG_DEGRADED,
PG_HAS_INCOMPLETE,
PG_HAS_DEGRADED,
PG_HAS_MISPLACED,
PG_HAS_UNCLEAN,
PG_HAS_INVALID,
PG_LEFT_ON_DEAD,
};
const char *pg_state_names[14] = {
"starting",
"peering",
"incomplete",
"active",
"stopping",
"offline",
"degraded",
"has_incomplete",
"has_degraded",
"has_misplaced",
"has_unclean",
"has_invalid",
"left_on_dead",
};

37
pg_states.h Normal file
View File

@ -0,0 +1,37 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
#pragma once
// Placement group states
// STARTING -> [acquire lock] -> PEERING -> INCOMPLETE|ACTIVE -> STOPPING -> OFFLINE -> [release lock]
// Exactly one of these:
#define PG_STARTING (1<<0)
#define PG_PEERING (1<<1)
#define PG_INCOMPLETE (1<<2)
#define PG_ACTIVE (1<<3)
#define PG_STOPPING (1<<4)
#define PG_OFFLINE (1<<5)
// Plus any of these:
#define PG_DEGRADED (1<<6)
#define PG_HAS_INCOMPLETE (1<<7)
#define PG_HAS_DEGRADED (1<<8)
#define PG_HAS_MISPLACED (1<<9)
#define PG_HAS_UNCLEAN (1<<10)
#define PG_HAS_INVALID (1<<11)
#define PG_LEFT_ON_DEAD (1<<12)
// Lower bits that represent object role (EC 0/1/2... or always 0 with replication)
// 12 bits is a safe default that doesn't depend on pg_stripe_size or pg_block_size
#define STRIPE_MASK ((uint64_t)4096 - 1)
// OSD object states
#define OBJ_DEGRADED 0x02
#define OBJ_INCOMPLETE 0x04
#define OBJ_MISPLACED 0x08
#define OBJ_NEEDS_STABLE 0x10000
#define OBJ_NEEDS_ROLLBACK 0x20000
extern const int pg_state_bits[];
extern const char *pg_state_names[];
extern const int pg_state_bit_count;

400
qemu_driver.c Normal file
View File

@ -0,0 +1,400 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
// QEMU block driver
#define _GNU_SOURCE
#include "qemu/osdep.h"
#include "qemu/units.h"
#include "block/block_int.h"
#include "block/qdict.h"
#include "qapi/error.h"
#include "qapi/qmp/qdict.h"
#include "qapi/qmp/qerror.h"
#include "qemu/uri.h"
#include "qemu/error-report.h"
#include "qemu/module.h"
#include "qemu/option.h"
#include "qemu/cutils.h"
#include "qemu_proxy.h"
typedef struct VitastorClient
{
void *proxy;
char *etcd_host;
char *etcd_prefix;
uint64_t inode;
uint64_t pool;
uint64_t size;
int readonly;
QemuMutex mutex;
} VitastorClient;
typedef struct VitastorRPC
{
BlockDriverState *bs;
Coroutine *co;
QEMUIOVector *iov;
int ret;
int complete;
} VitastorRPC;
static char *qemu_rbd_next_tok(char *src, char delim, char **p)
{
char *end;
*p = NULL;
for (end = src; *end; ++end)
{
if (*end == delim)
break;
if (*end == '\\' && end[1] != '\0')
end++;
}
if (*end == delim)
{
*p = end + 1;
*end = '\0';
}
return src;
}
static void qemu_rbd_unescape(char *src)
{
char *p;
for (p = src; *src; ++src, ++p)
{
if (*src == '\\' && src[1] != '\0')
src++;
*p = *src;
}
*p = '\0';
}
// vitastor[:key=value]*
// vitastor:etcd_host=127.0.0.1:inode=1:pool=1
static void vitastor_parse_filename(const char *filename, QDict *options, Error **errp)
{
const char *start;
char *p, *buf;
if (!strstart(filename, "vitastor:", &start))
{
error_setg(errp, "File name must start with 'vitastor:'");
return;
}
buf = g_strdup(start);
p = buf;
// The following are all key/value pairs
while (p)
{
char *name, *value;
name = qemu_rbd_next_tok(p, '=', &p);
if (!p)
{
error_setg(errp, "conf option %s has no value", name);
break;
}
qemu_rbd_unescape(name);
value = qemu_rbd_next_tok(p, ':', &p);
qemu_rbd_unescape(value);
if (!strcmp(name, "inode") || !strcmp(name, "pool") || !strcmp(name, "size"))
{
unsigned long long num_val;
if (parse_uint_full(value, &num_val, 0))
{
error_setg(errp, "Illegal %s: %s", name, value);
goto out;
}
qdict_put_int(options, name, num_val);
}
else
{
qdict_put_str(options, name, value);
}
}
if (!qdict_get_try_int(options, "inode", 0))
{
error_setg(errp, "inode is missing");
goto out;
}
if (!(qdict_get_try_int(options, "inode", 0) >> (64-POOL_ID_BITS)) &&
!qdict_get_try_int(options, "pool", 0))
{
error_setg(errp, "pool number is missing");
goto out;
}
if (!qdict_get_try_int(options, "size", 0))
{
error_setg(errp, "size is missing");
goto out;
}
if (!qdict_get_str(options, "etcd_host"))
{
error_setg(errp, "etcd_host is missing");
goto out;
}
out:
g_free(buf);
return;
}
static int vitastor_file_open(BlockDriverState *bs, QDict *options, int flags, Error **errp)
{
VitastorClient *client = bs->opaque;
int64_t ret = 0;
client->etcd_host = g_strdup(qdict_get_try_str(options, "etcd_host"));
client->etcd_prefix = g_strdup(qdict_get_try_str(options, "etcd_prefix"));
client->inode = qdict_get_int(options, "inode");
client->pool = qdict_get_int(options, "pool");
if (client->pool)
client->inode = (client->inode & ((1l << (64-POOL_ID_BITS)) - 1)) | (client->pool << (64-POOL_ID_BITS));
client->size = qdict_get_int(options, "size");
client->readonly = (flags & BDRV_O_RDWR) ? 1 : 0;
client->proxy = vitastor_proxy_create(bdrv_get_aio_context(bs), client->etcd_host, client->etcd_prefix);
//client->aio_context = bdrv_get_aio_context(bs);
bs->total_sectors = client->size / BDRV_SECTOR_SIZE;
qdict_del(options, "etcd_host");
qdict_del(options, "etcd_prefix");
qdict_del(options, "inode");
qdict_del(options, "pool");
qdict_del(options, "size");
qemu_mutex_init(&client->mutex);
return ret;
}
static void vitastor_close(BlockDriverState *bs)
{
VitastorClient *client = bs->opaque;
vitastor_proxy_destroy(client->proxy);
qemu_mutex_destroy(&client->mutex);
g_free(client->etcd_host);
if (client->etcd_prefix)
g_free(client->etcd_prefix);
}
static int vitastor_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz)
{
bsz->phys = 4096;
bsz->log = 4096;
return 0;
}
static int coroutine_fn vitastor_co_create_opts(
#if QEMU_VERSION_MAJOR >= 4
BlockDriver *drv,
#endif
const char *url, QemuOpts *opts, Error **errp)
{
QDict *options;
int ret;
options = qdict_new();
vitastor_parse_filename(url, options, errp);
if (*errp)
{
ret = -1;
goto out;
}
// inodes don't require creation in Vitastor. FIXME: They will when there will be some metadata
ret = 0;
out:
qobject_unref(options);
return ret;
}
static int coroutine_fn vitastor_co_truncate(BlockDriverState *bs, int64_t offset,
#if QEMU_VERSION_MAJOR >= 4
bool exact,
#endif
PreallocMode prealloc, Error **errp)
{
VitastorClient *client = bs->opaque;
if (prealloc != PREALLOC_MODE_OFF)
{
error_setg(errp, "Unsupported preallocation mode '%s'", PreallocMode_str(prealloc));
return -ENOTSUP;
}
// TODO: Resize inode to <offset> bytes
client->size = offset / BDRV_SECTOR_SIZE;
return 0;
}
static int vitastor_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
{
bdi->cluster_size = 4096;
return 0;
}
static int64_t vitastor_getlength(BlockDriverState *bs)
{
VitastorClient *client = bs->opaque;
return client->size;
}
static void vitastor_refresh_limits(BlockDriverState *bs, Error **errp)
{
bs->bl.request_alignment = 4096;
bs->bl.min_mem_alignment = 4096;
bs->bl.opt_mem_alignment = 4096;
}
static int64_t vitastor_get_allocated_file_size(BlockDriverState *bs)
{
return 0;
}
static void vitastor_co_init_task(BlockDriverState *bs, VitastorRPC *task)
{
*task = (VitastorRPC) {
.co = qemu_coroutine_self(),
.bs = bs,
};
}
static void vitastor_co_generic_bh_cb(int retval, void *opaque)
{
VitastorRPC *task = opaque;
task->ret = retval;
task->complete = 1;
if (qemu_coroutine_self() != task->co)
{
aio_co_wake(task->co);
}
}
static int coroutine_fn vitastor_co_preadv(BlockDriverState *bs, uint64_t offset, uint64_t bytes, QEMUIOVector *iov, int flags)
{
VitastorClient *client = bs->opaque;
VitastorRPC task;
vitastor_co_init_task(bs, &task);
task.iov = iov;
qemu_mutex_lock(&client->mutex);
vitastor_proxy_rw(0, client->proxy, client->inode, offset, bytes, iov->iov, iov->niov, vitastor_co_generic_bh_cb, &task);
qemu_mutex_unlock(&client->mutex);
while (!task.complete)
{
qemu_coroutine_yield();
}
return task.ret;
}
static int coroutine_fn vitastor_co_pwritev(BlockDriverState *bs, uint64_t offset, uint64_t bytes, QEMUIOVector *iov, int flags)
{
VitastorClient *client = bs->opaque;
VitastorRPC task;
vitastor_co_init_task(bs, &task);
task.iov = iov;
qemu_mutex_lock(&client->mutex);
vitastor_proxy_rw(1, client->proxy, client->inode, offset, bytes, iov->iov, iov->niov, vitastor_co_generic_bh_cb, &task);
qemu_mutex_unlock(&client->mutex);
while (!task.complete)
{
qemu_coroutine_yield();
}
return task.ret;
}
static int coroutine_fn vitastor_co_flush(BlockDriverState *bs)
{
VitastorClient *client = bs->opaque;
VitastorRPC task;
vitastor_co_init_task(bs, &task);
qemu_mutex_lock(&client->mutex);
vitastor_proxy_sync(client->proxy, vitastor_co_generic_bh_cb, &task);
qemu_mutex_unlock(&client->mutex);
while (!task.complete)
{
qemu_coroutine_yield();
}
return task.ret;
}
static QemuOptsList vitastor_create_opts = {
.name = "vitastor-create-opts",
.head = QTAILQ_HEAD_INITIALIZER(vitastor_create_opts.head),
.desc = {
{
.name = BLOCK_OPT_SIZE,
.type = QEMU_OPT_SIZE,
.help = "Virtual disk size"
},
{ /* end of list */ }
}
};
static const char *vitastor_strong_runtime_opts[] = {
"inode",
"pool",
"etcd_host",
"etcd_prefix",
NULL
};
static BlockDriver bdrv_vitastor = {
.format_name = "vitastor",
.protocol_name = "vitastor",
.instance_size = sizeof(VitastorClient),
.bdrv_parse_filename = vitastor_parse_filename,
.bdrv_has_zero_init = bdrv_has_zero_init_1,
#if QEMU_VERSION_MAJOR >= 4
.bdrv_has_zero_init_truncate = bdrv_has_zero_init_1,
#endif
.bdrv_get_info = vitastor_get_info,
.bdrv_getlength = vitastor_getlength,
.bdrv_probe_blocksizes = vitastor_probe_blocksizes,
.bdrv_refresh_limits = vitastor_refresh_limits,
// FIXME: Implement it along with per-inode statistics
//.bdrv_get_allocated_file_size = vitastor_get_allocated_file_size,
.bdrv_file_open = vitastor_file_open,
.bdrv_close = vitastor_close,
// Option list for the create operation
.create_opts = &vitastor_create_opts,
// For qmp_blockdev_create(), used by the qemu monitor / QAPI
// Requires patching QAPI IDL, thus unimplemented
//.bdrv_co_create = vitastor_co_create,
// For bdrv_create(), used by qemu-img
.bdrv_co_create_opts = vitastor_co_create_opts,
.bdrv_co_truncate = vitastor_co_truncate,
.bdrv_co_preadv = vitastor_co_preadv,
.bdrv_co_pwritev = vitastor_co_pwritev,
.bdrv_co_flush_to_disk = vitastor_co_flush,
#if QEMU_VERSION_MAJOR >= 4
.strong_runtime_opts = vitastor_strong_runtime_opts,
#endif
};
static void vitastor_block_init(void)
{
bdrv_register(&bdrv_vitastor);
}
block_init(vitastor_block_init);

130
qemu_proxy.cpp Normal file
View File

@ -0,0 +1,130 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
// C-C++ proxy for the QEMU driver
// (QEMU headers don't compile with g++)
#include <sys/epoll.h>
#include "cluster_client.h"
typedef void* AioContext;
#include "qemu_proxy.h"
extern "C"
{
// QEMU
typedef void IOHandler(void *opaque);
void aio_set_fd_handler(AioContext *ctx, int fd, int is_external, IOHandler *fd_read, IOHandler *fd_write, void *poll_fn, void *opaque);
}
struct QemuProxyData
{
int fd;
std::function<void(int, int)> callback;
};
class QemuProxy
{
std::map<int, QemuProxyData> handlers;
public:
timerfd_manager_t *tfd;
cluster_client_t *cli;
AioContext *ctx;
QemuProxy(AioContext *ctx, const char *etcd_host, const char *etcd_prefix)
{
this->ctx = ctx;
json11::Json cfg = json11::Json::object {
{ "etcd_address", std::string(etcd_host) },
{ "etcd_prefix", std::string(etcd_prefix ? etcd_prefix : "/vitastor") },
};
tfd = new timerfd_manager_t([this](int fd, bool wr, std::function<void(int, int)> callback) { set_fd_handler(fd, wr, callback); });
cli = new cluster_client_t(NULL, tfd, cfg);
}
~QemuProxy()
{
cli->stop();
delete cli;
delete tfd;
}
void set_fd_handler(int fd, bool wr, std::function<void(int, int)> callback)
{
if (callback != NULL)
{
handlers[fd] = { .fd = fd, .callback = callback };
aio_set_fd_handler(ctx, fd, false, &QemuProxy::read_handler, wr ? &QemuProxy::write_handler : NULL, NULL, &handlers[fd]);
}
else
{
handlers.erase(fd);
aio_set_fd_handler(ctx, fd, false, NULL, NULL, NULL, NULL);
}
}
static void read_handler(void *opaque)
{
QemuProxyData *data = (QemuProxyData *)opaque;
data->callback(data->fd, EPOLLIN);
}
static void write_handler(void *opaque)
{
QemuProxyData *data = (QemuProxyData *)opaque;
data->callback(data->fd, EPOLLOUT);
}
};
extern "C" {
void* vitastor_proxy_create(AioContext *ctx, const char *etcd_host, const char *etcd_prefix)
{
QemuProxy *p = new QemuProxy(ctx, etcd_host, etcd_prefix);
return p;
}
void vitastor_proxy_destroy(void *client)
{
QemuProxy *p = (QemuProxy*)client;
delete p;
}
void vitastor_proxy_rw(int write, void *client, uint64_t inode, uint64_t offset, uint64_t len,
iovec *iov, int iovcnt, VitastorIOHandler cb, void *opaque)
{
QemuProxy *p = (QemuProxy*)client;
cluster_op_t *op = new cluster_op_t;
op->opcode = write ? OSD_OP_WRITE : OSD_OP_READ;
op->inode = inode;
op->offset = offset;
op->len = len;
for (int i = 0; i < iovcnt; i++)
{
op->iov.push_back(iov[i].iov_base, iov[i].iov_len);
}
op->callback = [cb, opaque](cluster_op_t *op)
{
cb(op->retval, opaque);
delete op;
};
p->cli->execute(op);
}
void vitastor_proxy_sync(void *client, VitastorIOHandler cb, void *opaque)
{
QemuProxy *p = (QemuProxy*)client;
cluster_op_t *op = new cluster_op_t;
op->opcode = OSD_OP_SYNC;
op->callback = [cb, opaque](cluster_op_t *op)
{
cb(op->retval, opaque);
delete op;
};
p->cli->execute(op);
}
}

29
qemu_proxy.h Normal file
View File

@ -0,0 +1,29 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
#ifndef VITASTOR_QEMU_PROXY_H
#define VITASTOR_QEMU_PROXY_H
#ifndef POOL_ID_BITS
#define POOL_ID_BITS 16
#endif
#include <stdint.h>
#include <sys/uio.h>
#ifdef __cplusplus
extern "C" {
#endif
// Our exports
typedef void VitastorIOHandler(int retval, void *opaque);
void* vitastor_proxy_create(AioContext *ctx, const char *etcd_host, const char *etcd_prefix);
void vitastor_proxy_destroy(void *client);
void vitastor_proxy_rw(int write, void *client, uint64_t inode, uint64_t offset, uint64_t len,
struct iovec *iov, int iovcnt, VitastorIOHandler cb, void *opaque);
void vitastor_proxy_sync(void *client, VitastorIOHandler cb, void *opaque);
#ifdef __cplusplus
}
#endif
#endif

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
#include "ringloop.h"
ring_loop_t::ring_loop_t(int qd)
@ -18,6 +21,7 @@ ring_loop_t::ring_loop_t(int qd)
{
free_ring_data[i] = i;
}
wait_sqe_id = 1;
}
ring_loop_t::~ring_loop_t()
@ -27,11 +31,10 @@ ring_loop_t::~ring_loop_t()
io_uring_queue_exit(&ring);
}
int ring_loop_t::register_consumer(ring_consumer_t & consumer)
void ring_loop_t::register_consumer(ring_consumer_t *consumer)
{
consumer.number = consumers.size();
unregister_consumer(consumer);
consumers.push_back(consumer);
return consumer.number;
}
void ring_loop_t::wakeup()
@ -39,12 +42,15 @@ void ring_loop_t::wakeup()
loop_again = true;
}
void ring_loop_t::unregister_consumer(ring_consumer_t & consumer)
void ring_loop_t::unregister_consumer(ring_consumer_t *consumer)
{
if (consumer.number >= 0 && consumer.number < consumers.size())
for (int i = 0; i < consumers.size(); i++)
{
consumers[consumer.number].loop = NULL;
consumer.number = -1;
if (consumers[i] == consumer)
{
consumers.erase(consumers.begin()+i, consumers.begin()+i+1);
break;
}
}
}
@ -62,12 +68,17 @@ void ring_loop_t::loop()
free_ring_data[free_ring_data_ptr++] = d - ring_datas;
io_uring_cqe_seen(&ring, cqe);
}
while (get_sqe_queue.size() > 0)
{
(get_sqe_queue[0].second)();
get_sqe_queue.erase(get_sqe_queue.begin());
}
do
{
loop_again = false;
for (int i = 0; i < consumers.size(); i++)
{
consumers[i].loop();
consumers[i]->loop();
}
} while (loop_again);
}

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
#pragma once
#ifndef _LARGEFILE64_SOURCE
@ -113,23 +116,24 @@ struct ring_data_t
struct ring_consumer_t
{
int number;
std::function<void(void)> loop;
};
class ring_loop_t
{
std::vector<ring_consumer_t> consumers;
std::vector<std::pair<int,std::function<void()>>> get_sqe_queue;
std::vector<ring_consumer_t*> consumers;
struct ring_data_t *ring_datas;
int *free_ring_data;
int wait_sqe_id;
unsigned free_ring_data_ptr;
bool loop_again;
struct io_uring ring;
public:
ring_loop_t(int qd);
~ring_loop_t();
int register_consumer(ring_consumer_t & consumer);
void unregister_consumer(ring_consumer_t & consumer);
void register_consumer(ring_consumer_t *consumer);
void unregister_consumer(ring_consumer_t *consumer);
inline struct io_uring_sqe* get_sqe()
{
@ -140,6 +144,21 @@ public:
io_uring_sqe_set_data(sqe, ring_datas + free_ring_data[--free_ring_data_ptr]);
return sqe;
}
inline int wait_sqe(std::function<void()> cb)
{
get_sqe_queue.push_back({ wait_sqe_id, cb });
return wait_sqe_id++;
}
inline void cancel_wait_sqe(int wait_id)
{
for (int i = 0; i < get_sqe_queue.size(); i++)
{
if (get_sqe_queue[i].first == wait_id)
{
get_sqe_queue.erase(get_sqe_queue.begin()+i, get_sqe_queue.begin()+i+1);
}
}
}
inline int submit()
{
return io_uring_submit(&ring);
@ -153,7 +172,7 @@ public:
{
return free_ring_data_ptr;
}
inline bool get_loop_again()
inline bool has_work()
{
return loop_again;
}

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
#include <errno.h>
#include <stdlib.h>
#include <stdio.h>
@ -51,6 +54,40 @@ int write_blocking(int fd, void *write_buf, size_t remaining)
return done;
}
int readv_blocking(int fd, iovec *iov, int iovcnt)
{
int v = 0;
int done = 0;
while (v < iovcnt)
{
ssize_t r = readv(fd, iov, iovcnt);
if (r < 0)
{
if (errno != EAGAIN && errno != EPIPE)
{
perror("writev");
exit(1);
}
continue;
}
while (v < iovcnt)
{
if (iov[v].iov_len > r)
{
iov[v].iov_len -= r;
iov[v].iov_base += r;
break;
}
else
{
v++;
}
}
done += r;
}
return done;
}
int writev_blocking(int fd, iovec *iov, int iovcnt)
{
int v = 0;

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
#pragma once
#include <unistd.h>
@ -5,4 +8,5 @@
int read_blocking(int fd, void *read_buf, size_t remaining);
int write_blocking(int fd, void *write_buf, size_t remaining);
int readv_blocking(int fd, iovec *iov, int iovcnt);
int writev_blocking(int fd, iovec *iov, int iovcnt);

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
/**
* Stub benchmarker
*/
@ -25,20 +28,37 @@ int connect_stub(const char *server_address, int server_port);
void run_bench(int peer_fd);
static uint64_t read_sum = 0, read_count = 0;
static uint64_t write_sum = 0, write_count = 0;
static uint64_t sync_sum = 0, sync_count = 0;
void handle_sigint(int sig)
{
printf("4k randwrite: %lu us avg\n", write_sum/write_count);
printf("sync: %lu us avg\n", sync_sum/sync_count);
printf("4k randread: %lu us avg\n", read_count ? read_sum/read_count : 0);
printf("4k randwrite: %lu us avg\n", write_count ? write_sum/write_count : 0);
printf("sync: %lu us avg\n", sync_count ? sync_sum/sync_count : 0);
exit(0);
}
int main(int narg, char *args[])
{
if (narg < 2)
{
printf("USAGE: %s SERVER_IP [PORT]\n", args[0]);
return 1;
}
int port = 11203;
if (narg >= 3)
{
port = atoi(args[2]);
if (port <= 0 || port >= 65536)
{
printf("Bad port number\n");
return 1;
}
}
signal(SIGINT, handle_sigint);
int peer_fd = connect_stub("127.0.0.1", 11203);
int peer_fd = connect_stub(args[1], port);
run_bench(peer_fd);
close(peer_fd);
return 0;
@ -98,14 +118,41 @@ void run_bench(int peer_fd)
osd_any_reply_t reply;
void *buf = NULL;
int r;
iovec iov[2];
timespec tv_begin, tv_end;
clock_gettime(CLOCK_REALTIME, &tv_begin);
while (1)
{
// read
op.hdr.magic = SECONDARY_OSD_OP_MAGIC;
op.hdr.id = 1;
op.hdr.opcode = OSD_OP_SEC_READ;
op.sec_rw.oid.inode = 3;
op.sec_rw.oid.stripe = (rand() << 17) % (1 << 29); // 512 MB
op.sec_rw.version = 0;
op.sec_rw.len = 4096;
op.sec_rw.offset = (rand() * op.sec_rw.len) % (1 << 17);
r = write_blocking(peer_fd, op.buf, OSD_PACKET_SIZE) == OSD_PACKET_SIZE;
if (!r)
break;
buf = malloc(op.sec_rw.len);
iov[0] = { reply.buf, OSD_PACKET_SIZE };
iov[1] = { buf, op.sec_rw.len };
r = readv_blocking(peer_fd, iov, 2) == (OSD_PACKET_SIZE + op.sec_rw.len);
free(buf);
if (!r || !check_reply(OSD_PACKET_SIZE, op, reply, op.sec_rw.len))
break;
clock_gettime(CLOCK_REALTIME, &tv_end);
read_count++;
read_sum += (
(tv_end.tv_sec - tv_begin.tv_sec)*1000000 +
tv_end.tv_nsec/1000 - tv_begin.tv_nsec/1000
);
tv_begin = tv_end;
// write
op.hdr.magic = SECONDARY_OSD_OP_MAGIC;
op.hdr.id = 1;
op.hdr.opcode = OSD_OP_SECONDARY_WRITE;
op.hdr.opcode = OSD_OP_SEC_WRITE;
op.sec_rw.oid.inode = 3;
op.sec_rw.oid.stripe = (rand() << 17) % (1 << 29); // 512 MB
op.sec_rw.version = 0;
@ -113,9 +160,9 @@ void run_bench(int peer_fd)
op.sec_rw.offset = (rand() * op.sec_rw.len) % (1 << 17);
buf = malloc(op.sec_rw.len);
memset(buf, rand() % 255, op.sec_rw.len);
r = write_blocking(peer_fd, op.buf, OSD_PACKET_SIZE) == OSD_PACKET_SIZE;
if (r)
r = write_blocking(peer_fd, buf, op.sec_rw.len) == op.sec_rw.len;
iov[0] = { op.buf, OSD_PACKET_SIZE };
iov[1] = { buf, op.sec_rw.len };
r = writev_blocking(peer_fd, iov, 2) == (OSD_PACKET_SIZE + op.sec_rw.len);
free(buf);
if (!r)
break;
@ -128,6 +175,7 @@ void run_bench(int peer_fd)
(tv_end.tv_sec - tv_begin.tv_sec)*1000000 +
tv_end.tv_nsec/1000 - tv_begin.tv_nsec/1000
);
tv_begin = tv_end;
// sync/stab
op.hdr.magic = SECONDARY_OSD_OP_MAGIC;
op.hdr.id = 1;
@ -138,11 +186,12 @@ void run_bench(int peer_fd)
r = read_blocking(peer_fd, reply.buf, OSD_PACKET_SIZE);
if (!check_reply(r, op, reply, 0))
break;
clock_gettime(CLOCK_REALTIME, &tv_begin);
clock_gettime(CLOCK_REALTIME, &tv_end);
sync_count++;
sync_sum += (
(tv_begin.tv_sec - tv_end.tv_sec)*1000000 +
tv_begin.tv_nsec/1000 - tv_end.tv_nsec/1000
(tv_end.tv_sec - tv_begin.tv_sec)*1000000 +
tv_end.tv_nsec/1000 - tv_begin.tv_nsec/1000
);
tv_begin = tv_end;
}
}

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
/**
* Stub "OSD" to test & compare network performance with sync read/write and io_uring
*
@ -130,7 +133,7 @@ void run_stub(int peer_fd)
reply.hdr.magic = SECONDARY_OSD_REPLY_MAGIC;
reply.hdr.id = op.hdr.id;
reply.hdr.opcode = op.hdr.opcode;
if (op.hdr.opcode == OSD_OP_SECONDARY_READ)
if (op.hdr.opcode == OSD_OP_SEC_READ)
{
reply.hdr.retval = op.sec_rw.len;
buf = malloc(op.sec_rw.len);
@ -141,7 +144,7 @@ void run_stub(int peer_fd)
if (r < op.sec_rw.len)
break;
}
else if (op.hdr.opcode == OSD_OP_SECONDARY_WRITE)
else if (op.hdr.opcode == OSD_OP_SEC_WRITE || op.hdr.opcode == OSD_OP_SEC_WRITE_STABLE)
{
buf = malloc(op.sec_rw.len);
r = read_blocking(peer_fd, buf, op.sec_rw.len);

131
stub_uring_osd.cpp Normal file
View File

@ -0,0 +1,131 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
/**
* Stub "OSD" implemented on top of osd_messenger to test & compare
* network performance with sync read/write and io_uring
*/
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <arpa/inet.h>
#include <string.h>
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <stdlib.h>
#include <stdexcept>
#include "ringloop.h"
#include "epoll_manager.h"
#include "messenger.h"
int bind_stub(const char *bind_address, int bind_port);
void stub_exec_op(osd_messenger_t *msgr, osd_op_t *op);
int main(int narg, char *args[])
{
ring_consumer_t looper;
ring_loop_t *ringloop = new ring_loop_t(512);
epoll_manager_t *epmgr = new epoll_manager_t(ringloop);
osd_messenger_t *msgr = new osd_messenger_t();
msgr->osd_num = 1351;
msgr->tfd = epmgr->tfd;
msgr->ringloop = ringloop;
msgr->repeer_pgs = [](osd_num_t) {};
msgr->exec_op = [msgr](osd_op_t *op) { stub_exec_op(msgr, op); };
// Accept new connections
int listen_fd = bind_stub("0.0.0.0", 11203);
epmgr->set_fd_handler(listen_fd, false, [listen_fd, msgr](int fd, int events)
{
msgr->accept_connections(listen_fd);
});
looper.loop = [msgr, ringloop]()
{
msgr->read_requests();
msgr->send_replies();
ringloop->submit();
};
ringloop->register_consumer(&looper);
printf("stub_uring_osd: waiting for clients\n");
while (true)
{
ringloop->loop();
ringloop->wait();
}
delete msgr;
delete epmgr;
delete ringloop;
return 0;
}
int bind_stub(const char *bind_address, int bind_port)
{
int listen_backlog = 128;
int listen_fd = socket(AF_INET, SOCK_STREAM, 0);
if (listen_fd < 0)
{
throw std::runtime_error(std::string("socket: ") + strerror(errno));
}
int enable = 1;
setsockopt(listen_fd, SOL_SOCKET, SO_REUSEADDR, &enable, sizeof(enable));
sockaddr_in addr;
int r;
if ((r = inet_pton(AF_INET, bind_address, &addr.sin_addr)) != 1)
{
close(listen_fd);
throw std::runtime_error("bind address "+std::string(bind_address)+(r == 0 ? " is not valid" : ": no ipv4 support"));
}
addr.sin_family = AF_INET;
addr.sin_port = htons(bind_port);
if (bind(listen_fd, (sockaddr*)&addr, sizeof(addr)) < 0)
{
close(listen_fd);
throw std::runtime_error(std::string("bind: ") + strerror(errno));
}
if (listen(listen_fd, listen_backlog) < 0)
{
close(listen_fd);
throw std::runtime_error(std::string("listen: ") + strerror(errno));
}
fcntl(listen_fd, F_SETFL, fcntl(listen_fd, F_GETFL, 0) | O_NONBLOCK);
return listen_fd;
}
void stub_exec_op(osd_messenger_t *msgr, osd_op_t *op)
{
op->reply.hdr.magic = SECONDARY_OSD_REPLY_MAGIC;
op->reply.hdr.id = op->req.hdr.id;
op->reply.hdr.opcode = op->req.hdr.opcode;
if (op->req.hdr.opcode == OSD_OP_SEC_READ)
{
op->reply.hdr.retval = op->req.sec_rw.len;
op->buf = malloc(op->req.sec_rw.len);
op->iov.push_back(op->buf, op->req.sec_rw.len);
}
else if (op->req.hdr.opcode == OSD_OP_SEC_WRITE || op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE)
{
op->reply.hdr.retval = op->req.sec_rw.len;
}
else if (op->req.hdr.opcode == OSD_OP_TEST_SYNC_STAB_ALL)
{
op->reply.hdr.retval = 0;
}
else
{
printf("client %d: unsupported stub opcode: %lu\n", op->peer_fd, op->req.hdr.opcode);
op->reply.hdr.retval = -EINVAL;
}
msgr->outbox_push(op);
}

328
test.cpp
View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#define _LARGEFILE64_SOURCE
#include <sys/types.h>
#include <sys/ioctl.h>
@ -13,6 +16,7 @@
#include <assert.h>
#include <stdio.h>
#include <liburing.h>
#include <math.h>
#include <sys/socket.h>
#include <sys/epoll.h>
@ -61,24 +65,6 @@ static void test_write(struct io_uring *ring, int fd)
free(buf);
}
class obj_ver_hash
{
public:
size_t operator()(const obj_ver_id &s) const
{
size_t seed = 0;
spp::hash_combine(seed, s.oid.inode);
spp::hash_combine(seed, s.oid.stripe);
spp::hash_combine(seed, s.version);
return seed;
}
};
inline bool operator == (const obj_ver_id & a, const obj_ver_id & b)
{
return a.oid == b.oid && a.version == b.version;
}
int main00(int argc, char *argv[])
{
// queue with random removal: vector is best :D
@ -170,9 +156,9 @@ int main0(int argc, char *argv[])
// btree_map 5M entries monotone -> 0.458s, random -> 5.429s
// absl::btree_map 5M entries random -> 5.09s
// sparse_hash_map 5M entries -> 2.193s, random -> 2.586s
//btree::btree_map<obj_ver_id, dirty_entry> dirty_db;
btree::btree_map<obj_ver_id, dirty_entry> dirty_db;
//std::map<obj_ver_id, dirty_entry> dirty_db;
spp::sparse_hash_map<obj_ver_id, dirty_entry, obj_ver_hash> dirty_db;
//spp::sparse_hash_map<obj_ver_id, dirty_entry, obj_ver_hash> dirty_db;
for (int i = 0; i < 5000000; i++)
{
dirty_db[(obj_ver_id){
@ -182,7 +168,7 @@ int main0(int argc, char *argv[])
},
.version = 1,
}] = (dirty_entry){
.state = ST_D_META_SYNCED,
.state = ST_D_SYNCED,
.flags = 0,
.location = (uint64_t)i << 17,
.offset = 0,
@ -337,87 +323,253 @@ int main04(int argc, char *argv[])
return 0;
}
int main05(int argc, char *argv[])
uint64_t jumphash(uint64_t key, int count)
{
// FIXME extract this into a test
pg_t pg = {
.state = PG_PEERING,
.pg_num = 1,
.target_set = { 1, 2, 3 },
.cur_set = { 1, 2, 3 },
.peering_state = new pg_peering_state_t(),
};
for (uint64_t osd_num = 1; osd_num <= 3; osd_num++)
uint64_t b = 0;
uint64_t seed = key;
for (int j = 1; j < count; j++)
{
pg_list_result_t r = {
.buf = (obj_ver_id*)malloc(sizeof(obj_ver_id) * 1024*1024*8),
.total_count = 1024*1024*8,
.stable_count = (uint64_t)(1024*1024*8 - (osd_num == 1 ? 10 : 0)),
};
for (uint64_t i = 0; i < r.total_count; i++)
seed = 2862933555777941757ull*seed + 3037000493ull; // LCPRNG
if (seed < (UINT64_MAX / (j+1)))
{
r.buf[i] = {
.oid = {
.inode = 1,
.stripe = (i << STRIPE_SHIFT) | (osd_num-1),
},
.version = (uint64_t)(osd_num == 1 && i >= r.total_count - 10 ? 2 : 1),
};
b = j;
}
pg.peering_state->list_results[osd_num] = r;
}
pg.calc_object_states();
printf("deviation variants=%ld clean=%lu\n", pg.state_dict.size(), pg.clean_count);
for (auto it: pg.state_dict)
return b;
}
void jumphash_prepare(int count, uint64_t *out_weights, uint64_t *in_weights)
{
if (count <= 0)
{
printf("dev: state=%lx\n", it.second.state);
return;
}
uint64_t total_weight = in_weights[0];
out_weights[0] = UINT64_MAX;
for (int j = 1; j < count; j++)
{
total_weight += in_weights[j];
out_weights[j] = UINT64_MAX / total_weight * in_weights[j];
}
}
uint64_t jumphash_weights(uint64_t key, int count, uint64_t *prepared_weights)
{
uint64_t b = 0;
uint64_t seed = key;
for (int j = 1; j < count; j++)
{
seed = 2862933555777941757ull*seed + 3037000493ull; // LCPRNG
if (seed < prepared_weights[j])
{
b = j;
}
}
return b;
}
void jumphash3(uint64_t key, int count, uint64_t *weights, uint64_t *r)
{
r[0] = 0;
r[1] = 1;
r[2] = 2;
uint64_t total_weight = weights[0]+weights[1]+weights[2];
uint64_t seed = key;
for (int j = 3; j < count; j++)
{
seed = 2862933555777941757ull*seed + 3037000493ull; // LCPRNG
total_weight += weights[j];
if (seed < UINT64_MAX*1.0*weights[j]/total_weight)
r[0] = j;
else
{
seed = 2862933555777941757ull*seed + 3037000493ull; // LCPRNG
if (seed < UINT64_MAX*1.0*weights[j]/total_weight)
r[1] = j;
else
{
seed = 2862933555777941757ull*seed + 3037000493ull; // LCPRNG
if (seed < UINT64_MAX*1.0*weights[j]/total_weight)
r[2] = j;
}
}
}
}
uint64_t crush(uint64_t key, int count, uint64_t *weights)
{
uint64_t b = 0;
uint64_t seed = 0;
uint64_t max = 0;
for (int j = 0; j < count; j++)
{
seed = (key + 0xc6a4a7935bd1e995 + (seed << 6) + (seed >> 2));
seed ^= (j + 0xc6a4a7935bd1e995 + (seed << 6) + (seed >> 2));
seed = 2862933555777941757ull*seed + 3037000493ull; // LCPRNG
seed = -log(((double)seed) / (1ul << 32) / (1ul << 32)) * weights[j];
if (seed > max)
{
max = seed;
b = j;
}
}
return b;
}
void crush3(uint64_t key, int count, uint64_t *weights, uint64_t *r, uint64_t total_weight)
{
uint64_t seed = 0;
uint64_t max = 0;
for (int k1 = 0; k1 < count; k1++)
{
for (int k2 = k1+1; k2 < count; k2++)
{
if (k2 == k1)
{
continue;
}
for (int k3 = k2+1; k3 < count; k3++)
{
if (k3 == k1 || k3 == k2)
{
continue;
}
seed = (key + 0xc6a4a7935bd1e995 + (seed << 6) + (seed >> 2));
seed ^= (k1 + 0xc6a4a7935bd1e995 + (seed << 6) + (seed >> 2));
seed ^= (k2 + 0xc6a4a7935bd1e995 + (seed << 6) + (seed >> 2));
seed ^= (k3 + 0xc6a4a7935bd1e995 + (seed << 6) + (seed >> 2));
seed = 2862933555777941757ull*seed + 3037000493ull; // LCPRNG
//seed = ((double)seed) / (1ul << 32) / (1ul << 32) * (weights[k1] + weights[k2] + weights[k3]);
seed = ((double)seed) / (1ul << 32) / (1ul << 32) * (1 -
(1 - 1.0*weights[k1]/total_weight)*
(1 - 1.0*weights[k2]/total_weight)*
(1 - 1.0*weights[k3]/total_weight)
) * UINT64_MAX;
if (seed > max)
{
r[0] = k1;
r[1] = k2;
r[2] = k3;
max = seed;
}
}
}
}
return 0;
}
int main(int argc, char *argv[])
{
timeval fill_start, fill_end, filter_end;
spp::sparse_hash_map<object_id, clean_entry> clean_db;
//std::map<object_id, clean_entry> clean_db;
//btree::btree_map<object_id, clean_entry> clean_db;
gettimeofday(&fill_start, NULL);
printf("filling\n");
uint64_t total = 1024*1024*8*4;
clean_db.resize(total);
for (uint64_t i = 0; i < total; i++)
int host_count = 6;
uint64_t host_weights[] = {
34609*3,
34931*3,
35850+36387+35859,
36387,
36387*2,
36387,
};
/*int osd_count[] = { 3, 3, 3, 1, 2 };
uint64_t osd_weights[][3] = {
{ 34609, 34609, 34609 },
{ 34931, 34931, 34931 },
{ 35850, 36387, 35859 },
{ 36387 },
{ 36387, 36387 },
};*/
uint64_t total_weight = 0;
for (int i = 0; i < host_count; i++)
{
clean_db[(object_id){
.inode = 1,
//.stripe = (i << STRIPE_SHIFT),
.stripe = (((367*i) % total) << STRIPE_SHIFT),
}] = (clean_entry){
.version = 1,
.location = i << DEFAULT_ORDER,
};
total_weight += host_weights[i];
}
gettimeofday(&fill_end, NULL);
// no resize():
// spp = 17.87s (seq), 41.81s (rand), 3.29s (seq+resize), 8.3s (rand+resize), ~1.3G RAM in all cases
// std::unordered_map = 6.14 sec, ~2.3G RAM
// std::map = 13 sec (seq), 5.54 sec (rand), ~2.5G RAM
// cpp-btree = 2.47 sec (seq) ~1.2G RAM, 20.6 sec (pseudo-random 367*i % total) ~1.5G RAM
printf("filled %.2f sec\n", (fill_end.tv_sec - fill_start.tv_sec) + (fill_end.tv_usec - fill_start.tv_usec) / 1000000.0);
for (int pg = 0; pg < 100; pg++)
uint64_t host_weights_prepared[host_count];
jumphash_prepare(host_count, host_weights_prepared, host_weights);
uint64_t total_pgs[host_count] = { 0 };
int pg_count = 256;
double uniformity[pg_count] = { 0 };
for (uint64_t pg = 1; pg <= pg_count; pg++)
{
obj_ver_id* buf1 = (obj_ver_id*)malloc(sizeof(obj_ver_id) * ((total+99)/100));
int j = 0;
for (auto it: clean_db)
if ((it.first % 100) == pg)
buf1[j++] = { .oid = it.first, .version = it.second.version };
free(buf1);
printf("filtered %d\n", j);
uint64_t r[3];
/*
// Select first host
//r[0] = jumphash_weights(pg, host_count, host_weights_prepared);
r[0] = crush(pg, host_count, host_weights);
// Select second host
uint64_t seed = pg;
r[1] = r[0];
while (r[1] == r[0])
{
seed = 2862933555777941757ull*seed + 3037000493ull; // LCPRNG
//r[1] = jumphash_weights(seed, host_count, host_weights_prepared);
r[1] = crush(seed, host_count, host_weights);
}
// Select third host
seed = pg;
r[2] = r[0];
while (r[2] == r[0] || r[2] == r[1])
{
seed = 2862933555777941757ull*seed + 3037000493ull; // LCPRNG
//r[2] = jumphash_weights(seed, host_count, host_weights_prepared);
r[2] = crush(seed, host_count, host_weights);
}
*/
/*
// Select second host
uint64_t host_weights1[host_count];
for (int i = 0; i < r[0]; i++)
host_weights1[i] = host_weights[i];
for (int i = r[0]+1; i < host_count; i++)
host_weights1[i-1] = host_weights[i];
r[1] = crush(pg, host_count-1, host_weights1);
// Select third host
for (int i = r[1]+1; i < host_count-1; i++)
host_weights1[i-1] = host_weights[i];
r[2] = crush(pg, host_count-2, host_weights1);
// Transform numbers
r[2] = r[2] >= r[1] ? 1+r[2] : r[2];
r[2] = r[2] >= r[0] ? 1+r[2] : r[2];
r[1] = r[1] >= r[0] ? 1+r[1] : r[1];
*/
crush3(pg, host_count, host_weights, r, total_weight);
uint64_t shift = (2862933555777941757ull*pg + 3037000493ull) % host_count;
if (shift == 1)
{
uint64_t tmp;
tmp = r[0];
r[0] = r[1];
r[1] = r[2];
r[2] = tmp;
}
else if (shift == 2)
{
uint64_t tmp;
tmp = r[0];
r[0] = r[2];
r[2] = r[1];
r[1] = tmp;
}
total_pgs[r[0]]++;
total_pgs[r[1]]++;
total_pgs[r[2]]++;
double u = 0;
for (int i = 0; i < host_count; i++)
{
double d = abs(1 - total_pgs[i]/3.0/pg * total_weight/host_weights[i]);
u += d;
}
uniformity[pg-1] = u/host_count;
printf("pg %lu: hosts %lu, %lu, %lu ; avg deviation = %.2f\n", pg, r[0], r[1], r[2], u/host_count);
}
gettimeofday(&filter_end, NULL);
// spp = 42.15 sec / 60 sec (rand)
// std::unordered_map = 43.7 sec
// std::map = 156.13 sec
// cpp-btree = 21.87 sec (seq), 44.33 sec (rand)
printf("100 times filter %.2f sec\n", (filter_end.tv_sec - fill_end.tv_sec) + (filter_end.tv_usec - fill_end.tv_usec) / 1000000.0);
printf("total PGs: ");
for (int i = 0; i < host_count; i++)
{
printf(i > 0 ? ", %lu (%.2f)" : "%lu (%.2f)", total_pgs[i], total_pgs[i]/3.0/pg_count * total_weight/host_weights[i]);
}
printf("\n");
return 0;
}

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include <stdio.h>
#include "allocator.h"

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#include <malloc.h>
#include "timerfd_interval.h"
#include "blockstore.h"
@ -115,7 +118,7 @@ int main(int narg, char *args[])
}
};
ringloop->register_consumer(main_cons);
ringloop->register_consumer(&main_cons);
while (1)
{
ringloop->loop();

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 (see README.md for details)
#pragma once
#include <assert.h>

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.0 or GNU GPL-2.0+ (see README.md for details)
#include <sys/timerfd.h>
#include <sys/poll.h>
#include <unistd.h>
@ -20,14 +23,14 @@ timerfd_interval::timerfd_interval(ring_loop_t *ringloop, int seconds, std::func
throw std::runtime_error(std::string("timerfd_settime: ") + strerror(errno));
}
consumer.loop = [this]() { loop(); };
ringloop->register_consumer(consumer);
ringloop->register_consumer(&consumer);
this->ringloop = ringloop;
this->callback = cb;
}
timerfd_interval::~timerfd_interval()
{
ringloop->unregister_consumer(consumer);
ringloop->unregister_consumer(&consumer);
close(timerfd);
}

Some files were not shown because too many files have changed in this diff Show More