Vitaliy Filippov
82e6aff17b
Support mapping NBD by the image name
2021-04-17 17:39:55 +03:00
Vitaliy Filippov
57e2c503f7
Rename osd_t::c_cli to msgr
2021-04-17 16:32:09 +03:00
Vitaliy Filippov
715bc8d53d
Release 0.6.2
...
- Fix a possible crash during SYNC when journal fsyncs are enabled
- Fix a memory leak in the chained read implementation
2021-04-15 23:40:06 +03:00
Vitaliy Filippov
0af077701c
Fix a possible crash during SYNC when journal fsyncs are enabled
2021-04-15 02:01:50 +03:00
Vitaliy Filippov
cac976ce25
Fix a memory leak in the chained read implementation
2021-04-15 01:42:18 +03:00
Vitaliy Filippov
acf0646542
Build common sources once
2021-04-15 01:13:34 +03:00
Vitaliy Filippov
ede1c1d667
Release 0.6.1
...
A bugfix for the new "chained read from snapshot" feature
2021-04-14 22:32:23 +03:00
Vitaliy Filippov
38bd51c97f
Remove aio_context assertion, it seems it is unneeded
2021-04-14 22:32:15 +03:00
Vitaliy Filippov
966fb763ca
Oooops, fix chained reads
2021-04-13 16:19:21 +03:00
Vitaliy Filippov
0b41ffc08d
Release 0.6.0
...
Warning: upgrading from 0.5.x is currently not supported!
Please create an issue if you really need upgrade capability.
New features:
- Snapshots and Copy-on-Write clones
- Inode (image) names
- Inode I/O and space statistics
- Write throttling for smoothing random write workloads in SSD+HDD configurations
2021-04-11 00:49:18 +03:00
Vitaliy Filippov
64eeb79051
Prevent 0.6.x OSDs from talking to 0.5.x
...
The new protocol is almost compatible - it has bitmaps, but also it has
a "bitmap_length" field. It's not hard to make 0.5-0.6 OSDs and clients
compatible, but for now I just assume nobody needs it.
If I'm wrong and anybody requests to upgrade their production 0.5.x system
to 0.6.x I'll fix it.
2021-04-10 22:26:17 +03:00
Vitaliy Filippov
2a02f3c4c7
Add metadata superblock and check it on start
...
Refuse to start if the superblock is missing or bad version;
zero out the metadata area when initializing superblock.
2021-04-10 22:26:17 +03:00
Vitaliy Filippov
f684d9101a
Refuse to start with old journal version
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
a1f2f19489
Do not increment inode statistics if the object already exists
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
82c1a7ec67
Fix statistics reporting, split inode number into pool & inode
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
2ab423d4ef
Implement journaled write throttling for the SSD+HDD case
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
4694811eab
Add microsecond accuracy to set_timer
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
6b988de17d
Remove timerfd_interval
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
37efdc2a83
Fix bitmap_set for replicated pools
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
591cad09c9
Fix bitmaps for objects larger than 128K
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
b907ad50aa
Oops, forgot to add external bitmaps to blockstore in some places
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
5f5b6ef150
Enable chained reads in the client
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
38a3df4a0e
Implement chained (optimized) read in the primary OSD code
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
6950b8e3a0
Watch inode metadata revisions
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
0cea3576fb
Add "read bitmaps" operation to secondary OSD protocol
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
f01eea07d3
Add simplified interface to read blockstore bitmaps synchronously
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
2c2f08aca2
Shorten some structure names
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
d6524670e1
Introduce data distribution locality
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
7aeb2cbac7
Capture all by value in qemu_proxy
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
2612d3198a
Introduce image names and metadata storage in etcd
...
Each inode has: image name, parent inode number & pool, size and readonly flag
Snapshots are created by switching image name to a different inode number
while using the older inode as parent.
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
ab39ce2bbb
Use clean_entry_bitmap_size instead of entry_attr_size back because of changed bitmap handling
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
d0c2e31312
Add a test for snapshots, fix bugs. Now the test passes
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
9038d42327
Fix several snapshot I/O bugs
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
691f066055
Actual snapshot support (untested)
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
ffe1cd4c79
Report inode I/O statistics, aggregate it in the monitor
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
4ae1b84c67
Report inode space usage statistics to etcd, aggregate it in the monitor
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
c35963967f
Add inode space usage statistics tracking to blockstore
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
0aa2dd2890
Send bitmaps with primary-reads, actually read bitmaps for READ ops
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
6bf88883ac
Allocate bitmaps along with stripes to avoid memory fragmentation
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
004f265393
Remove cryptic bitmap inlining from bs_op_t and osd_op_t, use bitmap in primary OSD code
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
860ac24762
Add "external" bitmap support to the secondary OSD protocol
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
6107a4d07b
Add "external" bitmap support to blockstore
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
95c29b9dc3
Add "external" bitmap support to osd_rmw
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
6909807068
Allow to start the OSD just to flush the journal completely
2021-04-10 17:44:12 +03:00
Vitaliy Filippov
18c72f4835
Correct reenterability fix (now verified with a test)
...
It's rather funny but 0.5.12 has to be re-published again
2021-04-09 12:10:16 +03:00
Vitaliy Filippov
40b7c21fb1
Followup to 307c1731c1
- fix mark_stable
2021-04-08 15:47:18 +03:00
Vitaliy Filippov
efb3678606
Fix qemu-img broken in 0.5.11
...
Caused by the lack of reenterability of the main cluster_client function
2021-04-08 14:59:20 +03:00
Vitaliy Filippov
8d87e32175
Fix msgr_op.h includes
2021-04-08 01:18:46 +03:00
Vitaliy Filippov
b0b2e7df3c
Fix use-after-free in keepalive_timer and rework stop_client()
...
The bug reproduced if fio was temporarily stopped with SIGSTOP
during write test and then resumed after 10 seconds. In this case
"pings" were failed for all clients and fio process crashed with
'use-after-free' in keepalive_timer. It happened because it called
stop_client while having a live iterator to the map.
2021-04-07 11:06:31 +03:00
Vitaliy Filippov
97efb9e299
Do not crash on PG re-peering events when operations are in progress
2021-04-07 11:06:31 +03:00
Vitaliy Filippov
f6d705383a
Fix client connection recovery bugs, add dirty_ops limit
2021-04-07 11:06:31 +03:00
Vitaliy Filippov
68567c0e1f
Fix messenger possibly trying to connect to the same OSD twice
2021-04-07 01:30:38 +03:00
Vitaliy Filippov
04b00003e9
Log ping failures
2021-04-07 01:30:38 +03:00
Vitaliy Filippov
307c1731c1
Forget all dirty_entries before stable big_write or delete during initialisation
...
This fixes a 'double_alloc' assertion in the following case:
- big_write object #1 v1 to block #100
- big_write object #1 v2 to block #101
- big_write object #2 v1 to block #100
2021-04-07 01:30:38 +03:00
Vitaliy Filippov
a48e2bbf18
Fix write replay ordering when immediate_commit != all
...
Previous implementation didn't respect write ordering and could lead
to corrupted data when restarting writes after an OSD outage
Also rework cluster_client queueing logic and add tests for it to verify the correct behaviour
2021-04-03 14:51:52 +03:00
Vitaliy Filippov
688821665a
Remove stoull_full() from etcd_state_client.cpp
2021-04-03 14:36:04 +03:00
Vitaliy Filippov
3e162d95a0
Remove http_client.h include from etcd_state_client.h
2021-04-03 14:36:04 +03:00
Vitaliy Filippov
829381b335
Extract some definitions to msgr_op.{cpp,h}
2021-04-03 14:36:04 +03:00
Vitaliy Filippov
54f2353f24
Use bitmap granularity for alignment checks
2021-04-03 14:36:04 +03:00
Vitaliy Filippov
e47f6fba60
Remove cluster_client_t::stop()
2021-04-03 14:35:42 +03:00
Vitaliy Filippov
883bf84a16
Fix build
2021-04-03 01:47:15 +03:00
Vitaliy Filippov
52097c4856
Stop flushing when less than min_flusher_count operations are available (unless a trim is forced)
2021-04-03 00:53:28 +03:00
Vitaliy Filippov
e1355cbc74
Report failed operation name in cluster_client
2021-04-03 00:53:28 +03:00
Vitaliy Filippov
8f8b90be7a
Add min_flusher_count configuration
2021-04-03 00:53:28 +03:00
Vitaliy Filippov
ad9f619370
Skip double allocs when reading journal
2021-04-03 00:53:28 +03:00
Vitaliy Filippov
f4769ba7c7
Collapse create+delete journal entry pairs if they're already flushed
...
Old journal replay mechanism could lead to a double allocation of the same
block and a "Fatal error: tried to overwrite non-zero metadata entry"
2021-04-03 00:53:28 +03:00
Vitaliy Filippov
843b7052d2
Add an assertion when clearing deleted metadata entries, add debug details when freeing blocks
2021-04-03 00:53:28 +03:00
Vitaliy Filippov
4095bcc558
Do not ignore object deletion journal entries when they are preceded by a big write
2021-03-25 11:00:10 +03:00
Vitaliy Filippov
564d64e271
Add some details for debug prints
2021-03-25 11:00:10 +03:00
Vitaliy Filippov
cf54741c95
Followup to 05db1308aa
...
Don't do anything with the object state after errors because
it's freed by PG re-peer in this case
2021-03-25 11:00:10 +03:00
Vitaliy Filippov
18a5fafa2a
Fix rollback
2021-03-25 02:41:58 +03:00
Vitaliy Filippov
06f4978085
Fix fsync check in blockstore_flush (data fsyncs were disabled instead of journal fsyncs)
2021-03-25 02:41:58 +03:00
Vitaliy Filippov
7ebf1588c5
Check for immediate_commit==small in the OSD code
2021-03-25 02:41:58 +03:00
Vitaliy Filippov
b0ad1e1e6d
Remember writes as "unsynced" only after completing them
...
Previously BS_OP_SYNC could take unfinished writes and add them into the journal before
they were actually completed. This was leading to crashes with the message
"BUG: Unexpected dirty_entry 2000000000001:9f2a0000 v3 unstable state during flush: 338"
2021-03-25 02:41:58 +03:00
Vitaliy Filippov
0949f08407
Extract osd_primary write and sync code into separate files
2021-03-24 14:20:56 +03:00
Vitaliy Filippov
04a1f18fa5
Assign .req as a whole to always zero out the remaining part
...
Also clear .reply before processing the operation
2021-03-24 14:20:56 +03:00
Vitaliy Filippov
cf9a641d66
Skip disconnected OSDs during sync
2021-03-24 14:20:56 +03:00
Vitaliy Filippov
05db1308aa
Fix two potential read/write ordering problems (even though not yet seen in tests)
...
- Write operations could be 'stabilized' and previous versions could be
purged from OSDs before the removal of version_override and following
reads could potentially hit different version in EC pools
- Object was marked clean after completing the delete during recovery, so
reads could in theory hit a deleted version and return nothing
2021-03-24 14:20:56 +03:00
Vitaliy Filippov
98b54ca948
Don't try to "recover" misplaced objects if it would make them degraded
2021-03-21 01:37:23 +03:00
Vitaliy Filippov
23225c5e62
Do not run ping on clients that are not yet connected
2021-03-21 01:37:23 +03:00
Vitaliy Filippov
435045751d
Delete objects only after a SYNC during rebalance in the non-immediate_commit mode
...
Previously OSDs could commit deletes before writes during recovery or rebalance
in the "lazy fsync" (immediate_commit=off) mode which could result in lost objects
2021-03-16 12:48:26 +03:00
Vitaliy Filippov
c5fb1d5987
Do not duplicate blockstore operations when io_uring fills up
...
This bug was leading to OSDs dying with "Assertion `fulfilled == read_op->len' failed"
when testing fio -rw=randread -numjobs=8 -iodepth=128
2021-03-16 12:48:26 +03:00
Vitaliy Filippov
9ac7e75178
Allow to specify etcd URLs for OSDs with http://, do not die with a strange error if -etcd option is missing for fio
2021-03-16 12:48:26 +03:00
Vitaliy Filippov
88671cf745
Fix a bug causing all flushers to wait for an fsync without actually trying to do it
...
This happened because flusher_count became dynamic and fsync_batch() was comparing the number
of flushers currently ready to do an fsync with the maximum number of flushers. Also the number
wasn't rechecked on every loop which was also incorrect.
Now the interrupted_rebalance test passes even without IMMEDIATE_COMMIT=1.
2021-03-13 17:27:29 +03:00
Vitaliy Filippov
ceb9c28de7
Set default log_level before passing config to etcd_state_client
2021-03-13 17:19:45 +03:00
Vitaliy Filippov
299d7d7c95
Use common macro for get_sqe
2021-03-13 17:19:45 +03:00
Vitaliy Filippov
d1526b415f
Correctly resume writes when OSD is full to return an error
2021-03-13 17:19:45 +03:00
Vitaliy Filippov
f49fd53d55
Fix a bug where allocator was unable to allocate up to last (n%64) blocks, add tests for it
2021-03-13 02:19:02 +03:00
Vitaliy Filippov
b44f49aab2
Ignore zero OSDs in history osd_sets
2021-03-12 12:40:15 +03:00
Vitaliy Filippov
af5155fcd9
Implement "no_recovery" and "no_rebalance" flags
2021-03-11 00:36:31 +03:00
Vitaliy Filippov
c4ba24c305
Do not print ping op latency
2021-03-10 02:01:44 +03:00
Vitaliy Filippov
bd178ac20f
Fix history osd_set check - local OSD is always available!
2021-03-09 02:18:18 +03:00
Vitaliy Filippov
ad577c4aac
Add PING operation and timeouts to detect OSD failures when a host goes down
2021-03-09 02:15:38 +03:00
Vitaliy Filippov
e91ff2a9ec
Only forget offline PGs if their state is not changed during reporting
2021-03-08 17:04:10 +03:00
Vitaliy Filippov
086667f568
Do not check PG state key ownership if it doesn't exist yet
...
This fixes the bug where OSDs were sometimes trying to report updated PG states
infinitely without luck when PGs transitioned from 'starting' to 'peering' too fast
2021-03-08 17:04:10 +03:00
Vitaliy Filippov
1be94da437
Check & remove extra chunks for degraded / incomplete objects, too
2021-03-08 17:04:10 +03:00
Vitaliy Filippov
80e12358a2
Use pg_data_size instead of pg_minsize for object state calculation
2021-03-08 17:04:10 +03:00
Vitaliy Filippov
36c935ace6
Use std::vector for the blockstore submission queue
2021-03-08 17:04:10 +03:00
Vitaliy Filippov
0d8b5e2ef9
Remove unused enqueue_op_first()
2021-03-08 17:04:10 +03:00
Vitaliy Filippov
98f1e2c277
Rework write/sync ordering
...
Make syncs wait for all previous writes because it's the only way
to make sure that OSDs do not receive incomplete writes in LIST results
during peering when some writes are still in progress.
Also simplify blockstore submission queue logic.
2021-03-08 17:04:10 +03:00
Vitaliy Filippov
21e7686037
Fix possible "assertion failed: pg.inflight >= 0" error during PG stop
2021-03-08 17:04:10 +03:00
Vitaliy Filippov
ab21a1908b
Check for the dirty PG flag when trying to continue to stop it after sync
2021-03-08 17:04:10 +03:00
Vitaliy Filippov
30d1ccd43e
Fix an infinite loop when discarding list operations during stop_pg()
2021-03-08 17:04:10 +03:00
Vitaliy Filippov
8bdd6d8d78
Reset PG state when stopping them
2021-03-08 17:04:10 +03:00
Vitaliy Filippov
09b3e4e789
Fix OSDs being unable to stop PGs that are 'peering', not 'active'
...
This was sometimes leading to incorrect misplaced and degraded object count statistics
2021-03-08 17:04:10 +03:00
Vitaliy Filippov
bc742ccf8c
Fix a small memory leak in etcd_state_client
2021-03-08 17:04:10 +03:00
Vitaliy Filippov
314b20437b
Do not break subsequent small writes badly when a big write is canceled
2021-03-08 17:04:10 +03:00
Vitaliy Filippov
29d8ac8b1b
Do not report statistics for the empty operation
2021-03-01 16:20:57 +03:00
Vitaliy Filippov
6155b23a7e
Replace pgs[id] with pgs.at(id) to prevent accidental auto-vivification
2021-02-28 19:36:59 +03:00
Vitaliy Filippov
46e79f3306
Wait for PGs to become clean before stopping them
2021-02-28 19:36:59 +03:00
Vitaliy Filippov
41fd14e024
Fix deletes not increasing write_iodepth
2021-02-28 19:36:59 +03:00
Vitaliy Filippov
2d73b19a6c
Fix online PG count change bugs
2021-02-25 23:59:33 +03:00
Vitaliy Filippov
c974cb539c
Make flusher_count adaptive and limit write iodepth
2021-02-25 23:59:33 +03:00
Vitaliy Filippov
bf9a175efc
Move C/C++ sources to src subdirectory
2021-02-25 23:59:03 +03:00