Another bunch of fixes, including important ones. Now OSDs are stable in SSD+HDD
configurations and everything is mostly ready for the merge of master branch.
Features:
- Add min_flusher_count configuration (good for HDDs)
- Shuffle PGs for better data device utilisation
- Make OSDs benefit from the immediate_commit=small setting if it's applicable
Bug fixes:
- Rework client code to fix write ordering during operation replay
- Rework error handling code so OSDs don't crash in reaction to a crash of their peer OSDs
- Fix several block layer problems related to the journal, some of which
were leading to double allocations of the same block during journal replay
- Fix monitors crashing during the removal of OSD keys from etcd
- Fix data fsyncs being incorrectly disabled when only disable_journal_fsync was set
- Always zero out unused part of request/reply headers
- Fix some theoretically possible read/write ordering issues
- Don't try to "recover" misplaced objects if it would make them degraded
- Fix heartbeats sometimes preventing OSD to establish connections
The bug reproduced if fio was temporarily stopped with SIGSTOP
during write test and then resumed after 10 seconds. In this case
"pings" were failed for all clients and fio process crashed with
'use-after-free' in keepalive_timer. It happened because it called
stop_client while having a live iterator to the map.
This fixes a 'double_alloc' assertion in the following case:
- big_write object #1 v1 to block #100
- big_write object #1 v2 to block #101
- big_write object #2 v1 to block #100
Previous implementation didn't respect write ordering and could lead
to corrupted data when restarting writes after an OSD outage
Also rework cluster_client queueing logic and add tests for it to verify the correct behaviour
Previously BS_OP_SYNC could take unfinished writes and add them into the journal before
they were actually completed. This was leading to crashes with the message
"BUG: Unexpected dirty_entry 2000000000001:9f2a0000 v3 unstable state during flush: 338"
- Write operations could be 'stabilized' and previous versions could be
purged from OSDs before the removal of version_override and following
reads could potentially hit different version in EC pools
- Object was marked clean after completing the delete during recovery, so
reads could in theory hit a deleted version and return nothing
The version seems to be stable after this bunch of fixes :)
- Fix delete & write operation ordering during rebalance to not lose objects in the immediate_commit=off mode
- Fix a possible crash caused by very high iodepths
- Re-distribute PG primaries over OSDs that come up after a short downtime
- Allow to specify etcd URLs for OSDs with http://, do not die with a strange error if -etcd option is missing for fio
- Fix a journal flushing deadlock which sometimes occurred in the immediate_commit=off mode
- Fix a bug where OSDs could hang if the data device filled up
- Fix an allocator bug where it was unable to allocate up to last (n%64) data device blocks
- Fix monitor crash that occurred on removal of some etcd keys
- Fix a bug where PGs could remain incomplete due to incorrect PG history with just zeroes in osd_sets
Previously OSDs could commit deletes before writes during recovery or rebalance
in the "lazy fsync" (immediate_commit=off) mode which could result in lost objects