Commit Graph

451 Commits (lrc-matrix)

Author SHA1 Message Date
Vitaliy Filippov 515a2e6e33 Only die when detecting a real race condition, not just a CAS failure 2022-01-05 17:05:25 +03:00
Vitaliy Filippov 9c6168bf17 Remove fill_parsed_response 2022-01-03 20:08:26 +03:00
Vitaliy Filippov 5473d5b4a2 Rework HTTP client to use keepalive, move getifaddr_list to addr_util 2022-01-03 14:52:01 +03:00
Vitaliy Filippov c3304bce27
Merge pull request #38 from mirrorll/master
journal check_available error
2021-12-31 12:45:16 +03:00
Vitaliy Filippov b9f5c2a823 Support zero-copy send in fio_sec_osd to allow testing it
Prelimilary results:
- CPU usage drops significantly. For example, in T1Q8 128K write test against
  stub_uring_osd with 10G network and Athlon X4 860k CPU it drops from 100% to 30%
- Latency becomes slightly worse. In T1Q1 4K write test in the same environment
  latency increases from 56 to 63 us.
- Small write throughput also becomes slightly worse. In T1Q128 4K write test
  against stub iops decreases from 138k to ~110k (unstable, fluctuates 100k..120k).
  Note that this is without io_uring, of course.
2021-12-27 02:12:44 +03:00
Vitaliy Filippov e9d2f79aa7 Support reading bitmaps in fio_sec_osd 2021-12-27 02:12:44 +03:00
Vitaliy Filippov 0785bdf8b3 Release 0.6.11
- Slightly reduce journaling write amplification (requires no_same_sector_overwrites=false)
- Fix listen_backlog (it was 0) because it could more than halve OSD socket send speed
- Support IPv6 OSD addresses
- Do not try to initialize client in simple-offsets
- Fix OSDs sometimes marking PGs incomplete instead of trying to connect with peers
- Allow to configure OSD placement in node_placement
- Allow to run with 4k sector size block devices. Natural, but it was forbidden
2021-12-26 21:11:24 +03:00
Vitaliy Filippov b57e44748b Send 4 byte bitmap in stub_uring_osd 2021-12-25 11:38:13 +03:00
Vitaliy Filippov 1bbe62f29c Fix uninitialized listen_backlog which was leading to REALLY SLOW send speeds!!! 2021-12-25 11:38:13 +03:00
lihai 3061c30132 journal check_available error 2021-12-21 09:39:58 +08:00
Vitaliy Filippov 20a4406acc Support IPv6 OSD addresses 2021-12-19 10:42:17 +03:00
Vitaliy Filippov f93491bc6c Implement journal write batching and slightly refactor journal writes
Slightly reduces WA. For example, in 4K T1Q128 replicated randwrite tests
WA is reduced from ~3.6 to ~3.1, in T1Q64 from ~3.8 to ~3.4.

Only effective without no_same_sector_overwrites.
2021-12-16 00:27:17 +03:00
Vitaliy Filippov 999bed8514 Fix opening regular files as blockstore 2021-12-15 02:08:58 +03:00
Vitaliy Filippov 3f33095fd7 Do not try to initialize client in simple-offsets 2021-12-15 02:07:27 +03:00
Vitaliy Filippov dd74c5ce1b Fix OSDs marking PGs incomplete instead of trying to connect with peers 2021-12-14 01:57:51 +03:00
Vitaliy Filippov c6d104ecd6 Print object version on fatal overwrite 2021-12-14 01:57:04 +03:00
Vitaliy Filippov e544aef7d0 Fix test rw_blocking 2021-12-12 23:24:50 +03:00
Vitaliy Filippov 616c18c786 Fix stub_uring_osd 2021-12-12 23:06:11 +03:00
Vitaliy Filippov 2c7556e536 Allow to run with 4k sector size. Natural, but it was forbidden 2021-12-11 22:03:16 +00:00
Vitaliy Filippov 2020608a39 Release 0.6.10
- Implement a storage plugin for Proxmox. Now you can use Vitastor with Proxmox!
- Implement `vitastor-cli df` (pool space usage statistics) command
- Add glob pattern support for `vitastor-cli ls`
- Fix several bugs in other CLI commands (resize, create --parent, modify --readonly)
- Use 512 byte logical block size in QEMU driver by default (and thus don't require to set it in QEMU options)
2021-12-10 21:40:12 +03:00
Vitaliy Filippov f54ff6ad5d Do not crash in simple-offsets when some options are empty, too 2021-12-10 12:27:25 +03:00
Vitaliy Filippov b376ef2ed9 Do not crash on empty matched_addrs 2021-12-10 11:40:59 +03:00
Vitaliy Filippov 5a234588b9 Do not die when invoked via `vita` symlink 2021-12-10 02:45:16 +03:00
Vitaliy Filippov 0ee5e0a7fe Implement vitastor-cli df command 2021-12-10 02:37:02 +03:00
Vitaliy Filippov 3482bb0860 Fix readonly/readwrite option parsing 2021-12-10 00:52:59 +03:00
Vitaliy Filippov 526995f486 Do not skip empty iops in listings 2021-12-10 00:52:59 +03:00
Vitaliy Filippov 8dfbd7943c Use logical block size = 512 bytes by default 2021-12-08 23:43:40 +03:00
Vitaliy Filippov 3a83a32cb7 Aaand now fix create --parent :D 2021-12-08 23:00:34 +03:00
Vitaliy Filippov 20d5ed799a Add glob pattern matching for ls 2021-12-08 23:00:34 +03:00
Vitaliy Filippov b262938bca Fix naggy "Failed to get RDMA device list: Unknown error -38" 2021-12-08 02:02:30 +03:00
Vitaliy Filippov c3c2e68cc1 Now fix resize command :D 2021-12-05 01:38:08 +03:00
Vitaliy Filippov aa1e21dd99 Release 0.6.9
New features:
- Build Vitastor driver as part of QEMU
- Implement renaming images in CLI (vitastor-cli modify --rename)
- Add vitastor-cli alloc-osd and simple-offsets commands and use them in make-osd,
  thus removing the dependency on etcdctl
- Make monitor remove stale deleted inode statistics from etcd automatically
- Implement OSD address selection from a subnet, thus removing the need to specify
  OSD addresses in startup scripts explicitly

Bug fixes:
- Fix client failover in case of etcd shutdown or crash (make client survive etcd failures)
- Stick to the last live etcd in OSD and mon to prevent random failures when one of etcds is down
- Fix incorrect copying of data from journal to the data device which could lead to data corruption
- Prefer local etcd IPs in OSD
- Remove the total PG count restriction in optimize_change which was sometimes leading
  to inability to redistribute PGs over OSDs
- Fix error response parsing on a failed pg state report
- Fix slow linear writes with RDMA by changing default buffer settings
- Fix possible 'TypeError' in openstack nova when using Vitastor cinder driver
- Fix bugs in vitastor-cli create, ls, rm, modify commands

Patch changes:
- Add a patch for libvirt 7.6
- Add patches for QEMU 6.0 and 6.1
- Fix config file path XML location parsing in libvirt patches
- Replace _ with - in QEMU options
- Fix possible 'TypeError' in openstack nova when using Vitastor cinder driver
- Fix possible crashes of QEMU block driver in case of incorrect options
2021-12-03 10:58:54 +03:00
Vitaliy Filippov 9fca01dc62 Add a forgotten return statement 2021-12-03 00:41:49 +03:00
Vitaliy Filippov 0bd3a94efd Use qdict_get_try_int because qdict_get_int may segfault on a missing key 2021-12-03 00:22:17 +03:00
Vitaliy Filippov 5fe3a40416 More fixes for QEMU 2.x :) 2021-12-02 02:25:50 +03:00
Vitaliy Filippov 15957b7d13 Update QEMU 4.2 patch and CentOS 7 QEMU 4.2 spec patch 2021-12-02 01:03:19 +03:00
Vitaliy Filippov 5859f913fc Fix client failover in case of etcd shutdown or crash 2021-12-01 00:33:02 +03:00
Vitaliy Filippov 92362027a8 Build vitastor driver as part of the QEMU package by default
Old behaviour can be restored with cmake var WITH_QEMU=true
2021-11-29 02:05:26 +03:00
Vitaliy Filippov c4aeeda143 Fix index removal in vitastor-cli rm 2021-11-29 02:00:05 +03:00
Vitaliy Filippov 24f0f8278a Fix modify --readwrite 2021-11-29 01:52:21 +03:00
Vitaliy Filippov 95496d0845 Implement renaming images in CLI (vitastor-cli modify --rename) 2021-11-28 22:38:57 +03:00
Vitaliy Filippov 94b1f09ef2 Create snapshots in the same pool by default 2021-11-28 21:50:42 +03:00
Vitaliy Filippov 7a0b5212fe Exit if unable to restart watches
FIXME: It's probably not OK for the client to exit in this case
2021-11-28 01:43:31 +03:00
Vitaliy Filippov ce5b6253ab Make OSDs stick to the last successful etcd address
Previously OSDs were selecting a new random etcd from the cluster
on every request so they were failing randomly when part of etcds was down
2021-11-27 23:48:56 +03:00
Vitaliy Filippov 8398ad0117 Fix #36 - Fix old version data sometimes overriding new version data
Reproduction case:
- v3 = (offset 4kb, length 16kb)
- v2 = (offset 24kb, length 16kb)
- v1 = (offset 16kb, length 16kb)
- At the third step it was inserting 16..24kb instead of 20..24kb
2021-11-27 01:17:45 +03:00
Vitaliy Filippov fea451b4db Prefer local etcd in OSD 2021-11-27 00:36:53 +03:00
Vitaliy Filippov 7b7f20fb89
Merge pull request #34 from mirrorll/master
report pg state failed
2021-11-25 10:26:42 +03:00
Vitaliy Filippov 300d507026 Fix capture of out in alloc_osd 2021-11-25 10:20:01 +03:00
harley 6886171289
report pg state failed
after report pg state failed parse response error
2021-11-25 09:34:34 +08:00
Vitaliy Filippov 43f8ea47a0 Ok, something is not allowed somewhere in C99 2021-11-24 11:28:10 +03:00
Vitaliy Filippov 6e0e172e15 Implement OSD address selection from a specified subnet 2021-11-23 21:59:26 +03:00
Vitaliy Filippov 879fe9b2b4 Add a patch for qemu 6.1 and replace _ with - in qemu options 2021-11-21 16:24:30 +03:00
Vitaliy Filippov 660c3f7b0d Change default RDMA settings to 128x 129K buffers
129K to leave extra space for the header

The problem with 8x 1M buffers is that the following happens with,
for example, 2 OSDs and 4M T1Q1 write:
- Server posts 8 receives
- Client posts 8 sends
- WRs are processed by the RDMA stack, but the OSD doesn't have the time
  to handle them and doesn't refill buffers
- Client posts 1 more send
- RNR retransmission happens and performance drops to zero

Overall it seems that RDMA support should be reworked to use real 'RDMA'
operations i.e. operations writing into remote memory. This has an
additional advantage of avoiding a copy at the receive side of the OSD.
2021-11-21 12:05:52 +03:00
Vitaliy Filippov f0ebfae3b8 Fix vitastor-cli alloc-osd, use vitastor-cli in make-osd.sh 2021-11-21 00:01:03 +03:00
Vitaliy Filippov eb7ad2c114 Fix empty size syntax, use C version of simple-offsets in tests 2021-11-20 23:51:26 +03:00
Vitaliy Filippov cd21ff0b6a Rewrite simple-offsets.js in C/C++ 2021-11-19 02:39:56 +03:00
Vitaliy Filippov d3903f039c Implement alloc-osd (allocate a new OSD number) command 2021-11-19 02:39:37 +03:00
Vitaliy Filippov c5029961ea Oops. Fix vitastor-cli ls 2021-11-16 12:39:41 +03:00
Vitaliy Filippov 920345f7b6 Release 0.6.8
- Build separate packages for OSD, monitor, client, C header, fio and QEMU drivers
  instead of one package which included everything
2021-11-15 00:49:21 +03:00
Vitaliy Filippov 75b47a6298 Generate pkg-config file 2021-11-15 00:49:21 +03:00
Vitaliy Filippov 7eabc364bf Release 0.6.7
- Implement CLI commands for listing, viewing I/O statistics, creating,
  snapshotting, cloning, resizing and modifying images. All these operations
  are covered by 3 commands: ls, create, modify
- Implement an important fix to prior OSD set tracking for PGs. The previous
  version had an issue which could lead to data loss due to an OSD with older
  copy of the data thinking it has the newest copy
- Fix I/O statistics aggregation in the monitor
- Several minor fixes for Cinder driver
- Fix QEMU driver to be compatible with QEMU 2.x > 2.0
- Fix stalls sometimes possible in configurations without immediate_commit due
  to insufficient amount of automatic internal fsync operations
- Add `vita` alias for `vitastor-cli`
2021-11-13 23:23:55 +03:00
Vitaliy Filippov a346f84c69 Allow to show only specific images in listing 2021-11-13 23:23:55 +03:00
Vitaliy Filippov 71a0c1a7b9 Fix list sorting 2021-11-13 23:23:55 +03:00
Vitaliy Filippov 110b39900b Rename the new "set" command to "modify" 2021-11-13 22:39:17 +03:00
Vitaliy Filippov 42479b4590 Fix vitastor-nbd list, add ls alias 2021-11-13 22:39:17 +03:00
Vitaliy Filippov 6e82044e84 Add `vita` symlink 2021-11-13 22:39:17 +03:00
Vitaliy Filippov 2cb3e84882 Implement CLI set (resize, change readonly status) command 2021-11-13 22:39:17 +03:00
Vitaliy Filippov aa436027c8 Report pg/history from OSD on every degraded activation
Required to prevent data loss due to activation of an OSD with older data
when PG OSD set change doesn't occur. I.e. fixes the simplest case:
- Run 2 OSDs with 1 PG
- Start writing into the PG
- Stop OSD 2
- Stop OSD 1
- Start OSD 2

After this change the PG will refuse to start after the last step.
2021-11-13 22:39:17 +03:00
Vitaliy Filippov 577a563b91 Allow to disable colored output 2021-11-11 01:41:58 +03:00
Vitaliy Filippov e4efa2c08a Improve vitastor-cli ls - show I/O statistics, allow to sort & limit output 2021-11-11 01:41:58 +03:00
Vitaliy Filippov d528cd77f1 Fix install_symlink 2021-11-09 16:42:29 +03:00
Vitaliy Filippov 4d43774cbb Use 5s etcd_report_interval by default 2021-11-09 01:27:12 +03:00
Vitaliy Filippov a1488f7217 Fix qemu_driver to build with QEMU 2.x (previously it was only correct for QEMU 2.0) 2021-11-08 23:07:31 +03:00
Vitaliy Filippov 404e07d365 Implement image/snapshot/clone creation and listing by pool 2021-11-07 01:01:07 +03:00
Vitaliy Filippov b3dcee0d43 Also print "bare" inodes with missing config if they occupy space 2021-11-06 14:56:41 +03:00
Vitaliy Filippov 609bd4eb59 Remove naggy RDMA messages when log level is zero 2021-11-06 14:36:23 +03:00
Vitaliy Filippov 8e445ddc9a Begin to implement CLI: implement listing, add help, add create stub 2021-11-06 14:32:19 +03:00
Tân Lê e889ac4209
Fix building QEMU 3.1 2021-11-05 13:45:51 +07:00
Vitaliy Filippov cfe8de9b84 Autosync based on number of unstable ops to prevent journal stalls 2021-10-30 14:26:48 +03:00
Vitaliy Filippov fb2f7a0d3c Release 0.6.6
- New command-line tool: vitastor-cli
- Implement layer (snapshot/clone) merge and delete
- Remove 'bool' from the C header
- Fix a very rare flusher stall
- More diagnostics now printed for slow ops in the log
2021-10-19 02:26:37 +03:00
Vitaliy Filippov 38d85da19a Fix build for older gcc 2021-10-19 02:26:37 +03:00
Vitaliy Filippov 89dcda1fed Remove "bool" from the C header 2021-10-18 01:49:07 +03:00
Vitaliy Filippov 1526e2055e Do not crash with RDMA when receiving garbage, free RDMA buffers when connection is closed 2021-10-15 23:56:22 +03:00
Vitaliy Filippov 74cb3911db Rebase children of the "inverse" child when it is removed, change /index/image/%s keys during metadata ops 2021-09-26 13:41:48 +03:00
Vitaliy Filippov d5efbbb6b9 Rename commands and add CLI help 2021-09-26 13:14:36 +03:00
Vitaliy Filippov 4319091bd3 Implement "inverse merge" optimisation 2021-09-26 12:59:04 +03:00
Vitaliy Filippov 6d307d5391 Ignore "readonly" flag when merging snapshots 2021-09-26 11:32:42 +03:00
Vitaliy Filippov 065dfef683 Rename vitastor-cmd to vitastor-cli 2021-09-26 00:52:05 +03:00
Vitaliy Filippov 4d6b85fe67 Split one big cmd.cpp into multiple files 2021-09-26 00:48:08 +03:00
Vitaliy Filippov 2dd2f29f46 Move get_inode_cfg to cli_tool_t 2021-09-25 23:36:45 +03:00
Vitaliy Filippov fc3a1e076a Fix minor bugs in snapshot removal, check it in tests 2021-09-25 19:30:29 +03:00
Vitaliy Filippov 3a3e168c42 Implement high-level snapshot flatten and remove commands 2021-09-25 01:36:44 +03:00
Vitaliy Filippov 95c55da0ad Implement merge with CAS 2021-08-01 20:06:05 +03:00
Vitaliy Filippov 5cf1157f16 Return real version on CAS failure 2021-08-01 20:05:19 +03:00
Vitaliy Filippov acf637950c Implement layer merge
A new command merges multiple snapshot/clone layers into one of them,
so merged layers can be deleted after this procedure
2021-07-31 00:23:30 +03:00
Vitaliy Filippov a02b02eb04 Use new listing methods in rm_inode 2021-07-20 00:19:34 +03:00
Vitaliy Filippov 7d3d696110 Implement object listing with controllable parallelism in cluster_client 2021-07-20 00:19:34 +03:00
Vitaliy Filippov 712576ca75
Merge pull request #13 from lnsyyj/wip-vitastor-debug
fix BLOCKSTORE_DEBUG, error: ‘dirty_it’ was not declared in this scope
2021-07-18 01:25:05 +03:00
Vitaliy Filippov 28bd94d2c2 Make diagnostics slightly better 2021-07-18 01:24:38 +03:00
Vitaliy Filippov 148ff04aa8 Do not lose flusher queue entries when an "older object rescan" happens in parallel with flushing of an older version of another object 2021-07-18 01:20:54 +03:00
JiangYu e86df4a2a2 fix BLOCKSTORE_DEBUG, error: ‘dirty_it’ was not declared in this scope
Signed-off-by: JiangYu <lnsyyj@hotmail.com>
2021-07-18 00:46:05 +08:00
Vitaliy Filippov e74af9745e Print journal flusher diagnostics on slow ops 2021-07-17 16:13:41 +03:00
Vitaliy Filippov 0e0509e3da Dump op states in slow operation log 2021-07-16 01:58:50 +03:00
Vitaliy Filippov cb282d25e0 Release 0.6.5
- Basic support for OpenStack: Cinder driver, patches for Nova and libvirt
- Add missing "image" and "config_path" QEMU options
- Calculate aggregate per-pool statistics in monitor
- Implement writes with Check-And-Set semantics
- Add a C wrapper library with public header
2021-07-10 11:01:21 +03:00
Vitaliy Filippov b52dd6843a Rename qemu_rbd_unescape and qemu_rbd_next_tok to *_vitastor_* 2021-07-03 23:14:44 +03:00
Vitaliy Filippov b66160a7ad Aggregate per-pool statistics in mon 2021-07-03 23:14:44 +03:00
Vitaliy Filippov aad7792d3f Check for loops in parent inode chains 2021-06-20 00:23:03 +03:00
Vitaliy Filippov 6ca8afffe5 Add CAS version parameter to the C wrapper 2021-06-19 01:00:52 +03:00
Vitaliy Filippov 511a89948b Rework qemu_proxy into a C wrapper library with public header 2021-06-19 00:39:11 +03:00
Vitaliy Filippov 3de553ecd7 Add a test for CAS write operation 2021-06-15 00:12:35 +03:00
Vitaliy Filippov 891250d355 Implement CAS writes
From now on, reads will return the server-side object version numbers
and writes and deletes will have an additional "version" parameter
which, if set to a non-zero value, will be atomically compared with
the current version of the object plus 1 and the modification will
fail if it doesn't match.

This feature opens the road to correct online flattening of snapshot
layers and other interesting things.
2021-06-15 00:12:35 +03:00
Vitaliy Filippov f9fe72d40a Release 0.6.4
- Implement a basic Kubernetes CSI driver
- Minor fixes for vitastor-nbd
- Fix build without RDMA broken in 0.6.3
2021-05-16 01:38:01 +03:00
Vitaliy Filippov eaac1fc5d1 Log to stderr in etcd_state_client, too 2021-05-16 01:09:25 +03:00
Vitaliy Filippov 57be1923d3 Daemonize NBD_DO_IT process, correctly cleanup unmounted NBD clients 2021-05-16 01:09:25 +03:00
Vitaliy Filippov c467acc388 Fix /v3 appendage to etcd URLs without /v3 2021-05-15 19:22:24 +03:00
Vitaliy Filippov bf591ba3ee Fix nbd module load check 2021-05-15 19:22:24 +03:00
Vitaliy Filippov 699a0fbbc7 Log to stderr instead of stdout in client 2021-05-15 19:22:24 +03:00
Vitaliy Filippov 6b2dd50f27 Fix build without RDMA 2021-05-08 18:20:43 +03:00
Vitaliy Filippov caf2f3c56f Release 0.6.3
- RDMA support
- Client performance optimisations (4k randread ~120k -> ~180k on 1 core)
- JSON configuration file (/etc/vitastor/vitastor.conf) support
- Bug fixes
2021-05-02 17:47:43 +03:00
Vitaliy Filippov 9174f188b1 Build packages with libibverbs
For CentOS 7 it also requires newer rdma-core as CentOS 7's native version doesn't have
implicit ODP support. The updated version is already uploaded into the vitastor repo.
2021-05-02 17:47:16 +03:00
Vitaliy Filippov d3978c6d0e Do not print RDMA connection messages when log_level=0
By the way, it's 1 by default in the OSD, so these messages will still be there in OSD logs
2021-05-01 00:26:09 +03:00
Vitaliy Filippov 4a7365660d Do not wait for down OSDs during sync
Fixes a hang introduced in 0.5.11 in the non-immediate_commit mode
2021-05-01 00:26:07 +03:00
Vitaliy Filippov 818ae5d61d Some config parsing fixes 2021-05-01 00:20:01 +03:00
Vitaliy Filippov f6f35f4127 Pass options correctly to not override /etc/vitastor/vitastor.conf 2021-04-30 01:17:44 +03:00
Vitaliy Filippov 72aa2fd819 Make OSD and client read common configuration from /etc/vitastor/vitastor.conf 2021-04-30 01:11:27 +03:00
Vitaliy Filippov 5010b0dd75 Use json11 instead of blockstore_config_t 2021-04-30 00:52:46 +03:00
Vitaliy Filippov 483c5ab380 Negotiate max_msg instead of max_sge, make buffer settings more conservative :-) 2021-04-29 11:10:35 +03:00
Vitaliy Filippov 6a6fd6544d Add RDMA options to the QEMU driver 2021-04-29 11:02:49 +03:00
Vitaliy Filippov 971aa4ae4f Implement RDMA receive with memory copying (send remains zero-copy)
This is the simplest and, as usual, the best implementation :)

100% zero-copy implementation is also possible (see rdma-zerocopy branch),
but it requires to create A LOT of queues (~128 per client) to use QPN as a 'tag'
because of the lack of receive tags and the server may simply run out of queues.
Hardware limit is 262144 on Mellanox ConnectX-4 which amounts to only 2048
'connections' per host. And even with that amount of queues it's still less optimal
than the non-zerocopy one.

In fact, newest hardware like Mellanox ConnectX-5 does have Tag Matching
support, but it's still unsuitable for us because it doesn't support scatter/gather
(tm_caps.max_sge=1).
2021-04-29 02:34:45 +03:00
Vitaliy Filippov 9e6cbc6ebc Negotiate max_sge between RDMA client & server 2021-04-29 02:15:20 +03:00
Vitaliy Filippov ce777319c3 WIP RDMA support
Basic naive implementation works, but it's highly non-optimal as
RNR retransmissions occur all the time. RDMA expects the receiver
to always have place for incoming WRs...
2021-04-29 02:03:54 +03:00
Vitaliy Filippov f8ff39b0ab Rework continue_ops() to remove a CPU hot spot
This rework increases fio -rw=randread -iodepth=128 result from ~120k to ~180k iops :)
2021-04-29 01:50:13 +03:00
Vitaliy Filippov d749159585 Linked list experiment
Rework client operation queue from a vector to a linked list.
This is required to rework continue_ops() as its current implementation
consumes ~25% of client process CPU.
2021-04-29 01:47:33 +03:00
Vitaliy Filippov 9703773a63 Fix has_flushes setting 2021-04-28 23:40:44 +03:00
Vitaliy Filippov 5d8d486f7c Add SOVERSION 2021-04-20 01:01:32 +03:00
Vitaliy Filippov 2b546cdd55 Link vitastor_blk with vitastor_common for timerfd_manager_t
Not really required to operate, but fixes a verify-elf error
2021-04-20 00:51:53 +03:00
Vitaliy Filippov bd7b177707 Report sensitive configuration values instead of the configuration source 2021-04-17 23:11:16 +03:00
Vitaliy Filippov 82e6aff17b Support mapping NBD by the image name 2021-04-17 17:39:55 +03:00
Vitaliy Filippov 57e2c503f7 Rename osd_t::c_cli to msgr 2021-04-17 16:32:09 +03:00
Vitaliy Filippov 715bc8d53d Release 0.6.2
- Fix a possible crash during SYNC when journal fsyncs are enabled
- Fix a memory leak in the chained read implementation
2021-04-15 23:40:06 +03:00
Vitaliy Filippov 0af077701c Fix a possible crash during SYNC when journal fsyncs are enabled 2021-04-15 02:01:50 +03:00
Vitaliy Filippov cac976ce25 Fix a memory leak in the chained read implementation 2021-04-15 01:42:18 +03:00
Vitaliy Filippov acf0646542 Build common sources once 2021-04-15 01:13:34 +03:00
Vitaliy Filippov ede1c1d667 Release 0.6.1
A bugfix for the new "chained read from snapshot" feature
2021-04-14 22:32:23 +03:00
Vitaliy Filippov 38bd51c97f Remove aio_context assertion, it seems it is unneeded 2021-04-14 22:32:15 +03:00
Vitaliy Filippov 966fb763ca Oooops, fix chained reads 2021-04-13 16:19:21 +03:00
Vitaliy Filippov 0b41ffc08d Release 0.6.0
Warning: upgrading from 0.5.x is currently not supported!
Please create an issue if you really need upgrade capability.

New features:
- Snapshots and Copy-on-Write clones
- Inode (image) names
- Inode I/O and space statistics
- Write throttling for smoothing random write workloads in SSD+HDD configurations
2021-04-11 00:49:18 +03:00
Vitaliy Filippov 64eeb79051 Prevent 0.6.x OSDs from talking to 0.5.x
The new protocol is almost compatible - it has bitmaps, but also it has
a "bitmap_length" field. It's not hard to make 0.5-0.6 OSDs and clients
compatible, but for now I just assume nobody needs it.

If I'm wrong and anybody requests to upgrade their production 0.5.x system
to 0.6.x I'll fix it.
2021-04-10 22:26:17 +03:00
Vitaliy Filippov 2a02f3c4c7 Add metadata superblock and check it on start
Refuse to start if the superblock is missing or bad version;
zero out the metadata area when initializing superblock.
2021-04-10 22:26:17 +03:00
Vitaliy Filippov f684d9101a Refuse to start with old journal version 2021-04-10 17:44:12 +03:00
Vitaliy Filippov a1f2f19489 Do not increment inode statistics if the object already exists 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 82c1a7ec67 Fix statistics reporting, split inode number into pool & inode 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 2ab423d4ef Implement journaled write throttling for the SSD+HDD case 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 4694811eab Add microsecond accuracy to set_timer 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 6b988de17d Remove timerfd_interval 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 37efdc2a83 Fix bitmap_set for replicated pools 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 591cad09c9 Fix bitmaps for objects larger than 128K 2021-04-10 17:44:12 +03:00
Vitaliy Filippov b907ad50aa Oops, forgot to add external bitmaps to blockstore in some places 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 5f5b6ef150 Enable chained reads in the client 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 38a3df4a0e Implement chained (optimized) read in the primary OSD code 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 6950b8e3a0 Watch inode metadata revisions 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 0cea3576fb Add "read bitmaps" operation to secondary OSD protocol 2021-04-10 17:44:12 +03:00
Vitaliy Filippov f01eea07d3 Add simplified interface to read blockstore bitmaps synchronously 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 2c2f08aca2 Shorten some structure names 2021-04-10 17:44:12 +03:00
Vitaliy Filippov d6524670e1 Introduce data distribution locality 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 7aeb2cbac7 Capture all by value in qemu_proxy 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 2612d3198a Introduce image names and metadata storage in etcd
Each inode has: image name, parent inode number & pool, size and readonly flag

Snapshots are created by switching image name to a different inode number
while using the older inode as parent.
2021-04-10 17:44:12 +03:00
Vitaliy Filippov ab39ce2bbb Use clean_entry_bitmap_size instead of entry_attr_size back because of changed bitmap handling 2021-04-10 17:44:12 +03:00
Vitaliy Filippov d0c2e31312 Add a test for snapshots, fix bugs. Now the test passes 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 9038d42327 Fix several snapshot I/O bugs 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 691f066055 Actual snapshot support (untested) 2021-04-10 17:44:12 +03:00
Vitaliy Filippov ffe1cd4c79 Report inode I/O statistics, aggregate it in the monitor 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 4ae1b84c67 Report inode space usage statistics to etcd, aggregate it in the monitor 2021-04-10 17:44:12 +03:00
Vitaliy Filippov c35963967f Add inode space usage statistics tracking to blockstore 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 0aa2dd2890 Send bitmaps with primary-reads, actually read bitmaps for READ ops 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 6bf88883ac Allocate bitmaps along with stripes to avoid memory fragmentation 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 004f265393 Remove cryptic bitmap inlining from bs_op_t and osd_op_t, use bitmap in primary OSD code 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 860ac24762 Add "external" bitmap support to the secondary OSD protocol 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 6107a4d07b Add "external" bitmap support to blockstore 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 95c29b9dc3 Add "external" bitmap support to osd_rmw 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 6909807068 Allow to start the OSD just to flush the journal completely 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 18c72f4835 Correct reenterability fix (now verified with a test)
It's rather funny but 0.5.12 has to be re-published again
2021-04-09 12:10:16 +03:00
Vitaliy Filippov 40b7c21fb1 Followup to 307c1731c1 - fix mark_stable 2021-04-08 15:47:18 +03:00
Vitaliy Filippov efb3678606 Fix qemu-img broken in 0.5.11
Caused by the lack of reenterability of the main cluster_client function
2021-04-08 14:59:20 +03:00
Vitaliy Filippov 8d87e32175 Fix msgr_op.h includes 2021-04-08 01:18:46 +03:00
Vitaliy Filippov b0b2e7df3c Fix use-after-free in keepalive_timer and rework stop_client()
The bug reproduced if fio was temporarily stopped with SIGSTOP
during write test and then resumed after 10 seconds. In this case
"pings" were failed for all clients and fio process crashed with
'use-after-free' in keepalive_timer. It happened because it called
stop_client while having a live iterator to the map.
2021-04-07 11:06:31 +03:00
Vitaliy Filippov 97efb9e299 Do not crash on PG re-peering events when operations are in progress 2021-04-07 11:06:31 +03:00
Vitaliy Filippov f6d705383a Fix client connection recovery bugs, add dirty_ops limit 2021-04-07 11:06:31 +03:00
Vitaliy Filippov 68567c0e1f Fix messenger possibly trying to connect to the same OSD twice 2021-04-07 01:30:38 +03:00
Vitaliy Filippov 04b00003e9 Log ping failures 2021-04-07 01:30:38 +03:00
Vitaliy Filippov 307c1731c1 Forget all dirty_entries before stable big_write or delete during initialisation
This fixes a 'double_alloc' assertion in the following case:
- big_write object #1 v1 to block #100
- big_write object #1 v2 to block #101
- big_write object #2 v1 to block #100
2021-04-07 01:30:38 +03:00
Vitaliy Filippov a48e2bbf18 Fix write replay ordering when immediate_commit != all
Previous implementation didn't respect write ordering and could lead
to corrupted data when restarting writes after an OSD outage

Also rework cluster_client queueing logic and add tests for it to verify the correct behaviour
2021-04-03 14:51:52 +03:00
Vitaliy Filippov 688821665a Remove stoull_full() from etcd_state_client.cpp 2021-04-03 14:36:04 +03:00
Vitaliy Filippov 3e162d95a0 Remove http_client.h include from etcd_state_client.h 2021-04-03 14:36:04 +03:00
Vitaliy Filippov 829381b335 Extract some definitions to msgr_op.{cpp,h} 2021-04-03 14:36:04 +03:00
Vitaliy Filippov 54f2353f24 Use bitmap granularity for alignment checks 2021-04-03 14:36:04 +03:00
Vitaliy Filippov e47f6fba60 Remove cluster_client_t::stop() 2021-04-03 14:35:42 +03:00
Vitaliy Filippov 883bf84a16 Fix build 2021-04-03 01:47:15 +03:00
Vitaliy Filippov 52097c4856 Stop flushing when less than min_flusher_count operations are available (unless a trim is forced) 2021-04-03 00:53:28 +03:00
Vitaliy Filippov e1355cbc74 Report failed operation name in cluster_client 2021-04-03 00:53:28 +03:00
Vitaliy Filippov 8f8b90be7a Add min_flusher_count configuration 2021-04-03 00:53:28 +03:00
Vitaliy Filippov ad9f619370 Skip double allocs when reading journal 2021-04-03 00:53:28 +03:00
Vitaliy Filippov f4769ba7c7 Collapse create+delete journal entry pairs if they're already flushed
Old journal replay mechanism could lead to a double allocation of the same
block and a "Fatal error: tried to overwrite non-zero metadata entry"
2021-04-03 00:53:28 +03:00
Vitaliy Filippov 843b7052d2 Add an assertion when clearing deleted metadata entries, add debug details when freeing blocks 2021-04-03 00:53:28 +03:00
Vitaliy Filippov 4095bcc558 Do not ignore object deletion journal entries when they are preceded by a big write 2021-03-25 11:00:10 +03:00
Vitaliy Filippov 564d64e271 Add some details for debug prints 2021-03-25 11:00:10 +03:00
Vitaliy Filippov cf54741c95 Followup to 05db1308aa
Don't do anything with the object state after errors because
it's freed by PG re-peer in this case
2021-03-25 11:00:10 +03:00
Vitaliy Filippov 18a5fafa2a Fix rollback 2021-03-25 02:41:58 +03:00
Vitaliy Filippov 06f4978085 Fix fsync check in blockstore_flush (data fsyncs were disabled instead of journal fsyncs) 2021-03-25 02:41:58 +03:00
Vitaliy Filippov 7ebf1588c5 Check for immediate_commit==small in the OSD code 2021-03-25 02:41:58 +03:00
Vitaliy Filippov b0ad1e1e6d Remember writes as "unsynced" only after completing them
Previously BS_OP_SYNC could take unfinished writes and add them into the journal before
they were actually completed. This was leading to crashes with the message
"BUG: Unexpected dirty_entry 2000000000001:9f2a0000 v3 unstable state during flush: 338"
2021-03-25 02:41:58 +03:00
Vitaliy Filippov 0949f08407 Extract osd_primary write and sync code into separate files 2021-03-24 14:20:56 +03:00
Vitaliy Filippov 04a1f18fa5 Assign .req as a whole to always zero out the remaining part
Also clear .reply before processing the operation
2021-03-24 14:20:56 +03:00
Vitaliy Filippov cf9a641d66 Skip disconnected OSDs during sync 2021-03-24 14:20:56 +03:00
Vitaliy Filippov 05db1308aa Fix two potential read/write ordering problems (even though not yet seen in tests)
- Write operations could be 'stabilized' and previous versions could be
  purged from OSDs before the removal of version_override and following
  reads could potentially hit different version in EC pools
- Object was marked clean after completing the delete during recovery, so
  reads could in theory hit a deleted version and return nothing
2021-03-24 14:20:56 +03:00
Vitaliy Filippov 98b54ca948 Don't try to "recover" misplaced objects if it would make them degraded 2021-03-21 01:37:23 +03:00
Vitaliy Filippov 23225c5e62 Do not run ping on clients that are not yet connected 2021-03-21 01:37:23 +03:00
Vitaliy Filippov 435045751d Delete objects only after a SYNC during rebalance in the non-immediate_commit mode
Previously OSDs could commit deletes before writes during recovery or rebalance
in the "lazy fsync" (immediate_commit=off) mode which could result in lost objects
2021-03-16 12:48:26 +03:00
Vitaliy Filippov c5fb1d5987 Do not duplicate blockstore operations when io_uring fills up
This bug was leading to OSDs dying with "Assertion `fulfilled == read_op->len' failed"
when testing fio -rw=randread -numjobs=8 -iodepth=128
2021-03-16 12:48:26 +03:00
Vitaliy Filippov 9ac7e75178 Allow to specify etcd URLs for OSDs with http://, do not die with a strange error if -etcd option is missing for fio 2021-03-16 12:48:26 +03:00
Vitaliy Filippov 88671cf745 Fix a bug causing all flushers to wait for an fsync without actually trying to do it
This happened because flusher_count became dynamic and fsync_batch() was comparing the number
of flushers currently ready to do an fsync with the maximum number of flushers. Also the number
wasn't rechecked on every loop which was also incorrect.

Now the interrupted_rebalance test passes even without IMMEDIATE_COMMIT=1.
2021-03-13 17:27:29 +03:00
Vitaliy Filippov ceb9c28de7 Set default log_level before passing config to etcd_state_client 2021-03-13 17:19:45 +03:00
Vitaliy Filippov 299d7d7c95 Use common macro for get_sqe 2021-03-13 17:19:45 +03:00
Vitaliy Filippov d1526b415f Correctly resume writes when OSD is full to return an error 2021-03-13 17:19:45 +03:00
Vitaliy Filippov f49fd53d55 Fix a bug where allocator was unable to allocate up to last (n%64) blocks, add tests for it 2021-03-13 02:19:02 +03:00
Vitaliy Filippov b44f49aab2 Ignore zero OSDs in history osd_sets 2021-03-12 12:40:15 +03:00
Vitaliy Filippov af5155fcd9 Implement "no_recovery" and "no_rebalance" flags 2021-03-11 00:36:31 +03:00
Vitaliy Filippov c4ba24c305 Do not print ping op latency 2021-03-10 02:01:44 +03:00
Vitaliy Filippov bd178ac20f Fix history osd_set check - local OSD is always available! 2021-03-09 02:18:18 +03:00
Vitaliy Filippov ad577c4aac Add PING operation and timeouts to detect OSD failures when a host goes down 2021-03-09 02:15:38 +03:00
Vitaliy Filippov e91ff2a9ec Only forget offline PGs if their state is not changed during reporting 2021-03-08 17:04:10 +03:00
Vitaliy Filippov 086667f568 Do not check PG state key ownership if it doesn't exist yet
This fixes the bug where OSDs were sometimes trying to report updated PG states
infinitely without luck when PGs transitioned from 'starting' to 'peering' too fast
2021-03-08 17:04:10 +03:00
Vitaliy Filippov 1be94da437 Check & remove extra chunks for degraded / incomplete objects, too 2021-03-08 17:04:10 +03:00
Vitaliy Filippov 80e12358a2 Use pg_data_size instead of pg_minsize for object state calculation 2021-03-08 17:04:10 +03:00
Vitaliy Filippov 36c935ace6 Use std::vector for the blockstore submission queue 2021-03-08 17:04:10 +03:00
Vitaliy Filippov 0d8b5e2ef9 Remove unused enqueue_op_first() 2021-03-08 17:04:10 +03:00
Vitaliy Filippov 98f1e2c277 Rework write/sync ordering
Make syncs wait for all previous writes because it's the only way
to make sure that OSDs do not receive incomplete writes in LIST results
during peering when some writes are still in progress.

Also simplify blockstore submission queue logic.
2021-03-08 17:04:10 +03:00
Vitaliy Filippov 21e7686037 Fix possible "assertion failed: pg.inflight >= 0" error during PG stop 2021-03-08 17:04:10 +03:00
Vitaliy Filippov ab21a1908b Check for the dirty PG flag when trying to continue to stop it after sync 2021-03-08 17:04:10 +03:00
Vitaliy Filippov 30d1ccd43e Fix an infinite loop when discarding list operations during stop_pg() 2021-03-08 17:04:10 +03:00
Vitaliy Filippov 8bdd6d8d78 Reset PG state when stopping them 2021-03-08 17:04:10 +03:00
Vitaliy Filippov 09b3e4e789 Fix OSDs being unable to stop PGs that are 'peering', not 'active'
This was sometimes leading to incorrect misplaced and degraded object count statistics
2021-03-08 17:04:10 +03:00
Vitaliy Filippov bc742ccf8c Fix a small memory leak in etcd_state_client 2021-03-08 17:04:10 +03:00
Vitaliy Filippov 314b20437b Do not break subsequent small writes badly when a big write is canceled 2021-03-08 17:04:10 +03:00
Vitaliy Filippov 29d8ac8b1b Do not report statistics for the empty operation 2021-03-01 16:20:57 +03:00
Vitaliy Filippov 6155b23a7e Replace pgs[id] with pgs.at(id) to prevent accidental auto-vivification 2021-02-28 19:36:59 +03:00
Vitaliy Filippov 46e79f3306 Wait for PGs to become clean before stopping them 2021-02-28 19:36:59 +03:00
Vitaliy Filippov 41fd14e024 Fix deletes not increasing write_iodepth 2021-02-28 19:36:59 +03:00
Vitaliy Filippov 2d73b19a6c Fix online PG count change bugs 2021-02-25 23:59:33 +03:00
Vitaliy Filippov c974cb539c Make flusher_count adaptive and limit write iodepth 2021-02-25 23:59:33 +03:00