Commit Graph

23 Commits (nbd-vmsplice)

Author SHA1 Message Date
Vitaliy Filippov 660c3f7b0d Change default RDMA settings to 128x 129K buffers
129K to leave extra space for the header

The problem with 8x 1M buffers is that the following happens with,
for example, 2 OSDs and 4M T1Q1 write:
- Server posts 8 receives
- Client posts 8 sends
- WRs are processed by the RDMA stack, but the OSD doesn't have the time
  to handle them and doesn't refill buffers
- Client posts 1 more send
- RNR retransmission happens and performance drops to zero

Overall it seems that RDMA support should be reworked to use real 'RDMA'
operations i.e. operations writing into remote memory. This has an
additional advantage of avoiding a copy at the receive side of the OSD.
2 years ago
Vitaliy Filippov 609bd4eb59 Remove naggy RDMA messages when log level is zero 2 years ago
Vitaliy Filippov fc3a1e076a Fix minor bugs in snapshot removal, check it in tests 2 years ago
Vitaliy Filippov 891250d355 Implement CAS writes
From now on, reads will return the server-side object version numbers
and writes and deletes will have an additional "version" parameter
which, if set to a non-zero value, will be atomically compared with
the current version of the object plus 1 and the modification will
fail if it doesn't match.

This feature opens the road to correct online flattening of snapshot
layers and other interesting things.
2 years ago
Vitaliy Filippov 699a0fbbc7 Log to stderr instead of stdout in client 2 years ago
Vitaliy Filippov 6b2dd50f27 Fix build without RDMA 2 years ago
Vitaliy Filippov d3978c6d0e Do not print RDMA connection messages when log_level=0
By the way, it's 1 by default in the OSD, so these messages will still be there in OSD logs
2 years ago
Vitaliy Filippov 818ae5d61d Some config parsing fixes 2 years ago
Vitaliy Filippov 72aa2fd819 Make OSD and client read common configuration from /etc/vitastor/vitastor.conf 2 years ago
Vitaliy Filippov 483c5ab380 Negotiate max_msg instead of max_sge, make buffer settings more conservative :-) 2 years ago
Vitaliy Filippov 971aa4ae4f Implement RDMA receive with memory copying (send remains zero-copy)
This is the simplest and, as usual, the best implementation :)

100% zero-copy implementation is also possible (see rdma-zerocopy branch),
but it requires to create A LOT of queues (~128 per client) to use QPN as a 'tag'
because of the lack of receive tags and the server may simply run out of queues.
Hardware limit is 262144 on Mellanox ConnectX-4 which amounts to only 2048
'connections' per host. And even with that amount of queues it's still less optimal
than the non-zerocopy one.

In fact, newest hardware like Mellanox ConnectX-5 does have Tag Matching
support, but it's still unsuitable for us because it doesn't support scatter/gather
2 years ago
Vitaliy Filippov 9e6cbc6ebc Negotiate max_sge between RDMA client & server 2 years ago
Vitaliy Filippov ce777319c3 WIP RDMA support
Basic naive implementation works, but it's highly non-optimal as
RNR retransmissions occur all the time. RDMA expects the receiver
to always have place for incoming WRs...
2 years ago
Vitaliy Filippov 64eeb79051 Prevent 0.6.x OSDs from talking to 0.5.x
The new protocol is almost compatible - it has bitmaps, but also it has
a "bitmap_length" field. It's not hard to make 0.5-0.6 OSDs and clients
compatible, but for now I just assume nobody needs it.

If I'm wrong and anybody requests to upgrade their production 0.5.x system
to 0.6.x I'll fix it.
2 years ago
Vitaliy Filippov b0b2e7df3c Fix use-after-free in keepalive_timer and rework stop_client()
The bug reproduced if fio was temporarily stopped with SIGSTOP
during write test and then resumed after 10 seconds. In this case
"pings" were failed for all clients and fio process crashed with
'use-after-free' in keepalive_timer. It happened because it called
stop_client while having a live iterator to the map.
2 years ago
Vitaliy Filippov f6d705383a Fix client connection recovery bugs, add dirty_ops limit 2 years ago
Vitaliy Filippov 68567c0e1f Fix messenger possibly trying to connect to the same OSD twice 2 years ago
Vitaliy Filippov 04b00003e9 Log ping failures 2 years ago
Vitaliy Filippov a48e2bbf18 Fix write replay ordering when immediate_commit != all
Previous implementation didn't respect write ordering and could lead
to corrupted data when restarting writes after an OSD outage

Also rework cluster_client queueing logic and add tests for it to verify the correct behaviour
2 years ago
Vitaliy Filippov 829381b335 Extract some definitions to msgr_op.{cpp,h} 2 years ago
Vitaliy Filippov 23225c5e62 Do not run ping on clients that are not yet connected 2 years ago
Vitaliy Filippov ad577c4aac Add PING operation and timeouts to detect OSD failures when a host goes down 2 years ago
Vitaliy Filippov bf9a175efc Move C/C++ sources to src subdirectory 2 years ago