2022-01-29 23:43:22 +03:00
|
|
|
[Documentation](../../README.md#documentation) → [Configuration](../config.en.md) → Network Protocol Parameters
|
|
|
|
|
|
|
|
-----
|
|
|
|
|
|
|
|
[Читать на русском](network.ru.md)
|
|
|
|
|
|
|
|
# Network Protocol Parameters
|
|
|
|
|
|
|
|
These parameters apply to clients and OSDs and affect network connection logic
|
|
|
|
between clients, OSDs and etcd.
|
|
|
|
|
|
|
|
- [tcp_header_buffer_size](#tcp_header_buffer_size)
|
|
|
|
- [use_sync_send_recv](#use_sync_send_recv)
|
|
|
|
- [use_rdma](#use_rdma)
|
|
|
|
- [rdma_device](#rdma_device)
|
|
|
|
- [rdma_port_num](#rdma_port_num)
|
|
|
|
- [rdma_gid_index](#rdma_gid_index)
|
|
|
|
- [rdma_mtu](#rdma_mtu)
|
|
|
|
- [rdma_max_sge](#rdma_max_sge)
|
|
|
|
- [rdma_max_msg](#rdma_max_msg)
|
|
|
|
- [rdma_max_recv](#rdma_max_recv)
|
2023-02-28 02:58:28 +03:00
|
|
|
- [rdma_max_send](#rdma_max_send)
|
2023-08-21 01:10:52 +03:00
|
|
|
- [rdma_odp](#rdma_odp)
|
2022-01-29 23:43:22 +03:00
|
|
|
- [peer_connect_interval](#peer_connect_interval)
|
|
|
|
- [peer_connect_timeout](#peer_connect_timeout)
|
|
|
|
- [osd_idle_timeout](#osd_idle_timeout)
|
|
|
|
- [osd_ping_timeout](#osd_ping_timeout)
|
|
|
|
- [max_etcd_attempts](#max_etcd_attempts)
|
|
|
|
- [etcd_quick_timeout](#etcd_quick_timeout)
|
|
|
|
- [etcd_slow_timeout](#etcd_slow_timeout)
|
|
|
|
- [etcd_keepalive_timeout](#etcd_keepalive_timeout)
|
2024-03-26 01:56:08 +03:00
|
|
|
- [etcd_ws_keepalive_interval](#etcd_ws_keepalive_interval)
|
2022-01-29 23:43:22 +03:00
|
|
|
|
|
|
|
## tcp_header_buffer_size
|
|
|
|
|
|
|
|
- Type: integer
|
|
|
|
- Default: 65536
|
|
|
|
|
|
|
|
Size of the buffer used to read data using an additional copy. Vitastor
|
|
|
|
packet headers are 128 bytes, payload is always at least 4 KB, so it is
|
|
|
|
usually beneficial to try to read multiple packets at once even though
|
|
|
|
it requires to copy the data an additional time. The rest of each packet
|
|
|
|
is received without an additional copy. You can try to play with this
|
|
|
|
parameter and see how it affects random iops and linear bandwidth if you
|
|
|
|
want.
|
|
|
|
|
|
|
|
## use_sync_send_recv
|
|
|
|
|
|
|
|
- Type: boolean
|
|
|
|
- Default: false
|
|
|
|
|
|
|
|
If true, synchronous send/recv syscalls are used instead of io_uring for
|
|
|
|
socket communication. Useless for OSDs because they require io_uring anyway,
|
|
|
|
but may be required for clients with old kernel versions.
|
|
|
|
|
|
|
|
## use_rdma
|
|
|
|
|
|
|
|
- Type: boolean
|
|
|
|
- Default: true
|
|
|
|
|
|
|
|
Try to use RDMA for communication if it's available. Disable if you don't
|
|
|
|
want Vitastor to use RDMA. TCP-only clients can also talk to an RDMA-enabled
|
|
|
|
cluster, so disabling RDMA may be needed if clients have RDMA devices,
|
|
|
|
but they are not connected to the cluster.
|
|
|
|
|
|
|
|
## rdma_device
|
|
|
|
|
|
|
|
- Type: string
|
|
|
|
|
|
|
|
RDMA device name to use for Vitastor OSD communications (for example,
|
2023-08-21 01:10:52 +03:00
|
|
|
"rocep5s0f0"). Now Vitastor supports all adapters, even ones without
|
|
|
|
ODP support, like Mellanox ConnectX-3 and non-Mellanox cards.
|
|
|
|
|
|
|
|
Versions up to Vitastor 1.2.0 required ODP which is only present in
|
|
|
|
Mellanox ConnectX >= 4. See also [rdma_odp](#rdma_odp).
|
|
|
|
|
|
|
|
Run `ibv_devinfo -v` as root to list available RDMA devices and their
|
|
|
|
features.
|
2022-01-29 23:43:22 +03:00
|
|
|
|
2023-02-28 02:58:28 +03:00
|
|
|
Remember that you also have to configure your network switches if you use
|
|
|
|
RoCE/RoCEv2, otherwise you may experience unstable performance. Refer to
|
|
|
|
the manual of your network vendor for details about setting up the switch
|
|
|
|
for RoCEv2 correctly. Usually it means setting up Lossless Ethernet with
|
|
|
|
PFC (Priority Flow Control) and ECN (Explicit Congestion Notification).
|
|
|
|
|
2022-01-29 23:43:22 +03:00
|
|
|
## rdma_port_num
|
|
|
|
|
|
|
|
- Type: integer
|
|
|
|
- Default: 1
|
|
|
|
|
|
|
|
RDMA device port number to use. Only for devices that have more than 1 port.
|
|
|
|
See `phys_port_cnt` in `ibv_devinfo -v` output to determine how many ports
|
|
|
|
your device has.
|
|
|
|
|
|
|
|
## rdma_gid_index
|
|
|
|
|
|
|
|
- Type: integer
|
|
|
|
- Default: 0
|
|
|
|
|
|
|
|
Global address identifier index of the RDMA device to use. Different GID
|
|
|
|
indexes may correspond to different protocols like RoCEv1, RoCEv2 and iWARP.
|
|
|
|
Search for "GID" in `ibv_devinfo -v` output to determine which GID index
|
|
|
|
you need.
|
|
|
|
|
|
|
|
**IMPORTANT:** If you want to use RoCEv2 (as recommended) then the correct
|
|
|
|
rdma_gid_index is usually 1 (IPv6) or 3 (IPv4).
|
|
|
|
|
|
|
|
## rdma_mtu
|
|
|
|
|
|
|
|
- Type: integer
|
|
|
|
- Default: 4096
|
|
|
|
|
|
|
|
RDMA Path MTU to use. Must be 1024, 2048 or 4096. There is usually no
|
|
|
|
sense to change it from the default 4096.
|
|
|
|
|
|
|
|
## rdma_max_sge
|
|
|
|
|
|
|
|
- Type: integer
|
|
|
|
- Default: 128
|
|
|
|
|
|
|
|
Maximum number of scatter/gather entries to use for RDMA. OSDs negotiate
|
|
|
|
the actual value when establishing connection anyway, so it's usually not
|
|
|
|
required to change this parameter.
|
|
|
|
|
|
|
|
## rdma_max_msg
|
|
|
|
|
|
|
|
- Type: integer
|
2023-02-28 02:58:28 +03:00
|
|
|
- Default: 132096
|
2022-01-29 23:43:22 +03:00
|
|
|
|
|
|
|
Maximum size of a single RDMA send or receive operation in bytes.
|
|
|
|
|
|
|
|
## rdma_max_recv
|
|
|
|
|
2023-02-28 02:58:28 +03:00
|
|
|
- Type: integer
|
|
|
|
- Default: 16
|
|
|
|
|
|
|
|
Maximum number of RDMA receive buffers per connection (RDMA requires
|
|
|
|
preallocated buffers to receive data). Each buffer is `rdma_max_msg` bytes
|
|
|
|
in size. So this setting directly affects memory usage: a single Vitastor
|
|
|
|
RDMA client uses `rdma_max_recv * rdma_max_msg * OSD_COUNT` bytes of memory.
|
|
|
|
Default is roughly 2 MB * number of OSDs.
|
|
|
|
|
|
|
|
## rdma_max_send
|
|
|
|
|
2022-01-29 23:43:22 +03:00
|
|
|
- Type: integer
|
|
|
|
- Default: 8
|
|
|
|
|
2023-02-28 02:58:28 +03:00
|
|
|
Maximum number of outstanding RDMA send operations per connection. Should be
|
|
|
|
less than `rdma_max_recv` so the receiving side doesn't run out of buffers.
|
|
|
|
Doesn't affect memory usage - additional memory isn't allocated for send
|
|
|
|
operations.
|
2022-01-29 23:43:22 +03:00
|
|
|
|
2023-08-21 01:10:52 +03:00
|
|
|
## rdma_odp
|
|
|
|
|
|
|
|
- Type: boolean
|
|
|
|
- Default: false
|
|
|
|
|
|
|
|
Use RDMA with On-Demand Paging. ODP is currently only available on Mellanox
|
|
|
|
ConnectX-4 and newer adapters. ODP allows to not register memory explicitly
|
|
|
|
for RDMA adapter to be able to use it. This, in turn, allows to skip memory
|
|
|
|
copying during sending. One would think this should improve performance, but
|
|
|
|
**in reality** RDMA performance with ODP is **drastically** worse. Example
|
|
|
|
3-node cluster with 8 NVMe in each node and 2*25 GBit/s ConnectX-6 RDMA network
|
|
|
|
without ODP pushes 3950000 read iops, but only 239000 iops with ODP...
|
|
|
|
|
|
|
|
This happens because Mellanox ODP implementation seems to be based on
|
|
|
|
message retransmissions when the adapter doesn't know about the buffer yet -
|
|
|
|
it likely uses standard "RNR retransmissions" (RNR = receiver not ready)
|
|
|
|
which is generally slow in RDMA/RoCE networks. Here's a presentation about
|
|
|
|
it from ISPASS-2021 conference: https://tkygtr6.github.io/pub/ISPASS21_slides.pdf
|
|
|
|
|
|
|
|
ODP support is retained in the code just in case a good ODP implementation
|
|
|
|
appears one day.
|
|
|
|
|
2022-01-29 23:43:22 +03:00
|
|
|
## peer_connect_interval
|
|
|
|
|
|
|
|
- Type: seconds
|
|
|
|
- Default: 5
|
|
|
|
- Minimum: 1
|
2023-04-21 02:18:37 +03:00
|
|
|
- Can be changed online: yes
|
2022-01-29 23:43:22 +03:00
|
|
|
|
|
|
|
Interval before attempting to reconnect to an unavailable OSD.
|
|
|
|
|
|
|
|
## peer_connect_timeout
|
|
|
|
|
|
|
|
- Type: seconds
|
|
|
|
- Default: 5
|
|
|
|
- Minimum: 1
|
2023-04-21 02:18:37 +03:00
|
|
|
- Can be changed online: yes
|
2022-01-29 23:43:22 +03:00
|
|
|
|
|
|
|
Timeout for OSD connection attempts.
|
|
|
|
|
|
|
|
## osd_idle_timeout
|
|
|
|
|
|
|
|
- Type: seconds
|
|
|
|
- Default: 5
|
|
|
|
- Minimum: 1
|
2023-04-21 02:18:37 +03:00
|
|
|
- Can be changed online: yes
|
2022-01-29 23:43:22 +03:00
|
|
|
|
|
|
|
OSD connection inactivity time after which clients and other OSDs send
|
|
|
|
keepalive requests to check state of the connection.
|
|
|
|
|
|
|
|
## osd_ping_timeout
|
|
|
|
|
|
|
|
- Type: seconds
|
|
|
|
- Default: 5
|
|
|
|
- Minimum: 1
|
2023-04-21 02:18:37 +03:00
|
|
|
- Can be changed online: yes
|
2022-01-29 23:43:22 +03:00
|
|
|
|
|
|
|
Maximum time to wait for OSD keepalive responses. If an OSD doesn't respond
|
|
|
|
within this time, the connection to it is dropped and a reconnection attempt
|
|
|
|
is scheduled.
|
|
|
|
|
|
|
|
## max_etcd_attempts
|
|
|
|
|
|
|
|
- Type: integer
|
|
|
|
- Default: 5
|
2023-04-21 02:18:37 +03:00
|
|
|
- Can be changed online: yes
|
2022-01-29 23:43:22 +03:00
|
|
|
|
|
|
|
Maximum number of attempts for etcd requests which can't be retried
|
|
|
|
indefinitely.
|
|
|
|
|
|
|
|
## etcd_quick_timeout
|
|
|
|
|
|
|
|
- Type: milliseconds
|
|
|
|
- Default: 1000
|
2023-04-21 02:18:37 +03:00
|
|
|
- Can be changed online: yes
|
2022-01-29 23:43:22 +03:00
|
|
|
|
|
|
|
Timeout for etcd requests which should complete quickly, like lease refresh.
|
|
|
|
|
|
|
|
## etcd_slow_timeout
|
|
|
|
|
|
|
|
- Type: milliseconds
|
|
|
|
- Default: 5000
|
2023-04-21 02:18:37 +03:00
|
|
|
- Can be changed online: yes
|
2022-01-29 23:43:22 +03:00
|
|
|
|
|
|
|
Timeout for etcd requests which are allowed to wait for some time.
|
|
|
|
|
|
|
|
## etcd_keepalive_timeout
|
|
|
|
|
|
|
|
- Type: seconds
|
|
|
|
- Default: max(30, etcd_report_interval*2)
|
2023-04-21 02:18:37 +03:00
|
|
|
- Can be changed online: yes
|
2022-01-29 23:43:22 +03:00
|
|
|
|
|
|
|
Timeout for etcd connection HTTP Keep-Alive. Should be higher than
|
|
|
|
etcd_report_interval to guarantee that keepalive actually works.
|
|
|
|
|
2024-03-26 01:56:08 +03:00
|
|
|
## etcd_ws_keepalive_interval
|
2022-01-29 23:43:22 +03:00
|
|
|
|
|
|
|
- Type: seconds
|
2024-06-08 00:38:48 +03:00
|
|
|
- Default: 5
|
2023-04-21 02:18:37 +03:00
|
|
|
- Can be changed online: yes
|
2022-01-29 23:43:22 +03:00
|
|
|
|
|
|
|
etcd websocket ping interval required to keep the connection alive and
|
|
|
|
detect disconnections quickly.
|