vitastor/docs/config/network.en.md

256 lines
7.4 KiB
Markdown
Raw Normal View History

2022-01-29 23:43:22 +03:00
[Documentation](../../README.md#documentation) → [Configuration](../config.en.md) → Network Protocol Parameters
-----
[Читать на русском](network.ru.md)
# Network Protocol Parameters
These parameters apply to clients and OSDs and affect network connection logic
between clients, OSDs and etcd.
- [tcp_header_buffer_size](#tcp_header_buffer_size)
- [use_sync_send_recv](#use_sync_send_recv)
- [use_rdma](#use_rdma)
- [rdma_device](#rdma_device)
- [rdma_port_num](#rdma_port_num)
- [rdma_gid_index](#rdma_gid_index)
- [rdma_mtu](#rdma_mtu)
- [rdma_max_sge](#rdma_max_sge)
- [rdma_max_msg](#rdma_max_msg)
- [rdma_max_recv](#rdma_max_recv)
- [rdma_max_send](#rdma_max_send)
- [rdma_odp](#rdma_odp)
2022-01-29 23:43:22 +03:00
- [peer_connect_interval](#peer_connect_interval)
- [peer_connect_timeout](#peer_connect_timeout)
- [osd_idle_timeout](#osd_idle_timeout)
- [osd_ping_timeout](#osd_ping_timeout)
- [max_etcd_attempts](#max_etcd_attempts)
- [etcd_quick_timeout](#etcd_quick_timeout)
- [etcd_slow_timeout](#etcd_slow_timeout)
- [etcd_keepalive_timeout](#etcd_keepalive_timeout)
- [etcd_ws_keepalive_interval](#etcd_ws_keepalive_interval)
2022-01-29 23:43:22 +03:00
## tcp_header_buffer_size
- Type: integer
- Default: 65536
Size of the buffer used to read data using an additional copy. Vitastor
packet headers are 128 bytes, payload is always at least 4 KB, so it is
usually beneficial to try to read multiple packets at once even though
it requires to copy the data an additional time. The rest of each packet
is received without an additional copy. You can try to play with this
parameter and see how it affects random iops and linear bandwidth if you
want.
## use_sync_send_recv
- Type: boolean
- Default: false
If true, synchronous send/recv syscalls are used instead of io_uring for
socket communication. Useless for OSDs because they require io_uring anyway,
but may be required for clients with old kernel versions.
## use_rdma
- Type: boolean
- Default: true
Try to use RDMA for communication if it's available. Disable if you don't
want Vitastor to use RDMA. TCP-only clients can also talk to an RDMA-enabled
cluster, so disabling RDMA may be needed if clients have RDMA devices,
but they are not connected to the cluster.
## rdma_device
- Type: string
RDMA device name to use for Vitastor OSD communications (for example,
"rocep5s0f0"). Now Vitastor supports all adapters, even ones without
ODP support, like Mellanox ConnectX-3 and non-Mellanox cards.
Versions up to Vitastor 1.2.0 required ODP which is only present in
Mellanox ConnectX >= 4. See also [rdma_odp](#rdma_odp).
Run `ibv_devinfo -v` as root to list available RDMA devices and their
features.
2022-01-29 23:43:22 +03:00
Remember that you also have to configure your network switches if you use
RoCE/RoCEv2, otherwise you may experience unstable performance. Refer to
the manual of your network vendor for details about setting up the switch
for RoCEv2 correctly. Usually it means setting up Lossless Ethernet with
PFC (Priority Flow Control) and ECN (Explicit Congestion Notification).
2022-01-29 23:43:22 +03:00
## rdma_port_num
- Type: integer
- Default: 1
RDMA device port number to use. Only for devices that have more than 1 port.
See `phys_port_cnt` in `ibv_devinfo -v` output to determine how many ports
your device has.
## rdma_gid_index
- Type: integer
- Default: 0
Global address identifier index of the RDMA device to use. Different GID
indexes may correspond to different protocols like RoCEv1, RoCEv2 and iWARP.
Search for "GID" in `ibv_devinfo -v` output to determine which GID index
you need.
**IMPORTANT:** If you want to use RoCEv2 (as recommended) then the correct
rdma_gid_index is usually 1 (IPv6) or 3 (IPv4).
## rdma_mtu
- Type: integer
- Default: 4096
RDMA Path MTU to use. Must be 1024, 2048 or 4096. There is usually no
sense to change it from the default 4096.
## rdma_max_sge
- Type: integer
- Default: 128
Maximum number of scatter/gather entries to use for RDMA. OSDs negotiate
the actual value when establishing connection anyway, so it's usually not
required to change this parameter.
## rdma_max_msg
- Type: integer
- Default: 132096
2022-01-29 23:43:22 +03:00
Maximum size of a single RDMA send or receive operation in bytes.
## rdma_max_recv
- Type: integer
- Default: 16
Maximum number of RDMA receive buffers per connection (RDMA requires
preallocated buffers to receive data). Each buffer is `rdma_max_msg` bytes
in size. So this setting directly affects memory usage: a single Vitastor
RDMA client uses `rdma_max_recv * rdma_max_msg * OSD_COUNT` bytes of memory.
Default is roughly 2 MB * number of OSDs.
## rdma_max_send
2022-01-29 23:43:22 +03:00
- Type: integer
- Default: 8
Maximum number of outstanding RDMA send operations per connection. Should be
less than `rdma_max_recv` so the receiving side doesn't run out of buffers.
Doesn't affect memory usage - additional memory isn't allocated for send
operations.
2022-01-29 23:43:22 +03:00
## rdma_odp
- Type: boolean
- Default: false
Use RDMA with On-Demand Paging. ODP is currently only available on Mellanox
ConnectX-4 and newer adapters. ODP allows to not register memory explicitly
for RDMA adapter to be able to use it. This, in turn, allows to skip memory
copying during sending. One would think this should improve performance, but
**in reality** RDMA performance with ODP is **drastically** worse. Example
3-node cluster with 8 NVMe in each node and 2*25 GBit/s ConnectX-6 RDMA network
without ODP pushes 3950000 read iops, but only 239000 iops with ODP...
This happens because Mellanox ODP implementation seems to be based on
message retransmissions when the adapter doesn't know about the buffer yet -
it likely uses standard "RNR retransmissions" (RNR = receiver not ready)
which is generally slow in RDMA/RoCE networks. Here's a presentation about
it from ISPASS-2021 conference: https://tkygtr6.github.io/pub/ISPASS21_slides.pdf
ODP support is retained in the code just in case a good ODP implementation
appears one day.
2022-01-29 23:43:22 +03:00
## peer_connect_interval
- Type: seconds
- Default: 5
- Minimum: 1
- Can be changed online: yes
2022-01-29 23:43:22 +03:00
Interval before attempting to reconnect to an unavailable OSD.
## peer_connect_timeout
- Type: seconds
- Default: 5
- Minimum: 1
- Can be changed online: yes
2022-01-29 23:43:22 +03:00
Timeout for OSD connection attempts.
## osd_idle_timeout
- Type: seconds
- Default: 5
- Minimum: 1
- Can be changed online: yes
2022-01-29 23:43:22 +03:00
OSD connection inactivity time after which clients and other OSDs send
keepalive requests to check state of the connection.
## osd_ping_timeout
- Type: seconds
- Default: 5
- Minimum: 1
- Can be changed online: yes
2022-01-29 23:43:22 +03:00
Maximum time to wait for OSD keepalive responses. If an OSD doesn't respond
within this time, the connection to it is dropped and a reconnection attempt
is scheduled.
## max_etcd_attempts
- Type: integer
- Default: 5
- Can be changed online: yes
2022-01-29 23:43:22 +03:00
Maximum number of attempts for etcd requests which can't be retried
indefinitely.
## etcd_quick_timeout
- Type: milliseconds
- Default: 1000
- Can be changed online: yes
2022-01-29 23:43:22 +03:00
Timeout for etcd requests which should complete quickly, like lease refresh.
## etcd_slow_timeout
- Type: milliseconds
- Default: 5000
- Can be changed online: yes
2022-01-29 23:43:22 +03:00
Timeout for etcd requests which are allowed to wait for some time.
## etcd_keepalive_timeout
- Type: seconds
- Default: max(30, etcd_report_interval*2)
- Can be changed online: yes
2022-01-29 23:43:22 +03:00
Timeout for etcd connection HTTP Keep-Alive. Should be higher than
etcd_report_interval to guarantee that keepalive actually works.
## etcd_ws_keepalive_interval
2022-01-29 23:43:22 +03:00
- Type: seconds
- Default: 30
- Can be changed online: yes
2022-01-29 23:43:22 +03:00
etcd websocket ping interval required to keep the connection alive and
detect disconnections quickly.