[Documentation](../../README.md#documentation) → [Configuration](../config.en.md) → Network Protocol Parameters ----- [Читать на русском](network.ru.md) # Network Protocol Parameters These parameters apply to clients and OSDs and affect network connection logic between clients, OSDs and etcd. - [tcp_header_buffer_size](#tcp_header_buffer_size) - [use_sync_send_recv](#use_sync_send_recv) - [use_rdma](#use_rdma) - [rdma_device](#rdma_device) - [rdma_port_num](#rdma_port_num) - [rdma_gid_index](#rdma_gid_index) - [rdma_mtu](#rdma_mtu) - [rdma_max_sge](#rdma_max_sge) - [rdma_max_msg](#rdma_max_msg) - [rdma_max_recv](#rdma_max_recv) - [rdma_max_send](#rdma_max_send) - [rdma_odp](#rdma_odp) - [peer_connect_interval](#peer_connect_interval) - [peer_connect_timeout](#peer_connect_timeout) - [osd_idle_timeout](#osd_idle_timeout) - [osd_ping_timeout](#osd_ping_timeout) - [max_etcd_attempts](#max_etcd_attempts) - [etcd_quick_timeout](#etcd_quick_timeout) - [etcd_slow_timeout](#etcd_slow_timeout) - [etcd_keepalive_timeout](#etcd_keepalive_timeout) - [etcd_ws_keepalive_interval](#etcd_ws_keepalive_interval) ## tcp_header_buffer_size - Type: integer - Default: 65536 Size of the buffer used to read data using an additional copy. Vitastor packet headers are 128 bytes, payload is always at least 4 KB, so it is usually beneficial to try to read multiple packets at once even though it requires to copy the data an additional time. The rest of each packet is received without an additional copy. You can try to play with this parameter and see how it affects random iops and linear bandwidth if you want. ## use_sync_send_recv - Type: boolean - Default: false If true, synchronous send/recv syscalls are used instead of io_uring for socket communication. Useless for OSDs because they require io_uring anyway, but may be required for clients with old kernel versions. ## use_rdma - Type: boolean - Default: true Try to use RDMA for communication if it's available. Disable if you don't want Vitastor to use RDMA. TCP-only clients can also talk to an RDMA-enabled cluster, so disabling RDMA may be needed if clients have RDMA devices, but they are not connected to the cluster. ## rdma_device - Type: string RDMA device name to use for Vitastor OSD communications (for example, "rocep5s0f0"). Now Vitastor supports all adapters, even ones without ODP support, like Mellanox ConnectX-3 and non-Mellanox cards. Versions up to Vitastor 1.2.0 required ODP which is only present in Mellanox ConnectX >= 4. See also [rdma_odp](#rdma_odp). Run `ibv_devinfo -v` as root to list available RDMA devices and their features. Remember that you also have to configure your network switches if you use RoCE/RoCEv2, otherwise you may experience unstable performance. Refer to the manual of your network vendor for details about setting up the switch for RoCEv2 correctly. Usually it means setting up Lossless Ethernet with PFC (Priority Flow Control) and ECN (Explicit Congestion Notification). ## rdma_port_num - Type: integer - Default: 1 RDMA device port number to use. Only for devices that have more than 1 port. See `phys_port_cnt` in `ibv_devinfo -v` output to determine how many ports your device has. ## rdma_gid_index - Type: integer - Default: 0 Global address identifier index of the RDMA device to use. Different GID indexes may correspond to different protocols like RoCEv1, RoCEv2 and iWARP. Search for "GID" in `ibv_devinfo -v` output to determine which GID index you need. **IMPORTANT:** If you want to use RoCEv2 (as recommended) then the correct rdma_gid_index is usually 1 (IPv6) or 3 (IPv4). ## rdma_mtu - Type: integer - Default: 4096 RDMA Path MTU to use. Must be 1024, 2048 or 4096. There is usually no sense to change it from the default 4096. ## rdma_max_sge - Type: integer - Default: 128 Maximum number of scatter/gather entries to use for RDMA. OSDs negotiate the actual value when establishing connection anyway, so it's usually not required to change this parameter. ## rdma_max_msg - Type: integer - Default: 132096 Maximum size of a single RDMA send or receive operation in bytes. ## rdma_max_recv - Type: integer - Default: 16 Maximum number of RDMA receive buffers per connection (RDMA requires preallocated buffers to receive data). Each buffer is `rdma_max_msg` bytes in size. So this setting directly affects memory usage: a single Vitastor RDMA client uses `rdma_max_recv * rdma_max_msg * OSD_COUNT` bytes of memory. Default is roughly 2 MB * number of OSDs. ## rdma_max_send - Type: integer - Default: 8 Maximum number of outstanding RDMA send operations per connection. Should be less than `rdma_max_recv` so the receiving side doesn't run out of buffers. Doesn't affect memory usage - additional memory isn't allocated for send operations. ## rdma_odp - Type: boolean - Default: false Use RDMA with On-Demand Paging. ODP is currently only available on Mellanox ConnectX-4 and newer adapters. ODP allows to not register memory explicitly for RDMA adapter to be able to use it. This, in turn, allows to skip memory copying during sending. One would think this should improve performance, but **in reality** RDMA performance with ODP is **drastically** worse. Example 3-node cluster with 8 NVMe in each node and 2*25 GBit/s ConnectX-6 RDMA network without ODP pushes 3950000 read iops, but only 239000 iops with ODP... This happens because Mellanox ODP implementation seems to be based on message retransmissions when the adapter doesn't know about the buffer yet - it likely uses standard "RNR retransmissions" (RNR = receiver not ready) which is generally slow in RDMA/RoCE networks. Here's a presentation about it from ISPASS-2021 conference: https://tkygtr6.github.io/pub/ISPASS21_slides.pdf ODP support is retained in the code just in case a good ODP implementation appears one day. ## peer_connect_interval - Type: seconds - Default: 5 - Minimum: 1 - Can be changed online: yes Interval before attempting to reconnect to an unavailable OSD. ## peer_connect_timeout - Type: seconds - Default: 5 - Minimum: 1 - Can be changed online: yes Timeout for OSD connection attempts. ## osd_idle_timeout - Type: seconds - Default: 5 - Minimum: 1 - Can be changed online: yes OSD connection inactivity time after which clients and other OSDs send keepalive requests to check state of the connection. ## osd_ping_timeout - Type: seconds - Default: 5 - Minimum: 1 - Can be changed online: yes Maximum time to wait for OSD keepalive responses. If an OSD doesn't respond within this time, the connection to it is dropped and a reconnection attempt is scheduled. ## max_etcd_attempts - Type: integer - Default: 5 - Can be changed online: yes Maximum number of attempts for etcd requests which can't be retried indefinitely. ## etcd_quick_timeout - Type: milliseconds - Default: 1000 - Can be changed online: yes Timeout for etcd requests which should complete quickly, like lease refresh. ## etcd_slow_timeout - Type: milliseconds - Default: 5000 - Can be changed online: yes Timeout for etcd requests which are allowed to wait for some time. ## etcd_keepalive_timeout - Type: seconds - Default: max(30, etcd_report_interval*2) - Can be changed online: yes Timeout for etcd connection HTTP Keep-Alive. Should be higher than etcd_report_interval to guarantee that keepalive actually works. ## etcd_ws_keepalive_interval - Type: seconds - Default: 30 - Can be changed online: yes etcd websocket ping interval required to keep the connection alive and detect disconnections quickly.