Implement simple "flow control" for RDMA (to check hypothesis about slowdowns)

Fix vitastor-cli create syntax
Allow to start OSDs without local store (only for tests)
2023-03-24 01:57:32 +03:00 · 2023-03-17 11:12:58 +03:00 · 2023-03-15 01:13:59 +03:00 · 2023-03-15 01:08:23 +03:00 · 2023-03-15 01:08:23 +03:00 · 2023-03-15 01:08:23 +03:00
16 changed files with 84 additions and 22 deletions
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -1,4 +1,4 @@
-cmake_minimum_required(VERSION 2.8)
+cmake_minimum_required(VERSION 2.8.12)

 project(vitastor)

--- a/csi/src/controllerserver.go
+++ b/csi/src/controllerserver.go
@@ -6,6 +6,7 @@ package vitastor
 import (
    "context"
    "encoding/json"
+    "fmt"
    "strings"
    "bytes"
    "strconv"
@@ -178,7 +179,7 @@ func (cs *ControllerServer) CreateVolume(ctx context.Context, req *csi.CreateVol
    }

    // Create image using vitastor-cli
-    _, err := invokeCLI(ctxVars, []string{ "create", volName, "-s", string(volSize), "--pool", string(poolId) })
+    _, err := invokeCLI(ctxVars, []string{ "create", volName, "-s", fmt.Sprintf("%v", volSize), "--pool", fmt.Sprintf("%v", poolId) })
    if (err != nil)
    {
        if (strings.Index(err.Error(), "already exists") > 0)
--- a/docs/performance/theoretical.en.md
+++ b/docs/performance/theoretical.en.md
@@ -35,15 +35,24 @@ Write amplification for 4 KB blocks is usually 3-5 in Vitastor:
 If you manage to get an SSD which handles 512 byte blocks well (Optane?) you may
 lower 1, 3 and 4 to 512 bytes (1/8 of data size) and get WA as low as 2.375.

+Implemented NVDIMM support can basically eliminate WA at all - all extra writes will
+go to DRAM memory. But this requires a test cluster with NVDIMM - please contact me
+if you want to provide me with such cluster for tests.
+
 Lazy fsync also reduces WA for parallel workloads because journal blocks are only
 written when they fill up or fsync is requested.

 ## In Practice

-In practice, using tests from [Understanding Performance](understanding.en.md)
-and good server-grade SSD/NVMe drives, you should head for:
+In practice, using tests from [Understanding Performance](understanding.en.md), decent TCP network,
+good server-grade SSD/NVMe drives and disabled CPU power saving, you should head for:
 - At least 5000 T1Q1 replicated read and write iops (maximum 0.2ms latency)
+- At least 5000 T1Q1 EC read IOPS and at least 2200 EC write IOPS (maximum 0.45ms latency)
 - At least ~80k parallel read iops or ~30k write iops per 1 core (1 OSD)
 - Disk-speed or wire-speed linear reads and writes, whichever is the bottleneck in your case

 Lower results may mean that you have bad drives, bad network or some kind of misconfiguration.
+
+Current latency records:
+- 9668 T1Q1 replicated write iops (0.103 ms latency) with TCP and NVMe
+- 9143 T1Q1 replicated read iops (0.109 ms latency) with TCP and NVMe
--- a/docs/performance/theoretical.ru.md
+++ b/docs/performance/theoretical.ru.md
@@ -36,6 +36,25 @@ WA (мультипликатор записи) для 4 КБ блоков в Vit
 Если вы найдёте SSD, хорошо работающий с 512-байтными блоками данных (Optane?),
 то 1, 3 и 4 можно снизить до 512 байт (1/8 от размера данных) и получить WA всего 2.375.

+Если реализовать поддержку NVDIMM, то WA можно, условно говоря, ликвидировать вообще - все
+дополнительные операции записи смогут обслуживаться DRAM памятью. Но для этого необходим
+тестовый кластер с NVDIMM - пишите, если готовы предоставить такой для тестов.
+
 Кроме того, WA снижается при использовании отложенного/ленивого сброса при параллельной
 нагрузке, т.к. блоки журнала записываются на диск только когда они заполняются или явным
 образом запрашивается fsync.
+
+## На практике
+
+На практике, используя тесты fio со страницы [Понимание сути производительности систем хранения](understanding.ru.md),
+нормальную TCP-сеть, хорошие серверные SSD/NVMe, при отключённом энергосбережении процессоров вы можете рассчитывать на:
+- От 5000 IOPS в 1 поток (T1Q1) и на чтение, и на запись при использовании репликации (задержка до 0.2мс)
+- От 5000 IOPS в 1 поток (T1Q1) на чтение и 2200 IOPS в 1 поток на запись при использовании EC (задержка до 0.45мс)
+- От 80000 IOPS на чтение в параллельном режиме на 1 ядро, от 30000 IOPS на запись на 1 ядро (на 1 OSD)
+- Скорость параллельного линейного чтения и записи, равная меньшему значению из скорости дисков или сети
+
+Худшие результаты означают, что у вас либо медленные диски, либо медленная сеть, либо что-то неправильно настроено.
+
+Зафиксированный на данный момент рекорд задержки:
+- 9668 IOPS (0.103 мс задержка) в 1 поток (T1Q1) на запись с TCP и NVMe при использовании репликации
+- 9143 IOPS (0.109 мс задержка) в 1 поток (T1Q1) на чтение с TCP и NVMe при использовании репликации
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -1,4 +1,4 @@
-cmake_minimum_required(VERSION 2.8)
+cmake_minimum_required(VERSION 2.8.12)

 project(vitastor)

--- a/src/disk_tool_journal.cpp
+++ b/src/disk_tool_journal.cpp
@@ -281,7 +281,7 @@ void disk_tool_t::dump_journal_entry(int num, journal_entry *je, bool json)
        if (je->big_write.size > sizeof(journal_entry_big_write))
        {
            printf(json ? ",\"bitmap\":\"" : " (bitmap: ");
-            for (int i = sizeof(journal_entry_big_write); i < je->small_write.size; i++)
+            for (int i = sizeof(journal_entry_big_write); i < je->big_write.size; i++)
            {
                printf("%02x", ((uint8_t*)je)[i]);
            }
--- a/src/disk_tool_meta.cpp
+++ b/src/disk_tool_meta.cpp
@@ -26,7 +26,7 @@ int disk_tool_t::process_meta(std::function<void(blockstore_meta_header_v1_t *)>
        buf_size = dsk.meta_len;
    void *data = memalign_or_die(MEM_ALIGNMENT, buf_size);
    lseek64(dsk.meta_fd, dsk.meta_offset, 0);
-    read_blocking(dsk.meta_fd, data, buf_size);
+    read_blocking(dsk.meta_fd, data, dsk.meta_block_size);
    // Check superblock
    blockstore_meta_header_v1_t *hdr = (blockstore_meta_header_v1_t *)data;
    if (hdr->zero == 0 &&
@@ -41,8 +41,11 @@ int disk_tool_t::process_meta(std::function<void(blockstore_meta_header_v1_t *)>
            if (buf_size % dsk.meta_block_size)
            {
                buf_size = 8*dsk.meta_block_size;
+                void *new_data = memalign_or_die(MEM_ALIGNMENT, buf_size);
+                memcpy(new_data, data, dsk.meta_block_size);
                free(data);
-                data = memalign_or_die(MEM_ALIGNMENT, buf_size);
+                data = new_data;
+                hdr = (blockstore_meta_header_v1_t *)data;
            }
        }
        dsk.bitmap_granularity = hdr->bitmap_granularity;
--- a/src/msgr_rdma.cpp
+++ b/src/msgr_rdma.cpp
@@ -353,8 +353,10 @@ static void try_send_rdma_wr(osd_client_t *cl, ibv_sge *sge, int op_sge)
        .wr_id = (uint64_t)(cl->peer_fd*2+1),
        .sg_list = sge,
        .num_sge = op_sge,
-        .opcode = IBV_WR_SEND,
+        .opcode = cl->rdma_conn->avail_recv > 0 ? IBV_WR_SEND_WITH_IMM : IBV_WR_SEND,
        .send_flags = IBV_SEND_SIGNALED,
+        // Notify peer about our available incoming buffers
+        .imm_data = (uint32_t)cl->rdma_conn->avail_recv,
    };
    int err = ibv_post_send(cl->rdma_conn->qp, &wr, &bad_wr);
    if (err || bad_wr)
@@ -363,15 +365,27 @@ static void try_send_rdma_wr(osd_client_t *cl, ibv_sge *sge, int op_sge)
        exit(1);
    }
    cl->rdma_conn->cur_send++;
+    cl->rdma_conn->avail_send--;
+    cl->rdma_conn->avail_recv = 0;
 }

 bool osd_messenger_t::try_send_rdma(osd_client_t *cl)
 {
    auto rc = cl->rdma_conn;
-    if (!cl->send_list.size() || rc->cur_send >= rc->max_send)
+    if (rc->cur_send >= rc->max_send)
    {
        return true;
    }
+    if (!cl->send_list.size() || rc->use_flow_control && rc->avail_send <= 0)
+    {
+        if (rc->avail_recv)
+        {
+            // Only notify about available buffers so 2 peers don't lock each other
+            rc->send_sizes.push_back(0);
+            try_send_rdma_wr(cl, NULL, 0);
+        }
+        return true;
+    }
    uint64_t op_size = 0, op_sge = 0;
    ibv_sge sge[rc->max_sge];
    while (rc->send_pos < cl->send_list.size())
@@ -431,6 +445,7 @@ static void try_recv_rdma_wr(osd_client_t *cl, void *buf)
        exit(1);
    }
    cl->rdma_conn->cur_recv++;
+    cl->rdma_conn->avail_recv++;
 }

 bool osd_messenger_t::try_recv_rdma(osd_client_t *cl)
@@ -492,6 +507,11 @@ void osd_messenger_t::handle_rdma_events()
            if (!is_send)
            {
                rc->cur_recv--;
+                if ((wc[i].wc_flags & IBV_WC_WITH_IMM) && wc[i].imm_data > 0)
+                {
+                    rc->avail_send += wc[i].imm_data;
+                    rc->use_flow_control = true;
+                }
                if (!handle_read_buffer(cl, rc->recv_buffers[rc->next_recv_buf], wc[i].byte_len))
                {
                    // handle_read_buffer may stop the client
@@ -499,6 +519,7 @@ void osd_messenger_t::handle_rdma_events()
                }
                try_recv_rdma_wr(cl, rc->recv_buffers[rc->next_recv_buf]);
                rc->next_recv_buf = (rc->next_recv_buf+1) % rc->recv_buffers.size();
+                try_send_rdma(cl);
            }
            else
            {
--- a/src/msgr_rdma.h
+++ b/src/msgr_rdma.h
@@ -45,7 +45,8 @@ struct msgr_rdma_connection_t
    ibv_qp *qp = NULL;
    msgr_rdma_address_t addr;
    int max_send = 0, max_recv = 0, max_sge = 0;
-    int cur_send = 0, cur_recv = 0;
+    int cur_send = 0, cur_recv = 0, avail_recv = 0, avail_send = 0;
+    bool use_flow_control = false;
    uint64_t max_msg = 0;

    int send_pos = 0, send_buf_pos = 0;
--- a/src/osd.cpp
+++ b/src/osd.cpp
@@ -44,9 +44,10 @@ osd_t::osd_t(const json11::Json & config, ring_loop_t *ringloop)
    // FIXME: Use timerfd_interval based directly on io_uring
    this->tfd = epmgr->tfd;

-    auto bs_cfg = json_to_bs(this->config);
-    this->bs = new blockstore_t(bs_cfg, ringloop, tfd);
+    if (!json_is_true(this->config["disable_blockstore"]))
    {
+        auto bs_cfg = json_to_bs(this->config);
+        this->bs = new blockstore_t(bs_cfg, ringloop, tfd);
        // Autosync based on the number of unstable writes to prevent stalls due to insufficient journal space
        uint64_t max_autosync = bs->get_journal_size() / bs->get_block_size() / 2;
        if (autosync_writes > max_autosync)
@@ -93,7 +94,8 @@ osd_t::~osd_t()
 {
    ringloop->unregister_consumer(&consumer);
    delete epmgr;
-    delete bs;
+    if (bs)
+        delete bs;
    close(listen_fd);
    free(zero_buffer);
 }
@@ -475,7 +477,7 @@ void osd_t::print_slow()
            }
        }
    }
-    if (has_slow)
+    if (has_slow && bs)
    {
        bs->dump_diagnostics();
    }
--- a/src/osd.h
+++ b/src/osd.h
@@ -152,7 +152,7 @@ class osd_t

    bool stopping = false;
    int inflight_ops = 0;
-    blockstore_t *bs;
+    blockstore_t *bs = NULL;
    void *zero_buffer = NULL;
    uint64_t zero_buffer_size = 0;
    uint32_t bs_block_size, bs_bitmap_granularity, clean_entry_bitmap_size;
--- a/src/osd_cluster.cpp
+++ b/src/osd_cluster.cpp
@@ -182,10 +182,10 @@ json11::Json osd_t::get_statistics()
    char time_str[50] = { 0 };
    sprintf(time_str, "%ld.%03ld", ts.tv_sec, ts.tv_nsec/1000000);
    st["time"] = time_str;
-    st["blockstore_ready"] = bs->is_started();
-    st["data_block_size"] = (uint64_t)bs->get_block_size();
    if (bs)
    {
+        st["blockstore_ready"] = bs->is_started();
+        st["data_block_size"] = (uint64_t)bs->get_block_size();
        st["size"] = bs->get_block_count() * bs->get_block_size();
        st["free"] = bs->get_free_block_count() * bs->get_block_size();
    }
@@ -233,7 +233,8 @@ void osd_t::report_statistics()
    json11::Json::object inode_space;
    json11::Json::object last_stat;
    pool_id_t last_pool = 0;
-    auto & bs_inode_space = bs->get_inode_space_stats();
+    std::map<uint64_t, uint64_t> bs_empty_space;
+    auto & bs_inode_space = bs ? bs->get_inode_space_stats() : bs_empty_space;
    for (auto kv: bs_inode_space)
    {
        pool_id_t pool_id = INODE_POOL(kv.first);
--- a/src/osd_primary_subops.cpp
+++ b/src/osd_primary_subops.cpp
@@ -53,7 +53,10 @@ void osd_t::finish_op(osd_op_t *cur_op, int retval)
        inode_stats[cur_op->req.rw.inode].op_count[inode_st_op]++;
        inode_stats[cur_op->req.rw.inode].op_sum[inode_st_op] += usec;
        if (cur_op->req.hdr.opcode == OSD_OP_DELETE)
-            inode_stats[cur_op->req.rw.inode].op_bytes[inode_st_op] += cur_op->op_data->pg_data_size * bs_block_size;
+        {
+            if (cur_op->op_data)
+                inode_stats[cur_op->req.rw.inode].op_bytes[inode_st_op] += cur_op->op_data->pg_data_size * bs_block_size;
+        }
        else
            inode_stats[cur_op->req.rw.inode].op_bytes[inode_st_op] += cur_op->req.rw.len;
    }
--- a/src/osd_test.cpp
+++ b/src/osd_test.cpp
@@ -150,6 +150,7 @@ int connect_osd(const char *osd_address, int osd_port)
    if (connect(connect_fd, (sockaddr*)&addr, sizeof(addr)) < 0)
    {
        perror("connect");
+        close(connect_fd);
        return -1;
    }
    int one = 1;
--- a/src/rw_blocking.cpp
+++ b/src/rw_blocking.cpp
@@ -15,7 +15,7 @@ int read_blocking(int fd, void *read_buf, size_t remaining)
    size_t done = 0;
    while (done < remaining)
    {
-        size_t r = read(fd, read_buf, remaining-done);
+        ssize_t r = read(fd, read_buf, remaining-done);
        if (r <= 0)
        {
            if (!errno)
@@ -41,7 +41,7 @@ int write_blocking(int fd, void *write_buf, size_t remaining)
    size_t done = 0;
    while (done < remaining)
    {
-        size_t r = write(fd, write_buf, remaining-done);
+        ssize_t r = write(fd, write_buf, remaining-done);
        if (r < 0)
        {
            if (errno != EINTR && errno != EAGAIN && errno != EPIPE)
--- a/src/stub_bench.cpp
+++ b/src/stub_bench.cpp
@@ -83,6 +83,7 @@ int connect_stub(const char *server_address, int server_port)
    if (connect(connect_fd, (sockaddr*)&addr, sizeof(addr)) < 0)
    {
        perror("connect");
+        close(connect_fd);
        return -1;
    }
    int one = 1;
Author	SHA1	Message	Date
Vitaliy Filippov	8aec63fddd	Implement simple "flow control" for RDMA (to check hypothesis about slowdowns)	2023-03-24 01:57:32 +03:00
Vitaliy Filippov	3bbc46543d	Fix vitastor-cli create syntax	2023-03-17 11:12:58 +03:00
Vitaliy Filippov	2fb0c85618	Allow to start OSDs without local store (only for tests)	2023-03-15 01:13:59 +03:00
Vitaliy Filippov	d81a6c04fc	Update cmake min version so it does not complain about deprecation	2023-03-15 01:08:23 +03:00
Vitaliy Filippov	7b35801647	Fix possible bad realloc in disk_tool_meta for non-standard metadata block sizes	2023-03-15 01:08:23 +03:00
Vitaliy Filippov	f3228d5c07	Fix typo (did not affect execution though)	2023-03-15 01:08:23 +03:00
Vitaliy Filippov	18366f5055	Fix read/write return type in rw_blocking	2023-03-15 01:08:14 +03:00
Vitaliy Filippov	851507c147	Add missing close() in test stubs	2023-03-15 00:23:56 +03:00
Vitaliy Filippov	9aaad28488	Fix "null pointer exception" for unhandled OSD_OP_DELETEs (when pool is not loaded yet)	2023-03-02 11:16:39 +03:00
Vitaliy Filippov	dd57d086fe	Add a missing part of the "theoretical performance" to the Russian version	2023-03-01 00:24:54 +03:00