Compare commits

...

61 Commits

Author SHA1 Message Date
d258a6e76b Add --restart_interval to tests 2023-05-10 01:51:02 +03:00
77155ab7bd Add self-restart support to monitor (mainly for tests) 2023-05-10 01:51:02 +03:00
a409598b16 Wait for free space again, but count on big_write flushes instead of just flusher activity 2023-05-10 01:51:02 +03:00
f4c6765522 Ignore ENOENT in epoll_ctl 2023-05-08 20:39:20 +03:00
ad2916068a Fix test_add_osd rebalance timeout check 2023-05-08 20:39:20 +03:00
321cb435a6 Fix monitor incorrectly changing PG count when last_clean_pgs contains less PGs than the new number 2023-05-08 20:39:20 +03:00
cfcf4f4355 Support checking /dev/nbdX nodes in Docker 2023-05-08 20:39:20 +03:00
e0fb17bfee Make etcd more stable in tests (add ionice and raise timeout) 2023-05-08 20:36:00 +03:00
5b9031fecc Fix monitor possibly applying incorrect PG history under heavy load
Monitor could deceive itself by immediately saving PG configuration changes
which weren't applied to etcd yet in memory, and apply incorrect PG history
changes next time if the first update fails.

This usually only happened under heavy load and was caught in CI. :-)
2023-05-07 23:23:00 +03:00
5da1d8e1b5 Fix EC just-bitmap reads (len=0) (fixes SCHEME=ec test_snapshot.sh) 2023-05-07 14:00:08 +03:00
44f86f1999 Add a basic EC 2+2 recovery test (not really required, but let it be there) 2023-05-07 11:26:27 +03:00
2d9a80c6f6 Implement missing bitmap recovery with ISA-L \(°□°)/ 2023-05-07 11:25:51 +03:00
5e295e346e Do not make vitastor-mon part of vitastor.target 2023-04-29 00:17:47 +03:00
d9c0898b7c Notes about config and vitastor-disk cache status 2023-04-29 00:08:24 +03:00
04cfb48361 Add a note about PVE 7.4 2023-04-28 11:37:11 +03:00
ab615849d6 Release 0.8.8
- Fix vitastor-cli rm/rm-data broken in 0.8.6 (missing messenger initialization)
- Prepare OSD read handler for upcoming version with scrub - allow "secondary reads" to return errors
- Fix OSDs re-peering PGs infinitely with a big number of PGs (reproduced in test_add_osd)
- Fix another variant of flusher sync-waiting stall (reproduced in test_write)
- Fix other tests in tests/ (will add them to Gitea CI soon)
- Add patches for QEMU 6.2-8.0
- Fix QEMU driver compatibility with QEMU 8.0
- Build packages for RHEL 9 clones (based on AlmaLinux 9)
2023-04-28 11:22:00 +03:00
38be9a49c0 Add AlmaLinux 9 build to documentation 2023-04-28 02:00:52 +03:00
7d6bf84a3e Add scripts/meson-buildoptions.sh to QEMU patches 2023-04-28 01:43:22 +03:00
41a40a4123 Add QEMU spec patch for Alma/Rocky/RH 9 2023-04-28 01:32:06 +03:00
b94587ef0e Fix some build warnings 2023-04-28 00:44:27 +03:00
2a2f4f6738 Add Almalinux 9 build 2023-04-28 00:40:50 +03:00
c768a9015f Fix QEMU driver compatibility with QEMU 8.0 2023-04-25 11:20:21 +03:00
0d9e10cf96 Add patches for QEMU 6.2-8.0 2023-04-25 11:20:21 +03:00
b74ccb613c Fix another variant of flusher sync-waiting stall 2023-04-24 00:44:41 +03:00
5052174918 Fix test_write_no_same (too large image) 2023-04-24 00:44:41 +03:00
eec9cf5575 Fix test_snapshot.sh - qemu-img requires explicit backing_fmt 2023-04-24 00:44:41 +03:00
a04dab0840 Initialize messenger in cluster_client listings 2023-04-24 00:44:41 +03:00
160863f707 Print op pointer values in slow log 2023-04-23 17:54:00 +03:00
2f16c32eb4 Fix test_minsize_1 (left_on_dead) 2023-04-23 17:54:00 +03:00
2877cd0adb Allow OP_SEC_READ to return errors (do not hang the connection) 2023-04-23 17:54:00 +03:00
480509f5b9 Fix pg_data_size > 1 for replicas (harmless bug) 2023-04-23 01:50:42 +03:00
46462da45e Preload own PG history updates to fix PG state loop possibly applying the old metadata version 2023-04-23 01:50:30 +03:00
024c8658f6 Fix missing } in quick start documentation 2023-04-12 18:26:38 +03:00
7e958afeda Release 0.8.7
This release includes a bunch of important bugfixes for erasure-coded setups
with disabled immediate_commit. After these fixes, "test_heal" OSD killing test
now passes fine with EC:

- Fix cluster write stalls with "Error while doing flush on OSD xx: -16 (Device or resource busy)"
  in OSD logs possible in EC setups with disabled immediate_commit by selectively
  syncing nonsynced objects on STABILIZE/ROLLBACK (https://github.com/vitalif/vitastor/issues/51)
- Fix other EC + disabled immediate_commit problems:
  - Fix "opcode=5 retval=-2" errors happening on SYNC retries
  - Fix non-working "pagination" during PG dirty object flushing
  - Fix write operations not continued correctly after dirty object flushing
- Fix incorrect parity read-modify-write calculation when writing into a lost chunk
- Fix OSDs losing left_on_dead PG state of non-clean PGs and thus not removing junk data in the cluster
- Fix a small memory leak caused by bad indexing of EC recovery matrices
- Fix a rare use-after-free in cluster_client caused by a reenterability issue
- Fix vitastor-cli create command syntax in the CSI driver
- Allow to start OSDs without local store for tests
- Fix memory allocation error in disk_tool_meta for non-standard metadata block sizes
- Fix delete operations received before loading pool metadata crashing OSDs with "null pointer exception"
- Improve "theoretical performance" Russian documentation

New features:

- Implement online configuration update for some parameters. Documentation is coming soon :)
2023-04-11 02:11:57 +03:00
2f5e769a29 Fix a small memory leak caused by bad indexing of EC recovery matrices 2023-04-11 00:30:36 +03:00
28d5e53c6c Add test_heal to run_tests 2023-04-09 02:10:42 +03:00
d9f55f11d8 More logs (log_level 10), append to log instead of overwriting on restart in tests 2023-04-09 02:06:10 +03:00
3237014608 Fix incorrect parity read-modify-write calculation when writing into a lost chunk 2023-04-09 02:06:10 +03:00
baaf8f6f44 Fix write operations not continued correctly after flush 2023-04-09 02:06:10 +03:00
1d83fdcd17 Add debug logs to osd_flush 2023-04-09 02:06:10 +03:00
0ddd787c38 Fix non-working "pagination" during PG dirty object flushing 2023-04-08 02:44:02 +03:00
6eff3a60a5 Do not lose left_on_dead PG state of non-clean PGs 2023-04-08 02:44:02 +03:00
888a6975ab Fix a rare use-after-free in cluster_client caused by a reenterability issue 2023-04-08 02:44:02 +03:00
cd1e890bd4 Fix "opcode=5 retval=-2" errors sometimes possible with EC 2023-04-08 02:44:02 +03:00
0fbf4c6a08 Selectively sync nonsynced objects on STABILIZE/ROLLBACK (fix for github issue #51) 2023-04-08 02:44:02 +03:00
d06ed2b0e7 Implement online config update 2023-03-26 19:21:50 +03:00
3bbc46543d Fix vitastor-cli create syntax 2023-03-17 11:12:58 +03:00
2fb0c85618 Allow to start OSDs without local store (only for tests) 2023-03-15 01:13:59 +03:00
d81a6c04fc Update cmake min version so it does not complain about deprecation 2023-03-15 01:08:23 +03:00
7b35801647 Fix possible bad realloc in disk_tool_meta for non-standard metadata block sizes 2023-03-15 01:08:23 +03:00
f3228d5c07 Fix typo (did not affect execution though) 2023-03-15 01:08:23 +03:00
18366f5055 Fix read/write return type in rw_blocking 2023-03-15 01:08:14 +03:00
851507c147 Add missing close() in test stubs 2023-03-15 00:23:56 +03:00
9aaad28488 Fix "null pointer exception" for unhandled OSD_OP_DELETEs (when pool is not loaded yet) 2023-03-02 11:16:39 +03:00
dd57d086fe Add a missing part of the "theoretical performance" to the Russian version 2023-03-01 00:24:54 +03:00
8810eae8fb Release 0.8.6
Important fixes:

- Fix possibly incorrect EC parity chunk updates with EC n+k, k > 1 and when
  the first parity chunk is missing

Minor fixes and improvements:

- Fix incorrect EC free space statistics in vitastor-cli df output
- Speedup vitastor-cli startup in clusters with RDMA
- Remove unused PG "peered" state (previously used to update PG epoch)
- Use sfdisk with just --json in vitastor-disk (--dump --json isn't needed)
- Allow trailing comma in sfdisk output (fixes sfdisk 2.36 compatibility)
- Slightly improve RDMA send/receive code
- Reduce RDMA memory consumption by default (rdma_max_recv/send = 16/8)
- Use vitastor-cli instead of direct etcd interaction in the CSI driver
2023-02-28 11:18:48 +03:00
c1365f46c9 Use vitastor-cli instead of direct etcd interaction in the CSI driver 2023-02-28 11:02:50 +03:00
14d6acbcba Set default rdma_max_recv/send to 16/8, fix documentation 2023-02-28 11:00:56 +03:00
1e307069bc Fix missing parity chunk calculation for EC n+k, k > 1 and first parity chunk missing 2023-02-28 02:40:19 +03:00
c3e80abad7 Allow to send more than 1 operation at a time 2023-02-26 02:01:04 +03:00
138ffe4032 Reuse incoming RDMA buffers 2023-02-26 00:55:01 +03:00
103 changed files with 2855 additions and 700 deletions

View File

@@ -1,7 +1,7 @@
cmake_minimum_required(VERSION 2.8) cmake_minimum_required(VERSION 2.8.12)
project(vitastor) project(vitastor)
set(VERSION "0.8.5") set(VERSION "0.8.8")
add_subdirectory(src) add_subdirectory(src)

View File

@@ -1,4 +1,4 @@
VERSION ?= v0.8.5 VERSION ?= v0.8.8
all: build push all: build push

View File

@@ -49,7 +49,7 @@ spec:
capabilities: capabilities:
add: ["SYS_ADMIN"] add: ["SYS_ADMIN"]
allowPrivilegeEscalation: true allowPrivilegeEscalation: true
image: vitalif/vitastor-csi:v0.8.5 image: vitalif/vitastor-csi:v0.8.8
args: args:
- "--node=$(NODE_ID)" - "--node=$(NODE_ID)"
- "--endpoint=$(CSI_ENDPOINT)" - "--endpoint=$(CSI_ENDPOINT)"

View File

@@ -116,7 +116,7 @@ spec:
privileged: true privileged: true
capabilities: capabilities:
add: ["SYS_ADMIN"] add: ["SYS_ADMIN"]
image: vitalif/vitastor-csi:v0.8.5 image: vitalif/vitastor-csi:v0.8.8
args: args:
- "--node=$(NODE_ID)" - "--node=$(NODE_ID)"
- "--endpoint=$(CSI_ENDPOINT)" - "--endpoint=$(CSI_ENDPOINT)"

View File

@@ -5,7 +5,7 @@ package vitastor
const ( const (
vitastorCSIDriverName = "csi.vitastor.io" vitastorCSIDriverName = "csi.vitastor.io"
vitastorCSIDriverVersion = "0.8.5" vitastorCSIDriverVersion = "0.8.8"
) )
// Config struct fills the parameters of request or user input // Config struct fills the parameters of request or user input

View File

@@ -6,11 +6,11 @@ package vitastor
import ( import (
"context" "context"
"encoding/json" "encoding/json"
"fmt"
"strings" "strings"
"bytes" "bytes"
"strconv" "strconv"
"time" "time"
"fmt"
"os" "os"
"os/exec" "os/exec"
"io/ioutil" "io/ioutil"
@@ -21,8 +21,6 @@ import (
"google.golang.org/grpc/codes" "google.golang.org/grpc/codes"
"google.golang.org/grpc/status" "google.golang.org/grpc/status"
"go.etcd.io/etcd/clientv3"
"github.com/container-storage-interface/spec/lib/go/csi" "github.com/container-storage-interface/spec/lib/go/csi"
) )
@@ -114,6 +112,34 @@ func GetConnectionParams(params map[string]string) (map[string]string, []string,
return ctxVars, etcdUrl, etcdPrefix return ctxVars, etcdUrl, etcdPrefix
} }
func invokeCLI(ctxVars map[string]string, args []string) ([]byte, error)
{
if (ctxVars["etcdUrl"] != "")
{
args = append(args, "--etcd_address", ctxVars["etcdUrl"])
}
if (ctxVars["etcdPrefix"] != "")
{
args = append(args, "--etcd_prefix", ctxVars["etcdPrefix"])
}
if (ctxVars["configPath"] != "")
{
args = append(args, "--config_path", ctxVars["configPath"])
}
c := exec.Command("/usr/bin/vitastor-cli", args...)
var stdout, stderr bytes.Buffer
c.Stdout = &stdout
c.Stderr = &stderr
err := c.Run()
stderrStr := string(stderr.Bytes())
if (err != nil)
{
klog.Errorf("vitastor-cli %s failed: %s, status %s\n", strings.Join(args, " "), stderrStr, err)
return nil, status.Error(codes.Internal, stderrStr+" (status "+err.Error()+")")
}
return stdout.Bytes(), nil
}
// Create the volume // Create the volume
func (cs *ControllerServer) CreateVolume(ctx context.Context, req *csi.CreateVolumeRequest) (*csi.CreateVolumeResponse, error) func (cs *ControllerServer) CreateVolume(ctx context.Context, req *csi.CreateVolumeRequest) (*csi.CreateVolumeResponse, error)
{ {
@@ -146,128 +172,41 @@ func (cs *ControllerServer) CreateVolume(ctx context.Context, req *csi.CreateVol
volSize = ((capRange.GetRequiredBytes() + MB - 1) / MB) * MB volSize = ((capRange.GetRequiredBytes() + MB - 1) / MB) * MB
} }
// FIXME: The following should PROBABLY be implemented externally in a management tool ctxVars, etcdUrl, _ := GetConnectionParams(req.Parameters)
ctxVars, etcdUrl, etcdPrefix := GetConnectionParams(req.Parameters)
if (len(etcdUrl) == 0) if (len(etcdUrl) == 0)
{ {
return nil, status.Error(codes.InvalidArgument, "no etcdUrl in storage class configuration and no etcd_address in vitastor.conf") return nil, status.Error(codes.InvalidArgument, "no etcdUrl in storage class configuration and no etcd_address in vitastor.conf")
} }
// Connect to etcd // Create image using vitastor-cli
cli, err := clientv3.New(clientv3.Config{ _, err := invokeCLI(ctxVars, []string{ "create", volName, "-s", fmt.Sprintf("%v", volSize), "--pool", fmt.Sprintf("%v", poolId) })
DialTimeout: ETCD_TIMEOUT,
Endpoints: etcdUrl,
})
if (err != nil) if (err != nil)
{ {
return nil, status.Error(codes.Internal, "failed to connect to etcd at "+strings.Join(etcdUrl, ",")+": "+err.Error()) if (strings.Index(err.Error(), "already exists") > 0)
}
defer cli.Close()
var imageId uint64 = 0
for
{
// Check if the image exists
ctx, cancel := context.WithTimeout(context.Background(), ETCD_TIMEOUT)
resp, err := cli.Get(ctx, etcdPrefix+"/index/image/"+volName)
cancel()
if (err != nil)
{ {
return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error()) stat, err := invokeCLI(ctxVars, []string{ "ls", "--json", volName })
}
if (len(resp.Kvs) > 0)
{
kv := resp.Kvs[0]
var v InodeIndex
err := json.Unmarshal(kv.Value, &v)
if (err != nil) if (err != nil)
{ {
return nil, status.Error(codes.Internal, "invalid /index/image/"+volName+" key in etcd: "+err.Error()) return nil, err
} }
poolId = v.PoolId var inodeCfg []InodeConfig
imageId = v.Id err = json.Unmarshal(stat, &inodeCfg)
inodeCfgKey := fmt.Sprintf("/config/inode/%d/%d", poolId, imageId)
ctx, cancel := context.WithTimeout(context.Background(), ETCD_TIMEOUT)
resp, err := cli.Get(ctx, etcdPrefix+inodeCfgKey)
cancel()
if (err != nil) if (err != nil)
{ {
return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error()) return nil, status.Error(codes.Internal, "Invalid JSON in vitastor-cli ls: "+err.Error())
} }
if (len(resp.Kvs) == 0) if (len(inodeCfg) == 0)
{ {
return nil, status.Error(codes.Internal, "missing "+inodeCfgKey+" key in etcd") return nil, status.Error(codes.Internal, "vitastor-cli create said that image already exists, but ls can't find it")
} }
var inodeCfg InodeConfig if (inodeCfg[0].Size < uint64(volSize))
err = json.Unmarshal(resp.Kvs[0].Value, &inodeCfg)
if (err != nil)
{
return nil, status.Error(codes.Internal, "invalid "+inodeCfgKey+" key in etcd: "+err.Error())
}
if (inodeCfg.Size < uint64(volSize))
{ {
return nil, status.Error(codes.Internal, "image "+volName+" is already created, but size is less than expected") return nil, status.Error(codes.Internal, "image "+volName+" is already created, but size is less than expected")
} }
} }
else else
{ {
// Find a free ID return nil, err
// Create image metadata in a transaction verifying that the image doesn't exist yet AND ID is still free
maxIdKey := fmt.Sprintf("%s/index/maxid/%d", etcdPrefix, poolId)
ctx, cancel := context.WithTimeout(context.Background(), ETCD_TIMEOUT)
resp, err := cli.Get(ctx, maxIdKey)
cancel()
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
}
var modRev int64
var nextId uint64
if (len(resp.Kvs) > 0)
{
var err error
nextId, err = strconv.ParseUint(string(resp.Kvs[0].Value), 10, 64)
if (err != nil)
{
return nil, status.Error(codes.Internal, maxIdKey+" contains invalid ID")
}
modRev = resp.Kvs[0].ModRevision
nextId++
}
else
{
nextId = 1
}
inodeIdxJson, _ := json.Marshal(InodeIndex{
Id: nextId,
PoolId: poolId,
})
inodeCfgJson, _ := json.Marshal(InodeConfig{
Name: volName,
Size: uint64(volSize),
})
ctx, cancel = context.WithTimeout(context.Background(), ETCD_TIMEOUT)
txnResp, err := cli.Txn(ctx).If(
clientv3.Compare(clientv3.ModRevision(fmt.Sprintf("%s/index/maxid/%d", etcdPrefix, poolId)), "=", modRev),
clientv3.Compare(clientv3.CreateRevision(fmt.Sprintf("%s/index/image/%s", etcdPrefix, volName)), "=", 0),
clientv3.Compare(clientv3.CreateRevision(fmt.Sprintf("%s/config/inode/%d/%d", etcdPrefix, poolId, nextId)), "=", 0),
).Then(
clientv3.OpPut(fmt.Sprintf("%s/index/maxid/%d", etcdPrefix, poolId), fmt.Sprintf("%d", nextId)),
clientv3.OpPut(fmt.Sprintf("%s/index/image/%s", etcdPrefix, volName), string(inodeIdxJson)),
clientv3.OpPut(fmt.Sprintf("%s/config/inode/%d/%d", etcdPrefix, poolId, nextId), string(inodeCfgJson)),
).Commit()
cancel()
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to commit transaction in etcd: "+err.Error())
}
if (txnResp.Succeeded)
{
imageId = nextId
break
}
// Start over if the transaction fails
} }
} }
@@ -299,97 +238,12 @@ func (cs *ControllerServer) DeleteVolume(ctx context.Context, req *csi.DeleteVol
} }
volName := ctxVars["name"] volName := ctxVars["name"]
_, etcdUrl, etcdPrefix := GetConnectionParams(ctxVars) ctxVars, _, _ = GetConnectionParams(ctxVars)
if (len(etcdUrl) == 0)
{
return nil, status.Error(codes.InvalidArgument, "no etcdUrl in storage class configuration and no etcd_address in vitastor.conf")
}
cli, err := clientv3.New(clientv3.Config{ _, err = invokeCLI(ctxVars, []string{ "rm", volName })
DialTimeout: ETCD_TIMEOUT,
Endpoints: etcdUrl,
})
if (err != nil) if (err != nil)
{ {
return nil, status.Error(codes.Internal, "failed to connect to etcd at "+strings.Join(etcdUrl, ",")+": "+err.Error()) return nil, err
}
defer cli.Close()
// Find inode by name
ctx, cancel := context.WithTimeout(context.Background(), ETCD_TIMEOUT)
resp, err := cli.Get(ctx, etcdPrefix+"/index/image/"+volName)
cancel()
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
}
if (len(resp.Kvs) == 0)
{
return nil, status.Error(codes.NotFound, "volume "+volName+" does not exist")
}
var idx InodeIndex
err = json.Unmarshal(resp.Kvs[0].Value, &idx)
if (err != nil)
{
return nil, status.Error(codes.Internal, "invalid /index/image/"+volName+" key in etcd: "+err.Error())
}
// Get inode config
inodeCfgKey := fmt.Sprintf("%s/config/inode/%d/%d", etcdPrefix, idx.PoolId, idx.Id)
ctx, cancel = context.WithTimeout(context.Background(), ETCD_TIMEOUT)
resp, err = cli.Get(ctx, inodeCfgKey)
cancel()
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
}
if (len(resp.Kvs) == 0)
{
return nil, status.Error(codes.NotFound, "volume "+volName+" does not exist")
}
var inodeCfg InodeConfig
err = json.Unmarshal(resp.Kvs[0].Value, &inodeCfg)
if (err != nil)
{
return nil, status.Error(codes.Internal, "invalid "+inodeCfgKey+" key in etcd: "+err.Error())
}
// Delete inode data by invoking vitastor-cli
args := []string{
"rm-data", "--etcd_address", strings.Join(etcdUrl, ","),
"--pool", fmt.Sprintf("%d", idx.PoolId),
"--inode", fmt.Sprintf("%d", idx.Id),
}
if (ctxVars["configPath"] != "")
{
args = append(args, "--config_path", ctxVars["configPath"])
}
c := exec.Command("/usr/bin/vitastor-cli", args...)
var stderr bytes.Buffer
c.Stdout = nil
c.Stderr = &stderr
err = c.Run()
stderrStr := string(stderr.Bytes())
if (err != nil)
{
klog.Errorf("vitastor-cli rm-data failed: %s, status %s\n", stderrStr, err)
return nil, status.Error(codes.Internal, stderrStr+" (status "+err.Error()+")")
}
// Delete inode config in etcd
ctx, cancel = context.WithTimeout(context.Background(), ETCD_TIMEOUT)
txnResp, err := cli.Txn(ctx).Then(
clientv3.OpDelete(fmt.Sprintf("%s/index/image/%s", etcdPrefix, volName)),
clientv3.OpDelete(fmt.Sprintf("%s/config/inode/%d/%d", etcdPrefix, idx.PoolId, idx.Id)),
).Commit()
cancel()
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to delete keys in etcd: "+err.Error())
}
if (!txnResp.Succeeded)
{
return nil, status.Error(codes.Internal, "failed to delete keys in etcd: transaction failed")
} }
return &csi.DeleteVolumeResponse{}, nil return &csi.DeleteVolumeResponse{}, nil

4
debian/changelog vendored
View File

@@ -1,10 +1,10 @@
vitastor (0.8.5-1) unstable; urgency=medium vitastor (0.8.8-1) unstable; urgency=medium
* Bugfixes * Bugfixes
-- Vitaliy Filippov <vitalif@yourcmc.ru> Fri, 03 Jun 2022 02:09:44 +0300 -- Vitaliy Filippov <vitalif@yourcmc.ru> Fri, 03 Jun 2022 02:09:44 +0300
vitastor (0.8.5-1) unstable; urgency=medium vitastor (0.8.8-1) unstable; urgency=medium
* Implement NFS proxy * Implement NFS proxy
* Add documentation * Add documentation

View File

@@ -34,8 +34,8 @@ RUN set -e -x; \
mkdir -p /root/packages/vitastor-$REL; \ mkdir -p /root/packages/vitastor-$REL; \
rm -rf /root/packages/vitastor-$REL/*; \ rm -rf /root/packages/vitastor-$REL/*; \
cd /root/packages/vitastor-$REL; \ cd /root/packages/vitastor-$REL; \
cp -r /root/vitastor vitastor-0.8.5; \ cp -r /root/vitastor vitastor-0.8.8; \
cd vitastor-0.8.5; \ cd vitastor-0.8.8; \
ln -s /root/fio-build/fio-*/ ./fio; \ ln -s /root/fio-build/fio-*/ ./fio; \
FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \ FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
ls /usr/include/linux/raw.h || cp ./debian/raw.h /usr/include/linux/raw.h; \ ls /usr/include/linux/raw.h || cp ./debian/raw.h /usr/include/linux/raw.h; \
@@ -48,8 +48,8 @@ RUN set -e -x; \
rm -rf a b; \ rm -rf a b; \
echo "dep:fio=$FIO" > debian/fio_version; \ echo "dep:fio=$FIO" > debian/fio_version; \
cd /root/packages/vitastor-$REL; \ cd /root/packages/vitastor-$REL; \
tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.8.5.orig.tar.xz vitastor-0.8.5; \ tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.8.8.orig.tar.xz vitastor-0.8.8; \
cd vitastor-0.8.5; \ cd vitastor-0.8.8; \
V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \ V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \ DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \
DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \ DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \

View File

@@ -17,14 +17,16 @@ Configuration parameters can be set in 3 places:
- Configuration file (`/etc/vitastor/vitastor.conf` or other path) - Configuration file (`/etc/vitastor/vitastor.conf` or other path)
- etcd key `/vitastor/config/global`. Most variables can be set there, but etcd - etcd key `/vitastor/config/global`. Most variables can be set there, but etcd
connection parameters should obviously be set in the configuration file. connection parameters should obviously be set in the configuration file.
- Command line of Vitastor components: OSD, mon, fio and QEMU options, - Command line of Vitastor components: OSD (when you run it without vitastor-disk),
OpenStack/Proxmox/etc configuration. The latter doesn't allow to set all mon, fio and QEMU options, OpenStack/Proxmox/etc configuration. The latter
variables directly, but it allows to override the configuration file and doesn't allow to set all variables directly, but it allows to override the
set everything you need inside it. configuration file and set everything you need inside it.
- OSD superblocks created by [vitastor-disk](../usage/disk.en.md) contain
primarily disk layout parameters of specific OSDs. In fact, these parameters
are automatically passed into the command line of vitastor-osd process, so
they have the same "status" as command-line parameters.
In the future, additional configuration methods may be added: In the future, additional configuration methods may be added:
- OSD superblock which will, by design, contain parameters related to the disk
layout and to one specific OSD.
- OSD-specific keys in etcd like `/vitastor/config/osd/<number>`. - OSD-specific keys in etcd like `/vitastor/config/osd/<number>`.
## Parameter Reference ## Parameter Reference

View File

@@ -19,14 +19,17 @@
- Ключе в etcd `/vitastor/config/global`. Большая часть параметров может - Ключе в etcd `/vitastor/config/global`. Большая часть параметров может
задаваться там, кроме, естественно, самих параметров соединения с etcd, задаваться там, кроме, естественно, самих параметров соединения с etcd,
которые должны задаваться в файле конфигурации которые должны задаваться в файле конфигурации
- В командной строке компонентов Vitastor: OSD, монитора, опциях fio и QEMU, - В командной строке компонентов Vitastor: OSD (при ручном запуске без vitastor-disk),
настроек OpenStack, Proxmox и т.п. Последние, как правило, не включают полный монитора, опциях fio и QEMU, настроек OpenStack, Proxmox и т.п. Последние,
набор параметров напрямую, но разрешают определить путь к файлу конфигурации как правило, не включают полный набор параметров напрямую, но позволяют
и задать любые параметры в нём. определить путь к файлу конфигурации и задать любые параметры в нём.
- В суперблоке OSD, записываемом [vitastor-disk](../usage/disk.ru.md) - параметры,
связанные с дисковым форматом и с этим конкретным OSD. На самом деле,
при запуске OSD эти параметры автоматически передаются в командную строку
процесса vitastor-osd, то есть по "статусу" они эквивалентны параметрам
командной строки OSD.
В будущем также могут быть добавлены другие способы конфигурации: В будущем также могут быть добавлены другие способы конфигурации:
- Суперблок OSD, в котором будут храниться параметры OSD, связанные с дисковым
форматом и с этим конкретным OSD.
- OSD-специфичные ключи в etcd типа `/vitastor/config/osd/<номер>`. - OSD-специфичные ключи в etcd типа `/vitastor/config/osd/<номер>`.
## Список параметров ## Список параметров

View File

@@ -19,6 +19,7 @@ between clients, OSDs and etcd.
- [rdma_max_sge](#rdma_max_sge) - [rdma_max_sge](#rdma_max_sge)
- [rdma_max_msg](#rdma_max_msg) - [rdma_max_msg](#rdma_max_msg)
- [rdma_max_recv](#rdma_max_recv) - [rdma_max_recv](#rdma_max_recv)
- [rdma_max_send](#rdma_max_send)
- [peer_connect_interval](#peer_connect_interval) - [peer_connect_interval](#peer_connect_interval)
- [peer_connect_timeout](#peer_connect_timeout) - [peer_connect_timeout](#peer_connect_timeout)
- [osd_idle_timeout](#osd_idle_timeout) - [osd_idle_timeout](#osd_idle_timeout)
@@ -74,6 +75,12 @@ to work. For example, Mellanox ConnectX-3 and older adapters don't have
Implicit ODP, so they're unsupported by Vitastor. Run `ibv_devinfo -v` as Implicit ODP, so they're unsupported by Vitastor. Run `ibv_devinfo -v` as
root to list available RDMA devices and their features. root to list available RDMA devices and their features.
Remember that you also have to configure your network switches if you use
RoCE/RoCEv2, otherwise you may experience unstable performance. Refer to
the manual of your network vendor for details about setting up the switch
for RoCEv2 correctly. Usually it means setting up Lossless Ethernet with
PFC (Priority Flow Control) and ECN (Explicit Congestion Notification).
## rdma_port_num ## rdma_port_num
- Type: integer - Type: integer
@@ -116,20 +123,30 @@ required to change this parameter.
## rdma_max_msg ## rdma_max_msg
- Type: integer - Type: integer
- Default: 1048576 - Default: 132096
Maximum size of a single RDMA send or receive operation in bytes. Maximum size of a single RDMA send or receive operation in bytes.
## rdma_max_recv ## rdma_max_recv
- Type: integer
- Default: 16
Maximum number of RDMA receive buffers per connection (RDMA requires
preallocated buffers to receive data). Each buffer is `rdma_max_msg` bytes
in size. So this setting directly affects memory usage: a single Vitastor
RDMA client uses `rdma_max_recv * rdma_max_msg * OSD_COUNT` bytes of memory.
Default is roughly 2 MB * number of OSDs.
## rdma_max_send
- Type: integer - Type: integer
- Default: 8 - Default: 8
Maximum number of parallel RDMA receive operations. Note that this number Maximum number of outstanding RDMA send operations per connection. Should be
of receive buffers `rdma_max_msg` in size are allocated for each client, less than `rdma_max_recv` so the receiving side doesn't run out of buffers.
so this setting actually affects memory usage. This is because RDMA receive Doesn't affect memory usage - additional memory isn't allocated for send
operations are (sadly) still not zero-copy in Vitastor. It may be fixed in operations.
later versions.
## peer_connect_interval ## peer_connect_interval

View File

@@ -19,6 +19,7 @@
- [rdma_max_sge](#rdma_max_sge) - [rdma_max_sge](#rdma_max_sge)
- [rdma_max_msg](#rdma_max_msg) - [rdma_max_msg](#rdma_max_msg)
- [rdma_max_recv](#rdma_max_recv) - [rdma_max_recv](#rdma_max_recv)
- [rdma_max_send](#rdma_max_send)
- [peer_connect_interval](#peer_connect_interval) - [peer_connect_interval](#peer_connect_interval)
- [peer_connect_timeout](#peer_connect_timeout) - [peer_connect_timeout](#peer_connect_timeout)
- [osd_idle_timeout](#osd_idle_timeout) - [osd_idle_timeout](#osd_idle_timeout)
@@ -78,6 +79,13 @@ Implicit On-Demand Paging (Implicit ODP) и Scatter/Gather (SG). Наприме
суперпользователя, чтобы посмотреть список доступных RDMA-устройств, их суперпользователя, чтобы посмотреть список доступных RDMA-устройств, их
параметры и возможности. параметры и возможности.
Обратите внимание, что если вы используете RoCE/RoCEv2, вам также необходимо
правильно настроить для него коммутаторы, иначе вы можете столкнуться с
нестабильной производительностью. Подробную информацию о настройке
коммутатора для RoCEv2 ищите в документации производителя. Обычно это
подразумевает настройку сети без потерь на основе PFC (Priority Flow
Control) и ECN (Explicit Congestion Notification).
## rdma_port_num ## rdma_port_num
- Тип: целое число - Тип: целое число
@@ -121,22 +129,32 @@ OSD в любом случае согласовывают реальное зн
## rdma_max_msg ## rdma_max_msg
- Тип: целое число - Тип: целое число
- Значение по умолчанию: 1048576 - Значение по умолчанию: 132096
Максимальный размер одной RDMA-операции отправки или приёма. Максимальный размер одной RDMA-операции отправки или приёма.
## rdma_max_recv ## rdma_max_recv
- Тип: целое число
- Значение по умолчанию: 16
Максимальное число буферов для RDMA-приёма данных на одно соединение
(RDMA требует заранее выделенных буферов для приёма данных). Каждый буфер
имеет размер `rdma_max_msg` байт. Таким образом, настройка прямо влияет на
потребление памяти - один Vitastor-клиент с RDMA использует
`rdma_max_recv * rdma_max_msg * ЧИСЛО_OSD` байт памяти, по умолчанию -
примерно 2 МБ * число OSD.
## rdma_max_send
- Тип: целое число - Тип: целое число
- Значение по умолчанию: 8 - Значение по умолчанию: 8
Максимальное число параллельных RDMA-операций получения данных. Следует Максимальное число RDMA-операций отправки, отправляемых в очередь одного
иметь в виду, что данное число буферов размером `rdma_max_msg` выделяется соединения. Желательно, чтобы оно было меньше `rdma_max_recv`, чтобы
для каждого подключённого клиентского соединения, так что данная настройка у принимающей стороны в процессе работы не заканчивались буферы на приём.
влияет на потребление памяти. Это так потому, что RDMA-приём данных в Не влияет на потребление памяти - дополнительная память на операции отправки
Vitastor, увы, всё равно не является zero-copy, т.е. всё равно 1 раз не выделяется.
копирует данные в памяти. Данная особенность, возможно, будет исправлена в
более новых версиях Vitastor.
## peer_connect_interval ## peer_connect_interval

View File

@@ -53,6 +53,12 @@
to work. For example, Mellanox ConnectX-3 and older adapters don't have to work. For example, Mellanox ConnectX-3 and older adapters don't have
Implicit ODP, so they're unsupported by Vitastor. Run `ibv_devinfo -v` as Implicit ODP, so they're unsupported by Vitastor. Run `ibv_devinfo -v` as
root to list available RDMA devices and their features. root to list available RDMA devices and their features.
Remember that you also have to configure your network switches if you use
RoCE/RoCEv2, otherwise you may experience unstable performance. Refer to
the manual of your network vendor for details about setting up the switch
for RoCEv2 correctly. Usually it means setting up Lossless Ethernet with
PFC (Priority Flow Control) and ECN (Explicit Congestion Notification).
info_ru: | info_ru: |
Название RDMA-устройства для связи с Vitastor OSD (например, "rocep5s0f0"). Название RDMA-устройства для связи с Vitastor OSD (например, "rocep5s0f0").
Имейте в виду, что поддержка RDMA в Vitastor требует функций устройства Имейте в виду, что поддержка RDMA в Vitastor требует функций устройства
@@ -61,6 +67,13 @@
потому не поддерживаются в Vitastor. Запустите `ibv_devinfo -v` от имени потому не поддерживаются в Vitastor. Запустите `ibv_devinfo -v` от имени
суперпользователя, чтобы посмотреть список доступных RDMA-устройств, их суперпользователя, чтобы посмотреть список доступных RDMA-устройств, их
параметры и возможности. параметры и возможности.
Обратите внимание, что если вы используете RoCE/RoCEv2, вам также необходимо
правильно настроить для него коммутаторы, иначе вы можете столкнуться с
нестабильной производительностью. Подробную информацию о настройке
коммутатора для RoCEv2 ищите в документации производителя. Обычно это
подразумевает настройку сети без потерь на основе PFC (Priority Flow
Control) и ECN (Explicit Congestion Notification).
- name: rdma_port_num - name: rdma_port_num
type: int type: int
default: 1 default: 1
@@ -114,26 +127,39 @@
так что менять этот параметр обычно не нужно. так что менять этот параметр обычно не нужно.
- name: rdma_max_msg - name: rdma_max_msg
type: int type: int
default: 1048576 default: 132096
info: Maximum size of a single RDMA send or receive operation in bytes. info: Maximum size of a single RDMA send or receive operation in bytes.
info_ru: Максимальный размер одной RDMA-операции отправки или приёма. info_ru: Максимальный размер одной RDMA-операции отправки или приёма.
- name: rdma_max_recv - name: rdma_max_recv
type: int
default: 16
info: |
Maximum number of RDMA receive buffers per connection (RDMA requires
preallocated buffers to receive data). Each buffer is `rdma_max_msg` bytes
in size. So this setting directly affects memory usage: a single Vitastor
RDMA client uses `rdma_max_recv * rdma_max_msg * OSD_COUNT` bytes of memory.
Default is roughly 2 MB * number of OSDs.
info_ru: |
Максимальное число буферов для RDMA-приёма данных на одно соединение
(RDMA требует заранее выделенных буферов для приёма данных). Каждый буфер
имеет размер `rdma_max_msg` байт. Таким образом, настройка прямо влияет на
потребление памяти - один Vitastor-клиент с RDMA использует
`rdma_max_recv * rdma_max_msg * ЧИСЛО_OSD` байт памяти, по умолчанию -
примерно 2 МБ * число OSD.
- name: rdma_max_send
type: int type: int
default: 8 default: 8
info: | info: |
Maximum number of parallel RDMA receive operations. Note that this number Maximum number of outstanding RDMA send operations per connection. Should be
of receive buffers `rdma_max_msg` in size are allocated for each client, less than `rdma_max_recv` so the receiving side doesn't run out of buffers.
so this setting actually affects memory usage. This is because RDMA receive Doesn't affect memory usage - additional memory isn't allocated for send
operations are (sadly) still not zero-copy in Vitastor. It may be fixed in operations.
later versions.
info_ru: | info_ru: |
Максимальное число параллельных RDMA-операций получения данных. Следует Максимальное число RDMA-операций отправки, отправляемых в очередь одного
иметь в виду, что данное число буферов размером `rdma_max_msg` выделяется соединения. Желательно, чтобы оно было меньше `rdma_max_recv`, чтобы
для каждого подключённого клиентского соединения, так что данная настройка у принимающей стороны в процессе работы не заканчивались буферы на приём.
влияет на потребление памяти. Это так потому, что RDMA-приём данных в Не влияет на потребление памяти - дополнительная память на операции отправки
Vitastor, увы, всё равно не является zero-copy, т.е. всё равно 1 раз не выделяется.
копирует данные в памяти. Данная особенность, возможно, будет исправлена в
более новых версиях Vitastor.
- name: peer_connect_interval - name: peer_connect_interval
type: sec type: sec
min: 1 min: 1

View File

@@ -22,14 +22,17 @@
- Add Vitastor package repository: - Add Vitastor package repository:
- CentOS 7: `yum install https://vitastor.io/rpms/centos/7/vitastor-release.rpm` - CentOS 7: `yum install https://vitastor.io/rpms/centos/7/vitastor-release.rpm`
- CentOS 8: `dnf install https://vitastor.io/rpms/centos/8/vitastor-release.rpm` - CentOS 8: `dnf install https://vitastor.io/rpms/centos/8/vitastor-release.rpm`
- AlmaLinux 9 and other RHEL 9 clones (Rocky, Oracle...): `dnf install https://vitastor.io/rpms/centos/9/vitastor-release.rpm`
- Enable EPEL: `yum/dnf install epel-release` - Enable EPEL: `yum/dnf install epel-release`
- Enable additional CentOS repositories: - Enable additional CentOS repositories:
- CentOS 7: `yum install centos-release-scl` - CentOS 7: `yum install centos-release-scl`
- CentOS 8: `dnf install centos-release-advanced-virtualization` - CentOS 8: `dnf install centos-release-advanced-virtualization`
- RHEL 9 clones: not required
- Enable elrepo-kernel: - Enable elrepo-kernel:
- CentOS 7: `yum install https://www.elrepo.org/elrepo-release-7.el7.elrepo.noarch.rpm` - CentOS 7: `yum install https://www.elrepo.org/elrepo-release-7.el7.elrepo.noarch.rpm`
- CentOS 8: `dnf install https://www.elrepo.org/elrepo-release-8.el8.elrepo.noarch.rpm` - CentOS 8: `dnf install https://www.elrepo.org/elrepo-release-8.el8.elrepo.noarch.rpm`
- Install packages: `yum/dnf install vitastor lpsolve etcd kernel-ml qemu-kvm` - RHEL 9 clones: optional, not required: `dnf install https://www.elrepo.org/elrepo-release-9.el9.elrepo.noarch.rpm`
- Install packages: `yum/dnf install vitastor lpsolve etcd qemu-kvm` and optionally `kernel-ml` if you use elrepo-kernel
## Installation requirements ## Installation requirements

View File

@@ -6,10 +6,10 @@
# Proxmox VE # Proxmox VE
To enable Vitastor support in Proxmox Virtual Environment (6.4-7.3 are supported): To enable Vitastor support in Proxmox Virtual Environment (6.4-7.4 are supported):
- Add the corresponding Vitastor Debian repository into sources.list on Proxmox hosts: - Add the corresponding Vitastor Debian repository into sources.list on Proxmox hosts:
buster for 6.4, bullseye for 7.3, pve7.1 for 7.1, pve7.2 for 7.2 buster for 6.4, bullseye for 7.4, pve7.1 for 7.1, pve7.2 for 7.2, pve7.3 for 7.3
- Install vitastor-client, pve-qemu-kvm, pve-storage-vitastor (* or see note) packages from Vitastor repository - Install vitastor-client, pve-qemu-kvm, pve-storage-vitastor (* or see note) packages from Vitastor repository
- Define storage in `/etc/pve/storage.cfg` (see below) - Define storage in `/etc/pve/storage.cfg` (see below)
- Block network access from VMs to Vitastor network (to OSDs and etcd), - Block network access from VMs to Vitastor network (to OSDs and etcd),

View File

@@ -6,10 +6,10 @@
# Proxmox # Proxmox
Чтобы подключить Vitastor к Proxmox Virtual Environment (поддерживаются версии 6.4-7.3): Чтобы подключить Vitastor к Proxmox Virtual Environment (поддерживаются версии 6.4-7.4):
- Добавьте соответствующий Debian-репозиторий Vitastor в sources.list на хостах Proxmox: - Добавьте соответствующий Debian-репозиторий Vitastor в sources.list на хостах Proxmox:
buster для 6.4, bullseye для 7.3, pve7.1 для 7.1, pve7.2 для 7.2 buster для 6.4, bullseye для 7.4, pve7.1 для 7.1, pve7.2 для 7.2, pve7.3 для 7.3
- Установите пакеты vitastor-client, pve-qemu-kvm, pve-storage-vitastor (* или см. сноску) из репозитория Vitastor - Установите пакеты vitastor-client, pve-qemu-kvm, pve-storage-vitastor (* или см. сноску) из репозитория Vitastor
- Определите тип хранилища в `/etc/pve/storage.cfg` (см. ниже) - Определите тип хранилища в `/etc/pve/storage.cfg` (см. ниже)
- Обязательно заблокируйте доступ от виртуальных машин к сети Vitastor (OSD и etcd), т.к. Vitastor (пока) не поддерживает аутентификацию - Обязательно заблокируйте доступ от виртуальных машин к сети Vitastor (OSD и etcd), т.к. Vitastor (пока) не поддерживает аутентификацию

View File

@@ -45,7 +45,9 @@ On the monitor hosts:
} }
``` ```
- Initialize OSDs: - Initialize OSDs:
- SSD-only: `vitastor-disk prepare /dev/sdXXX [/dev/sdYYY ...]` - SSD-only: `vitastor-disk prepare /dev/sdXXX [/dev/sdYYY ...]`. You can add
`--disable_data_fsync off` to leave disk cache enabled if you use desktop
SSDs without capacitors.
- Hybrid, SSD+HDD: `vitastor-disk prepare --hybrid /dev/sdXXX [/dev/sdYYY ...]`. - Hybrid, SSD+HDD: `vitastor-disk prepare --hybrid /dev/sdXXX [/dev/sdYYY ...]`.
Pass all your devices (HDD and SSD) to this script &mdash; it will partition disks and initialize journals on its own. Pass all your devices (HDD and SSD) to this script &mdash; it will partition disks and initialize journals on its own.
This script skips HDDs which are already partitioned so if you want to use non-empty disks for This script skips HDDs which are already partitioned so if you want to use non-empty disks for
@@ -53,7 +55,9 @@ On the monitor hosts:
but some free unpartitioned space must be available because the script creates new partitions for journals. but some free unpartitioned space must be available because the script creates new partitions for journals.
- You can change OSD configuration in units or in `vitastor.conf`. - You can change OSD configuration in units or in `vitastor.conf`.
Check [Configuration Reference](../config.en.md) for parameter descriptions. Check [Configuration Reference](../config.en.md) for parameter descriptions.
- If all your drives have capacitors, create global configuration in etcd: \ - If all your drives have capacitors, and even if not, but if you ran `vitastor-disk`
without `--disable_data_fsync off` at the first step, then put the following
setting into etcd: \
`etcdctl --endpoints=... put /vitastor/config/global '{"immediate_commit":"all"}'` `etcdctl --endpoints=... put /vitastor/config/global '{"immediate_commit":"all"}'`
- Start all OSDs: `systemctl start vitastor.target` - Start all OSDs: `systemctl start vitastor.target`
@@ -70,11 +74,15 @@ For EC pools the configuration should look like the following:
``` ```
etcdctl --endpoints=... put /vitastor/config/pools '{"2":{"name":"ecpool", etcdctl --endpoints=... put /vitastor/config/pools '{"2":{"name":"ecpool",
"scheme":"ec","pg_size":4,"parity_chunks":2,"pg_minsize":2,"pg_count":256,"failure_domain":"host"}' "scheme":"ec","pg_size":4,"parity_chunks":2,"pg_minsize":2,"pg_count":256,"failure_domain":"host"}}'
``` ```
After you do this, one of the monitors will configure PGs and OSDs will start them. After you do this, one of the monitors will configure PGs and OSDs will start them.
If you use HDDs you should also add `"block_size": 1048576` to pool configuration.
The other option is to add it into /vitastor/config/global, in this case it will
apply to all pools by default.
## Check cluster status ## Check cluster status
`vitastor-cli status` `vitastor-cli status`

View File

@@ -45,7 +45,9 @@
} }
``` ```
- Инициализуйте OSD: - Инициализуйте OSD:
- SSD: `vitastor-disk prepare /dev/sdXXX [/dev/sdYYY ...]` - SSD: `vitastor-disk prepare /dev/sdXXX [/dev/sdYYY ...]`. Если вы используете
десктопные SSD без конденсаторов, можете оставить кэш включённым, добавив
опцию `--disable_data_fsync off`.
- Гибридные, SSD+HDD: `vitastor-disk prepare --hybrid /dev/sdXXX [/dev/sdYYY ...]`. - Гибридные, SSD+HDD: `vitastor-disk prepare --hybrid /dev/sdXXX [/dev/sdYYY ...]`.
Передайте все ваши SSD и HDD скрипту в командной строке подряд, скрипт автоматически выделит Передайте все ваши SSD и HDD скрипту в командной строке подряд, скрипт автоматически выделит
разделы под журналы на SSD и данные на HDD. Скрипт пропускает HDD, на которых уже есть разделы разделы под журналы на SSD и данные на HDD. Скрипт пропускает HDD, на которых уже есть разделы
@@ -54,8 +56,11 @@
для журналов, на SSD должно быть доступно свободное нераспределённое место. для журналов, на SSD должно быть доступно свободное нераспределённое место.
- Вы можете менять параметры OSD в юнитах systemd или в `vitastor.conf`. Описания параметров - Вы можете менять параметры OSD в юнитах systemd или в `vitastor.conf`. Описания параметров
смотрите в [справке по конфигурации](../config.ru.md). смотрите в [справке по конфигурации](../config.ru.md).
- Если все ваши диски - серверные с конденсаторами, пропишите это в глобальную конфигурацию в etcd: \ - Если все ваши диски - серверные с конденсаторами, и даже если нет, но при этом
`etcdctl --endpoints=... put /vitastor/config/global '{"immediate_commit":"all"}'` вы не добавляли опцию `--disable_data_fsync off` на первом шаге, а `vitastor-disk`
не ругался на невозможность отключения кэша дисков, пропишите следующую настройку
в глобальную конфигурацию в etcd: \
`etcdctl --endpoints=... put /vitastor/config/global '{"immediate_commit":"all"}'`.
- Запустите все OSD: `systemctl start vitastor.target` - Запустите все OSD: `systemctl start vitastor.target`
## Создайте пул ## Создайте пул
@@ -71,11 +76,15 @@ etcdctl --endpoints=... put /vitastor/config/pools '{"1":{"name":"testpool",
``` ```
etcdctl --endpoints=... put /vitastor/config/pools '{"2":{"name":"ecpool", etcdctl --endpoints=... put /vitastor/config/pools '{"2":{"name":"ecpool",
"scheme":"ec","pg_size":4,"parity_chunks":2,"pg_minsize":2,"pg_count":256,"failure_domain":"host"}' "scheme":"ec","pg_size":4,"parity_chunks":2,"pg_minsize":2,"pg_count":256,"failure_domain":"host"}}'
``` ```
После этого один из мониторов должен сконфигурировать PG, а OSD должны запустить их. После этого один из мониторов должен сконфигурировать PG, а OSD должны запустить их.
Если вы используете HDD-диски, то добавьте в конфигурацию пулов опцию `"block_size": 1048576`.
Также эту опцию можно добавить в /vitastor/config/global, в этом случае она будет
применяться ко всем пулам по умолчанию.
## Проверьте состояние кластера ## Проверьте состояние кластера
`vitastor-cli status` `vitastor-cli status`

View File

@@ -35,15 +35,24 @@ Write amplification for 4 KB blocks is usually 3-5 in Vitastor:
If you manage to get an SSD which handles 512 byte blocks well (Optane?) you may If you manage to get an SSD which handles 512 byte blocks well (Optane?) you may
lower 1, 3 and 4 to 512 bytes (1/8 of data size) and get WA as low as 2.375. lower 1, 3 and 4 to 512 bytes (1/8 of data size) and get WA as low as 2.375.
Implemented NVDIMM support can basically eliminate WA at all - all extra writes will
go to DRAM memory. But this requires a test cluster with NVDIMM - please contact me
if you want to provide me with such cluster for tests.
Lazy fsync also reduces WA for parallel workloads because journal blocks are only Lazy fsync also reduces WA for parallel workloads because journal blocks are only
written when they fill up or fsync is requested. written when they fill up or fsync is requested.
## In Practice ## In Practice
In practice, using tests from [Understanding Performance](understanding.en.md) In practice, using tests from [Understanding Performance](understanding.en.md), decent TCP network,
and good server-grade SSD/NVMe drives, you should head for: good server-grade SSD/NVMe drives and disabled CPU power saving, you should head for:
- At least 5000 T1Q1 replicated read and write iops (maximum 0.2ms latency) - At least 5000 T1Q1 replicated read and write iops (maximum 0.2ms latency)
- At least 5000 T1Q1 EC read IOPS and at least 2200 EC write IOPS (maximum 0.45ms latency)
- At least ~80k parallel read iops or ~30k write iops per 1 core (1 OSD) - At least ~80k parallel read iops or ~30k write iops per 1 core (1 OSD)
- Disk-speed or wire-speed linear reads and writes, whichever is the bottleneck in your case - Disk-speed or wire-speed linear reads and writes, whichever is the bottleneck in your case
Lower results may mean that you have bad drives, bad network or some kind of misconfiguration. Lower results may mean that you have bad drives, bad network or some kind of misconfiguration.
Current latency records:
- 9668 T1Q1 replicated write iops (0.103 ms latency) with TCP and NVMe
- 9143 T1Q1 replicated read iops (0.109 ms latency) with TCP and NVMe

View File

@@ -36,6 +36,25 @@ WA (мультипликатор записи) для 4 КБ блоков в Vit
Если вы найдёте SSD, хорошо работающий с 512-байтными блоками данных (Optane?), Если вы найдёте SSD, хорошо работающий с 512-байтными блоками данных (Optane?),
то 1, 3 и 4 можно снизить до 512 байт (1/8 от размера данных) и получить WA всего 2.375. то 1, 3 и 4 можно снизить до 512 байт (1/8 от размера данных) и получить WA всего 2.375.
Если реализовать поддержку NVDIMM, то WA можно, условно говоря, ликвидировать вообще - все
дополнительные операции записи смогут обслуживаться DRAM памятью. Но для этого необходим
тестовый кластер с NVDIMM - пишите, если готовы предоставить такой для тестов.
Кроме того, WA снижается при использовании отложенного/ленивого сброса при параллельной Кроме того, WA снижается при использовании отложенного/ленивого сброса при параллельной
нагрузке, т.к. блоки журнала записываются на диск только когда они заполняются или явным нагрузке, т.к. блоки журнала записываются на диск только когда они заполняются или явным
образом запрашивается fsync. образом запрашивается fsync.
## На практике
На практике, используя тесты fio со страницы [Понимание сути производительности систем хранения](understanding.ru.md),
нормальную TCP-сеть, хорошие серверные SSD/NVMe, при отключённом энергосбережении процессоров вы можете рассчитывать на:
- От 5000 IOPS в 1 поток (T1Q1) и на чтение, и на запись при использовании репликации (задержка до 0.2мс)
- От 5000 IOPS в 1 поток (T1Q1) на чтение и 2200 IOPS в 1 поток на запись при использовании EC (задержка до 0.45мс)
- От 80000 IOPS на чтение в параллельном режиме на 1 ядро, от 30000 IOPS на запись на 1 ядро (на 1 OSD)
- Скорость параллельного линейного чтения и записи, равная меньшему значению из скорости дисков или сети
Худшие результаты означают, что у вас либо медленные диски, либо медленная сеть, либо что-то неправильно настроено.
Зафиксированный на данный момент рекорд задержки:
- 9668 IOPS (0.103 мс задержка) в 1 поток (T1Q1) на запись с TCP и NVMe при использовании репликации
- 9143 IOPS (0.109 мс задержка) в 1 поток (T1Q1) на чтение с TCP и NVMe при использовании репликации

View File

@@ -1,4 +1,4 @@
[Documentation](../../README.md#documentation) → Usage → Disk Tool [Documentation](../../README.md#documentation) → Usage → Disk management tool
----- -----

View File

@@ -1,4 +1,4 @@
[Документация](../../README-ru.md#документация) → Использование → Управление дисками [Документация](../../README-ru.md#документация) → Использование → Инструмент управления дисками
----- -----

View File

@@ -43,16 +43,16 @@ function finish_pg_history(merged_history)
merged_history.all_peers = Object.values(merged_history.all_peers); merged_history.all_peers = Object.values(merged_history.all_peers);
} }
function scale_pg_count(prev_pgs, prev_pg_history, new_pg_history, new_pg_count) function scale_pg_count(prev_pgs, real_prev_pgs, prev_pg_history, new_pg_history, new_pg_count)
{ {
const old_pg_count = prev_pgs.length; const old_pg_count = real_prev_pgs.length;
// Add all possibly intersecting PGs to the history of new PGs // Add all possibly intersecting PGs to the history of new PGs
if (!(new_pg_count % old_pg_count)) if (!(new_pg_count % old_pg_count))
{ {
// New PG count is a multiple of old PG count // New PG count is a multiple of old PG count
for (let i = 0; i < new_pg_count; i++) for (let i = 0; i < new_pg_count; i++)
{ {
add_pg_history(new_pg_history, i, prev_pgs, prev_pg_history, i % old_pg_count); add_pg_history(new_pg_history, i, real_prev_pgs, prev_pg_history, i % old_pg_count);
finish_pg_history(new_pg_history[i]); finish_pg_history(new_pg_history[i]);
} }
} }
@@ -64,7 +64,7 @@ function scale_pg_count(prev_pgs, prev_pg_history, new_pg_history, new_pg_count)
{ {
for (let j = 0; j < mul; j++) for (let j = 0; j < mul; j++)
{ {
add_pg_history(new_pg_history, i, prev_pgs, prev_pg_history, i+j*new_pg_count); add_pg_history(new_pg_history, i, real_prev_pgs, prev_pg_history, i+j*new_pg_count);
} }
finish_pg_history(new_pg_history[i]); finish_pg_history(new_pg_history[i]);
} }
@@ -76,7 +76,7 @@ function scale_pg_count(prev_pgs, prev_pg_history, new_pg_history, new_pg_count)
let merged_history = {}; let merged_history = {};
for (let i = 0; i < old_pg_count; i++) for (let i = 0; i < old_pg_count; i++)
{ {
add_pg_history(merged_history, 1, prev_pgs, prev_pg_history, i); add_pg_history(merged_history, 1, real_prev_pgs, prev_pg_history, i);
} }
finish_pg_history(merged_history[1]); finish_pg_history(merged_history[1]);
for (let i = 0; i < new_pg_count; i++) for (let i = 0; i < new_pg_count; i++)
@@ -90,15 +90,15 @@ function scale_pg_count(prev_pgs, prev_pg_history, new_pg_history, new_pg_count)
new_pg_history[i] = null; new_pg_history[i] = null;
} }
// Just for the lp_solve optimizer - pick a "previous" PG for each "new" one // Just for the lp_solve optimizer - pick a "previous" PG for each "new" one
if (old_pg_count < new_pg_count) if (prev_pgs.length < new_pg_count)
{ {
for (let i = old_pg_count; i < new_pg_count; i++) for (let i = prev_pgs.length; i < new_pg_count; i++)
{ {
prev_pgs[i] = prev_pgs[i % old_pg_count]; prev_pgs[i] = prev_pgs[i % prev_pgs.length];
} }
} }
else if (old_pg_count > new_pg_count) else if (prev_pgs.length > new_pg_count)
{ {
prev_pgs.splice(new_pg_count, old_pg_count-new_pg_count); prev_pgs.splice(new_pg_count, prev_pgs.length-new_pg_count);
} }
} }

View File

@@ -13,7 +13,7 @@ for (let i = 2; i < process.argv.length; i++)
{ {
console.error('USAGE: '+process.argv[0]+' '+process.argv[1]+' [--verbose 1]'+ console.error('USAGE: '+process.argv[0]+' '+process.argv[1]+' [--verbose 1]'+
' [--etcd_address "http://127.0.0.1:2379,..."] [--config_path /etc/vitastor/vitastor.conf]'+ ' [--etcd_address "http://127.0.0.1:2379,..."] [--config_path /etc/vitastor/vitastor.conf]'+
' [--etcd_prefix "/vitastor"] [--etcd_start_timeout 5]'); ' [--etcd_prefix "/vitastor"] [--etcd_start_timeout 5] [--restart_interval 5]');
process.exit(); process.exit();
} }
else if (process.argv[i].substr(0, 2) == '--') else if (process.argv[i].substr(0, 2) == '--')

View File

@@ -51,8 +51,9 @@ const etcd_tree = {
// THIS IS JUST A POOR MAN'S CONFIG DOCUMENTATION // THIS IS JUST A POOR MAN'S CONFIG DOCUMENTATION
// etcd connection // etcd connection
config_path: "/etc/vitastor/vitastor.conf", config_path: "/etc/vitastor/vitastor.conf",
etcd_address: "10.0.115.10:2379/v3",
etcd_prefix: "/vitastor", etcd_prefix: "/vitastor",
// etcd connection - configurable online
etcd_address: "10.0.115.10:2379/v3",
// mon // mon
etcd_mon_ttl: 30, // min: 10 etcd_mon_ttl: 30, // min: 10
etcd_mon_timeout: 1000, // ms. min: 0 etcd_mon_timeout: 1000, // ms. min: 0
@@ -70,14 +71,15 @@ const etcd_tree = {
rdma_gid_index: 0, rdma_gid_index: 0,
rdma_mtu: 4096, rdma_mtu: 4096,
rdma_max_sge: 128, rdma_max_sge: 128,
rdma_max_send: 32, rdma_max_send: 8,
rdma_max_recv: 8, rdma_max_recv: 16,
rdma_max_msg: 1048576, rdma_max_msg: 132096,
log_level: 0,
block_size: 131072, block_size: 131072,
disk_alignment: 4096, disk_alignment: 4096,
bitmap_granularity: 4096, bitmap_granularity: 4096,
immediate_commit: false, // 'all' or 'small' immediate_commit: false, // 'all' or 'small'
// client and osd - configurable online
log_level: 0,
client_dirty_limit: 33554432, client_dirty_limit: 33554432,
peer_connect_interval: 5, // seconds. min: 1 peer_connect_interval: 5, // seconds. min: 1
peer_connect_timeout: 5, // seconds. min: 1 peer_connect_timeout: 5, // seconds. min: 1
@@ -95,18 +97,19 @@ const etcd_tree = {
osd_network: null, // "192.168.7.0/24" or an array of masks osd_network: null, // "192.168.7.0/24" or an array of masks
bind_address: "0.0.0.0", bind_address: "0.0.0.0",
bind_port: 0, bind_port: 0,
readonly: false,
osd_memlock: false,
// osd - configurable online
autosync_interval: 5, autosync_interval: 5,
autosync_writes: 128, autosync_writes: 128,
client_queue_depth: 128, // unused client_queue_depth: 128, // unused
recovery_queue_depth: 4, recovery_queue_depth: 4,
recovery_sync_batch: 16, recovery_sync_batch: 16,
readonly: false,
no_recovery: false, no_recovery: false,
no_rebalance: false, no_rebalance: false,
print_stats_interval: 3, print_stats_interval: 3,
slow_log_interval: 10, slow_log_interval: 10,
inode_vanish_time: 60, inode_vanish_time: 60,
osd_memlock: false,
// blockstore - fixed in superblock // blockstore - fixed in superblock
block_size, block_size,
disk_alignment, disk_alignment,
@@ -125,14 +128,15 @@ const etcd_tree = {
meta_offset, meta_offset,
disable_meta_fsync, disable_meta_fsync,
disable_device_lock, disable_device_lock,
// blockstore - configurable // blockstore - configurable offline
max_write_iodepth,
min_flusher_count: 1,
max_flusher_count: 256,
inmemory_metadata, inmemory_metadata,
inmemory_journal, inmemory_journal,
journal_sector_buffer_count, journal_sector_buffer_count,
journal_no_same_sector_overwrites, journal_no_same_sector_overwrites,
// blockstore - configurable online
max_write_iodepth,
min_flusher_count: 1,
max_flusher_count: 256,
throttle_small_writes: false, throttle_small_writes: false,
throttle_target_iops: 100, throttle_target_iops: 100,
throttle_target_mbs: 100, throttle_target_mbs: 100,
@@ -557,7 +561,7 @@ class Mon
} }
if (!this.ws) if (!this.ws)
{ {
this.die('Failed to open etcd watch websocket'); await this.die('Failed to open etcd watch websocket');
} }
const cur_addr = this.selected_etcd_url; const cur_addr = this.selected_etcd_url;
this.ws_alive = true; this.ws_alive = true;
@@ -724,7 +728,7 @@ class Mon
const res = await this.etcd_call('/lease/keepalive', { ID: this.etcd_lease_id }, this.config.etcd_mon_timeout, this.config.etcd_mon_retries); const res = await this.etcd_call('/lease/keepalive', { ID: this.etcd_lease_id }, this.config.etcd_mon_timeout, this.config.etcd_mon_retries);
if (!res.result.TTL) if (!res.result.TTL)
{ {
this.die('Lease expired'); await this.die('Lease expired');
} }
}, this.config.etcd_mon_timeout); }, this.config.etcd_mon_timeout);
if (!this.signals_set) if (!this.signals_set)
@@ -737,9 +741,32 @@ class Mon
async on_stop(status) async on_stop(status)
{ {
clearInterval(this.lease_timer); if (this.ws_keepalive_timer)
await this.etcd_call('/lease/revoke', { ID: this.etcd_lease_id }, this.config.etcd_mon_timeout, this.config.etcd_mon_retries); {
process.exit(status); clearInterval(this.ws_keepalive_timer);
this.ws_keepalive_timer = null;
}
if (this.lease_timer)
{
clearInterval(this.lease_timer);
this.lease_timer = null;
}
if (this.etcd_lease_id)
{
const lease_id = this.etcd_lease_id;
this.etcd_lease_id = null;
await this.etcd_call('/lease/revoke', { ID: lease_id }, this.config.etcd_mon_timeout, this.config.etcd_mon_retries);
}
if (!status || !this.initConfig.restart_interval)
{
process.exit(status);
}
else
{
console.log('Restarting after '+this.initConfig.restart_interval+' seconds');
await new Promise(ok => setTimeout(ok, this.initConfig.restart_interval*1000));
await this.start();
}
} }
async become_master() async become_master()
@@ -952,7 +979,7 @@ class Mon
return alive_set[this.rng() % alive_set.length]; return alive_set[this.rng() % alive_set.length];
} }
save_new_pgs_txn(request, pool_id, up_osds, osd_tree, prev_pgs, new_pgs, pg_history) save_new_pgs_txn(save_to, request, pool_id, up_osds, osd_tree, prev_pgs, new_pgs, pg_history)
{ {
const aff_osds = this.get_affinity_osds(this.state.config.pools[pool_id], up_osds, osd_tree); const aff_osds = this.get_affinity_osds(this.state.config.pools[pool_id], up_osds, osd_tree);
const pg_items = {}; const pg_items = {};
@@ -1005,14 +1032,14 @@ class Mon
}); });
} }
} }
this.state.config.pgs.items = this.state.config.pgs.items || {}; save_to.items = save_to.items || {};
if (!new_pgs.length) if (!new_pgs.length)
{ {
delete this.state.config.pgs.items[pool_id]; delete save_to.items[pool_id];
} }
else else
{ {
this.state.config.pgs.items[pool_id] = pg_items; save_to.items[pool_id] = pg_items;
} }
} }
@@ -1156,6 +1183,7 @@ class Mon
if (this.state.config.pgs.hash != tree_hash) if (this.state.config.pgs.hash != tree_hash)
{ {
// Something has changed // Something has changed
const new_config_pgs = JSON.parse(JSON.stringify(this.state.config.pgs));
const etcd_request = { compare: [], success: [] }; const etcd_request = { compare: [], success: [] };
for (const pool_id in (this.state.config.pgs||{}).items||{}) for (const pool_id in (this.state.config.pgs||{}).items||{})
{ {
@@ -1176,7 +1204,7 @@ class Mon
etcd_request.success.push({ requestDeleteRange: { etcd_request.success.push({ requestDeleteRange: {
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id), key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
} }); } });
this.save_new_pgs_txn(etcd_request, pool_id, up_osds, osd_tree, prev_pgs, [], []); this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, prev_pgs, [], []);
} }
} }
for (const pool_id in this.state.config.pools) for (const pool_id in this.state.config.pools)
@@ -1230,7 +1258,7 @@ class Mon
return; return;
} }
const new_pg_history = []; const new_pg_history = [];
PGUtil.scale_pg_count(prev_pgs, pg_history, new_pg_history, pool_cfg.pg_count); PGUtil.scale_pg_count(prev_pgs, real_prev_pgs, pg_history, new_pg_history, pool_cfg.pg_count);
pg_history = new_pg_history; pg_history = new_pg_history;
} }
for (const pg of prev_pgs) for (const pg of prev_pgs)
@@ -1283,14 +1311,15 @@ class Mon
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id), key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
value: b64(JSON.stringify(this.state.pool.stats[pool_id])), value: b64(JSON.stringify(this.state.pool.stats[pool_id])),
} }); } });
this.save_new_pgs_txn(etcd_request, pool_id, up_osds, osd_tree, real_prev_pgs, optimize_result.int_pgs, pg_history); this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, real_prev_pgs, optimize_result.int_pgs, pg_history);
} }
this.state.config.pgs.hash = tree_hash; new_config_pgs.hash = tree_hash;
await this.save_pg_config(etcd_request); await this.save_pg_config(new_config_pgs, etcd_request);
} }
else else
{ {
// Nothing changed, but we still want to recheck the distribution of primaries // Nothing changed, but we still want to recheck the distribution of primaries
let new_config_pgs;
let changed = false; let changed = false;
for (const pool_id in this.state.config.pools) for (const pool_id in this.state.config.pools)
{ {
@@ -1310,31 +1339,35 @@ class Mon
const new_primary = this.pick_primary(pool_id, pg_cfg.osd_set, up_osds, aff_osds); const new_primary = this.pick_primary(pool_id, pg_cfg.osd_set, up_osds, aff_osds);
if (pg_cfg.primary != new_primary) if (pg_cfg.primary != new_primary)
{ {
if (!new_config_pgs)
{
new_config_pgs = JSON.parse(JSON.stringify(this.state.config.pgs));
}
console.log( console.log(
`Moving pool ${pool_id} (${pool_cfg.name || 'unnamed'}) PG ${pg_num}`+ `Moving pool ${pool_id} (${pool_cfg.name || 'unnamed'}) PG ${pg_num}`+
` primary OSD from ${pg_cfg.primary} to ${new_primary}` ` primary OSD from ${pg_cfg.primary} to ${new_primary}`
); );
changed = true; changed = true;
pg_cfg.primary = new_primary; new_config_pgs.items[pool_id][pg_num].primary = new_primary;
} }
} }
} }
} }
if (changed) if (changed)
{ {
await this.save_pg_config(); await this.save_pg_config(new_config_pgs);
} }
} }
} }
async save_pg_config(etcd_request = { compare: [], success: [] }) async save_pg_config(new_config_pgs, etcd_request = { compare: [], success: [] })
{ {
etcd_request.compare.push( etcd_request.compare.push(
{ key: b64(this.etcd_prefix+'/mon/master'), target: 'LEASE', lease: ''+this.etcd_lease_id }, { key: b64(this.etcd_prefix+'/mon/master'), target: 'LEASE', lease: ''+this.etcd_lease_id },
{ key: b64(this.etcd_prefix+'/config/pgs'), target: 'MOD', mod_revision: ''+this.etcd_watch_revision, result: 'LESS' }, { key: b64(this.etcd_prefix+'/config/pgs'), target: 'MOD', mod_revision: ''+this.etcd_watch_revision, result: 'LESS' },
); );
etcd_request.success.push( etcd_request.success.push(
{ requestPut: { key: b64(this.etcd_prefix+'/config/pgs'), value: b64(JSON.stringify(this.state.config.pgs)) } }, { requestPut: { key: b64(this.etcd_prefix+'/config/pgs'), value: b64(JSON.stringify(new_config_pgs)) } },
); );
const res = await this.etcd_call('/kv/txn', etcd_request, this.config.etcd_mon_timeout, 0); const res = await this.etcd_call('/kv/txn', etcd_request, this.config.etcd_mon_timeout, 0);
if (!res.succeeded) if (!res.succeeded)
@@ -1761,14 +1794,13 @@ class Mon
return res.json; return res.json;
} }
} }
this.die(); await this.die();
} }
_die(err) async _die(err)
{ {
// In fact we can just try to rejoin
console.error(new Error(err || 'Cluster connection failed')); console.error(new Error(err || 'Cluster connection failed'));
process.exit(1); await this.on_stop(1);
} }
local_ips(all) local_ips(all)
@@ -1813,6 +1845,7 @@ function POST(url, body, timeout)
clearTimeout(timer_id); clearTimeout(timer_id);
let res_body = ''; let res_body = '';
res.setEncoding('utf8'); res.setEncoding('utf8');
res.on('error', no);
res.on('data', chunk => { res_body += chunk; }); res.on('data', chunk => { res_body += chunk; });
res.on('end', () => res.on('end', () =>
{ {
@@ -1832,6 +1865,8 @@ function POST(url, body, timeout)
} }
}); });
}); });
req.on('error', no);
req.on('close', () => no(new Error('Connection closed prematurely')));
req.write(body_text); req.write(body_text);
req.end(); req.end();
}); });

View File

@@ -15,4 +15,4 @@ StartLimitInterval=0
RestartSec=10 RestartSec=10
[Install] [Install]
WantedBy=vitastor.target WantedBy=multi-user.target

View File

@@ -50,7 +50,7 @@ from cinder.volume import configuration
from cinder.volume import driver from cinder.volume import driver
from cinder.volume import volume_utils from cinder.volume import volume_utils
VERSION = '0.8.5' VERSION = '0.8.8'
LOG = logging.getLogger(__name__) LOG = logging.getLogger(__name__)

View File

@@ -0,0 +1,169 @@
Index: pve-qemu-kvm-7.2.0/block/meson.build
===================================================================
--- pve-qemu-kvm-7.2.0.orig/block/meson.build
+++ pve-qemu-kvm-7.2.0/block/meson.build
@@ -113,6 +113,7 @@ foreach m : [
[libnfs, 'nfs', files('nfs.c')],
[libssh, 'ssh', files('ssh.c')],
[rbd, 'rbd', files('rbd.c')],
+ [vitastor, 'vitastor', files('vitastor.c')],
]
if m[0].found()
module_ss = ss.source_set()
Index: pve-qemu-kvm-7.2.0/meson.build
===================================================================
--- pve-qemu-kvm-7.2.0.orig/meson.build
+++ pve-qemu-kvm-7.2.0/meson.build
@@ -1026,6 +1026,26 @@ if not get_option('rbd').auto() or have_
endif
endif
+vitastor = not_found
+if not get_option('vitastor').auto() or have_block
+ libvitastor_client = cc.find_library('vitastor_client', has_headers: ['vitastor_c.h'],
+ required: get_option('vitastor'), kwargs: static_kwargs)
+ if libvitastor_client.found()
+ if cc.links('''
+ #include <vitastor_c.h>
+ int main(void) {
+ vitastor_c_create_qemu(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
+ return 0;
+ }''', dependencies: libvitastor_client)
+ vitastor = declare_dependency(dependencies: libvitastor_client)
+ elif get_option('vitastor').enabled()
+ error('could not link libvitastor_client')
+ else
+ warning('could not link libvitastor_client, disabling')
+ endif
+ endif
+endif
+
glusterfs = not_found
glusterfs_ftruncate_has_stat = false
glusterfs_iocb_has_stat = false
@@ -1865,6 +1885,7 @@ config_host_data.set('CONFIG_NUMA', numa
config_host_data.set('CONFIG_OPENGL', opengl.found())
config_host_data.set('CONFIG_PROFILER', get_option('profiler'))
config_host_data.set('CONFIG_RBD', rbd.found())
+config_host_data.set('CONFIG_VITASTOR', vitastor.found())
config_host_data.set('CONFIG_RDMA', rdma.found())
config_host_data.set('CONFIG_SDL', sdl.found())
config_host_data.set('CONFIG_SDL_IMAGE', sdl_image.found())
@@ -3957,6 +3978,7 @@ if spice_protocol.found()
summary_info += {' spice server support': spice}
endif
summary_info += {'rbd support': rbd}
+summary_info += {'vitastor support': vitastor}
summary_info += {'smartcard support': cacard}
summary_info += {'U2F support': u2f}
summary_info += {'libusb': libusb}
Index: pve-qemu-kvm-7.2.0/meson_options.txt
===================================================================
--- pve-qemu-kvm-7.2.0.orig/meson_options.txt
+++ pve-qemu-kvm-7.2.0/meson_options.txt
@@ -169,6 +169,8 @@ option('lzo', type : 'feature', value :
description: 'lzo compression support')
option('rbd', type : 'feature', value : 'auto',
description: 'Ceph block device driver')
+option('vitastor', type : 'feature', value : 'auto',
+ description: 'Vitastor block device driver')
option('opengl', type : 'feature', value : 'auto',
description: 'OpenGL support')
option('rdma', type : 'feature', value : 'auto',
Index: pve-qemu-kvm-7.2.0/qapi/block-core.json
===================================================================
--- pve-qemu-kvm-7.2.0.orig/qapi/block-core.json
+++ pve-qemu-kvm-7.2.0/qapi/block-core.json
@@ -3213,7 +3213,7 @@
'raw', 'rbd',
{ 'name': 'replication', 'if': 'CONFIG_REPLICATION' },
'pbs',
- 'ssh', 'throttle', 'vdi', 'vhdx',
+ 'ssh', 'throttle', 'vdi', 'vhdx', 'vitastor',
{ 'name': 'virtio-blk-vfio-pci', 'if': 'CONFIG_BLKIO' },
{ 'name': 'virtio-blk-vhost-user', 'if': 'CONFIG_BLKIO' },
{ 'name': 'virtio-blk-vhost-vdpa', 'if': 'CONFIG_BLKIO' },
@@ -4223,6 +4223,28 @@
'*server': ['InetSocketAddressBase'] } }
##
+# @BlockdevOptionsVitastor:
+#
+# Driver specific block device options for vitastor
+#
+# @image: Image name
+# @inode: Inode number
+# @pool: Pool ID
+# @size: Desired image size in bytes
+# @config-path: Path to Vitastor configuration
+# @etcd-host: etcd connection address(es)
+# @etcd-prefix: etcd key/value prefix
+##
+{ 'struct': 'BlockdevOptionsVitastor',
+ 'data': { '*inode': 'uint64',
+ '*pool': 'uint64',
+ '*size': 'uint64',
+ '*image': 'str',
+ '*config-path': 'str',
+ '*etcd-host': 'str',
+ '*etcd-prefix': 'str' } }
+
+##
# @ReplicationMode:
#
# An enumeration of replication modes.
@@ -4671,6 +4693,7 @@
'throttle': 'BlockdevOptionsThrottle',
'vdi': 'BlockdevOptionsGenericFormat',
'vhdx': 'BlockdevOptionsGenericFormat',
+ 'vitastor': 'BlockdevOptionsVitastor',
'virtio-blk-vfio-pci':
{ 'type': 'BlockdevOptionsVirtioBlkVfioPci',
'if': 'CONFIG_BLKIO' },
@@ -5072,6 +5095,17 @@
'*encrypt' : 'RbdEncryptionCreateOptions' } }
##
+# @BlockdevCreateOptionsVitastor:
+#
+# Driver specific image creation options for Vitastor.
+#
+# @size: Size of the virtual disk in bytes
+##
+{ 'struct': 'BlockdevCreateOptionsVitastor',
+ 'data': { 'location': 'BlockdevOptionsVitastor',
+ 'size': 'size' } }
+
+##
# @BlockdevVmdkSubformat:
#
# Subformat options for VMDK images
@@ -5269,6 +5303,7 @@
'ssh': 'BlockdevCreateOptionsSsh',
'vdi': 'BlockdevCreateOptionsVdi',
'vhdx': 'BlockdevCreateOptionsVhdx',
+ 'vitastor': 'BlockdevCreateOptionsVitastor',
'vmdk': 'BlockdevCreateOptionsVmdk',
'vpc': 'BlockdevCreateOptionsVpc'
} }
Index: pve-qemu-kvm-7.2.0/scripts/ci/org.centos/stream/8/x86_64/configure
===================================================================
--- pve-qemu-kvm-7.2.0.orig/scripts/ci/org.centos/stream/8/x86_64/configure
+++ pve-qemu-kvm-7.2.0/scripts/ci/org.centos/stream/8/x86_64/configure
@@ -31,7 +31,7 @@
--with-git=meson \
--with-git-submodules=update \
--target-list="x86_64-softmmu" \
---block-drv-rw-whitelist="qcow2,raw,file,host_device,nbd,iscsi,rbd,blkdebug,luks,null-co,nvme,copy-on-read,throttle,gluster" \
+--block-drv-rw-whitelist="qcow2,raw,file,host_device,nbd,iscsi,rbd,vitastor,blkdebug,luks,null-co,nvme,copy-on-read,throttle,gluster" \
--audio-drv-list="" \
--block-drv-ro-whitelist="vmdk,vhdx,vpc,https,ssh" \
--with-coroutine=ucontext \
@@ -179,6 +179,7 @@
--enable-opengl \
--enable-pie \
--enable-rbd \
+--enable-vitastor \
--enable-rdma \
--enable-seccomp \
--enable-snappy \

View File

@@ -0,0 +1,169 @@
diff --git a/block/meson.build b/block/meson.build
index deb73ca389..e269f599a1 100644
--- a/block/meson.build
+++ b/block/meson.build
@@ -78,6 +78,7 @@ foreach m : [
[libnfs, 'nfs', files('nfs.c')],
[libssh, 'ssh', files('ssh.c')],
[rbd, 'rbd', files('rbd.c')],
+ [vitastor, 'vitastor', files('vitastor.c')],
]
if m[0].found()
module_ss = ss.source_set()
diff --git a/meson.build b/meson.build
index 96de1a6ef9..2e3994777d 100644
--- a/meson.build
+++ b/meson.build
@@ -838,6 +838,26 @@ if not get_option('rbd').auto() or have_block
endif
endif
+vitastor = not_found
+if not get_option('vitastor').auto() or have_block
+ libvitastor_client = cc.find_library('vitastor_client', has_headers: ['vitastor_c.h'],
+ required: get_option('vitastor'), kwargs: static_kwargs)
+ if libvitastor_client.found()
+ if cc.links('''
+ #include <vitastor_c.h>
+ int main(void) {
+ vitastor_c_create_qemu(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
+ return 0;
+ }''', dependencies: libvitastor_client)
+ vitastor = declare_dependency(dependencies: libvitastor_client)
+ elif get_option('vitastor').enabled()
+ error('could not link libvitastor_client')
+ else
+ warning('could not link libvitastor_client, disabling')
+ endif
+ endif
+endif
+
glusterfs = not_found
glusterfs_ftruncate_has_stat = false
glusterfs_iocb_has_stat = false
@@ -1455,6 +1475,7 @@ config_host_data.set('CONFIG_LINUX_AIO', libaio.found())
config_host_data.set('CONFIG_LINUX_IO_URING', linux_io_uring.found())
config_host_data.set('CONFIG_LIBPMEM', libpmem.found())
config_host_data.set('CONFIG_RBD', rbd.found())
+config_host_data.set('CONFIG_VITASTOR', vitastor.found())
config_host_data.set('CONFIG_SDL', sdl.found())
config_host_data.set('CONFIG_SDL_IMAGE', sdl_image.found())
config_host_data.set('CONFIG_SECCOMP', seccomp.found())
@@ -3412,6 +3433,7 @@ if spice_protocol.found()
summary_info += {' spice server support': spice}
endif
summary_info += {'rbd support': rbd}
+summary_info += {'vitastor support': vitastor}
summary_info += {'xfsctl support': config_host.has_key('CONFIG_XFS')}
summary_info += {'smartcard support': cacard}
summary_info += {'U2F support': u2f}
diff --git a/meson_options.txt b/meson_options.txt
index e392323732..5b56007475 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -121,6 +121,8 @@ option('lzo', type : 'feature', value : 'auto',
description: 'lzo compression support')
option('rbd', type : 'feature', value : 'auto',
description: 'Ceph block device driver')
+option('vitastor', type : 'feature', value : 'auto',
+ description: 'Vitastor block device driver')
option('gtk', type : 'feature', value : 'auto',
description: 'GTK+ user interface')
option('sdl', type : 'feature', value : 'auto',
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 1d3dd9cb48..88453405e5 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2930,7 +2930,7 @@
'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
{ 'name': 'replication', 'if': 'CONFIG_REPLICATION' },
- 'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
+ 'ssh', 'throttle', 'vdi', 'vhdx', 'vitastor', 'vmdk', 'vpc', 'vvfat' ] }
##
# @BlockdevOptionsFile:
@@ -3864,6 +3864,28 @@
'*key-secret': 'str',
'*server': ['InetSocketAddressBase'] } }
+##
+# @BlockdevOptionsVitastor:
+#
+# Driver specific block device options for vitastor
+#
+# @image: Image name
+# @inode: Inode number
+# @pool: Pool ID
+# @size: Desired image size in bytes
+# @config-path: Path to Vitastor configuration
+# @etcd-host: etcd connection address(es)
+# @etcd-prefix: etcd key/value prefix
+##
+{ 'struct': 'BlockdevOptionsVitastor',
+ 'data': { '*inode': 'uint64',
+ '*pool': 'uint64',
+ '*size': 'uint64',
+ '*image': 'str',
+ '*config-path': 'str',
+ '*etcd-host': 'str',
+ '*etcd-prefix': 'str' } }
+
##
# @ReplicationMode:
#
@@ -4259,6 +4281,7 @@
'throttle': 'BlockdevOptionsThrottle',
'vdi': 'BlockdevOptionsGenericFormat',
'vhdx': 'BlockdevOptionsGenericFormat',
+ 'vitastor': 'BlockdevOptionsVitastor',
'vmdk': 'BlockdevOptionsGenericCOWFormat',
'vpc': 'BlockdevOptionsGenericFormat',
'vvfat': 'BlockdevOptionsVVFAT'
@@ -4647,6 +4670,17 @@
'*cluster-size' : 'size',
'*encrypt' : 'RbdEncryptionCreateOptions' } }
+##
+# @BlockdevCreateOptionsVitastor:
+#
+# Driver specific image creation options for Vitastor.
+#
+# @size: Size of the virtual disk in bytes
+##
+{ 'struct': 'BlockdevCreateOptionsVitastor',
+ 'data': { 'location': 'BlockdevOptionsVitastor',
+ 'size': 'size' } }
+
##
# @BlockdevVmdkSubformat:
#
@@ -4846,6 +4880,7 @@
'ssh': 'BlockdevCreateOptionsSsh',
'vdi': 'BlockdevCreateOptionsVdi',
'vhdx': 'BlockdevCreateOptionsVhdx',
+ 'vitastor': 'BlockdevCreateOptionsVitastor',
'vmdk': 'BlockdevCreateOptionsVmdk',
'vpc': 'BlockdevCreateOptionsVpc'
} }
diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
index 7a17ff4218..cdddbf32aa 100644
--- a/scripts/meson-buildoptions.sh
+++ b/scripts/meson-buildoptions.sh
@@ -69,6 +69,7 @@ meson_options_help() {
printf "%s\n" ' oss OSS sound support'
printf "%s\n" ' pa PulseAudio sound support'
printf "%s\n" ' rbd Ceph block device driver'
+ printf "%s\n" ' vitastor Vitastor block device driver'
printf "%s\n" ' sdl SDL user interface'
printf "%s\n" ' sdl-image SDL Image support for icons'
printf "%s\n" ' seccomp seccomp support'
@@ -210,6 +211,8 @@ _meson_option_parse() {
--disable-pa) printf "%s" -Dpa=disabled ;;
--enable-rbd) printf "%s" -Drbd=enabled ;;
--disable-rbd) printf "%s" -Drbd=disabled ;;
+ --enable-vitastor) printf "%s" -Dvitastor=enabled ;;
+ --disable-vitastor) printf "%s" -Dvitastor=disabled ;;
--enable-sdl) printf "%s" -Dsdl=enabled ;;
--disable-sdl) printf "%s" -Dsdl=disabled ;;
--enable-sdl-image) printf "%s" -Dsdl_image=enabled ;;

View File

@@ -0,0 +1,190 @@
diff --git a/block/meson.build b/block/meson.build
index 0b2a60c99b..d923713804 100644
--- a/block/meson.build
+++ b/block/meson.build
@@ -98,6 +98,7 @@ foreach m : [
[libnfs, 'nfs', files('nfs.c')],
[libssh, 'ssh', files('ssh.c')],
[rbd, 'rbd', files('rbd.c')],
+ [vitastor, 'vitastor', files('vitastor.c')],
]
if m[0].found()
module_ss = ss.source_set()
diff --git a/meson.build b/meson.build
index 861de93c4f..272f72af11 100644
--- a/meson.build
+++ b/meson.build
@@ -884,6 +884,26 @@ if not get_option('rbd').auto() or have_block
endif
endif
+vitastor = not_found
+if not get_option('vitastor').auto() or have_block
+ libvitastor_client = cc.find_library('vitastor_client', has_headers: ['vitastor_c.h'],
+ required: get_option('vitastor'), kwargs: static_kwargs)
+ if libvitastor_client.found()
+ if cc.links('''
+ #include <vitastor_c.h>
+ int main(void) {
+ vitastor_c_create_qemu(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
+ return 0;
+ }''', dependencies: libvitastor_client)
+ vitastor = declare_dependency(dependencies: libvitastor_client)
+ elif get_option('vitastor').enabled()
+ error('could not link libvitastor_client')
+ else
+ warning('could not link libvitastor_client, disabling')
+ endif
+ endif
+endif
+
glusterfs = not_found
glusterfs_ftruncate_has_stat = false
glusterfs_iocb_has_stat = false
@@ -1546,6 +1566,7 @@ config_host_data.set('CONFIG_LIBPMEM', libpmem.found())
config_host_data.set('CONFIG_NUMA', numa.found())
config_host_data.set('CONFIG_PROFILER', get_option('profiler'))
config_host_data.set('CONFIG_RBD', rbd.found())
+config_host_data.set('CONFIG_VITASTOR', vitastor.found())
config_host_data.set('CONFIG_SDL', sdl.found())
config_host_data.set('CONFIG_SDL_IMAGE', sdl_image.found())
config_host_data.set('CONFIG_SECCOMP', seccomp.found())
@@ -3709,6 +3730,7 @@ if spice_protocol.found()
summary_info += {' spice server support': spice}
endif
summary_info += {'rbd support': rbd}
+summary_info += {'vitastor support': vitastor}
summary_info += {'smartcard support': cacard}
summary_info += {'U2F support': u2f}
summary_info += {'libusb': libusb}
diff --git a/meson_options.txt b/meson_options.txt
index 52b11cead4..d8d0868174 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -149,6 +149,8 @@ option('lzo', type : 'feature', value : 'auto',
description: 'lzo compression support')
option('rbd', type : 'feature', value : 'auto',
description: 'Ceph block device driver')
+option('vitastor', type : 'feature', value : 'auto',
+ description: 'Vitastor block device driver')
option('gtk', type : 'feature', value : 'auto',
description: 'GTK+ user interface')
option('sdl', type : 'feature', value : 'auto',
diff --git a/qapi/block-core.json b/qapi/block-core.json
index beeb91952a..1c98dc0e12 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2929,7 +2929,7 @@
'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
{ 'name': 'replication', 'if': 'CONFIG_REPLICATION' },
- 'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
+ 'ssh', 'throttle', 'vdi', 'vhdx', 'vitastor', 'vmdk', 'vpc', 'vvfat' ] }
##
# @BlockdevOptionsFile:
@@ -3863,6 +3863,28 @@
'*key-secret': 'str',
'*server': ['InetSocketAddressBase'] } }
+##
+# @BlockdevOptionsVitastor:
+#
+# Driver specific block device options for vitastor
+#
+# @image: Image name
+# @inode: Inode number
+# @pool: Pool ID
+# @size: Desired image size in bytes
+# @config-path: Path to Vitastor configuration
+# @etcd-host: etcd connection address(es)
+# @etcd-prefix: etcd key/value prefix
+##
+{ 'struct': 'BlockdevOptionsVitastor',
+ 'data': { '*inode': 'uint64',
+ '*pool': 'uint64',
+ '*size': 'uint64',
+ '*image': 'str',
+ '*config-path': 'str',
+ '*etcd-host': 'str',
+ '*etcd-prefix': 'str' } }
+
##
# @ReplicationMode:
#
@@ -4277,6 +4299,7 @@
'throttle': 'BlockdevOptionsThrottle',
'vdi': 'BlockdevOptionsGenericFormat',
'vhdx': 'BlockdevOptionsGenericFormat',
+ 'vitastor': 'BlockdevOptionsVitastor',
'vmdk': 'BlockdevOptionsGenericCOWFormat',
'vpc': 'BlockdevOptionsGenericFormat',
'vvfat': 'BlockdevOptionsVVFAT'
@@ -4665,6 +4688,17 @@
'*cluster-size' : 'size',
'*encrypt' : 'RbdEncryptionCreateOptions' } }
+##
+# @BlockdevCreateOptionsVitastor:
+#
+# Driver specific image creation options for Vitastor.
+#
+# @size: Size of the virtual disk in bytes
+##
+{ 'struct': 'BlockdevCreateOptionsVitastor',
+ 'data': { 'location': 'BlockdevOptionsVitastor',
+ 'size': 'size' } }
+
##
# @BlockdevVmdkSubformat:
#
@@ -4864,6 +4898,7 @@
'ssh': 'BlockdevCreateOptionsSsh',
'vdi': 'BlockdevCreateOptionsVdi',
'vhdx': 'BlockdevCreateOptionsVhdx',
+ 'vitastor': 'BlockdevCreateOptionsVitastor',
'vmdk': 'BlockdevCreateOptionsVmdk',
'vpc': 'BlockdevCreateOptionsVpc'
} }
diff --git a/scripts/ci/org.centos/stream/8/x86_64/configure b/scripts/ci/org.centos/stream/8/x86_64/configure
index 9850dd4444..72b1287520 100755
--- a/scripts/ci/org.centos/stream/8/x86_64/configure
+++ b/scripts/ci/org.centos/stream/8/x86_64/configure
@@ -31,7 +31,7 @@
--with-git=meson \
--with-git-submodules=update \
--target-list="x86_64-softmmu" \
---block-drv-rw-whitelist="qcow2,raw,file,host_device,nbd,iscsi,rbd,blkdebug,luks,null-co,nvme,copy-on-read,throttle,gluster" \
+--block-drv-rw-whitelist="qcow2,raw,file,host_device,nbd,iscsi,rbd,vitastor,blkdebug,luks,null-co,nvme,copy-on-read,throttle,gluster" \
--audio-drv-list="" \
--block-drv-ro-whitelist="vmdk,vhdx,vpc,https,ssh" \
--with-coroutine=ucontext \
@@ -181,6 +181,7 @@
--enable-opengl \
--enable-pie \
--enable-rbd \
+--enable-vitastor \
--enable-rdma \
--enable-seccomp \
--enable-snappy \
diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
index 1e26f4571e..370898d48c 100644
--- a/scripts/meson-buildoptions.sh
+++ b/scripts/meson-buildoptions.sh
@@ -98,6 +98,7 @@ meson_options_help() {
printf "%s\n" ' qed qed image format support'
printf "%s\n" ' qga-vss build QGA VSS support (broken with MinGW)'
printf "%s\n" ' rbd Ceph block device driver'
+ printf "%s\n" ' vitastor Vitastor block device driver'
printf "%s\n" ' replication replication support'
printf "%s\n" ' sdl SDL user interface'
printf "%s\n" ' sdl-image SDL Image support for icons'
@@ -289,6 +290,8 @@ _meson_option_parse() {
--disable-qom-cast-debug) printf "%s" -Dqom_cast_debug=false ;;
--enable-rbd) printf "%s" -Drbd=enabled ;;
--disable-rbd) printf "%s" -Drbd=disabled ;;
+ --enable-vitastor) printf "%s" -Dvitastor=enabled ;;
+ --disable-vitastor) printf "%s" -Dvitastor=disabled ;;
--enable-replication) printf "%s" -Dreplication=enabled ;;
--disable-replication) printf "%s" -Dreplication=disabled ;;
--enable-rng-none) printf "%s" -Drng_none=true ;;

View File

@@ -0,0 +1,190 @@
diff --git a/block/meson.build b/block/meson.build
index 60bc305597..89a042216f 100644
--- a/block/meson.build
+++ b/block/meson.build
@@ -98,6 +98,7 @@ foreach m : [
[libnfs, 'nfs', files('nfs.c')],
[libssh, 'ssh', files('ssh.c')],
[rbd, 'rbd', files('rbd.c')],
+ [vitastor, 'vitastor', files('vitastor.c')],
]
if m[0].found()
module_ss = ss.source_set()
diff --git a/meson.build b/meson.build
index 20fddbd707..600db4e2fb 100644
--- a/meson.build
+++ b/meson.build
@@ -967,6 +967,26 @@ if not get_option('rbd').auto() or have_block
endif
endif
+vitastor = not_found
+if not get_option('vitastor').auto() or have_block
+ libvitastor_client = cc.find_library('vitastor_client', has_headers: ['vitastor_c.h'],
+ required: get_option('vitastor'), kwargs: static_kwargs)
+ if libvitastor_client.found()
+ if cc.links('''
+ #include <vitastor_c.h>
+ int main(void) {
+ vitastor_c_create_qemu(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
+ return 0;
+ }''', dependencies: libvitastor_client)
+ vitastor = declare_dependency(dependencies: libvitastor_client)
+ elif get_option('vitastor').enabled()
+ error('could not link libvitastor_client')
+ else
+ warning('could not link libvitastor_client, disabling')
+ endif
+ endif
+endif
+
glusterfs = not_found
glusterfs_ftruncate_has_stat = false
glusterfs_iocb_has_stat = false
@@ -1799,6 +1819,7 @@ config_host_data.set('CONFIG_NUMA', numa.found())
config_host_data.set('CONFIG_OPENGL', opengl.found())
config_host_data.set('CONFIG_PROFILER', get_option('profiler'))
config_host_data.set('CONFIG_RBD', rbd.found())
+config_host_data.set('CONFIG_VITASTOR', vitastor.found())
config_host_data.set('CONFIG_RDMA', rdma.found())
config_host_data.set('CONFIG_SDL', sdl.found())
config_host_data.set('CONFIG_SDL_IMAGE', sdl_image.found())
@@ -3954,6 +3975,7 @@ if spice_protocol.found()
summary_info += {' spice server support': spice}
endif
summary_info += {'rbd support': rbd}
+summary_info += {'vitastor support': vitastor}
summary_info += {'smartcard support': cacard}
summary_info += {'U2F support': u2f}
summary_info += {'libusb': libusb}
diff --git a/meson_options.txt b/meson_options.txt
index e58e158396..9747b38fd0 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -167,6 +167,8 @@ option('lzo', type : 'feature', value : 'auto',
description: 'lzo compression support')
option('rbd', type : 'feature', value : 'auto',
description: 'Ceph block device driver')
+option('vitastor', type : 'feature', value : 'auto',
+ description: 'Vitastor block device driver')
option('opengl', type : 'feature', value : 'auto',
description: 'OpenGL support')
option('rdma', type : 'feature', value : 'auto',
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 2173e7734a..5a4900b322 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2955,7 +2955,7 @@
'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
{ 'name': 'replication', 'if': 'CONFIG_REPLICATION' },
- 'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
+ 'ssh', 'throttle', 'vdi', 'vhdx', 'vitastor', 'vmdk', 'vpc', 'vvfat' ] }
##
# @BlockdevOptionsFile:
@@ -3883,6 +3883,28 @@
'*key-secret': 'str',
'*server': ['InetSocketAddressBase'] } }
+##
+# @BlockdevOptionsVitastor:
+#
+# Driver specific block device options for vitastor
+#
+# @image: Image name
+# @inode: Inode number
+# @pool: Pool ID
+# @size: Desired image size in bytes
+# @config-path: Path to Vitastor configuration
+# @etcd-host: etcd connection address(es)
+# @etcd-prefix: etcd key/value prefix
+##
+{ 'struct': 'BlockdevOptionsVitastor',
+ 'data': { '*inode': 'uint64',
+ '*pool': 'uint64',
+ '*size': 'uint64',
+ '*image': 'str',
+ '*config-path': 'str',
+ '*etcd-host': 'str',
+ '*etcd-prefix': 'str' } }
+
##
# @ReplicationMode:
#
@@ -4327,6 +4349,7 @@
'throttle': 'BlockdevOptionsThrottle',
'vdi': 'BlockdevOptionsGenericFormat',
'vhdx': 'BlockdevOptionsGenericFormat',
+ 'vitastor': 'BlockdevOptionsVitastor',
'vmdk': 'BlockdevOptionsGenericCOWFormat',
'vpc': 'BlockdevOptionsGenericFormat',
'vvfat': 'BlockdevOptionsVVFAT'
@@ -4717,6 +4740,17 @@
'*cluster-size' : 'size',
'*encrypt' : 'RbdEncryptionCreateOptions' } }
+##
+# @BlockdevCreateOptionsVitastor:
+#
+# Driver specific image creation options for Vitastor.
+#
+# @size: Size of the virtual disk in bytes
+##
+{ 'struct': 'BlockdevCreateOptionsVitastor',
+ 'data': { 'location': 'BlockdevOptionsVitastor',
+ 'size': 'size' } }
+
##
# @BlockdevVmdkSubformat:
#
@@ -4915,6 +4949,7 @@
'ssh': 'BlockdevCreateOptionsSsh',
'vdi': 'BlockdevCreateOptionsVdi',
'vhdx': 'BlockdevCreateOptionsVhdx',
+ 'vitastor': 'BlockdevCreateOptionsVitastor',
'vmdk': 'BlockdevCreateOptionsVmdk',
'vpc': 'BlockdevCreateOptionsVpc'
} }
diff --git a/scripts/ci/org.centos/stream/8/x86_64/configure b/scripts/ci/org.centos/stream/8/x86_64/configure
index a7f92aff90..53dc55be2e 100755
--- a/scripts/ci/org.centos/stream/8/x86_64/configure
+++ b/scripts/ci/org.centos/stream/8/x86_64/configure
@@ -31,7 +31,7 @@
--with-git=meson \
--with-git-submodules=update \
--target-list="x86_64-softmmu" \
---block-drv-rw-whitelist="qcow2,raw,file,host_device,nbd,iscsi,rbd,blkdebug,luks,null-co,nvme,copy-on-read,throttle,gluster" \
+--block-drv-rw-whitelist="qcow2,raw,file,host_device,nbd,iscsi,rbd,vitastor,blkdebug,luks,null-co,nvme,copy-on-read,throttle,gluster" \
--audio-drv-list="" \
--block-drv-ro-whitelist="vmdk,vhdx,vpc,https,ssh" \
--with-coroutine=ucontext \
@@ -179,6 +179,7 @@
--enable-opengl \
--enable-pie \
--enable-rbd \
+--enable-vitastor \
--enable-rdma \
--enable-seccomp \
--enable-snappy \
diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
index 359b04e0e6..f5b85ba78c 100644
--- a/scripts/meson-buildoptions.sh
+++ b/scripts/meson-buildoptions.sh
@@ -135,6 +135,7 @@ meson_options_help() {
printf "%s\n" ' qed qed image format support'
printf "%s\n" ' qga-vss build QGA VSS support (broken with MinGW)'
printf "%s\n" ' rbd Ceph block device driver'
+ printf "%s\n" ' vitastor Vitastor block device driver'
printf "%s\n" ' rdma Enable RDMA-based migration'
printf "%s\n" ' replication replication support'
printf "%s\n" ' sdl SDL user interface'
@@ -370,6 +371,8 @@ _meson_option_parse() {
--disable-qom-cast-debug) printf "%s" -Dqom_cast_debug=false ;;
--enable-rbd) printf "%s" -Drbd=enabled ;;
--disable-rbd) printf "%s" -Drbd=disabled ;;
+ --enable-vitastor) printf "%s" -Dvitastor=enabled ;;
+ --disable-vitastor) printf "%s" -Dvitastor=disabled ;;
--enable-rdma) printf "%s" -Drdma=enabled ;;
--disable-rdma) printf "%s" -Drdma=disabled ;;
--enable-replication) printf "%s" -Dreplication=enabled ;;

View File

@@ -0,0 +1,190 @@
diff --git a/block/meson.build b/block/meson.build
index b7c68b83a3..95d8a6f15d 100644
--- a/block/meson.build
+++ b/block/meson.build
@@ -100,6 +100,7 @@ foreach m : [
[libnfs, 'nfs', files('nfs.c')],
[libssh, 'ssh', files('ssh.c')],
[rbd, 'rbd', files('rbd.c')],
+ [vitastor, 'vitastor', files('vitastor.c')],
]
if m[0].found()
module_ss = ss.source_set()
diff --git a/meson.build b/meson.build
index 5c6b5a1c75..f31f73612e 100644
--- a/meson.build
+++ b/meson.build
@@ -1026,6 +1026,26 @@ if not get_option('rbd').auto() or have_block
endif
endif
+vitastor = not_found
+if not get_option('vitastor').auto() or have_block
+ libvitastor_client = cc.find_library('vitastor_client', has_headers: ['vitastor_c.h'],
+ required: get_option('vitastor'), kwargs: static_kwargs)
+ if libvitastor_client.found()
+ if cc.links('''
+ #include <vitastor_c.h>
+ int main(void) {
+ vitastor_c_create_qemu(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
+ return 0;
+ }''', dependencies: libvitastor_client)
+ vitastor = declare_dependency(dependencies: libvitastor_client)
+ elif get_option('vitastor').enabled()
+ error('could not link libvitastor_client')
+ else
+ warning('could not link libvitastor_client, disabling')
+ endif
+ endif
+endif
+
glusterfs = not_found
glusterfs_ftruncate_has_stat = false
glusterfs_iocb_has_stat = false
@@ -1861,6 +1881,7 @@ config_host_data.set('CONFIG_NUMA', numa.found())
config_host_data.set('CONFIG_OPENGL', opengl.found())
config_host_data.set('CONFIG_PROFILER', get_option('profiler'))
config_host_data.set('CONFIG_RBD', rbd.found())
+config_host_data.set('CONFIG_VITASTOR', vitastor.found())
config_host_data.set('CONFIG_RDMA', rdma.found())
config_host_data.set('CONFIG_SDL', sdl.found())
config_host_data.set('CONFIG_SDL_IMAGE', sdl_image.found())
@@ -3945,6 +3966,7 @@ if spice_protocol.found()
summary_info += {' spice server support': spice}
endif
summary_info += {'rbd support': rbd}
+summary_info += {'vitastor support': vitastor}
summary_info += {'smartcard support': cacard}
summary_info += {'U2F support': u2f}
summary_info += {'libusb': libusb}
diff --git a/meson_options.txt b/meson_options.txt
index 4b749ca549..6b37bd6b77 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -169,6 +169,8 @@ option('lzo', type : 'feature', value : 'auto',
description: 'lzo compression support')
option('rbd', type : 'feature', value : 'auto',
description: 'Ceph block device driver')
+option('vitastor', type : 'feature', value : 'auto',
+ description: 'Vitastor block device driver')
option('opengl', type : 'feature', value : 'auto',
description: 'OpenGL support')
option('rdma', type : 'feature', value : 'auto',
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 95ac4fa634..7a240827e4 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2959,7 +2959,7 @@
'parallels', 'preallocate', 'qcow', 'qcow2', 'qed', 'quorum',
'raw', 'rbd',
{ 'name': 'replication', 'if': 'CONFIG_REPLICATION' },
- 'ssh', 'throttle', 'vdi', 'vhdx',
+ 'ssh', 'throttle', 'vdi', 'vhdx', 'vitastor',
{ 'name': 'virtio-blk-vfio-pci', 'if': 'CONFIG_BLKIO' },
{ 'name': 'virtio-blk-vhost-user', 'if': 'CONFIG_BLKIO' },
{ 'name': 'virtio-blk-vhost-vdpa', 'if': 'CONFIG_BLKIO' },
@@ -3957,6 +3957,28 @@
'*key-secret': 'str',
'*server': ['InetSocketAddressBase'] } }
+##
+# @BlockdevOptionsVitastor:
+#
+# Driver specific block device options for vitastor
+#
+# @image: Image name
+# @inode: Inode number
+# @pool: Pool ID
+# @size: Desired image size in bytes
+# @config-path: Path to Vitastor configuration
+# @etcd-host: etcd connection address(es)
+# @etcd-prefix: etcd key/value prefix
+##
+{ 'struct': 'BlockdevOptionsVitastor',
+ 'data': { '*inode': 'uint64',
+ '*pool': 'uint64',
+ '*size': 'uint64',
+ '*image': 'str',
+ '*config-path': 'str',
+ '*etcd-host': 'str',
+ '*etcd-prefix': 'str' } }
+
##
# @ReplicationMode:
#
@@ -4405,6 +4427,7 @@
'throttle': 'BlockdevOptionsThrottle',
'vdi': 'BlockdevOptionsGenericFormat',
'vhdx': 'BlockdevOptionsGenericFormat',
+ 'vitastor': 'BlockdevOptionsVitastor',
'virtio-blk-vfio-pci':
{ 'type': 'BlockdevOptionsVirtioBlkVfioPci',
'if': 'CONFIG_BLKIO' },
@@ -4804,6 +4827,17 @@
'*cluster-size' : 'size',
'*encrypt' : 'RbdEncryptionCreateOptions' } }
+##
+# @BlockdevCreateOptionsVitastor:
+#
+# Driver specific image creation options for Vitastor.
+#
+# @size: Size of the virtual disk in bytes
+##
+{ 'struct': 'BlockdevCreateOptionsVitastor',
+ 'data': { 'location': 'BlockdevOptionsVitastor',
+ 'size': 'size' } }
+
##
# @BlockdevVmdkSubformat:
#
@@ -5002,6 +5036,7 @@
'ssh': 'BlockdevCreateOptionsSsh',
'vdi': 'BlockdevCreateOptionsVdi',
'vhdx': 'BlockdevCreateOptionsVhdx',
+ 'vitastor': 'BlockdevCreateOptionsVitastor',
'vmdk': 'BlockdevCreateOptionsVmdk',
'vpc': 'BlockdevCreateOptionsVpc'
} }
diff --git a/scripts/ci/org.centos/stream/8/x86_64/configure b/scripts/ci/org.centos/stream/8/x86_64/configure
index a7f92aff90..53dc55be2e 100755
--- a/scripts/ci/org.centos/stream/8/x86_64/configure
+++ b/scripts/ci/org.centos/stream/8/x86_64/configure
@@ -31,7 +31,7 @@
--with-git=meson \
--with-git-submodules=update \
--target-list="x86_64-softmmu" \
---block-drv-rw-whitelist="qcow2,raw,file,host_device,nbd,iscsi,rbd,blkdebug,luks,null-co,nvme,copy-on-read,throttle,gluster" \
+--block-drv-rw-whitelist="qcow2,raw,file,host_device,nbd,iscsi,rbd,vitastor,blkdebug,luks,null-co,nvme,copy-on-read,throttle,gluster" \
--audio-drv-list="" \
--block-drv-ro-whitelist="vmdk,vhdx,vpc,https,ssh" \
--with-coroutine=ucontext \
@@ -179,6 +179,7 @@
--enable-opengl \
--enable-pie \
--enable-rbd \
+--enable-vitastor \
--enable-rdma \
--enable-seccomp \
--enable-snappy \
diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
index aa6e30ea91..c45d21c40f 100644
--- a/scripts/meson-buildoptions.sh
+++ b/scripts/meson-buildoptions.sh
@@ -135,6 +135,7 @@ meson_options_help() {
printf "%s\n" ' qed qed image format support'
printf "%s\n" ' qga-vss build QGA VSS support (broken with MinGW)'
printf "%s\n" ' rbd Ceph block device driver'
+ printf "%s\n" ' vitastor Vitastor block device driver'
printf "%s\n" ' rdma Enable RDMA-based migration'
printf "%s\n" ' replication replication support'
printf "%s\n" ' sdl SDL user interface'
@@ -376,6 +377,8 @@ _meson_option_parse() {
--disable-qom-cast-debug) printf "%s" -Dqom_cast_debug=false ;;
--enable-rbd) printf "%s" -Drbd=enabled ;;
--disable-rbd) printf "%s" -Drbd=disabled ;;
+ --enable-vitastor) printf "%s" -Dvitastor=enabled ;;
+ --disable-vitastor) printf "%s" -Dvitastor=disabled ;;
--enable-rdma) printf "%s" -Drdma=enabled ;;
--disable-rdma) printf "%s" -Drdma=disabled ;;
--enable-replication) printf "%s" -Dreplication=enabled ;;

View File

@@ -0,0 +1,190 @@
diff --git a/block/meson.build b/block/meson.build
index 382bec0e7d..af6207dbce 100644
--- a/block/meson.build
+++ b/block/meson.build
@@ -101,6 +101,7 @@ foreach m : [
[libnfs, 'nfs', files('nfs.c')],
[libssh, 'ssh', files('ssh.c')],
[rbd, 'rbd', files('rbd.c')],
+ [vitastor, 'vitastor', files('vitastor.c')],
]
if m[0].found()
module_ss = ss.source_set()
diff --git a/meson.build b/meson.build
index c44d05a13f..ebedb42843 100644
--- a/meson.build
+++ b/meson.build
@@ -1028,6 +1028,26 @@ if not get_option('rbd').auto() or have_block
endif
endif
+vitastor = not_found
+if not get_option('vitastor').auto() or have_block
+ libvitastor_client = cc.find_library('vitastor_client', has_headers: ['vitastor_c.h'],
+ required: get_option('vitastor'), kwargs: static_kwargs)
+ if libvitastor_client.found()
+ if cc.links('''
+ #include <vitastor_c.h>
+ int main(void) {
+ vitastor_c_create_qemu(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
+ return 0;
+ }''', dependencies: libvitastor_client)
+ vitastor = declare_dependency(dependencies: libvitastor_client)
+ elif get_option('vitastor').enabled()
+ error('could not link libvitastor_client')
+ else
+ warning('could not link libvitastor_client, disabling')
+ endif
+ endif
+endif
+
glusterfs = not_found
glusterfs_ftruncate_has_stat = false
glusterfs_iocb_has_stat = false
@@ -1878,6 +1898,7 @@ endif
config_host_data.set('CONFIG_OPENGL', opengl.found())
config_host_data.set('CONFIG_PROFILER', get_option('profiler'))
config_host_data.set('CONFIG_RBD', rbd.found())
+config_host_data.set('CONFIG_VITASTOR', vitastor.found())
config_host_data.set('CONFIG_RDMA', rdma.found())
config_host_data.set('CONFIG_SDL', sdl.found())
config_host_data.set('CONFIG_SDL_IMAGE', sdl_image.found())
@@ -4002,6 +4023,7 @@ if spice_protocol.found()
summary_info += {' spice server support': spice}
endif
summary_info += {'rbd support': rbd}
+summary_info += {'vitastor support': vitastor}
summary_info += {'smartcard support': cacard}
summary_info += {'U2F support': u2f}
summary_info += {'libusb': libusb}
diff --git a/meson_options.txt b/meson_options.txt
index fc9447d267..c4ac55c283 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -173,6 +173,8 @@ option('lzo', type : 'feature', value : 'auto',
description: 'lzo compression support')
option('rbd', type : 'feature', value : 'auto',
description: 'Ceph block device driver')
+option('vitastor', type : 'feature', value : 'auto',
+ description: 'Vitastor block device driver')
option('opengl', type : 'feature', value : 'auto',
description: 'OpenGL support')
option('rdma', type : 'feature', value : 'auto',
diff --git a/qapi/block-core.json b/qapi/block-core.json
index c05ad0c07e..f5eb701604 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -3054,7 +3054,7 @@
'parallels', 'preallocate', 'qcow', 'qcow2', 'qed', 'quorum',
'raw', 'rbd',
{ 'name': 'replication', 'if': 'CONFIG_REPLICATION' },
- 'ssh', 'throttle', 'vdi', 'vhdx',
+ 'ssh', 'throttle', 'vdi', 'vhdx', 'vitastor',
{ 'name': 'virtio-blk-vfio-pci', 'if': 'CONFIG_BLKIO' },
{ 'name': 'virtio-blk-vhost-user', 'if': 'CONFIG_BLKIO' },
{ 'name': 'virtio-blk-vhost-vdpa', 'if': 'CONFIG_BLKIO' },
@@ -4073,6 +4073,28 @@
'*key-secret': 'str',
'*server': ['InetSocketAddressBase'] } }
+##
+# @BlockdevOptionsVitastor:
+#
+# Driver specific block device options for vitastor
+#
+# @image: Image name
+# @inode: Inode number
+# @pool: Pool ID
+# @size: Desired image size in bytes
+# @config-path: Path to Vitastor configuration
+# @etcd-host: etcd connection address(es)
+# @etcd-prefix: etcd key/value prefix
+##
+{ 'struct': 'BlockdevOptionsVitastor',
+ 'data': { '*inode': 'uint64',
+ '*pool': 'uint64',
+ '*size': 'uint64',
+ '*image': 'str',
+ '*config-path': 'str',
+ '*etcd-host': 'str',
+ '*etcd-prefix': 'str' } }
+
##
# @ReplicationMode:
#
@@ -4521,6 +4543,7 @@
'throttle': 'BlockdevOptionsThrottle',
'vdi': 'BlockdevOptionsGenericFormat',
'vhdx': 'BlockdevOptionsGenericFormat',
+ 'vitastor': 'BlockdevOptionsVitastor',
'virtio-blk-vfio-pci':
{ 'type': 'BlockdevOptionsVirtioBlkVfioPci',
'if': 'CONFIG_BLKIO' },
@@ -4920,6 +4943,17 @@
'*cluster-size' : 'size',
'*encrypt' : 'RbdEncryptionCreateOptions' } }
+##
+# @BlockdevCreateOptionsVitastor:
+#
+# Driver specific image creation options for Vitastor.
+#
+# @size: Size of the virtual disk in bytes
+##
+{ 'struct': 'BlockdevCreateOptionsVitastor',
+ 'data': { 'location': 'BlockdevOptionsVitastor',
+ 'size': 'size' } }
+
##
# @BlockdevVmdkSubformat:
#
@@ -5118,6 +5152,7 @@
'ssh': 'BlockdevCreateOptionsSsh',
'vdi': 'BlockdevCreateOptionsVdi',
'vhdx': 'BlockdevCreateOptionsVhdx',
+ 'vitastor': 'BlockdevCreateOptionsVitastor',
'vmdk': 'BlockdevCreateOptionsVmdk',
'vpc': 'BlockdevCreateOptionsVpc'
} }
diff --git a/scripts/ci/org.centos/stream/8/x86_64/configure b/scripts/ci/org.centos/stream/8/x86_64/configure
index 6e8983f39c..1b0b9fcf3e 100755
--- a/scripts/ci/org.centos/stream/8/x86_64/configure
+++ b/scripts/ci/org.centos/stream/8/x86_64/configure
@@ -32,7 +32,7 @@
--with-git=meson \
--with-git-submodules=update \
--target-list="x86_64-softmmu" \
---block-drv-rw-whitelist="qcow2,raw,file,host_device,nbd,iscsi,rbd,blkdebug,luks,null-co,nvme,copy-on-read,throttle,gluster" \
+--block-drv-rw-whitelist="qcow2,raw,file,host_device,nbd,iscsi,rbd,vitastor,blkdebug,luks,null-co,nvme,copy-on-read,throttle,gluster" \
--audio-drv-list="" \
--block-drv-ro-whitelist="vmdk,vhdx,vpc,https,ssh" \
--with-coroutine=ucontext \
@@ -179,6 +179,7 @@
--enable-opengl \
--enable-pie \
--enable-rbd \
+--enable-vitastor \
--enable-rdma \
--enable-seccomp \
--enable-snappy \
diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
index 009fab1515..95914e6ebc 100644
--- a/scripts/meson-buildoptions.sh
+++ b/scripts/meson-buildoptions.sh
@@ -142,6 +142,7 @@ meson_options_help() {
printf "%s\n" ' qed qed image format support'
printf "%s\n" ' qga-vss build QGA VSS support (broken with MinGW)'
printf "%s\n" ' rbd Ceph block device driver'
+ printf "%s\n" ' vitastor Vitastor block device driver'
printf "%s\n" ' rdma Enable RDMA-based migration'
printf "%s\n" ' replication replication support'
printf "%s\n" ' sdl SDL user interface'
@@ -388,6 +389,8 @@ _meson_option_parse() {
--disable-qom-cast-debug) printf "%s" -Dqom_cast_debug=false ;;
--enable-rbd) printf "%s" -Drbd=enabled ;;
--disable-rbd) printf "%s" -Drbd=disabled ;;
+ --enable-vitastor) printf "%s" -Dvitastor=enabled ;;
+ --disable-vitastor) printf "%s" -Dvitastor=disabled ;;
--enable-rdma) printf "%s" -Drdma=enabled ;;
--disable-rdma) printf "%s" -Drdma=disabled ;;
--enable-replication) printf "%s" -Dreplication=enabled ;;

View File

@@ -7,13 +7,12 @@ set -e
VITASTOR=$(dirname $0) VITASTOR=$(dirname $0)
VITASTOR=$(realpath "$VITASTOR/..") VITASTOR=$(realpath "$VITASTOR/..")
if [ -d /opt/rh/gcc-toolset-9 ]; then EL=$(rpm --eval '%dist')
if [ "$EL" = ".el8" ]; then
# CentOS 8 # CentOS 8
EL=8
. /opt/rh/gcc-toolset-9/enable . /opt/rh/gcc-toolset-9/enable
else elif [ "$EL" = ".el7" ]; then
# CentOS 7 # CentOS 7
EL=7
. /opt/rh/devtoolset-9/enable . /opt/rh/devtoolset-9/enable
fi fi
cd ~/rpmbuild/SPECS cd ~/rpmbuild/SPECS
@@ -25,4 +24,4 @@ rm fio
mv fio-copy fio mv fio-copy fio
FIO=`rpm -qi fio | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'` FIO=`rpm -qi fio | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'`
perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec
tar --transform 's#^#vitastor-0.8.5/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.8.5$(rpm --eval '%dist').tar.gz * tar --transform 's#^#vitastor-0.8.8/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.8.8$(rpm --eval '%dist').tar.gz *

View File

@@ -0,0 +1,93 @@
--- qemu-kvm.spec.orig 2023-02-28 08:04:06.000000000 +0000
+++ qemu-kvm.spec 2023-04-27 22:29:18.094878829 +0000
@@ -100,8 +100,6 @@
%endif
%global target_list %{kvm_target}-softmmu
-%global block_drivers_rw_list qcow2,raw,file,host_device,nbd,iscsi,rbd,blkdebug,luks,null-co,nvme,copy-on-read,throttle,compress
-%global block_drivers_ro_list vdi,vmdk,vhdx,vpc,https
%define qemudocdir %{_docdir}/%{name}
%global firmwaredirs "%{_datadir}/qemu-firmware:%{_datadir}/ipxe/qemu:%{_datadir}/seavgabios:%{_datadir}/seabios"
@@ -129,6 +127,7 @@ Requires: %{name}-device-usb-host = %{ep
Requires: %{name}-device-usb-redirect = %{epoch}:%{version}-%{release} \
%endif \
Requires: %{name}-block-rbd = %{epoch}:%{version}-%{release} \
+Requires: %{name}-block-vitastor = %{epoch}:%{version}-%{release}\
Requires: %{name}-audio-pa = %{epoch}:%{version}-%{release}
# Since SPICE is removed from RHEL-9, the following Obsoletes:
@@ -151,7 +150,7 @@ Obsoletes: %{name}-block-ssh <= %{epoch}
Summary: QEMU is a machine emulator and virtualizer
Name: qemu-kvm
Version: 7.0.0
-Release: 13%{?rcrel}%{?dist}%{?cc_suffix}.2
+Release: 13.vitastor%{?rcrel}%{?dist}%{?cc_suffix}
# Epoch because we pushed a qemu-1.0 package. AIUI this can't ever be dropped
# Epoch 15 used for RHEL 8
# Epoch 17 used for RHEL 9 (due to release versioning offset in RHEL 8.5)
@@ -174,6 +173,7 @@ Source28: 95-kvm-memlock.conf
Source30: kvm-s390x.conf
Source31: kvm-x86.conf
Source36: README.tests
+Source37: qemu-vitastor.c
Patch0004: 0004-Initial-redhat-build.patch
@@ -498,6 +498,7 @@ Patch171: kvm-i386-do-kvm_put_msr_featur
Patch172: kvm-target-i386-kvm-fix-kvmclock_current_nsec-Assertion-.patch
# For bz#2168221 - while live-migrating many instances concurrently, libvirt sometimes return internal error: migration was active, but no RAM info was set [rhel-9.1.0.z]
Patch173: kvm-migration-Read-state-once.patch
+Patch174: qemu-7.0-vitastor.patch
# Source-git patches
@@ -531,6 +532,7 @@ BuildRequires: libcurl-devel
%if %{have_block_rbd}
BuildRequires: librbd-devel
%endif
+BuildRequires: vitastor-client-devel
# We need both because the 'stap' binary is probed for by configure
BuildRequires: systemtap
BuildRequires: systemtap-sdt-devel
@@ -718,6 +720,14 @@ using the rbd protocol.
%endif
+%package block-vitastor
+Summary: QEMU Vitastor block driver
+Requires: %{name}-common%{?_isa} = %{epoch}:%{version}-%{release}
+
+%description block-vitastor
+This package provides the additional Vitastor block driver for QEMU.
+
+
%package audio-pa
Summary: QEMU PulseAudio audio driver
Requires: %{name}-common%{?_isa} = %{epoch}:%{version}-%{release}
@@ -811,6 +821,7 @@ This package provides usbredir support.
%prep
%setup -q -n qemu-%{version}%{?rcstr}
%autopatch -p1
+cp %{SOURCE37} ./block/vitastor.c
%global qemu_kvm_build qemu_kvm_build
mkdir -p %{qemu_kvm_build}
@@ -1032,6 +1043,7 @@ run_configure \
%if %{have_block_rbd}
--enable-rbd \
%endif
+ --enable-vitastor \
%if %{have_librdma}
--enable-rdma \
%endif
@@ -1511,6 +1523,9 @@ useradd -r -u 107 -g qemu -G kvm -d / -s
%files block-rbd
%{_libdir}/%{name}/block-rbd.so
%endif
+%files block-vitastor
+%{_libdir}/%{name}/block-vitastor.so
+
%files audio-pa
%{_libdir}/%{name}/audio-pa.so

View File

@@ -35,7 +35,7 @@ ADD . /root/vitastor
RUN set -e; \ RUN set -e; \
cd /root/vitastor/rpm; \ cd /root/vitastor/rpm; \
sh build-tarball.sh; \ sh build-tarball.sh; \
cp /root/vitastor-0.8.5.el7.tar.gz ~/rpmbuild/SOURCES; \ cp /root/vitastor-0.8.8.el7.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \ cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \ cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \ rpmbuild -ba vitastor.spec; \

View File

@@ -1,11 +1,11 @@
Name: vitastor Name: vitastor
Version: 0.8.5 Version: 0.8.8
Release: 1%{?dist} Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1 License: Vitastor Network Public License 1.1
URL: https://vitastor.io/ URL: https://vitastor.io/
Source0: vitastor-0.8.5.el7.tar.gz Source0: vitastor-0.8.8.el7.tar.gz
BuildRequires: liburing-devel >= 0.6 BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel BuildRequires: gperftools-devel

View File

@@ -35,7 +35,7 @@ ADD . /root/vitastor
RUN set -e; \ RUN set -e; \
cd /root/vitastor/rpm; \ cd /root/vitastor/rpm; \
sh build-tarball.sh; \ sh build-tarball.sh; \
cp /root/vitastor-0.8.5.el8.tar.gz ~/rpmbuild/SOURCES; \ cp /root/vitastor-0.8.8.el8.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \ cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \ cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \ rpmbuild -ba vitastor.spec; \

View File

@@ -1,11 +1,11 @@
Name: vitastor Name: vitastor
Version: 0.8.5 Version: 0.8.8
Release: 1%{?dist} Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1 License: Vitastor Network Public License 1.1
URL: https://vitastor.io/ URL: https://vitastor.io/
Source0: vitastor-0.8.5.el8.tar.gz Source0: vitastor-0.8.8.el8.tar.gz
BuildRequires: liburing-devel >= 0.6 BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel BuildRequires: gperftools-devel

View File

@@ -0,0 +1,28 @@
# Build packages for AlmaLinux 9 inside a container
# cd ..; podman build -t vitastor-el9 -v `pwd`/packages:/root/packages -f rpm/vitastor-el9.Dockerfile .
FROM almalinux:9
WORKDIR /root
RUN sed -i 's/enabled=0/enabled=1/' /etc/yum.repos.d/*.repo
RUN dnf -y install epel-release dnf-plugins-core
RUN dnf -y install https://vitastor.io/rpms/centos/9/vitastor-release-1.0-1.el9.noarch.rpm
RUN dnf -y install gcc-c++ gperftools-devel fio nodejs rpm-build jerasure-devel libisa-l-devel gf-complete-devel rdma-core-devel libarchive liburing-devel cmake
RUN dnf download --source fio
RUN rpm --nomd5 -i fio*.src.rpm
RUN cd ~/rpmbuild/SPECS && dnf builddep -y --spec fio.spec
ADD . /root/vitastor
RUN set -e; \
cd /root/vitastor/rpm; \
sh build-tarball.sh; \
cp /root/vitastor-0.8.8.el9.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el9.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \
mkdir -p /root/packages/vitastor-el9; \
rm -rf /root/packages/vitastor-el9/*; \
cp ~/rpmbuild/RPMS/*/vitastor* /root/packages/vitastor-el9/; \
cp ~/rpmbuild/SRPMS/vitastor* /root/packages/vitastor-el9/

158
rpm/vitastor-el9.spec Normal file
View File

@@ -0,0 +1,158 @@
Name: vitastor
Version: 0.8.8
Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1
URL: https://vitastor.io/
Source0: vitastor-0.8.8.el9.tar.gz
BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel
BuildRequires: gcc-c++
BuildRequires: nodejs >= 10
BuildRequires: jerasure-devel
BuildRequires: libisa-l-devel
BuildRequires: gf-complete-devel
BuildRequires: rdma-core-devel
BuildRequires: cmake
Requires: vitastor-osd = %{version}-%{release}
Requires: vitastor-mon = %{version}-%{release}
Requires: vitastor-client = %{version}-%{release}
Requires: vitastor-client-devel = %{version}-%{release}
Requires: vitastor-fio = %{version}-%{release}
%description
Vitastor is a small, simple and fast clustered block storage (storage for VM drives),
architecturally similar to Ceph which means strong consistency, primary-replication,
symmetric clustering and automatic data distribution over any number of drives of any
size with configurable redundancy (replication or erasure codes/XOR).
%package -n vitastor-osd
Summary: Vitastor - OSD
Requires: vitastor-client = %{version}-%{release}
Requires: util-linux
Requires: parted
%description -n vitastor-osd
Vitastor object storage daemon, i.e. server program that stores data.
%package -n vitastor-mon
Summary: Vitastor - monitor
Requires: nodejs >= 10
Requires: lpsolve
%description -n vitastor-mon
Vitastor monitor, i.e. server program responsible for watching cluster state and
scheduling cluster-level operations.
%package -n vitastor-client
Summary: Vitastor - client
%description -n vitastor-client
Vitastor client library and command-line interface.
%package -n vitastor-client-devel
Summary: Vitastor - development files
Group: Development/Libraries
Requires: vitastor-client = %{version}-%{release}
%description -n vitastor-client-devel
Vitastor library headers for development.
%package -n vitastor-fio
Summary: Vitastor - fio drivers
Group: Development/Libraries
Requires: vitastor-client = %{version}-%{release}
Requires: fio = 3.27-7.el9
%description -n vitastor-fio
Vitastor fio drivers for benchmarking.
%prep
%setup -q
%build
%cmake
%cmake_build
%install
rm -rf $RPM_BUILD_ROOT
%cmake_install
cd mon
npm install
cd ..
mkdir -p %buildroot/usr/lib/vitastor
cp -r mon %buildroot/usr/lib/vitastor
mkdir -p %buildroot/lib/systemd/system
cp mon/vitastor.target mon/vitastor-mon.service mon/vitastor-osd@.service %buildroot/lib/systemd/system
mkdir -p %buildroot/lib/udev/rules.d
cp mon/90-vitastor.rules %buildroot/lib/udev/rules.d
%files
%doc GPL-2.0.txt VNPL-1.1.txt README.md README-ru.md
%files -n vitastor-osd
%_bindir/vitastor-osd
%_bindir/vitastor-disk
%_bindir/vitastor-dump-journal
/lib/systemd/system/vitastor-osd@.service
/lib/systemd/system/vitastor.target
/lib/udev/rules.d/90-vitastor.rules
%pre -n vitastor-osd
groupadd -r -f vitastor 2>/dev/null ||:
useradd -r -g vitastor -s /sbin/nologin -c "Vitastor daemons" -M -d /nonexistent vitastor 2>/dev/null ||:
install -o vitastor -g vitastor -d /var/log/vitastor
mkdir -p /etc/vitastor
%files -n vitastor-mon
/usr/lib/vitastor/mon
/lib/systemd/system/vitastor-mon.service
%pre -n vitastor-mon
groupadd -r -f vitastor 2>/dev/null ||:
useradd -r -g vitastor -s /sbin/nologin -c "Vitastor daemons" -M -d /nonexistent vitastor 2>/dev/null ||:
mkdir -p /etc/vitastor
%files -n vitastor-client
%_bindir/vitastor-nbd
%_bindir/vitastor-nfs
%_bindir/vitastor-cli
%_bindir/vitastor-rm
%_bindir/vita
%_libdir/libvitastor_blk.so*
%_libdir/libvitastor_client.so*
%files -n vitastor-client-devel
%_includedir/vitastor_c.h
%_libdir/pkgconfig
%files -n vitastor-fio
%_libdir/libfio_vitastor.so
%_libdir/libfio_vitastor_blk.so
%_libdir/libfio_vitastor_sec.so
%changelog

View File

@@ -1,4 +1,4 @@
cmake_minimum_required(VERSION 2.8) cmake_minimum_required(VERSION 2.8.12)
project(vitastor) project(vitastor)
@@ -16,7 +16,7 @@ if("${CMAKE_INSTALL_PREFIX}" MATCHES "^/usr/local/?$")
set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}") set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}")
endif() endif()
add_definitions(-DVERSION="0.8.5") add_definitions(-DVERSION="0.8.8")
add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -fdiagnostics-color=always -I ${CMAKE_SOURCE_DIR}/src) add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -fdiagnostics-color=always -I ${CMAKE_SOURCE_DIR}/src)
if (${WITH_ASAN}) if (${WITH_ASAN})
add_definitions(-fsanitize=address -fno-omit-frame-pointer) add_definitions(-fsanitize=address -fno-omit-frame-pointer)

View File

@@ -13,6 +13,11 @@ blockstore_t::~blockstore_t()
delete impl; delete impl;
} }
void blockstore_t::parse_config(blockstore_config_t & config)
{
impl->parse_config(config, false);
}
void blockstore_t::loop() void blockstore_t::loop()
{ {
impl->loop(); impl->loop();

View File

@@ -107,7 +107,7 @@ Input:
- buf = pre-allocated obj_ver_id array <len> units long - buf = pre-allocated obj_ver_id array <len> units long
Output: Output:
- retval = 0 or negative error number (-EINVAL, -ENOENT if no such version or -EBUSY if not synced) - retval = 0 or negative error number (-ENOENT if no such version for stabilize)
## BS_OP_SYNC_STAB_ALL ## BS_OP_SYNC_STAB_ALL
@@ -165,6 +165,9 @@ public:
blockstore_t(blockstore_config_t & config, ring_loop_t *ringloop, timerfd_manager_t *tfd); blockstore_t(blockstore_config_t & config, ring_loop_t *ringloop, timerfd_manager_t *tfd);
~blockstore_t(); ~blockstore_t();
// Update configuration
void parse_config(blockstore_config_t & config);
// Event loop // Event loop
void loop(); void loop();

View File

@@ -932,7 +932,7 @@ bool journal_flusher_co::fsync_batch(bool fsync_meta, int wait_base)
resume_1: resume_1:
if (!cur_sync->state) if (!cur_sync->state)
{ {
if (flusher->syncing_flushers >= flusher->cur_flusher_count || !flusher->flush_queue.size()) if (flusher->syncing_flushers >= flusher->active_flushers || !flusher->flush_queue.size())
{ {
// Sync batch is ready. Do it. // Sync batch is ready. Do it.
await_sqe(0); await_sqe(0);

View File

@@ -11,7 +11,7 @@ blockstore_impl_t::blockstore_impl_t(blockstore_config_t & config, ring_loop_t *
ring_consumer.loop = [this]() { loop(); }; ring_consumer.loop = [this]() { loop(); };
ringloop->register_consumer(&ring_consumer); ringloop->register_consumer(&ring_consumer);
initialized = 0; initialized = 0;
parse_config(config); parse_config(config, true);
zero_object = (uint8_t*)memalign_or_die(MEM_ALIGNMENT, dsk.data_block_size); zero_object = (uint8_t*)memalign_or_die(MEM_ALIGNMENT, dsk.data_block_size);
try try
{ {
@@ -171,7 +171,7 @@ void blockstore_impl_t::loop()
// Can't submit SYNC before previous writes // Can't submit SYNC before previous writes
continue; continue;
} }
wr_st = continue_sync(op, false); wr_st = continue_sync(op);
if (wr_st != 2) if (wr_st != 2)
{ {
has_writes = wr_st > 0 ? 1 : 2; has_writes = wr_st > 0 ? 1 : 2;
@@ -307,6 +307,18 @@ void blockstore_impl_t::check_wait(blockstore_op_t *op)
} }
PRIV(op)->wait_for = 0; PRIV(op)->wait_for = 0;
} }
else if (PRIV(op)->wait_for == WAIT_FREE)
{
if (!data_alloc->get_free_count() && big_to_flush > 0)
{
#ifdef BLOCKSTORE_DEBUG
printf("Still waiting for free space on the data device\n");
#endif
return;
}
flusher->release_trim();
PRIV(op)->wait_for = 0;
}
else else
{ {
throw std::runtime_error("BUG: op->wait_for value is unexpected"); throw std::runtime_error("BUG: op->wait_for value is unexpected");
@@ -371,13 +383,18 @@ void blockstore_impl_t::enqueue_op(blockstore_op_t *op)
ringloop->set_immediate([op]() { std::function<void (blockstore_op_t*)>(op->callback)(op); }); ringloop->set_immediate([op]() { std::function<void (blockstore_op_t*)>(op->callback)(op); });
return; return;
} }
init_op(op);
submit_queue.push_back(op);
ringloop->wakeup();
}
void blockstore_impl_t::init_op(blockstore_op_t *op)
{
// Call constructor without allocating memory. We'll call destructor before returning op back // Call constructor without allocating memory. We'll call destructor before returning op back
new ((void*)op->private_data) blockstore_op_private_t; new ((void*)op->private_data) blockstore_op_private_t;
PRIV(op)->wait_for = 0; PRIV(op)->wait_for = 0;
PRIV(op)->op_state = 0; PRIV(op)->op_state = 0;
PRIV(op)->pending_ops = 0; PRIV(op)->pending_ops = 0;
submit_queue.push_back(op);
ringloop->wakeup();
} }
static bool replace_stable(object_id oid, uint64_t version, int search_start, int search_end, obj_ver_id* list) static bool replace_stable(object_id oid, uint64_t version, int search_start, int search_end, obj_ver_id* list)

View File

@@ -160,6 +160,8 @@ struct __attribute__((__packed__)) dirty_entry
#define WAIT_JOURNAL 3 #define WAIT_JOURNAL 3
// Suspend operation until the next journal sector buffer is free // Suspend operation until the next journal sector buffer is free
#define WAIT_JOURNAL_BUFFER 4 #define WAIT_JOURNAL_BUFFER 4
// Suspend operation until there is some free space on the data device
#define WAIT_FREE 5
struct fulfill_read_t struct fulfill_read_t
{ {
@@ -216,6 +218,11 @@ struct pool_shard_settings_t
uint32_t pg_stripe_size; uint32_t pg_stripe_size;
}; };
#define STAB_SPLIT_DONE 1
#define STAB_SPLIT_WAIT 2
#define STAB_SPLIT_SYNC 3
#define STAB_SPLIT_TODO 4
class blockstore_impl_t class blockstore_impl_t
{ {
blockstore_disk_t dsk; blockstore_disk_t dsk;
@@ -258,6 +265,7 @@ class blockstore_impl_t
struct journal_t journal; struct journal_t journal;
journal_flusher_t *flusher; journal_flusher_t *flusher;
int big_to_flush = 0;
int write_iodepth = 0; int write_iodepth = 0;
bool live = false, queue_stall = false; bool live = false, queue_stall = false;
@@ -277,7 +285,6 @@ class blockstore_impl_t
friend class journal_flusher_t; friend class journal_flusher_t;
friend class journal_flusher_co; friend class journal_flusher_co;
void parse_config(blockstore_config_t & config);
void calc_lengths(); void calc_lengths();
void open_data(); void open_data();
void open_meta(); void open_meta();
@@ -299,6 +306,7 @@ class blockstore_impl_t
blockstore_init_journal* journal_init_reader; blockstore_init_journal* journal_init_reader;
void check_wait(blockstore_op_t *op); void check_wait(blockstore_op_t *op);
void init_op(blockstore_op_t *op);
// Read // Read
int dequeue_read(blockstore_op_t *read_op); int dequeue_read(blockstore_op_t *read_op);
@@ -318,7 +326,7 @@ class blockstore_impl_t
void handle_write_event(ring_data_t *data, blockstore_op_t *op); void handle_write_event(ring_data_t *data, blockstore_op_t *op);
// Sync // Sync
int continue_sync(blockstore_op_t *op, bool queue_has_in_progress_sync); int continue_sync(blockstore_op_t *op);
void ack_sync(blockstore_op_t *op); void ack_sync(blockstore_op_t *op);
// Stabilize // Stabilize
@@ -326,6 +334,8 @@ class blockstore_impl_t
int continue_stable(blockstore_op_t *op); int continue_stable(blockstore_op_t *op);
void mark_stable(const obj_ver_id & ov, bool forget_dirty = false); void mark_stable(const obj_ver_id & ov, bool forget_dirty = false);
void stabilize_object(object_id oid, uint64_t max_ver); void stabilize_object(object_id oid, uint64_t max_ver);
blockstore_op_t* selective_sync(blockstore_op_t *op);
int split_stab_op(blockstore_op_t *op, std::function<int(obj_ver_id v)> decider);
// Rollback // Rollback
int dequeue_rollback(blockstore_op_t *op); int dequeue_rollback(blockstore_op_t *op);
@@ -341,6 +351,8 @@ public:
blockstore_impl_t(blockstore_config_t & config, ring_loop_t *ringloop, timerfd_manager_t *tfd); blockstore_impl_t(blockstore_config_t & config, ring_loop_t *ringloop, timerfd_manager_t *tfd);
~blockstore_impl_t(); ~blockstore_impl_t();
void parse_config(blockstore_config_t & config, bool init);
// Event loop // Event loop
void loop(); void loop();

View File

@@ -4,8 +4,54 @@
#include <sys/file.h> #include <sys/file.h>
#include "blockstore_impl.h" #include "blockstore_impl.h"
void blockstore_impl_t::parse_config(blockstore_config_t & config) void blockstore_impl_t::parse_config(blockstore_config_t & config, bool init)
{ {
// Online-configurable options:
max_flusher_count = strtoull(config["max_flusher_count"].c_str(), NULL, 10);
if (!max_flusher_count)
{
max_flusher_count = strtoull(config["flusher_count"].c_str(), NULL, 10);
}
min_flusher_count = strtoull(config["min_flusher_count"].c_str(), NULL, 10);
max_write_iodepth = strtoull(config["max_write_iodepth"].c_str(), NULL, 10);
throttle_small_writes = config["throttle_small_writes"] == "true" || config["throttle_small_writes"] == "1" || config["throttle_small_writes"] == "yes";
throttle_target_iops = strtoull(config["throttle_target_iops"].c_str(), NULL, 10);
throttle_target_mbs = strtoull(config["throttle_target_mbs"].c_str(), NULL, 10);
throttle_target_parallelism = strtoull(config["throttle_target_parallelism"].c_str(), NULL, 10);
throttle_threshold_us = strtoull(config["throttle_threshold_us"].c_str(), NULL, 10);
if (!max_flusher_count)
{
max_flusher_count = 256;
}
if (!min_flusher_count || journal.flush_journal)
{
min_flusher_count = 1;
}
if (!max_write_iodepth)
{
max_write_iodepth = 128;
}
if (!throttle_target_iops)
{
throttle_target_iops = 100;
}
if (!throttle_target_mbs)
{
throttle_target_mbs = 100;
}
if (!throttle_target_parallelism)
{
throttle_target_parallelism = 1;
}
if (!throttle_threshold_us)
{
throttle_threshold_us = 50;
}
if (!init)
{
return;
}
// Offline-configurable options:
// Common disk options // Common disk options
dsk.parse_config(config); dsk.parse_config(config);
// Parse // Parse
@@ -44,29 +90,7 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
journal.no_same_sector_overwrites = config["journal_no_same_sector_overwrites"] == "true" || journal.no_same_sector_overwrites = config["journal_no_same_sector_overwrites"] == "true" ||
config["journal_no_same_sector_overwrites"] == "1" || config["journal_no_same_sector_overwrites"] == "yes"; config["journal_no_same_sector_overwrites"] == "1" || config["journal_no_same_sector_overwrites"] == "yes";
journal.inmemory = config["inmemory_journal"] != "false"; journal.inmemory = config["inmemory_journal"] != "false";
max_flusher_count = strtoull(config["max_flusher_count"].c_str(), NULL, 10);
if (!max_flusher_count)
max_flusher_count = strtoull(config["flusher_count"].c_str(), NULL, 10);
min_flusher_count = strtoull(config["min_flusher_count"].c_str(), NULL, 10);
max_write_iodepth = strtoull(config["max_write_iodepth"].c_str(), NULL, 10);
throttle_small_writes = config["throttle_small_writes"] == "true" || config["throttle_small_writes"] == "1" || config["throttle_small_writes"] == "yes";
throttle_target_iops = strtoull(config["throttle_target_iops"].c_str(), NULL, 10);
throttle_target_mbs = strtoull(config["throttle_target_mbs"].c_str(), NULL, 10);
throttle_target_parallelism = strtoull(config["throttle_target_parallelism"].c_str(), NULL, 10);
throttle_threshold_us = strtoull(config["throttle_threshold_us"].c_str(), NULL, 10);
// Validate // Validate
if (!max_flusher_count)
{
max_flusher_count = 256;
}
if (!min_flusher_count || journal.flush_journal)
{
min_flusher_count = 1;
}
if (!max_write_iodepth)
{
max_write_iodepth = 128;
}
if (journal.sector_count < 2) if (journal.sector_count < 2)
{ {
journal.sector_count = 32; journal.sector_count = 32;
@@ -91,22 +115,6 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
{ {
throw std::runtime_error("immediate_commit=all requires disable_journal_fsync and disable_data_fsync"); throw std::runtime_error("immediate_commit=all requires disable_journal_fsync and disable_data_fsync");
} }
if (!throttle_target_iops)
{
throttle_target_iops = 100;
}
if (!throttle_target_mbs)
{
throttle_target_mbs = 100;
}
if (!throttle_target_parallelism)
{
throttle_target_parallelism = 1;
}
if (!throttle_threshold_us)
{
throttle_threshold_us = 50;
}
// init some fields // init some fields
journal.block_size = dsk.journal_block_size; journal.block_size = dsk.journal_block_size;
journal.next_free = dsk.journal_block_size; journal.next_free = dsk.journal_block_size;

View File

@@ -9,48 +9,39 @@ int blockstore_impl_t::dequeue_rollback(blockstore_op_t *op)
{ {
return continue_rollback(op); return continue_rollback(op);
} }
obj_ver_id *v, *nv; int r = split_stab_op(op, [this](obj_ver_id ov)
int i, todo = op->len;
for (i = 0, v = (obj_ver_id*)op->buf, nv = (obj_ver_id*)op->buf; i < op->len; i++, v++, nv++)
{ {
if (nv != v)
{
*nv = *v;
}
// Check that there are some versions greater than v->version (which may be zero), // Check that there are some versions greater than v->version (which may be zero),
// check that they're unstable, synced, and not currently written to // check that they're unstable, synced, and not currently written to
auto dirty_it = dirty_db.lower_bound((obj_ver_id){ auto dirty_it = dirty_db.lower_bound((obj_ver_id){
.oid = v->oid, .oid = ov.oid,
.version = UINT64_MAX, .version = UINT64_MAX,
}); });
if (dirty_it == dirty_db.begin()) if (dirty_it == dirty_db.begin())
{ {
skip_ov:
// Already rolled back, skip this object version // Already rolled back, skip this object version
todo--; return STAB_SPLIT_DONE;
nv--;
continue;
} }
else else
{ {
dirty_it--; dirty_it--;
if (dirty_it->first.oid != v->oid || dirty_it->first.version < v->version) if (dirty_it->first.oid != ov.oid || dirty_it->first.version < ov.version)
{ {
goto skip_ov; // Already rolled back, skip this object version
return STAB_SPLIT_DONE;
} }
while (dirty_it->first.oid == v->oid && dirty_it->first.version > v->version) while (dirty_it->first.oid == ov.oid && dirty_it->first.version > ov.version)
{ {
if (IS_IN_FLIGHT(dirty_it->second.state)) if (IS_IN_FLIGHT(dirty_it->second.state))
{ {
// Object write is still in progress. Wait until the write request completes // Object write is still in progress. Wait until the write request completes
return 0; return STAB_SPLIT_WAIT;
} }
else if (!IS_SYNCED(dirty_it->second.state) || else if (!IS_SYNCED(dirty_it->second.state) ||
IS_STABLE(dirty_it->second.state)) IS_STABLE(dirty_it->second.state))
{ {
op->retval = -EBUSY; // Sync the object
FINISH_OP(op); return STAB_SPLIT_SYNC;
return 2;
} }
if (dirty_it == dirty_db.begin()) if (dirty_it == dirty_db.begin())
{ {
@@ -58,19 +49,16 @@ skip_ov:
} }
dirty_it--; dirty_it--;
} }
return STAB_SPLIT_TODO;
} }
} });
op->len = todo; if (r != 1)
if (!todo)
{ {
// Already rolled back return r;
op->retval = 0;
FINISH_OP(op);
return 2;
} }
// Check journal space // Check journal space
blockstore_journal_check_t space_check(this); blockstore_journal_check_t space_check(this);
if (!space_check.check_available(op, todo, sizeof(journal_entry_rollback), 0)) if (!space_check.check_available(op, op->len, sizeof(journal_entry_rollback), 0))
{ {
return 0; return 0;
} }
@@ -78,7 +66,8 @@ skip_ov:
BS_SUBMIT_CHECK_SQES(space_check.sectors_to_write); BS_SUBMIT_CHECK_SQES(space_check.sectors_to_write);
// Prepare and submit journal entries // Prepare and submit journal entries
int s = 0; int s = 0;
for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++) auto v = (obj_ver_id*)op->buf;
for (int i = 0; i < op->len; i++, v++)
{ {
if (!journal.entry_fits(sizeof(journal_entry_rollback)) && if (!journal.entry_fits(sizeof(journal_entry_rollback)) &&
journal.sector_info[journal.cur_sector].dirty) journal.sector_info[journal.cur_sector].dirty)
@@ -212,6 +201,11 @@ void blockstore_impl_t::erase_dirty(blockstore_dirty_db_t::iterator dirty_start,
} }
while (1) while (1)
{ {
if ((IS_BIG_WRITE(dirty_it->second.state) || IS_DELETE(dirty_it->second.state)) &&
IS_STABLE(dirty_it->second.state))
{
big_to_flush--;
}
if (IS_BIG_WRITE(dirty_it->second.state) && dirty_it->second.location != clean_loc && if (IS_BIG_WRITE(dirty_it->second.state) && dirty_it->second.location != clean_loc &&
dirty_it->second.location != UINT64_MAX) dirty_it->second.location != UINT64_MAX)
{ {

View File

@@ -41,60 +41,309 @@
// 4) after a while it takes his synced object list and sends stabilize requests // 4) after a while it takes his synced object list and sends stabilize requests
// to peers and to its own blockstore, thus freeing the old version // to peers and to its own blockstore, thus freeing the old version
int blockstore_impl_t::dequeue_stable(blockstore_op_t *op) struct ver_vector_t
{ {
if (PRIV(op)->op_state) obj_ver_id *items = NULL;
uint64_t alloc = 0, size = 0;
};
static void init_versions(ver_vector_t & vec, obj_ver_id *start, obj_ver_id *end, uint64_t len)
{
if (!vec.items)
{ {
return continue_stable(op); vec.alloc = len;
vec.items = (obj_ver_id*)malloc_or_die(sizeof(obj_ver_id) * vec.alloc);
for (auto sv = start; sv < end; sv++)
{
vec.items[vec.size++] = *sv;
}
} }
}
static void append_version(ver_vector_t & vec, obj_ver_id ov)
{
if (vec.size >= vec.alloc)
{
vec.alloc = !vec.alloc ? 4 : vec.alloc*2;
vec.items = (obj_ver_id*)realloc_or_die(vec.items, sizeof(obj_ver_id) * vec.alloc);
}
vec.items[vec.size++] = ov;
}
static bool check_unsynced(std::vector<obj_ver_id> & check, obj_ver_id ov, std::vector<obj_ver_id> & to, int *count)
{
bool found = false;
int j = 0, k = 0;
while (j < check.size())
{
if (check[j] == ov)
found = true;
if (check[j].oid == ov.oid && check[j].version <= ov.version)
{
to.push_back(check[j++]);
if (count)
(*count)--;
}
else
check[k++] = check[j++];
}
check.resize(k);
return found;
}
blockstore_op_t* blockstore_impl_t::selective_sync(blockstore_op_t *op)
{
unsynced_big_write_count -= unsynced_big_writes.size();
unsynced_big_writes.swap(PRIV(op)->sync_big_writes);
unsynced_big_write_count += unsynced_big_writes.size();
unsynced_small_writes.swap(PRIV(op)->sync_small_writes);
// Create a sync operation, insert into the end of the queue
// And move ourselves into the end too!
// Rather hacky but that's what we need...
blockstore_op_t *sync_op = new blockstore_op_t;
sync_op->opcode = BS_OP_SYNC;
sync_op->buf = NULL;
sync_op->callback = [this](blockstore_op_t *sync_op)
{
delete sync_op;
};
init_op(sync_op);
int sync_res = continue_sync(sync_op);
if (sync_res != 2)
{
// Put SYNC into the queue if it's not finished yet
submit_queue.push_back(sync_op);
}
// Restore unsynced_writes
unsynced_small_writes.swap(PRIV(op)->sync_small_writes);
unsynced_big_write_count -= unsynced_big_writes.size();
unsynced_big_writes.swap(PRIV(op)->sync_big_writes);
unsynced_big_write_count += unsynced_big_writes.size();
if (sync_res == 2)
{
// Sync is immediately completed
return NULL;
}
return sync_op;
}
// Returns: 2 = stop processing and dequeue, 0 = stop processing and do not dequeue, 1 = proceed with op itself
int blockstore_impl_t::split_stab_op(blockstore_op_t *op, std::function<int(obj_ver_id v)> decider)
{
bool add_sync = false;
ver_vector_t good_vers, bad_vers;
obj_ver_id* v; obj_ver_id* v;
int i, todo = 0; int i, todo = 0;
for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++) for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++)
{ {
auto dirty_it = dirty_db.find(*v); int action = decider(*v);
if (dirty_it == dirty_db.end()) if (action < 0)
{ {
auto & clean_db = clean_db_shard(v->oid); // Rollback changes
auto clean_it = clean_db.find(v->oid); for (auto & ov: PRIV(op)->sync_big_writes)
if (clean_it == clean_db.end() || clean_it->second.version < v->version)
{ {
// No such object version unsynced_big_writes.push_back(ov);
op->retval = -ENOENT; unsynced_big_write_count++;
FINISH_OP(op);
return 2;
} }
else for (auto & ov: PRIV(op)->sync_small_writes)
{ {
// Already stable unsynced_small_writes.push_back(ov);
} }
} free(good_vers.items);
else if (IS_IN_FLIGHT(dirty_it->second.state)) good_vers.items = NULL;
{ free(bad_vers.items);
// Object write is still in progress. Wait until the write request completes bad_vers.items = NULL;
return 0; // Error
} op->retval = action;
else if (!IS_SYNCED(dirty_it->second.state))
{
// Object not synced yet. Caller must sync it first
op->retval = -EBUSY;
FINISH_OP(op); FINISH_OP(op);
return 2; return 2;
} }
else if (!IS_STABLE(dirty_it->second.state)) else if (action == STAB_SPLIT_DONE)
{ {
// Already done
init_versions(good_vers, (obj_ver_id*)op->buf, v, op->len);
}
else if (action == STAB_SPLIT_WAIT)
{
// Already in progress, we just have to wait until it finishes
init_versions(good_vers, (obj_ver_id*)op->buf, v, op->len);
append_version(bad_vers, *v);
}
else if (action == STAB_SPLIT_SYNC)
{
// Needs a SYNC, we have to send a SYNC if not already in progress
//
// If the object is not present in unsynced_(big|small)_writes then
// it's currently being synced. If it's present then we can initiate
// its sync ourselves.
init_versions(good_vers, (obj_ver_id*)op->buf, v, op->len);
append_version(bad_vers, *v);
if (!add_sync)
{
PRIV(op)->sync_big_writes.clear();
PRIV(op)->sync_small_writes.clear();
add_sync = true;
}
check_unsynced(unsynced_small_writes, *v, PRIV(op)->sync_small_writes, NULL);
check_unsynced(unsynced_big_writes, *v, PRIV(op)->sync_big_writes, &unsynced_big_write_count);
}
else /* if (action == STAB_SPLIT_TODO) */
{
if (good_vers.items)
{
// If we're selecting versions then append it
// Main idea is that 99% of the time all versions passed to BS_OP_STABLE are synced
// And we don't want to select/allocate anything in that optimistic case
append_version(good_vers, *v);
}
todo++; todo++;
} }
} }
if (!todo) // In a pessimistic scenario, an operation may be split into 3:
// - Stabilize synced entries
// - Sync unsynced entries
// - Continue for unsynced entries after sync
add_sync = add_sync && (PRIV(op)->sync_big_writes.size() || PRIV(op)->sync_small_writes.size());
if (!todo && !bad_vers.size)
{ {
// Already stable // Already stable
op->retval = 0; op->retval = 0;
FINISH_OP(op); FINISH_OP(op);
return 2; return 2;
} }
op->retval = 0;
if (!todo && !add_sync)
{
// Only wait for inflight writes or current in-progress syncs
return 0;
}
blockstore_op_t *sync_op = NULL, *split_stab_op = NULL;
if (add_sync)
{
// Initiate a selective sync for PRIV(op)->sync_(big|small)_writes
sync_op = selective_sync(op);
}
if (bad_vers.size)
{
// Split part of the request into a separate operation
split_stab_op = new blockstore_op_t;
split_stab_op->opcode = op->opcode;
split_stab_op->buf = bad_vers.items;
split_stab_op->len = bad_vers.size;
init_op(split_stab_op);
submit_queue.push_back(split_stab_op);
}
if (sync_op || split_stab_op || good_vers.items)
{
void *orig_buf = op->buf;
if (good_vers.items)
{
op->buf = good_vers.items;
op->len = good_vers.size;
}
// Make a wrapped callback
int *split_op_counter = (int*)malloc_or_die(sizeof(int));
*split_op_counter = (sync_op ? 1 : 0) + (split_stab_op ? 1 : 0) + (todo ? 1 : 0);
auto cb = [this, op, good_items = good_vers.items,
bad_items = bad_vers.items, split_op_counter,
orig_buf, real_cb = op->callback](blockstore_op_t *split_op)
{
if (split_op->retval != 0)
op->retval = split_op->retval;
(*split_op_counter)--;
assert((*split_op_counter) >= 0);
if (op != split_op)
delete split_op;
if (!*split_op_counter)
{
free(good_items);
free(bad_items);
free(split_op_counter);
op->buf = orig_buf;
real_cb(op);
}
};
if (sync_op)
{
sync_op->callback = cb;
}
if (split_stab_op)
{
split_stab_op->callback = cb;
}
op->callback = cb;
}
if (!todo)
{
// All work is postponed
op->callback = NULL;
return 2;
}
return 1;
}
int blockstore_impl_t::dequeue_stable(blockstore_op_t *op)
{
if (PRIV(op)->op_state)
{
return continue_stable(op);
}
int r = split_stab_op(op, [this](obj_ver_id ov)
{
auto dirty_it = dirty_db.find(ov);
if (dirty_it == dirty_db.end())
{
auto & clean_db = clean_db_shard(ov.oid);
auto clean_it = clean_db.find(ov.oid);
if (clean_it == clean_db.end() || clean_it->second.version < ov.version)
{
// No such object version
printf("Error: %lx:%lx v%lu not found while stabilizing\n", ov.oid.inode, ov.oid.stripe, ov.version);
return -ENOENT;
}
else
{
// Already stable
return STAB_SPLIT_DONE;
}
}
else if (IS_IN_FLIGHT(dirty_it->second.state))
{
// Object write is still in progress. Wait until the write request completes
return STAB_SPLIT_WAIT;
}
else if (!IS_SYNCED(dirty_it->second.state))
{
// Object not synced yet - sync it
// In previous versions we returned EBUSY here and required
// the caller (OSD) to issue a global sync first. But a global sync
// waits for all writes in the queue including inflight writes. And
// inflight writes may themselves be blocked by unstable writes being
// still present in the journal and not flushed away from it.
// So we must sync specific objects here.
//
// Even more, we have to process "stabilize" request in parts. That is,
// we must stabilize all objects which are already synced. Otherwise
// they may block objects which are NOT synced yet.
return STAB_SPLIT_SYNC;
}
else if (IS_STABLE(dirty_it->second.state))
{
// Already stable
return STAB_SPLIT_DONE;
}
else
{
return STAB_SPLIT_TODO;
}
});
if (r != 1)
{
return r;
}
// Check journal space // Check journal space
blockstore_journal_check_t space_check(this); blockstore_journal_check_t space_check(this);
if (!space_check.check_available(op, todo, sizeof(journal_entry_stable), 0)) if (!space_check.check_available(op, op->len, sizeof(journal_entry_stable), 0))
{ {
return 0; return 0;
} }
@@ -102,9 +351,9 @@ int blockstore_impl_t::dequeue_stable(blockstore_op_t *op)
BS_SUBMIT_CHECK_SQES(space_check.sectors_to_write); BS_SUBMIT_CHECK_SQES(space_check.sectors_to_write);
// Prepare and submit journal entries // Prepare and submit journal entries
int s = 0; int s = 0;
for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++) auto v = (obj_ver_id*)op->buf;
for (int i = 0; i < op->len; i++, v++)
{ {
// FIXME: Only stabilize versions that aren't stable yet
if (!journal.entry_fits(sizeof(journal_entry_stable)) && if (!journal.entry_fits(sizeof(journal_entry_stable)) &&
journal.sector_info[journal.cur_sector].dirty) journal.sector_info[journal.cur_sector].dirty)
{ {
@@ -197,6 +446,7 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
{ {
inode_space_stats[dirty_it->first.oid.inode] += dsk.data_block_size; inode_space_stats[dirty_it->first.oid.inode] += dsk.data_block_size;
} }
big_to_flush++;
} }
else if (IS_DELETE(dirty_it->second.state)) else if (IS_DELETE(dirty_it->second.state))
{ {
@@ -205,6 +455,7 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
sp -= dsk.data_block_size; sp -= dsk.data_block_size;
else else
inode_space_stats.erase(dirty_it->first.oid.inode); inode_space_stats.erase(dirty_it->first.oid.inode);
big_to_flush++;
} }
} }
if (forget_dirty && (IS_BIG_WRITE(dirty_it->second.state) || if (forget_dirty && (IS_BIG_WRITE(dirty_it->second.state) ||

View File

@@ -12,7 +12,7 @@
#define SYNC_JOURNAL_SYNC_SENT 7 #define SYNC_JOURNAL_SYNC_SENT 7
#define SYNC_DONE 8 #define SYNC_DONE 8
int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_progress_sync) int blockstore_impl_t::continue_sync(blockstore_op_t *op)
{ {
if (immediate_commit == IMMEDIATE_ALL) if (immediate_commit == IMMEDIATE_ALL)
{ {
@@ -145,7 +145,7 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
PRIV(op)->op_state = SYNC_DONE; PRIV(op)->op_state = SYNC_DONE;
} }
} }
if (PRIV(op)->op_state == SYNC_DONE && !queue_has_in_progress_sync) if (PRIV(op)->op_state == SYNC_DONE)
{ {
ack_sync(op); ack_sync(op);
return 2; return 2;

View File

@@ -271,6 +271,13 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
if (loc == UINT64_MAX) if (loc == UINT64_MAX)
{ {
// no space // no space
if (big_to_flush > 0)
{
// hope that some space will be available after flush
flusher->request_trim();
PRIV(op)->wait_for = WAIT_FREE;
return 0;
}
cancel_all_writes(op, dirty_it, -ENOSPC); cancel_all_writes(op, dirty_it, -ENOSPC);
return 2; return 2;
} }

View File

@@ -278,7 +278,7 @@ struct rm_osd_t
if (rsp["response_delete_range"]["deleted"].uint64_value() > 0) if (rsp["response_delete_range"]["deleted"].uint64_value() > 0)
{ {
// Wait for mon_change_timeout before updating PG history, or the monitor's change will likely interfere with ours // Wait for mon_change_timeout before updating PG history, or the monitor's change will likely interfere with ours
retry_wait = parent->cli->merged_config["mon_change_timeout"].uint64_value(); retry_wait = parent->cli->config["mon_change_timeout"].uint64_value();
if (!retry_wait) if (!retry_wait)
retry_wait = 1000; retry_wait = 1000;
retry_wait += etcd_tx_retry_ms; retry_wait += etcd_tx_retry_ms;

View File

@@ -198,9 +198,9 @@ resume_2:
} }
pgs_by_state_str += std::to_string(kv.second)+" "+kv.first; pgs_by_state_str += std::to_string(kv.second)+" "+kv.first;
} }
bool readonly = json_is_true(parent->cli->merged_config["readonly"]); bool readonly = json_is_true(parent->cli->config["readonly"]);
bool no_recovery = json_is_true(parent->cli->merged_config["no_recovery"]); bool no_recovery = json_is_true(parent->cli->config["no_recovery"]);
bool no_rebalance = json_is_true(parent->cli->merged_config["no_rebalance"]); bool no_rebalance = json_is_true(parent->cli->config["no_rebalance"]);
if (parent->json_output) if (parent->json_output)
{ {
// JSON output // JSON output

View File

@@ -18,11 +18,12 @@
cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config) cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config)
{ {
config = osd_messenger_t::read_config(config); cli_config = config.object_items();
file_config = osd_messenger_t::read_config(config);
config = osd_messenger_t::merge_configs(cli_config, file_config, etcd_global_config, {});
this->ringloop = ringloop; this->ringloop = ringloop;
this->tfd = tfd; this->tfd = tfd;
this->config = config;
msgr.osd_num = 0; msgr.osd_num = 0;
msgr.tfd = tfd; msgr.tfd = tfd;
@@ -58,7 +59,7 @@ cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd
msgr.stop_client(op->peer_fd); msgr.stop_client(op->peer_fd);
delete op; delete op;
}; };
msgr.parse_config(this->config); msgr.parse_config(config);
st_cli.tfd = tfd; st_cli.tfd = tfd;
st_cli.on_load_config_hook = [this](json11::Json::object & cfg) { on_load_config_hook(cfg); }; st_cli.on_load_config_hook = [this](json11::Json::object & cfg) { on_load_config_hook(cfg); };
@@ -276,13 +277,10 @@ restart:
continuing_ops = 0; continuing_ops = 0;
} }
void cluster_client_t::on_load_config_hook(json11::Json::object & config) void cluster_client_t::on_load_config_hook(json11::Json::object & etcd_global_config)
{ {
this->merged_config = config; this->etcd_global_config = etcd_global_config;
for (auto & kv: this->config.object_items()) config = osd_messenger_t::merge_configs(cli_config, file_config, etcd_global_config, {});
{
this->merged_config[kv.first] = kv.second;
}
if (config.find("client_max_dirty_bytes") != config.end()) if (config.find("client_max_dirty_bytes") != config.end())
{ {
client_max_dirty_bytes = config["client_max_dirty_bytes"].uint64_value(); client_max_dirty_bytes = config["client_max_dirty_bytes"].uint64_value();
@@ -292,14 +290,13 @@ void cluster_client_t::on_load_config_hook(json11::Json::object & config)
// Old name // Old name
client_max_dirty_bytes = config["client_dirty_limit"].uint64_value(); client_max_dirty_bytes = config["client_dirty_limit"].uint64_value();
} }
if (config.find("client_max_dirty_ops") != config.end()) else
{ client_max_dirty_bytes = 0;
client_max_dirty_ops = config["client_max_dirty_ops"].uint64_value();
}
if (!client_max_dirty_bytes) if (!client_max_dirty_bytes)
{ {
client_max_dirty_bytes = DEFAULT_CLIENT_MAX_DIRTY_BYTES; client_max_dirty_bytes = DEFAULT_CLIENT_MAX_DIRTY_BYTES;
} }
client_max_dirty_ops = config["client_max_dirty_ops"].uint64_value();
if (!client_max_dirty_ops) if (!client_max_dirty_ops)
{ {
client_max_dirty_ops = DEFAULT_CLIENT_MAX_DIRTY_OPS; client_max_dirty_ops = DEFAULT_CLIENT_MAX_DIRTY_OPS;
@@ -314,7 +311,7 @@ void cluster_client_t::on_load_config_hook(json11::Json::object & config)
up_wait_retry_interval = 50; up_wait_retry_interval = 50;
} }
msgr.parse_config(config); msgr.parse_config(config);
msgr.parse_config(this->config); st_cli.parse_config(config);
st_cli.load_pgs(); st_cli.load_pgs();
} }
@@ -1121,6 +1118,24 @@ void cluster_client_t::handle_op_part(cluster_op_part_t *part)
if (part->op.reply.hdr.retval != expected) if (part->op.reply.hdr.retval != expected)
{ {
// Operation failed, retry // Operation failed, retry
part->flags |= PART_ERROR;
if (!op->retval || op->retval == -EPIPE)
{
// Don't overwrite other errors with -EPIPE
op->retval = part->op.reply.hdr.retval;
}
int stop_fd = -1;
if (op->retval != -EINTR && op->retval != -EIO)
{
stop_fd = part->op.peer_fd;
fprintf(
stderr, "%s operation failed on OSD %lu: retval=%ld (expected %d), dropping connection\n",
osd_op_names[part->op.req.hdr.opcode], part->osd_num, part->op.reply.hdr.retval, expected
);
}
// All next things like timer, continue_sync/rw and stop_client may affect the operation again
// So do all these things after modifying operation state, otherwise we may hit reenterability bugs
// FIXME postpone such things to set_immediate here to avoid bugs
if (part->op.reply.hdr.retval == -EPIPE) if (part->op.reply.hdr.retval == -EPIPE)
{ {
// Mark op->up_wait = true before stopping the client // Mark op->up_wait = true before stopping the client
@@ -1134,20 +1149,17 @@ void cluster_client_t::handle_op_part(cluster_op_part_t *part)
}); });
} }
} }
if (!op->retval || op->retval == -EPIPE) if (op->inflight_count == 0)
{ {
// Don't overwrite other errors with -EPIPE if (op->opcode == OSD_OP_SYNC)
op->retval = part->op.reply.hdr.retval; continue_sync(op);
else
continue_rw(op);
} }
if (op->retval != -EINTR && op->retval != -EIO) if (stop_fd >= 0)
{ {
fprintf( msgr.stop_client(stop_fd);
stderr, "%s operation failed on OSD %lu: retval=%ld (expected %d), dropping connection\n",
osd_op_names[part->op.req.hdr.opcode], part->osd_num, part->op.reply.hdr.retval, expected
);
msgr.stop_client(part->op.peer_fd);
} }
part->flags |= PART_ERROR;
} }
else else
{ {
@@ -1161,13 +1173,13 @@ void cluster_client_t::handle_op_part(cluster_op_part_t *part)
copy_part_bitmap(op, part); copy_part_bitmap(op, part);
op->version = op->parts.size() == 1 ? part->op.reply.rw.version : 0; op->version = op->parts.size() == 1 ? part->op.reply.rw.version : 0;
} }
} if (op->inflight_count == 0)
if (op->inflight_count == 0) {
{ if (op->opcode == OSD_OP_SYNC)
if (op->opcode == OSD_OP_SYNC) continue_sync(op);
continue_sync(op); else
else continue_rw(op);
continue_rw(op); }
} }
} }

View File

@@ -112,8 +112,8 @@ public:
osd_messenger_t msgr; osd_messenger_t msgr;
void init_msgr(); void init_msgr();
json11::Json config; json11::Json::object cli_config, file_config, etcd_global_config;
json11::Json::object merged_config; json11::Json::object config;
cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config); cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config);
~cluster_client_t(); ~cluster_client_t();

View File

@@ -43,6 +43,7 @@ struct inode_list_t
inode_list_t* cluster_client_t::list_inode_start(inode_t inode, inode_list_t* cluster_client_t::list_inode_start(inode_t inode,
std::function<void(inode_list_t* lst, std::set<object_id>&& objects, pg_num_t pg_num, osd_num_t primary_osd, int status)> callback) std::function<void(inode_list_t* lst, std::set<object_id>&& objects, pg_num_t pg_num, osd_num_t primary_osd, int status)> callback)
{ {
init_msgr();
int skipped_pgs = 0; int skipped_pgs = 0;
pool_id_t pool_id = INODE_POOL(inode); pool_id_t pool_id = INODE_POOL(inode);
if (!pool_id || st_cli.pool_config.find(pool_id) == st_cli.pool_config.end()) if (!pool_id || st_cli.pool_config.find(pool_id) == st_cli.pool_config.end())

View File

@@ -281,7 +281,7 @@ void disk_tool_t::dump_journal_entry(int num, journal_entry *je, bool json)
if (je->big_write.size > sizeof(journal_entry_big_write)) if (je->big_write.size > sizeof(journal_entry_big_write))
{ {
printf(json ? ",\"bitmap\":\"" : " (bitmap: "); printf(json ? ",\"bitmap\":\"" : " (bitmap: ");
for (int i = sizeof(journal_entry_big_write); i < je->small_write.size; i++) for (int i = sizeof(journal_entry_big_write); i < je->big_write.size; i++)
{ {
printf("%02x", ((uint8_t*)je)[i]); printf("%02x", ((uint8_t*)je)[i]);
} }

View File

@@ -26,7 +26,7 @@ int disk_tool_t::process_meta(std::function<void(blockstore_meta_header_v1_t *)>
buf_size = dsk.meta_len; buf_size = dsk.meta_len;
void *data = memalign_or_die(MEM_ALIGNMENT, buf_size); void *data = memalign_or_die(MEM_ALIGNMENT, buf_size);
lseek64(dsk.meta_fd, dsk.meta_offset, 0); lseek64(dsk.meta_fd, dsk.meta_offset, 0);
read_blocking(dsk.meta_fd, data, buf_size); read_blocking(dsk.meta_fd, data, dsk.meta_block_size);
// Check superblock // Check superblock
blockstore_meta_header_v1_t *hdr = (blockstore_meta_header_v1_t *)data; blockstore_meta_header_v1_t *hdr = (blockstore_meta_header_v1_t *)data;
if (hdr->zero == 0 && if (hdr->zero == 0 &&
@@ -41,8 +41,11 @@ int disk_tool_t::process_meta(std::function<void(blockstore_meta_header_v1_t *)>
if (buf_size % dsk.meta_block_size) if (buf_size % dsk.meta_block_size)
{ {
buf_size = 8*dsk.meta_block_size; buf_size = 8*dsk.meta_block_size;
void *new_data = memalign_or_die(MEM_ALIGNMENT, buf_size);
memcpy(new_data, data, dsk.meta_block_size);
free(data); free(data);
data = memalign_or_die(MEM_ALIGNMENT, buf_size); data = new_data;
hdr = (blockstore_meta_header_v1_t *)data;
} }
} }
dsk.bitmap_granularity = hdr->bitmap_granularity; dsk.bitmap_granularity = hdr->bitmap_granularity;

View File

@@ -54,6 +54,13 @@ void epoll_manager_t::set_fd_handler(int fd, bool wr, std::function<void(int, in
ev.events = (wr ? EPOLLOUT : 0) | EPOLLIN | EPOLLRDHUP | EPOLLET; ev.events = (wr ? EPOLLOUT : 0) | EPOLLIN | EPOLLRDHUP | EPOLLET;
if (epoll_ctl(epoll_fd, exists ? EPOLL_CTL_MOD : EPOLL_CTL_ADD, fd, &ev) < 0) if (epoll_ctl(epoll_fd, exists ? EPOLL_CTL_MOD : EPOLL_CTL_ADD, fd, &ev) < 0)
{ {
if (errno == ENOENT)
{
// The FD is probably already closed
epoll_ctl(epoll_fd, EPOLL_CTL_DEL, fd, NULL);
epoll_handlers.erase(fd);
return;
}
throw std::runtime_error(std::string("epoll_ctl: ") + strerror(errno)); throw std::runtime_error(std::string("epoll_ctl: ") + strerror(errno));
} }
epoll_handlers[fd] = handler; epoll_handlers[fd] = handler;

View File

@@ -18,12 +18,8 @@ etcd_state_client_t::~etcd_state_client_t()
} }
watches.clear(); watches.clear();
etcd_watches_initialised = -1; etcd_watches_initialised = -1;
if (ws_keepalive_timer >= 0)
{
tfd->clear_timer(ws_keepalive_timer);
ws_keepalive_timer = -1;
}
#ifndef __MOCK__ #ifndef __MOCK__
stop_ws_keepalive();
if (etcd_watch_ws) if (etcd_watch_ws)
{ {
http_close(etcd_watch_ws); http_close(etcd_watch_ws);
@@ -245,6 +241,7 @@ void etcd_state_client_t::parse_config(const json11::Json & config)
if (this->etcd_keepalive_timeout < 30) if (this->etcd_keepalive_timeout < 30)
this->etcd_keepalive_timeout = 30; this->etcd_keepalive_timeout = 30;
} }
auto old_etcd_ws_keepalive_interval = this->etcd_ws_keepalive_interval;
this->etcd_ws_keepalive_interval = config["etcd_ws_keepalive_interval"].uint64_value(); this->etcd_ws_keepalive_interval = config["etcd_ws_keepalive_interval"].uint64_value();
if (this->etcd_ws_keepalive_interval <= 0) if (this->etcd_ws_keepalive_interval <= 0)
{ {
@@ -265,6 +262,13 @@ void etcd_state_client_t::parse_config(const json11::Json & config)
{ {
this->etcd_quick_timeout = 1000; this->etcd_quick_timeout = 1000;
} }
if (this->etcd_ws_keepalive_interval != old_etcd_ws_keepalive_interval && ws_keepalive_timer >= 0)
{
#ifndef __MOCK__
stop_ws_keepalive();
start_ws_keepalive();
#endif
}
} }
void etcd_state_client_t::pick_next_etcd() void etcd_state_client_t::pick_next_etcd()
@@ -478,6 +482,20 @@ void etcd_state_client_t::start_etcd_watcher()
{ {
on_start_watcher_hook(etcd_watch_ws); on_start_watcher_hook(etcd_watch_ws);
} }
start_ws_keepalive();
}
void etcd_state_client_t::stop_ws_keepalive()
{
if (ws_keepalive_timer >= 0)
{
tfd->clear_timer(ws_keepalive_timer);
ws_keepalive_timer = -1;
}
}
void etcd_state_client_t::start_ws_keepalive()
{
if (ws_keepalive_timer < 0) if (ws_keepalive_timer < 0)
{ {
ws_keepalive_timer = tfd->set_timer(etcd_ws_keepalive_interval*1000, true, [this](int) ws_keepalive_timer = tfd->set_timer(etcd_ws_keepalive_interval*1000, true, [this](int)

View File

@@ -132,6 +132,8 @@ public:
void etcd_txn(json11::Json txn, int timeout, int retries, int interval, std::function<void(std::string, json11::Json)> callback); void etcd_txn(json11::Json txn, int timeout, int retries, int interval, std::function<void(std::string, json11::Json)> callback);
void etcd_txn_slow(json11::Json txn, std::function<void(std::string, json11::Json)> callback); void etcd_txn_slow(json11::Json txn, std::function<void(std::string, json11::Json)> callback);
void start_etcd_watcher(); void start_etcd_watcher();
void stop_ws_keepalive();
void start_ws_keepalive();
void load_global_config(); void load_global_config();
void load_pgs(); void load_pgs();
void parse_state(const etcd_kv_t & kv); void parse_state(const etcd_kv_t & kv);

View File

@@ -157,10 +157,10 @@ void osd_messenger_t::parse_config(const json11::Json & config)
this->rdma_max_sge = 128; this->rdma_max_sge = 128;
this->rdma_max_send = config["rdma_max_send"].uint64_value(); this->rdma_max_send = config["rdma_max_send"].uint64_value();
if (!this->rdma_max_send) if (!this->rdma_max_send)
this->rdma_max_send = 1; this->rdma_max_send = 8;
this->rdma_max_recv = config["rdma_max_recv"].uint64_value(); this->rdma_max_recv = config["rdma_max_recv"].uint64_value();
if (!this->rdma_max_recv) if (!this->rdma_max_recv)
this->rdma_max_recv = 128; this->rdma_max_recv = 16;
this->rdma_max_msg = config["rdma_max_msg"].uint64_value(); this->rdma_max_msg = config["rdma_max_msg"].uint64_value();
if (!this->rdma_max_msg || this->rdma_max_msg > 128*1024*1024) if (!this->rdma_max_msg || this->rdma_max_msg > 128*1024*1024)
this->rdma_max_msg = 129*1024; this->rdma_max_msg = 129*1024;
@@ -534,8 +534,9 @@ bool osd_messenger_t::is_rdma_enabled()
} }
#endif #endif
json11::Json osd_messenger_t::read_config(const json11::Json & config) json11::Json::object osd_messenger_t::read_config(const json11::Json & config)
{ {
json11::Json::object file_config;
const char *config_path = config["config_path"].string_value() != "" const char *config_path = config["config_path"].string_value() != ""
? config["config_path"].string_value().c_str() : VITASTOR_CONFIG_PATH; ? config["config_path"].string_value().c_str() : VITASTOR_CONFIG_PATH;
int fd = open(config_path, O_RDONLY); int fd = open(config_path, O_RDONLY);
@@ -543,14 +544,14 @@ json11::Json osd_messenger_t::read_config(const json11::Json & config)
{ {
if (errno != ENOENT) if (errno != ENOENT)
fprintf(stderr, "Error reading %s: %s\n", config_path, strerror(errno)); fprintf(stderr, "Error reading %s: %s\n", config_path, strerror(errno));
return config; return file_config;
} }
struct stat st; struct stat st;
if (fstat(fd, &st) != 0) if (fstat(fd, &st) != 0)
{ {
fprintf(stderr, "Error reading %s: %s\n", config_path, strerror(errno)); fprintf(stderr, "Error reading %s: %s\n", config_path, strerror(errno));
close(fd); close(fd);
return config; return file_config;
} }
std::string buf; std::string buf;
buf.resize(st.st_size); buf.resize(st.st_size);
@@ -562,23 +563,125 @@ json11::Json osd_messenger_t::read_config(const json11::Json & config)
{ {
fprintf(stderr, "Error reading %s: %s\n", config_path, strerror(errno)); fprintf(stderr, "Error reading %s: %s\n", config_path, strerror(errno));
close(fd); close(fd);
return config; return file_config;
} }
done += r; done += r;
} }
close(fd); close(fd);
std::string json_err; std::string json_err;
json11::Json::object file_config = json11::Json::parse(buf, json_err).object_items(); file_config = json11::Json::parse(buf, json_err).object_items();
if (json_err != "") if (json_err != "")
{ {
fprintf(stderr, "Invalid JSON in %s: %s\n", config_path, json_err.c_str()); fprintf(stderr, "Invalid JSON in %s: %s\n", config_path, json_err.c_str());
return config;
}
file_config.erase("config_path");
file_config.erase("osd_num");
for (auto kv: config.object_items())
{
file_config[kv.first] = kv.second;
} }
return file_config; return file_config;
} }
static const char* cli_only_params[] = {
// The list has to be sorted
"bitmap_granularity",
"block_size",
"data_device",
"data_offset",
"data_size",
"disable_data_fsync",
"disable_device_lock",
"disable_journal_fsync",
"disable_meta_fsync",
"disk_alignment",
"flush_journal",
"immediate_commit",
"inmemory_journal",
"inmemory_metadata",
"journal_block_size",
"journal_device",
"journal_no_same_sector_overwrites",
"journal_offset",
"journal_sector_buffer_count",
"journal_size",
"meta_block_size",
"meta_buf_size",
"meta_device",
"meta_offset",
"osd_num",
"readonly",
};
static const char **cli_only_end = cli_only_params + (sizeof(cli_only_params)/sizeof(cli_only_params[0]));
static const char* local_only_params[] = {
// The list has to be sorted
"config_path",
"rdma_device",
"rdma_gid_index",
"rdma_max_msg",
"rdma_max_recv",
"rdma_max_send",
"rdma_max_sge",
"rdma_mtu",
"rdma_port_num",
"tcp_header_buffer_size",
"use_rdma",
"use_sync_send_recv",
};
static const char **local_only_end = local_only_params + (sizeof(local_only_params)/sizeof(local_only_params[0]));
// Basically could be replaced by std::lower_bound()...
static int find_str_array(const char **start, const char **end, const std::string & s)
{
int min = 0, max = end-start;
while (max-min >= 2)
{
int mid = (min+max)/2;
int r = strcmp(s.c_str(), start[mid]);
if (r < 0)
max = mid;
else if (r > 0)
min = mid;
else
return mid;
}
if (min < end-start && !strcmp(s.c_str(), start[min]))
return min;
return -1;
}
json11::Json::object osd_messenger_t::merge_configs(const json11::Json::object & cli_config,
const json11::Json::object & file_config,
const json11::Json::object & etcd_global_config,
const json11::Json::object & etcd_osd_config)
{
// Priority: most important -> less important:
// etcd_osd_config -> cli_config -> etcd_global_config -> file_config
json11::Json::object res = file_config;
for (auto & kv: file_config)
{
int cli_only = find_str_array(cli_only_params, cli_only_end, kv.first);
if (cli_only < 0)
{
res[kv.first] = kv.second;
}
}
for (auto & kv: etcd_global_config)
{
int local_only = find_str_array(local_only_params, local_only_end, kv.first);
if (local_only < 0)
{
res[kv.first] = kv.second;
}
}
for (auto & kv: cli_config)
{
res[kv.first] = kv.second;
}
for (auto & kv: etcd_osd_config)
{
int local_only = find_str_array(local_only_params, local_only_end, kv.first);
if (local_only < 0)
{
res[kv.first] = kv.second;
}
}
return res;
}

View File

@@ -166,7 +166,11 @@ public:
void accept_connections(int listen_fd); void accept_connections(int listen_fd);
~osd_messenger_t(); ~osd_messenger_t();
static json11::Json read_config(const json11::Json & config); static json11::Json::object read_config(const json11::Json & config);
static json11::Json::object merge_configs(const json11::Json::object & cli_config,
const json11::Json::object & file_config,
const json11::Json::object & etcd_global_config,
const json11::Json::object & etcd_osd_config);
#ifdef WITH_RDMA #ifdef WITH_RDMA
bool is_rdma_enabled(); bool is_rdma_enabled();

View File

@@ -43,7 +43,15 @@ void osd_messenger_t::send_replies()
{ {
} }
json11::Json osd_messenger_t::read_config(const json11::Json & config) json11::Json::object osd_messenger_t::read_config(const json11::Json & config)
{ {
return config; return json11::Json::object();
}
json11::Json::object osd_messenger_t::merge_configs(const json11::Json::object & cli_config,
const json11::Json::object & file_config,
const json11::Json::object & etcd_global_config,
const json11::Json::object & etcd_osd_config)
{
return cli_config;
} }

View File

@@ -368,9 +368,8 @@ static void try_send_rdma_wr(osd_client_t *cl, ibv_sge *sge, int op_sge)
bool osd_messenger_t::try_send_rdma(osd_client_t *cl) bool osd_messenger_t::try_send_rdma(osd_client_t *cl)
{ {
auto rc = cl->rdma_conn; auto rc = cl->rdma_conn;
if (!cl->send_list.size() || rc->cur_send > 0) if (!cl->send_list.size() || rc->cur_send >= rc->max_send)
{ {
// Only send one batch at a time
return true; return true;
} }
uint64_t op_size = 0, op_sge = 0; uint64_t op_size = 0, op_sge = 0;
@@ -380,6 +379,7 @@ bool osd_messenger_t::try_send_rdma(osd_client_t *cl)
iovec & iov = cl->send_list[rc->send_pos]; iovec & iov = cl->send_list[rc->send_pos];
if (op_size >= rc->max_msg || op_sge >= rc->max_sge) if (op_size >= rc->max_msg || op_sge >= rc->max_sge)
{ {
rc->send_sizes.push_back(op_size);
try_send_rdma_wr(cl, sge, op_sge); try_send_rdma_wr(cl, sge, op_sge);
op_sge = 0; op_sge = 0;
op_size = 0; op_size = 0;
@@ -405,18 +405,24 @@ bool osd_messenger_t::try_send_rdma(osd_client_t *cl)
} }
if (op_sge > 0) if (op_sge > 0)
{ {
rc->send_sizes.push_back(op_size);
try_send_rdma_wr(cl, sge, op_sge); try_send_rdma_wr(cl, sge, op_sge);
} }
return true; return true;
} }
static void try_recv_rdma_wr(osd_client_t *cl, ibv_sge *sge, int op_sge) static void try_recv_rdma_wr(osd_client_t *cl, void *buf)
{ {
ibv_sge sge = {
.addr = (uintptr_t)buf,
.length = (uint32_t)cl->rdma_conn->max_msg,
.lkey = cl->rdma_conn->ctx->mr->lkey,
};
ibv_recv_wr *bad_wr = NULL; ibv_recv_wr *bad_wr = NULL;
ibv_recv_wr wr = { ibv_recv_wr wr = {
.wr_id = (uint64_t)(cl->peer_fd*2), .wr_id = (uint64_t)(cl->peer_fd*2),
.sg_list = sge, .sg_list = &sge,
.num_sge = op_sge, .num_sge = 1,
}; };
int err = ibv_post_recv(cl->rdma_conn->qp, &wr, &bad_wr); int err = ibv_post_recv(cl->rdma_conn->qp, &wr, &bad_wr);
if (err || bad_wr) if (err || bad_wr)
@@ -434,12 +440,7 @@ bool osd_messenger_t::try_recv_rdma(osd_client_t *cl)
{ {
void *buf = malloc_or_die(rc->max_msg); void *buf = malloc_or_die(rc->max_msg);
rc->recv_buffers.push_back(buf); rc->recv_buffers.push_back(buf);
ibv_sge sge = { try_recv_rdma_wr(cl, buf);
.addr = (uintptr_t)buf,
.length = (uint32_t)rc->max_msg,
.lkey = rc->ctx->mr->lkey,
};
try_recv_rdma_wr(cl, &sge, 1);
} }
return true; return true;
} }
@@ -476,6 +477,7 @@ void osd_messenger_t::handle_rdma_events()
continue; continue;
} }
osd_client_t *cl = cl_it->second; osd_client_t *cl = cl_it->second;
auto rc = cl->rdma_conn;
if (wc[i].status != IBV_WC_SUCCESS) if (wc[i].status != IBV_WC_SUCCESS)
{ {
fprintf(stderr, "RDMA work request failed for client %d", client_id); fprintf(stderr, "RDMA work request failed for client %d", client_id);
@@ -489,44 +491,59 @@ void osd_messenger_t::handle_rdma_events()
} }
if (!is_send) if (!is_send)
{ {
cl->rdma_conn->cur_recv--; rc->cur_recv--;
if (!handle_read_buffer(cl, cl->rdma_conn->recv_buffers[0], wc[i].byte_len)) if (!handle_read_buffer(cl, rc->recv_buffers[rc->next_recv_buf], wc[i].byte_len))
{ {
// handle_read_buffer may stop the client // handle_read_buffer may stop the client
continue; continue;
} }
free(cl->rdma_conn->recv_buffers[0]); try_recv_rdma_wr(cl, rc->recv_buffers[rc->next_recv_buf]);
cl->rdma_conn->recv_buffers.erase(cl->rdma_conn->recv_buffers.begin(), cl->rdma_conn->recv_buffers.begin()+1); rc->next_recv_buf = (rc->next_recv_buf+1) % rc->recv_buffers.size();
try_recv_rdma(cl);
} }
else else
{ {
cl->rdma_conn->cur_send--; rc->cur_send--;
if (!cl->rdma_conn->cur_send) uint64_t sent_size = rc->send_sizes.at(0);
rc->send_sizes.erase(rc->send_sizes.begin(), rc->send_sizes.begin()+1);
int send_pos = 0, send_buf_pos = 0;
while (sent_size > 0)
{ {
// Wait for the whole batch if (sent_size >= cl->send_list.at(send_pos).iov_len)
for (int i = 0; i < cl->rdma_conn->send_pos; i++)
{ {
if (cl->outbox[i].flags & MSGR_SENDP_FREE) sent_size -= cl->send_list[send_pos].iov_len;
{ send_pos++;
// Reply fully sent
delete cl->outbox[i].op;
}
} }
if (cl->rdma_conn->send_pos > 0) else
{ {
cl->send_list.erase(cl->send_list.begin(), cl->send_list.begin()+cl->rdma_conn->send_pos); send_buf_pos = sent_size;
cl->outbox.erase(cl->outbox.begin(), cl->outbox.begin()+cl->rdma_conn->send_pos); sent_size = 0;
cl->rdma_conn->send_pos = 0;
} }
if (cl->rdma_conn->send_buf_pos > 0)
{
cl->send_list[0].iov_base = (uint8_t*)cl->send_list[0].iov_base + cl->rdma_conn->send_buf_pos;
cl->send_list[0].iov_len -= cl->rdma_conn->send_buf_pos;
cl->rdma_conn->send_buf_pos = 0;
}
try_send_rdma(cl);
} }
assert(rc->send_pos >= send_pos);
if (rc->send_pos == send_pos)
{
rc->send_buf_pos -= send_buf_pos;
}
rc->send_pos -= send_pos;
for (int i = 0; i < send_pos; i++)
{
if (cl->outbox[i].flags & MSGR_SENDP_FREE)
{
// Reply fully sent
delete cl->outbox[i].op;
}
}
if (send_pos > 0)
{
cl->send_list.erase(cl->send_list.begin(), cl->send_list.begin()+send_pos);
cl->outbox.erase(cl->outbox.begin(), cl->outbox.begin()+send_pos);
}
if (send_buf_pos > 0)
{
cl->send_list[0].iov_base = (uint8_t*)cl->send_list[0].iov_base + send_buf_pos;
cl->send_list[0].iov_len -= send_buf_pos;
}
try_send_rdma(cl);
} }
} }
} while (event_count > 0); } while (event_count > 0);

View File

@@ -49,8 +49,9 @@ struct msgr_rdma_connection_t
uint64_t max_msg = 0; uint64_t max_msg = 0;
int send_pos = 0, send_buf_pos = 0; int send_pos = 0, send_buf_pos = 0;
int recv_pos = 0, recv_buf_pos = 0; int next_recv_buf = 0;
std::vector<void*> recv_buffers; std::vector<void*> recv_buffers;
std::vector<uint64_t> send_sizes;
~msgr_rdma_connection_t(); ~msgr_rdma_connection_t();
static msgr_rdma_connection_t *create(msgr_rdma_context_t *ctx, uint32_t max_send, uint32_t max_recv, uint32_t max_sge, uint32_t max_msg); static msgr_rdma_connection_t *create(msgr_rdma_context_t *ctx, uint32_t max_send, uint32_t max_recv, uint32_t max_sge, uint32_t max_msg);

View File

@@ -313,17 +313,18 @@ bool osd_messenger_t::handle_reply_hdr(osd_client_t *cl)
stop_client(cl->peer_fd); stop_client(cl->peer_fd);
return false; return false;
} }
if (op->reply.hdr.retval >= 0 && bmp_len > 0) if (bmp_len > 0)
{ {
assert(op->bitmap); assert(op->bitmap);
cl->recv_list.push_back(op->bitmap, bmp_len); cl->recv_list.push_back(op->bitmap, bmp_len);
cl->read_remaining += bmp_len;
} }
if (op->reply.hdr.retval > 0) if (op->reply.hdr.retval > 0)
{ {
assert(op->iov.count > 0); assert(op->iov.count > 0);
cl->recv_list.append(op->iov); cl->recv_list.append(op->iov);
cl->read_remaining += op->reply.hdr.retval;
} }
cl->read_remaining = op->reply.hdr.retval + bmp_len;
if (cl->read_remaining == 0) if (cl->read_remaining == 0)
{ {
goto reuse; goto reuse;

View File

@@ -39,6 +39,11 @@ struct __attribute__((__packed__)) obj_ver_id
uint64_t version; uint64_t version;
}; };
inline bool operator == (const obj_ver_id & a, const obj_ver_id & b)
{
return a.oid == b.oid && a.version == b.version;
}
inline bool operator < (const obj_ver_id & a, const obj_ver_id & b) inline bool operator < (const obj_ver_id & a, const obj_ver_id & b)
{ {
return a.oid < b.oid || a.oid == b.oid && a.version < b.version; return a.oid < b.oid || a.oid == b.oid && a.version < b.version;

View File

@@ -35,18 +35,18 @@ osd_t::osd_t(const json11::Json & config, ring_loop_t *ringloop)
this->ringloop = ringloop; this->ringloop = ringloop;
this->config = msgr.read_config(config).object_items(); this->cli_config = config.object_items();
if (this->config.find("log_level") == this->config.end()) this->file_config = msgr.read_config(this->cli_config);
this->config["log_level"] = 1; parse_config(true);
parse_config(this->config, true);
epmgr = new epoll_manager_t(ringloop); epmgr = new epoll_manager_t(ringloop);
// FIXME: Use timerfd_interval based directly on io_uring // FIXME: Use timerfd_interval based directly on io_uring
this->tfd = epmgr->tfd; this->tfd = epmgr->tfd;
auto bs_cfg = json_to_bs(this->config); if (!json_is_true(this->config["disable_blockstore"]))
this->bs = new blockstore_t(bs_cfg, ringloop, tfd);
{ {
auto bs_cfg = json_to_bs(this->config);
this->bs = new blockstore_t(bs_cfg, ringloop, tfd);
// Autosync based on the number of unstable writes to prevent stalls due to insufficient journal space // Autosync based on the number of unstable writes to prevent stalls due to insufficient journal space
uint64_t max_autosync = bs->get_journal_size() / bs->get_block_size() / 2; uint64_t max_autosync = bs->get_journal_size() / bs->get_block_size() / 2;
if (autosync_writes > max_autosync) if (autosync_writes > max_autosync)
@@ -67,11 +67,11 @@ osd_t::osd_t(const json11::Json & config, ring_loop_t *ringloop)
} }
} }
this->tfd->set_timer(print_stats_interval*1000, true, [this](int timer_id) print_stats_timer_id = this->tfd->set_timer(print_stats_interval*1000, true, [this](int timer_id)
{ {
print_stats(); print_stats();
}); });
this->tfd->set_timer(slow_log_interval*1000, true, [this](int timer_id) slow_log_timer_id = this->tfd->set_timer(slow_log_interval*1000, true, [this](int timer_id)
{ {
print_slow(); print_slow();
}); });
@@ -91,18 +91,42 @@ osd_t::osd_t(const json11::Json & config, ring_loop_t *ringloop)
osd_t::~osd_t() osd_t::~osd_t()
{ {
if (slow_log_timer_id >= 0)
{
tfd->clear_timer(slow_log_timer_id);
slow_log_timer_id = -1;
}
if (print_stats_timer_id >= 0)
{
tfd->clear_timer(print_stats_timer_id);
print_stats_timer_id = -1;
}
if (autosync_timer_id >= 0)
{
tfd->clear_timer(autosync_timer_id);
autosync_timer_id = -1;
}
ringloop->unregister_consumer(&consumer); ringloop->unregister_consumer(&consumer);
delete epmgr; delete epmgr;
delete bs; if (bs)
delete bs;
close(listen_fd); close(listen_fd);
free(zero_buffer); free(zero_buffer);
} }
void osd_t::parse_config(const json11::Json & config, bool allow_disk_params) void osd_t::parse_config(bool init)
{ {
config = msgr.merge_configs(cli_config, file_config, etcd_global_config, etcd_osd_config);
if (config.find("log_level") == this->config.end())
config["log_level"] = 1;
if (bs)
{
auto bs_cfg = json_to_bs(config);
bs->parse_config(bs_cfg);
}
st_cli.parse_config(config); st_cli.parse_config(config);
msgr.parse_config(config); msgr.parse_config(config);
if (allow_disk_params) if (init)
{ {
// OSD number // OSD number
osd_num = config["osd_num"].uint64_value(); osd_num = config["osd_num"].uint64_value();
@@ -124,24 +148,27 @@ void osd_t::parse_config(const json11::Json & config, bool allow_disk_params)
immediate_commit = IMMEDIATE_SMALL; immediate_commit = IMMEDIATE_SMALL;
else else
immediate_commit = IMMEDIATE_NONE; immediate_commit = IMMEDIATE_NONE;
// Bind address
bind_address = config["bind_address"].string_value();
if (bind_address == "")
bind_address = "0.0.0.0";
bind_port = config["bind_port"].uint64_value();
if (bind_port <= 0 || bind_port > 65535)
bind_port = 0;
// OSD configuration
etcd_report_interval = config["etcd_report_interval"].uint64_value();
if (etcd_report_interval <= 0)
etcd_report_interval = 5;
readonly = json_is_true(config["readonly"]);
run_primary = !json_is_false(config["run_primary"]);
allow_test_ops = json_is_true(config["allow_test_ops"]);
} }
// Bind address
bind_address = config["bind_address"].string_value();
if (bind_address == "")
bind_address = "0.0.0.0";
bind_port = config["bind_port"].uint64_value();
if (bind_port <= 0 || bind_port > 65535)
bind_port = 0;
// OSD configuration
log_level = config["log_level"].uint64_value(); log_level = config["log_level"].uint64_value();
etcd_report_interval = config["etcd_report_interval"].uint64_value(); auto old_no_rebalance = no_rebalance;
if (etcd_report_interval <= 0)
etcd_report_interval = 5;
readonly = json_is_true(config["readonly"]);
run_primary = !json_is_false(config["run_primary"]);
no_rebalance = json_is_true(config["no_rebalance"]); no_rebalance = json_is_true(config["no_rebalance"]);
auto old_no_recovery = no_recovery;
no_recovery = json_is_true(config["no_recovery"]); no_recovery = json_is_true(config["no_recovery"]);
allow_test_ops = json_is_true(config["allow_test_ops"]); auto old_autosync_interval = autosync_interval;
if (!config["autosync_interval"].is_null()) if (!config["autosync_interval"].is_null())
{ {
// Allow to set it to 0 // Allow to set it to 0
@@ -169,15 +196,46 @@ void osd_t::parse_config(const json11::Json & config, bool allow_disk_params)
recovery_sync_batch = config["recovery_sync_batch"].uint64_value(); recovery_sync_batch = config["recovery_sync_batch"].uint64_value();
if (recovery_sync_batch < 1 || recovery_sync_batch > MAX_RECOVERY_QUEUE) if (recovery_sync_batch < 1 || recovery_sync_batch > MAX_RECOVERY_QUEUE)
recovery_sync_batch = DEFAULT_RECOVERY_BATCH; recovery_sync_batch = DEFAULT_RECOVERY_BATCH;
auto old_print_stats_interval = print_stats_interval;
print_stats_interval = config["print_stats_interval"].uint64_value(); print_stats_interval = config["print_stats_interval"].uint64_value();
if (!print_stats_interval) if (!print_stats_interval)
print_stats_interval = 3; print_stats_interval = 3;
auto old_slow_log_interval = slow_log_interval;
slow_log_interval = config["slow_log_interval"].uint64_value(); slow_log_interval = config["slow_log_interval"].uint64_value();
if (!slow_log_interval) if (!slow_log_interval)
slow_log_interval = 10; slow_log_interval = 10;
inode_vanish_time = config["inode_vanish_time"].uint64_value(); inode_vanish_time = config["inode_vanish_time"].uint64_value();
if (!inode_vanish_time) if (!inode_vanish_time)
inode_vanish_time = 60; inode_vanish_time = 60;
if ((old_no_rebalance && !no_rebalance || old_no_recovery && !no_recovery) &&
!(peering_state & (OSD_RECOVERING | OSD_FLUSHING_PGS)))
{
peering_state = peering_state | OSD_RECOVERING;
}
if (old_autosync_interval != autosync_interval && autosync_timer_id >= 0)
{
this->tfd->clear_timer(autosync_timer_id);
autosync_timer_id = this->tfd->set_timer(autosync_interval*1000, true, [this](int timer_id)
{
autosync();
});
}
if (old_print_stats_interval != print_stats_interval && print_stats_timer_id >= 0)
{
tfd->clear_timer(print_stats_timer_id);
print_stats_timer_id = this->tfd->set_timer(print_stats_interval*1000, true, [this](int timer_id)
{
print_stats();
});
}
if (old_slow_log_interval != slow_log_interval && slow_log_timer_id >= 0)
{
tfd->clear_timer(slow_log_timer_id);
slow_log_timer_id = this->tfd->set_timer(slow_log_interval*1000, true, [this](int timer_id)
{
print_slow();
});
}
} }
void osd_t::bind_socket() void osd_t::bind_socket()
@@ -402,7 +460,7 @@ void osd_t::print_slow()
int l = sizeof(alloc), n; int l = sizeof(alloc), n;
char *buf = alloc; char *buf = alloc;
#define bufprintf(s, ...) { n = snprintf(buf, l, s, __VA_ARGS__); n = n < 0 ? 0 : n; buf += n; l -= n; } #define bufprintf(s, ...) { n = snprintf(buf, l, s, __VA_ARGS__); n = n < 0 ? 0 : n; buf += n; l -= n; }
bufprintf("[OSD %lu] Slow op", osd_num); bufprintf("[OSD %lu] Slow op %lx", osd_num, (unsigned long)op);
if (kv.second->osd_num) if (kv.second->osd_num)
{ {
bufprintf(" from peer OSD %lu (client %d)", kv.second->osd_num, kv.second->peer_fd); bufprintf(" from peer OSD %lu (client %d)", kv.second->osd_num, kv.second->peer_fd);
@@ -475,7 +533,7 @@ void osd_t::print_slow()
} }
} }
} }
if (has_slow) if (has_slow && bs)
{ {
bs->dump_diagnostics(); bs->dump_diagnostics();
} }

View File

@@ -90,7 +90,7 @@ class osd_t
{ {
// config // config
json11::Json::object config; json11::Json::object cli_config, file_config, etcd_global_config, etcd_osd_config, config;
int etcd_report_interval = 5; int etcd_report_interval = 5;
bool readonly = false; bool readonly = false;
@@ -126,6 +126,7 @@ class osd_t
bool pg_config_applied = false; bool pg_config_applied = false;
bool etcd_reporting_pg_state = false; bool etcd_reporting_pg_state = false;
bool etcd_reporting_stats = false; bool etcd_reporting_stats = false;
int autosync_timer_id = -1, print_stats_timer_id = -1, slow_log_timer_id = -1;
// peers and PGs // peers and PGs
@@ -152,7 +153,7 @@ class osd_t
bool stopping = false; bool stopping = false;
int inflight_ops = 0; int inflight_ops = 0;
blockstore_t *bs; blockstore_t *bs = NULL;
void *zero_buffer = NULL; void *zero_buffer = NULL;
uint64_t zero_buffer_size = 0; uint64_t zero_buffer_size = 0;
uint32_t bs_block_size, bs_bitmap_granularity, clean_entry_bitmap_size; uint32_t bs_block_size, bs_bitmap_granularity, clean_entry_bitmap_size;
@@ -173,7 +174,7 @@ class osd_t
uint64_t recovery_stat_bytes[2][2] = {}; uint64_t recovery_stat_bytes[2][2] = {};
// cluster connection // cluster connection
void parse_config(const json11::Json & config, bool allow_disk_params); void parse_config(bool init);
void init_cluster(); void init_cluster();
void on_change_osd_state_hook(osd_num_t peer_osd); void on_change_osd_state_hook(osd_num_t peer_osd);
void on_change_pg_history_hook(pool_id_t pool_id, pg_num_t pg_num); void on_change_pg_history_hook(pool_id_t pool_id, pg_num_t pg_num);

View File

@@ -75,7 +75,7 @@ void osd_t::init_cluster()
} }
if (run_primary && autosync_interval > 0) if (run_primary && autosync_interval > 0)
{ {
this->tfd->set_timer(autosync_interval*1000, true, [this](int timer_id) autosync_timer_id = this->tfd->set_timer(autosync_interval*1000, true, [this](int timer_id)
{ {
autosync(); autosync();
}); });
@@ -182,10 +182,10 @@ json11::Json osd_t::get_statistics()
char time_str[50] = { 0 }; char time_str[50] = { 0 };
sprintf(time_str, "%ld.%03ld", ts.tv_sec, ts.tv_nsec/1000000); sprintf(time_str, "%ld.%03ld", ts.tv_sec, ts.tv_nsec/1000000);
st["time"] = time_str; st["time"] = time_str;
st["blockstore_ready"] = bs->is_started();
st["data_block_size"] = (uint64_t)bs->get_block_size();
if (bs) if (bs)
{ {
st["blockstore_ready"] = bs->is_started();
st["data_block_size"] = (uint64_t)bs->get_block_size();
st["size"] = bs->get_block_count() * bs->get_block_size(); st["size"] = bs->get_block_count() * bs->get_block_size();
st["free"] = bs->get_free_block_count() * bs->get_block_size(); st["free"] = bs->get_free_block_count() * bs->get_block_size();
} }
@@ -233,7 +233,8 @@ void osd_t::report_statistics()
json11::Json::object inode_space; json11::Json::object inode_space;
json11::Json::object last_stat; json11::Json::object last_stat;
pool_id_t last_pool = 0; pool_id_t last_pool = 0;
auto & bs_inode_space = bs->get_inode_space_stats(); std::map<uint64_t, uint64_t> bs_empty_space;
auto & bs_inode_space = bs ? bs->get_inode_space_stats() : bs_empty_space;
for (auto kv: bs_inode_space) for (auto kv: bs_inode_space)
{ {
pool_id_t pool_id = INODE_POOL(kv.first); pool_id_t pool_id = INODE_POOL(kv.first);
@@ -374,7 +375,11 @@ void osd_t::on_change_osd_state_hook(osd_num_t peer_osd)
void osd_t::on_change_etcd_state_hook(std::map<std::string, etcd_kv_t> & changes) void osd_t::on_change_etcd_state_hook(std::map<std::string, etcd_kv_t> & changes)
{ {
// FIXME apply config changes in runtime (maybe, some) if (changes.find(st_cli.etcd_prefix+"/config/global") != changes.end())
{
etcd_global_config = changes[st_cli.etcd_prefix+"/config/global"].value.object_items();
parse_config(false);
}
if (run_primary) if (run_primary)
{ {
apply_pg_count(); apply_pg_count();
@@ -384,11 +389,8 @@ void osd_t::on_change_etcd_state_hook(std::map<std::string, etcd_kv_t> & changes
void osd_t::on_load_config_hook(json11::Json::object & global_config) void osd_t::on_load_config_hook(json11::Json::object & global_config)
{ {
json11::Json::object osd_config = this->config; etcd_global_config = global_config;
for (auto & kv: global_config) parse_config(true);
if (osd_config.find(kv.first) == osd_config.end())
osd_config[kv.first] = kv.second;
parse_config(osd_config, false);
bind_socket(); bind_socket();
acquire_lease(); acquire_lease();
} }
@@ -734,7 +736,7 @@ void osd_t::apply_pg_config()
.pg_cursize = 0, .pg_cursize = 0,
.pg_size = pool_item.second.pg_size, .pg_size = pool_item.second.pg_size,
.pg_minsize = pool_item.second.pg_minsize, .pg_minsize = pool_item.second.pg_minsize,
.pg_data_size = pg.scheme == POOL_SCHEME_REPLICATED .pg_data_size = pool_item.second.scheme == POOL_SCHEME_REPLICATED
? 1 : pool_item.second.pg_size - pool_item.second.parity_chunks, ? 1 : pool_item.second.pg_size - pool_item.second.parity_chunks,
.pool_id = pool_id, .pool_id = pool_id,
.pg_num = pg_num, .pg_num = pg_num,

View File

@@ -64,6 +64,11 @@ void osd_t::submit_pg_flush_ops(pg_t & pg)
void osd_t::handle_flush_op(bool rollback, pool_id_t pool_id, pg_num_t pg_num, pg_flush_batch_t *fb, osd_num_t peer_osd, int retval) void osd_t::handle_flush_op(bool rollback, pool_id_t pool_id, pg_num_t pg_num, pg_flush_batch_t *fb, osd_num_t peer_osd, int retval)
{ {
if (log_level > 2)
{
printf("[PG %u/%u] flush batch %lx completed on OSD %lu with result %d\n",
pool_id, pg_num, (uint64_t)fb, peer_osd, retval);
}
pool_pg_num_t pg_id = { .pool_id = pool_id, .pg_num = pg_num }; pool_pg_num_t pg_id = { .pool_id = pool_id, .pg_num = pg_num };
if (pgs.find(pg_id) == pgs.end() || pgs[pg_id].flush_batch != fb) if (pgs.find(pg_id) == pgs.end() || pgs[pg_id].flush_batch != fb)
{ {
@@ -99,10 +104,9 @@ void osd_t::handle_flush_op(bool rollback, pool_id_t pool_id, pg_num_t pg_num, p
std::vector<osd_op_t*> continue_ops; std::vector<osd_op_t*> continue_ops;
auto & pg = pgs.at(pg_id); auto & pg = pgs.at(pg_id);
auto it = pg.flush_actions.begin(), prev_it = it; auto it = pg.flush_actions.begin(), prev_it = it;
auto erase_start = it;
while (1) while (1)
{ {
if (it == pg.flush_actions.end() || if (it == pg.flush_actions.end() || !it->second.submitted ||
it->first.oid.inode != prev_it->first.oid.inode || it->first.oid.inode != prev_it->first.oid.inode ||
(it->first.oid.stripe & ~STRIPE_MASK) != (prev_it->first.oid.stripe & ~STRIPE_MASK)) (it->first.oid.stripe & ~STRIPE_MASK) != (prev_it->first.oid.stripe & ~STRIPE_MASK))
{ {
@@ -116,29 +120,23 @@ void osd_t::handle_flush_op(bool rollback, pool_id_t pool_id, pg_num_t pg_num, p
}); });
if (wr_it != pg.write_queue.end()) if (wr_it != pg.write_queue.end())
{ {
if (log_level > 2)
{
printf("[PG %u/%u] continuing write %lx to object %lx:%lx after flush\n",
pool_id, pg_num, (uint64_t)wr_it->second, wr_it->first.inode, wr_it->first.stripe);
}
continue_ops.push_back(wr_it->second); continue_ops.push_back(wr_it->second);
pg.write_queue.erase(wr_it);
} }
} }
if ((it == pg.flush_actions.end() || !it->second.submitted) && if (it == pg.flush_actions.end() || !it->second.submitted)
erase_start != it)
{
pg.flush_actions.erase(erase_start, it);
}
if (it == pg.flush_actions.end())
{ {
if (it != pg.flush_actions.begin())
{
pg.flush_actions.erase(pg.flush_actions.begin(), it);
}
break; break;
} }
prev_it = it; prev_it = it++;
if (!it->second.submitted)
{
it++;
erase_start = it;
}
else
{
it++;
}
} }
delete fb; delete fb;
pg.flush_batch = NULL; pg.flush_batch = NULL;
@@ -168,6 +166,18 @@ bool osd_t::submit_flush_op(pool_id_t pool_id, pg_num_t pg_num, pg_flush_batch_t
// Copy buffer so it gets freed along with the operation // Copy buffer so it gets freed along with the operation
op->buf = malloc_or_die(sizeof(obj_ver_id) * count); op->buf = malloc_or_die(sizeof(obj_ver_id) * count);
memcpy(op->buf, data, sizeof(obj_ver_id) * count); memcpy(op->buf, data, sizeof(obj_ver_id) * count);
if (log_level > 2)
{
printf(
"[PG %u/%u] flush batch %lx on OSD %lu: %s objects: ",
pool_id, pg_num, (uint64_t)fb, peer_osd, rollback ? "rollback" : "stabilize"
);
for (int i = 0; i < count; i++)
{
printf(i > 0 ? ", %lx:%lx v%lu" : "%lx:%lx v%lu", data[i].oid.inode, data[i].oid.stripe, data[i].version);
}
printf("\n");
}
if (peer_osd == this->osd_num) if (peer_osd == this->osd_num)
{ {
// local // local
@@ -304,9 +314,10 @@ void osd_t::submit_recovery_op(osd_recovery_op_t *op)
{ {
// PG is stopped or one of the OSDs is gone, error is harmless // PG is stopped or one of the OSDs is gone, error is harmless
printf( printf(
"Recovery operation failed with object %lx:%lx (PG %u/%u)\n", "[PG %u/%u] Recovery operation failed with object %lx:%lx\n",
op->oid.inode, op->oid.stripe, INODE_POOL(op->oid.inode), INODE_POOL(op->oid.inode),
map_to_pg(op->oid, st_cli.pool_config.at(INODE_POOL(op->oid.inode)).pg_stripe_size) map_to_pg(op->oid, st_cli.pool_config.at(INODE_POOL(op->oid.inode)).pg_stripe_size),
op->oid.inode, op->oid.stripe
); );
} }
else else

View File

@@ -191,7 +191,7 @@ struct __attribute__((__packed__)) osd_op_rw_t
uint64_t inode; uint64_t inode;
// offset // offset
uint64_t offset; uint64_t offset;
// length // length. 0 means to read all bitmaps of the specified range, but no data.
uint32_t len; uint32_t len;
// flags (for future) // flags (for future)
uint32_t flags; uint32_t flags;

View File

@@ -76,7 +76,7 @@ void osd_t::handle_peers()
peering_state = peering_state & ~OSD_FLUSHING_PGS | OSD_RECOVERING; peering_state = peering_state & ~OSD_FLUSHING_PGS | OSD_RECOVERING;
} }
} }
if ((peering_state & OSD_RECOVERING) && !readonly) if (!(peering_state & OSD_FLUSHING_PGS) && (peering_state & OSD_RECOVERING) && !readonly)
{ {
if (!continue_recovery()) if (!continue_recovery())
{ {
@@ -483,6 +483,10 @@ void osd_t::report_pg_state(pg_t & pg)
pg.all_peers = pg.target_set; pg.all_peers = pg.target_set;
std::sort(pg.all_peers.begin(), pg.all_peers.end()); std::sort(pg.all_peers.begin(), pg.all_peers.end());
pg.cur_peers = pg.target_set; pg.cur_peers = pg.target_set;
// Change pg_config at the same time, otherwise our PG reconciling loop may try to apply the old metadata
auto & pg_cfg = st_cli.pool_config[pg.pool_id].pg_config[pg.pg_num];
pg_cfg.target_history = pg.target_history;
pg_cfg.all_peers = pg.all_peers;
} }
else if (pg.state == (PG_ACTIVE|PG_LEFT_ON_DEAD)) else if (pg.state == (PG_ACTIVE|PG_LEFT_ON_DEAD))
{ {
@@ -522,6 +526,9 @@ void osd_t::report_pg_state(pg_t & pg)
pg.cur_peers.push_back(pg_osd); pg.cur_peers.push_back(pg_osd);
} }
} }
auto & pg_cfg = st_cli.pool_config[pg.pool_id].pg_config[pg.pg_num];
pg_cfg.target_history = pg.target_history;
pg_cfg.all_peers = pg.all_peers;
} }
if (pg.state == PG_OFFLINE && !this->pg_config_applied) if (pg.state == PG_OFFLINE && !this->pg_config_applied)
{ {

View File

@@ -91,7 +91,7 @@ void pg_obj_state_check_t::walk()
pg->state |= PG_DEGRADED; pg->state |= PG_DEGRADED;
} }
pg->state |= PG_ACTIVE; pg->state |= PG_ACTIVE;
if (pg->state == PG_ACTIVE && pg->cur_peers.size() < pg->all_peers.size()) if (pg->cur_peers.size() < pg->all_peers.size())
{ {
pg->state |= PG_LEFT_ON_DEAD; pg->state |= PG_LEFT_ON_DEAD;
} }

View File

@@ -186,10 +186,22 @@ void osd_t::continue_primary_read(osd_op_t *cur_op)
cur_op->reply.rw.bitmap_len = 0; cur_op->reply.rw.bitmap_len = 0;
{ {
auto & pg = pgs.at({ .pool_id = INODE_POOL(op_data->oid.inode), .pg_num = op_data->pg_num }); auto & pg = pgs.at({ .pool_id = INODE_POOL(op_data->oid.inode), .pg_num = op_data->pg_num });
for (int role = 0; role < op_data->pg_data_size; role++) if (cur_op->req.rw.len == 0)
{ {
op_data->stripes[role].read_start = op_data->stripes[role].req_start; // len=0 => bitmap read
op_data->stripes[role].read_end = op_data->stripes[role].req_end; for (int role = 0; role < op_data->pg_data_size; role++)
{
op_data->stripes[role].read_start = 0;
op_data->stripes[role].read_end = UINT32_MAX;
}
}
else
{
for (int role = 0; role < op_data->pg_data_size; role++)
{
op_data->stripes[role].read_start = op_data->stripes[role].req_start;
op_data->stripes[role].read_end = op_data->stripes[role].req_end;
}
} }
// Determine version // Determine version
auto vo_it = pg.ver_override.find(op_data->oid); auto vo_it = pg.ver_override.find(op_data->oid);

View File

@@ -219,7 +219,7 @@ int osd_t::submit_bitmap_subops(osd_op_t *cur_op, pg_t & pg)
op_data->n_subops++; op_data->n_subops++;
} }
} }
if (op_data->n_subops) if (op_data->n_subops > 0)
{ {
op_data->fact_ver = 0; op_data->fact_ver = 0;
op_data->done = op_data->errors = 0; op_data->done = op_data->errors = 0;

View File

@@ -53,7 +53,10 @@ void osd_t::finish_op(osd_op_t *cur_op, int retval)
inode_stats[cur_op->req.rw.inode].op_count[inode_st_op]++; inode_stats[cur_op->req.rw.inode].op_count[inode_st_op]++;
inode_stats[cur_op->req.rw.inode].op_sum[inode_st_op] += usec; inode_stats[cur_op->req.rw.inode].op_sum[inode_st_op] += usec;
if (cur_op->req.hdr.opcode == OSD_OP_DELETE) if (cur_op->req.hdr.opcode == OSD_OP_DELETE)
inode_stats[cur_op->req.rw.inode].op_bytes[inode_st_op] += cur_op->op_data->pg_data_size * bs_block_size; {
if (cur_op->op_data)
inode_stats[cur_op->req.rw.inode].op_bytes[inode_st_op] += cur_op->op_data->pg_data_size * bs_block_size;
}
else else
inode_stats[cur_op->req.rw.inode].op_bytes[inode_st_op] += cur_op->req.rw.len; inode_stats[cur_op->req.rw.inode].op_bytes[inode_st_op] += cur_op->req.rw.len;
} }
@@ -148,6 +151,13 @@ int osd_t::submit_primary_subop_batch(int submit_type, inode_t inode, uint64_t o
{ {
int stripe_num = rep ? 0 : role; int stripe_num = rep ? 0 : role;
osd_op_t *subop = op_data->subops + i; osd_op_t *subop = op_data->subops + i;
uint32_t subop_len = wr
? stripes[stripe_num].write_end - stripes[stripe_num].write_start
: stripes[stripe_num].read_end - stripes[stripe_num].read_start;
if (!wr && stripes[stripe_num].read_end == UINT32_MAX)
{
subop_len = 0;
}
if (role_osd_num == this->osd_num) if (role_osd_num == this->osd_num)
{ {
clock_gettime(CLOCK_REALTIME, &subop->tv_begin); clock_gettime(CLOCK_REALTIME, &subop->tv_begin);
@@ -166,7 +176,7 @@ int osd_t::submit_primary_subop_batch(int submit_type, inode_t inode, uint64_t o
}, },
.version = op_version, .version = op_version,
.offset = wr ? stripes[stripe_num].write_start : stripes[stripe_num].read_start, .offset = wr ? stripes[stripe_num].write_start : stripes[stripe_num].read_start,
.len = wr ? stripes[stripe_num].write_end - stripes[stripe_num].write_start : stripes[stripe_num].read_end - stripes[stripe_num].read_start, .len = subop_len,
.buf = wr ? stripes[stripe_num].write_buf : stripes[stripe_num].read_buf, .buf = wr ? stripes[stripe_num].write_buf : stripes[stripe_num].read_buf,
.bitmap = stripes[stripe_num].bmp_buf, .bitmap = stripes[stripe_num].bmp_buf,
}); });
@@ -196,7 +206,7 @@ int osd_t::submit_primary_subop_batch(int submit_type, inode_t inode, uint64_t o
}, },
.version = op_version, .version = op_version,
.offset = wr ? stripes[stripe_num].write_start : stripes[stripe_num].read_start, .offset = wr ? stripes[stripe_num].write_start : stripes[stripe_num].read_start,
.len = wr ? stripes[stripe_num].write_end - stripes[stripe_num].write_start : stripes[stripe_num].read_end - stripes[stripe_num].read_start, .len = subop_len,
.attr_len = wr ? clean_entry_bitmap_size : 0, .attr_len = wr ? clean_entry_bitmap_size : 0,
}; };
#ifdef OSD_DEBUG #ifdef OSD_DEBUG
@@ -215,9 +225,9 @@ int osd_t::submit_primary_subop_batch(int submit_type, inode_t inode, uint64_t o
} }
else else
{ {
if (stripes[stripe_num].read_end > stripes[stripe_num].read_start) if (subop_len > 0)
{ {
subop->iov.push_back(stripes[stripe_num].read_buf, stripes[stripe_num].read_end - stripes[stripe_num].read_start); subop->iov.push_back(stripes[stripe_num].read_buf, subop_len);
} }
} }
subop->callback = [cur_op, this](osd_op_t *subop) subop->callback = [cur_op, this](osd_op_t *subop)
@@ -469,7 +479,7 @@ void osd_t::submit_primary_del_batch(osd_op_t *cur_op, obj_ver_osd_t *chunks_to_
osd_primary_op_data_t *op_data = cur_op->op_data; osd_primary_op_data_t *op_data = cur_op->op_data;
op_data->n_subops = chunks_to_delete_count; op_data->n_subops = chunks_to_delete_count;
op_data->done = op_data->errors = op_data->errcode = 0; op_data->done = op_data->errors = op_data->errcode = 0;
if (!op_data->n_subops) if (op_data->n_subops <= 0)
{ {
return; return;
} }

View File

@@ -166,7 +166,7 @@ resume_6:
for (int i = 0; i < unstable_osd.len; i++) for (int i = 0; i < unstable_osd.len; i++)
{ {
// Except those from peered PGs // Except those from peered PGs
auto & w = op_data->unstable_writes[i]; auto & w = op_data->unstable_writes[unstable_osd.start + i];
pool_pg_num_t wpg = { pool_pg_num_t wpg = {
.pool_id = INODE_POOL(w.oid.inode), .pool_id = INODE_POOL(w.oid.inode),
.pg_num = map_to_pg(w.oid, st_cli.pool_config.at(INODE_POOL(w.oid.inode)).pg_stripe_size), .pg_num = map_to_pg(w.oid, st_cli.pool_config.at(INODE_POOL(w.oid.inode)).pg_stripe_size),

View File

@@ -12,6 +12,7 @@ bool osd_t::check_write_queue(osd_op_t *cur_op, pg_t & pg)
.oid = op_data->oid, .oid = op_data->oid,
.osd_num = 0, .osd_num = 0,
}); });
op_data->st = 1;
if (act_it != pg.flush_actions.end() && if (act_it != pg.flush_actions.end() &&
act_it->first.oid.inode == op_data->oid.inode && act_it->first.oid.inode == op_data->oid.inode &&
(act_it->first.oid.stripe & ~STRIPE_MASK) == op_data->oid.stripe) (act_it->first.oid.stripe & ~STRIPE_MASK) == op_data->oid.stripe)
@@ -23,7 +24,6 @@ bool osd_t::check_write_queue(osd_op_t *cur_op, pg_t & pg)
auto vo_it = pg.write_queue.find(op_data->oid); auto vo_it = pg.write_queue.find(op_data->oid);
if (vo_it != pg.write_queue.end()) if (vo_it != pg.write_queue.end())
{ {
op_data->st = 1;
pg.write_queue.emplace(op_data->oid, cur_op); pg.write_queue.emplace(op_data->oid, cur_op);
return false; return false;
} }

View File

@@ -28,7 +28,9 @@ static inline void extend_read(uint32_t start, uint32_t end, osd_rmw_stripe_t &
} }
else else
{ {
if (stripe.read_end < end) if (stripe.read_end < end && end != UINT32_MAX ||
// UINT32_MAX means that stripe only needs bitmap, end != 0 => needs also data
stripe.read_end == UINT32_MAX && end != 0)
stripe.read_end = end; stripe.read_end = end;
if (stripe.read_start > start) if (stripe.read_start > start)
stripe.read_start = start; stripe.read_start = start;
@@ -105,24 +107,30 @@ void reconstruct_stripes_xor(osd_rmw_stripe_t *stripes, int pg_size, uint32_t bi
} }
else if (prev >= 0) else if (prev >= 0)
{ {
assert(stripes[role].read_start >= stripes[prev].read_start && if (stripes[role].read_end != UINT32_MAX)
stripes[role].read_start >= stripes[other].read_start); {
memxor( assert(stripes[role].read_start >= stripes[prev].read_start &&
(uint8_t*)stripes[prev].read_buf + (stripes[role].read_start - stripes[prev].read_start), stripes[role].read_start >= stripes[other].read_start);
(uint8_t*)stripes[other].read_buf + (stripes[role].read_start - stripes[other].read_start), memxor(
stripes[role].read_buf, stripes[role].read_end - stripes[role].read_start (uint8_t*)stripes[prev].read_buf + (stripes[role].read_start - stripes[prev].read_start),
); (uint8_t*)stripes[other].read_buf + (stripes[role].read_start - stripes[other].read_start),
stripes[role].read_buf, stripes[role].read_end - stripes[role].read_start
);
}
memxor(stripes[prev].bmp_buf, stripes[other].bmp_buf, stripes[role].bmp_buf, bitmap_size); memxor(stripes[prev].bmp_buf, stripes[other].bmp_buf, stripes[role].bmp_buf, bitmap_size);
prev = -1; prev = -1;
} }
else else
{ {
assert(stripes[role].read_start >= stripes[other].read_start); if (stripes[role].read_end != UINT32_MAX)
memxor( {
stripes[role].read_buf, assert(stripes[role].read_start >= stripes[other].read_start);
(uint8_t*)stripes[other].read_buf + (stripes[role].read_start - stripes[other].read_start), memxor(
stripes[role].read_buf, stripes[role].read_end - stripes[role].read_start stripes[role].read_buf,
); (uint8_t*)stripes[other].read_buf + (stripes[role].read_start - stripes[other].read_start),
stripes[role].read_buf, stripes[role].read_end - stripes[role].read_start
);
}
memxor(stripes[role].bmp_buf, stripes[other].bmp_buf, stripes[role].bmp_buf, bitmap_size); memxor(stripes[role].bmp_buf, stripes[other].bmp_buf, stripes[role].bmp_buf, bitmap_size);
} }
} }
@@ -142,11 +150,11 @@ inline bool operator < (const reed_sol_erased_t &a, const reed_sol_erased_t &b)
for (int i = 0; i < a.size && i < b.size; i++) for (int i = 0; i < a.size && i < b.size; i++)
{ {
if (a.data[i] < b.data[i]) if (a.data[i] < b.data[i])
return -1; return true;
else if (a.data[i] > b.data[i]) else if (a.data[i] > b.data[i])
return 1; return false;
} }
return 0; return false;
} }
struct reed_sol_matrix_t struct reed_sol_matrix_t
@@ -356,20 +364,23 @@ void reconstruct_stripes_ec(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsi
uint64_t read_start = 0, read_end = 0; uint64_t read_start = 0, read_end = 0;
auto recover_seq = [&]() auto recover_seq = [&]()
{ {
int orig = 0; if (read_end != UINT32_MAX)
for (int other = 0; other < pg_size; other++)
{ {
if (stripes[other].read_end != 0 && !stripes[other].missing) int orig = 0;
for (int other = 0; other < pg_size && orig < pg_minsize; other++)
{ {
assert(stripes[other].read_start <= read_start); if (stripes[other].read_end != 0 && !stripes[other].missing)
assert(stripes[other].read_end >= read_end); {
data_ptrs[orig++] = (uint8_t*)stripes[other].read_buf + (read_start - stripes[other].read_start); assert(stripes[other].read_start <= read_start);
assert(stripes[other].read_end >= read_end);
data_ptrs[orig++] = (uint8_t*)stripes[other].read_buf + (read_start - stripes[other].read_start);
}
} }
ec_encode_data(
read_end-read_start, pg_minsize, wanted, dectable + wanted_base*32*pg_minsize,
data_ptrs, data_ptrs + pg_minsize
);
} }
ec_encode_data(
read_end-read_start, pg_minsize, wanted, dectable + wanted_base*32*pg_minsize,
data_ptrs, data_ptrs + pg_minsize
);
wanted_base += wanted; wanted_base += wanted;
wanted = 0; wanted = 0;
}; };
@@ -391,6 +402,32 @@ void reconstruct_stripes_ec(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsi
{ {
recover_seq(); recover_seq();
} }
// Recover bitmaps
if (bitmap_size > 0)
{
for (int role = 0; role < pg_minsize; role++)
{
if (stripes[role].read_end != 0 && stripes[role].missing)
{
data_ptrs[pg_minsize + (wanted++)] = (uint8_t*)stripes[role].bmp_buf;
}
}
if (wanted > 0)
{
int orig = 0;
for (int other = 0; other < pg_size && orig < pg_minsize; other++)
{
if (stripes[other].read_end != 0 && !stripes[other].missing)
{
data_ptrs[orig++] = (uint8_t*)stripes[other].bmp_buf;
}
}
ec_encode_data(
bitmap_size, pg_minsize, wanted, dectable,
data_ptrs, data_ptrs + pg_minsize
);
}
}
} }
#else #else
void reconstruct_stripes_ec(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize, uint32_t bitmap_size) void reconstruct_stripes_ec(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize, uint32_t bitmap_size)
@@ -412,7 +449,8 @@ void reconstruct_stripes_ec(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsi
if (stripes[role].read_end != 0 && stripes[role].missing) if (stripes[role].read_end != 0 && stripes[role].missing)
{ {
recovered = true; recovered = true;
if (stripes[role].read_end > stripes[role].read_start) if (stripes[role].read_end > stripes[role].read_start &&
stripes[role].read_end != UINT32_MAX)
{ {
for (int other = 0; other < pg_size; other++) for (int other = 0; other < pg_size; other++)
{ {
@@ -531,7 +569,8 @@ void* alloc_read_buffer(osd_rmw_stripe_t *stripes, int read_pg_size, uint64_t ad
uint64_t buf_size = add_size; uint64_t buf_size = add_size;
for (int role = 0; role < read_pg_size; role++) for (int role = 0; role < read_pg_size; role++)
{ {
if (stripes[role].read_end != 0) if (stripes[role].read_end != 0 &&
stripes[role].read_end != UINT32_MAX)
{ {
buf_size += stripes[role].read_end - stripes[role].read_start; buf_size += stripes[role].read_end - stripes[role].read_start;
} }
@@ -541,7 +580,8 @@ void* alloc_read_buffer(osd_rmw_stripe_t *stripes, int read_pg_size, uint64_t ad
uint64_t buf_pos = add_size; uint64_t buf_pos = add_size;
for (int role = 0; role < read_pg_size; role++) for (int role = 0; role < read_pg_size; role++)
{ {
if (stripes[role].read_end != 0) if (stripes[role].read_end != 0 &&
stripes[role].read_end != UINT32_MAX)
{ {
stripes[role].read_buf = (uint8_t*)buf + buf_pos; stripes[role].read_buf = (uint8_t*)buf + buf_pos;
buf_pos += stripes[role].read_end - stripes[role].read_start; buf_pos += stripes[role].read_end - stripes[role].read_start;
@@ -677,11 +717,11 @@ void* calc_rmw(void *request_buf, osd_rmw_stripe_t *stripes, uint64_t *read_osd_
static void get_old_new_buffers(osd_rmw_stripe_t & stripe, uint32_t wr_start, uint32_t wr_end, buf_len_t *bufs, int & nbufs) static void get_old_new_buffers(osd_rmw_stripe_t & stripe, uint32_t wr_start, uint32_t wr_end, buf_len_t *bufs, int & nbufs)
{ {
uint32_t ns = 0, ne = 0, os = 0, oe = 0; uint32_t ns = 0, ne = 0, os = 0, oe = 0;
if (stripe.req_end > wr_start && if (stripe.write_end > wr_start &&
stripe.req_start < wr_end) stripe.write_start < wr_end)
{ {
ns = std::max(stripe.req_start, wr_start); ns = std::max(stripe.write_start, wr_start);
ne = std::min(stripe.req_end, wr_end); ne = std::min(stripe.write_end, wr_end);
} }
if (stripe.read_end > wr_start && if (stripe.read_end > wr_start &&
stripe.read_start < wr_end) stripe.read_start < wr_end)
@@ -692,7 +732,7 @@ static void get_old_new_buffers(osd_rmw_stripe_t & stripe, uint32_t wr_start, ui
if (ne && (!oe || ns <= os)) if (ne && (!oe || ns <= os))
{ {
// NEW or NEW->OLD // NEW or NEW->OLD
bufs[nbufs++] = { .buf = (uint8_t*)stripe.write_buf + ns - stripe.req_start, .len = ne-ns }; bufs[nbufs++] = { .buf = (uint8_t*)stripe.write_buf + ns - stripe.write_start, .len = ne-ns };
if (os < ne) if (os < ne)
os = ne; os = ne;
if (oe > os) if (oe > os)
@@ -708,7 +748,7 @@ static void get_old_new_buffers(osd_rmw_stripe_t & stripe, uint32_t wr_start, ui
{ {
// OLD->NEW or OLD->NEW->OLD // OLD->NEW or OLD->NEW->OLD
bufs[nbufs++] = { .buf = (uint8_t*)stripe.read_buf + os - stripe.read_start, .len = ns-os }; bufs[nbufs++] = { .buf = (uint8_t*)stripe.read_buf + os - stripe.read_start, .len = ns-os };
bufs[nbufs++] = { .buf = (uint8_t*)stripe.write_buf + ns - stripe.req_start, .len = ne-ns }; bufs[nbufs++] = { .buf = (uint8_t*)stripe.write_buf + ns - stripe.write_start, .len = ne-ns };
if (oe > ne) if (oe > ne)
{ {
// OLD->NEW->OLD // OLD->NEW->OLD
@@ -759,7 +799,18 @@ static void calc_rmw_parity_copy_mod(osd_rmw_stripe_t *stripes, int pg_size, int
uint64_t *read_osd_set, uint64_t *write_osd_set, uint32_t chunk_size, uint32_t bitmap_granularity, uint64_t *read_osd_set, uint64_t *write_osd_set, uint32_t chunk_size, uint32_t bitmap_granularity,
uint32_t &start, uint32_t &end) uint32_t &start, uint32_t &end)
{ {
if (write_osd_set[pg_minsize] != 0 || write_osd_set != read_osd_set) bool required = false;
for (int role = pg_minsize; role < pg_size; role++)
{
if (write_osd_set[role] != 0)
{
// Whole parity chunk is needed when we move the object
if (write_osd_set[role] != read_osd_set[role])
end = chunk_size;
required = true;
}
}
if (required && end != chunk_size)
{ {
// start & end are required for calc_rmw_parity // start & end are required for calc_rmw_parity
for (int role = 0; role < pg_minsize; role++) for (int role = 0; role < pg_minsize; role++)
@@ -770,14 +821,6 @@ static void calc_rmw_parity_copy_mod(osd_rmw_stripe_t *stripes, int pg_size, int
end = std::max(stripes[role].req_end, end); end = std::max(stripes[role].req_end, end);
} }
} }
for (int role = pg_minsize; role < pg_size; role++)
{
if (write_osd_set[role] != 0 && write_osd_set[role] != read_osd_set[role])
{
start = 0;
end = chunk_size;
}
}
} }
// Set bitmap bits accordingly // Set bitmap bits accordingly
if (bitmap_granularity > 0) if (bitmap_granularity > 0)

View File

@@ -23,6 +23,7 @@ struct osd_rmw_stripe_t
void *read_buf, *write_buf; void *read_buf, *write_buf;
void *bmp_buf; void *bmp_buf;
uint32_t req_start, req_end; uint32_t req_start, req_end;
// read_end=UINT32_MAX means to only read bitmap, but not data
uint32_t read_start, read_end; uint32_t read_start, read_end;
uint32_t write_start, write_end; uint32_t write_start, write_end;
bool missing; bool missing;

View File

@@ -17,6 +17,7 @@ void test4();
void test5(); void test5();
void test6(); void test6();
void test7(); void test7();
void test_rmw_4k_degraded_into_lost_to_normal(bool ec);
void test8(); void test8();
void test9(); void test9();
void test10(); void test10();
@@ -24,8 +25,9 @@ void test11();
void test12(); void test12();
void test13(); void test13();
void test14(); void test14();
void test15(); void test15(bool second);
void test16(); void test16();
void test_recover_22_d2();
int main(int narg, char *args[]) int main(int narg, char *args[])
{ {
@@ -39,6 +41,8 @@ int main(int narg, char *args[])
test6(); test6();
// Test 7 // Test 7
test7(); test7();
test_rmw_4k_degraded_into_lost_to_normal(false);
test_rmw_4k_degraded_into_lost_to_normal(true);
// Test 8 // Test 8
test8(); test8();
// Test 9 // Test 9
@@ -54,9 +58,12 @@ int main(int narg, char *args[])
// Test 14 // Test 14
test14(); test14();
// Test 15 // Test 15
test15(); test15(false);
test15(true);
// Test 16 // Test 16
test16(); test16();
// Test 17
test_recover_22_d2();
// End // End
printf("all ok\n"); printf("all ok\n");
return 0; return 0;
@@ -315,6 +322,69 @@ void test7()
/*** /***
7/2. calc_rmw(offset=48K, len=4K, osd_set=[0,2,3], write_set=[1,2,3])
= {
read: [ [ 0, 128K ], [ 0, 128K ], [ 0, 128K ] ],
write: [ [ 48K, 52K ], [ 0, 0 ], [ 48K, 52K ] ],
input buffer: [ write0 ],
rmw buffer: [ write2, read0, read1, read2 ],
}
then, after calc_rmw_parity_xor/ec(): {
write: [ [ 0, 128K ], [ 0, 0 ], [ 48K, 52K ] ],
write0==read0,
}
+ check write0, write2 buffers
***/
void test_rmw_4k_degraded_into_lost_to_normal(bool ec)
{
osd_num_t osd_set[3] = { 0, 2, 3 };
osd_num_t write_osd_set[3] = { 1, 2, 3 };
osd_rmw_stripe_t stripes[3] = {};
// Subtest 1
split_stripes(2, 128*1024, 48*1024, 4096, stripes);
void *write_buf = malloc(4096);
void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 3, write_osd_set, 128*1024, 0);
assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024);
assert(stripes[1].read_start == 0 && stripes[1].read_end == 128*1024);
assert(stripes[2].read_start == 0 && stripes[2].read_end == 128*1024);
assert(stripes[0].write_start == 48*1024 && stripes[0].write_end == 52*1024);
assert(stripes[1].write_start == 0 && stripes[1].write_end == 0);
assert(stripes[2].write_start == 48*1024 && stripes[2].write_end == 52*1024);
assert(stripes[0].read_buf == (uint8_t*)rmw_buf+4*1024);
assert(stripes[1].read_buf == (uint8_t*)rmw_buf+4*1024+128*1024);
assert(stripes[2].read_buf == (uint8_t*)rmw_buf+4*1024+2*128*1024);
assert(stripes[0].write_buf == write_buf);
assert(stripes[1].write_buf == NULL);
assert(stripes[2].write_buf == rmw_buf);
// Subtest 2
set_pattern(write_buf, 4096, PATTERN2);
set_pattern(stripes[1].read_buf, 128*1024, PATTERN1);
set_pattern(stripes[2].read_buf, 128*1024, PATTERN0^PATTERN1);
if (!ec)
calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024, 0);
else
{
use_ec(3, 2, true);
calc_rmw_parity_ec(stripes, 3, 2, osd_set, write_osd_set, 128*1024, 0);
use_ec(3, 2, false);
}
assert(stripes[0].write_start == 0 && stripes[0].write_end == 128*1024);
assert(stripes[1].write_start == 0 && stripes[1].write_end == 0);
assert(stripes[2].write_start == 48*1024 && stripes[2].write_end == 52*1024);
assert(stripes[0].write_buf == stripes[0].read_buf);
assert(stripes[1].write_buf == NULL);
assert(stripes[2].write_buf == rmw_buf);
check_pattern(stripes[0].write_buf, 4096, PATTERN0);
check_pattern(stripes[0].write_buf+48*1024, 4096, PATTERN2);
check_pattern(stripes[2].write_buf, 4096, PATTERN2^PATTERN1); // new parity
free(rmw_buf);
free(write_buf);
}
/***
8. calc_rmw(offset=0, len=128K+4K, osd_set=[0,2,3], write_set=[1,2,3]) 8. calc_rmw(offset=0, len=128K+4K, osd_set=[0,2,3], write_set=[1,2,3])
= { = {
read: [ [ 0, 0 ], [ 4K, 128K ], [ 0, 0 ] ], read: [ [ 0, 0 ], [ 4K, 128K ], [ 0, 0 ] ],
@@ -826,12 +896,11 @@ void test14()
***/ ***/
void test15() void test15(bool second)
{ {
const int bmp = 64*1024 / 4096 / 8; const int bmp = 64*1024 / 4096 / 8;
use_ec(4, 2, true); use_ec(4, 2, true);
osd_num_t osd_set[4] = { 1, 2, 3, 0 }; osd_num_t osd_set[4] = { 1, 2, (osd_num_t)(second ? 0 : 3), (osd_num_t)(second ? 4 : 0) };
osd_num_t write_osd_set[4] = { 1, 2, 3, 0 };
osd_rmw_stripe_t stripes[4] = {}; osd_rmw_stripe_t stripes[4] = {};
unsigned bitmaps[4] = { 0 }; unsigned bitmaps[4] = { 0 };
// Test 15.0 // Test 15.0
@@ -842,7 +911,7 @@ void test15()
assert(stripes[2].req_start == 0 && stripes[2].req_end == 0); assert(stripes[2].req_start == 0 && stripes[2].req_end == 0);
assert(stripes[3].req_start == 0 && stripes[3].req_end == 0); assert(stripes[3].req_start == 0 && stripes[3].req_end == 0);
// Test 15.1 // Test 15.1
void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 4, 2, 3, write_osd_set, 64*1024, bmp); void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 4, 2, 3, osd_set, 64*1024, bmp);
for (int i = 0; i < 4; i++) for (int i = 0; i < 4; i++)
stripes[i].bmp_buf = bitmaps+i; stripes[i].bmp_buf = bitmaps+i;
assert(rmw_buf); assert(rmw_buf);
@@ -852,32 +921,34 @@ void test15()
assert(stripes[3].read_start == 0 && stripes[3].read_end == 0); assert(stripes[3].read_start == 0 && stripes[3].read_end == 0);
assert(stripes[0].write_start == 0 && stripes[0].write_end == 0); assert(stripes[0].write_start == 0 && stripes[0].write_end == 0);
assert(stripes[1].write_start == 28*1024 && stripes[1].write_end == 32*1024); assert(stripes[1].write_start == 28*1024 && stripes[1].write_end == 32*1024);
assert(stripes[2].write_start == 28*1024 && stripes[2].write_end == 32*1024); assert(stripes[2+second].write_start == 28*1024 && stripes[2+second].write_end == 32*1024);
assert(stripes[3].write_start == 0 && stripes[3].write_end == 0); assert(stripes[3-second].write_start == 0 && stripes[3-second].write_end == 0);
assert(stripes[0].read_buf == (uint8_t*)rmw_buf+4*1024); assert(stripes[0].read_buf == (uint8_t*)rmw_buf+4*1024);
assert(stripes[1].read_buf == NULL); assert(stripes[1].read_buf == NULL);
assert(stripes[2].read_buf == NULL); assert(stripes[2].read_buf == NULL);
assert(stripes[3].read_buf == NULL); assert(stripes[3].read_buf == NULL);
assert(stripes[0].write_buf == NULL); assert(stripes[0].write_buf == NULL);
assert(stripes[1].write_buf == (uint8_t*)write_buf); assert(stripes[1].write_buf == (uint8_t*)write_buf);
assert(stripes[2].write_buf == rmw_buf); assert(stripes[2+second].write_buf == rmw_buf);
assert(stripes[3].write_buf == NULL); assert(stripes[3-second].write_buf == NULL);
// Test 15.2 - encode // Test 15.2 - encode
set_pattern(write_buf, 4*1024, PATTERN1); set_pattern(write_buf, 4*1024, PATTERN1);
set_pattern(stripes[0].read_buf, 4*1024, PATTERN2); set_pattern(stripes[0].read_buf, 4*1024, PATTERN2);
memset(stripes[0].bmp_buf, 0, bmp); memset(stripes[0].bmp_buf, 0, bmp);
memset(stripes[1].bmp_buf, 0, bmp); memset(stripes[1].bmp_buf, 0, bmp);
calc_rmw_parity_ec(stripes, 4, 2, osd_set, write_osd_set, 64*1024, bmp); memset(stripes[2+second].write_buf, 0, 4096);
assert(*(uint32_t*)stripes[2].bmp_buf == 0x80); calc_rmw_parity_ec(stripes, 4, 2, osd_set, osd_set, 64*1024, bmp);
assert(second || *(uint32_t*)stripes[2].bmp_buf == 0x80);
assert(stripes[0].write_start == 0 && stripes[0].write_end == 0); assert(stripes[0].write_start == 0 && stripes[0].write_end == 0);
assert(stripes[1].write_start == 28*1024 && stripes[1].write_end == 32*1024); assert(stripes[1].write_start == 28*1024 && stripes[1].write_end == 32*1024);
assert(stripes[2].write_start == 28*1024 && stripes[2].write_end == 32*1024); assert(stripes[2+second].write_start == 28*1024 && stripes[2+second].write_end == 32*1024);
assert(stripes[3].write_start == 0 && stripes[3].write_end == 0); assert(stripes[3-second].write_start == 0 && stripes[3-second].write_end == 0);
assert(stripes[0].write_buf == NULL); assert(stripes[0].write_buf == NULL);
assert(stripes[1].write_buf == (uint8_t*)write_buf); assert(stripes[1].write_buf == (uint8_t*)write_buf);
assert(stripes[2].write_buf == rmw_buf); assert(stripes[2+second].write_buf == rmw_buf);
assert(stripes[3].write_buf == NULL); assert(stripes[3-second].write_buf == NULL);
check_pattern(stripes[2].write_buf, 4*1024, PATTERN1^PATTERN2); // first parity is always xor :) // first parity is always xor :), second isn't...
check_pattern(stripes[2+second].write_buf, 4*1024, second ? 0xb79a59a0ce8b9b81 : PATTERN1^PATTERN2);
// Done // Done
free(rmw_buf); free(rmw_buf);
free(write_buf); free(write_buf);
@@ -977,7 +1048,12 @@ void test16()
assert(stripes[3].read_buf == (uint8_t*)read_buf+2*128*1024); assert(stripes[3].read_buf == (uint8_t*)read_buf+2*128*1024);
set_pattern(stripes[1].read_buf, 128*1024, PATTERN2); set_pattern(stripes[1].read_buf, 128*1024, PATTERN2);
memcpy(stripes[3].read_buf, rmw_buf, 128*1024); memcpy(stripes[3].read_buf, rmw_buf, 128*1024);
memset(stripes[0].bmp_buf, 0xa8, bmp);
memset(stripes[2].bmp_buf, 0xb7, bmp);
assert(bitmaps[1] == 0xFFFFFFFF);
assert(bitmaps[3] == 0xF1F1F1F1);
reconstruct_stripes_ec(stripes, 4, 2, bmp); reconstruct_stripes_ec(stripes, 4, 2, bmp);
assert(*(uint32_t*)stripes[3].bmp_buf == 0xF1F1F1F1);
assert(bitmaps[0] == 0xFFFFFFFF); assert(bitmaps[0] == 0xFFFFFFFF);
check_pattern(stripes[0].read_buf, 128*1024, PATTERN1); check_pattern(stripes[0].read_buf, 128*1024, PATTERN1);
free(read_buf); free(read_buf);
@@ -986,3 +1062,47 @@ void test16()
free(write_buf); free(write_buf);
use_ec(4, 2, false); use_ec(4, 2, false);
} }
/***
17. EC 2+2 recover second data block
***/
void test_recover_22_d2()
{
const int bmp = 128*1024 / 4096 / 8;
use_ec(4, 2, true);
osd_num_t osd_set[4] = { 1, 0, 3, 4 };
osd_rmw_stripe_t stripes[4] = {};
unsigned bitmaps[4] = { 0 };
// Read 0-256K
split_stripes(2, 128*1024, 0, 256*1024, stripes);
assert(stripes[0].req_start == 0 && stripes[0].req_end == 128*1024);
assert(stripes[1].req_start == 0 && stripes[1].req_end == 128*1024);
assert(stripes[2].req_start == 0 && stripes[2].req_end == 0);
assert(stripes[3].req_start == 0 && stripes[3].req_end == 0);
uint8_t *data_buf = (uint8_t*)malloc_or_die(128*1024*4);
for (int i = 0; i < 4; i++)
{
stripes[i].read_start = stripes[i].req_start;
stripes[i].read_end = stripes[i].req_end;
stripes[i].read_buf = data_buf + i*128*1024;
stripes[i].bmp_buf = bitmaps + i;
}
// Read using parity
assert(extend_missing_stripes(stripes, osd_set, 2, 4) == 0);
assert(stripes[2].read_start == 0 && stripes[2].read_end == 128*1024);
assert(stripes[3].read_start == 0 && stripes[3].read_end == 0);
bitmaps[0] = 0xffffffff;
bitmaps[2] = 0;
set_pattern(stripes[0].read_buf, 128*1024, PATTERN1);
set_pattern(stripes[2].read_buf, 128*1024, PATTERN1^PATTERN2);
// Reconstruct
reconstruct_stripes_ec(stripes, 4, 2, bmp);
check_pattern(stripes[1].read_buf, 128*1024, PATTERN2);
assert(bitmaps[1] == 0xFFFFFFFF);
free(data_buf);
// Done
use_ec(4, 2, false);
}

View File

@@ -150,6 +150,7 @@ int connect_osd(const char *osd_address, int osd_port)
if (connect(connect_fd, (sockaddr*)&addr, sizeof(addr)) < 0) if (connect(connect_fd, (sockaddr*)&addr, sizeof(addr)) < 0)
{ {
perror("connect"); perror("connect");
close(connect_fd);
return -1; return -1;
} }
int one = 1; int one = 1;

View File

@@ -9,6 +9,9 @@
#endif #endif
#include "qemu/osdep.h" #include "qemu/osdep.h"
#include "qemu/main-loop.h" #include "qemu/main-loop.h"
#if QEMU_VERSION_MAJOR >= 8
#include "block/block-io.h"
#endif
#include "block/block_int.h" #include "block/block_int.h"
#include "qapi/error.h" #include "qapi/error.h"
#include "qapi/qmp/qdict.h" #include "qapi/qmp/qdict.h"
@@ -268,7 +271,13 @@ static int vitastor_file_open(BlockDriverState *bs, QDict *options, int flags, E
} }
else else
{ {
#if QEMU_VERSION_MAJOR >= 8
aio_co_enter(bdrv_get_aio_context(bs), qemu_coroutine_create((void(*)(void*))vitastor_co_get_metadata, &task));
#elif QEMU_VERSION_MAJOR == 2 && QEMU_VERSION_MINOR >= 9 || QEMU_VERSION_MAJOR >= 3
bdrv_coroutine_enter(bs, qemu_coroutine_create((void(*)(void*))vitastor_co_get_metadata, &task)); bdrv_coroutine_enter(bs, qemu_coroutine_create((void(*)(void*))vitastor_co_get_metadata, &task));
#else
qemu_coroutine_enter(qemu_coroutine_create((void(*)(void*))vitastor_co_get_metadata, &task));
#endif
BDRV_POLL_WHILE(bs, !task.complete); BDRV_POLL_WHILE(bs, !task.complete);
} }
client->image = image; client->image = image;
@@ -732,8 +741,13 @@ static BlockDriver bdrv_vitastor = {
.bdrv_parse_filename = vitastor_parse_filename, .bdrv_parse_filename = vitastor_parse_filename,
.bdrv_has_zero_init = bdrv_has_zero_init_1, .bdrv_has_zero_init = bdrv_has_zero_init_1,
#if QEMU_VERSION_MAJOR >= 8
.bdrv_co_get_info = vitastor_get_info,
.bdrv_co_getlength = vitastor_getlength,
#else
.bdrv_get_info = vitastor_get_info, .bdrv_get_info = vitastor_get_info,
.bdrv_getlength = vitastor_getlength, .bdrv_getlength = vitastor_getlength,
#endif
#if QEMU_VERSION_MAJOR >= 3 || QEMU_VERSION_MAJOR == 2 && QEMU_VERSION_MINOR > 2 #if QEMU_VERSION_MAJOR >= 3 || QEMU_VERSION_MAJOR == 2 && QEMU_VERSION_MINOR > 2
.bdrv_probe_blocksizes = vitastor_probe_blocksizes, .bdrv_probe_blocksizes = vitastor_probe_blocksizes,
#endif #endif

View File

@@ -15,7 +15,7 @@ int read_blocking(int fd, void *read_buf, size_t remaining)
size_t done = 0; size_t done = 0;
while (done < remaining) while (done < remaining)
{ {
size_t r = read(fd, read_buf, remaining-done); ssize_t r = read(fd, read_buf, remaining-done);
if (r <= 0) if (r <= 0)
{ {
if (!errno) if (!errno)
@@ -41,7 +41,7 @@ int write_blocking(int fd, void *write_buf, size_t remaining)
size_t done = 0; size_t done = 0;
while (done < remaining) while (done < remaining)
{ {
size_t r = write(fd, write_buf, remaining-done); ssize_t r = write(fd, write_buf, remaining-done);
if (r < 0) if (r < 0)
{ {
if (errno != EINTR && errno != EAGAIN && errno != EPIPE) if (errno != EINTR && errno != EAGAIN && errno != EPIPE)

View File

@@ -83,6 +83,7 @@ int connect_stub(const char *server_address, int server_port)
if (connect(connect_fd, (sockaddr*)&addr, sizeof(addr)) < 0) if (connect(connect_fd, (sockaddr*)&addr, sizeof(addr)) < 0)
{ {
perror("connect"); perror("connect");
close(connect_fd);
return -1; return -1;
} }
int one = 1; int one = 1;

View File

@@ -6,7 +6,7 @@ includedir=${prefix}/@CMAKE_INSTALL_INCLUDEDIR@
Name: Vitastor Name: Vitastor
Description: Vitastor client library Description: Vitastor client library
Version: 0.8.5 Version: 0.8.8
Libs: -L${libdir} -lvitastor_client Libs: -L${libdir} -lvitastor_client
Cflags: -I${includedir} Cflags: -I${includedir}

View File

@@ -36,12 +36,12 @@ for i in $(seq 2 $ETCD_COUNT); do
ETCD_URL="$ETCD_URL,http://$ETCD_IP:$((ETCD_PORT+2*i-2))" ETCD_URL="$ETCD_URL,http://$ETCD_IP:$((ETCD_PORT+2*i-2))"
ETCD_CLUSTER="$ETCD_CLUSTER,etcd$i=http://$ETCD_IP:$((ETCD_PORT+2*i-1))" ETCD_CLUSTER="$ETCD_CLUSTER,etcd$i=http://$ETCD_IP:$((ETCD_PORT+2*i-1))"
done done
ETCDCTL="${ETCD}ctl --endpoints=$ETCD_URL" ETCDCTL="${ETCD}ctl --endpoints=$ETCD_URL --dial-timeout=5s --command-timeout=10s"
start_etcd() start_etcd()
{ {
local i=$1 local i=$1
$ETCD -name etcd$i --data-dir ./testdata/etcd$i \ ionice -c2 -n0 $ETCD -name etcd$i --data-dir ./testdata/etcd$i \
--advertise-client-urls http://$ETCD_IP:$((ETCD_PORT+2*i-2)) --listen-client-urls http://$ETCD_IP:$((ETCD_PORT+2*i-2)) \ --advertise-client-urls http://$ETCD_IP:$((ETCD_PORT+2*i-2)) --listen-client-urls http://$ETCD_IP:$((ETCD_PORT+2*i-2)) \
--initial-advertise-peer-urls http://$ETCD_IP:$((ETCD_PORT+2*i-1)) --listen-peer-urls http://$ETCD_IP:$((ETCD_PORT+2*i-1)) \ --initial-advertise-peer-urls http://$ETCD_IP:$((ETCD_PORT+2*i-1)) --listen-peer-urls http://$ETCD_IP:$((ETCD_PORT+2*i-1)) \
--initial-cluster-token vitastor-tests-etcd --initial-cluster-state new \ --initial-cluster-token vitastor-tests-etcd --initial-cluster-state new \
@@ -53,8 +53,11 @@ start_etcd()
for i in $(seq 1 $ETCD_COUNT); do for i in $(seq 1 $ETCD_COUNT); do
start_etcd $i start_etcd $i
done done
if [ $ETCD_COUNT -gt 1 ]; then for i in {1..10}; do
sleep 1 ${ETCD}ctl --endpoints=$ETCD_URL --dial-timeout=1s --command-timeout=1s member list >/dev/null && break
done
if [[ $i = 10 ]]; then
format_error "Failed to start etcd"
fi fi
echo leak:fio >> testdata/lsan-suppress.txt echo leak:fio >> testdata/lsan-suppress.txt
@@ -64,4 +67,4 @@ echo leak:librbd >> testdata/lsan-suppress.txt
echo leak:_M_mutate >> testdata/lsan-suppress.txt echo leak:_M_mutate >> testdata/lsan-suppress.txt
echo leak:_M_assign >> testdata/lsan-suppress.txt echo leak:_M_assign >> testdata/lsan-suppress.txt
export LSAN_OPTIONS=report_objects=true:suppressions=`pwd`/testdata/lsan-suppress.txt export LSAN_OPTIONS=report_objects=true:suppressions=`pwd`/testdata/lsan-suppress.txt
export ASAN_OPTIONS=verify_asan_link_order=false export ASAN_OPTIONS=verify_asan_link_order=false:abort_on_error=1

View File

@@ -17,17 +17,17 @@ else
fi fi
if [ "$IMMEDIATE_COMMIT" != "" ]; then if [ "$IMMEDIATE_COMMIT" != "" ]; then
NO_SAME="--journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024 --disable_data_fsync 1 --immediate_commit all --log_level 1" NO_SAME="--journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024 --disable_data_fsync 1 --immediate_commit all --log_level 10"
$ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"osd_out_time":1,"immediate_commit":"all"}' $ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"osd_out_time":1,"immediate_commit":"all"}'
else else
NO_SAME="--journal_sector_buffer_count 1024 --log_level 1" NO_SAME="--journal_sector_buffer_count 1024 --log_level 10"
$ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"osd_out_time":1}' $ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"osd_out_time":1}'
fi fi
start_osd() start_osd()
{ {
local i=$1 local i=$1
build/src/vitastor-osd --osd_num $i --bind_address 127.0.0.1 $NO_SAME $OSD_ARGS --etcd_address $ETCD_URL $(build/src/vitastor-disk simple-offsets --format options ./testdata/test_osd$i.bin 2>/dev/null) &>./testdata/osd$i.log & build/src/vitastor-osd --osd_num $i --bind_address 127.0.0.1 $NO_SAME $OSD_ARGS --etcd_address $ETCD_URL $(build/src/vitastor-disk simple-offsets --format options ./testdata/test_osd$i.bin 2>/dev/null) >>./testdata/osd$i.log 2>&1 &
eval OSD${i}_PID=$! eval OSD${i}_PID=$!
} }
@@ -39,7 +39,7 @@ done
cd mon cd mon
npm install npm install
cd .. cd ..
node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 &>./testdata/mon.log & node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 --restart_interval 5 &>./testdata/mon.log &
MON_PID=$! MON_PID=$!
if [ "$SCHEME" = "ec" ]; then if [ "$SCHEME" = "ec" ]; then
@@ -100,13 +100,13 @@ wait_finish_rebalance()
sec=$1 sec=$1
i=0 i=0
while [[ $i -lt $sec ]]; do while [[ $i -lt $sec ]]; do
($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"]) ] | length) == 32') && \ ($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"] or .state == ["active", "left_on_dead"]) ] | length) == '$PG_COUNT) && \
break break
if [ $i -eq 60 ]; then
format_error "Rebalance couldn't finish in $sec seconds"
fi
sleep 1 sleep 1
i=$((i+1)) i=$((i+1))
if [ $i -eq $sec ]; then
format_error "Rebalance couldn't finish in $sec seconds"
fi
done done
} }
@@ -117,3 +117,14 @@ check_qemu()
sudo ln -s "$(realpath .)/build/src/block-vitastor.so" /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so sudo ln -s "$(realpath .)/build/src/block-vitastor.so" /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so
fi fi
} }
check_nbd()
{
if [[ -d /sys/module/nbd && ! -e /dev/nbd0 ]]; then
max_part=$(cat /sys/module/nbd/parameters/max_part)
nbds_max=$(cat /sys/module/nbd/parameters/nbds_max)
for i in $(seq 1 $nbds_max); do
mknod /dev/nbd$((i-1)) b 43 $(((i-1)*(max_part+1)))
done
fi
}

View File

@@ -43,3 +43,6 @@ SCHEME=ec ./test_snapshot.sh
SCHEME=xor ./test_write.sh SCHEME=xor ./test_write.sh
./test_write_no_same.sh ./test_write_no_same.sh
./test_heal.sh
SCHEME=ec PG_MINSIZE=2 ./test_heal.sh

View File

@@ -15,10 +15,10 @@ done
sleep 2 sleep 2
for i in {1..10}; do for i in {1..30}; do
($ETCDCTL get /vitastor/config/pgs --print-value-only |\ ($ETCDCTL get /vitastor/config/pgs --print-value-only |\
jq -s -e '([ .[0].items["1"] | map(.osd_set)[][] ] | sort | unique == ["1","2","3","4"])') && \ jq -s -e '([ .[0].items["1"] | map(.osd_set)[][] ] | sort | unique == ["1","2","3","4"])') && \
($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"]) ] | length) == '$PG_COUNT'') && \ ($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"]) ] | length) == '$PG_COUNT) && \
break break
sleep 1 sleep 1
done done
@@ -28,9 +28,7 @@ if ! ($ETCDCTL get /vitastor/config/pgs --print-value-only |\
format_error "FAILED: OSD NOT ADDED INTO DISTRIBUTION" format_error "FAILED: OSD NOT ADDED INTO DISTRIBUTION"
fi fi
if ! ($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"]) ] | length) == '$PG_COUNT''); then wait_finish_rebalance 20
format_error "FAILED: $PG_COUNT PGS NOT ACTIVE"
fi
sleep 1 sleep 1
kill -9 $OSD4_PID kill -9 $OSD4_PID
@@ -39,7 +37,7 @@ build/src/vitastor-cli --etcd_address $ETCD_URL rm-osd --force 4
sleep 2 sleep 2
for i in {1..10}; do for i in {1..30}; do
($ETCDCTL get /vitastor/config/pgs --print-value-only |\ ($ETCDCTL get /vitastor/config/pgs --print-value-only |\
jq -s -e '([ .[0].items["1"] | map(.osd_set)[][] ] | sort | unique == ["1","2","3"])') && \ jq -s -e '([ .[0].items["1"] | map(.osd_set)[][] ] | sort | unique == ["1","2","3"])') && \
($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"] or .state == ["active", "left_on_dead"]) ] | length) == '$PG_COUNT'') && \ ($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"] or .state == ["active", "left_on_dead"]) ] | length) == '$PG_COUNT'') && \
@@ -52,8 +50,6 @@ if ! ($ETCDCTL get /vitastor/config/pgs --print-value-only |\
format_error "FAILED: OSD NOT REMOVED FROM DISTRIBUTION" format_error "FAILED: OSD NOT REMOVED FROM DISTRIBUTION"
fi fi
if ! ($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"] or .state == ["active", "left_on_dead"]) ] | length) == '$PG_COUNT''); then wait_finish_rebalance 20
format_error "FAILED: $PG_COUNT PGS NOT ACTIVE"
fi
format_green OK format_green OK

View File

@@ -18,7 +18,7 @@ $ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicate
cd mon cd mon
npm install npm install
cd .. cd ..
node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" &>./testdata/mon.log & node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 --restart_interval 5 &>./testdata/mon.log &
MON_PID=$! MON_PID=$!
sleep 2 sleep 2

View File

@@ -43,7 +43,7 @@ kill_osds &
LD_PRELOAD="build/src/libfio_vitastor.so" \ LD_PRELOAD="build/src/libfio_vitastor.so" \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4k -direct=1 -iodepth=16 -fsync=256 -rw=randwrite \ fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4k -direct=1 -iodepth=16 -fsync=256 -rw=randwrite \
-mirror_file=./testdata/mirror.bin -etcd=$ETCD_URL -image=testimg -loops=10 -runtime=120 2>/dev/null -mirror_file=./testdata/mirror.bin -etcd=$ETCD_URL -image=testimg -loops=10 -runtime=120
qemu-img convert -S 4096 -p \ qemu-img convert -S 4096 -p \
-f raw "vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:image=testimg" \ -f raw "vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:image=testimg" \

View File

@@ -10,7 +10,7 @@ kill -INT $OSD2_PID
sleep 5 sleep 5
if ! ($ETCDCTL get /vitastor/pg/state/1/ --prefix --print-value-only | jq -s -e '[ .[] | select(.state == ["active", "degraded"]) ] | length == '$PG_COUNT); then if ! ($ETCDCTL get /vitastor/pg/state/1/ --prefix --print-value-only | jq -s -e '[ .[] | select(.state == ["active", "degraded", "left_on_dead"]) ] | length == '$PG_COUNT); then
format_error "FAILED: $PG_COUNT PG(s) NOT ACTIVE+DEGRADED" format_error "FAILED: $PG_COUNT PG(s) NOT ACTIVE+DEGRADED"
fi fi

View File

@@ -7,7 +7,7 @@ OSD_COUNT=5
OSD_ARGS= OSD_ARGS=
for i in $(seq 1 $OSD_COUNT); do for i in $(seq 1 $OSD_COUNT); do
dd if=/dev/zero of=./testdata/test_osd$i.bin bs=1024 count=1 seek=$((OSD_SIZE*1024-1)) dd if=/dev/zero of=./testdata/test_osd$i.bin bs=1024 count=1 seek=$((OSD_SIZE*1024-1))
build/src/vitastor-osd --osd_num $i --bind_address 127.0.0.1 $OSD_ARGS --etcd_address $ETCD_URL $(build/src/vitastor-disk simple-offsets --format options ./testdata/test_osd$i.bin 2>/dev/null) &>./testdata/osd$i.log & build/src/vitastor-osd --osd_num $i --bind_address 127.0.0.1 $OSD_ARGS --etcd_address $ETCD_URL $(build/src/vitastor-disk simple-offsets --format options ./testdata/test_osd$i.bin 2>/dev/null) >>./testdata/osd$i.log 2>&1 &
eval OSD${i}_PID=$! eval OSD${i}_PID=$!
done done

View File

@@ -4,6 +4,8 @@ OSD_COUNT=7
PG_COUNT=32 PG_COUNT=32
. `dirname $0`/run_3osds.sh . `dirname $0`/run_3osds.sh
check_nbd
IMG_SIZE=256 IMG_SIZE=256
$ETCDCTL put /vitastor/config/inode/1/1 '{"name":"testimg","size":'$((IMG_SIZE*1024*1024))'}' $ETCDCTL put /vitastor/config/inode/1/1 '{"name":"testimg","size":'$((IMG_SIZE*1024*1024))'}'

View File

@@ -53,7 +53,7 @@ for i in $(seq 1 $OSD_COUNT); do
--data_device ./testdata/test_osd$i.bin \ --data_device ./testdata/test_osd$i.bin \
--meta_offset 0 \ --meta_offset 0 \
--journal_offset $((1024*1024)) \ --journal_offset $((1024*1024)) \
--data_offset $((128*1024*1024)) &>./testdata/osd$i.log & --data_offset $((128*1024*1024)) >>./testdata/osd$i.log 2>&1 &
eval OSD${i}_PID=$! eval OSD${i}_PID=$!
done done

View File

@@ -30,7 +30,7 @@ qemu-img create -f qcow2 ./testdata/empty.qcow2 32M
qemu-img convert -p \ qemu-img convert -p \
-f raw "vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:pool=1:inode=3:size=$((32*1024*1024)):skip-parents=1" \ -f raw "vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:pool=1:inode=3:size=$((32*1024*1024)):skip-parents=1" \
-O qcow2 -o 'cluster_size=4k' -B empty.qcow2 ./testdata/layer1.qcow2 -O qcow2 -o 'cluster_size=4k,backing_fmt=qcow2' -B empty.qcow2 ./testdata/layer1.qcow2
qemu-img convert -S 4096 -p \ qemu-img convert -S 4096 -p \
-f raw "vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:pool=1:inode=3:size=$((32*1024*1024))" \ -f raw "vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:pool=1:inode=3:size=$((32*1024*1024))" \
@@ -64,7 +64,7 @@ cmp ./testdata/merged.bin ./testdata/merged-by-tool.bin
# Test merge by qemu-img # Test merge by qemu-img
qemu-img rebase -u -b layer0.qcow2 ./testdata/layer1.qcow2 qemu-img rebase -u -b layer0.qcow2 -F qcow2 ./testdata/layer1.qcow2
qemu-img convert -S 4096 -f qcow2 ./testdata/layer1.qcow2 -O raw ./testdata/rebased.bin qemu-img convert -S 4096 -f qcow2 ./testdata/layer1.qcow2 -O raw ./testdata/rebased.bin

Some files were not shown because too many files have changed in this diff Show More