Compare commits

..

26 Commits

Author SHA1 Message Date
6261809e87 WIP Simplified NFS proxy 2022-04-17 20:19:45 +03:00
d7e64e6ea1 Prepare CLI functions for integration - return results and errors 2022-04-17 20:19:45 +03:00
98e3528a14 Add block PVC and pod examples 2022-04-17 15:43:37 +03:00
8e88f77101 Fix CSI driver volumeMode: Block support 2022-04-17 15:39:11 +03:00
caa2cc2e6c Fix 32bit build error 2022-04-16 01:48:24 +03:00
842ba8b831 Use (uint64_t)1 instead of 1l / 1ul 2022-04-16 01:48:14 +03:00
1493823f9e Note about starting monitors 2022-04-12 15:00:28 +03:00
c857272f44 Comment: epoch is uint64_t 2022-04-10 12:21:37 +03:00
340a4b4f27 Release 0.6.16
- Implement `vitastor-cli status` (print cluster status) command
- Add a new `make-osd-hybrid.js` script to quickly prepare a lot of hybrid (HDD+SSD) OSDs
- Implement snapshot deletion for Cinder driver (only works in a healthy cluster)
- Fix a huge :) bug causing reads to return all zeroes during rebalance. Add a test to prevent it in the future
- Disconnect NBD proxy correctly without leaving a zombie [vitastor-nbd] process in D state
- Fix a rare write hang appearing with small write throttling enabled
2022-04-09 01:16:52 +03:00
5118980315 Add a script to run all tests 2022-04-09 01:14:00 +03:00
d71cc174e3 Implement CLI status command 2022-04-09 00:25:51 +03:00
0eb929f1ba Fix change_pg_count test (statistic reporting may take some time) 2022-04-08 11:58:53 +03:00
83146fa3e2 Fix the same HUGE bug for regular reads during rebalance 2022-04-08 11:50:09 +03:00
15dcaf7903 Add the same "rebalance" test with regular reads 2022-04-08 11:48:31 +03:00
cd18ef7323 Disconnect NBD proxy correctly without leaving a zombie [vitastor-nbd] process in D state 2022-04-07 16:03:35 +03:00
39531ef1a6 Fix incorrect chained reads during rebalance (the bug detected by test_rebalance_verify.sh) 2022-04-07 15:56:58 +03:00
d334914948 Fix the test so it actually fails indicating a bug :-) 2022-04-07 15:56:26 +03:00
c373425562 Fix nbd log 2022-04-07 15:55:38 +03:00
3615e57879 Register standby monitors in etcd in /mon/member 2022-04-04 00:48:52 +03:00
0edc6fe5a6 Add notes about the new script 2022-04-03 13:04:34 +03:00
9c30df83e3 Fix a HUGE :) bug in NBD proxy
The bug could result in corrupted data on large writes
2022-04-03 10:42:06 +03:00
a420c77107 Add rebalance-verify test 2022-04-03 10:42:06 +03:00
4100d829c7 Allow to override log file for daemonized NBD proxy 2022-04-03 02:41:04 +03:00
79ebda933e Fix a write hang with throttling due to timer reenterability / triggerability 2022-03-28 01:42:06 +03:00
65d08e067e Add a script for preparing hybrid (HDD+SSD) OSDs 2022-03-28 01:11:26 +03:00
d289753df4 Implement snapshot deletion for Cinder driver 2022-03-08 11:41:29 +03:00
68 changed files with 3369 additions and 545 deletions

3
.gitmodules vendored
View File

@@ -4,3 +4,6 @@
[submodule "json11"]
path = json11
url = ../json11.git
[submodule "libnfs"]
path = libnfs
url = ../libnfs.git

View File

@@ -2,6 +2,6 @@ cmake_minimum_required(VERSION 2.8)
project(vitastor)
set(VERSION "0.6.15")
set(VERSION "0.6.16")
add_subdirectory(src)

View File

@@ -407,6 +407,7 @@ Vitastor с однопоточной NBD прокси на том же стен
- На хостах мониторов:
- Пропишите нужные вам значения в файле `/usr/lib/vitastor/mon/make-units.sh`
- Создайте юниты systemd для etcd и мониторов: `/usr/lib/vitastor/mon/make-units.sh`
- Запустите etcd и мониторы: `systemctl start etcd vitastor-mon`
- Пропишите etcd_address и osd_network в `/etc/vitastor/vitastor.conf`. Например:
```
{
@@ -414,7 +415,14 @@ Vitastor с однопоточной NBD прокси на том же стен
"osd_network": "10.200.1.0/24"
}
```
- Создайте юниты systemd для OSD: `/usr/lib/vitastor/make-osd.sh /dev/disk/by-partuuid/XXX [/dev/disk/by-partuuid/YYY ...]`
- Инициализуйте OSD:
- SSD: `/usr/lib/vitastor/make-osd.sh /dev/disk/by-partuuid/XXX [/dev/disk/by-partuuid/YYY ...]`
- Гибридные, HDD+SSD: `/usr/lib/vitastor/mon/make-osd-hybrid.js /dev/sda /dev/sdb ...` - передайте
все ваши SSD и HDD скрипту в командной строке подряд, скрипт автоматически выделит разделы под
журналы на SSD и данные на HDD. Скрипт пропускает HDD, на которых уже есть разделы
или вообще какие-то данные, поэтому если диски непустые, сначала очистите их с помощью
`wipefs -a`. SSD с таблицей разделов не пропускаются, но так как скрипт создаёт новые разделы
для журналов, на SSD должно быть доступно свободное нераспределённое место.
- Вы можете менять параметры OSD в юнитах systemd или в `vitastor.conf`. Смысл некоторых параметров:
- `disable_data_fsync 1` - отключает fsync, используется с SSD с конденсаторами.
- `immediate_commit all` - используется с SSD с конденсаторами.
@@ -430,7 +438,6 @@ Vitastor с однопоточной NBD прокси на том же стен
диски, используемые на одном из тестовых стендов - Intel D3-S4510 - очень сильно не любят такую
перезапись, и для них была добавлена эта опция. Когда данный режим включён, также нужно поднимать
значение `journal_sector_buffer_count`, так как иначе Vitastor не хватит буферов для записи в журнал.
- Запустите все etcd: `systemctl start etcd`
- Создайте глобальную конфигурацию в etcd: `etcdctl --endpoints=... put /vitastor/config/global '{"immediate_commit":"all"}'`
(если все ваши диски - серверные с конденсаторами).
- Создайте пулы: `etcdctl --endpoints=... put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":256,"failure_domain":"host"}}'`.

View File

@@ -360,6 +360,7 @@ and calculate disk offsets almost by hand. This will be fixed in near future.
- On the monitor hosts:
- Edit variables at the top of `/usr/lib/vitastor/mon/make-units.sh` to desired values.
- Create systemd units for the monitor and etcd: `/usr/lib/vitastor/mon/make-units.sh`
- Start etcd and monitors: `systemctl start etcd vitastor-mon`
- Put etcd_address and osd_network into `/etc/vitastor/vitastor.conf`. Example:
```
{
@@ -367,7 +368,13 @@ and calculate disk offsets almost by hand. This will be fixed in near future.
"osd_network": "10.200.1.0/24"
}
```
- Create systemd units for your OSDs: `/usr/lib/vitastor/mon/make-osd.sh /dev/disk/by-partuuid/XXX [/dev/disk/by-partuuid/YYY ...]`
- Initialize OSDs:
- Simplest, SSD-only: `/usr/lib/vitastor/mon/make-osd.sh /dev/disk/by-partuuid/XXX [/dev/disk/by-partuuid/YYY ...]`
- Hybrid, HDD+SSD: `/usr/lib/vitastor/mon/make-osd-hybrid.js /dev/sda /dev/sdb ...` - pass all your
devices (HDD and SSD) to this script - it will partition disks and initialize journals on its own.
This script skips HDDs which are already partitioned so if you want to use non-empty disks for
Vitastor you should first wipe them with `wipefs -a`. SSDs with GPT partition table are not skipped,
but some free unpartitioned space must be available because the script creates new partitions for journals.
- You can change OSD configuration in units or in `vitastor.conf`. Notable configuration variables:
- `disable_data_fsync 1` - only safe with server-grade drives with capacitors.
- `immediate_commit all` - use this if all your drives are server-grade.

View File

@@ -1,4 +1,4 @@
VERSION ?= v0.6.15
VERSION ?= v0.6.16
all: build push

View File

@@ -49,7 +49,7 @@ spec:
capabilities:
add: ["SYS_ADMIN"]
allowPrivilegeEscalation: true
image: vitalif/vitastor-csi:v0.6.15
image: vitalif/vitastor-csi:v0.6.16
args:
- "--node=$(NODE_ID)"
- "--endpoint=$(CSI_ENDPOINT)"

View File

@@ -116,7 +116,7 @@ spec:
privileged: true
capabilities:
add: ["SYS_ADMIN"]
image: vitalif/vitastor-csi:v0.6.15
image: vitalif/vitastor-csi:v0.6.16
args:
- "--node=$(NODE_ID)"
- "--endpoint=$(CSI_ENDPOINT)"

View File

@@ -0,0 +1,13 @@
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-vitastor-pvc-block
spec:
storageClassName: vitastor
volumeMode: Block
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi

View File

@@ -0,0 +1,17 @@
apiVersion: v1
kind: Pod
metadata:
name: vitastor-test-block-pvc
namespace: default
spec:
containers:
- name: vitastor-test-block-pvc
image: nginx
volumeDevices:
- name: data
devicePath: /dev/xvda
volumes:
- name: data
persistentVolumeClaim:
claimName: test-vitastor-pvc-block
readOnly: false

View File

@@ -0,0 +1,17 @@
apiVersion: v1
kind: Pod
metadata:
name: vitastor-test-nginx
namespace: default
spec:
containers:
- name: vitastor-test-nginx
image: nginx
volumeMounts:
- mountPath: /usr/share/nginx/html/s3
name: data
volumes:
- name: data
persistentVolumeClaim:
claimName: test-vitastor-pvc
readOnly: false

View File

@@ -5,7 +5,7 @@ package vitastor
const (
vitastorCSIDriverName = "csi.vitastor.io"
vitastorCSIDriverVersion = "0.6.15"
vitastorCSIDriverVersion = "0.6.16"
)
// Config struct fills the parameters of request or user input

View File

@@ -67,29 +67,44 @@ func (ns *NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublis
klog.Infof("received node publish volume request %+v", protosanitizer.StripSecrets(req))
targetPath := req.GetTargetPath()
isBlock := req.GetVolumeCapability().GetBlock() != nil
// Check that it's not already mounted
free, error := mount.IsNotMountPoint(ns.mounter, targetPath)
_, error := mount.IsNotMountPoint(ns.mounter, targetPath)
if (error != nil)
{
if (os.IsNotExist(error))
{
error := os.MkdirAll(targetPath, 0777)
if (error != nil)
if (isBlock)
{
return nil, status.Error(codes.Internal, error.Error())
pathFile, err := os.OpenFile(targetPath, os.O_CREATE|os.O_RDWR, 0o600)
if (err != nil)
{
klog.Errorf("failed to create block device mount target %s with error: %v", targetPath, err)
return nil, status.Error(codes.Internal, err.Error())
}
err = pathFile.Close()
if (err != nil)
{
klog.Errorf("failed to close %s with error: %v", targetPath, err)
return nil, status.Error(codes.Internal, err.Error())
}
}
else
{
err := os.MkdirAll(targetPath, 0777)
if (err != nil)
{
klog.Errorf("failed to create fs mount target %s with error: %v", targetPath, err)
return nil, status.Error(codes.Internal, err.Error())
}
}
free = true
}
else
{
return nil, status.Error(codes.Internal, error.Error())
}
}
if (!free)
{
return &csi.NodePublishVolumeResponse{}, nil
}
ctxVars := make(map[string]string)
err := json.Unmarshal([]byte(req.VolumeId), &ctxVars)
@@ -149,7 +164,6 @@ func (ns *NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublis
// Format the device (ext4 or xfs)
fsType := req.GetVolumeCapability().GetMount().GetFsType()
isBlock := req.GetVolumeCapability().GetBlock() != nil
opt := req.GetVolumeCapability().GetMount().GetMountFlags()
opt = append(opt, "_netdev")
if ((req.VolumeCapability.AccessMode.Mode == csi.VolumeCapability_AccessMode_MULTI_NODE_READER_ONLY ||

2
debian/changelog vendored
View File

@@ -1,4 +1,4 @@
vitastor (0.6.15-1) unstable; urgency=medium
vitastor (0.6.16-1) unstable; urgency=medium
* RDMA support
* Bugfixes

View File

@@ -33,8 +33,8 @@ RUN set -e -x; \
mkdir -p /root/packages/vitastor-$REL; \
rm -rf /root/packages/vitastor-$REL/*; \
cd /root/packages/vitastor-$REL; \
cp -r /root/vitastor vitastor-0.6.15; \
cd vitastor-0.6.15; \
cp -r /root/vitastor vitastor-0.6.16; \
cd vitastor-0.6.16; \
ln -s /root/fio-build/fio-*/ ./fio; \
FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
ls /usr/include/linux/raw.h || cp ./debian/raw.h /usr/include/linux/raw.h; \
@@ -47,8 +47,8 @@ RUN set -e -x; \
rm -rf a b; \
echo "dep:fio=$FIO" > debian/fio_version; \
cd /root/packages/vitastor-$REL; \
tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.6.15.orig.tar.xz vitastor-0.6.15; \
cd vitastor-0.6.15; \
tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.6.16.orig.tar.xz vitastor-0.6.16; \
cd vitastor-0.6.16; \
V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \
DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \

1
libnfs Submodule

Submodule libnfs added at 5a991e1fcb

414
mon/make-osd-hybrid.js Executable file
View File

@@ -0,0 +1,414 @@
#!/usr/bin/nodejs
// systemd unit generator for hybrid (HDD+SSD) vitastor OSDs
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1
// USAGE: nodejs make-osd-hybrid.js [--disable_ssd_cache 0] [--disable_hdd_cache 0] /dev/sda /dev/sdb /dev/sdc /dev/sdd ...
// I.e. - just pass all HDDs and SSDs mixed, the script will decide where
// to put journals on its own
const fs = require('fs');
const fsp = fs.promises;
const child_process = require('child_process');
const options = {
debug: 1,
journal_size: 1024*1024*1024,
min_meta_size: 1024*1024*1024,
object_size: 1024*1024,
bitmap_granularity: 4096,
device_block_size: 4096,
disable_ssd_cache: 1,
disable_hdd_cache: 1,
};
run().catch(console.fatal);
async function run()
{
const device_list = parse_options();
await system_or_die("mkdir -p /var/log/vitastor; chown vitastor /var/log/vitastor");
// Collect devices
const all_devices = await collect_devices(device_list);
const ssds = all_devices.filter(d => d.ssd);
const hdds = all_devices.filter(d => !d.ssd);
// Collect existing OSD units
const osd_units = await collect_osd_units();
// Count assigned HDD journals and unallocated space for each SSD
await check_journal_count(ssds, osd_units);
// Create new OSDs
await create_new_hybrid_osds(hdds, ssds, osd_units);
process.exit(0);
}
function parse_options()
{
const devices = [];
const opt = {};
for (let i = 2; i < process.argv.length; i++)
{
const arg = process.argv[i];
if (arg == '--help' || arg == '-h')
{
opt.help = true;
break;
}
else if (arg.substr(0, 2) == '--')
opt[arg.substr(2)] = process.argv[++i];
else
devices.push(arg);
}
if (opt.help || !devices.length)
{
console.log(
'Prepare hybrid (HDD+SSD) Vitastor OSDs\n'+
'(c) Vitaliy Filippov, 2019+, license: VNPL-1.1\n\n'+
'USAGE: nodejs make-osd-hybrid.js [OPTIONS] /dev/sda /dev/sdb /dev/sdc ...\n'+
'Just pass all your SSDs and HDDs in any order, the script will distribute OSDs for you.\n\n'+
'OPTIONS (with defaults):\n'+
Object.keys(options).map(k => ` --${k} ${options[k]}`).join('\n')
);
process.exit(0);
}
for (const k in opt)
options[k] = opt[k];
return devices;
}
// Collect devices
async function collect_devices(devices_to_check)
{
const devices = [];
for (const dev of devices_to_check)
{
if (dev.substr(0, 5) != '/dev/')
{
console.log(`${dev} does not start with /dev/, skipping`);
continue;
}
if (!await file_exists('/sys/block/'+dev.substr(5)))
{
console.log(`${dev} is a partition, skipping`);
continue;
}
// Check if the device is an SSD
const rot = '/sys/block/'+dev.substr(5)+'/queue/rotational';
if (!await file_exists(rot))
{
console.log(`${dev} does not have ${rot} to check whether it's an SSD, skipping`);
continue;
}
const ssd = !parseInt(await fsp.readFile(rot, { encoding: 'utf-8' }));
// Check if the device has partition table
let [ has_partition_table, parts ] = await system(`sfdisk --dump ${dev} --json`);
if (has_partition_table != 0)
{
// Check if the device has any data
const [ has_data, out ] = await system(`blkid ${dev}`);
if (has_data == 0)
{
console.log(`${dev} contains data, skipping:\n ${out.trim().replace(/\n/g, '\n ')}`);
continue;
}
}
parts = parts ? JSON.parse(parts).partitiontable : null;
if (parts && parts.label != 'gpt')
{
console.log(`${dev} contains "${parts.label}" partition table, only GPT is supported, skipping`);
continue;
}
devices.push({
path: dev,
ssd,
parts,
});
}
return devices;
}
// Collect existing OSD units
async function collect_osd_units()
{
const units = [];
for (const unit of (await system("ls /etc/systemd/system/vitastor-osd*.service"))[1].trim().split('\n'))
{
if (!unit)
{
continue;
}
let cmd = /^ExecStart\s*=\s*(([^\n]*\\\n)*[^\n]*)/.exec(await fsp.readFile(unit, { encoding: 'utf-8' }));
if (!cmd)
{
console.log('ExecStart= not found in '+unit+', skipping')
continue;
}
let kv = {}, key;
cmd = cmd[1].replace(/^bash\s+-c\s+'/, '')
.replace(/>>\s*\S+2>\s*&1\s*'$/, '')
.replace(/\s*\\\n\s*/g, ' ')
.replace(/([^\s']+)|'([^']+)'/g, (m, m1, m2) =>
{
m1 = m1||m2;
if (key == null)
{
if (m1.substr(0, 2) != '--')
{
console.log('Strange command line in '+unit+', stopping');
process.exit(1);
}
key = m1.substr(2);
}
else
{
kv[key] = m1;
key = null;
}
});
units.push(kv);
}
return units;
}
// Count assigned HDD journals and unallocated space for each SSD
async function check_journal_count(ssds, osd_units)
{
const units_by_journal = osd_units.reduce((a, c) =>
{
if (c.journal_device)
a[c.journal_device] = c;
return a;
}, {});
for (const dev of ssds)
{
dev.journals = 0;
if (dev.parts)
{
for (const part of dev.parts.partitions)
{
if (part.uuid && units_by_journal['/dev/disk/by-partuuid/'+part.uuid.toLowerCase()])
{
dev.journals++;
}
}
dev.free = free_from_parttable(dev.parts);
}
else
{
dev.free = parseInt(await system_or_die("blockdev --getsize64 "+dev.path));
}
}
}
async function create_new_hybrid_osds(hdds, ssds, osd_units)
{
const units_by_disk = osd_units.reduce((a, c) => { a[c.data_device] = c; return a; }, {});
for (const dev of hdds)
{
if (!dev.parts)
{
// HDD is not partitioned yet, create a single partition
// + is the "default value" for sfdisk
await system_or_die('sfdisk '+dev.path, 'label: gpt\n\n+ +\n');
dev.parts = JSON.parse(await system_or_die('sfdisk --dump '+dev.path+' --json')).partitiontable;
}
if (dev.parts.partitions.length != 1)
{
console.log(dev.path+' has more than 1 partition, skipping');
}
else if ((dev.parts.partitions[0].start + dev.parts.partitions[0].size) != (1 + dev.parts.lastlba))
{
console.log(dev.path+'1 is not a whole-disk partition, skipping');
}
else if (!dev.parts.partitions[0].uuid)
{
console.log(dev.parts.partitions[0].node+' does not have UUID. Please repartition '+dev.path+' with GPT');
}
else if (!units_by_disk['/dev/disk/by-partuuid/'+dev.parts.partitions[0].uuid.toLowerCase()])
{
await create_hybrid_osd(dev, ssds);
}
}
}
async function create_hybrid_osd(dev, ssds)
{
// Create a new OSD
// Calculate metadata size
const data_device = '/dev/disk/by-partuuid/'+dev.parts.partitions[0].uuid.toLowerCase();
const data_size = dev.parts.partitions[0].size * dev.parts.sectorsize;
const meta_entry_size = 24 + 2*options.object_size/options.bitmap_granularity/8;
const entries_per_block = Math.floor(options.device_block_size / meta_entry_size);
const object_count = Math.floor(data_size / options.object_size);
let meta_size = Math.ceil(1 + object_count / entries_per_block) * options.device_block_size;
// Leave some extra space for future metadata formats and round metadata area size to multiples of 1 MB
meta_size = 2*meta_size;
meta_size = Math.ceil(meta_size/1024/1024) * 1024*1024;
if (meta_size < options.min_meta_size)
meta_size = options.min_meta_size;
let journal_size = Math.ceil(options.journal_size/1024/1024) * 1024*1024;
// Pick an SSD for journal, balancing the number of journals across SSDs
let selected_ssd;
for (const ssd of ssds)
if (ssd.free >= (meta_size+journal_size) && (!selected_ssd || selected_ssd.journals > ssd.journals))
selected_ssd = ssd;
if (!selected_ssd)
{
console.error('Could not find free space for SSD journal and metadata for '+dev.path);
process.exit(1);
}
// Allocate an OSD number
const osd_num = (await system_or_die("vitastor-cli alloc-osd")).trim();
if (!osd_num)
{
console.error('Failed to run vitastor-cli alloc-osd');
process.exit(1);
}
console.log('Creating OSD '+osd_num+' on '+dev.path+' (HDD) with journal and metadata on '+selected_ssd.path+' (SSD)');
// Add two partitions: journal and metadata
const new_parts = await add_partitions(selected_ssd, [ journal_size, meta_size ]);
selected_ssd.journals++;
const journal_device = '/dev/disk/by-partuuid/'+new_parts[0].uuid.toLowerCase();
const meta_device = '/dev/disk/by-partuuid/'+new_parts[1].uuid.toLowerCase();
// Wait until the device symlinks appear
while (!await file_exists(journal_device))
{
await new Promise(ok => setTimeout(ok, 100));
}
while (!await file_exists(meta_device))
{
await new Promise(ok => setTimeout(ok, 100));
}
// Zero out metadata and journal
await system_or_die("dd if=/dev/zero of="+journal_device+" bs=1M count="+(journal_size/1024/1024)+" oflag=direct");
await system_or_die("dd if=/dev/zero of="+meta_device+" bs=1M count="+(meta_size/1024/1024)+" oflag=direct");
// Create unit file for the OSD
const has_scsi_cache_type = options.disable_ssd_cache &&
(await system("ls /sys/block/"+selected_ssd.path.substr(5)+"/device/scsi_disk/*/cache_type"))[0] == 0;
const write_through = options.disable_ssd_cache && (
has_scsi_cache_type || selected_ssd.path.substr(5, 4) == 'nvme'
&& (await system_or_die("/sys/block/"+selected_ssd.path.substr(5)+"/queue/write_cache")).trim() == "write through");
await fsp.writeFile('/etc/systemd/system/vitastor-osd'+osd_num+'.service',
`[Unit]
Description=Vitastor object storage daemon osd.${osd_num}
After=network-online.target local-fs.target time-sync.target
Wants=network-online.target local-fs.target time-sync.target
PartOf=vitastor.target
[Service]
LimitNOFILE=1048576
LimitNPROC=1048576
LimitMEMLOCK=infinity
ExecStart=bash -c '/usr/bin/vitastor-osd \\
--osd_num ${osd_num} ${write_through
? "--disable_meta_fsync 1 --disable_journal_fsync 1 --immediate_commit "+(options.disable_hdd_cache ? "all" : "small")
: ""} \\
--throttle_small_writes 1 \\
--disk_alignment ${options.device_block_size} \\
--journal_block_size ${options.device_block_size} \\
--meta_block_size ${options.device_block_size} \\
--journal_no_same_sector_overwrites true \\
--journal_sector_buffer_count 1024 \\
--block_size ${options.object_size} \\
--data_device ${data_device} \\
--journal_device ${journal_device} \\
--meta_device ${meta_device} >>/var/log/vitastor/osd${osd_num}.log 2>&1'
WorkingDirectory=/
ExecStartPre=+chown vitastor:vitastor ${data_device}
ExecStartPre=+chown vitastor:vitastor ${journal_device}
ExecStartPre=+chown vitastor:vitastor ${meta_device}${
has_scsi_cache_type
? "\nExecStartPre=+bash -c 'D=$$$(readlink "+journal_device+"); echo write through > $$$(dirname /sys/block/*/$$\${D##*/})/device/scsi_disk/*/cache_type'"
: ""}${
options.disable_hdd_cache
? "\nExecStartPre=+bash -c 'D=$$$(readlink "+data_device+"); echo write through > $$$(dirname /sys/block/*/$$\${D##*/})/device/scsi_disk/*/cache_type'"
: ""}
User=vitastor
PrivateTmp=false
TasksMax=infinity
Restart=always
StartLimitInterval=0
RestartSec=10
[Install]
WantedBy=vitastor.target
`);
await system_or_die("systemctl enable vitastor-osd"+osd_num);
}
async function add_partitions(dev, sizes)
{
let script = 'label: gpt\n\n';
if (dev.parts)
{
// Old partitions
for (const part of dev.parts.partitions)
{
script += part.node+': '+Object.keys(part).map(k => k == 'node' ? '' : k+'='+part[k]).filter(k => k).join(', ')+'\n';
}
}
// New partitions
for (const size of sizes)
{
script += '+ '+Math.ceil(size/1024)+'KiB\n';
}
await system_or_die('sfdisk '+dev.path, script);
// Get new partition table and find the new partition
const newpt = JSON.parse(await system_or_die('sfdisk --dump '+dev.path+' --json')).partitiontable;
const old_nodes = dev.parts ? dev.parts.partitions.reduce((a, c) => { a[c.uuid] = true; return a; }, {}) : {};
const new_nodes = newpt.partitions.filter(part => !old_nodes[part.uuid]);
if (new_nodes.length != sizes.length)
{
console.error('Failed to partition '+dev.path+': new partitions not found in table');
process.exit(1);
}
dev.parts = newpt;
dev.free = free_from_parttable(newpt);
return new_nodes;
}
function free_from_parttable(pt)
{
let free = pt.lastlba + 1 - pt.firstlba;
for (const part of pt.partitions)
{
free -= part.size;
}
free *= pt.sectorsize;
return free;
}
async function system_or_die(cmd, input = '')
{
let [ exitcode, stdout, stderr ] = await system(cmd, input);
if (exitcode != 0)
{
console.error(cmd+' failed: '+stderr);
process.exit(1);
}
return stdout;
}
async function system(cmd, input = '')
{
if (options.debug)
{
process.stderr.write('+ '+cmd+(input ? " <<EOF\n"+input.replace(/\s*$/, '\n')+"EOF" : '')+'\n');
}
const cp = child_process.spawn(cmd, { shell: true });
let stdout = '', stderr = '', finish_cb;
cp.stdout.on('data', buf => stdout += buf.toString());
cp.stderr.on('data', buf => stderr += buf.toString());
cp.on('exit', () => finish_cb && finish_cb());
cp.stdin.write(input);
cp.stdin.end();
if (cp.exitCode == null)
{
await new Promise(ok => finish_cb = ok);
}
return [ cp.exitCode, stdout, stderr ];
}
async function file_exists(filename)
{
return new Promise((ok, no) => fs.access(filename, fs.constants.R_OK, err => ok(!err)));
}

View File

@@ -25,6 +25,10 @@ OPT=$(vitastor-cli simple-offsets --format options $DEV | tr '\n' ' ')
META=$(vitastor-cli simple-offsets --format json $DEV | jq .data_offset)
dd if=/dev/zero of=$DEV bs=1048576 count=$(((META+1048575)/1048576)) oflag=direct
mkdir -p /var/log/vitastor
id vitastor &>/dev/null || useradd vitastor
chown vitastor /var/log/vitastor
cat >/etc/systemd/system/vitastor-osd$OSD_NUM.service <<EOF
[Unit]
Description=Vitastor object storage daemon osd.$OSD_NUM
@@ -36,14 +40,14 @@ PartOf=vitastor.target
LimitNOFILE=1048576
LimitNPROC=1048576
LimitMEMLOCK=infinity
ExecStart=/usr/bin/vitastor-osd \\
ExecStart=bash -c '/usr/bin/vitastor-osd \\
--osd_num $OSD_NUM \\
--disable_data_fsync 1 \\
--immediate_commit all \\
--disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 \\
--journal_no_same_sector_overwrites true \\
--journal_sector_buffer_count 1024 \\
$OPT
$OPT >>/var/log/vitastor/osd$OSD_NUM.log 2>&1'
WorkingDirectory=/
ExecStartPre=+chown vitastor:vitastor $DEV
User=vitastor

View File

@@ -31,6 +31,7 @@ const etcd_allow = new RegExp('^'+[
'osd/inodestats/[1-9]\\d*',
'osd/space/[1-9]\\d*',
'mon/master',
'mon/member/[a-f0-9]+',
'pg/state/[1-9]\\d*/[1-9]\\d*',
'pg/stats/[1-9]\\d*/[1-9]\\d*',
'pg/history/[1-9]\\d*/[1-9]\\d*',
@@ -237,7 +238,10 @@ const etcd_tree = {
},
mon: {
master: {
/* ip: [ string ], */
/* ip: [ string ], id: uint64_t */
},
standby: {
/* <uint64_t>: { ip: [ string ] }, */
},
},
pg: {
@@ -268,7 +272,7 @@ const etcd_tree = {
<pg_id>: {
osd_sets: osd_num_t[][],
all_peers: osd_num_t[],
epoch: uint32_t,
epoch: uint64_t,
},
}, */
},
@@ -673,11 +677,25 @@ class Mon
}, this.etcd_start_timeout, 0);
}
get_mon_state()
{
return { ip: this.local_ips(), hostname: os.hostname() };
}
async get_lease()
{
const max_ttl = this.config.etcd_mon_ttl + this.config.etcd_mon_timeout/1000*this.config.etcd_mon_retries;
const res = await this.etcd_call('/lease/grant', { TTL: max_ttl }, this.config.etcd_mon_timeout, -1);
// Get lease
let res = await this.etcd_call('/lease/grant', { TTL: max_ttl }, this.config.etcd_mon_timeout, -1);
this.etcd_lease_id = res.ID;
// Register in /mon/member, just for the information
const state = this.get_mon_state();
res = await this.etcd_call('/kv/put', {
key: b64(this.etcd_prefix+'/mon/member/'+this.etcd_lease_id),
value: b64(JSON.stringify(state)),
lease: ''+this.etcd_lease_id
}, this.etcd_start_timeout, 0);
// Set refresh timer
this.lease_timer = setInterval(async () =>
{
const res = await this.etcd_call('/lease/keepalive', { ID: this.etcd_lease_id }, this.config.etcd_mon_timeout, this.config.etcd_mon_retries);
@@ -703,7 +721,7 @@ class Mon
async become_master()
{
const state = { ip: this.local_ips() };
const state = { ...this.get_mon_state(), id: ''+this.etcd_lease_id };
while (1)
{
const res = await this.etcd_call('/kv/txn', {

View File

@@ -49,7 +49,8 @@ async function run()
}
options.journal_offset = Math.ceil(options.journal_offset/options.device_block_size)*options.device_block_size;
const meta_offset = options.journal_offset + Math.ceil(options.journal_size/options.device_block_size)*options.device_block_size;
const entries_per_block = Math.floor(options.device_block_size / (24 + 2*options.object_size/options.bitmap_granularity/8));
const meta_entry_size = 24 + 2*options.object_size/options.bitmap_granularity/8;
const entries_per_block = Math.floor(options.device_block_size / meta_entry_size);
const object_count = Math.floor((device_size-meta_offset)/options.object_size);
const meta_size = Math.ceil(1 + object_count / entries_per_block) * options.device_block_size;
const data_offset = meta_offset + meta_size;

View File

@@ -50,7 +50,7 @@ from cinder.volume import configuration
from cinder.volume import driver
from cinder.volume import volume_utils
VERSION = '0.6.15'
VERSION = '0.6.16'
LOG = logging.getLogger(__name__)
@@ -355,7 +355,25 @@ class VitastorDriver(driver.CloneableImageVD,
def revert_to_snapshot(self, context, volume, snapshot):
"""Revert a volume to a given snapshot."""
# FIXME Delete the image, then recreate it from the snapshot
vol_name = utils.convert_str(snapshot.volume_name)
snap_name = utils.convert_str(snapshot.name)
# Delete the image and recreate it from the snapshot
args = [ 'vitastor-cli', 'rm', vol_name, *(self._vitastor_args()) ]
try:
self._execute(*args)
except processutils.ProcessExecutionError as exc:
LOG.error("Failed to delete image "+vol_name+": "+exc)
raise exception.VolumeBackendAPIException(data = exc.stderr)
args = [
'vitastor-cli', 'create', '--parent', vol_name+'@'+snap_name,
vol_name, *(self._vitastor_args())
]
try:
self._execute(*args)
except processutils.ProcessExecutionError as exc:
LOG.error("Failed to recreate image "+vol_name+" from "+vol_name+"@"+snap_name+": "+exc)
raise exception.VolumeBackendAPIException(data = exc.stderr)
def delete_snapshot(self, snapshot):
"""Deletes a snapshot."""
@@ -363,24 +381,15 @@ class VitastorDriver(driver.CloneableImageVD,
vol_name = utils.convert_str(snapshot.volume_name)
snap_name = utils.convert_str(snapshot.name)
# Find the snapshot
resp = self._etcd_txn({ 'success': [
{ 'request_range': { 'key': 'index/image/'+vol_name+'@'+snap_name } },
] })
if len(resp['responses'][0]['kvs']) == 0:
raise exception.SnapshotNotFound(snapshot_id = snap_name)
inode_id = int(resp['responses'][0]['kvs'][0]['value']['id'])
pool_id = int(resp['responses'][0]['kvs'][0]['value']['pool_id'])
parents = {}
parents[(pool_id << 48) | (inode_id & 0xffffffffffff)] = True
# Check if there are child volumes
children = self._child_count(parents)
if children > 0:
raise exception.SnapshotIsBusy(snapshot_name = snap_name)
# FIXME: We can't delete snapshots because we can't merge layers yet
raise exception.VolumeBackendAPIException(data = 'Snapshot delete (layer merge) is not implemented yet')
args = [
'vitastor-cli', 'rm', vol_name+'@'+snap_name,
*(self._vitastor_args())
]
try:
self._execute(*args)
except processutils.ProcessExecutionError as exc:
LOG.error("Failed to remove snapshot "+vol_name+'@'+snap_name+": "+exc)
raise exception.VolumeBackendAPIException(data = exc.stderr)
def _child_count(self, parents):
children = 0

View File

@@ -25,4 +25,4 @@ rm fio
mv fio-copy fio
FIO=`rpm -qi fio | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'`
perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec
tar --transform 's#^#vitastor-0.6.15/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.6.15$(rpm --eval '%dist').tar.gz *
tar --transform 's#^#vitastor-0.6.16/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.6.16$(rpm --eval '%dist').tar.gz *

View File

@@ -34,7 +34,7 @@ ADD . /root/vitastor
RUN set -e; \
cd /root/vitastor/rpm; \
sh build-tarball.sh; \
cp /root/vitastor-0.6.15.el7.tar.gz ~/rpmbuild/SOURCES; \
cp /root/vitastor-0.6.16.el7.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \

View File

@@ -1,11 +1,11 @@
Name: vitastor
Version: 0.6.15
Version: 0.6.16
Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1
URL: https://vitastor.io/
Source0: vitastor-0.6.15.el7.tar.gz
Source0: vitastor-0.6.16.el7.tar.gz
BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel

View File

@@ -33,7 +33,7 @@ ADD . /root/vitastor
RUN set -e; \
cd /root/vitastor/rpm; \
sh build-tarball.sh; \
cp /root/vitastor-0.6.15.el8.tar.gz ~/rpmbuild/SOURCES; \
cp /root/vitastor-0.6.16.el8.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \

View File

@@ -1,11 +1,11 @@
Name: vitastor
Version: 0.6.15
Version: 0.6.16
Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1
URL: https://vitastor.io/
Source0: vitastor-0.6.15.el8.tar.gz
Source0: vitastor-0.6.16.el8.tar.gz
BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel

View File

@@ -15,7 +15,7 @@ if("${CMAKE_INSTALL_PREFIX}" MATCHES "^/usr/local/?$")
set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}")
endif()
add_definitions(-DVERSION="0.6.15")
add_definitions(-DVERSION="0.6.16")
add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -fdiagnostics-color=always -I ${CMAKE_SOURCE_DIR}/src)
if (${WITH_ASAN})
add_definitions(-fsanitize=address -fno-omit-frame-pointer)
@@ -152,9 +152,70 @@ target_link_libraries(vitastor-nbd
vitastor_client
)
# vitastor-nfs
add_executable(vitastor-nfs
nfs_proxy.cpp
nfs_conn.cpp
nfs_portmap.cpp
sha256.c
../libnfs/lib/init.c
../libnfs/lib/pdu.c
../libnfs/lib/libnfs-zdr.c
../libnfs/lib/socket.c
../libnfs/portmap/libnfs-raw-portmap.c
../libnfs/nfs/libnfs-raw-nfs.c
../libnfs/mount/libnfs-raw-mount.c
)
set_source_files_properties(
../libnfs/nfs/libnfs-raw-nfs.c
PROPERTIES
COMPILE_FLAGS "-Wno-unused-but-set-variable"
)
# Simplified static configuration
# The other option is to build patched libnfs packages until all distros get my fixes
target_compile_options(vitastor-nfs
PRIVATE
-DHAVE_ARPA_INET_H
-DHAVE_INTTYPES_H
-DHAVE_MEMORY_H
-DHAVE_NETDB_H
-DHAVE_NETINET_IN_H
-DHAVE_NETINET_TCP_H
-DHAVE_NET_IF_H
-DHAVE_POLL_H
-DHAVE_STDINT_H
-DHAVE_STDLIB_H
-DHAVE_STRINGS_H
-DHAVE_STRING_H
-DHAVE_SYS_IOCTL_H
-DHAVE_SYS_SOCKET_H
-DHAVE_SYS_STATVFS_H
-DHAVE_SYS_STAT_H
-DHAVE_SYS_SYSMACROS_H
-DHAVE_SYS_TIME_H
-DHAVE_SYS_TYPES_H
-DHAVE_SYS_VFS_H
-DHAVE_UNISTD_H
-DHAVE_UTIME_H
-DHAVE_SOCKADDR_STORAGE
-DHAVE_STRUCT_STAT_ST_MTIM_TV_NSEC
-D_U_=
)
target_include_directories(vitastor-nfs
PRIVATE
../libnfs/include
../libnfs/include/nfsc
../libnfs/portmap
../libnfs/nfs
../libnfs/mount
)
target_link_libraries(vitastor-nfs
vitastor_client
)
# vitastor-cli
add_executable(vitastor-cli
cli.cpp cli_alloc_osd.cpp cli_simple_offsets.cpp cli_df.cpp
cli.cpp cli_common.cpp cli_alloc_osd.cpp cli_simple_offsets.cpp cli_status.cpp cli_df.cpp
cli_ls.cpp cli_create.cpp cli_modify.cpp cli_flatten.cpp cli_merge.cpp cli_rm_data.cpp cli_rm.cpp
)
target_link_libraries(vitastor-cli

View File

@@ -25,7 +25,7 @@ allocator::allocator(uint64_t blocks)
size = free = blocks;
last_one_mask = (blocks % 64) == 0
? UINT64_MAX
: ((1l << (blocks % 64)) - 1);
: (((uint64_t)1 << (blocks % 64)) - 1);
for (uint64_t i = 0; i < total; i++)
{
mask[i] = 0;
@@ -79,7 +79,7 @@ void allocator::set(uint64_t addr, bool value)
}
if (value)
{
mask[last] = mask[last] | (1l << bit);
mask[last] = mask[last] | ((uint64_t)1 << bit);
if (mask[last] != (!is_last || cur_addr/64 < size/64
? UINT64_MAX : last_one_mask))
{
@@ -88,7 +88,7 @@ void allocator::set(uint64_t addr, bool value)
}
else
{
mask[last] = mask[last] & ~(1l << bit);
mask[last] = mask[last] & ~((uint64_t)1 << bit);
}
is_last = false;
if (p2 > 1)

View File

@@ -544,12 +544,13 @@ resume_4:
if (ref_us > exec_us + throttle_threshold_us)
{
// Pause reply
PRIV(op)->op_state = 5;
// Remember that the timer can in theory be called right here
tfd->set_timer_us(ref_us-exec_us, false, [this, op](int timer_id)
{
PRIV(op)->op_state++;
ringloop->wakeup();
});
PRIV(op)->op_state = 5;
return 1;
}
}

View File

@@ -2,8 +2,7 @@
// License: VNPL-1.1 (see README.md for details)
/**
* CLI tool
* Currently can (a) remove inodes and (b) merge snapshot/clone layers
* CLI tool and also a library for administrative tasks
*/
#include <vector>
@@ -17,7 +16,9 @@
static const char *exe_name = NULL;
json11::Json::object cli_tool_t::parse_args(int narg, const char *args[])
static void help();
static json11::Json::object parse_args(int narg, const char *args[])
{
json11::Json::object cfg;
json11::Json::array cmd;
@@ -79,13 +80,16 @@ json11::Json::object cli_tool_t::parse_args(int narg, const char *args[])
return cfg;
}
void cli_tool_t::help()
static void help()
{
printf(
"Vitastor command-line tool\n"
"(c) Vitaliy Filippov, 2019+ (VNPL-1.1)\n"
"\n"
"USAGE:\n"
"%s status\n"
" Show cluster status\n"
"\n"
"%s df\n"
" Show pool space statistics\n"
"\n"
@@ -155,223 +159,132 @@ void cli_tool_t::help()
" --no-color Disable colored output\n"
" --json JSON output\n"
,
exe_name, exe_name, exe_name, exe_name, exe_name, exe_name,
exe_name, exe_name, exe_name, exe_name, exe_name, exe_name, exe_name,
exe_name, exe_name, exe_name, exe_name, exe_name, exe_name
);
exit(0);
}
void cli_tool_t::change_parent(inode_t cur, inode_t new_parent)
{
auto cur_cfg_it = cli->st_cli.inode_config.find(cur);
if (cur_cfg_it == cli->st_cli.inode_config.end())
{
fprintf(stderr, "Inode 0x%lx disappeared\n", cur);
exit(1);
}
inode_config_t new_cfg = cur_cfg_it->second;
std::string cur_name = new_cfg.name;
std::string cur_cfg_key = base64_encode(cli->st_cli.etcd_prefix+
"/config/inode/"+std::to_string(INODE_POOL(cur))+
"/"+std::to_string(INODE_NO_POOL(cur)));
new_cfg.parent_id = new_parent;
json11::Json::object cur_cfg_json = cli->st_cli.serialize_inode_cfg(&new_cfg);
waiting++;
cli->st_cli.etcd_txn_slow(json11::Json::object {
{ "compare", json11::Json::array {
json11::Json::object {
{ "target", "MOD" },
{ "key", cur_cfg_key },
{ "result", "LESS" },
{ "mod_revision", new_cfg.mod_revision+1 },
},
} },
{ "success", json11::Json::array {
json11::Json::object {
{ "request_put", json11::Json::object {
{ "key", cur_cfg_key },
{ "value", base64_encode(json11::Json(cur_cfg_json).dump()) },
} }
},
} },
}, [this, new_parent, cur, cur_name](std::string err, json11::Json res)
{
if (err != "")
{
fprintf(stderr, "Error changing parent of %s: %s\n", cur_name.c_str(), err.c_str());
exit(1);
}
if (!res["succeeded"].bool_value())
{
fprintf(stderr, "Inode %s was modified during snapshot deletion\n", cur_name.c_str());
exit(1);
}
if (new_parent)
{
auto new_parent_it = cli->st_cli.inode_config.find(new_parent);
std::string new_parent_name = new_parent_it != cli->st_cli.inode_config.end()
? new_parent_it->second.name : "<unknown>";
printf(
"Parent of layer %s (inode %lu in pool %u) changed to %s (inode %lu in pool %u)\n",
cur_name.c_str(), INODE_NO_POOL(cur), INODE_POOL(cur),
new_parent_name.c_str(), INODE_NO_POOL(new_parent), INODE_POOL(new_parent)
);
}
else
{
printf(
"Parent of layer %s (inode %lu in pool %u) detached\n",
cur_name.c_str(), INODE_NO_POOL(cur), INODE_POOL(cur)
);
}
waiting--;
ringloop->wakeup();
});
}
void cli_tool_t::etcd_txn(json11::Json txn)
{
waiting++;
cli->st_cli.etcd_txn_slow(txn, [this](std::string err, json11::Json res)
{
waiting--;
if (err != "")
{
fprintf(stderr, "Error reading from etcd: %s\n", err.c_str());
exit(1);
}
etcd_result = res;
ringloop->wakeup();
});
}
inode_config_t* cli_tool_t::get_inode_cfg(const std::string & name)
{
for (auto & ic: cli->st_cli.inode_config)
{
if (ic.second.name == name)
{
return &ic.second;
}
}
fprintf(stderr, "Layer %s not found\n", name.c_str());
exit(1);
}
void cli_tool_t::run(json11::Json cfg)
static int run(cli_tool_t *p, json11::Json cfg)
{
p->parse_config(cfg);
json11::Json::array cmd = cfg["command"].array_items();
std::function<bool(cli_result_t &)> action_cb;
if (!cmd.size())
{
fprintf(stderr, "command is missing\n");
exit(1);
return EINVAL;
}
else if (cmd[0] == "status")
{
// Show cluster status
action_cb = p->start_status(cfg);
}
else if (cmd[0] == "df")
{
// Show pool space stats
action_cb = start_df(cfg);
action_cb = p->start_df(cfg);
}
else if (cmd[0] == "ls")
{
// List images
action_cb = start_ls(cfg);
action_cb = p->start_ls(cfg);
}
else if (cmd[0] == "create" || cmd[0] == "snap-create")
{
// Create image/snapshot
action_cb = start_create(cfg);
action_cb = p->start_create(cfg);
}
else if (cmd[0] == "modify")
{
// Modify image
action_cb = start_modify(cfg);
action_cb = p->start_modify(cfg);
}
else if (cmd[0] == "rm-data")
{
// Delete inode data
action_cb = start_rm(cfg);
action_cb = p->start_rm(cfg);
}
else if (cmd[0] == "merge-data")
{
// Merge layer data without affecting metadata
action_cb = start_merge(cfg);
action_cb = p->start_merge(cfg);
}
else if (cmd[0] == "flatten")
{
// Merge layer data without affecting metadata
action_cb = start_flatten(cfg);
action_cb = p->start_flatten(cfg);
}
else if (cmd[0] == "rm")
{
// Remove multiple snapshots and rebase their children
action_cb = start_snap_rm(cfg);
action_cb = p->start_snap_rm(cfg);
}
else if (cmd[0] == "alloc-osd")
{
// Allocate a new OSD number
action_cb = start_alloc_osd(cfg);
action_cb = p->start_alloc_osd(cfg);
}
else if (cmd[0] == "simple-offsets")
{
// Calculate offsets for simple & stupid OSD deployment without superblock
action_cb = simple_offsets(cfg);
action_cb = p->simple_offsets(cfg);
}
else
{
fprintf(stderr, "unknown command: %s\n", cmd[0].string_value().c_str());
exit(1);
return EINVAL;
}
if (action_cb == NULL)
{
return;
return 0;
}
color = !cfg["no-color"].bool_value();
json_output = cfg["json"].bool_value();
iodepth = cfg["iodepth"].uint64_value();
if (!iodepth)
iodepth = 32;
parallel_osds = cfg["parallel_osds"].uint64_value();
if (!parallel_osds)
parallel_osds = 4;
log_level = cfg["log_level"].int64_value();
progress = cfg["progress"].uint64_value() ? true : false;
list_first = cfg["wait-list"].uint64_value() ? true : false;
// Create client
ringloop = new ring_loop_t(512);
epmgr = new epoll_manager_t(ringloop);
cli = new cluster_client_t(ringloop, epmgr->tfd, cfg);
cli->on_ready([this]()
p->ringloop = new ring_loop_t(512);
p->epmgr = new epoll_manager_t(p->ringloop);
p->cli = new cluster_client_t(p->ringloop, p->epmgr->tfd, cfg);
// Smaller timeout by default for more interactiveness
p->cli->st_cli.etcd_slow_timeout = p->cli->st_cli.etcd_quick_timeout;
ring_consumer_t consumer;
cli_result_t result;
p->cli->on_ready([&]()
{
// Initialize job
consumer.loop = [this]()
consumer.loop = [&]()
{
if (action_cb != NULL)
{
bool done = action_cb();
bool done = action_cb(result);
if (done)
{
action_cb = NULL;
}
}
ringloop->submit();
p->ringloop->submit();
};
ringloop->register_consumer(&consumer);
p->ringloop->register_consumer(&consumer);
consumer.loop();
});
// Loop until it completes
while (action_cb != NULL)
{
ringloop->loop();
p->ringloop->loop();
if (action_cb != NULL)
ringloop->wait();
p->ringloop->wait();
}
// Print result
if (result.text != "")
{
fprintf(stderr, "%s\n", result.text.c_str());
}
// Destroy the client
delete cli;
delete epmgr;
delete ringloop;
cli = NULL;
epmgr = NULL;
ringloop = NULL;
delete p->cli;
delete p->epmgr;
delete p->ringloop;
p->cli = NULL;
p->epmgr = NULL;
p->ringloop = NULL;
return result.err;
}
int main(int narg, const char *args[])
@@ -380,7 +293,7 @@ int main(int narg, const char *args[])
setvbuf(stderr, NULL, _IONBF, 0);
exe_name = args[0];
cli_tool_t *p = new cli_tool_t();
p->run(cli_tool_t::parse_args(narg, args));
int r = run(p, parse_args(narg, args));
delete p;
return 0;
return r;
}

View File

@@ -19,6 +19,13 @@ class epoll_manager_t;
class cluster_client_t;
struct inode_config_t;
struct cli_result_t
{
int err;
std::string text;
json11::Json data;
};
class cli_tool_t
{
public:
@@ -35,32 +42,28 @@ public:
int waiting = 0;
json11::Json etcd_result;
ring_consumer_t consumer;
std::function<bool(void)> action_cb;
void run(json11::Json cfg);
void parse_config(json11::Json cfg);
void change_parent(inode_t cur, inode_t new_parent);
inode_config_t* get_inode_cfg(const std::string & name);
static json11::Json::object parse_args(int narg, const char *args[]);
static void help();
friend struct rm_inode_t;
friend struct snap_merger_t;
friend struct snap_flattener_t;
friend struct snap_remover_t;
std::function<bool(void)> start_df(json11::Json);
std::function<bool(void)> start_ls(json11::Json);
std::function<bool(void)> start_create(json11::Json);
std::function<bool(void)> start_modify(json11::Json);
std::function<bool(void)> start_rm(json11::Json);
std::function<bool(void)> start_merge(json11::Json);
std::function<bool(void)> start_flatten(json11::Json);
std::function<bool(void)> start_snap_rm(json11::Json);
std::function<bool(void)> start_alloc_osd(json11::Json cfg, uint64_t *out = NULL);
std::function<bool(void)> simple_offsets(json11::Json cfg);
std::function<bool(cli_result_t &)> start_status(json11::Json);
std::function<bool(cli_result_t &)> start_df(json11::Json);
std::function<bool(cli_result_t &)> start_ls(json11::Json);
std::function<bool(cli_result_t &)> start_create(json11::Json);
std::function<bool(cli_result_t &)> start_modify(json11::Json);
std::function<bool(cli_result_t &)> start_rm(json11::Json);
std::function<bool(cli_result_t &)> start_merge(json11::Json);
std::function<bool(cli_result_t &)> start_flatten(json11::Json);
std::function<bool(cli_result_t &)> start_snap_rm(json11::Json);
std::function<bool(cli_result_t &)> start_alloc_osd(json11::Json cfg);
std::function<bool(cli_result_t &)> simple_offsets(json11::Json cfg);
void etcd_txn(json11::Json txn);
};
@@ -69,7 +72,7 @@ uint64_t parse_size(std::string size_str);
std::string print_table(json11::Json items, json11::Json header, bool use_esc);
std::string format_size(uint64_t size);
std::string format_size(uint64_t size, bool nobytes = false);
std::string format_lat(uint64_t lat);

View File

@@ -102,20 +102,20 @@ struct alloc_osd_t
}
};
std::function<bool(void)> cli_tool_t::start_alloc_osd(json11::Json cfg, uint64_t *out)
std::function<bool(cli_result_t &)> cli_tool_t::start_alloc_osd(json11::Json cfg)
{
json11::Json::array cmd = cfg["command"].array_items();
auto alloc_osd = new alloc_osd_t();
alloc_osd->parent = this;
return [alloc_osd, out]()
return [alloc_osd](cli_result_t & result)
{
alloc_osd->loop();
if (alloc_osd->is_done())
{
if (out)
*out = alloc_osd->new_id;
else if (alloc_osd->new_id)
printf("%lu\n", alloc_osd->new_id);
result = (cli_result_t){
.text = std::to_string(alloc_osd->new_id),
.data = json11::Json(alloc_osd->new_id),
};
delete alloc_osd;
return true;
}

118
src/cli_common.cpp Normal file
View File

@@ -0,0 +1,118 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
#include "base64.h"
#include "cluster_client.h"
#include "cli.h"
void cli_tool_t::change_parent(inode_t cur, inode_t new_parent)
{
auto cur_cfg_it = cli->st_cli.inode_config.find(cur);
if (cur_cfg_it == cli->st_cli.inode_config.end())
{
fprintf(stderr, "Inode 0x%lx disappeared\n", cur);
exit(1);
}
inode_config_t new_cfg = cur_cfg_it->second;
std::string cur_name = new_cfg.name;
std::string cur_cfg_key = base64_encode(cli->st_cli.etcd_prefix+
"/config/inode/"+std::to_string(INODE_POOL(cur))+
"/"+std::to_string(INODE_NO_POOL(cur)));
new_cfg.parent_id = new_parent;
json11::Json::object cur_cfg_json = cli->st_cli.serialize_inode_cfg(&new_cfg);
waiting++;
cli->st_cli.etcd_txn_slow(json11::Json::object {
{ "compare", json11::Json::array {
json11::Json::object {
{ "target", "MOD" },
{ "key", cur_cfg_key },
{ "result", "LESS" },
{ "mod_revision", new_cfg.mod_revision+1 },
},
} },
{ "success", json11::Json::array {
json11::Json::object {
{ "request_put", json11::Json::object {
{ "key", cur_cfg_key },
{ "value", base64_encode(json11::Json(cur_cfg_json).dump()) },
} }
},
} },
}, [this, new_parent, cur, cur_name](std::string err, json11::Json res)
{
if (err != "")
{
fprintf(stderr, "Error changing parent of %s: %s\n", cur_name.c_str(), err.c_str());
exit(1);
}
if (!res["succeeded"].bool_value())
{
fprintf(stderr, "Inode %s was modified during snapshot deletion\n", cur_name.c_str());
exit(1);
}
if (new_parent)
{
auto new_parent_it = cli->st_cli.inode_config.find(new_parent);
std::string new_parent_name = new_parent_it != cli->st_cli.inode_config.end()
? new_parent_it->second.name : "<unknown>";
printf(
"Parent of layer %s (inode %lu in pool %u) changed to %s (inode %lu in pool %u)\n",
cur_name.c_str(), INODE_NO_POOL(cur), INODE_POOL(cur),
new_parent_name.c_str(), INODE_NO_POOL(new_parent), INODE_POOL(new_parent)
);
}
else
{
printf(
"Parent of layer %s (inode %lu in pool %u) detached\n",
cur_name.c_str(), INODE_NO_POOL(cur), INODE_POOL(cur)
);
}
waiting--;
ringloop->wakeup();
});
}
void cli_tool_t::etcd_txn(json11::Json txn)
{
waiting++;
cli->st_cli.etcd_txn_slow(txn, [this](std::string err, json11::Json res)
{
waiting--;
if (err != "")
{
fprintf(stderr, "Error reading from etcd: %s\n", err.c_str());
exit(1);
}
etcd_result = res;
ringloop->wakeup();
});
}
inode_config_t* cli_tool_t::get_inode_cfg(const std::string & name)
{
for (auto & ic: cli->st_cli.inode_config)
{
if (ic.second.name == name)
{
return &ic.second;
}
}
fprintf(stderr, "Layer %s not found\n", name.c_str());
exit(1);
}
void cli_tool_t::parse_config(json11::Json cfg)
{
color = !cfg["no-color"].bool_value();
json_output = cfg["json"].bool_value();
iodepth = cfg["iodepth"].uint64_value();
if (!iodepth)
iodepth = 32;
parallel_osds = cfg["parallel_osds"].uint64_value();
if (!parallel_osds)
parallel_osds = 4;
log_level = cfg["log_level"].int64_value();
progress = cfg["progress"].uint64_value() ? true : false;
list_first = cfg["wait-list"].uint64_value() ? true : false;
}

View File

@@ -33,6 +33,7 @@ struct image_creator_t
uint64_t max_id_mod_rev = 0, cfg_mod_rev = 0, idx_mod_rev = 0;
int state = 0;
cli_result_t result;
bool is_done()
{
@@ -43,13 +44,27 @@ struct image_creator_t
{
if (state >= 1)
goto resume_1;
if (image_name == "")
{
// FIXME: EINVAL -> specific codes for every error
result = (cli_result_t){ .err = EINVAL, .text = "Image name is missing" };
state = 100;
return;
}
if (image_name.find('@') != std::string::npos)
{
result = (cli_result_t){ .err = EINVAL, .text = "Image name can't contain @ character" };
state = 100;
return;
}
if (new_pool_id)
{
auto & pools = parent->cli->st_cli.pool_config;
if (pools.find(new_pool_id) == pools.end())
{
fprintf(stderr, "Pool %u does not exist\n", new_pool_id);
exit(1);
result = (cli_result_t){ .err = ENOENT, .text = "Pool "+std::to_string(new_pool_id)+" does not exist" };
state = 100;
return;
}
}
else if (new_pool_name != "")
@@ -64,8 +79,9 @@ struct image_creator_t
}
if (!new_pool_id)
{
fprintf(stderr, "Pool %s does not exist\n", new_pool_name.c_str());
exit(1);
result = (cli_result_t){ .err = ENOENT, .text = "Pool "+new_pool_name+" does not exist" };
state = 100;
return;
}
}
else if (parent->cli->st_cli.pool_config.size() == 1)
@@ -91,8 +107,9 @@ struct image_creator_t
{
if (ic.second.name == image_name)
{
fprintf(stderr, "Image %s already exists\n", image_name.c_str());
exit(1);
result = (cli_result_t){ .err = EEXIST, .text = "Image "+image_name+" already exists" };
state = 100;
return;
}
if (ic.second.name == new_parent)
{
@@ -109,18 +126,21 @@ struct image_creator_t
}
if (new_parent != "" && !new_parent_id)
{
fprintf(stderr, "Parent image not found\n");
exit(1);
result = (cli_result_t){ .err = ENOENT, .text = "Parent image "+new_parent+" not found" };
state = 100;
return;
}
if (!new_pool_id)
{
fprintf(stderr, "Pool name or ID is missing\n");
exit(1);
result = (cli_result_t){ .err = EINVAL, .text = "Pool name or ID is missing" };
state = 100;
return;
}
if (!size)
{
fprintf(stderr, "Image size is missing\n");
exit(1);
result = (cli_result_t){ .err = EINVAL, .text = "Image size is missing" };
state = 100;
return;
}
do
{
@@ -140,14 +160,12 @@ resume_3:
if (!parent->etcd_result["succeeded"].bool_value() &&
parent->etcd_result["responses"][0]["response_range"]["kvs"].array_items().size() > 0)
{
fprintf(stderr, "Image %s already exists\n", image_name.c_str());
exit(1);
result = (cli_result_t){ .err = EEXIST, .text = "Image "+image_name+" already exists" };
state = 100;
return;
}
} while (!parent->etcd_result["succeeded"].bool_value());
if (parent->progress)
{
printf("Image %s created\n", image_name.c_str());
}
result = (cli_result_t){ .err = 0, .text = "Image "+image_name+" created" };
state = 100;
}
@@ -163,14 +181,16 @@ resume_3:
{
if (ic.second.name == image_name+"@"+new_snap)
{
fprintf(stderr, "Snapshot %s@%s already exists\n", image_name.c_str(), new_snap.c_str());
exit(1);
result = (cli_result_t){ .err = EEXIST, .text = "Snapshot "+image_name+"@"+new_snap+" already exists" };
state = 100;
return;
}
}
if (new_parent != "")
{
fprintf(stderr, "--parent can't be used with snapshots\n");
exit(1);
result = (cli_result_t){ .err = EINVAL, .text = "Parent can't be specified for snapshots" };
state = 100;
return;
}
do
{
@@ -182,8 +202,9 @@ resume_3:
return;
if (!old_id)
{
fprintf(stderr, "Image %s does not exist\n", image_name.c_str());
exit(1);
result = (cli_result_t){ .err = ENOENT, .text = "Image "+image_name+" does not exist" };
state = 100;
return;
}
if (!new_pool_id)
{
@@ -198,14 +219,12 @@ resume_4:
if (!parent->etcd_result["succeeded"].bool_value() &&
parent->etcd_result["responses"][0]["response_range"]["kvs"].array_items().size() > 0)
{
fprintf(stderr, "Snapshot %s@%s already exists\n", image_name.c_str(), new_snap.c_str());
exit(1);
result = (cli_result_t){ .err = EEXIST, .text = "Snapshot "+image_name+"@"+new_snap+" already exists" };
state = 100;
return;
}
} while (!parent->etcd_result["succeeded"].bool_value());
if (parent->progress)
{
printf("Snapshot %s@%s created\n", image_name.c_str(), new_snap.c_str());
}
result = (cli_result_t){ .err = 0, .text = "Snapshot "+image_name+"@"+new_snap+" created" };
state = 100;
}
@@ -288,8 +307,9 @@ resume_2:
idx_mod_rev = kv.mod_revision;
if (!old_id || !old_pool_id || old_pool_id >= POOL_ID_MAX)
{
fprintf(stderr, "Invalid pool or inode ID in etcd key %s\n", kv.key.c_str());
exit(1);
result = (cli_result_t){ .err = ENOENT, .text = "Invalid pool or inode ID in etcd key "+kv.key };
state = 100;
return;
}
}
parent->etcd_txn(json11::Json::object {
@@ -457,25 +477,24 @@ uint64_t parse_size(std::string size_str)
if (type_char == 'k' || type_char == 'm' || type_char == 'g' || type_char == 't')
{
if (type_char == 'k')
mul = 1l<<10;
mul = (uint64_t)1<<10;
else if (type_char == 'm')
mul = 1l<<20;
mul = (uint64_t)1<<20;
else if (type_char == 'g')
mul = 1l<<30;
mul = (uint64_t)1<<30;
else /*if (type_char == 't')*/
mul = 1l<<40;
mul = (uint64_t)1<<40;
size_str = size_str.substr(0, size_str.length()-1);
}
uint64_t size = json11::Json(size_str).uint64_value() * mul;
if (size == 0 && size_str != "0" && (size_str != "" || mul != 1))
{
fprintf(stderr, "Invalid syntax for size: %s\n", size_str.c_str());
exit(1);
return UINT64_MAX;
}
return size;
}
std::function<bool(void)> cli_tool_t::start_create(json11::Json cfg)
std::function<bool(cli_result_t &)> cli_tool_t::start_create(json11::Json cfg)
{
json11::Json::array cmd = cfg["command"].array_items();
auto image_creator = new image_creator_t();
@@ -492,8 +511,12 @@ std::function<bool(void)> cli_tool_t::start_create(json11::Json cfg)
int p = image_creator->image_name.find('@');
if (p == std::string::npos || p == image_creator->image_name.length()-1)
{
fprintf(stderr, "Please specify new snapshot name after @\n");
exit(1);
delete image_creator;
return [](cli_result_t & result)
{
result = (cli_result_t){ .err = EINVAL, .text = "Please specify new snapshot name after @" };
return true;
};
}
image_creator->new_snap = image_creator->image_name.substr(p + 1);
image_creator->image_name = image_creator->image_name.substr(0, p);
@@ -502,32 +525,39 @@ std::function<bool(void)> cli_tool_t::start_create(json11::Json cfg)
if (cfg["size"].string_value() != "")
{
image_creator->size = parse_size(cfg["size"].string_value());
if (image_creator->size == UINT64_MAX)
{
return [size = cfg["size"].string_value()](cli_result_t & result)
{
result = (cli_result_t){ .err = EINVAL, .text = "Invalid syntax for size: "+size };
return true;
};
}
if (image_creator->size % 4096)
{
fprintf(stderr, "Size should be a multiple of 4096\n");
exit(1);
delete image_creator;
return [](cli_result_t & result)
{
result = (cli_result_t){ .err = EINVAL, .text = "Size should be a multiple of 4096" };
return true;
};
}
if (image_creator->new_snap != "")
{
fprintf(stderr, "--size can't be specified for snapshots\n");
exit(1);
delete image_creator;
return [](cli_result_t & result)
{
result = (cli_result_t){ .err = EINVAL, .text = "Size can't be specified for snapshots" };
return true;
};
}
}
if (image_creator->image_name == "")
{
fprintf(stderr, "Image name is missing\n");
exit(1);
}
if (image_creator->image_name.find('@') != std::string::npos)
{
fprintf(stderr, "Image name can't contain @ character\n");
exit(1);
}
return [image_creator]()
return [image_creator](cli_result_t & result)
{
image_creator->loop();
if (image_creator->is_done())
{
result = image_creator->result;
delete image_creator;
return true;
}

View File

@@ -12,6 +12,7 @@ struct pool_lister_t
int state = 0;
json11::Json space_info;
cli_result_t result;
std::map<pool_id_t, json11::Json::object> pool_stats;
bool is_done()
@@ -124,8 +125,8 @@ resume_1:
{ "scheme_name", pool_cfg.scheme == POOL_SCHEME_REPLICATED
? std::to_string(pool_cfg.pg_size)+"/"+std::to_string(pool_cfg.pg_minsize)
: "EC "+std::to_string(pool_cfg.pg_size-pool_cfg.parity_chunks)+"+"+std::to_string(pool_cfg.parity_chunks) },
{ "used_raw", (uint64_t)(pool_stats[pool_cfg.id]["used_raw_tb"].number_value() * (1l<<40)) },
{ "total_raw", (uint64_t)(pool_stats[pool_cfg.id]["total_raw_tb"].number_value() * (1l<<40)) },
{ "used_raw", (uint64_t)(pool_stats[pool_cfg.id]["used_raw_tb"].number_value() * ((uint64_t)1<<40)) },
{ "total_raw", (uint64_t)(pool_stats[pool_cfg.id]["total_raw_tb"].number_value() * ((uint64_t)1<<40)) },
{ "max_available", pool_avail },
{ "raw_to_usable", pool_stats[pool_cfg.id]["raw_to_usable"].number_value() },
{ "space_efficiency", pool_stats[pool_cfg.id]["space_efficiency"].number_value() },
@@ -150,10 +151,10 @@ resume_1:
get_stats();
if (parent->waiting > 0)
return;
result.data = to_list();
if (parent->json_output)
{
// JSON output
printf("%s\n", json11::Json(to_list()).dump().c_str());
state = 100;
return;
}
@@ -206,21 +207,22 @@ resume_1:
: 100)+"%";
kv.second["eff_fmt"] = format_q(kv.second["space_efficiency"].number_value()*100)+"%";
}
printf("%s", print_table(to_list(), cols, parent->color).c_str());
result.text = print_table(result.data, cols, parent->color);
state = 100;
}
};
std::function<bool(void)> cli_tool_t::start_df(json11::Json cfg)
std::function<bool(cli_result_t &)> cli_tool_t::start_df(json11::Json cfg)
{
json11::Json::array cmd = cfg["command"].array_items();
auto lister = new pool_lister_t();
lister->parent = this;
return [lister]()
return [lister](cli_result_t & result)
{
lister->loop();
if (lister->is_done())
{
result = lister->result;
delete lister;
return true;
}

View File

@@ -22,7 +22,8 @@ struct snap_flattener_t
std::string top_parent_name;
inode_t target_id = 0;
int state = 0;
std::function<bool(void)> merger_cb;
std::function<bool(cli_result_t &)> merger_cb;
cli_result_t result;
void get_merge_parents()
{
@@ -37,23 +38,34 @@ struct snap_flattener_t
auto it = parent->cli->st_cli.inode_config.find(cur->parent_id);
if (it == parent->cli->st_cli.inode_config.end())
{
fprintf(stderr, "Parent inode of layer %s (id %ld) not found\n", cur->name.c_str(), cur->parent_id);
exit(1);
result = (cli_result_t){
.err = ENOENT,
.text = "Parent inode of layer "+cur->name+" (id "+std::to_string(cur->parent_id)+") does not exist",
.data = json11::Json::object {
{ "error", "parent-not-found" },
{ "inode_id", cur->num },
{ "inode_name", cur->name },
{ "parent_id", cur->parent_id },
},
};
state = 100;
return;
}
cur = &it->second;
chain_list.push_back(cur->num);
}
if (cur->parent_id != 0)
{
fprintf(stderr, "Layer %s has a loop in parents\n", target_name.c_str());
exit(1);
result = (cli_result_t){ .err = EBADF, .text = "Layer "+target_name+" has a loop in parents" };
state = 100;
return;
}
top_parent_name = cur->name;
}
bool is_done()
{
return state == 5;
return state == 100;
}
void loop()
@@ -64,8 +76,16 @@ struct snap_flattener_t
goto resume_2;
else if (state == 3)
goto resume_3;
if (target_name == "")
{
result = (cli_result_t){ .err = EINVAL, .text = "Layer to flatten not specified" };
state = 100;
return;
}
// Get parent layers
get_merge_parents();
if (state == 100)
return;
// Start merger
merger_cb = parent->start_merge(json11::Json::object {
{ "command", json11::Json::array{ "merge-data", top_parent_name, target_name } },
@@ -76,12 +96,17 @@ struct snap_flattener_t
});
// Wait for it
resume_1:
while (!merger_cb())
while (!merger_cb(result))
{
state = 1;
return;
}
merger_cb = NULL;
if (result.err)
{
state = 100;
return;
}
// Change parent
parent->change_parent(target_id, 0);
// Wait for it to complete
@@ -92,31 +117,27 @@ resume_2:
state = 3;
resume_3:
// Done
return;
state = 100;
}
};
std::function<bool(void)> cli_tool_t::start_flatten(json11::Json cfg)
std::function<bool(cli_result_t &)> cli_tool_t::start_flatten(json11::Json cfg)
{
json11::Json::array cmd = cfg["command"].array_items();
auto flattener = new snap_flattener_t();
flattener->parent = this;
flattener->target_name = cmd.size() > 1 ? cmd[1].string_value() : "";
if (flattener->target_name == "")
{
fprintf(stderr, "Layer to flatten argument is missing\n");
exit(1);
}
flattener->fsync_interval = cfg["fsync-interval"].uint64_value();
if (!flattener->fsync_interval)
flattener->fsync_interval = 128;
if (!cfg["cas"].is_null())
flattener->use_cas = cfg["cas"].uint64_value() ? 2 : 0;
return [flattener]()
return [flattener](cli_result_t & result)
{
flattener->loop();
if (flattener->is_done())
{
result = flattener->result;
delete flattener;
return true;
}

View File

@@ -24,6 +24,7 @@ struct image_lister_t
int state = 0;
std::map<inode_t, json11::Json::object> stats;
json11::Json space_info;
cli_result_t result;
bool is_done()
{
@@ -44,8 +45,9 @@ struct image_lister_t
}
if (!list_pool_id)
{
fprintf(stderr, "Pool %s does not exist\n", list_pool_name.c_str());
exit(1);
result = (cli_result_t){ .err = ENOENT, .text = "Pool "+list_pool_name+" does not exist" };
state = 100;
return;
}
}
for (auto & ic: parent->cli->st_cli.inode_config)
@@ -246,10 +248,10 @@ resume_1:
if (parent->waiting > 0)
return;
}
result.data = to_list();
if (parent->json_output)
{
// JSON output
printf("%s\n", json11::Json(to_list()).dump().c_str());
state = 100;
return;
}
@@ -359,7 +361,7 @@ resume_1:
kv.second["size_fmt"] = format_size(kv.second["size"].uint64_value());
kv.second["ro"] = kv.second["readonly"].bool_value() ? "RO" : "-";
}
printf("%s", print_table(to_list(), cols, parent->color).c_str());
result.text = print_table(to_list(), cols, parent->color);
state = 100;
}
};
@@ -436,23 +438,26 @@ std::string print_table(json11::Json items, json11::Json header, bool use_esc)
return str;
}
static uint64_t size_thresh[] = { 1024l*1024*1024*1024, 1024l*1024*1024, 1024l*1024, 1024, 0 };
static uint64_t size_thresh[] = { (uint64_t)1024*1024*1024*1024, (uint64_t)1024*1024*1024, (uint64_t)1024*1024, 1024, 0 };
static uint64_t size_thresh_d[] = { (uint64_t)1000000000000, (uint64_t)1000000000, (uint64_t)1000000, (uint64_t)1000, 0 };
static const int size_thresh_n = sizeof(size_thresh)/sizeof(size_thresh[0]);
static const char *size_unit = "TGMKB";
std::string format_size(uint64_t size)
std::string format_size(uint64_t size, bool nobytes)
{
uint64_t *thr = nobytes ? size_thresh_d : size_thresh;
char buf[256];
for (int i = 0; i < sizeof(size_thresh)/sizeof(size_thresh[0]); i++)
for (int i = 0; i < size_thresh_n; i++)
{
if (size >= size_thresh[i] || i >= sizeof(size_thresh)/sizeof(size_thresh[0])-1)
if (size >= thr[i] || i >= size_thresh_n-1)
{
double value = size_thresh[i] ? (double)size/size_thresh[i] : size;
double value = thr[i] ? (double)size/thr[i] : size;
int l = snprintf(buf, sizeof(buf), "%.1f", value);
assert(l < sizeof(buf)-2);
if (buf[l-1] == '0')
l -= 2;
buf[l] = ' ';
buf[l+1] = size_unit[i];
buf[l] = i == size_thresh_n-1 && nobytes ? 0 : ' ';
buf[l+1] = i == size_thresh_n-1 && nobytes ? 0 : size_unit[i];
buf[l+2] = 0;
break;
}
@@ -543,7 +548,7 @@ back:
return true;
}
std::function<bool(void)> cli_tool_t::start_ls(json11::Json cfg)
std::function<bool(cli_result_t &)> cli_tool_t::start_ls(json11::Json cfg)
{
json11::Json::array cmd = cfg["command"].array_items();
auto lister = new image_lister_t();
@@ -559,11 +564,12 @@ std::function<bool(void)> cli_tool_t::start_ls(json11::Json cfg)
{
lister->only_names.insert(cmd[i].string_value());
}
return [lister]()
return [lister](cli_result_t & result)
{
lister->loop();
if (lister->is_done())
{
result = lister->result;
delete lister;
return true;
}

View File

@@ -12,6 +12,9 @@ struct snap_rw_op_t
cluster_op_t op;
int todo = 0;
uint32_t start = 0, end = 0;
int error_code = 0;
uint64_t error_offset = 0;
bool error_read = false;
};
// Layer merge is the base for multiple operations:
@@ -54,17 +57,27 @@ struct snap_merger_t
uint64_t last_written_offset = 0;
int deleted_unsynced = 0;
uint64_t processed = 0, to_process = 0;
std::string rwo_error;
cli_result_t result;
void start_merge()
{
if (from_name == "" || to_name == "")
{
result = (cli_result_t){ .err = EINVAL, .text = "Beginning or end of the merge sequence is missing" };
state = 100;
return;
}
check_delete_source = delete_source || check_delete_source;
inode_config_t *from_cfg = parent->get_inode_cfg(from_name);
inode_config_t *to_cfg = parent->get_inode_cfg(to_name);
inode_config_t *target_cfg = target_name == "" ? from_cfg : parent->get_inode_cfg(target_name);
if (to_cfg->num == from_cfg->num)
{
fprintf(stderr, "Only one layer specified, nothing to merge\n");
exit(1);
result = (cli_result_t){ .err = EINVAL, .text = "Only one layer specified, nothing to merge" };
state = 100;
return;
}
// Check that to_cfg is actually a child of from_cfg and target_cfg is somewhere between them
std::vector<inode_t> chain_list;
@@ -78,8 +91,18 @@ struct snap_merger_t
auto it = parent->cli->st_cli.inode_config.find(cur->parent_id);
if (it == parent->cli->st_cli.inode_config.end())
{
fprintf(stderr, "Parent inode of layer %s (id %ld) not found\n", cur->name.c_str(), cur->parent_id);
exit(1);
result = (cli_result_t){
.err = ENOENT,
.text = "Parent inode of layer "+cur->name+" (id "+std::to_string(cur->parent_id)+") does not exist",
.data = json11::Json::object {
{ "error", "parent-not-found" },
{ "inode_id", cur->num },
{ "inode_name", cur->name },
{ "parent_id", cur->parent_id },
},
};
state = 100;
return;
}
cur = &it->second;
chain_list.push_back(cur->num);
@@ -87,8 +110,9 @@ struct snap_merger_t
}
if (cur->parent_id != from_cfg->num)
{
fprintf(stderr, "Layer %s is not a child of %s\n", to_name.c_str(), from_name.c_str());
exit(1);
result = (cli_result_t){ .err = EINVAL, .text = "Layer "+to_name+" is not a child of "+from_name };
state = 100;
return;
}
chain_list.push_back(from_cfg->num);
layer_block_size[from_cfg->num] = get_block_size(from_cfg->num);
@@ -99,8 +123,9 @@ struct snap_merger_t
}
if (sources.find(target_cfg->num) == sources.end())
{
fprintf(stderr, "Layer %s is not between %s and %s\n", target_name.c_str(), to_name.c_str(), from_name.c_str());
exit(1);
result = (cli_result_t){ .err = EINVAL, .text = "Layer "+target_name+" is not between "+to_name+" and "+from_name };
state = 100;
return;
}
target = target_cfg->num;
target_rank = sources.at(target);
@@ -130,14 +155,15 @@ struct snap_merger_t
int parent_rank = it->second;
if (parent_rank < to_rank && (parent_rank >= target_rank || check_delete_source))
{
fprintf(
stderr, "Layers at or above %s, but below %s are not allowed"
" to have other children, but %s is a child of %s\n",
(check_delete_source ? from_name.c_str() : target_name.c_str()),
to_name.c_str(), ic.second.name.c_str(),
parent->cli->st_cli.inode_config.at(ic.second.parent_id).name.c_str()
);
exit(1);
result = (cli_result_t){
.err = EINVAL,
.text = "Layers at or above "+(check_delete_source ? from_name : target_name)+
", but below "+to_name+" are not allowed to have other children, but "+
ic.second.name+" is a child of "+
parent->cli->st_cli.inode_config.at(ic.second.parent_id).name,
};
state = 100;
return;
}
if (parent_rank >= to_rank)
{
@@ -152,11 +178,14 @@ struct snap_merger_t
use_cas = 0;
}
sources.erase(target);
printf(
"Merging %ld layer(s) into target %s%s (inode %lu in pool %u)\n",
sources.size(), target_cfg->name.c_str(),
use_cas ? " online (with CAS)" : "", INODE_NO_POOL(target), INODE_POOL(target)
);
if (parent->progress)
{
printf(
"Merging %ld layer(s) into target %s%s (inode %lu in pool %u)\n",
sources.size(), target_cfg->name.c_str(),
use_cas ? " online (with CAS)" : "", INODE_NO_POOL(target), INODE_POOL(target)
);
}
target_block_size = get_block_size(target);
}
@@ -253,7 +282,8 @@ struct snap_merger_t
oit = merge_offsets.begin();
resume_5:
// Now read, overwrite and optionally delete offsets one by one
while (in_flight < parent->iodepth*parent->parallel_osds && oit != merge_offsets.end())
while (in_flight < parent->iodepth*parent->parallel_osds &&
oit != merge_offsets.end() && !rwo_error.size())
{
in_flight++;
read_and_write(*oit);
@@ -264,6 +294,15 @@ struct snap_merger_t
printf("\rOverwriting blocks: %lu/%lu", processed, to_process);
}
}
if (in_flight == 0 && rwo_error.size())
{
result = (cli_result_t){
.err = EIO,
.text = rwo_error,
};
state = 100;
return;
}
if (in_flight > 0 || oit != merge_offsets.end())
{
// Wait until overwrites finish
@@ -396,8 +435,9 @@ struct snap_merger_t
{
if (op->retval != op->len)
{
fprintf(stderr, "error reading target at offset %lx: %s\n", op->offset, strerror(-op->retval));
exit(1);
rwo->error_code = -op->retval;
rwo->error_offset = op->offset;
rwo->error_read = true;
}
next_write(rwo);
};
@@ -410,7 +450,7 @@ struct snap_merger_t
// FIXME: Allow to use single write with "holes" (OSDs don't allow it yet)
uint32_t gran = parent->cli->get_bs_bitmap_granularity();
uint64_t bitmap_size = target_block_size / gran;
while (rwo->end < bitmap_size)
while (rwo->end < bitmap_size && !rwo->error_code)
{
auto bit = ((*((uint8_t*)rwo->op.bitmap_buf + (rwo->end >> 3))) & (1 << (rwo->end & 0x7)));
if (!bit)
@@ -434,7 +474,7 @@ struct snap_merger_t
rwo->end++;
}
}
if (rwo->end > rwo->start)
if (rwo->end > rwo->start && !rwo->error_code)
{
// write start->end
rwo->todo++;
@@ -473,8 +513,9 @@ struct snap_merger_t
delete subop;
return;
}
fprintf(stderr, "error writing target at offset %lx: %s\n", subop->offset, strerror(-subop->retval));
exit(1);
rwo->error_code = -subop->retval;
rwo->error_offset = subop->offset;
rwo->error_read = false;
}
// Increment CAS version
rwo->op.version++;
@@ -510,11 +551,12 @@ struct snap_merger_t
{
if (!rwo->todo)
{
if (last_written_offset < rwo->op.offset+target_block_size)
if (!rwo->error_code &&
last_written_offset < rwo->op.offset+target_block_size)
{
last_written_offset = rwo->op.offset+target_block_size;
}
if (delete_source)
if (!rwo->error_code && delete_source)
{
deleted_unsynced++;
if (deleted_unsynced >= fsync_interval)
@@ -545,13 +587,20 @@ struct snap_merger_t
}
free(rwo->buf);
delete rwo;
if (rwo->error_code)
{
char buf[1024];
snprintf(buf, 1024, "Error %s target at offset %lx: %s",
rwo->error_read ? "reading" : "writing", rwo->error_offset, strerror(rwo->error_code));
rwo_error = std::string(buf);
}
in_flight--;
continue_merge_reent();
}
}
};
std::function<bool(void)> cli_tool_t::start_merge(json11::Json cfg)
std::function<bool(cli_result_t &)> cli_tool_t::start_merge(json11::Json cfg)
{
json11::Json::array cmd = cfg["command"].array_items();
auto merger = new snap_merger_t();
@@ -559,22 +608,18 @@ std::function<bool(void)> cli_tool_t::start_merge(json11::Json cfg)
merger->from_name = cmd.size() > 1 ? cmd[1].string_value() : "";
merger->to_name = cmd.size() > 2 ? cmd[2].string_value() : "";
merger->target_name = cfg["target"].string_value();
if (merger->from_name == "" || merger->to_name == "")
{
fprintf(stderr, "Beginning or end of the merge sequence is missing\n");
exit(1);
}
merger->delete_source = cfg["delete-source"].string_value() != "";
merger->fsync_interval = cfg["fsync-interval"].uint64_value();
if (!merger->fsync_interval)
merger->fsync_interval = 128;
if (!cfg["cas"].is_null())
merger->use_cas = cfg["cas"].uint64_value() ? 2 : 0;
return [merger]()
return [merger](cli_result_t & result)
{
merger->continue_merge_reent();
if (merger->is_done())
{
result = merger->result;
delete merger;
return true;
}

View File

@@ -23,7 +23,8 @@ struct image_changer_t
bool has_children = false;
int state = 0;
std::function<bool(void)> cb;
std::function<bool(cli_result_t &)> cb;
cli_result_t result;
bool is_done()
{
@@ -36,6 +37,18 @@ struct image_changer_t
goto resume_1;
else if (state == 2)
goto resume_2;
if (image_name == "")
{
result = (cli_result_t){ .err = EINVAL, .text = "Image name is missing" };
state = 100;
return;
}
if (new_size != 0 && (new_size % 4096))
{
result = (cli_result_t){ .err = EINVAL, .text = "Image size should be a multiple of 4096" };
state = 100;
return;
}
for (auto & ic: parent->cli->st_cli.inode_config)
{
if (ic.second.name == image_name)
@@ -46,14 +59,16 @@ struct image_changer_t
}
if (new_name != "" && ic.second.name == new_name)
{
fprintf(stderr, "Image %s already exists\n", new_name.c_str());
exit(1);
result = (cli_result_t){ .err = EEXIST, .text = "Image "+new_name+" already exists" };
state = 100;
return;
}
}
if (!inode_num)
{
fprintf(stderr, "Image %s does not exist\n", image_name.c_str());
exit(1);
result = (cli_result_t){ .err = ENOENT, .text = "Image "+image_name+" does not exist" };
state = 100;
return;
}
for (auto & ic: parent->cli->st_cli.inode_config)
{
@@ -68,7 +83,7 @@ struct image_changer_t
(!new_size || cfg.size == new_size) &&
(new_name == "" || new_name == image_name))
{
printf("No change\n");
result = (cli_result_t){ .text = "No change" };
state = 100;
return;
}
@@ -79,8 +94,9 @@ struct image_changer_t
// Check confirmation when trimming an image with children
if (has_children && !force)
{
fprintf(stderr, "Image %s has children. Refusing to shrink it without --force\n", image_name.c_str());
exit(1);
result = (cli_result_t){ .err = EINVAL, .text = "Image "+image_name+" has children. Refusing to shrink it without --force" };
state = 100;
return;
}
// Shrink the image first
cb = parent->start_rm(json11::Json::object {
@@ -90,12 +106,17 @@ struct image_changer_t
{ "min-offset", new_size },
});
resume_1:
while (!cb())
while (!cb(result))
{
state = 1;
return;
}
cb = NULL;
if (result.err)
{
state = 100;
return;
}
}
cfg.size = new_size;
}
@@ -109,8 +130,9 @@ resume_1:
// Check confirmation when making an image with children read-write
if (has_children && !force)
{
fprintf(stderr, "Image %s has children. Refusing to make it read-write without --force\n", image_name.c_str());
exit(1);
result = (cli_result_t){ .err = EINVAL, .text = "Image "+image_name+" has children. Refusing to make it read-write without --force" };
state = 100;
return;
}
}
if (new_name != "")
@@ -180,32 +202,23 @@ resume_2:
return;
if (!parent->etcd_result["succeeded"].bool_value())
{
fprintf(stderr, "Image %s was modified by someone else, please repeat your request\n", image_name.c_str());
exit(1);
result = (cli_result_t){ .err = EAGAIN, .text = "Image "+image_name+" was modified by someone else, please repeat your request" };
state = 100;
return;
}
printf("Image %s modified\n", image_name.c_str());
result = (cli_result_t){ .err = 0, .text = "Image "+image_name+" modified" };
state = 100;
}
};
std::function<bool(void)> cli_tool_t::start_modify(json11::Json cfg)
std::function<bool(cli_result_t &)> cli_tool_t::start_modify(json11::Json cfg)
{
json11::Json::array cmd = cfg["command"].array_items();
auto changer = new image_changer_t();
changer->parent = this;
changer->image_name = cmd.size() > 1 ? cmd[1].string_value() : "";
if (changer->image_name == "")
{
fprintf(stderr, "Image name is missing\n");
exit(1);
}
changer->new_name = cfg["rename"].string_value();
changer->new_size = parse_size(cfg["resize"].string_value());
if (changer->new_size != 0 && (changer->new_size % 4096))
{
fprintf(stderr, "Image size should be a multiple of 4096\n");
exit(1);
}
changer->force = cfg["force"].bool_value();
changer->set_readonly = cfg["readonly"].bool_value();
changer->set_readwrite = cfg["readwrite"].bool_value();
@@ -213,11 +226,12 @@ std::function<bool(void)> cli_tool_t::start_modify(json11::Json cfg)
if (!changer->fsync_interval)
changer->fsync_interval = 128;
// FIXME Check that the image doesn't have children when shrinking
return [changer]()
return [changer](cli_result_t & result)
{
changer->loop();
if (changer->is_done())
{
result = changer->result;
delete changer;
return true;
}

View File

@@ -63,7 +63,9 @@ struct snap_remover_t
inode_t new_parent = 0;
int state = 0;
int current_child = 0;
std::function<bool(void)> cb;
std::function<bool(cli_result_t &)> cb;
cli_result_t result;
bool is_done()
{
@@ -90,11 +92,25 @@ struct snap_remover_t
goto resume_8;
else if (state == 9)
goto resume_9;
if (from_name == "")
{
result = (cli_result_t){ .err = EINVAL, .text = "Layer to remove argument is missing" };
state = 100;
return;
}
if (to_name == "")
{
to_name = from_name;
}
// Get children to merge
get_merge_children();
if (state == 100)
return;
// Try to select an inode for the "inverse" optimized scenario
// Read statistics from etcd to do it
read_stats();
if (state == 100)
return;
state = 1;
resume_1:
if (parent->waiting > 0)
@@ -106,13 +122,20 @@ resume_1:
if (merge_children[current_child] == inverse_child)
continue;
start_merge_child(merge_children[current_child], merge_children[current_child]);
if (state == 100)
return;
resume_2:
while (!cb())
while (!cb(result))
{
state = 2;
return;
}
cb = NULL;
if (result.err)
{
state = 100;
return;
}
parent->change_parent(merge_children[current_child], new_parent);
state = 3;
resume_3:
@@ -123,25 +146,41 @@ resume_3:
if (inverse_child != 0)
{
start_merge_child(inverse_child, inverse_parent);
if (state == 100)
return;
resume_4:
while (!cb())
while (!cb(result))
{
state = 4;
return;
}
cb = NULL;
if (result.err)
{
state = 100;
return;
}
// Delete "inverse" child data
start_delete_source(inverse_child);
if (state == 100)
return;
resume_5:
while (!cb())
while (!cb(result))
{
state = 5;
return;
}
cb = NULL;
if (result.err)
{
state = 100;
return;
}
// Delete "inverse" child metadata, rename parent over it,
// and also change parent links of the previous "inverse" child
rename_inverse_parent();
if (state == 100)
return;
state = 6;
resume_6:
if (parent->waiting > 0)
@@ -154,13 +193,20 @@ resume_6:
continue;
start_delete_source(chain_list[current_child]);
resume_7:
while (!cb())
while (!cb(result))
{
state = 7;
return;
}
cb = NULL;
if (result.err)
{
state = 100;
return;
}
delete_inode_config(chain_list[current_child]);
if (state == 100)
return;
state = 8;
resume_8:
if (parent->waiting > 0)
@@ -186,16 +232,19 @@ resume_9:
auto it = parent->cli->st_cli.inode_config.find(cur->parent_id);
if (it == parent->cli->st_cli.inode_config.end())
{
fprintf(stderr, "Parent inode of layer %s (id %ld) not found\n", cur->name.c_str(), cur->parent_id);
exit(1);
char buf[1024];
snprintf(buf, 1024, "Parent inode of layer %s (id 0x%lx) not found", cur->name.c_str(), cur->parent_id);
state = 100;
return;
}
cur = &it->second;
chain_list.push_back(cur->num);
}
if (cur->num != from_cfg->num)
{
fprintf(stderr, "Layer %s is not a child of %s\n", to_name.c_str(), from_name.c_str());
exit(1);
result = (cli_result_t){ .err = EINVAL, .text = "Layer "+to_name+" is not a child of "+from_name };
state = 100;
return;
}
new_parent = from_cfg->parent_id;
// Calculate ranks
@@ -263,8 +312,9 @@ resume_9:
parent->waiting--;
if (err != "")
{
fprintf(stderr, "Error reading layer statistics from etcd: %s\n", err.c_str());
exit(1);
result = (cli_result_t){ .err = EIO, .text = "Error reading layer statistics from etcd: "+err };
state = 100;
return;
}
for (auto inode_result: data["responses"].array_items())
{
@@ -275,14 +325,16 @@ resume_9:
sscanf(kv.key.c_str() + parent->cli->st_cli.etcd_prefix.length()+13, "%u/%lu%c", &pool_id, &inode, &null_byte);
if (!inode || null_byte != 0)
{
fprintf(stderr, "Bad key returned from etcd: %s\n", kv.key.c_str());
exit(1);
result = (cli_result_t){ .err = EIO, .text = "Bad key returned from etcd: "+kv.key };
state = 100;
return;
}
auto pool_cfg_it = parent->cli->st_cli.pool_config.find(pool_id);
if (pool_cfg_it == parent->cli->st_cli.pool_config.end())
{
fprintf(stderr, "Pool %u does not exist\n", pool_id);
exit(1);
result = (cli_result_t){ .err = ENOENT, .text = "Pool "+std::to_string(pool_id)+" does not exist" };
state = 100;
return;
}
inode = INODE_WITH_POOL(pool_id, inode);
auto & pool_cfg = pool_cfg_it->second;
@@ -324,14 +376,20 @@ resume_9:
auto child_it = parent->cli->st_cli.inode_config.find(inverse_child);
if (child_it == parent->cli->st_cli.inode_config.end())
{
fprintf(stderr, "Inode %ld disappeared\n", inverse_child);
exit(1);
char buf[1024];
snprintf(buf, 1024, "Inode 0x%lx disappeared", inverse_child);
result = (cli_result_t){ .err = EIO, .text = std::string(buf) };
state = 100;
return;
}
auto target_it = parent->cli->st_cli.inode_config.find(inverse_parent);
if (target_it == parent->cli->st_cli.inode_config.end())
{
fprintf(stderr, "Inode %ld disappeared\n", inverse_parent);
exit(1);
char buf[1024];
snprintf(buf, 1024, "Inode 0x%lx disappeared", inverse_parent);
result = (cli_result_t){ .err = EIO, .text = std::string(buf) };
state = 100;
return;
}
inode_config_t *child_cfg = &child_it->second;
inode_config_t *target_cfg = &target_it->second;
@@ -422,16 +480,19 @@ resume_9:
parent->waiting--;
if (err != "")
{
fprintf(stderr, "Error renaming %s to %s: %s\n", target_name.c_str(), child_name.c_str(), err.c_str());
exit(1);
result = (cli_result_t){ .err = EIO, .text = "Error renaming "+target_name+" to "+child_name+": "+err };
state = 100;
return;
}
if (!res["succeeded"].bool_value())
{
fprintf(
stderr, "Parent (%s), child (%s), or one of its children"
" configuration was modified during rename\n", target_name.c_str(), child_name.c_str()
);
exit(1);
result = (cli_result_t){
.err = EIO,
.text = "Parent ("+target_name+"), child ("+child_name+"), or one of its children"
" configuration was modified during rename",
};
state = 100;
return;
}
printf("Layer %s renamed to %s\n", target_name.c_str(), child_name.c_str());
parent->ringloop->wakeup();
@@ -443,8 +504,11 @@ resume_9:
auto cur_cfg_it = parent->cli->st_cli.inode_config.find(cur);
if (cur_cfg_it == parent->cli->st_cli.inode_config.end())
{
fprintf(stderr, "Inode 0x%lx disappeared\n", cur);
exit(1);
char buf[1024];
snprintf(buf, 1024, "Inode 0x%lx disappeared", cur);
result = (cli_result_t){ .err = EIO, .text = std::string(buf) };
state = 100;
return;
}
inode_config_t *cur_cfg = &cur_cfg_it->second;
std::string cur_name = cur_cfg->name;
@@ -480,13 +544,15 @@ resume_9:
parent->waiting--;
if (err != "")
{
fprintf(stderr, "Error deleting %s: %s\n", cur_name.c_str(), err.c_str());
exit(1);
result = (cli_result_t){ .err = EIO, .text = "Error deleting "+cur_name+": "+err };
state = 100;
return;
}
if (!res["succeeded"].bool_value())
{
fprintf(stderr, "Layer %s configuration was modified during deletion\n", cur_name.c_str());
exit(1);
result = (cli_result_t){ .err = EIO, .text = "Layer "+cur_name+" was modified during deletion" };
state = 100;
return;
}
printf("Layer %s deleted\n", cur_name.c_str());
parent->ringloop->wakeup();
@@ -498,14 +564,20 @@ resume_9:
auto child_it = parent->cli->st_cli.inode_config.find(child_inode);
if (child_it == parent->cli->st_cli.inode_config.end())
{
fprintf(stderr, "Inode %ld disappeared\n", child_inode);
exit(1);
char buf[1024];
snprintf(buf, 1024, "Inode 0x%lx disappeared", child_inode);
result = (cli_result_t){ .err = EIO, .text = std::string(buf) };
state = 100;
return;
}
auto target_it = parent->cli->st_cli.inode_config.find(target_inode);
if (target_it == parent->cli->st_cli.inode_config.end())
{
fprintf(stderr, "Inode %ld disappeared\n", target_inode);
exit(1);
char buf[1024];
snprintf(buf, 1024, "Inode 0x%lx disappeared", target_inode);
result = (cli_result_t){ .err = EIO, .text = std::string(buf) };
state = 100;
return;
}
cb = parent->start_merge(json11::Json::object {
{ "command", json11::Json::array{ "merge-data", from_name, child_it->second.name } },
@@ -521,8 +593,11 @@ resume_9:
auto source = parent->cli->st_cli.inode_config.find(inode);
if (source == parent->cli->st_cli.inode_config.end())
{
fprintf(stderr, "Inode %ld disappeared\n", inode);
exit(1);
char buf[1024];
snprintf(buf, 1024, "Inode 0x%lx disappeared", inode);
result = (cli_result_t){ .err = EIO, .text = std::string(buf) };
state = 100;
return;
}
cb = parent->start_rm(json11::Json::object {
{ "inode", inode },
@@ -532,22 +607,13 @@ resume_9:
}
};
std::function<bool(void)> cli_tool_t::start_snap_rm(json11::Json cfg)
std::function<bool(cli_result_t &)> cli_tool_t::start_snap_rm(json11::Json cfg)
{
json11::Json::array cmd = cfg["command"].array_items();
auto snap_remover = new snap_remover_t();
snap_remover->parent = this;
snap_remover->from_name = cmd.size() > 1 ? cmd[1].string_value() : "";
snap_remover->to_name = cmd.size() > 2 ? cmd[2].string_value() : "";
if (snap_remover->from_name == "")
{
fprintf(stderr, "Layer to remove argument is missing\n");
exit(1);
}
if (snap_remover->to_name == "")
{
snap_remover->to_name = snap_remover->from_name;
}
snap_remover->fsync_interval = cfg["fsync-interval"].uint64_value();
if (!snap_remover->fsync_interval)
snap_remover->fsync_interval = 128;
@@ -555,11 +621,12 @@ std::function<bool(void)> cli_tool_t::start_snap_rm(json11::Json cfg)
snap_remover->use_cas = cfg["cas"].uint64_value() ? 2 : 0;
if (!cfg["writers_stopped"].is_null())
snap_remover->writers_stopped = true;
return [snap_remover]()
return [snap_remover](cli_result_t & result)
{
snap_remover->loop();
if (snap_remover->is_done())
{
result = snap_remover->result;
delete snap_remover;
return true;
}

View File

@@ -32,6 +32,9 @@ struct rm_inode_t
uint64_t pgs_to_list = 0;
bool lists_done = false;
int state = 0;
int error_count = 0;
cli_result_t result;
void start_delete()
{
@@ -74,8 +77,13 @@ struct rm_inode_t
});
if (!lister)
{
fprintf(stderr, "Failed to list inode %lu from pool %u objects\n", INODE_NO_POOL(inode), INODE_POOL(inode));
exit(1);
result = (cli_result_t){
.err = EIO,
.text = "Failed to list objects of inode "+std::to_string(INODE_NO_POOL(inode))+
" from pool "+std::to_string(INODE_POOL(inode)),
};
state = 100;
return;
}
pgs_to_list = parent->cli->list_pg_count(lister);
parent->cli->list_inode_next(lister, parent->parallel_osds);
@@ -118,6 +126,7 @@ struct rm_inode_t
fprintf(stderr, "Failed to remove object %lx:%lx from PG %u (OSD %lu) (retval=%ld)\n",
op->req.rw.inode, op->req.rw.offset,
cur_list->pg_num, cur_list->rm_osd_num, op->reply.hdr.retval);
error_count++;
}
delete op;
cur_list->obj_done++;
@@ -161,31 +170,41 @@ struct rm_inode_t
}
if (lists_done && !lists.size())
{
printf("Done, inode %lu in pool %u data removed\n", INODE_NO_POOL(inode), pool_id);
state = 2;
result = (cli_result_t){
.err = error_count > 0 ? EIO : 0,
.text = error_count > 0 ? "Some blocks were not removed" : (
"Done, inode "+std::to_string(INODE_NO_POOL(inode))+" from pool "+
std::to_string(pool_id)+" removed"),
};
state = 100;
}
}
bool loop()
bool is_done()
{
if (state == 0)
return state == 100;
}
void loop()
{
if (state == 1)
goto resume_1;
if (!pool_id)
{
start_delete();
state = 1;
result = (cli_result_t){ .err = EINVAL, .text = "Pool is not specified" };
state = 100;
return;
}
else if (state == 1)
{
continue_delete();
}
else if (state == 2)
{
return true;
}
return false;
start_delete();
if (state == 100)
return;
state = 1;
resume_1:
continue_delete();
}
};
std::function<bool(void)> cli_tool_t::start_rm(json11::Json cfg)
std::function<bool(cli_result_t &)> cli_tool_t::start_rm(json11::Json cfg)
{
auto remover = new rm_inode_t();
remover->parent = this;
@@ -193,19 +212,16 @@ std::function<bool(void)> cli_tool_t::start_rm(json11::Json cfg)
remover->pool_id = cfg["pool"].uint64_value();
if (remover->pool_id)
{
remover->inode = (remover->inode & ((1l << (64-POOL_ID_BITS)) - 1)) | (((uint64_t)remover->pool_id) << (64-POOL_ID_BITS));
remover->inode = (remover->inode & (((uint64_t)1 << (64-POOL_ID_BITS)) - 1)) | (((uint64_t)remover->pool_id) << (64-POOL_ID_BITS));
}
remover->pool_id = INODE_POOL(remover->inode);
if (!remover->pool_id)
{
fprintf(stderr, "pool is missing\n");
exit(1);
}
remover->min_offset = cfg["min-offset"].uint64_value();
return [remover]()
return [remover](cli_result_t & result)
{
if (remover->loop())
remover->loop();
if (remover->is_done())
{
result = remover->result;
delete remover;
return true;
}

View File

@@ -11,7 +11,7 @@
#include <sys/stat.h>
// Calculate offsets for a block device and print OSD command line parameters
std::function<bool(void)> cli_tool_t::simple_offsets(json11::Json cfg)
std::function<bool(cli_result_t &)> cli_tool_t::simple_offsets(json11::Json cfg)
{
std::string device = cfg["command"][1].string_value();
uint64_t object_size = parse_size(cfg["object_size"].string_value());

296
src/cli_status.cpp Normal file
View File

@@ -0,0 +1,296 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
#include "cli.h"
#include "cluster_client.h"
#include "base64.h"
#include "pg_states.h"
// Print cluster status:
// etcd, mon, osd states
// raw/used space, object states, pool states, pg states
// client io, recovery io, rebalance io
struct status_printer_t
{
cli_tool_t *parent;
int state = 0;
json11::Json::array mon_members, osd_stats;
json11::Json agg_stats;
std::map<pool_id_t, json11::Json::object> pool_stats;
json11::Json::array etcd_states;
bool is_done()
{
return state == 100;
}
void loop()
{
if (state == 1)
goto resume_1;
else if (state == 2)
goto resume_2;
// etcd states
{
auto addrs = parent->cli->st_cli.get_addresses();
etcd_states.resize(addrs.size());
for (int i = 0; i < etcd_states.size(); i++)
{
parent->waiting++;
parent->cli->st_cli.etcd_call_oneshot(
addrs[i], "/maintenance/status", json11::Json::object(),
parent->cli->st_cli.etcd_quick_timeout, [this, i](std::string err, json11::Json res)
{
parent->waiting--;
etcd_states[i] = err != "" ? json11::Json::object{ { "error", err } } : res;
parent->ringloop->wakeup();
}
);
}
}
state = 1;
resume_1:
if (parent->waiting > 0)
return;
// Monitors, OSD states
parent->etcd_txn(json11::Json::object {
{ "success", json11::Json::array {
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/mon/") },
{ "range_end", base64_encode(parent->cli->st_cli.etcd_prefix+"/mon0") },
} },
},
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(
parent->cli->st_cli.etcd_prefix+"/osd/stats/"
) },
{ "range_end", base64_encode(
parent->cli->st_cli.etcd_prefix+"/osd/stats0"
) },
} },
},
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/stats") },
} },
},
} },
});
state = 2;
resume_2:
if (parent->waiting > 0)
return;
mon_members = parent->etcd_result["responses"][0]["response_range"]["kvs"].array_items();
osd_stats = parent->etcd_result["responses"][1]["response_range"]["kvs"].array_items();
if (parent->etcd_result["responses"][2]["response_range"]["kvs"].array_items().size() > 0)
{
agg_stats = parent->cli->st_cli.parse_etcd_kv(parent->etcd_result["responses"][2]["response_range"]["kvs"][0]).value;
}
int etcd_alive = 0;
uint64_t etcd_db_size = 0;
std::string etcd_detail;
for (int i = 0; i < etcd_states.size(); i++)
{
if (etcd_states[i]["error"].is_null())
{
etcd_alive++;
etcd_db_size = etcd_states[i]["dbSizeInUse"].uint64_value();
}
}
int mon_count = 0;
std::string mon_master;
for (int i = 0; i < mon_members.size(); i++)
{
auto kv = parent->cli->st_cli.parse_etcd_kv(mon_members[i]);
kv.key = kv.key.substr(parent->cli->st_cli.etcd_prefix.size());
if (kv.key.substr(0, 12) == "/mon/member/")
mon_count++;
else if (kv.key == "/mon/master")
{
if (kv.value["hostname"].is_string())
mon_master = kv.value["hostname"].string_value();
else
mon_master = kv.value["ip"][0].string_value();
}
}
int osd_count = 0, osd_up = 0;
uint64_t total_raw = 0, free_raw = 0, free_down_raw = 0, down_raw = 0;
for (int i = 0; i < osd_stats.size(); i++)
{
auto kv = parent->cli->st_cli.parse_etcd_kv(osd_stats[i]);
osd_num_t stat_osd_num = 0;
char null_byte = 0;
sscanf(kv.key.c_str() + parent->cli->st_cli.etcd_prefix.size(), "/osd/stats/%lu%c", &stat_osd_num, &null_byte);
if (!stat_osd_num || null_byte != 0)
{
fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
continue;
}
osd_count++;
total_raw += kv.value["size"].uint64_value();
free_raw += kv.value["free"].uint64_value();
auto peer_it = parent->cli->st_cli.peer_states.find(stat_osd_num);
if (peer_it != parent->cli->st_cli.peer_states.end())
{
osd_up++;
}
else
{
down_raw += kv.value["size"].uint64_value();
free_down_raw += kv.value["size"].uint64_value();
}
}
int pool_count = 0, pools_active = 0;
std::map<std::string, int> pgs_by_state;
std::string pgs_by_state_str;
for (auto & pool_pair: parent->cli->st_cli.pool_config)
{
auto & pool_cfg = pool_pair.second;
bool active = true;
if (pool_cfg.pg_config.size() != pool_cfg.pg_count)
{
active = false;
pgs_by_state["offline"] += pool_cfg.pg_count-pool_cfg.pg_config.size();
}
pool_count++;
for (auto pg_it = pool_cfg.pg_config.begin(); pg_it != pool_cfg.pg_config.end(); pg_it++)
{
if (!(pg_it->second.cur_state & PG_ACTIVE))
{
active = false;
}
std::string pg_state_str;
for (int i = 0; i < pg_state_bit_count; i++)
{
if (pg_it->second.cur_state & pg_state_bits[i])
{
pg_state_str += "+";
pg_state_str += pg_state_names[i];
}
}
if (pg_state_str.size())
pgs_by_state[pg_state_str.substr(1)]++;
else
pgs_by_state["offline"]++;
}
if (active)
{
pools_active++;
}
}
for (auto & kv: pgs_by_state)
{
if (pgs_by_state_str.size())
{
pgs_by_state_str += "\n ";
}
pgs_by_state_str += std::to_string(kv.second)+" "+kv.first;
}
uint64_t object_size = parent->cli->get_bs_block_size();
std::string more_states;
uint64_t obj_n;
obj_n = agg_stats["object_counts"]["misplaced"].uint64_value();
if (obj_n > 0)
more_states += ", "+format_size(obj_n*object_size)+" misplaced";
obj_n = agg_stats["object_counts"]["degraded"].uint64_value();
if (obj_n > 0)
more_states += ", "+format_size(obj_n*object_size)+" degraded";
obj_n = agg_stats["object_counts"]["incomplete"].uint64_value();
if (obj_n > 0)
more_states += ", "+format_size(obj_n*object_size)+" incomplete";
std::string recovery_io;
{
uint64_t deg_bps = agg_stats["recovery_stats"]["degraded"]["bps"].uint64_value();
uint64_t deg_iops = agg_stats["recovery_stats"]["degraded"]["iops"].uint64_value();
uint64_t misp_bps = agg_stats["recovery_stats"]["misplaced"]["bps"].uint64_value();
uint64_t misp_iops = agg_stats["recovery_stats"]["misplaced"]["iops"].uint64_value();
if (deg_iops > 0 || deg_bps > 0)
recovery_io += " recovery: "+format_size(deg_bps)+"/s, "+format_size(deg_iops, true)+" op/s\n";
if (misp_iops > 0 || misp_bps > 0)
recovery_io += " rebalance: "+format_size(misp_bps)+"/s, "+format_size(misp_iops, true)+" op/s\n";
}
if (parent->json_output)
{
// JSON output
printf("%s\n", json11::Json(json11::Json::object {
{ "etcd_alive", etcd_alive },
{ "etcd_count", (uint64_t)etcd_states.size() },
{ "etcd_db_size", etcd_db_size },
{ "mon_count", mon_count },
{ "mon_master", mon_master },
{ "osd_up", osd_up },
{ "osd_count", osd_count },
{ "total_raw", total_raw },
{ "free_raw", free_raw },
{ "down_raw", down_raw },
{ "free_down_raw", free_down_raw },
{ "clean_data", agg_stats["object_counts"]["clean"].uint64_value() * object_size },
{ "misplaced_data", agg_stats["object_counts"]["misplaced"].uint64_value() * object_size },
{ "degraded_data", agg_stats["object_counts"]["degraded"].uint64_value() * object_size },
{ "incomplete_data", agg_stats["object_counts"]["incomplete"].uint64_value() * object_size },
{ "pool_count", pool_count },
{ "active_pool_count", pools_active },
{ "pg_states", pgs_by_state },
{ "op_stats", agg_stats["op_stats"] },
{ "recovery_stats", agg_stats["recovery_stats"] },
{ "object_counts", agg_stats["object_counts"] },
}).dump().c_str());
state = 100;
return;
}
printf(
" cluster:\n"
" etcd: %d / %ld up, %s database size\n"
" mon: %d up%s\n"
" osd: %d / %d up\n"
" \n"
" data:\n"
" raw: %s used, %s / %s available%s\n"
" state: %s clean%s\n"
" pools: %d / %d active\n"
" pgs: %s\n"
" \n"
" io:\n"
" client:%s %s/s rd, %s op/s rd, %s/s wr, %s op/s wr\n"
"%s",
etcd_alive, etcd_states.size(), format_size(etcd_db_size).c_str(),
mon_count, mon_master == "" ? "" : (", master "+mon_master).c_str(),
osd_up, osd_count,
format_size(total_raw-free_raw).c_str(),
format_size(free_raw-free_down_raw).c_str(),
format_size(total_raw-down_raw).c_str(),
(down_raw > 0 ? (", "+format_size(down_raw)+" down").c_str() : ""),
format_size(agg_stats["object_counts"]["clean"].uint64_value() * object_size).c_str(), more_states.c_str(),
pools_active, pool_count,
pgs_by_state_str.c_str(),
recovery_io.size() > 0 ? " " : "",
format_size(agg_stats["op_stats"]["primary_read"]["bps"].uint64_value()).c_str(),
format_size(agg_stats["op_stats"]["primary_read"]["iops"].uint64_value(), true).c_str(),
format_size(agg_stats["op_stats"]["primary_write"]["bps"].uint64_value()).c_str(),
format_size(agg_stats["op_stats"]["primary_write"]["iops"].uint64_value(), true).c_str(),
recovery_io.c_str()
);
state = 100;
}
};
std::function<bool(cli_result_t &)> cli_tool_t::start_status(json11::Json cfg)
{
json11::Json::array cmd = cfg["command"].array_items();
auto printer = new status_printer_t();
printer->parent = this;
return [printer](cli_result_t & result)
{
printer->loop();
if (printer->is_done())
{
result = { .err = 0 };
delete printer;
return true;
}
return false;
};
}

View File

@@ -64,6 +64,42 @@ void etcd_state_client_t::etcd_txn_slow(json11::Json txn, std::function<void(std
etcd_call("/kv/txn", txn, etcd_slow_timeout, max_etcd_attempts, 0, callback);
}
std::vector<std::string> etcd_state_client_t::get_addresses()
{
auto addrs = etcd_local;
addrs.insert(addrs.end(), etcd_addresses.begin(), etcd_addresses.end());
return addrs;
}
void etcd_state_client_t::etcd_call_oneshot(std::string etcd_address, std::string api, json11::Json payload,
int timeout, std::function<void(std::string, json11::Json)> callback)
{
std::string etcd_api_path;
int pos = etcd_address.find('/');
if (pos >= 0)
{
etcd_api_path = etcd_address.substr(pos);
etcd_address = etcd_address.substr(0, pos);
}
std::string req = payload.dump();
req = "POST "+etcd_api_path+api+" HTTP/1.1\r\n"
"Host: "+etcd_address+"\r\n"
"Content-Type: application/json\r\n"
"Content-Length: "+std::to_string(req.size())+"\r\n"
"Connection: close\r\n"
"\r\n"+req;
auto http_cli = http_init(tfd);
auto cb = [this, http_cli, callback](const http_response_t *response)
{
std::string err;
json11::Json data;
response->parse_json_response(err, data);
callback(err, data);
http_close(http_cli);
};
http_request(http_cli, etcd_address, req, { .timeout = timeout }, cb);
}
void etcd_state_client_t::etcd_call(std::string api, json11::Json payload, int timeout,
int retries, int interval, std::function<void(std::string, json11::Json)> callback)
{
@@ -918,6 +954,10 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
}
if (!value.is_object())
{
if (on_inode_change_hook != NULL)
{
on_inode_change_hook(inode_num, true);
}
this->inode_config.erase(inode_num);
}
else
@@ -932,7 +972,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
{
fprintf(
stderr, "Inode %lu/%lu parent_pool value is invalid, ignoring parent setting\n",
inode_num >> (64-POOL_ID_BITS), inode_num & ((1l << (64-POOL_ID_BITS)) - 1)
inode_num >> (64-POOL_ID_BITS), inode_num & (((uint64_t)1 << (64-POOL_ID_BITS)) - 1)
);
parent_inode_num = 0;
}
@@ -959,6 +999,10 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
}
}
}
if (on_inode_change_hook != NULL)
{
on_inode_change_hook(inode_num, false);
}
}
}
}

View File

@@ -109,9 +109,12 @@ public:
std::function<void(pool_id_t, pg_num_t)> on_change_pg_history_hook;
std::function<void(osd_num_t)> on_change_osd_state_hook;
std::function<void()> on_reload_hook;
std::function<void(inode_t, bool)> on_inode_change_hook;
json11::Json::object serialize_inode_cfg(inode_config_t *cfg);
etcd_kv_t parse_etcd_kv(const json11::Json & kv_json);
std::vector<std::string> get_addresses();
void etcd_call_oneshot(std::string etcd_address, std::string api, json11::Json payload, int timeout, std::function<void(std::string, json11::Json)> callback);
void etcd_call(std::string api, json11::Json payload, int timeout, int retries, int interval, std::function<void(std::string, json11::Json)> callback);
void etcd_txn(json11::Json txn, int timeout, int retries, int interval, std::function<void(std::string, json11::Json)> callback);
void etcd_txn_slow(json11::Json txn, std::function<void(std::string, json11::Json)> callback);

View File

@@ -214,14 +214,14 @@ static int sec_setup(struct thread_data *td)
if (!o->image)
{
if (!(o->inode & ((1l << (64-POOL_ID_BITS)) - 1)))
if (!(o->inode & (((uint64_t)1 << (64-POOL_ID_BITS)) - 1)))
{
td_verror(td, EINVAL, "inode number is missing");
return 1;
}
if (o->pool)
{
o->inode = (o->inode & ((1l << (64-POOL_ID_BITS)) - 1)) | (o->pool << (64-POOL_ID_BITS));
o->inode = (o->inode & (((uint64_t)1 << (64-POOL_ID_BITS)) - 1)) | (o->pool << (64-POOL_ID_BITS));
}
if (!(o->inode >> (64-POOL_ID_BITS)))
{

View File

@@ -342,6 +342,8 @@ void http_co_t::handle_events()
}
else if (epoll_events & (EPOLLRDHUP|EPOLLERR))
{
if (state == HTTP_CO_HEADERS_RECEIVED)
std::swap(parsed.body, response);
close_connection();
run_cb_and_clear();
break;
@@ -465,6 +467,8 @@ again:
{
// < 0 means error, 0 means EOF
epoll_events = epoll_events & ~EPOLLIN;
if (state == HTTP_CO_HEADERS_RECEIVED)
std::swap(parsed.body, response);
close_connection();
if (res < 0)
parsed = { .error = std::string("recvmsg: ")+strerror(-res) };

View File

@@ -54,6 +54,8 @@ protected:
msghdr read_msg = { 0 }, send_msg = { 0 };
iovec read_iov = { 0 };
std::string logfile = "/dev/null";
public:
~nbd_proxy()
{
@@ -187,7 +189,7 @@ public:
uint64_t pool = cfg["pool"].uint64_value();
if (pool)
{
inode = (inode & ((1l << (64-POOL_ID_BITS)) - 1)) | (pool << (64-POOL_ID_BITS));
inode = (inode & (((uint64_t)1 << (64-POOL_ID_BITS)) - 1)) | (pool << (64-POOL_ID_BITS));
}
if (!(inode >> (64-POOL_ID_BITS)))
{
@@ -278,6 +280,10 @@ public:
}
}
}
if (cfg["logfile"].is_string())
{
logfile = cfg["logfile"].string_value();
}
if (bg)
{
daemonize();
@@ -363,14 +369,14 @@ public:
setsid();
if (fork())
exit(0);
if (chdir("/") != 0)
fprintf(stderr, "Warning: Failed to chdir into /\n");
close(0);
close(1);
close(2);
open("/dev/null", O_RDONLY);
open("/dev/null", O_WRONLY);
open("/dev/null", O_WRONLY);
open(logfile.c_str(), O_WRONLY|O_APPEND|O_CREAT, 0666);
open(logfile.c_str(), O_WRONLY|O_APPEND|O_CREAT, 0666);
if (chdir("/") != 0)
fprintf(stderr, "Warning: Failed to chdir into /\n");
}
json11::Json::object list_mapped()
@@ -684,6 +690,7 @@ protected:
{
assert(result <= cur_left);
cur_left -= result;
cur_buf = (uint8_t*)cur_buf + result;
result = 0;
}
if (cur_left <= 0)
@@ -698,6 +705,12 @@ protected:
if (read_state == CL_READ_HDR)
{
int req_type = be32toh(cur_req.type);
if (be32toh(cur_req.magic) == NBD_REQUEST_MAGIC && req_type == NBD_CMD_DISC)
{
// Disconnect
close(nbd_fd);
exit(0);
}
if (be32toh(cur_req.magic) != NBD_REQUEST_MAGIC ||
req_type != NBD_CMD_READ && req_type != NBD_CMD_WRITE && req_type != NBD_CMD_FLUSH)
{

748
src/nfs_conn.cpp Normal file
View File

@@ -0,0 +1,748 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS connection handler for NFS proxy
#include <sys/time.h>
#include "libnfs-raw-mount.h"
#include "libnfs-raw-nfs.h"
#include "base64.h"
#include "nfs_proxy.h"
static unsigned len_pad4(unsigned len)
{
return len + (len&3 ? 4-(len&3) : 0);
}
static int nfs3_null_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
rpc_send_reply(rpc, call, NULL, (zdrproc_t)zdr_void, 0);
return 0;
}
static fattr3 get_dir_attributes(nfs_client_t *self, std::string dir)
{
return (fattr3){
.type = NF3DIR,
.mode = 0755,
.nlink = 1,
.uid = 0,
.gid = 0,
.size = 4096,
.used = 4096,
.rdev = (specdata3){ 0 },
.fsid = self->parent->fsid,
.fileid = dir == "" ? 1 : self->parent->dir_ids.at(dir),
//.atime = (nfstime3){ .seconds = now.tv_sec, .nseconds = now.tv_nsec },
//.mtime = (nfstime3){ .seconds = now.tv_sec, .nseconds = now.tv_nsec },
//.ctime = (nfstime3){ .seconds = now.tv_sec, .nseconds = now.tv_nsec },
};
}
static int nfs3_getattr_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
nfs_client_t *self = (nfs_client_t*)opaque;
GETATTR3args *args = (GETATTR3args*)call->body.cbody.args;
GETATTR3res reply;
std::string dirhash = std::string(args->object.data.data_val, args->object.data.data_len);
bool is_dir = false;
std::string dir;
if (dirhash == "roothandle")
is_dir = true;
else
{
auto dir_it = self->parent->dir_by_hash.find(dirhash);
if (dir_it != self->parent->dir_by_hash.end())
{
is_dir = true;
dir = dir_it->second;
}
}
if (is_dir)
{
// Directory info
reply.status = NFS3_OK;
reply.GETATTR3res_u.resok.obj_attributes = get_dir_attributes(self, dir);
}
else
{
uint64_t inode_num;
auto inode_num_it = self->parent->inode_by_hash.find(dirhash);
if (inode_num_it != self->parent->inode_by_hash.end())
inode_num = inode_num_it->second;
auto inode_it = self->parent->cli->st_cli.inode_config.find(inode_num);
if (inode_it != self->parent->cli->st_cli.inode_config.end())
{
// File info
auto & inode_cfg = inode_it->second;
reply.status = NFS3_OK;
reply.GETATTR3res_u.resok.obj_attributes = {
.type = NF3REG,
.mode = 0644,
.nlink = 1,
.uid = 0,
.gid = 0,
.size = inode_cfg.size,
.used = inode_cfg.size,
.rdev = (specdata3){ 0 },
.fsid = self->parent->fsid,
.fileid = inode_it->first,
//.atime = (nfstime3){ .seconds = now.tv_sec, .nseconds = now.tv_nsec },
//.mtime = (nfstime3){ .seconds = now.tv_sec, .nseconds = now.tv_nsec },
//.ctime = (nfstime3){ .seconds = now.tv_sec, .nseconds = now.tv_nsec },
};
}
else
{
// File not exists
reply.status = NFS3ERR_NOENT;
}
}
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_GETATTR3res, sizeof(GETATTR3res));
return 0;
}
static int nfs3_setattr_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
nfs_client_t *self = (nfs_client_t*)opaque;
SETATTR3args *args = (SETATTR3args*)call->body.cbody.args;
SETATTR3res reply;
// Not supported yet
reply.status = NFS3ERR_NOTSUPP;
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_SETATTR3res, sizeof(SETATTR3res));
return 0;
}
static int nfs3_lookup_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
nfs_client_t *self = (nfs_client_t*)opaque;
LOOKUP3args *args = (LOOKUP3args*)call->body.cbody.args;
LOOKUP3res reply;
std::string dirhash = std::string(args->what.dir.data.data_val, args->what.dir.data.data_len);
std::string dir;
if (dirhash != "roothandle")
{
auto dir_it = self->parent->dir_by_hash.find(dirhash);
if (dir_it != self->parent->dir_by_hash.end())
dir = dir_it->second;
}
std::string full_name = self->parent->name_prefix;
if (dir != "")
{
full_name += dir+"/";
}
full_name += std::string(args->what.name);
for (auto & ic: self->parent->cli->st_cli.inode_config)
{
if (ic.second.name == full_name)
{
std::string fh = "S"+base64_encode(sha256(full_name.substr(self->parent->name_prefix.size())));
reply.status = NFS3_OK;
reply.LOOKUP3res_u.resok.object.data.data_len = fh.size();
reply.LOOKUP3res_u.resok.object.data.data_val = (char*)fh.c_str();
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_LOOKUP3res, sizeof(LOOKUP3res));
return 0;
}
}
reply.status = NFS3ERR_NOENT;
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_LOOKUP3res, sizeof(LOOKUP3res));
return 0;
}
static int nfs3_access_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
nfs_client_t *self = (nfs_client_t*)opaque;
ACCESS3args *args = (ACCESS3args*)call->body.cbody.args;
ACCESS3res reply = {
.status = NFS3_OK,
.ACCESS3res_u = { .resok = {
.access = args->access,
} },
};
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_ACCESS3res, sizeof(ACCESS3res));
return 0;
}
static int nfs3_readlink_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
nfs_client_t *self = (nfs_client_t*)opaque;
READLINK3args *args = (READLINK3args*)call->body.cbody.args;
READLINK3res reply = {};
// Not supported yet
reply.status = NFS3ERR_NOTSUPP;
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_READLINK3res, sizeof(READLINK3res));
return 0;
}
#define MAX_REQUEST_SIZE 128*1024*1024
static int nfs3_read_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
nfs_client_t *self = (nfs_client_t*)opaque;
READ3args *args = (READ3args*)call->body.cbody.args;
std::string handle = std::string(args->file.data.data_val, args->file.data.data_len);
auto ino_it = self->parent->inode_by_hash.find(handle);
if (ino_it == self->parent->inode_by_hash.end())
{
READ3res reply = { .status = NFS3ERR_NOENT };
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_READ3res, sizeof(READ3res));
return 0;
}
if (args->count > MAX_REQUEST_SIZE)
{
READ3res reply = { .status = NFS3ERR_INVAL };
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_READ3res, sizeof(READ3res));
return 0;
}
void *buf = malloc_or_die(args->count);
cluster_op_t *op = new cluster_op_t;
op->opcode = OSD_OP_READ;
op->inode = ino_it->second;
op->offset = args->offset;
op->len = args->count;
op->iov.push_back(buf, args->count);
op->callback = [rpc, call](cluster_op_t *op)
{
void *buf = op->iov.buf[0].iov_base;
READ3res reply = {};
if (op->retval != op->len)
{
if (op->retval == -EINVAL)
reply.status = NFS3ERR_INVAL;
else if (op->retval == -ENOSPC)
reply.status = NFS3ERR_NOSPC;
else
reply.status = NFS3ERR_IO;
}
else
{
reply.status = NFS3_OK;
auto & reply_ok = reply.READ3res_u.resok;
reply_ok.count = op->retval;
reply_ok.eof = FALSE;
reply_ok.data.data_len = reply_ok.count;
reply_ok.data.data_val = (char*)buf;
}
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_READ3res, sizeof(READ3res));
delete op;
free(buf);
};
self->parent->cli->execute(op);
return 0;
}
static int nfs3_write_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
nfs_client_t *self = (nfs_client_t*)opaque;
WRITE3args *args = (WRITE3args*)call->body.cbody.args;
WRITE3res reply;
// Not supported yet
reply.status = NFS3ERR_NOTSUPP;
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_WRITE3res, sizeof(WRITE3res));
return 0;
}
static int nfs3_create_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
nfs_client_t *self = (nfs_client_t*)opaque;
CREATE3args *args = (CREATE3args*)call->body.cbody.args;
CREATE3res reply;
// Not supported yet
reply.status = NFS3ERR_NOTSUPP;
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_CREATE3res, sizeof(CREATE3res));
return 0;
}
static int nfs3_mkdir_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
nfs_client_t *self = (nfs_client_t*)opaque;
MKDIR3args *args = (MKDIR3args*)call->body.cbody.args;
MKDIR3res reply;
// Not supported yet
reply.status = NFS3ERR_NOTSUPP;
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_MKDIR3res, sizeof(MKDIR3res));
return 0;
}
static int nfs3_symlink_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
nfs_client_t *self = (nfs_client_t*)opaque;
SYMLINK3args *args = (SYMLINK3args*)call->body.cbody.args;
SYMLINK3res reply;
// Not supported yet
reply.status = NFS3ERR_NOTSUPP;
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_SYMLINK3res, sizeof(SYMLINK3res));
return 0;
}
static int nfs3_mknod_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
nfs_client_t *self = (nfs_client_t*)opaque;
MKNOD3args *args = (MKNOD3args*)call->body.cbody.args;
MKNOD3res reply;
// Not supported yet
reply.status = NFS3ERR_NOTSUPP;
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_MKNOD3res, sizeof(MKNOD3res));
return 0;
}
static int nfs3_remove_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
REMOVE3args *args = (REMOVE3args*)call->body.cbody.args;
REMOVE3res reply;
// Not supported yet
reply.status = NFS3ERR_NOTSUPP;
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_REMOVE3res, sizeof(REMOVE3res));
return 0;
}
static int nfs3_rmdir_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
nfs_client_t *self = (nfs_client_t*)opaque;
RMDIR3args *args = (RMDIR3args*)call->body.cbody.args;
RMDIR3res reply;
// Not supported yet
reply.status = NFS3ERR_NOTSUPP;
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_RMDIR3res, sizeof(RMDIR3res));
return 0;
}
static int nfs3_rename_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
nfs_client_t *self = (nfs_client_t*)opaque;
RENAME3args *args = (RENAME3args*)call->body.cbody.args;
RENAME3res reply;
// Not supported yet
reply.status = NFS3ERR_NOTSUPP;
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_RENAME3res, sizeof(RENAME3res));
return 0;
}
static int nfs3_link_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
nfs_client_t *self = (nfs_client_t*)opaque;
LINK3args *args = (LINK3args*)call->body.cbody.args;
// We don't support hard links
LINK3res reply = { NFS3ERR_NOTSUPP };
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_LINK3res, sizeof(LINK3res));
return 0;
}
static void nfs3_readdir_common(struct rpc_context *rpc, struct rpc_msg *call, void *opaque, bool is_plus)
{
nfs_client_t *self = (nfs_client_t*)opaque;
READDIRPLUS3args plus_args;
READDIRPLUS3args *args = NULL;
if (is_plus)
args = ((READDIRPLUS3args*)call->body.cbody.args);
else
{
args = &plus_args;
READDIR3args *in_args = ((READDIR3args*)call->body.cbody.args);
args->dir = in_args->dir;
args->cookie = in_args->cookie;
*((uint64_t*)args->cookieverf) = *((uint64_t*)in_args->cookieverf);
args->dircount = 512;
args->maxcount = in_args->count;
}
std::string dirhash = std::string(args->dir.data.data_val, args->dir.data.data_len);
std::string dir;
if (dirhash != "roothandle")
{
auto dir_it = self->parent->dir_by_hash.find(dirhash);
if (dir_it != self->parent->dir_by_hash.end())
dir = dir_it->second;
}
std::string prefix = self->parent->name_prefix;
if (dir != "")
{
prefix += dir+"/";
}
//struct timespec now;
//clock_gettime(CLOCK_REALTIME, &now);
std::map<std::string, struct entryplus3> entries;
std::vector<std::string> handles;
for (auto & ic: self->parent->cli->st_cli.inode_config)
{
auto & inode_cfg = ic.second;
if (prefix != "" && inode_cfg.name.substr(0, prefix.size()) != prefix)
continue;
std::string subname = inode_cfg.name.substr(prefix.size());
int p = 0;
while (p < subname.size() && subname[p] == '/')
p++;
if (p > 0)
subname = subname.substr(p);
if (subname.size() == 0)
continue;
p = 0;
while (p < subname.size() && subname[p] != '/')
p++;
if (p >= subname.size())
{
entries[subname] = (struct entryplus3){
// fileid will change when the user creates snapshots
// however, we hope that clients tolerate it well
// Linux does, even though it complains about "fileid changed" in dmesg
.fileid = ic.first,
};
if (is_plus)
{
handles.push_back("S"+base64_encode(sha256(inode_cfg.name)));
entries[subname].name_attributes = {
.attributes_follow = TRUE,
.post_op_attr_u = { .attributes = {
.type = NF3REG,
.mode = 0644,
.nlink = 1,
.uid = 0,
.gid = 0,
.size = inode_cfg.size,
.used = inode_cfg.size, // FIXME take from statistics
.rdev = (specdata3){ 0 },
.fsid = self->parent->fsid,
.fileid = ic.first,
//.atime = (nfstime3){ .seconds = now.tv_sec, .nseconds = now.tv_nsec },
//.mtime = (nfstime3){ .seconds = now.tv_sec, .nseconds = now.tv_nsec },
//.ctime = (nfstime3){ .seconds = now.tv_sec, .nseconds = now.tv_nsec },
} },
};
entries[subname].name_handle = {
.handle_follows = TRUE,
.post_op_fh3_u = { .handle = {
.data = {
// FIXME: I really want ZDR with std::string
.data_len = handles[handles.size()-1].size(),
.data_val = (char*)handles[handles.size()-1].c_str(),
},
} },
};
}
}
else
{
auto subdir = dir == "" ? subname.substr(0, p) : dir+"/"+subname.substr(0, p);
entries[subdir] = (struct entryplus3){
// for directories, fileid will change when the user restarts proxy
.fileid = self->parent->dir_ids.at(subdir),
};
if (is_plus)
{
handles.push_back("S"+base64_encode(sha256(subdir)));
entries[subdir].name_attributes = {
.attributes_follow = TRUE,
.post_op_attr_u = { .attributes = get_dir_attributes(self, subdir) },
};
entries[subdir].name_handle = {
.handle_follows = TRUE,
.post_op_fh3_u = { .handle = {
.data = {
// FIXME: I really want ZDR with std::string
.data_len = (unsigned)handles[handles.size()-1].size(),
.data_val = (char*)handles[handles.size()-1].c_str(),
},
} },
};
}
}
}
// Offset results by the continuation cookie (equal to index in the listing)
uint64_t idx = 1;
void *prev = NULL;
for (auto it = entries.begin(); it != entries.end(); it++)
{
entryplus3 *entry = &it->second;
// First fields of entry3 and entryplus3 are the same: fileid, name, cookie
entry->name = (char*)it->first.c_str();
entry->cookie = idx++;
if (prev)
{
if (is_plus)
((entryplus3*)prev)->nextentry = entry;
else
((entry3*)prev)->nextentry = (entry3*)entry;
}
prev = entry;
if (args->cookie > 0 && entry->cookie == args->cookie+1)
{
entries.erase(entries.begin(), it);
}
}
// Now limit results based on maximum reply size
// Sadly we have to calculate reply size by hand
// reply without entries is 4+4+(dir_attributes ? sizeof(fattr3) : 0)+8+4 bytes
int reply_size = 20;
if (reply_size > args->maxcount)
{
// Error, too small max reply size
if (is_plus)
{
READDIRPLUS3res reply = { .status = NFS3ERR_TOOSMALL };
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_READDIRPLUS3res, sizeof(READDIRPLUS3res));
}
else
{
READDIR3res reply = { .status = NFS3ERR_TOOSMALL };
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_READDIR3res, sizeof(READDIR3res));
}
return;
}
// 1 entry3 is (8+4+(filename_len+3)/4*4+8) bytes
// 1 entryplus3 is (8+4+(filename_len+3)/4*4+8
// + 4+(name_attributes ? (sizeof(fattr3) = 84) : 0)
// + 4+(name_handle ? 4+(handle_len+3)/4*4 : 0)) bytes
bool eof = true;
for (auto it = entries.begin(); it != entries.end(); it++)
{
reply_size += 20+len_pad4(it->first.size())+(is_plus
? 8+84+len_pad4(it->second.name_handle.post_op_fh3_u.handle.data.data_len) : 0);
if (reply_size > args->maxcount)
{
// Stop
entries.erase(it, entries.end());
eof = false;
break;
}
}
if (entries.end() != entries.begin())
{
auto last_it = entries.end();
last_it--;
if (is_plus)
((entryplus3*)&last_it->second)->nextentry = NULL;
else
((entry3*)&last_it->second)->nextentry = NULL;
}
// Send reply
if (is_plus)
{
READDIRPLUS3res reply = { .status = NFS3_OK };
*(uint64_t*)(reply.READDIRPLUS3res_u.resok.cookieverf) = self->parent->dir_mod_rev.at(dir);
reply.READDIRPLUS3res_u.resok.reply.entries = &entries.begin()->second;
reply.READDIRPLUS3res_u.resok.reply.eof = eof;
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_READDIRPLUS3res, sizeof(READDIRPLUS3res));
}
else
{
READDIR3res reply = { .status = NFS3_OK };
*(uint64_t*)(reply.READDIR3res_u.resok.cookieverf) = self->parent->dir_mod_rev.at(dir);
reply.READDIR3res_u.resok.reply.entries = (entry3*)&entries.begin()->second;
reply.READDIR3res_u.resok.reply.eof = eof;
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_READDIR3res, sizeof(READDIR3res));
}
}
static int nfs3_readdir_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
nfs3_readdir_common(rpc, call, opaque, false);
return 0;
}
static int nfs3_readdirplus_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
nfs3_readdir_common(rpc, call, opaque, true);
return 0;
}
// Get file system statistics
static int nfs3_fsstat_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
nfs_client_t *self = (nfs_client_t*)opaque;
FSSTAT3args *args = (FSSTAT3args*)call->body.cbody.args;
FSSTAT3res reply;
reply.status = NFS3_OK;
reply.FSSTAT3res_u.resok.obj_attributes.attributes_follow = TRUE;
reply.FSSTAT3res_u.resok.obj_attributes.post_op_attr_u.attributes = get_dir_attributes(self, "");
reply.FSSTAT3res_u.resok.tbytes = 4096; // total bytes
reply.FSSTAT3res_u.resok.fbytes = 4096; // free bytes
reply.FSSTAT3res_u.resok.abytes = 4096; // available bytes
reply.FSSTAT3res_u.resok.tfiles = 1 << 31; // total files
reply.FSSTAT3res_u.resok.ffiles = 1 << 31; // free files
reply.FSSTAT3res_u.resok.afiles = 1 << 31; // available files
reply.FSSTAT3res_u.resok.invarsec = 0;
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_FSSTAT3res, sizeof(FSSTAT3res));
return 0;
}
static int nfs3_fsinfo_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
nfs_client_t *self = (nfs_client_t*)opaque;
FSINFO3args *args = (FSINFO3args*)call->body.cbody.args;
FSINFO3res reply;
if (args->fsroot.data.data_len != 10)
{
// Example error
reply.status = NFS3ERR_INVAL;
}
else
{
// Fill info
reply.status = NFS3_OK;
reply.FSINFO3res_u.resok.obj_attributes.attributes_follow = TRUE;
reply.FSINFO3res_u.resok.obj_attributes.post_op_attr_u.attributes = get_dir_attributes(self, "");
reply.FSINFO3res_u.resok.rtmax = 128*1024*1024;
reply.FSINFO3res_u.resok.rtpref = 128*1024*1024;
reply.FSINFO3res_u.resok.rtmult = 4096;
reply.FSINFO3res_u.resok.wtmax = 128*1024*1024;
reply.FSINFO3res_u.resok.wtpref = 128*1024*1024;
reply.FSINFO3res_u.resok.wtmult = 4096;
reply.FSINFO3res_u.resok.dtpref = 128;
reply.FSINFO3res_u.resok.maxfilesize = 0x7fffffffffffffff;
reply.FSINFO3res_u.resok.time_delta.seconds = 1;
reply.FSINFO3res_u.resok.time_delta.nseconds = 0;
reply.FSINFO3res_u.resok.properties = FSF3_SYMLINK | FSF3_HOMOGENEOUS;
}
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_FSINFO3res, sizeof(FSINFO3res));
return 0;
}
static int nfs3_pathconf_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
nfs_client_t *self = (nfs_client_t*)opaque;
PATHCONF3args *args = (PATHCONF3args*)call->body.cbody.args;
PATHCONF3res reply;
if (args->object.data.data_len != 10)
{
// Example error
reply.status = NFS3ERR_INVAL;
}
else
{
// Fill info
reply.status = NFS3_OK;
reply.PATHCONF3res_u.resok.obj_attributes.attributes_follow = FALSE;
reply.PATHCONF3res_u.resok.linkmax = 0;
reply.PATHCONF3res_u.resok.name_max = 255;
reply.PATHCONF3res_u.resok.no_trunc = TRUE;
reply.PATHCONF3res_u.resok.chown_restricted = FALSE;
reply.PATHCONF3res_u.resok.case_insensitive = FALSE;
reply.PATHCONF3res_u.resok.case_preserving = TRUE;
}
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_PATHCONF3res, sizeof(PATHCONF3res));
return 0;
}
static int nfs3_commit_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
nfs_client_t *self = (nfs_client_t*)opaque;
COMMIT3args *args = (COMMIT3args*)call->body.cbody.args;
COMMIT3res reply = {};
// Just pretend we did fsync :-)
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_COMMIT3res, sizeof(COMMIT3res));
return 0;
}
static int mount3_mnt_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
nfs_client_t *self = (nfs_client_t*)opaque;
dirpath *arg = (dirpath*)call->body.cbody.args;
int flavor = AUTH_NONE;
mountres3 reply;
reply.fhs_status = MNT3_OK;
reply.mountres3_u.mountinfo.fhandle.fhandle3_len = 10;
reply.mountres3_u.mountinfo.fhandle.fhandle3_val = "roothandle";
reply.mountres3_u.mountinfo.auth_flavors.auth_flavors_len = 1;
reply.mountres3_u.mountinfo.auth_flavors.auth_flavors_val = &flavor;
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_mountres3, sizeof(mountres3));
return 0;
}
static int mount3_dump_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
nfs_client_t *self = (nfs_client_t*)opaque;
mountlist reply;
reply = (struct mountbody*)malloc(sizeof(struct mountbody));
reply->ml_hostname = (dirpath)"127.0.0.1";
reply->ml_directory = (dirpath)"/test";
reply->ml_next = NULL;
rpc_send_reply(rpc, call, NULL, (zdrproc_t)zdr_mountlist, sizeof(mountlist));
free(reply);
return 0;
}
static int mount3_umnt_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
nfs_client_t *self = (nfs_client_t*)opaque;
dirpath *arg = (dirpath*)call->body.cbody.args;
// do nothing
rpc_send_reply(rpc, call, NULL, (zdrproc_t)zdr_void, 0);
return 0;
}
static int mount3_umntall_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
// do nothing
rpc_send_reply(rpc, call, NULL, (zdrproc_t)zdr_void, 0);
return 0;
}
static int mount3_export_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
nfs_client_t *self = (nfs_client_t*)opaque;
exports reply;
reply = (struct exportnode*)malloc(sizeof(struct exportnode) + sizeof(struct groupnode));
reply->ex_dir = (dirpath)"/test";
reply->ex_groups = (struct groupnode*)(reply+1);
reply->ex_groups->gr_name = (dirpath)"127.0.0.1";
reply->ex_groups->gr_next = NULL;
reply->ex_next = NULL;
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_exports, sizeof(exports));
free(reply);
return 0;
}
nfs_client_t::nfs_client_t()
{
struct service_proc nfs3_pt_a[22] = {
{NFS3_NULL, nfs3_null_proc, (zdrproc_t)zdr_void, 0, this},
{NFS3_GETATTR, nfs3_getattr_proc, (zdrproc_t)zdr_GETATTR3args, sizeof(GETATTR3args), this},
{NFS3_SETATTR, nfs3_setattr_proc, (zdrproc_t)zdr_SETATTR3args, sizeof(SETATTR3args), this},
{NFS3_LOOKUP, nfs3_lookup_proc, (zdrproc_t)zdr_LOOKUP3args, sizeof(LOOKUP3args), this},
{NFS3_ACCESS, nfs3_access_proc, (zdrproc_t)zdr_ACCESS3args, sizeof(ACCESS3args), this},
{NFS3_READLINK, nfs3_readlink_proc, (zdrproc_t)zdr_READLINK3args, sizeof(READLINK3args), this},
{NFS3_READ, nfs3_read_proc, (zdrproc_t)zdr_READ3args, sizeof(READ3args), this},
{NFS3_WRITE, nfs3_write_proc, (zdrproc_t)zdr_WRITE3args, sizeof(WRITE3args), this},
{NFS3_CREATE, nfs3_create_proc, (zdrproc_t)zdr_CREATE3args, sizeof(CREATE3args), this},
{NFS3_MKDIR, nfs3_mkdir_proc, (zdrproc_t)zdr_MKDIR3args, sizeof(MKDIR3args), this},
{NFS3_SYMLINK, nfs3_symlink_proc, (zdrproc_t)zdr_SYMLINK3args, sizeof(SYMLINK3args), this},
{NFS3_MKNOD, nfs3_mknod_proc, (zdrproc_t)zdr_MKNOD3args, sizeof(MKNOD3args), this},
{NFS3_REMOVE, nfs3_remove_proc, (zdrproc_t)zdr_REMOVE3args, sizeof(REMOVE3args), this},
{NFS3_RMDIR, nfs3_rmdir_proc, (zdrproc_t)zdr_RMDIR3args, sizeof(RMDIR3args), this},
{NFS3_RENAME, nfs3_rename_proc, (zdrproc_t)zdr_RENAME3args, sizeof(RENAME3args), this},
{NFS3_LINK, nfs3_link_proc, (zdrproc_t)zdr_LINK3args, sizeof(LINK3args), this},
{NFS3_READDIR, nfs3_readdir_proc, (zdrproc_t)zdr_READDIR3args, sizeof(READDIR3args), this},
{NFS3_READDIRPLUS, nfs3_readdirplus_proc, (zdrproc_t)zdr_READDIRPLUS3args, sizeof(READDIRPLUS3args), this},
{NFS3_FSSTAT, nfs3_fsstat_proc, (zdrproc_t)zdr_FSSTAT3args, sizeof(FSSTAT3args), this},
{NFS3_FSINFO, nfs3_fsinfo_proc, (zdrproc_t)zdr_FSINFO3args, sizeof(FSINFO3args), this},
{NFS3_PATHCONF, nfs3_pathconf_proc, (zdrproc_t)zdr_PATHCONF3args, sizeof(PATHCONF3args), this},
{NFS3_COMMIT, nfs3_commit_proc, (zdrproc_t)zdr_COMMIT3args, sizeof(COMMIT3args), this},
};
for (int i = 0; i < sizeof(nfs3_pt_a)/sizeof(service_proc); i++)
{
nfs3_pt.push_back(nfs3_pt_a[i]);
}
struct service_proc nfs3_mount_pt_a[6] = {
{MOUNT3_NULL, nfs3_null_proc, (zdrproc_t)zdr_void, 0, this},
{MOUNT3_MNT, mount3_mnt_proc, (zdrproc_t)zdr_dirpath, sizeof(dirpath), this},
{MOUNT3_DUMP, mount3_dump_proc, (zdrproc_t)zdr_void, 0, this},
{MOUNT3_UMNT, mount3_umnt_proc, (zdrproc_t)zdr_dirpath, sizeof(dirpath), this},
{MOUNT3_UMNTALL, mount3_umntall_proc, (zdrproc_t)zdr_void, 0, this},
{MOUNT3_EXPORT, mount3_export_proc, (zdrproc_t)zdr_void, 0, this},
};
for (int i = 0; i < sizeof(nfs3_mount_pt_a)/sizeof(service_proc); i++)
{
nfs3_mount_pt.push_back(nfs3_mount_pt_a[i]);
}
}
nfs_client_t::~nfs_client_t()
{
if (rpc)
{
rpc_disconnect(rpc, NULL);
rpc_destroy_context(rpc);
}
}

172
src/nfs_portmap.cpp Normal file
View File

@@ -0,0 +1,172 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// Portmap service for NFS proxy
#include <netinet/in.h>
#include <string.h>
#include "nfs_portmap.h"
#include "libnfs-raw-portmap.h"
#include "sha256.h"
#include "base64.h"
/*
* The NULL procedure. All protocols/versions must provide a NULL procedure
* as index 0.
* It is used by clients, and rpcinfo, to "ping" a service and verify that
* the service is available and that it does support the indicated version.
*/
static int pmap2_null_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
rpc_send_reply(rpc, call, NULL, (zdrproc_t)zdr_void, 0);
return 0;
}
/*
* v2 GETPORT.
* This is the lookup function for portmapper version 2.
* A client provides program, version and protocol (tcp or udp)
* and portmapper returns which port that service is available on,
* (or 0 if no such program is registered.)
*/
static int pmap2_getport_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
portmap_service_t *self = (portmap_service_t *)opaque;
PMAP2GETPORTargs *args = (PMAP2GETPORTargs *)call->body.cbody.args;
uint32_t port = 0;
auto it = self->reg_ports.lower_bound((portmap_id_t){
.prog = args->prog,
.vers = args->vers,
.udp = args->prot == IPPROTO_UDP,
.ipv6 = false,
});
if (it != self->reg_ports.end() &&
it->prog == args->prog && it->vers == args->vers &&
it->udp == (args->prot == IPPROTO_UDP))
{
port = it->port;
}
rpc_send_reply(rpc, call, &port, (zdrproc_t)zdr_uint32_t, sizeof(uint32_t));
return 0;
}
/*
* v2 DUMP.
* This RPC returns a list of all endpoints that are registered with
* portmapper.
*/
static int pmap2_dump_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
portmap_service_t *self = (portmap_service_t *)opaque;
pmap2_mapping_list *list = new pmap2_mapping_list[self->reg_ports.size()];
int i = 0;
for (auto it = self->reg_ports.begin(); it != self->reg_ports.end(); it++)
{
if (it->ipv6)
continue;
list[i] = {
.map = {
.prog = it->prog,
.vers = it->vers,
.prot = it->udp ? IPPROTO_UDP : IPPROTO_TCP,
.port = it->port,
},
.next = list+i+1,
};
i++;
}
list[i-1].next = NULL;
// Send reply
PMAP2DUMPres reply;
reply.list = list;
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_PMAP2DUMPres, sizeof(PMAP2DUMPres));
reply.list = NULL;
delete list;
return 0;
}
/*
* v3 GETADDR.
* This is the lookup function for portmapper version 3.
*/
static int pmap3_getaddr_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
portmap_service_t *self = (portmap_service_t *)opaque;
PMAP3GETADDRargs *args = (PMAP3GETADDRargs *)call->body.cbody.args;
portmap_id_t ref = (portmap_id_t){
.prog = args->prog,
.vers = args->vers,
.udp = !strcmp(args->netid, "udp") || !strcmp(args->netid, "udp6"),
.ipv6 = !strcmp(args->netid, "tcp6") || !strcmp(args->netid, "udp6"),
};
auto it = self->reg_ports.lower_bound(ref);
PMAP3GETADDRres reply;
if (it != self->reg_ports.end() &&
it->prog == ref.prog && it->vers == ref.vers &&
it->udp == ref.udp && it->ipv6 == ref.ipv6)
{
reply.addr = (char*)it->addr.c_str();
}
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_PMAP3GETADDRres, sizeof(PMAP3GETADDRres));
return 0;
}
/*
* v3 DUMP.
* This RPC returns a list of all endpoints that are registered with
* portmapper.
*/
static int pmap3_dump_proc(struct rpc_context *rpc, struct rpc_msg *call, void *opaque)
{
portmap_service_t *self = (portmap_service_t *)opaque;
pmap3_mapping_list *list = new pmap3_mapping_list[self->reg_ports.size()];
int i = 0;
for (auto it = self->reg_ports.begin(); it != self->reg_ports.end(); it++)
{
list[i] = (pmap3_mapping_list){
.map = {
.prog = it->prog,
.vers = it->vers,
.netid = (char*)(it->ipv6
? (it->udp ? "udp6" : "tcp6")
: (it->udp ? "udp" : "tcp")),
.addr = (char*)it->addr.c_str(), // 0.0.0.0.port
.owner = (char*)it->owner.c_str(),
},
.next = list+i+1,
};
i++;
}
list[i-1].next = NULL;
// Send reply
PMAP3DUMPres reply;
reply.list = list;
rpc_send_reply(rpc, call, &reply, (zdrproc_t)zdr_PMAP3DUMPres, sizeof(PMAP3DUMPres));
reply.list = NULL;
delete list;
return 0;
}
portmap_service_t::portmap_service_t()
{
pmap2_pt.push_back((service_proc){PMAP2_NULL, pmap2_null_proc, (zdrproc_t)zdr_void, 0, this});
pmap2_pt.push_back((service_proc){PMAP2_GETPORT, pmap2_getport_proc, (zdrproc_t)zdr_PMAP2GETPORTargs, sizeof(PMAP2GETPORTargs), this});
pmap2_pt.push_back((service_proc){PMAP2_DUMP, pmap2_dump_proc, (zdrproc_t)zdr_void, 0, this});
pmap3_pt.push_back((service_proc){PMAP3_NULL, pmap2_null_proc, (zdrproc_t)zdr_void, 0, this});
pmap3_pt.push_back((service_proc){PMAP3_GETADDR, pmap3_getaddr_proc, (zdrproc_t)zdr_PMAP3GETADDRargs, sizeof(PMAP3GETADDRargs), this});
pmap3_pt.push_back((service_proc){PMAP3_DUMP, pmap3_dump_proc, (zdrproc_t)zdr_void, 0, this});
}
std::string sha256(const std::string & str)
{
std::string hash;
hash.resize(32);
SHA256_CTX ctx;
sha256_init(&ctx);
sha256_update(&ctx, (uint8_t*)str.data(), str.size());
sha256_final(&ctx, (uint8_t*)hash.data());
return hash;
}

41
src/nfs_portmap.h Normal file
View File

@@ -0,0 +1,41 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// Portmap service for NFS proxy
#pragma once
#include <string>
#include <set>
#include <vector>
#include "nfsc/libnfs.h"
#include "nfsc/libnfs-raw.h"
struct portmap_id_t
{
unsigned prog, vers;
bool udp;
bool ipv6;
unsigned port;
std::string owner;
std::string addr;
};
class portmap_service_t
{
public:
std::set<portmap_id_t> reg_ports;
std::vector<service_proc> pmap2_pt;
std::vector<service_proc> pmap3_pt;
portmap_service_t();
};
inline bool operator < (const portmap_id_t &a, const portmap_id_t &b)
{
return a.prog < b.prog || a.prog == b.prog && a.vers < b.vers ||
a.prog == b.prog && a.vers == b.vers && a.udp < b.udp ||
a.prog == b.prog && a.vers == b.vers && a.udp == b.udp && a.ipv6 < b.ipv6;
}
std::string sha256(const std::string & str);

301
src/nfs_proxy.cpp Normal file
View File

@@ -0,0 +1,301 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// Simplified NFS proxy
// Presents all images as files, stores small files directly in etcd
// Keeps image list in memory and thus is unsuitable for a lot of files
#include <netinet/tcp.h>
#include <sys/epoll.h>
#include <sys/poll.h>
#include <unistd.h>
#include <fcntl.h>
//#include <signal.h>
#include "libnfs-raw-mount.h"
#include "libnfs-raw-nfs.h"
#include "libnfs-raw-portmap.h"
#include "addr_util.h"
#include "base64.h"
#include "nfs_proxy.h"
const char *exe_name = NULL;
nfs_proxy_t::~nfs_proxy_t()
{
if (cli)
delete cli;
if (epmgr)
delete epmgr;
if (ringloop)
delete ringloop;
}
json11::Json::object nfs_proxy_t::parse_args(int narg, const char *args[])
{
json11::Json::object cfg;
for (int i = 1; i < narg; i++)
{
if (!strcmp(args[i], "-h") || !strcmp(args[i], "--help"))
{
printf(
"Vitastor NFS 3.0 proxy\n"
"(c) Vitaliy Filippov, 2021-2022 (VNPL-1.1)\n\n"
"USAGE:\n"
" %s [--etcd_address ADDR] [OTHER OPTIONS]\n",
exe_name
);
exit(0);
}
else if (args[i][0] == '-' && args[i][1] == '-')
{
const char *opt = args[i]+2;
cfg[opt] = !strcmp(opt, "json") || i == narg-1 ? "1" : args[++i];
}
}
return cfg;
}
void nfs_proxy_t::run(json11::Json cfg)
{
bind_address = cfg["bind_address"].string_value();
if (bind_address == "")
bind_address = "0.0.0.0";
// Create client
ringloop = new ring_loop_t(512);
epmgr = new epoll_manager_t(ringloop);
cli = new cluster_client_t(ringloop, epmgr->tfd, cfg);
// We need inode name hashes for NFS handles to remain stateless and <= 64 bytes long
dir_mod_rev[""] = 0;
dir_ids[""] = 1;
assert(cli->st_cli.on_inode_change_hook == NULL);
cli->st_cli.on_inode_change_hook = [this](inode_t changed_inode, bool removed)
{
if (removed)
{
auto ino_it = hash_by_inode.find(changed_inode);
if (ino_it != hash_by_inode.end())
{
inode_by_hash.erase(ino_it->second);
hash_by_inode.erase(ino_it);
}
// FIXME also calculate dir_mod_rev
}
else
{
auto & inode_cfg = cli->st_cli.inode_config.at(changed_inode);
std::string name = inode_cfg.name;
if (name_prefix != "")
{
if (name.substr(0, name_prefix.size()) != name_prefix)
return;
name = name.substr(name_prefix.size());
}
dir_mod_rev[""] = dir_mod_rev[""] < inode_cfg.mod_revision ? inode_cfg.mod_revision : dir_mod_rev[""];
std::string hash = "S"+base64_encode(sha256(name));
int pos = name.find('/');
while (pos >= 0)
{
std::string dir = name.substr(0, pos);
if (dir_ids.find(dir) == dir_ids.end())
dir_ids[dir] = next_dir_id++;
dir_mod_rev[dir] = dir_mod_rev[dir] < inode_cfg.mod_revision ? inode_cfg.mod_revision : dir_mod_rev[dir];
dir_by_hash["S"+base64_encode(sha256(dir))] = dir;
int next = name.substr(pos+1).find('/');
pos = next < 0 ? -1 : pos+1+next;
}
auto hbi_it = hash_by_inode.find(changed_inode);
if (hbi_it != hash_by_inode.end() && hbi_it->second != hash)
{
// inode had a different name, remove old hash=>inode pointer
inode_by_hash.erase(hbi_it->second);
}
inode_by_hash[hash] = changed_inode;
hash_by_inode[changed_inode] = hash;
}
};
// Load image metadata
while (!cli->is_ready())
{
ringloop->loop();
if (cli->is_ready())
break;
ringloop->wait();
}
// Create portmap socket
int portmap_socket = create_and_bind_socket(bind_address, 111, 128, NULL);
fcntl(portmap_socket, F_SETFL, fcntl(portmap_socket, F_GETFL, 0) | O_NONBLOCK);
// Create NFS socket
int nfs_socket = create_and_bind_socket(bind_address, 2049, 128, NULL);
fcntl(nfs_socket, F_SETFL, fcntl(nfs_socket, F_GETFL, 0) | O_NONBLOCK);
// Self-register portmap and NFS
pmap.reg_ports.insert((portmap_id_t){
.prog = PMAP_PROGRAM,
.vers = PMAP_V2,
.port = 111,
.owner = "portmapper-service",
.addr = "0.0.0.0.0.111",
});
pmap.reg_ports.insert((portmap_id_t){
.prog = PMAP_PROGRAM,
.vers = PMAP_V3,
.port = 111,
.owner = "portmapper-service",
.addr = "0.0.0.0.0.111",
});
pmap.reg_ports.insert((portmap_id_t){
.prog = NFS_PROGRAM,
.vers = NFS_V3,
.port = 2049,
.owner = "nfs-server",
.addr = "0.0.0.0.0.2049",
});
pmap.reg_ports.insert((portmap_id_t){
.prog = MOUNT_PROGRAM,
.vers = MOUNT_V3,
.port = 2049,
.owner = "rpc.mountd",
.addr = "0.0.0.0.0.2049",
});
// Add FDs to epoll
epmgr->tfd->set_fd_handler(portmap_socket, false, [this](int portmap_socket, int epoll_events)
{
if (epoll_events & EPOLLRDHUP)
{
fprintf(stderr, "Listening portmap socket disconnected, exiting\n");
exit(1);
}
else
{
do_accept(portmap_socket);
}
});
epmgr->tfd->set_fd_handler(nfs_socket, false, [this](int nfs_socket, int epoll_events)
{
if (epoll_events & EPOLLRDHUP)
{
fprintf(stderr, "Listening portmap socket disconnected, exiting\n");
exit(1);
}
else
{
do_accept(nfs_socket);
}
});
if (cfg["foreground"].is_null())
{
daemonize();
}
while (true)
{
ringloop->loop();
ringloop->wait();
}
/*// Sync at the end
cluster_op_t *close_sync = new cluster_op_t;
close_sync->opcode = OSD_OP_SYNC;
close_sync->callback = [&stop](cluster_op_t *op)
{
stop = true;
delete op;
};
cli->execute(close_sync);*/
// Destroy the client
delete cli;
delete epmgr;
delete ringloop;
cli = NULL;
epmgr = NULL;
ringloop = NULL;
}
void nfs_proxy_t::do_accept(int listen_fd)
{
struct sockaddr_storage addr;
socklen_t addr_size = sizeof(addr);
int nfs_fd = 0;
while ((nfs_fd = accept(listen_fd, (struct sockaddr *)&addr, &addr_size)) >= 0)
{
fprintf(stderr, "New client %d: connection from %s\n", nfs_fd, addr_to_string(addr).c_str());
fcntl(nfs_fd, F_SETFL, fcntl(nfs_fd, F_GETFL, 0) | O_NONBLOCK);
int one = 1;
setsockopt(nfs_fd, SOL_TCP, TCP_NODELAY, &one, sizeof(one));
auto cli = new nfs_client_t();
cli->parent = this;
cli->nfs_fd = nfs_fd;
cli->rpc = rpc_init_server_context(nfs_fd);
if (!cli->rpc)
{
delete cli;
close(nfs_fd);
fprintf(stderr, "Failed to init libnfs server context\n");
exit(1);
}
// Use both portmap and NFS everywhere
rpc_register_service(cli->rpc, PMAP_PROGRAM, PMAP_V2, pmap.pmap2_pt.data(), pmap.pmap2_pt.size());
rpc_register_service(cli->rpc, PMAP_PROGRAM, PMAP_V3, pmap.pmap3_pt.data(), pmap.pmap3_pt.size());
rpc_register_service(cli->rpc, NFS_PROGRAM, NFS_V3, cli->nfs3_pt.data(), cli->nfs3_pt.size());
rpc_register_service(cli->rpc, MOUNT_PROGRAM, MOUNT_V3, cli->nfs3_mount_pt.data(), cli->nfs3_mount_pt.size());
epmgr->tfd->set_fd_handler(nfs_fd, true, [this, cli](int nfs_fd, int epoll_events)
{
// Handle incoming event
if (epoll_events & EPOLLRDHUP)
{
fprintf(stderr, "Client %d disconnected\n", nfs_fd);
epmgr->tfd->set_fd_handler(cli->nfs_fd, true, NULL);
delete cli;
close(nfs_fd);
return;
}
int revents = 0;
if (epoll_events & EPOLLIN)
revents |= POLLIN;
if (epoll_events & EPOLLOUT)
revents |= POLLOUT;
// Let libnfs process the event
if (rpc_service(cli->rpc, revents) < 0)
{
fprintf(stderr, "libnfs error: %s, disconnecting client %d\n", rpc_get_error(cli->rpc), nfs_fd);
epmgr->tfd->set_fd_handler(cli->nfs_fd, true, NULL);
delete cli;
close(nfs_fd);
return;
}
// FIXME Add/remove events based on rpc_which_events(rpc) ?
});
}
if (nfs_fd < 0 && errno != EAGAIN)
{
fprintf(stderr, "Failed to accept connection: %s\n", strerror(errno));
exit(1);
}
}
void nfs_proxy_t::daemonize()
{
if (fork())
exit(0);
setsid();
if (fork())
exit(0);
if (chdir("/") != 0)
fprintf(stderr, "Warning: Failed to chdir into /\n");
close(0);
close(1);
close(2);
open("/dev/null", O_RDONLY);
open("/dev/null", O_WRONLY);
open("/dev/null", O_WRONLY);
}
int main(int narg, const char *args[])
{
setvbuf(stdout, NULL, _IONBF, 0);
setvbuf(stderr, NULL, _IONBF, 0);
exe_name = args[0];
nfs_proxy_t *p = new nfs_proxy_t();
p->run(nfs_proxy_t::parse_args(narg, args));
delete p;
return 0;
}

47
src/nfs_proxy.h Normal file
View File

@@ -0,0 +1,47 @@
#pragma once
#include "cluster_client.h"
#include "epoll_manager.h"
#include "nfs_portmap.h"
#include "nfsc/libnfs-raw.h"
class nfs_proxy_t
{
public:
std::string bind_address;
std::string name_prefix;
int fsid = 1;
portmap_service_t pmap;
ring_loop_t *ringloop = NULL;
epoll_manager_t *epmgr = NULL;
cluster_client_t *cli = NULL;
uint64_t next_dir_id = 2;
std::map<std::string, std::string> dir_by_hash;
std::map<std::string, uint64_t> dir_ids;
std::map<std::string, uint64_t> dir_mod_rev;
std::map<inode_t, std::string> hash_by_inode;
std::map<std::string, inode_t> inode_by_hash;
~nfs_proxy_t();
static json11::Json::object parse_args(int narg, const char *args[]);
void run(json11::Json cfg);
void do_accept(int listen_fd);
void daemonize();
};
class nfs_client_t
{
public:
nfs_proxy_t *parent = NULL;
int nfs_fd;
struct rpc_context *rpc = NULL;
std::vector<service_proc> nfs3_pt;
std::vector<service_proc> nfs3_mount_pt;
nfs_client_t();
~nfs_client_t();
};

View File

@@ -189,7 +189,7 @@ void osd_t::report_statistics()
for (auto kv: bs->get_inode_space_stats())
{
pool_id_t pool_id = INODE_POOL(kv.first);
uint64_t only_inode_num = (kv.first & ((1l << (64-POOL_ID_BITS)) - 1));
uint64_t only_inode_num = (kv.first & (((uint64_t)1 << (64-POOL_ID_BITS)) - 1));
if (!last_pool || pool_id != last_pool)
{
if (last_pool)
@@ -207,7 +207,7 @@ void osd_t::report_statistics()
for (auto kv: inode_stats)
{
pool_id_t pool_id = INODE_POOL(kv.first);
uint64_t only_inode_num = (kv.first & ((1l << (64-POOL_ID_BITS)) - 1));
uint64_t only_inode_num = (kv.first & (((uint64_t)1 << (64-POOL_ID_BITS)) - 1));
if (!last_pool || pool_id != last_pool)
{
if (last_pool)

View File

@@ -9,7 +9,7 @@
#define POOL_ID_MAX 0x10000
#define POOL_ID_BITS 16
#define INODE_POOL(inode) (pool_id_t)((inode) >> (64 - POOL_ID_BITS))
#define INODE_NO_POOL(inode) (inode_t)(inode & ((1l << (64-POOL_ID_BITS)) - 1))
#define INODE_NO_POOL(inode) (inode_t)(inode & (((uint64_t)1 << (64-POOL_ID_BITS)) - 1))
#define INODE_WITH_POOL(pool_id, inode) (((inode_t)(pool_id) << (64-POOL_ID_BITS)) | INODE_NO_POOL(inode))
// Pool ID is 16 bits long

View File

@@ -437,7 +437,7 @@ void pg_t::calc_object_states(int log_level)
st.walk();
if (this->state & (PG_DEGRADED|PG_LEFT_ON_DEAD))
{
assert(epoch != ((1ul << PG_EPOCH_BITS)-1));
assert(epoch != (((uint64_t)1 << PG_EPOCH_BITS)-1));
epoch++;
}
}

View File

@@ -194,18 +194,22 @@ void osd_t::continue_primary_read(osd_op_t *cur_op)
// Determine version
auto vo_it = pg.ver_override.find(op_data->oid);
op_data->target_ver = vo_it != pg.ver_override.end() ? vo_it->second : UINT64_MAX;
op_data->prev_set = pg.cur_set.data();
if (pg.state != PG_ACTIVE)
{
// PG may be degraded or have misplaced objects
op_data->prev_set = get_object_osd_set(pg, op_data->oid, pg.cur_set.data(), &op_data->object_state);
}
if (pg.state == PG_ACTIVE || op_data->scheme == POOL_SCHEME_REPLICATED)
{
// Fast happy-path
cur_op->buf = alloc_read_buffer(op_data->stripes, op_data->pg_data_size, 0);
submit_primary_subops(SUBMIT_RMW_READ, op_data->target_ver, pg.cur_set.data(), cur_op);
submit_primary_subops(SUBMIT_RMW_READ, op_data->target_ver, op_data->prev_set, cur_op);
op_data->st = 1;
}
else
{
// PG may be degraded or have misplaced objects
uint64_t* cur_set = get_object_osd_set(pg, op_data->oid, pg.cur_set.data(), &op_data->object_state);
if (extend_missing_stripes(op_data->stripes, cur_set, op_data->pg_data_size, pg.pg_size) < 0)
if (extend_missing_stripes(op_data->stripes, op_data->prev_set, op_data->pg_data_size, pg.pg_size) < 0)
{
finish_op(cur_op, -EIO);
return;
@@ -215,7 +219,7 @@ void osd_t::continue_primary_read(osd_op_t *cur_op)
op_data->scheme = pg.scheme;
op_data->degraded = 1;
cur_op->buf = alloc_read_buffer(op_data->stripes, pg.pg_size, 0);
submit_primary_subops(SUBMIT_RMW_READ, op_data->target_ver, cur_set, cur_op);
submit_primary_subops(SUBMIT_RMW_READ, op_data->target_ver, op_data->prev_set, cur_op);
op_data->st = 1;
}
}

View File

@@ -400,18 +400,21 @@ int osd_t::submit_chained_read_requests(pg_t & pg, osd_op_t *cur_op)
stripes[role].read_end = stripes[role].req_end;
}
uint64_t *cur_set = pg.cur_set.data();
if (pg.state != PG_ACTIVE && op_data->scheme != POOL_SCHEME_REPLICATED)
if (pg.state != PG_ACTIVE)
{
pg_osd_set_state_t *object_state;
cur_set = get_object_osd_set(pg, cur_oid, pg.cur_set.data(), &object_state);
if (extend_missing_stripes(stripes, cur_set, pg.pg_data_size, pg.pg_size) < 0)
if (op_data->scheme != POOL_SCHEME_REPLICATED)
{
free(op_data->chain_reads);
op_data->chain_reads = NULL;
finish_op(cur_op, -EIO);
return -1;
if (extend_missing_stripes(stripes, cur_set, pg.pg_data_size, pg.pg_size) < 0)
{
free(op_data->chain_reads);
op_data->chain_reads = NULL;
finish_op(cur_op, -EIO);
return -1;
}
op_data->degraded = 1;
}
op_data->degraded = 1;
}
if (op_data->scheme == POOL_SCHEME_REPLICATED)
{
@@ -465,7 +468,7 @@ int osd_t::submit_chained_read_requests(pg_t & pg, osd_op_t *cur_op)
auto vo_it = pg.ver_override.find(cur_oid);
uint64_t target_ver = vo_it != pg.ver_override.end() ? vo_it->second : UINT64_MAX;
uint64_t *cur_set = pg.cur_set.data();
if (pg.state != PG_ACTIVE && op_data->scheme != POOL_SCHEME_REPLICATED)
if (pg.state != PG_ACTIVE)
{
pg_osd_set_state_t *object_state;
cur_set = get_object_osd_set(pg, cur_oid, pg.cur_set.data(), &object_state);

View File

@@ -144,9 +144,9 @@ resume_3:
}
else
{
if ((op_data->fact_ver & (1ul<<(64-PG_EPOCH_BITS) - 1)) == (1ul<<(64-PG_EPOCH_BITS) - 1))
if ((op_data->fact_ver & ((uint64_t)1 << (64-PG_EPOCH_BITS) - 1)) == ((uint64_t)1 << (64-PG_EPOCH_BITS) - 1))
{
assert(pg.epoch != ((1ul << PG_EPOCH_BITS)-1));
assert(pg.epoch != (((uint64_t)1 << PG_EPOCH_BITS)-1));
pg.epoch++;
}
op_data->target_ver = op_data->fact_ver + 1;

View File

@@ -262,7 +262,7 @@ static int vitastor_file_open(BlockDriverState *bs, QDict *options, int flags, E
client->pool = qdict_get_try_int(options, "pool", 0);
if (client->pool)
{
client->inode = (client->inode & ((1l << (64-POOL_ID_BITS)) - 1)) | (client->pool << (64-POOL_ID_BITS));
client->inode = (client->inode & (((uint64_t)1 << (64-POOL_ID_BITS)) - 1)) | (client->pool << (64-POOL_ID_BITS));
}
client->size = qdict_get_try_int(options, "size", 0);
}

158
src/sha256.c Normal file
View File

@@ -0,0 +1,158 @@
/*********************************************************************
* Filename: sha256.c
* Author: Brad Conte (brad AT bradconte.com)
* Copyright:
* Disclaimer: This code is presented "as is" without any guarantees.
* Details: Implementation of the SHA-256 hashing algorithm.
SHA-256 is one of the three algorithms in the SHA2
specification. The others, SHA-384 and SHA-512, are not
offered in this implementation.
Algorithm specification can be found here:
* http://csrc.nist.gov/publications/fips/fips180-2/fips180-2withchangenotice.pdf
This implementation uses little endian byte order.
*********************************************************************/
/*************************** HEADER FILES ***************************/
#include <stdlib.h>
#include <memory.h>
#include "sha256.h"
/****************************** MACROS ******************************/
#define ROTLEFT(a,b) (((a) << (b)) | ((a) >> (32-(b))))
#define ROTRIGHT(a,b) (((a) >> (b)) | ((a) << (32-(b))))
#define CH(x,y,z) (((x) & (y)) ^ (~(x) & (z)))
#define MAJ(x,y,z) (((x) & (y)) ^ ((x) & (z)) ^ ((y) & (z)))
#define EP0(x) (ROTRIGHT(x,2) ^ ROTRIGHT(x,13) ^ ROTRIGHT(x,22))
#define EP1(x) (ROTRIGHT(x,6) ^ ROTRIGHT(x,11) ^ ROTRIGHT(x,25))
#define SIG0(x) (ROTRIGHT(x,7) ^ ROTRIGHT(x,18) ^ ((x) >> 3))
#define SIG1(x) (ROTRIGHT(x,17) ^ ROTRIGHT(x,19) ^ ((x) >> 10))
/**************************** VARIABLES *****************************/
static const WORD k[64] = {
0x428a2f98,0x71374491,0xb5c0fbcf,0xe9b5dba5,0x3956c25b,0x59f111f1,0x923f82a4,0xab1c5ed5,
0xd807aa98,0x12835b01,0x243185be,0x550c7dc3,0x72be5d74,0x80deb1fe,0x9bdc06a7,0xc19bf174,
0xe49b69c1,0xefbe4786,0x0fc19dc6,0x240ca1cc,0x2de92c6f,0x4a7484aa,0x5cb0a9dc,0x76f988da,
0x983e5152,0xa831c66d,0xb00327c8,0xbf597fc7,0xc6e00bf3,0xd5a79147,0x06ca6351,0x14292967,
0x27b70a85,0x2e1b2138,0x4d2c6dfc,0x53380d13,0x650a7354,0x766a0abb,0x81c2c92e,0x92722c85,
0xa2bfe8a1,0xa81a664b,0xc24b8b70,0xc76c51a3,0xd192e819,0xd6990624,0xf40e3585,0x106aa070,
0x19a4c116,0x1e376c08,0x2748774c,0x34b0bcb5,0x391c0cb3,0x4ed8aa4a,0x5b9cca4f,0x682e6ff3,
0x748f82ee,0x78a5636f,0x84c87814,0x8cc70208,0x90befffa,0xa4506ceb,0xbef9a3f7,0xc67178f2
};
/*********************** FUNCTION DEFINITIONS ***********************/
void sha256_transform(SHA256_CTX *ctx, const BYTE data[])
{
WORD a, b, c, d, e, f, g, h, i, j, t1, t2, m[64];
for (i = 0, j = 0; i < 16; ++i, j += 4)
m[i] = (data[j] << 24) | (data[j + 1] << 16) | (data[j + 2] << 8) | (data[j + 3]);
for ( ; i < 64; ++i)
m[i] = SIG1(m[i - 2]) + m[i - 7] + SIG0(m[i - 15]) + m[i - 16];
a = ctx->state[0];
b = ctx->state[1];
c = ctx->state[2];
d = ctx->state[3];
e = ctx->state[4];
f = ctx->state[5];
g = ctx->state[6];
h = ctx->state[7];
for (i = 0; i < 64; ++i) {
t1 = h + EP1(e) + CH(e,f,g) + k[i] + m[i];
t2 = EP0(a) + MAJ(a,b,c);
h = g;
g = f;
f = e;
e = d + t1;
d = c;
c = b;
b = a;
a = t1 + t2;
}
ctx->state[0] += a;
ctx->state[1] += b;
ctx->state[2] += c;
ctx->state[3] += d;
ctx->state[4] += e;
ctx->state[5] += f;
ctx->state[6] += g;
ctx->state[7] += h;
}
void sha256_init(SHA256_CTX *ctx)
{
ctx->datalen = 0;
ctx->bitlen = 0;
ctx->state[0] = 0x6a09e667;
ctx->state[1] = 0xbb67ae85;
ctx->state[2] = 0x3c6ef372;
ctx->state[3] = 0xa54ff53a;
ctx->state[4] = 0x510e527f;
ctx->state[5] = 0x9b05688c;
ctx->state[6] = 0x1f83d9ab;
ctx->state[7] = 0x5be0cd19;
}
void sha256_update(SHA256_CTX *ctx, const BYTE data[], size_t len)
{
WORD i;
for (i = 0; i < len; ++i) {
ctx->data[ctx->datalen] = data[i];
ctx->datalen++;
if (ctx->datalen == 64) {
sha256_transform(ctx, ctx->data);
ctx->bitlen += 512;
ctx->datalen = 0;
}
}
}
void sha256_final(SHA256_CTX *ctx, BYTE hash[])
{
WORD i;
i = ctx->datalen;
// Pad whatever data is left in the buffer.
if (ctx->datalen < 56) {
ctx->data[i++] = 0x80;
while (i < 56)
ctx->data[i++] = 0x00;
}
else {
ctx->data[i++] = 0x80;
while (i < 64)
ctx->data[i++] = 0x00;
sha256_transform(ctx, ctx->data);
memset(ctx->data, 0, 56);
}
// Append to the padding the total message's length in bits and transform.
ctx->bitlen += ctx->datalen * 8;
ctx->data[63] = ctx->bitlen;
ctx->data[62] = ctx->bitlen >> 8;
ctx->data[61] = ctx->bitlen >> 16;
ctx->data[60] = ctx->bitlen >> 24;
ctx->data[59] = ctx->bitlen >> 32;
ctx->data[58] = ctx->bitlen >> 40;
ctx->data[57] = ctx->bitlen >> 48;
ctx->data[56] = ctx->bitlen >> 56;
sha256_transform(ctx, ctx->data);
// Since this implementation uses little endian byte ordering and SHA uses big endian,
// reverse all the bytes when copying the final state to the output hash.
for (i = 0; i < 4; ++i) {
hash[i] = (ctx->state[0] >> (24 - i * 8)) & 0x000000ff;
hash[i + 4] = (ctx->state[1] >> (24 - i * 8)) & 0x000000ff;
hash[i + 8] = (ctx->state[2] >> (24 - i * 8)) & 0x000000ff;
hash[i + 12] = (ctx->state[3] >> (24 - i * 8)) & 0x000000ff;
hash[i + 16] = (ctx->state[4] >> (24 - i * 8)) & 0x000000ff;
hash[i + 20] = (ctx->state[5] >> (24 - i * 8)) & 0x000000ff;
hash[i + 24] = (ctx->state[6] >> (24 - i * 8)) & 0x000000ff;
hash[i + 28] = (ctx->state[7] >> (24 - i * 8)) & 0x000000ff;
}
}

41
src/sha256.h Normal file
View File

@@ -0,0 +1,41 @@
/*********************************************************************
* Filename: sha256.h
* Author: Brad Conte (brad AT bradconte.com)
* Copyright:
* Disclaimer: This code is presented "as is" without any guarantees.
* Details: Defines the API for the corresponding SHA1 implementation.
*********************************************************************/
#ifndef SHA256_H
#define SHA256_H
/*************************** HEADER FILES ***************************/
#include <stddef.h>
/****************************** MACROS ******************************/
#define SHA256_BLOCK_SIZE 32 // SHA256 outputs a 32 byte digest
#ifdef __cplusplus
extern "C" {
#endif
/**************************** DATA TYPES ****************************/
typedef unsigned char BYTE; // 8-bit byte
typedef unsigned int WORD; // 32-bit word, change to "long" for 16-bit machines
typedef struct {
BYTE data[64];
WORD datalen;
unsigned long long bitlen;
WORD state[8];
} SHA256_CTX;
/*********************** FUNCTION DECLARATIONS **********************/
void sha256_init(SHA256_CTX *ctx);
void sha256_update(SHA256_CTX *ctx, const BYTE data[], size_t len);
void sha256_final(SHA256_CTX *ctx, BYTE hash[]);
#ifdef __cplusplus
};
#endif
#endif // SHA256_H

View File

@@ -406,7 +406,7 @@ uint64_t crush(uint64_t key, int count, uint64_t *weights)
seed = (key + 0xc6a4a7935bd1e995 + (seed << 6) + (seed >> 2));
seed ^= (j + 0xc6a4a7935bd1e995 + (seed << 6) + (seed >> 2));
seed = 2862933555777941757ull*seed + 3037000493ull; // LCPRNG
seed = -log(((double)seed) / (1ul << 32) / (1ul << 32)) * weights[j];
seed = -log(((double)seed) / ((uint64_t)1 << 32) / ((uint64_t)1 << 32)) * weights[j];
if (seed > max)
{
max = seed;
@@ -439,8 +439,8 @@ void crush3(uint64_t key, int count, uint64_t *weights, uint64_t *r, uint64_t to
seed ^= (k2 + 0xc6a4a7935bd1e995 + (seed << 6) + (seed >> 2));
seed ^= (k3 + 0xc6a4a7935bd1e995 + (seed << 6) + (seed >> 2));
seed = 2862933555777941757ull*seed + 3037000493ull; // LCPRNG
//seed = ((double)seed) / (1ul << 32) / (1ul << 32) * (weights[k1] + weights[k2] + weights[k3]);
seed = ((double)seed) / (1ul << 32) / (1ul << 32) * (1 -
//seed = ((double)seed) / ((uint64_t)1 << 32) / ((uint64_t)1 << 32) * (weights[k1] + weights[k2] + weights[k3]);
seed = ((double)seed) / ((uint64_t)1 << 32) / ((uint64_t)1 << 32) * (1 -
(1 - 1.0*weights[k1]/total_weight)*
(1 - 1.0*weights[k2]/total_weight)*
(1 - 1.0*weights[k3]/total_weight)

View File

@@ -6,7 +6,7 @@ includedir=${prefix}/@CMAKE_INSTALL_INCLUDEDIR@
Name: Vitastor
Description: Vitastor client library
Version: 0.6.15
Version: 0.6.16
Libs: -L${libdir} -lvitastor_client
Cflags: -I${includedir}

68
tests/run_7osds.sh Normal file
View File

@@ -0,0 +1,68 @@
#!/bin/bash
. `dirname $0`/common.sh
if [ "$IMMEDIATE_COMMIT" != "" ]; then
NO_SAME="--journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024 --disable_data_fsync 1 --immediate_commit all --log_level 1"
$ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"osd_out_time":5,"immediate_commit":"all"}'
else
NO_SAME="--journal_sector_buffer_count 1024 --log_level 1"
$ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"osd_out_time":5}'
fi
OSD_SIZE=1024
OSD_COUNT=7
OSD_ARGS=
for i in $(seq 1 $OSD_COUNT); do
dd if=/dev/zero of=./testdata/test_osd$i.bin bs=1024 count=1 seek=$((OSD_SIZE*1024-1))
build/src/vitastor-osd --osd_num $i --bind_address 127.0.0.1 $NO_SAME $OSD_ARGS --etcd_address $ETCD_URL $(build/src/vitastor-cli simple-offsets --format options ./testdata/test_osd$i.bin 2>/dev/null) &>./testdata/osd$i.log &
eval OSD${i}_PID=$!
done
cd mon
npm install
cd ..
node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 &>./testdata/mon.log &
MON_PID=$!
if [ "$EC" != "" ]; then
POOLCFG='"scheme":"xor","pg_size":3,"pg_minsize":2,"parity_chunks":1'
PG_SIZE=3
else
POOLCFG='"scheme":"replicated","pg_size":2,"pg_minsize":2'
PG_SIZE=2
fi
$ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool",'$POOLCFG',"pg_count":32,"failure_domain":"osd"}}'
sleep 2
if ! ($ETCDCTL get /vitastor/config/pgs --print-value-only | jq -s -e '(.[0].items["1"] | map((.osd_set | select(. > 0)) | length == '$PG_SIZE') | length) == 32'); then
format_error "FAILED: 32 PGS NOT CONFIGURED"
fi
if ! ($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"]) ] | length) == 32'); then
format_error "FAILED: 32 PGS NOT UP"
fi
try_reweight()
{
osd=$1
w=$2
$ETCDCTL put /vitastor/config/osd/$osd '{"reweight":'$w'}'
sleep 3
}
wait_finish_rebalance()
{
sec=$1
i=0
while [[ $i -lt $sec ]]; do
($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"]) ] | length) == 32') && \
break
if [ $i -eq 60 ]; then
format_error "Rebalance couldn't finish in $sec seconds"
fi
sleep 1
i=$((i+1))
done
}

43
tests/run_tests.sh Executable file
View File

@@ -0,0 +1,43 @@
#!/bin/bash -ex
# Run all possible tests
cd $(dirname $0)
./test_add_osd.sh
./test_cas.sh
./test_change_pg_count.sh
EC=1 ./test_change_pg_count.sh
./test_change_pg_size.sh
./test_etcd_fail.sh
./test_failure_domain.sh
./test_interrupted_rebalance.sh
IMMEDIATE_COMMIT=1 ./test_interrupted_rebalance.sh
EC=1 ./test_interrupted_rebalance.sh
EC=1 IMMEDIATE_COMMIT=1 ./test_interrupted_rebalance.sh
./test_minsize_1.sh
./test_move_reappear.sh
./test_rebalance_verify.sh
IMMEDIATE_COMMIT=1 ./test_rebalance_verify.sh
EC=1 ./test_rebalance_verify.sh
EC=1 IMMEDIATE_COMMIT=1 ./test_rebalance_verify.sh
./test_rm.sh
./test_snapshot.sh
SCHEME=replicated ./test_snapshot.sh
./test_splitbrain.sh
./test_write.sh
SCHEME=replicated ./test_write.sh
./test_write_no_same.sh

View File

@@ -78,7 +78,16 @@ try_change()
fi
# Check that no objects are lost !
nobj=`$ETCDCTL get --prefix '/vitastor/pg/stats' --print-value-only | jq -s '[ .[].object_count ] | reduce .[] as $num (0; .+$num)'`
# But note that reporting this information may take up to <etcd_report_interval+1> seconds
nobj=0
waittime=0
while [[ $nobj -ne $NOBJ && $waittime -lt 7 ]]; do
nobj=`$ETCDCTL get --prefix '/vitastor/pg/stats' --print-value-only | jq -s '[ .[].object_count ] | reduce .[] as $num (0; .+$num)'`
if [[ $nobj -ne $NOBJ ]]; then
waittime=$((waittime+1))
sleep 1
fi
done
if [ "$nobj" -ne $NOBJ ]; then
format_error "Data lost after changing PG count to $n: $NOBJ objects expected, but got $nobj"
fi

View File

@@ -1,41 +1,6 @@
#!/bin/bash -ex
. `dirname $0`/common.sh
if [ "$IMMEDIATE_COMMIT" != "" ]; then
NO_SAME="--journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024 --disable_data_fsync 1 --immediate_commit all --log_level 1"
$ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"osd_out_time":5,"immediate_commit":"all"}'
else
NO_SAME="--journal_sector_buffer_count 1024 --log_level 1"
$ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"osd_out_time":5}'
fi
OSD_SIZE=1024
OSD_COUNT=7
OSD_ARGS=
for i in $(seq 1 $OSD_COUNT); do
dd if=/dev/zero of=./testdata/test_osd$i.bin bs=1024 count=1 seek=$((OSD_SIZE*1024-1))
build/src/vitastor-osd --osd_num $i --bind_address 127.0.0.1 $OSD_ARGS --etcd_address $ETCD_URL $(build/src/vitastor-cli simple-offsets --format options ./testdata/test_osd$i.bin 2>/dev/null) &>./testdata/osd$i.log &
eval OSD${i}_PID=$!
done
cd mon
npm install
cd ..
node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 &>./testdata/mon.log &
MON_PID=$!
$ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":32,"failure_domain":"osd"}}'
sleep 2
if ! ($ETCDCTL get /vitastor/config/pgs --print-value-only | jq -s -e '(.[0].items["1"] | map((.osd_set | select(. > 0)) | length == 2) | length) == 32'); then
format_error "FAILED: 32 PGS NOT CONFIGURED"
fi
if ! ($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"]) ] | length) == 32'); then
format_error "FAILED: 32 PGS NOT UP"
fi
. `dirname $0`/run_7osds.sh
IMG_SIZE=960
@@ -43,14 +8,6 @@ LD_PRELOAD=libasan.so.5 \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4M -direct=1 -iodepth=16 -fsync=16 -rw=write \
-etcd=$ETCD_URL -pool=1 -inode=2 -size=${IMG_SIZE}M -cluster_log_level=10
try_reweight()
{
osd=$1
w=$2
$ETCDCTL put /vitastor/config/osd/$osd '{"reweight":'$w'}'
sleep 3
}
try_reweight 1 0
try_reweight 2 0
@@ -72,14 +29,7 @@ try_reweight 4 1
try_reweight 5 1
# Wait for the rebalance to finish
for i in {1..60}; do
($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"]) ] | length) == 32') && \
break
if [ $i -eq 60 ]; then
format_error "Rebalance couldn't finish in 60 seconds"
fi
sleep 1
done
wait_finish_rebalance 60
# Check that PGs never had degraded objects !
if grep has_degraded ./testdata/mon.log; then

57
tests/test_rebalance_verify.sh Executable file
View File

@@ -0,0 +1,57 @@
#!/bin/bash -ex
. `dirname $0`/run_7osds.sh
IMG_SIZE=256
$ETCDCTL put /vitastor/config/inode/1/1 '{"name":"testimg","size":'$((IMG_SIZE*1024*1024))'}'
NBD_DEV=$(sudo build/src/vitastor-nbd map --etcd_address $ETCD_URL --image testimg --logfile ./testdata/nbd.log &)
trap "sudo build/src/vitastor-nbd unmap $NBD_DEV"'; kill -9 $(jobs -p)' EXIT
sudo chown $(id -u) $NBD_DEV
dd if=/dev/urandom of=./testdata/img1.bin bs=1M count=$IMG_SIZE
dd if=./testdata/img1.bin of=$NBD_DEV bs=1M count=$IMG_SIZE oflag=direct
verify() {
echo "Verifying before rebalance"
dd if=$NBD_DEV of=./testdata/img2.bin bs=1M count=$IMG_SIZE iflag=direct
diff ./testdata/img1.bin ./testdata/img2.bin
$ETCDCTL put /vitastor/config/osd/1 '{"reweight":'$1'}'
$ETCDCTL put /vitastor/config/osd/2 '{"reweight":'$1'}'
$ETCDCTL put /vitastor/config/osd/3 '{"reweight":'$1'}'
sleep 1
for i in {1..10000}; do
O=$(((RANDOM*RANDOM) % (IMG_SIZE*128)))
dd if=$NBD_DEV of=./testdata/img2.bin bs=4k seek=$O skip=$O count=1 iflag=direct conv=notrunc
done
echo "Verifying during rebalance"
diff ./testdata/img1.bin ./testdata/img2.bin
# Wait for the rebalance to finish
wait_finish_rebalance 60
echo "Verifying after rebalance"
dd if=$NBD_DEV of=./testdata/img2.bin bs=1M count=$IMG_SIZE iflag=direct
diff ./testdata/img1.bin ./testdata/img2.bin
}
# Verify with regular reads
verify 0
# Same with chained reads
$ETCDCTL put /vitastor/config/inode/1/1 '{"name":"testimg0","size":'$((IMG_SIZE*1024*1024))'}'
$ETCDCTL put /vitastor/config/inode/1/2 '{"name":"testimg","size":'$((IMG_SIZE*1024*1024))',"parent_id":1}'
sleep 1
verify 1
format_green OK