Compare commits

...

19 Commits

Author SHA1 Message Date
385aca9d44 Experiment: zero-copy TCP send 2022-05-11 18:08:45 +03:00
2697aae909 Fix free_down_raw in cli status 2022-05-11 18:08:45 +03:00
6b69db73ac Remove getrandom() usage 2022-05-11 11:25:20 +03:00
d48a824846 Fix some warnings 2022-05-10 12:42:58 +03:00
40985282ff Fix build under GCC 8 2022-05-10 12:26:47 +03:00
acf403e886 Add install target for NFS proxy 2022-05-10 10:43:17 +03:00
cf03b9c84d Implement "primary affinity tags" 2022-05-09 22:37:23 +03:00
7c2379d458 Simplified NFS proxy based on own NFS/XDR implementation 2022-05-07 01:01:20 +03:00
a2189100dd Make CLI functions usable in library form
Return results and errors in a variable instead of just printing them,
separate vitastor-cli main() from cli_tool_t, move positional argument
parsing to CLI main from command implementations.
2022-05-06 02:18:32 +03:00
bb84379db6 Release 0.6.17
- Fix incorrect reading of extra metadata block leading to extra unknown objects in stats
- Fix CSI driver volumeMode: Block support
- Add block PVC and pod examples
- Fix build under 32 bit architectures
- Fix slow connection ramp-up caused by up_wait_retry_interval
2022-05-06 02:18:01 +03:00
714dda8151 Fix slow connection ramp-up caused by up_wait_retry_interval pausing operations on first connection attempt 2022-05-06 02:12:08 +03:00
834554c523 LD_PRELOAD=libasan.so.5 fio in tests fails when vitastor is built with ASan 2022-05-05 02:11:34 +03:00
e718116f54 Fix incorrect reading of extra metadata block 2022-04-21 02:52:21 +03:00
98e3528a14 Add block PVC and pod examples 2022-04-17 15:43:37 +03:00
8e88f77101 Fix CSI driver volumeMode: Block support 2022-04-17 15:39:11 +03:00
caa2cc2e6c Fix 32bit build error 2022-04-16 01:48:24 +03:00
842ba8b831 Use (uint64_t)1 instead of 1l / 1ul 2022-04-16 01:48:14 +03:00
1493823f9e Note about starting monitors 2022-04-12 15:00:28 +03:00
c857272f44 Comment: epoch is uint64_t 2022-04-10 12:21:37 +03:00
87 changed files with 12193 additions and 605 deletions

View File

@@ -2,6 +2,6 @@ cmake_minimum_required(VERSION 2.8)
project(vitastor)
set(VERSION "0.6.16")
set(VERSION "0.6.17")
add_subdirectory(src)

View File

@@ -52,6 +52,7 @@ Vitastor на данный момент находится в статусе п
- Слияние снапшотов (vitastor-cli {snap-rm,flatten,merge})
- Консольный интерфейс для управления образами (vitastor-cli {ls,create,modify})
- Плагин для Proxmox
- Упрощённая NFS-прокси для эмуляции файлового доступа к образам (подходит для VMWare)
## Планы развития
@@ -59,7 +60,6 @@ Vitastor на данный момент находится в статусе п
- Другие инструменты администрирования
- Плагины для OpenNebula и других облачных систем
- iSCSI-прокси
- Упрощённый NFS прокси
- Более быстрое переключение при отказах
- Фоновая проверка целостности без контрольных сумм (сверка реплик)
- Контрольные суммы
@@ -407,6 +407,7 @@ Vitastor с однопоточной NBD прокси на том же стен
- На хостах мониторов:
- Пропишите нужные вам значения в файле `/usr/lib/vitastor/mon/make-units.sh`
- Создайте юниты systemd для etcd и мониторов: `/usr/lib/vitastor/mon/make-units.sh`
- Запустите etcd и мониторы: `systemctl start etcd vitastor-mon`
- Пропишите etcd_address и osd_network в `/etc/vitastor/vitastor.conf`. Например:
```
{
@@ -437,7 +438,6 @@ Vitastor с однопоточной NBD прокси на том же стен
диски, используемые на одном из тестовых стендов - Intel D3-S4510 - очень сильно не любят такую
перезапись, и для них была добавлена эта опция. Когда данный режим включён, также нужно поднимать
значение `journal_sector_buffer_count`, так как иначе Vitastor не хватит буферов для записи в журнал.
- Запустите все etcd: `systemctl start etcd`
- Создайте глобальную конфигурацию в etcd: `etcdctl --endpoints=... put /vitastor/config/global '{"immediate_commit":"all"}'`
(если все ваши диски - серверные с конденсаторами).
- Создайте пулы: `etcdctl --endpoints=... put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":256,"failure_domain":"host"}}'`.
@@ -530,9 +530,48 @@ vitastor-nbd map --etcd_address 10.115.0.10:2379/v3 --image testimg
Для обращения по номеру инода, аналогично другим командам, можно использовать опции
`--pool <POOL> --inode <INODE> --size <SIZE>` вместо `--image testimg`.
### NFS
В Vitastor реализована упрощённая NFS 3.0 прокси для эмуляции файлового доступа к образам.
Это не полноценная файловая система, т.к. метаданные всех файлов (образов) сохраняются
в etcd и всё время хранятся в оперативной памяти - то есть, положить туда много файлов
не получится.
Однако в качестве способа доступа к образам виртуальных машин NFS прокси прекрасно подходит
и позволяет подключить Vitastor, например, к VMWare.
При этом, если вы используете режим immediate_commit=all (для SSD с конденсаторами или HDD
с отключённым кэшем), то NFS-сервер не имеет состояния и вы можете свободно поднять
его в нескольких экземплярах и использовать поверх них сетевой балансировщик нагрузки или
схему с отказоустойчивостью.
Использование vitastor-nfs:
```
vitastor-nfs [--etcd_address ADDR] [ДРУГИЕ ОПЦИИ]
--subdir <DIR> экспортировать "поддиректорию" - образы с префиксом имени <DIR>/ (по умолчанию пусто - экспортировать все образы)
--portmap 0 отключить сервис portmap/rpcbind на порту 111 (по умолчанию включён и требует root привилегий)
--bind <IP> принимать соединения по адресу <IP> (по умолчанию 0.0.0.0 - на всех)
--nfspath <PATH> установить путь NFS-экспорта в <PATH> (по умолчанию /)
--port <PORT> использовать порт <PORT> для NFS-сервисов (по умолчанию 2049)
--pool <POOL> исползовать пул <POOL> для новых образов (обязательно, если пул в кластере не один)
--foreground 1 не уходить в фон после запуска
```
Пример монтирования Vitastor через NFS:
```
vitastor-nfs --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool
```
```
mount localhost:/ /mnt/ -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
```
### Kubernetes
У Vitastor есть CSI-плагин для Kubernetes, поддерживающий RWO-тома.
У Vitastor есть CSI-плагин для Kubernetes, поддерживающий RWO, а также блочные RWX, тома.
Для установки возьмите манифесты из директории [csi/deploy/](csi/deploy/), поместите
вашу конфигурацию подключения к Vitastor в [csi/deploy/001-csi-config-map.yaml](001-csi-config-map.yaml),

View File

@@ -46,6 +46,7 @@ breaking changes in the future. However, the following is implemented:
- Snapshot merge tool (vitastor-cli {snap-rm,flatten,merge})
- Image management CLI (vitastor-cli {ls,create,modify})
- Proxmox storage plugin
- Simplified NFS proxy for file-based image access emulation (suitable for VMWare)
## Roadmap
@@ -53,7 +54,6 @@ breaking changes in the future. However, the following is implemented:
- Other administrative tools
- Plugins for OpenNebula and other cloud systems
- iSCSI proxy
- Simplified NFS proxy
- Faster failover
- Scrubbing without checksums (verification of replicas)
- Checksums
@@ -360,6 +360,7 @@ and calculate disk offsets almost by hand. This will be fixed in near future.
- On the monitor hosts:
- Edit variables at the top of `/usr/lib/vitastor/mon/make-units.sh` to desired values.
- Create systemd units for the monitor and etcd: `/usr/lib/vitastor/mon/make-units.sh`
- Start etcd and monitors: `systemctl start etcd vitastor-mon`
- Put etcd_address and osd_network into `/etc/vitastor/vitastor.conf`. Example:
```
{
@@ -478,9 +479,49 @@ It will output the device name, like /dev/nbd0 which you can then format and mou
Again, you can use `--pool <POOL> --inode <INODE> --size <SIZE>` insteaf of `--image <IMAGE>` if you want.
### NFS
Vitastor has a simplified NFS 3.0 proxy for file-based image access emulation. It's not
suitable as a full-featured file system, at least because all file/image metadata is stored
in etcd and kept in memory all the time - thus you can't put a lot of files in it.
However, NFS proxy is totally fine as a method to provide VM image access and allows to
plug Vitastor into, for example, VMWare. It's important to note that for VMWare it's a much
better access method than iSCSI, because with iSCSI we'd have to put all VM images into one
Vitastor image exported as a LUN to VMWare and formatted with VMFS. VMWare doesn't use VMFS
over NFS.
NFS proxy is stateless if you use immediate_commit=all mode (for SSD with capacitors or
HDDs with disabled cache), so you can run multiple NFS proxies and use a network load
balancer or any failover method you want to in that case.
vitastor-nfs usage:
```
vitastor-nfs [--etcd_address ADDR] [OTHER OPTIONS]
--subdir <DIR> export images prefixed <DIR>/ (default empty - export all images)
--portmap 0 do not listen on port 111 (portmap/rpcbind, requires root)
--bind <IP> bind service to <IP> address (default 0.0.0.0)
--nfspath <PATH> set NFS export path to <PATH> (default is /)
--port <PORT> use port <PORT> for NFS services (default is 2049)
--pool <POOL> use <POOL> as default pool for new files (images)
--foreground 1 stay in foreground, do not daemonize
```
Example start and mount commands:
```
vitastor-nfs --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool
```
```
mount localhost:/ /mnt/ -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
```
### Kubernetes
Vitastor has a CSI plugin for Kubernetes which supports RWO volumes.
Vitastor has a CSI plugin for Kubernetes which supports RWO (and block RWX) volumes.
To deploy it, take manifests from [csi/deploy/](csi/deploy/) directory, put your
Vitastor configuration in [csi/deploy/001-csi-config-map.yaml](001-csi-config-map.yaml),

View File

@@ -1,4 +1,4 @@
VERSION ?= v0.6.16
VERSION ?= v0.6.17
all: build push

View File

@@ -49,7 +49,7 @@ spec:
capabilities:
add: ["SYS_ADMIN"]
allowPrivilegeEscalation: true
image: vitalif/vitastor-csi:v0.6.16
image: vitalif/vitastor-csi:v0.6.17
args:
- "--node=$(NODE_ID)"
- "--endpoint=$(CSI_ENDPOINT)"

View File

@@ -116,7 +116,7 @@ spec:
privileged: true
capabilities:
add: ["SYS_ADMIN"]
image: vitalif/vitastor-csi:v0.6.16
image: vitalif/vitastor-csi:v0.6.17
args:
- "--node=$(NODE_ID)"
- "--endpoint=$(CSI_ENDPOINT)"

View File

@@ -0,0 +1,13 @@
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-vitastor-pvc-block
spec:
storageClassName: vitastor
volumeMode: Block
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi

View File

@@ -0,0 +1,17 @@
apiVersion: v1
kind: Pod
metadata:
name: vitastor-test-block-pvc
namespace: default
spec:
containers:
- name: vitastor-test-block-pvc
image: nginx
volumeDevices:
- name: data
devicePath: /dev/xvda
volumes:
- name: data
persistentVolumeClaim:
claimName: test-vitastor-pvc-block
readOnly: false

View File

@@ -0,0 +1,17 @@
apiVersion: v1
kind: Pod
metadata:
name: vitastor-test-nginx
namespace: default
spec:
containers:
- name: vitastor-test-nginx
image: nginx
volumeMounts:
- mountPath: /usr/share/nginx/html/s3
name: data
volumes:
- name: data
persistentVolumeClaim:
claimName: test-vitastor-pvc
readOnly: false

View File

@@ -5,7 +5,7 @@ package vitastor
const (
vitastorCSIDriverName = "csi.vitastor.io"
vitastorCSIDriverVersion = "0.6.16"
vitastorCSIDriverVersion = "0.6.17"
)
// Config struct fills the parameters of request or user input

View File

@@ -67,29 +67,44 @@ func (ns *NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublis
klog.Infof("received node publish volume request %+v", protosanitizer.StripSecrets(req))
targetPath := req.GetTargetPath()
isBlock := req.GetVolumeCapability().GetBlock() != nil
// Check that it's not already mounted
free, error := mount.IsNotMountPoint(ns.mounter, targetPath)
_, error := mount.IsNotMountPoint(ns.mounter, targetPath)
if (error != nil)
{
if (os.IsNotExist(error))
{
error := os.MkdirAll(targetPath, 0777)
if (error != nil)
if (isBlock)
{
return nil, status.Error(codes.Internal, error.Error())
pathFile, err := os.OpenFile(targetPath, os.O_CREATE|os.O_RDWR, 0o600)
if (err != nil)
{
klog.Errorf("failed to create block device mount target %s with error: %v", targetPath, err)
return nil, status.Error(codes.Internal, err.Error())
}
err = pathFile.Close()
if (err != nil)
{
klog.Errorf("failed to close %s with error: %v", targetPath, err)
return nil, status.Error(codes.Internal, err.Error())
}
}
else
{
err := os.MkdirAll(targetPath, 0777)
if (err != nil)
{
klog.Errorf("failed to create fs mount target %s with error: %v", targetPath, err)
return nil, status.Error(codes.Internal, err.Error())
}
}
free = true
}
else
{
return nil, status.Error(codes.Internal, error.Error())
}
}
if (!free)
{
return &csi.NodePublishVolumeResponse{}, nil
}
ctxVars := make(map[string]string)
err := json.Unmarshal([]byte(req.VolumeId), &ctxVars)
@@ -149,7 +164,6 @@ func (ns *NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublis
// Format the device (ext4 or xfs)
fsType := req.GetVolumeCapability().GetMount().GetFsType()
isBlock := req.GetVolumeCapability().GetBlock() != nil
opt := req.GetVolumeCapability().GetMount().GetMountFlags()
opt = append(opt, "_netdev")
if ((req.VolumeCapability.AccessMode.Mode == csi.VolumeCapability_AccessMode_MULTI_NODE_READER_ONLY ||

2
debian/changelog vendored
View File

@@ -1,4 +1,4 @@
vitastor (0.6.16-1) unstable; urgency=medium
vitastor (0.6.17-1) unstable; urgency=medium
* RDMA support
* Bugfixes

View File

@@ -2,5 +2,6 @@ usr/bin/vita
usr/bin/vitastor-cli
usr/bin/vitastor-rm
usr/bin/vitastor-nbd
usr/bin/vitastor-nfs
usr/lib/*/libvitastor*.so*
mon/make-osd.sh /usr/lib/vitastor

View File

@@ -33,8 +33,8 @@ RUN set -e -x; \
mkdir -p /root/packages/vitastor-$REL; \
rm -rf /root/packages/vitastor-$REL/*; \
cd /root/packages/vitastor-$REL; \
cp -r /root/vitastor vitastor-0.6.16; \
cd vitastor-0.6.16; \
cp -r /root/vitastor vitastor-0.6.17; \
cd vitastor-0.6.17; \
ln -s /root/fio-build/fio-*/ ./fio; \
FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
ls /usr/include/linux/raw.h || cp ./debian/raw.h /usr/include/linux/raw.h; \
@@ -47,8 +47,8 @@ RUN set -e -x; \
rm -rf a b; \
echo "dep:fio=$FIO" > debian/fio_version; \
cd /root/packages/vitastor-$REL; \
tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.6.16.orig.tar.xz vitastor-0.6.16; \
cd vitastor-0.6.16; \
tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.6.17.orig.tar.xz vitastor-0.6.17; \
cd vitastor-0.6.17; \
V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \
DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \

View File

@@ -30,6 +30,18 @@
будут использоваться обычные синхронные системные вызовы send/recv. Для OSD
это бессмысленно, так как OSD в любом случае нуждается в io_uring, но, в
принципе, это может применяться для клиентов со старыми версиями ядра.
- name: use_zerocopy_send
type: bool
default: false
info: |
If true, OSDs and clients will attempt to use TCP zero-copy send
(MSG_ZEROCOPY) for big buffers. It's recommended to raise net.ipv4.tcp_wmem
and net.core.wmem_max sysctls when using this mode.
info_ru: |
Если установлено в true, то OSD и клиенты будут стараться использовать
TCP-отправку без копирования (MSG_ZEROCOPY) для больших буферов данных.
Рекомендуется поднять значения sysctl net.ipv4.tcp_wmem и net.core.wmem_max
при использовании этого режима.
- name: use_rdma
type: bool
default: true

View File

@@ -64,6 +64,7 @@ const etcd_tree = {
// client and osd
tcp_header_buffer_size: 65536,
use_sync_send_recv: false,
use_zerocopy_send: false,
use_rdma: true,
rdma_device: null, // for example, "rocep5s0f0"
rdma_port_num: 1,
@@ -160,6 +161,8 @@ const etcd_tree = {
root_node?: 'rack1',
// restrict pool to OSDs having all of these tags
osd_tags?: 'nvme' | [ 'nvme', ... ],
// prefer to put primary on OSD with these tags
primary_affinity_tags?: 'nvme' | [ 'nvme', ... ],
},
...
}, */
@@ -224,15 +227,19 @@ const etcd_tree = {
}, */
},
inodestats: {
/* <inode_t>: {
read: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
write: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
delete: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
/* <pool_id>: {
<inode_t>: {
read: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
write: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
delete: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
},
}, */
},
space: {
/* <osd_num_t>: {
<inode_t>: uint64_t, // bytes
<pool_id>: {
<inode_t>: uint64_t, // bytes
},
}, */
},
},
@@ -272,7 +279,7 @@ const etcd_tree = {
<pg_id>: {
osd_sets: osd_num_t[][],
all_peers: osd_num_t[],
epoch: uint32_t,
epoch: uint64_t,
},
}, */
},
@@ -899,27 +906,39 @@ class Mon
return this.seed + 2147483648;
}
pick_primary(pool_id, osd_set, up_osds)
pick_primary(pool_id, osd_set, up_osds, aff_osds)
{
let alive_set;
if (this.state.config.pools[pool_id].scheme === 'replicated')
alive_set = osd_set.filter(osd_num => osd_num && up_osds[osd_num]);
{
// Prefer "affinity" OSDs
alive_set = osd_set.filter(osd_num => osd_num && aff_osds[osd_num]);
if (!alive_set.length)
alive_set = osd_set.filter(osd_num => osd_num && up_osds[osd_num]);
}
else
{
// Prefer data OSDs for EC because they can actually read something without an additional network hop
const pg_data_size = (this.state.config.pools[pool_id].pg_size||0) -
(this.state.config.pools[pool_id].parity_chunks||0);
alive_set = osd_set.slice(0, pg_data_size).filter(osd_num => osd_num && up_osds[osd_num]);
alive_set = osd_set.slice(0, pg_data_size).filter(osd_num => osd_num && aff_osds[osd_num]);
if (!alive_set.length)
alive_set = osd_set.filter(osd_num => osd_num && up_osds[osd_num]);
alive_set = osd_set.filter(osd_num => osd_num && aff_osds[osd_num]);
if (!alive_set.length)
{
alive_set = osd_set.slice(0, pg_data_size).filter(osd_num => osd_num && up_osds[osd_num]);
if (!alive_set.length)
alive_set = osd_set.filter(osd_num => osd_num && up_osds[osd_num]);
}
}
if (!alive_set.length)
return 0;
return alive_set[this.rng() % alive_set.length];
}
save_new_pgs_txn(request, pool_id, up_osds, prev_pgs, new_pgs, pg_history)
save_new_pgs_txn(request, pool_id, up_osds, osd_tree, prev_pgs, new_pgs, pg_history)
{
const aff_osds = this.get_affinity_osds(this.state.config.pools[pool_id], up_osds, osd_tree);
const pg_items = {};
this.reset_rng();
new_pgs.map((osd_set, i) =>
@@ -927,7 +946,7 @@ class Mon
osd_set = osd_set.map(osd_num => osd_num === LPOptimizer.NO_OSD ? 0 : osd_num);
pg_items[i+1] = {
osd_set,
primary: this.pick_primary(pool_id, osd_set, up_osds),
primary: this.pick_primary(pool_id, osd_set, up_osds, aff_osds),
};
if (prev_pgs[i] && prev_pgs[i].join(' ') != osd_set.join(' ') &&
prev_pgs[i].filter(osd_num => osd_num).length > 0)
@@ -1058,6 +1077,13 @@ class Mon
console.log('Pool '+pool_id+' has invalid osd_tags (must be a string or array of strings)');
return false;
}
if (pool_cfg.primary_affinity_tags && typeof(pool_cfg.primary_affinity_tags) != 'string' &&
(!(pool_cfg.primary_affinity_tags instanceof Array) || pool_cfg.primary_affinity_tags.filter(t => typeof t != 'string').length > 0))
{
if (warn)
console.log('Pool '+pool_id+' has invalid primary_affinity_tags (must be a string or array of strings)');
return false;
}
return true;
}
@@ -1087,6 +1113,17 @@ class Mon
}
}
get_affinity_osds(pool_cfg, up_osds, osd_tree)
{
let aff_osds = up_osds;
if (pool_cfg.primary_affinity_tags)
{
aff_osds = { ...up_osds };
this.filter_osds_by_tags(osd_tree, { x: aff_osds }, pool_cfg.primary_affinity_tags);
}
return aff_osds;
}
async recheck_pgs()
{
// Take configuration and state, check it against the stored configuration hash
@@ -1117,7 +1154,7 @@ class Mon
{
prev_pgs[pg-1] = this.state.config.pgs.items[pool_id][pg].osd_set;
}
this.save_new_pgs_txn(etcd_request, pool_id, up_osds, prev_pgs, [], []);
this.save_new_pgs_txn(etcd_request, pool_id, up_osds, osd_tree, prev_pgs, [], []);
}
}
for (const pool_id in this.state.config.pools)
@@ -1224,7 +1261,7 @@ class Mon
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
value: b64(JSON.stringify(this.state.pool.stats[pool_id])),
} });
this.save_new_pgs_txn(etcd_request, pool_id, up_osds, real_prev_pgs, optimize_result.int_pgs, pg_history);
this.save_new_pgs_txn(etcd_request, pool_id, up_osds, osd_tree, real_prev_pgs, optimize_result.int_pgs, pg_history);
}
this.state.config.pgs.hash = tree_hash;
await this.save_pg_config(etcd_request);
@@ -1241,13 +1278,14 @@ class Mon
continue;
}
const replicated = pool_cfg.scheme === 'replicated';
const aff_osds = this.get_affinity_osds(pool_cfg, up_osds, osd_tree);
this.reset_rng();
for (let pg_num = 1; pg_num <= pool_cfg.pg_count; pg_num++)
{
const pg_cfg = this.state.config.pgs.items[pool_id][pg_num];
if (pg_cfg)
{
const new_primary = this.pick_primary(pool_id, pg_cfg.osd_set, up_osds);
const new_primary = this.pick_primary(pool_id, pg_cfg.osd_set, up_osds, aff_osds);
if (pg_cfg.primary != new_primary)
{
console.log(

View File

@@ -50,7 +50,7 @@ from cinder.volume import configuration
from cinder.volume import driver
from cinder.volume import volume_utils
VERSION = '0.6.16'
VERSION = '0.6.17'
LOG = logging.getLogger(__name__)

View File

@@ -25,4 +25,4 @@ rm fio
mv fio-copy fio
FIO=`rpm -qi fio | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'`
perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec
tar --transform 's#^#vitastor-0.6.16/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.6.16$(rpm --eval '%dist').tar.gz *
tar --transform 's#^#vitastor-0.6.17/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.6.17$(rpm --eval '%dist').tar.gz *

View File

@@ -34,7 +34,7 @@ ADD . /root/vitastor
RUN set -e; \
cd /root/vitastor/rpm; \
sh build-tarball.sh; \
cp /root/vitastor-0.6.16.el7.tar.gz ~/rpmbuild/SOURCES; \
cp /root/vitastor-0.6.17.el7.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \

View File

@@ -1,11 +1,11 @@
Name: vitastor
Version: 0.6.16
Version: 0.6.17
Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1
URL: https://vitastor.io/
Source0: vitastor-0.6.16.el7.tar.gz
Source0: vitastor-0.6.17.el7.tar.gz
BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel
@@ -119,6 +119,7 @@ cp -r mon %buildroot/usr/lib/vitastor
%files -n vitastor-client
%_bindir/vitastor-nbd
%_bindir/vitastor-nfs
%_bindir/vitastor-cli
%_bindir/vitastor-rm
%_bindir/vita

View File

@@ -33,7 +33,7 @@ ADD . /root/vitastor
RUN set -e; \
cd /root/vitastor/rpm; \
sh build-tarball.sh; \
cp /root/vitastor-0.6.16.el8.tar.gz ~/rpmbuild/SOURCES; \
cp /root/vitastor-0.6.17.el8.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \

View File

@@ -1,11 +1,11 @@
Name: vitastor
Version: 0.6.16
Version: 0.6.17
Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1
URL: https://vitastor.io/
Source0: vitastor-0.6.16.el8.tar.gz
Source0: vitastor-0.6.17.el8.tar.gz
BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel
@@ -116,6 +116,7 @@ cp -r mon %buildroot/usr/lib/vitastor
%files -n vitastor-client
%_bindir/vitastor-nbd
%_bindir/vitastor-nfs
%_bindir/vitastor-cli
%_bindir/vitastor-rm
%_bindir/vita

View File

@@ -15,7 +15,7 @@ if("${CMAKE_INSTALL_PREFIX}" MATCHES "^/usr/local/?$")
set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}")
endif()
add_definitions(-DVERSION="0.6.16")
add_definitions(-DVERSION="0.6.17")
add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -fdiagnostics-color=always -I ${CMAKE_SOURCE_DIR}/src)
if (${WITH_ASAN})
add_definitions(-fsanitize=address -fno-omit-frame-pointer)
@@ -124,6 +124,18 @@ add_library(vitastor_client SHARED
cluster_client.cpp
cluster_client_list.cpp
vitastor_c.cpp
cli_common.cpp
cli_alloc_osd.cpp
cli_simple_offsets.cpp
cli_status.cpp
cli_df.cpp
cli_ls.cpp
cli_create.cpp
cli_modify.cpp
cli_flatten.cpp
cli_merge.cpp
cli_rm_data.cpp
cli_rm.cpp
)
set_target_properties(vitastor_client PROPERTIES PUBLIC_HEADER "vitastor_c.h")
target_link_libraries(vitastor_client
@@ -152,10 +164,24 @@ target_link_libraries(vitastor-nbd
vitastor_client
)
# vitastor-nfs
add_executable(vitastor-nfs
nfs_proxy.cpp
nfs_conn.cpp
nfs_portmap.cpp
sha256.c
nfs/xdr_impl.cpp
nfs/rpc_xdr.cpp
nfs/portmap_xdr.cpp
nfs/nfs_xdr.cpp
)
target_link_libraries(vitastor-nfs
vitastor_client
)
# vitastor-cli
add_executable(vitastor-cli
cli.cpp cli_alloc_osd.cpp cli_simple_offsets.cpp cli_status.cpp cli_df.cpp
cli_ls.cpp cli_create.cpp cli_modify.cpp cli_flatten.cpp cli_merge.cpp cli_rm_data.cpp cli_rm.cpp
cli.cpp
)
target_link_libraries(vitastor-cli
vitastor_client
@@ -244,7 +270,7 @@ target_include_directories(test_cluster_client PUBLIC ${CMAKE_SOURCE_DIR}/src/mo
### Install
install(TARGETS vitastor-osd vitastor-dump-journal vitastor-nbd vitastor-cli RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR})
install(TARGETS vitastor-osd vitastor-dump-journal vitastor-nbd vitastor-nfs vitastor-cli RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR})
install_symlink(vitastor-cli ${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_BINDIR}/vitastor-rm)
install_symlink(vitastor-cli ${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_BINDIR}/vita)
install(

View File

@@ -25,7 +25,7 @@ allocator::allocator(uint64_t blocks)
size = free = blocks;
last_one_mask = (blocks % 64) == 0
? UINT64_MAX
: ((1l << (blocks % 64)) - 1);
: (((uint64_t)1 << (blocks % 64)) - 1);
for (uint64_t i = 0; i < total; i++)
{
mask[i] = 0;
@@ -79,7 +79,7 @@ void allocator::set(uint64_t addr, bool value)
}
if (value)
{
mask[last] = mask[last] | (1l << bit);
mask[last] = mask[last] | ((uint64_t)1 << bit);
if (mask[last] != (!is_last || cur_addr/64 < size/64
? UINT64_MAX : last_one_mask))
{
@@ -88,7 +88,7 @@ void allocator::set(uint64_t addr, bool value)
}
else
{
mask[last] = mask[last] & ~(1l << bit);
mask[last] = mask[last] & ~((uint64_t)1 << bit);
}
is_last = false;
if (p2 > 1)

View File

@@ -131,6 +131,7 @@ resume_1:
}
// Skip superblock
bs->meta_offset += bs->meta_block_size;
bs->meta_len -= bs->meta_block_size;
prev_done = 0;
done_len = 0;
done_pos = 0;

View File

@@ -2,8 +2,7 @@
// License: VNPL-1.1 (see README.md for details)
/**
* CLI tool
* Currently can (a) remove inodes and (b) merge snapshot/clone layers
* CLI tool and also a library for administrative tasks
*/
#include <vector>
@@ -17,7 +16,9 @@
static const char *exe_name = NULL;
json11::Json::object cli_tool_t::parse_args(int narg, const char *args[])
static void help();
static json11::Json::object parse_args(int narg, const char *args[])
{
json11::Json::object cfg;
json11::Json::array cmd;
@@ -79,7 +80,7 @@ json11::Json::object cli_tool_t::parse_args(int narg, const char *args[])
return cfg;
}
void cli_tool_t::help()
static void help()
{
printf(
"Vitastor command-line tool\n"
@@ -164,224 +165,171 @@ void cli_tool_t::help()
exit(0);
}
void cli_tool_t::change_parent(inode_t cur, inode_t new_parent)
{
auto cur_cfg_it = cli->st_cli.inode_config.find(cur);
if (cur_cfg_it == cli->st_cli.inode_config.end())
{
fprintf(stderr, "Inode 0x%lx disappeared\n", cur);
exit(1);
}
inode_config_t new_cfg = cur_cfg_it->second;
std::string cur_name = new_cfg.name;
std::string cur_cfg_key = base64_encode(cli->st_cli.etcd_prefix+
"/config/inode/"+std::to_string(INODE_POOL(cur))+
"/"+std::to_string(INODE_NO_POOL(cur)));
new_cfg.parent_id = new_parent;
json11::Json::object cur_cfg_json = cli->st_cli.serialize_inode_cfg(&new_cfg);
waiting++;
cli->st_cli.etcd_txn_slow(json11::Json::object {
{ "compare", json11::Json::array {
json11::Json::object {
{ "target", "MOD" },
{ "key", cur_cfg_key },
{ "result", "LESS" },
{ "mod_revision", new_cfg.mod_revision+1 },
},
} },
{ "success", json11::Json::array {
json11::Json::object {
{ "request_put", json11::Json::object {
{ "key", cur_cfg_key },
{ "value", base64_encode(json11::Json(cur_cfg_json).dump()) },
} }
},
} },
}, [this, new_parent, cur, cur_name](std::string err, json11::Json res)
{
if (err != "")
{
fprintf(stderr, "Error changing parent of %s: %s\n", cur_name.c_str(), err.c_str());
exit(1);
}
if (!res["succeeded"].bool_value())
{
fprintf(stderr, "Inode %s was modified during snapshot deletion\n", cur_name.c_str());
exit(1);
}
if (new_parent)
{
auto new_parent_it = cli->st_cli.inode_config.find(new_parent);
std::string new_parent_name = new_parent_it != cli->st_cli.inode_config.end()
? new_parent_it->second.name : "<unknown>";
printf(
"Parent of layer %s (inode %lu in pool %u) changed to %s (inode %lu in pool %u)\n",
cur_name.c_str(), INODE_NO_POOL(cur), INODE_POOL(cur),
new_parent_name.c_str(), INODE_NO_POOL(new_parent), INODE_POOL(new_parent)
);
}
else
{
printf(
"Parent of layer %s (inode %lu in pool %u) detached\n",
cur_name.c_str(), INODE_NO_POOL(cur), INODE_POOL(cur)
);
}
waiting--;
ringloop->wakeup();
});
}
void cli_tool_t::etcd_txn(json11::Json txn)
{
waiting++;
cli->st_cli.etcd_txn_slow(txn, [this](std::string err, json11::Json res)
{
waiting--;
if (err != "")
{
fprintf(stderr, "Error reading from etcd: %s\n", err.c_str());
exit(1);
}
etcd_result = res;
ringloop->wakeup();
});
}
inode_config_t* cli_tool_t::get_inode_cfg(const std::string & name)
{
for (auto & ic: cli->st_cli.inode_config)
{
if (ic.second.name == name)
{
return &ic.second;
}
}
fprintf(stderr, "Layer %s not found\n", name.c_str());
exit(1);
}
void cli_tool_t::run(json11::Json cfg)
static int run(cli_tool_t *p, json11::Json::object cfg)
{
cli_result_t result;
p->parse_config(cfg);
json11::Json::array cmd = cfg["command"].array_items();
cfg.erase("command");
std::function<bool(cli_result_t &)> action_cb;
if (!cmd.size())
{
fprintf(stderr, "command is missing\n");
exit(1);
result = { .err = EINVAL, .text = "command is missing" };
}
else if (cmd[0] == "status")
{
// Show cluster status
action_cb = start_status(cfg);
action_cb = p->start_status(cfg);
}
else if (cmd[0] == "df")
{
// Show pool space stats
action_cb = start_df(cfg);
action_cb = p->start_df(cfg);
}
else if (cmd[0] == "ls")
{
// List images
action_cb = start_ls(cfg);
if (cmd.size() > 1)
{
cmd.erase(cmd.begin(), cmd.begin()+1);
cfg["names"] = cmd;
}
action_cb = p->start_ls(cfg);
}
else if (cmd[0] == "create" || cmd[0] == "snap-create")
else if (cmd[0] == "snap-create")
{
// Create snapshot
std::string name = cmd.size() > 1 ? cmd[1].string_value() : "";
int pos = name.find('@');
if (pos == std::string::npos || pos == name.length()-1)
{
result = (cli_result_t){ .err = EINVAL, .text = "Please specify new snapshot name after @" };
}
else
{
cfg["image"] = name.substr(0, pos);
cfg["snapshot"] = name.substr(pos + 1);
action_cb = p->start_create(cfg);
}
}
else if (cmd[0] == "create")
{
// Create image/snapshot
action_cb = start_create(cfg);
if (cmd.size() > 1)
{
cfg["image"] = cmd[1];
}
action_cb = p->start_create(cfg);
}
else if (cmd[0] == "modify")
{
// Modify image
action_cb = start_modify(cfg);
if (cmd.size() > 1)
{
cfg["image"] = cmd[1];
}
action_cb = p->start_modify(cfg);
}
else if (cmd[0] == "rm-data")
{
// Delete inode data
action_cb = start_rm(cfg);
action_cb = p->start_rm_data(cfg);
}
else if (cmd[0] == "merge-data")
{
// Merge layer data without affecting metadata
action_cb = start_merge(cfg);
if (cmd.size() > 1)
{
cfg["from"] = cmd[1];
if (cmd.size() > 2)
cfg["to"] = cmd[2];
}
action_cb = p->start_merge(cfg);
}
else if (cmd[0] == "flatten")
{
// Merge layer data without affecting metadata
action_cb = start_flatten(cfg);
if (cmd.size() > 1)
{
cfg["image"] = cmd[1];
}
action_cb = p->start_flatten(cfg);
}
else if (cmd[0] == "rm")
{
// Remove multiple snapshots and rebase their children
action_cb = start_snap_rm(cfg);
if (cmd.size() > 1)
{
cfg["from"] = cmd[1];
if (cmd.size() > 2)
cfg["to"] = cmd[2];
}
action_cb = p->start_rm(cfg);
}
else if (cmd[0] == "alloc-osd")
{
// Allocate a new OSD number
action_cb = start_alloc_osd(cfg);
action_cb = p->start_alloc_osd(cfg);
}
else if (cmd[0] == "simple-offsets")
{
// Calculate offsets for simple & stupid OSD deployment without superblock
action_cb = simple_offsets(cfg);
if (cmd.size() > 1)
{
cfg["device"] = cmd[1];
}
action_cb = p->simple_offsets(cfg);
}
else
{
fprintf(stderr, "unknown command: %s\n", cmd[0].string_value().c_str());
exit(1);
result = { .err = EINVAL, .text = "unknown command: "+cmd[0].string_value() };
}
if (action_cb == NULL)
if (action_cb != NULL)
{
return;
}
color = !cfg["no-color"].bool_value();
json_output = cfg["json"].bool_value();
iodepth = cfg["iodepth"].uint64_value();
if (!iodepth)
iodepth = 32;
parallel_osds = cfg["parallel_osds"].uint64_value();
if (!parallel_osds)
parallel_osds = 4;
log_level = cfg["log_level"].int64_value();
progress = cfg["progress"].uint64_value() ? true : false;
list_first = cfg["wait-list"].uint64_value() ? true : false;
// Create client
ringloop = new ring_loop_t(512);
epmgr = new epoll_manager_t(ringloop);
cli = new cluster_client_t(ringloop, epmgr->tfd, cfg);
// Smaller timeout by default for more interactiveness
cli->st_cli.etcd_slow_timeout = cli->st_cli.etcd_quick_timeout;
cli->on_ready([this]()
{
// Initialize job
consumer.loop = [this]()
// Create client
json11::Json cfg_j = cfg;
p->ringloop = new ring_loop_t(512);
p->epmgr = new epoll_manager_t(p->ringloop);
p->cli = new cluster_client_t(p->ringloop, p->epmgr->tfd, cfg_j);
// Smaller timeout by default for more interactiveness
p->cli->st_cli.etcd_slow_timeout = p->cli->st_cli.etcd_quick_timeout;
p->loop_and_wait(action_cb, [&](const cli_result_t & r)
{
result = r;
action_cb = NULL;
});
// Loop until it completes
while (action_cb != NULL)
{
p->ringloop->loop();
if (action_cb != NULL)
{
bool done = action_cb();
if (done)
{
action_cb = NULL;
}
}
ringloop->submit();
};
ringloop->register_consumer(&consumer);
consumer.loop();
});
// Loop until it completes
while (action_cb != NULL)
{
ringloop->loop();
if (action_cb != NULL)
ringloop->wait();
p->ringloop->wait();
}
// Destroy the client
delete p->cli;
delete p->epmgr;
delete p->ringloop;
p->cli = NULL;
p->epmgr = NULL;
p->ringloop = NULL;
}
// Destroy the client
delete cli;
delete epmgr;
delete ringloop;
cli = NULL;
epmgr = NULL;
ringloop = NULL;
// Print result
if (p->json_output && !result.data.is_null())
{
printf("%s\n", result.data.dump().c_str());
}
else if (p->json_output && result.err)
{
printf("%s\n", json11::Json(json11::Json::object {
{ "error_code", result.err },
{ "error_text", result.text },
}).dump().c_str());
}
else if (result.text != "")
{
fprintf(result.err ? stderr : stdout, result.text[result.text.size()-1] == '\n' ? "%s" : "%s\n", result.text.c_str());
}
return result.err;
}
int main(int narg, const char *args[])
@@ -390,7 +338,7 @@ int main(int narg, const char *args[])
setvbuf(stderr, NULL, _IONBF, 0);
exe_name = args[0];
cli_tool_t *p = new cli_tool_t();
p->run(cli_tool_t::parse_args(narg, args));
int r = run(p, parse_args(narg, args));
delete p;
return 0;
return r;
}

View File

@@ -19,11 +19,18 @@ class epoll_manager_t;
class cluster_client_t;
struct inode_config_t;
struct cli_result_t
{
int err;
std::string text;
json11::Json data;
};
class cli_tool_t
{
public:
uint64_t iodepth = 0, parallel_osds = 0;
bool progress = true;
uint64_t iodepth = 4, parallel_osds = 32;
bool progress = false;
bool list_first = false;
bool json_output = false;
int log_level = 0;
@@ -34,34 +41,33 @@ public:
cluster_client_t *cli = NULL;
int waiting = 0;
cli_result_t etcd_err;
json11::Json etcd_result;
ring_consumer_t consumer;
std::function<bool(void)> action_cb;
void run(json11::Json cfg);
void parse_config(json11::Json cfg);
void change_parent(inode_t cur, inode_t new_parent);
void change_parent(inode_t cur, inode_t new_parent, cli_result_t *result);
inode_config_t* get_inode_cfg(const std::string & name);
static json11::Json::object parse_args(int narg, const char *args[]);
static void help();
friend struct rm_inode_t;
friend struct snap_merger_t;
friend struct snap_flattener_t;
friend struct snap_remover_t;
std::function<bool(void)> start_status(json11::Json cfg);
std::function<bool(void)> start_df(json11::Json);
std::function<bool(void)> start_ls(json11::Json);
std::function<bool(void)> start_create(json11::Json);
std::function<bool(void)> start_modify(json11::Json);
std::function<bool(void)> start_rm(json11::Json);
std::function<bool(void)> start_merge(json11::Json);
std::function<bool(void)> start_flatten(json11::Json);
std::function<bool(void)> start_snap_rm(json11::Json);
std::function<bool(void)> start_alloc_osd(json11::Json cfg, uint64_t *out = NULL);
std::function<bool(void)> simple_offsets(json11::Json cfg);
std::function<bool(cli_result_t &)> start_status(json11::Json);
std::function<bool(cli_result_t &)> start_df(json11::Json);
std::function<bool(cli_result_t &)> start_ls(json11::Json);
std::function<bool(cli_result_t &)> start_create(json11::Json);
std::function<bool(cli_result_t &)> start_modify(json11::Json);
std::function<bool(cli_result_t &)> start_rm_data(json11::Json);
std::function<bool(cli_result_t &)> start_merge(json11::Json);
std::function<bool(cli_result_t &)> start_flatten(json11::Json);
std::function<bool(cli_result_t &)> start_rm(json11::Json);
std::function<bool(cli_result_t &)> start_alloc_osd(json11::Json cfg);
std::function<bool(cli_result_t &)> simple_offsets(json11::Json cfg);
// Should be called like loop_and_wait(start_status(), <completion callback>)
void loop_and_wait(std::function<bool(cli_result_t &)> loop_cb, std::function<void(const cli_result_t &)> complete_cb);
void etcd_txn(json11::Json txn);
};

View File

@@ -16,6 +16,7 @@ struct alloc_osd_t
uint64_t new_id = 1;
int state = 0;
cli_result_t result;
bool is_done()
{
@@ -62,6 +63,12 @@ struct alloc_osd_t
state = 1;
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
if (!parent->etcd_result["succeeded"].bool_value())
{
std::vector<osd_num_t> used;
@@ -99,23 +106,23 @@ struct alloc_osd_t
}
} while (!parent->etcd_result["succeeded"].bool_value());
state = 100;
result = (cli_result_t){
.text = std::to_string(new_id),
.data = json11::Json(new_id),
};
}
};
std::function<bool(void)> cli_tool_t::start_alloc_osd(json11::Json cfg, uint64_t *out)
std::function<bool(cli_result_t &)> cli_tool_t::start_alloc_osd(json11::Json cfg)
{
json11::Json::array cmd = cfg["command"].array_items();
auto alloc_osd = new alloc_osd_t();
alloc_osd->parent = this;
return [alloc_osd, out]()
return [alloc_osd](cli_result_t & result)
{
alloc_osd->loop();
if (alloc_osd->is_done())
{
if (out)
*out = alloc_osd->new_id;
else if (alloc_osd->new_id)
printf("%lu\n", alloc_osd->new_id);
result = alloc_osd->result;
delete alloc_osd;
return true;
}

149
src/cli_common.cpp Normal file
View File

@@ -0,0 +1,149 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
#include "base64.h"
#include "cluster_client.h"
#include "cli.h"
void cli_tool_t::change_parent(inode_t cur, inode_t new_parent, cli_result_t *result)
{
auto cur_cfg_it = cli->st_cli.inode_config.find(cur);
if (cur_cfg_it == cli->st_cli.inode_config.end())
{
char buf[128];
snprintf(buf, 128, "Inode 0x%lx disappeared", cur);
*result = (cli_result_t){ .err = EIO, .text = buf };
return;
}
inode_config_t new_cfg = cur_cfg_it->second;
std::string cur_name = new_cfg.name;
std::string cur_cfg_key = base64_encode(cli->st_cli.etcd_prefix+
"/config/inode/"+std::to_string(INODE_POOL(cur))+
"/"+std::to_string(INODE_NO_POOL(cur)));
new_cfg.parent_id = new_parent;
json11::Json::object cur_cfg_json = cli->st_cli.serialize_inode_cfg(&new_cfg);
waiting++;
cli->st_cli.etcd_txn_slow(json11::Json::object {
{ "compare", json11::Json::array {
json11::Json::object {
{ "target", "MOD" },
{ "key", cur_cfg_key },
{ "result", "LESS" },
{ "mod_revision", new_cfg.mod_revision+1 },
},
} },
{ "success", json11::Json::array {
json11::Json::object {
{ "request_put", json11::Json::object {
{ "key", cur_cfg_key },
{ "value", base64_encode(json11::Json(cur_cfg_json).dump()) },
} }
},
} },
}, [this, result, new_parent, cur, cur_name](std::string err, json11::Json res)
{
if (err != "")
{
*result = (cli_result_t){ .err = EIO, .text = "Error changing parent of "+cur_name+": "+err };
}
else if (!res["succeeded"].bool_value())
{
*result = (cli_result_t){ .err = EAGAIN, .text = "Image "+cur_name+" was modified during change" };
}
else if (new_parent)
{
auto new_parent_it = cli->st_cli.inode_config.find(new_parent);
std::string new_parent_name = new_parent_it != cli->st_cli.inode_config.end()
? new_parent_it->second.name : "<unknown>";
*result = (cli_result_t){
.text = "Parent of layer "+cur_name+" (inode "+std::to_string(INODE_NO_POOL(cur))+
" in pool "+std::to_string(INODE_POOL(cur))+") changed to "+new_parent_name+
" (inode "+std::to_string(INODE_NO_POOL(new_parent))+" in pool "+std::to_string(INODE_POOL(new_parent))+")",
};
}
else
{
*result = (cli_result_t){
.text = "Parent of layer "+cur_name+" (inode "+std::to_string(INODE_NO_POOL(cur))+
" in pool "+std::to_string(INODE_POOL(cur))+") detached",
};
}
waiting--;
ringloop->wakeup();
});
}
void cli_tool_t::etcd_txn(json11::Json txn)
{
waiting++;
cli->st_cli.etcd_txn_slow(txn, [this](std::string err, json11::Json res)
{
waiting--;
if (err != "")
etcd_err = (cli_result_t){ .err = EIO, .text = "Error communicating with etcd: "+err };
else
etcd_err = (cli_result_t){ .err = 0 };
etcd_result = res;
ringloop->wakeup();
});
}
inode_config_t* cli_tool_t::get_inode_cfg(const std::string & name)
{
for (auto & ic: cli->st_cli.inode_config)
{
if (ic.second.name == name)
{
return &ic.second;
}
}
return NULL;
}
void cli_tool_t::parse_config(json11::Json cfg)
{
color = !cfg["no-color"].bool_value();
json_output = cfg["json"].bool_value();
iodepth = cfg["iodepth"].uint64_value();
if (!iodepth)
iodepth = 32;
parallel_osds = cfg["parallel_osds"].uint64_value();
if (!parallel_osds)
parallel_osds = 4;
log_level = cfg["log_level"].int64_value();
progress = cfg["progress"].uint64_value() ? true : false;
list_first = cfg["wait-list"].uint64_value() ? true : false;
}
struct cli_result_looper_t
{
ring_consumer_t consumer;
cli_result_t result;
std::function<bool(cli_result_t &)> loop_cb;
std::function<void(const cli_result_t &)> complete_cb;
};
void cli_tool_t::loop_and_wait(std::function<bool(cli_result_t &)> loop_cb, std::function<void(const cli_result_t &)> complete_cb)
{
auto *looper = new cli_result_looper_t();
looper->loop_cb = loop_cb;
looper->complete_cb = complete_cb;
looper->consumer.loop = [this, looper]()
{
bool done = looper->loop_cb(looper->result);
if (done)
{
ringloop->unregister_consumer(&looper->consumer);
looper->loop_cb = NULL;
looper->complete_cb(looper->result);
delete looper;
return;
}
ringloop->submit();
};
cli->on_ready([this, looper]()
{
ringloop->register_consumer(&looper->consumer);
ringloop->wakeup();
});
}

View File

@@ -25,14 +25,18 @@ struct image_creator_t
pool_id_t new_pool_id = 0;
std::string new_pool_name;
std::string image_name, new_snap, new_parent;
json11::Json new_meta;
uint64_t size;
bool force_size = false;
pool_id_t old_pool_id = 0;
inode_t new_parent_id = 0;
inode_t new_id = 0, old_id = 0;
uint64_t max_id_mod_rev = 0, cfg_mod_rev = 0, idx_mod_rev = 0;
inode_config_t new_cfg;
int state = 0;
cli_result_t result;
bool is_done()
{
@@ -43,13 +47,27 @@ struct image_creator_t
{
if (state >= 1)
goto resume_1;
if (image_name == "")
{
// FIXME: EINVAL -> specific codes for every error
result = (cli_result_t){ .err = EINVAL, .text = "Image name is missing" };
state = 100;
return;
}
if (image_name.find('@') != std::string::npos)
{
result = (cli_result_t){ .err = EINVAL, .text = "Image name can't contain @ character" };
state = 100;
return;
}
if (new_pool_id)
{
auto & pools = parent->cli->st_cli.pool_config;
if (pools.find(new_pool_id) == pools.end())
{
fprintf(stderr, "Pool %u does not exist\n", new_pool_id);
exit(1);
result = (cli_result_t){ .err = ENOENT, .text = "Pool "+std::to_string(new_pool_id)+" does not exist" };
state = 100;
return;
}
}
else if (new_pool_name != "")
@@ -64,8 +82,9 @@ struct image_creator_t
}
if (!new_pool_id)
{
fprintf(stderr, "Pool %s does not exist\n", new_pool_name.c_str());
exit(1);
result = (cli_result_t){ .err = ENOENT, .text = "Pool "+new_pool_name+" does not exist" };
state = 100;
return;
}
}
else if (parent->cli->st_cli.pool_config.size() == 1)
@@ -91,8 +110,9 @@ struct image_creator_t
{
if (ic.second.name == image_name)
{
fprintf(stderr, "Image %s already exists\n", image_name.c_str());
exit(1);
result = (cli_result_t){ .err = EEXIST, .text = "Image "+image_name+" already exists" };
state = 100;
return;
}
if (ic.second.name == new_parent)
{
@@ -109,18 +129,21 @@ struct image_creator_t
}
if (new_parent != "" && !new_parent_id)
{
fprintf(stderr, "Parent image not found\n");
exit(1);
result = (cli_result_t){ .err = ENOENT, .text = "Parent image "+new_parent+" not found" };
state = 100;
return;
}
if (!new_pool_id)
{
fprintf(stderr, "Pool name or ID is missing\n");
exit(1);
result = (cli_result_t){ .err = EINVAL, .text = "Pool name or ID is missing" };
state = 100;
return;
}
if (!size)
if (!size && !force_size)
{
fprintf(stderr, "Image size is missing\n");
exit(1);
result = (cli_result_t){ .err = EINVAL, .text = "Image size is missing" };
state = 100;
return;
}
do
{
@@ -131,23 +154,36 @@ struct image_creator_t
resume_2:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
extract_next_id(parent->etcd_result["responses"][0]);
attempt_create();
state = 3;
resume_3:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
if (!parent->etcd_result["succeeded"].bool_value() &&
parent->etcd_result["responses"][0]["response_range"]["kvs"].array_items().size() > 0)
{
fprintf(stderr, "Image %s already exists\n", image_name.c_str());
exit(1);
result = (cli_result_t){ .err = EEXIST, .text = "Image "+image_name+" already exists" };
state = 100;
return;
}
} while (!parent->etcd_result["succeeded"].bool_value());
if (parent->progress)
{
printf("Image %s created\n", image_name.c_str());
}
// Save into inode_config for library users to be able to take it from there immediately
new_cfg.mod_revision = parent->etcd_result["responses"][0]["response_put"]["header"]["revision"].uint64_value();
parent->cli->st_cli.insert_inode_config(new_cfg);
result = (cli_result_t){ .err = 0, .text = "Image "+image_name+" created" };
state = 100;
}
@@ -163,14 +199,16 @@ resume_3:
{
if (ic.second.name == image_name+"@"+new_snap)
{
fprintf(stderr, "Snapshot %s@%s already exists\n", image_name.c_str(), new_snap.c_str());
exit(1);
result = (cli_result_t){ .err = EEXIST, .text = "Snapshot "+image_name+"@"+new_snap+" already exists" };
state = 100;
return;
}
}
if (new_parent != "")
{
fprintf(stderr, "--parent can't be used with snapshots\n");
exit(1);
result = (cli_result_t){ .err = EINVAL, .text = "Parent can't be specified for snapshots" };
state = 100;
return;
}
do
{
@@ -182,8 +220,9 @@ resume_3:
return;
if (!old_id)
{
fprintf(stderr, "Image %s does not exist\n", image_name.c_str());
exit(1);
result = (cli_result_t){ .err = ENOENT, .text = "Image "+image_name+" does not exist" };
state = 100;
return;
}
if (!new_pool_id)
{
@@ -195,17 +234,24 @@ resume_3:
resume_4:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
if (!parent->etcd_result["succeeded"].bool_value() &&
parent->etcd_result["responses"][0]["response_range"]["kvs"].array_items().size() > 0)
{
fprintf(stderr, "Snapshot %s@%s already exists\n", image_name.c_str(), new_snap.c_str());
exit(1);
result = (cli_result_t){ .err = EEXIST, .text = "Snapshot "+image_name+"@"+new_snap+" already exists" };
state = 100;
return;
}
} while (!parent->etcd_result["succeeded"].bool_value());
if (parent->progress)
{
printf("Snapshot %s@%s created\n", image_name.c_str(), new_snap.c_str());
}
// Save into inode_config for library users to be able to take it from there immediately
new_cfg.mod_revision = parent->etcd_result["responses"][0]["response_put"]["header"]["revision"].uint64_value();
parent->cli->st_cli.insert_inode_config(new_cfg);
result = (cli_result_t){ .err = 0, .text = "Snapshot "+image_name+"@"+new_snap+" created" };
state = 100;
}
@@ -259,6 +305,12 @@ resume_4:
resume_2:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
extract_next_id(parent->etcd_result["responses"][0]);
old_id = 0;
old_pool_id = 0;
@@ -288,8 +340,9 @@ resume_2:
idx_mod_rev = kv.mod_revision;
if (!old_id || !old_pool_id || old_pool_id >= POOL_ID_MAX)
{
fprintf(stderr, "Invalid pool or inode ID in etcd key %s\n", kv.key.c_str());
exit(1);
result = (cli_result_t){ .err = ENOENT, .text = "Invalid pool or inode ID in etcd key "+kv.key };
state = 100;
return;
}
}
parent->etcd_txn(json11::Json::object {
@@ -308,6 +361,12 @@ resume_2:
resume_3:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
{
auto kv = parent->cli->st_cli.parse_etcd_kv(parent->etcd_result["responses"][0]["response_range"]["kvs"][0]);
size = kv.value["size"].uint64_value();
@@ -324,12 +383,13 @@ resume_3:
void attempt_create()
{
inode_config_t new_cfg = {
new_cfg = {
.num = INODE_WITH_POOL(new_pool_id, new_id),
.name = image_name,
.size = size,
.parent_id = (new_snap != "" ? INODE_WITH_POOL(old_pool_id, old_id) : new_parent_id),
.readonly = false,
.meta = new_meta,
};
json11::Json::array checks = json11::Json::array {
json11::Json::object {
@@ -457,77 +517,76 @@ uint64_t parse_size(std::string size_str)
if (type_char == 'k' || type_char == 'm' || type_char == 'g' || type_char == 't')
{
if (type_char == 'k')
mul = 1l<<10;
mul = (uint64_t)1<<10;
else if (type_char == 'm')
mul = 1l<<20;
mul = (uint64_t)1<<20;
else if (type_char == 'g')
mul = 1l<<30;
mul = (uint64_t)1<<30;
else /*if (type_char == 't')*/
mul = 1l<<40;
mul = (uint64_t)1<<40;
size_str = size_str.substr(0, size_str.length()-1);
}
uint64_t size = json11::Json(size_str).uint64_value() * mul;
if (size == 0 && size_str != "0" && (size_str != "" || mul != 1))
{
fprintf(stderr, "Invalid syntax for size: %s\n", size_str.c_str());
exit(1);
return UINT64_MAX;
}
return size;
}
std::function<bool(void)> cli_tool_t::start_create(json11::Json cfg)
std::function<bool(cli_result_t &)> cli_tool_t::start_create(json11::Json cfg)
{
json11::Json::array cmd = cfg["command"].array_items();
auto image_creator = new image_creator_t();
image_creator->parent = this;
image_creator->image_name = cmd.size() > 1 ? cmd[1].string_value() : "";
image_creator->image_name = cfg["image"].string_value();
image_creator->new_pool_id = cfg["pool"].uint64_value();
image_creator->new_pool_name = cfg["pool"].string_value();
image_creator->force_size = cfg["force_size"].bool_value();
if (cfg["image_meta"].is_object())
{
image_creator->new_meta = cfg["image-meta"];
}
if (cfg["snapshot"].string_value() != "")
{
image_creator->new_snap = cfg["snapshot"].string_value();
}
else if (cmd[0] == "snap-create")
{
int p = image_creator->image_name.find('@');
if (p == std::string::npos || p == image_creator->image_name.length()-1)
{
fprintf(stderr, "Please specify new snapshot name after @\n");
exit(1);
}
image_creator->new_snap = image_creator->image_name.substr(p + 1);
image_creator->image_name = image_creator->image_name.substr(0, p);
}
image_creator->new_parent = cfg["parent"].string_value();
if (cfg["size"].string_value() != "")
{
image_creator->size = parse_size(cfg["size"].string_value());
if (image_creator->size % 4096)
if (image_creator->size == UINT64_MAX)
{
fprintf(stderr, "Size should be a multiple of 4096\n");
exit(1);
return [size = cfg["size"].string_value()](cli_result_t & result)
{
result = (cli_result_t){ .err = EINVAL, .text = "Invalid syntax for size: "+size };
return true;
};
}
if ((image_creator->size % 4096) && !cfg["force_size"].bool_value())
{
delete image_creator;
return [](cli_result_t & result)
{
result = (cli_result_t){ .err = EINVAL, .text = "Size should be a multiple of 4096" };
return true;
};
}
if (image_creator->new_snap != "")
{
fprintf(stderr, "--size can't be specified for snapshots\n");
exit(1);
delete image_creator;
return [](cli_result_t & result)
{
result = (cli_result_t){ .err = EINVAL, .text = "Size can't be specified for snapshots" };
return true;
};
}
}
if (image_creator->image_name == "")
{
fprintf(stderr, "Image name is missing\n");
exit(1);
}
if (image_creator->image_name.find('@') != std::string::npos)
{
fprintf(stderr, "Image name can't contain @ character\n");
exit(1);
}
return [image_creator]()
return [image_creator](cli_result_t & result)
{
image_creator->loop();
if (image_creator->is_done())
{
result = image_creator->result;
delete image_creator;
return true;
}

View File

@@ -12,6 +12,7 @@ struct pool_lister_t
int state = 0;
json11::Json space_info;
cli_result_t result;
std::map<pool_id_t, json11::Json::object> pool_stats;
bool is_done()
@@ -52,6 +53,12 @@ struct pool_lister_t
resume_1:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
space_info = parent->etcd_result;
std::map<pool_id_t, uint64_t> osd_free;
for (auto & kv_item: space_info["responses"][0]["response_range"]["kvs"].array_items())
@@ -124,8 +131,8 @@ resume_1:
{ "scheme_name", pool_cfg.scheme == POOL_SCHEME_REPLICATED
? std::to_string(pool_cfg.pg_size)+"/"+std::to_string(pool_cfg.pg_minsize)
: "EC "+std::to_string(pool_cfg.pg_size-pool_cfg.parity_chunks)+"+"+std::to_string(pool_cfg.parity_chunks) },
{ "used_raw", (uint64_t)(pool_stats[pool_cfg.id]["used_raw_tb"].number_value() * (1l<<40)) },
{ "total_raw", (uint64_t)(pool_stats[pool_cfg.id]["total_raw_tb"].number_value() * (1l<<40)) },
{ "used_raw", (uint64_t)(pool_stats[pool_cfg.id]["used_raw_tb"].number_value() * ((uint64_t)1<<40)) },
{ "total_raw", (uint64_t)(pool_stats[pool_cfg.id]["total_raw_tb"].number_value() * ((uint64_t)1<<40)) },
{ "max_available", pool_avail },
{ "raw_to_usable", pool_stats[pool_cfg.id]["raw_to_usable"].number_value() },
{ "space_efficiency", pool_stats[pool_cfg.id]["space_efficiency"].number_value() },
@@ -150,10 +157,12 @@ resume_1:
get_stats();
if (parent->waiting > 0)
return;
if (state == 100)
return;
if (parent->json_output)
{
// JSON output
printf("%s\n", json11::Json(to_list()).dump().c_str());
result.data = to_list();
state = 100;
return;
}
@@ -206,21 +215,22 @@ resume_1:
: 100)+"%";
kv.second["eff_fmt"] = format_q(kv.second["space_efficiency"].number_value()*100)+"%";
}
printf("%s", print_table(to_list(), cols, parent->color).c_str());
result.data = to_list();
result.text = print_table(result.data, cols, parent->color);
state = 100;
}
};
std::function<bool(void)> cli_tool_t::start_df(json11::Json cfg)
std::function<bool(cli_result_t &)> cli_tool_t::start_df(json11::Json cfg)
{
json11::Json::array cmd = cfg["command"].array_items();
auto lister = new pool_lister_t();
lister->parent = this;
return [lister]()
return [lister](cli_result_t & result)
{
lister->loop();
if (lister->is_done())
{
result = lister->result;
delete lister;
return true;
}

View File

@@ -22,12 +22,19 @@ struct snap_flattener_t
std::string top_parent_name;
inode_t target_id = 0;
int state = 0;
std::function<bool(void)> merger_cb;
std::function<bool(cli_result_t &)> merger_cb;
cli_result_t result;
void get_merge_parents()
{
// Get all parents of target
inode_config_t *target_cfg = parent->get_inode_cfg(target_name);
if (!target_cfg)
{
result = (cli_result_t){ .err = ENOENT, .text = "Layer "+target_name+" not found" };
state = 100;
return;
}
target_id = target_cfg->num;
std::vector<inode_t> chain_list;
inode_config_t *cur = target_cfg;
@@ -37,23 +44,34 @@ struct snap_flattener_t
auto it = parent->cli->st_cli.inode_config.find(cur->parent_id);
if (it == parent->cli->st_cli.inode_config.end())
{
fprintf(stderr, "Parent inode of layer %s (id %ld) not found\n", cur->name.c_str(), cur->parent_id);
exit(1);
result = (cli_result_t){
.err = ENOENT,
.text = "Parent inode of layer "+cur->name+" (id "+std::to_string(cur->parent_id)+") does not exist",
.data = json11::Json::object {
{ "error", "parent-not-found" },
{ "inode_id", cur->num },
{ "inode_name", cur->name },
{ "parent_id", cur->parent_id },
},
};
state = 100;
return;
}
cur = &it->second;
chain_list.push_back(cur->num);
}
if (cur->parent_id != 0)
{
fprintf(stderr, "Layer %s has a loop in parents\n", target_name.c_str());
exit(1);
result = (cli_result_t){ .err = EBADF, .text = "Layer "+target_name+" has a loop in parents" };
state = 100;
return;
}
top_parent_name = cur->name;
}
bool is_done()
{
return state == 5;
return state == 100;
}
void loop()
@@ -64,11 +82,20 @@ struct snap_flattener_t
goto resume_2;
else if (state == 3)
goto resume_3;
if (target_name == "")
{
result = (cli_result_t){ .err = EINVAL, .text = "Layer to flatten not specified" };
state = 100;
return;
}
// Get parent layers
get_merge_parents();
if (state == 100)
return;
// Start merger
merger_cb = parent->start_merge(json11::Json::object {
{ "command", json11::Json::array{ "merge-data", top_parent_name, target_name } },
{ "from", top_parent_name },
{ "to", target_name },
{ "target", target_name },
{ "delete-source", false },
{ "cas", use_cas },
@@ -76,14 +103,19 @@ struct snap_flattener_t
});
// Wait for it
resume_1:
while (!merger_cb())
while (!merger_cb(result))
{
state = 1;
return;
}
merger_cb = NULL;
if (result.err)
{
state = 100;
return;
}
// Change parent
parent->change_parent(target_id, 0);
parent->change_parent(target_id, 0, &result);
// Wait for it to complete
state = 2;
resume_2:
@@ -92,31 +124,26 @@ resume_2:
state = 3;
resume_3:
// Done
return;
state = 100;
}
};
std::function<bool(void)> cli_tool_t::start_flatten(json11::Json cfg)
std::function<bool(cli_result_t &)> cli_tool_t::start_flatten(json11::Json cfg)
{
json11::Json::array cmd = cfg["command"].array_items();
auto flattener = new snap_flattener_t();
flattener->parent = this;
flattener->target_name = cmd.size() > 1 ? cmd[1].string_value() : "";
if (flattener->target_name == "")
{
fprintf(stderr, "Layer to flatten argument is missing\n");
exit(1);
}
flattener->target_name = cfg["image"].string_value();
flattener->fsync_interval = cfg["fsync-interval"].uint64_value();
if (!flattener->fsync_interval)
flattener->fsync_interval = 128;
if (!cfg["cas"].is_null())
flattener->use_cas = cfg["cas"].uint64_value() ? 2 : 0;
return [flattener]()
return [flattener](cli_result_t & result)
{
flattener->loop();
if (flattener->is_done())
{
result = flattener->result;
delete flattener;
return true;
}

View File

@@ -24,6 +24,7 @@ struct image_lister_t
int state = 0;
std::map<inode_t, json11::Json::object> stats;
json11::Json space_info;
cli_result_t result;
bool is_done()
{
@@ -44,8 +45,9 @@ struct image_lister_t
}
if (!list_pool_id)
{
fprintf(stderr, "Pool %s does not exist\n", list_pool_name.c_str());
exit(1);
result = (cli_result_t){ .err = ENOENT, .text = "Pool "+list_pool_name+" does not exist" };
state = 100;
return;
}
}
for (auto & ic: parent->cli->st_cli.inode_config)
@@ -116,6 +118,12 @@ struct image_lister_t
resume_1:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
space_info = parent->etcd_result;
std::map<pool_id_t, uint64_t> pool_pg_real_size;
for (auto & kv_item: space_info["responses"][0]["response_range"]["kvs"].array_items())
@@ -245,11 +253,13 @@ resume_1:
get_stats();
if (parent->waiting > 0)
return;
if (state == 100)
return;
}
result.data = to_list();
if (parent->json_output)
{
// JSON output
printf("%s\n", json11::Json(to_list()).dump().c_str());
state = 100;
return;
}
@@ -359,7 +369,7 @@ resume_1:
kv.second["size_fmt"] = format_size(kv.second["size"].uint64_value());
kv.second["ro"] = kv.second["readonly"].bool_value() ? "RO" : "-";
}
printf("%s", print_table(to_list(), cols, parent->color).c_str());
result.text = print_table(to_list(), cols, parent->color);
state = 100;
}
};
@@ -436,8 +446,8 @@ std::string print_table(json11::Json items, json11::Json header, bool use_esc)
return str;
}
static uint64_t size_thresh[] = { 1024l*1024*1024*1024, 1024l*1024*1024, 1024l*1024, 1024, 0 };
static uint64_t size_thresh_d[] = { 1000000000000l, 1000000000l, 1000000l, 1000l, 0 };
static uint64_t size_thresh[] = { (uint64_t)1024*1024*1024*1024, (uint64_t)1024*1024*1024, (uint64_t)1024*1024, 1024, 0 };
static uint64_t size_thresh_d[] = { (uint64_t)1000000000000, (uint64_t)1000000000, (uint64_t)1000000, (uint64_t)1000, 0 };
static const int size_thresh_n = sizeof(size_thresh)/sizeof(size_thresh[0]);
static const char *size_unit = "TGMKB";
@@ -546,9 +556,8 @@ back:
return true;
}
std::function<bool(void)> cli_tool_t::start_ls(json11::Json cfg)
std::function<bool(cli_result_t &)> cli_tool_t::start_ls(json11::Json cfg)
{
json11::Json::array cmd = cfg["command"].array_items();
auto lister = new image_lister_t();
lister->parent = this;
lister->list_pool_id = cfg["pool"].uint64_value();
@@ -558,15 +567,16 @@ std::function<bool(void)> cli_tool_t::start_ls(json11::Json cfg)
lister->sort_field = cfg["sort"].string_value();
lister->reverse = cfg["reverse"].bool_value();
lister->max_count = cfg["count"].uint64_value();
for (int i = 1; i < cmd.size(); i++)
for (auto & item: cfg["names"].array_items())
{
lister->only_names.insert(cmd[i].string_value());
lister->only_names.insert(item.string_value());
}
return [lister]()
return [lister](cli_result_t & result)
{
lister->loop();
if (lister->is_done())
{
result = lister->result;
delete lister;
return true;
}

View File

@@ -12,6 +12,9 @@ struct snap_rw_op_t
cluster_op_t op;
int todo = 0;
uint32_t start = 0, end = 0;
int error_code = 0;
uint64_t error_offset = 0;
bool error_read = false;
};
// Layer merge is the base for multiple operations:
@@ -54,17 +57,45 @@ struct snap_merger_t
uint64_t last_written_offset = 0;
int deleted_unsynced = 0;
uint64_t processed = 0, to_process = 0;
std::string rwo_error;
cli_result_t result;
void start_merge()
{
if (from_name == "" || to_name == "")
{
result = (cli_result_t){ .err = EINVAL, .text = "Beginning or end of the merge sequence is missing" };
state = 100;
return;
}
check_delete_source = delete_source || check_delete_source;
inode_config_t *from_cfg = parent->get_inode_cfg(from_name);
if (!from_cfg)
{
result = (cli_result_t){ .err = ENOENT, .text = "Layer "+from_name+" not found" };
state = 100;
return;
}
inode_config_t *to_cfg = parent->get_inode_cfg(to_name);
if (!to_cfg)
{
result = (cli_result_t){ .err = ENOENT, .text = "Layer "+to_name+" not found" };
state = 100;
return;
}
inode_config_t *target_cfg = target_name == "" ? from_cfg : parent->get_inode_cfg(target_name);
if (!target_cfg)
{
result = (cli_result_t){ .err = ENOENT, .text = "Layer "+target_name+" not found" };
state = 100;
return;
}
if (to_cfg->num == from_cfg->num)
{
fprintf(stderr, "Only one layer specified, nothing to merge\n");
exit(1);
result = (cli_result_t){ .err = EINVAL, .text = "Only one layer specified, nothing to merge" };
state = 100;
return;
}
// Check that to_cfg is actually a child of from_cfg and target_cfg is somewhere between them
std::vector<inode_t> chain_list;
@@ -78,8 +109,18 @@ struct snap_merger_t
auto it = parent->cli->st_cli.inode_config.find(cur->parent_id);
if (it == parent->cli->st_cli.inode_config.end())
{
fprintf(stderr, "Parent inode of layer %s (id %ld) not found\n", cur->name.c_str(), cur->parent_id);
exit(1);
result = (cli_result_t){
.err = ENOENT,
.text = "Parent inode of layer "+cur->name+" (id "+std::to_string(cur->parent_id)+") does not exist",
.data = json11::Json::object {
{ "error", "parent-not-found" },
{ "inode_id", cur->num },
{ "inode_name", cur->name },
{ "parent_id", cur->parent_id },
},
};
state = 100;
return;
}
cur = &it->second;
chain_list.push_back(cur->num);
@@ -87,8 +128,9 @@ struct snap_merger_t
}
if (cur->parent_id != from_cfg->num)
{
fprintf(stderr, "Layer %s is not a child of %s\n", to_name.c_str(), from_name.c_str());
exit(1);
result = (cli_result_t){ .err = EINVAL, .text = "Layer "+to_name+" is not a child of "+from_name };
state = 100;
return;
}
chain_list.push_back(from_cfg->num);
layer_block_size[from_cfg->num] = get_block_size(from_cfg->num);
@@ -99,8 +141,9 @@ struct snap_merger_t
}
if (sources.find(target_cfg->num) == sources.end())
{
fprintf(stderr, "Layer %s is not between %s and %s\n", target_name.c_str(), to_name.c_str(), from_name.c_str());
exit(1);
result = (cli_result_t){ .err = EINVAL, .text = "Layer "+target_name+" is not between "+to_name+" and "+from_name };
state = 100;
return;
}
target = target_cfg->num;
target_rank = sources.at(target);
@@ -130,14 +173,15 @@ struct snap_merger_t
int parent_rank = it->second;
if (parent_rank < to_rank && (parent_rank >= target_rank || check_delete_source))
{
fprintf(
stderr, "Layers at or above %s, but below %s are not allowed"
" to have other children, but %s is a child of %s\n",
(check_delete_source ? from_name.c_str() : target_name.c_str()),
to_name.c_str(), ic.second.name.c_str(),
parent->cli->st_cli.inode_config.at(ic.second.parent_id).name.c_str()
);
exit(1);
result = (cli_result_t){
.err = EINVAL,
.text = "Layers at or above "+(check_delete_source ? from_name : target_name)+
", but below "+to_name+" are not allowed to have other children, but "+
ic.second.name+" is a child of "+
parent->cli->st_cli.inode_config.at(ic.second.parent_id).name,
};
state = 100;
return;
}
if (parent_rank >= to_rank)
{
@@ -152,11 +196,14 @@ struct snap_merger_t
use_cas = 0;
}
sources.erase(target);
printf(
"Merging %ld layer(s) into target %s%s (inode %lu in pool %u)\n",
sources.size(), target_cfg->name.c_str(),
use_cas ? " online (with CAS)" : "", INODE_NO_POOL(target), INODE_POOL(target)
);
if (parent->progress)
{
printf(
"Merging %ld layer(s) into target %s%s (inode %lu in pool %u)\n",
sources.size(), target_cfg->name.c_str(),
use_cas ? " online (with CAS)" : "", INODE_NO_POOL(target), INODE_POOL(target)
);
}
target_block_size = get_block_size(target);
}
@@ -179,7 +226,7 @@ struct snap_merger_t
bool is_done()
{
return state == 6;
return state == 100;
}
void continue_merge()
@@ -194,8 +241,8 @@ struct snap_merger_t
goto resume_4;
else if (state == 5)
goto resume_5;
else if (state == 6)
goto resume_6;
else if (state == 100)
goto resume_100;
// Get parents and so on
start_merge();
// First list lower layers
@@ -253,7 +300,8 @@ struct snap_merger_t
oit = merge_offsets.begin();
resume_5:
// Now read, overwrite and optionally delete offsets one by one
while (in_flight < parent->iodepth*parent->parallel_osds && oit != merge_offsets.end())
while (in_flight < parent->iodepth*parent->parallel_osds &&
oit != merge_offsets.end() && !rwo_error.size())
{
in_flight++;
read_and_write(*oit);
@@ -264,6 +312,15 @@ struct snap_merger_t
printf("\rOverwriting blocks: %lu/%lu", processed, to_process);
}
}
if (in_flight == 0 && rwo_error.size())
{
result = (cli_result_t){
.err = EIO,
.text = rwo_error,
};
state = 100;
return;
}
if (in_flight > 0 || oit != merge_offsets.end())
{
// Wait until overwrites finish
@@ -274,9 +331,9 @@ struct snap_merger_t
printf("\rOverwriting blocks: %lu/%lu\n", to_process, to_process);
}
// Done
printf("Done, layers from %s to %s merged into %s\n", from_name.c_str(), to_name.c_str(), target_name.c_str());
state = 6;
resume_6:
result = (cli_result_t){ .text = "Done, layers from "+from_name+" to "+to_name+" merged into "+target_name };
state = 100;
resume_100:
return;
}
@@ -314,7 +371,10 @@ struct snap_merger_t
if (status & INODE_LIST_DONE)
{
auto & name = parent->cli->st_cli.inode_config.at(src).name;
printf("Got listing of layer %s (inode %lu in pool %u)\n", name.c_str(), INODE_NO_POOL(src), INODE_POOL(src));
if (parent->progress)
{
printf("Got listing of layer %s (inode %lu in pool %u)\n", name.c_str(), INODE_NO_POOL(src), INODE_POOL(src));
}
if (delete_source)
{
// Sort the inode listing
@@ -396,8 +456,9 @@ struct snap_merger_t
{
if (op->retval != op->len)
{
fprintf(stderr, "error reading target at offset %lx: %s\n", op->offset, strerror(-op->retval));
exit(1);
rwo->error_code = -op->retval;
rwo->error_offset = op->offset;
rwo->error_read = true;
}
next_write(rwo);
};
@@ -410,7 +471,7 @@ struct snap_merger_t
// FIXME: Allow to use single write with "holes" (OSDs don't allow it yet)
uint32_t gran = parent->cli->get_bs_bitmap_granularity();
uint64_t bitmap_size = target_block_size / gran;
while (rwo->end < bitmap_size)
while (rwo->end < bitmap_size && !rwo->error_code)
{
auto bit = ((*((uint8_t*)rwo->op.bitmap_buf + (rwo->end >> 3))) & (1 << (rwo->end & 0x7)));
if (!bit)
@@ -434,7 +495,7 @@ struct snap_merger_t
rwo->end++;
}
}
if (rwo->end > rwo->start)
if (rwo->end > rwo->start && !rwo->error_code)
{
// write start->end
rwo->todo++;
@@ -473,8 +534,9 @@ struct snap_merger_t
delete subop;
return;
}
fprintf(stderr, "error writing target at offset %lx: %s\n", subop->offset, strerror(-subop->retval));
exit(1);
rwo->error_code = -subop->retval;
rwo->error_offset = subop->offset;
rwo->error_read = false;
}
// Increment CAS version
rwo->op.version++;
@@ -510,11 +572,12 @@ struct snap_merger_t
{
if (!rwo->todo)
{
if (last_written_offset < rwo->op.offset+target_block_size)
if (!rwo->error_code &&
last_written_offset < rwo->op.offset+target_block_size)
{
last_written_offset = rwo->op.offset+target_block_size;
}
if (delete_source)
if (!rwo->error_code && delete_source)
{
deleted_unsynced++;
if (deleted_unsynced >= fsync_interval)
@@ -544,6 +607,13 @@ struct snap_merger_t
}
}
free(rwo->buf);
if (rwo->error_code)
{
char buf[1024];
snprintf(buf, 1024, "Error %s target at offset %lx: %s",
rwo->error_read ? "reading" : "writing", rwo->error_offset, strerror(rwo->error_code));
rwo_error = std::string(buf);
}
delete rwo;
in_flight--;
continue_merge_reent();
@@ -551,30 +621,25 @@ struct snap_merger_t
}
};
std::function<bool(void)> cli_tool_t::start_merge(json11::Json cfg)
std::function<bool(cli_result_t &)> cli_tool_t::start_merge(json11::Json cfg)
{
json11::Json::array cmd = cfg["command"].array_items();
auto merger = new snap_merger_t();
merger->parent = this;
merger->from_name = cmd.size() > 1 ? cmd[1].string_value() : "";
merger->to_name = cmd.size() > 2 ? cmd[2].string_value() : "";
merger->from_name = cfg["from"].string_value();
merger->to_name = cfg["to"].string_value();
merger->target_name = cfg["target"].string_value();
if (merger->from_name == "" || merger->to_name == "")
{
fprintf(stderr, "Beginning or end of the merge sequence is missing\n");
exit(1);
}
merger->delete_source = cfg["delete-source"].string_value() != "";
merger->fsync_interval = cfg["fsync-interval"].uint64_value();
if (!merger->fsync_interval)
merger->fsync_interval = 128;
if (!cfg["cas"].is_null())
merger->use_cas = cfg["cas"].uint64_value() ? 2 : 0;
return [merger]()
return [merger](cli_result_t & result)
{
merger->continue_merge_reent();
if (merger->is_done())
{
result = merger->result;
delete merger;
return true;
}

View File

@@ -13,6 +13,7 @@ struct image_changer_t
std::string image_name;
std::string new_name;
uint64_t new_size = 0;
bool force_size = false;
bool set_readonly = false, set_readwrite = false, force = false;
// interval between fsyncs
int fsync_interval = 128;
@@ -23,7 +24,8 @@ struct image_changer_t
bool has_children = false;
int state = 0;
std::function<bool(void)> cb;
std::function<bool(cli_result_t &)> cb;
cli_result_t result;
bool is_done()
{
@@ -36,6 +38,18 @@ struct image_changer_t
goto resume_1;
else if (state == 2)
goto resume_2;
if (image_name == "")
{
result = (cli_result_t){ .err = EINVAL, .text = "Image name is missing" };
state = 100;
return;
}
if (new_size != 0 && (new_size % 4096) && !force_size)
{
result = (cli_result_t){ .err = EINVAL, .text = "Image size should be a multiple of 4096" };
state = 100;
return;
}
for (auto & ic: parent->cli->st_cli.inode_config)
{
if (ic.second.name == image_name)
@@ -46,14 +60,16 @@ struct image_changer_t
}
if (new_name != "" && ic.second.name == new_name)
{
fprintf(stderr, "Image %s already exists\n", new_name.c_str());
exit(1);
result = (cli_result_t){ .err = EEXIST, .text = "Image "+new_name+" already exists" };
state = 100;
return;
}
}
if (!inode_num)
{
fprintf(stderr, "Image %s does not exist\n", image_name.c_str());
exit(1);
result = (cli_result_t){ .err = ENOENT, .text = "Image "+image_name+" does not exist" };
state = 100;
return;
}
for (auto & ic: parent->cli->st_cli.inode_config)
{
@@ -65,37 +81,43 @@ struct image_changer_t
}
if ((!set_readwrite || !cfg.readonly) &&
(!set_readonly || cfg.readonly) &&
(!new_size || cfg.size == new_size) &&
(!new_size && !force_size || cfg.size == new_size) &&
(new_name == "" || new_name == image_name))
{
printf("No change\n");
result = (cli_result_t){ .text = "No change" };
state = 100;
return;
}
if (new_size != 0)
if (new_size != 0 || force_size)
{
if (cfg.size >= new_size)
{
// Check confirmation when trimming an image with children
if (has_children && !force)
{
fprintf(stderr, "Image %s has children. Refusing to shrink it without --force\n", image_name.c_str());
exit(1);
result = (cli_result_t){ .err = EINVAL, .text = "Image "+image_name+" has children. Refusing to shrink it without --force" };
state = 100;
return;
}
// Shrink the image first
cb = parent->start_rm(json11::Json::object {
cb = parent->start_rm_data(json11::Json::object {
{ "inode", INODE_NO_POOL(inode_num) },
{ "pool", (uint64_t)INODE_POOL(inode_num) },
{ "fsync-interval", fsync_interval },
{ "min-offset", new_size },
{ "min-offset", ((new_size+4095)/4096)*4096 },
});
resume_1:
while (!cb())
while (!cb(result))
{
state = 1;
return;
}
cb = NULL;
if (result.err)
{
state = 100;
return;
}
}
cfg.size = new_size;
}
@@ -109,8 +131,9 @@ resume_1:
// Check confirmation when making an image with children read-write
if (has_children && !force)
{
fprintf(stderr, "Image %s has children. Refusing to make it read-write without --force\n", image_name.c_str());
exit(1);
result = (cli_result_t){ .err = EINVAL, .text = "Image "+image_name+" has children. Refusing to make it read-write without --force" };
state = 100;
return;
}
}
if (new_name != "")
@@ -178,34 +201,38 @@ resume_1:
resume_2:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
if (!parent->etcd_result["succeeded"].bool_value())
{
fprintf(stderr, "Image %s was modified by someone else, please repeat your request\n", image_name.c_str());
exit(1);
result = (cli_result_t){ .err = EAGAIN, .text = "Image "+image_name+" was modified by someone else, please repeat your request" };
state = 100;
return;
}
printf("Image %s modified\n", image_name.c_str());
// Save into inode_config for library users to be able to take it from there immediately
cfg.mod_revision = parent->etcd_result["responses"][0]["response_put"]["header"]["revision"].uint64_value();
if (new_name != "")
{
parent->cli->st_cli.inode_by_name.erase(image_name);
}
parent->cli->st_cli.insert_inode_config(cfg);
result = (cli_result_t){ .err = 0, .text = "Image "+image_name+" modified" };
state = 100;
}
};
std::function<bool(void)> cli_tool_t::start_modify(json11::Json cfg)
std::function<bool(cli_result_t &)> cli_tool_t::start_modify(json11::Json cfg)
{
json11::Json::array cmd = cfg["command"].array_items();
auto changer = new image_changer_t();
changer->parent = this;
changer->image_name = cmd.size() > 1 ? cmd[1].string_value() : "";
if (changer->image_name == "")
{
fprintf(stderr, "Image name is missing\n");
exit(1);
}
changer->image_name = cfg["image"].string_value();
changer->new_name = cfg["rename"].string_value();
changer->new_size = parse_size(cfg["resize"].string_value());
if (changer->new_size != 0 && (changer->new_size % 4096))
{
fprintf(stderr, "Image size should be a multiple of 4096\n");
exit(1);
}
changer->new_size = parse_size(cfg["resize"].as_string());
changer->force_size = cfg["force_size"].bool_value();
changer->force = cfg["force"].bool_value();
changer->set_readonly = cfg["readonly"].bool_value();
changer->set_readwrite = cfg["readwrite"].bool_value();
@@ -213,11 +240,12 @@ std::function<bool(void)> cli_tool_t::start_modify(json11::Json cfg)
if (!changer->fsync_interval)
changer->fsync_interval = 128;
// FIXME Check that the image doesn't have children when shrinking
return [changer]()
return [changer](cli_result_t & result)
{
changer->loop();
if (changer->is_done())
{
result = changer->result;
delete changer;
return true;
}

View File

@@ -63,11 +63,13 @@ struct snap_remover_t
inode_t new_parent = 0;
int state = 0;
int current_child = 0;
std::function<bool(void)> cb;
std::function<bool(cli_result_t &)> cb;
cli_result_t result;
bool is_done()
{
return state == 9;
return state == 100;
}
void loop()
@@ -88,13 +90,28 @@ struct snap_remover_t
goto resume_7;
else if (state == 8)
goto resume_8;
else if (state == 9)
goto resume_9;
else if (state == 100)
goto resume_100;
assert(!state);
if (from_name == "")
{
result = (cli_result_t){ .err = EINVAL, .text = "Layer to remove argument is missing" };
state = 100;
return;
}
if (to_name == "")
{
to_name = from_name;
}
// Get children to merge
get_merge_children();
if (state == 100)
return;
// Try to select an inode for the "inverse" optimized scenario
// Read statistics from etcd to do it
read_stats();
if (state == 100)
return;
state = 1;
resume_1:
if (parent->waiting > 0)
@@ -106,42 +123,72 @@ resume_1:
if (merge_children[current_child] == inverse_child)
continue;
start_merge_child(merge_children[current_child], merge_children[current_child]);
if (state == 100)
return;
resume_2:
while (!cb())
while (!cb(result))
{
state = 2;
return;
}
cb = NULL;
parent->change_parent(merge_children[current_child], new_parent);
if (result.err)
{
state = 100;
return;
}
parent->change_parent(merge_children[current_child], new_parent, &result);
state = 3;
resume_3:
if (parent->waiting > 0)
return;
if (result.err)
{
state = 100;
return;
}
else if (parent->progress)
printf("%s\n", result.text.c_str());
}
// Merge our "inverse" child into our "inverse" parent
if (inverse_child != 0)
{
start_merge_child(inverse_child, inverse_parent);
if (state == 100)
return;
resume_4:
while (!cb())
while (!cb(result))
{
state = 4;
return;
}
cb = NULL;
if (result.err)
{
state = 100;
return;
}
// Delete "inverse" child data
start_delete_source(inverse_child);
if (state == 100)
return;
resume_5:
while (!cb())
while (!cb(result))
{
state = 5;
return;
}
cb = NULL;
if (result.err)
{
state = 100;
return;
}
// Delete "inverse" child metadata, rename parent over it,
// and also change parent links of the previous "inverse" child
rename_inverse_parent();
if (state == 100)
return;
state = 6;
resume_6:
if (parent->waiting > 0)
@@ -154,20 +201,27 @@ resume_6:
continue;
start_delete_source(chain_list[current_child]);
resume_7:
while (!cb())
while (!cb(result))
{
state = 7;
return;
}
cb = NULL;
if (result.err)
{
state = 100;
return;
}
delete_inode_config(chain_list[current_child]);
if (state == 100)
return;
state = 8;
resume_8:
if (parent->waiting > 0)
return;
}
state = 9;
resume_9:
state = 100;
resume_100:
// Done
return;
}
@@ -176,7 +230,19 @@ resume_9:
{
// Get all children of from..to
inode_config_t *from_cfg = parent->get_inode_cfg(from_name);
if (!from_cfg)
{
result = (cli_result_t){ .err = ENOENT, .text = "Layer "+from_name+" not found" };
state = 100;
return;
}
inode_config_t *to_cfg = parent->get_inode_cfg(to_name);
if (!to_cfg)
{
result = (cli_result_t){ .err = ENOENT, .text = "Layer "+to_name+" not found" };
state = 100;
return;
}
// Check that to_cfg is actually a child of from_cfg
// FIXME de-copypaste the following piece of code with snap_merger_t
inode_config_t *cur = to_cfg;
@@ -186,16 +252,19 @@ resume_9:
auto it = parent->cli->st_cli.inode_config.find(cur->parent_id);
if (it == parent->cli->st_cli.inode_config.end())
{
fprintf(stderr, "Parent inode of layer %s (id %ld) not found\n", cur->name.c_str(), cur->parent_id);
exit(1);
char buf[1024];
snprintf(buf, 1024, "Parent inode of layer %s (id 0x%lx) not found", cur->name.c_str(), cur->parent_id);
state = 100;
return;
}
cur = &it->second;
chain_list.push_back(cur->num);
}
if (cur->num != from_cfg->num)
{
fprintf(stderr, "Layer %s is not a child of %s\n", to_name.c_str(), from_name.c_str());
exit(1);
result = (cli_result_t){ .err = EINVAL, .text = "Layer "+to_name+" is not a child of "+from_name };
state = 100;
return;
}
new_parent = from_cfg->parent_id;
// Calculate ranks
@@ -263,8 +332,9 @@ resume_9:
parent->waiting--;
if (err != "")
{
fprintf(stderr, "Error reading layer statistics from etcd: %s\n", err.c_str());
exit(1);
result = (cli_result_t){ .err = EIO, .text = "Error reading layer statistics from etcd: "+err };
state = 100;
return;
}
for (auto inode_result: data["responses"].array_items())
{
@@ -275,14 +345,16 @@ resume_9:
sscanf(kv.key.c_str() + parent->cli->st_cli.etcd_prefix.length()+13, "%u/%lu%c", &pool_id, &inode, &null_byte);
if (!inode || null_byte != 0)
{
fprintf(stderr, "Bad key returned from etcd: %s\n", kv.key.c_str());
exit(1);
result = (cli_result_t){ .err = EIO, .text = "Bad key returned from etcd: "+kv.key };
state = 100;
return;
}
auto pool_cfg_it = parent->cli->st_cli.pool_config.find(pool_id);
if (pool_cfg_it == parent->cli->st_cli.pool_config.end())
{
fprintf(stderr, "Pool %u does not exist\n", pool_id);
exit(1);
result = (cli_result_t){ .err = ENOENT, .text = "Pool "+std::to_string(pool_id)+" does not exist" };
state = 100;
return;
}
inode = INODE_WITH_POOL(pool_id, inode);
auto & pool_cfg = pool_cfg_it->second;
@@ -324,14 +396,20 @@ resume_9:
auto child_it = parent->cli->st_cli.inode_config.find(inverse_child);
if (child_it == parent->cli->st_cli.inode_config.end())
{
fprintf(stderr, "Inode %ld disappeared\n", inverse_child);
exit(1);
char buf[1024];
snprintf(buf, 1024, "Inode 0x%lx disappeared", inverse_child);
result = (cli_result_t){ .err = EIO, .text = std::string(buf) };
state = 100;
return;
}
auto target_it = parent->cli->st_cli.inode_config.find(inverse_parent);
if (target_it == parent->cli->st_cli.inode_config.end())
{
fprintf(stderr, "Inode %ld disappeared\n", inverse_parent);
exit(1);
char buf[1024];
snprintf(buf, 1024, "Inode 0x%lx disappeared", inverse_parent);
result = (cli_result_t){ .err = EIO, .text = std::string(buf) };
state = 100;
return;
}
inode_config_t *child_cfg = &child_it->second;
inode_config_t *target_cfg = &target_it->second;
@@ -422,18 +500,22 @@ resume_9:
parent->waiting--;
if (err != "")
{
fprintf(stderr, "Error renaming %s to %s: %s\n", target_name.c_str(), child_name.c_str(), err.c_str());
exit(1);
result = (cli_result_t){ .err = EIO, .text = "Error renaming "+target_name+" to "+child_name+": "+err };
state = 100;
return;
}
if (!res["succeeded"].bool_value())
{
fprintf(
stderr, "Parent (%s), child (%s), or one of its children"
" configuration was modified during rename\n", target_name.c_str(), child_name.c_str()
);
exit(1);
result = (cli_result_t){
.err = EAGAIN,
.text = "Parent ("+target_name+"), child ("+child_name+"), or one of its children"
" configuration was modified during rename",
};
state = 100;
return;
}
printf("Layer %s renamed to %s\n", target_name.c_str(), child_name.c_str());
if (parent->progress)
printf("Layer %s renamed to %s\n", target_name.c_str(), child_name.c_str());
parent->ringloop->wakeup();
});
}
@@ -443,8 +525,11 @@ resume_9:
auto cur_cfg_it = parent->cli->st_cli.inode_config.find(cur);
if (cur_cfg_it == parent->cli->st_cli.inode_config.end())
{
fprintf(stderr, "Inode 0x%lx disappeared\n", cur);
exit(1);
char buf[1024];
snprintf(buf, 1024, "Inode 0x%lx disappeared", cur);
result = (cli_result_t){ .err = EIO, .text = std::string(buf) };
state = 100;
return;
}
inode_config_t *cur_cfg = &cur_cfg_it->second;
std::string cur_name = cur_cfg->name;
@@ -475,20 +560,26 @@ resume_9:
} },
},
} },
}, [this, cur_name](std::string err, json11::Json res)
}, [this, cur, cur_name](std::string err, json11::Json res)
{
parent->waiting--;
if (err != "")
{
fprintf(stderr, "Error deleting %s: %s\n", cur_name.c_str(), err.c_str());
exit(1);
result = (cli_result_t){ .err = EIO, .text = "Error deleting "+cur_name+": "+err };
state = 100;
return;
}
if (!res["succeeded"].bool_value())
{
fprintf(stderr, "Layer %s configuration was modified during deletion\n", cur_name.c_str());
exit(1);
result = (cli_result_t){ .err = EAGAIN, .text = "Layer "+cur_name+" was modified during deletion" };
state = 100;
return;
}
printf("Layer %s deleted\n", cur_name.c_str());
// Modify inode_config for library users to be able to take it from there immediately
parent->cli->st_cli.inode_by_name.erase(cur_name);
parent->cli->st_cli.inode_config.erase(cur);
if (parent->progress)
printf("Layer %s deleted\n", cur_name.c_str());
parent->ringloop->wakeup();
});
}
@@ -498,17 +589,24 @@ resume_9:
auto child_it = parent->cli->st_cli.inode_config.find(child_inode);
if (child_it == parent->cli->st_cli.inode_config.end())
{
fprintf(stderr, "Inode %ld disappeared\n", child_inode);
exit(1);
char buf[1024];
snprintf(buf, 1024, "Inode 0x%lx disappeared", child_inode);
result = (cli_result_t){ .err = EIO, .text = std::string(buf) };
state = 100;
return;
}
auto target_it = parent->cli->st_cli.inode_config.find(target_inode);
if (target_it == parent->cli->st_cli.inode_config.end())
{
fprintf(stderr, "Inode %ld disappeared\n", target_inode);
exit(1);
char buf[1024];
snprintf(buf, 1024, "Inode 0x%lx disappeared", target_inode);
result = (cli_result_t){ .err = EIO, .text = std::string(buf) };
state = 100;
return;
}
cb = parent->start_merge(json11::Json::object {
{ "command", json11::Json::array{ "merge-data", from_name, child_it->second.name } },
{ "from", from_name },
{ "to", child_it->second.name },
{ "target", target_it->second.name },
{ "delete-source", false },
{ "cas", use_cas },
@@ -521,10 +619,13 @@ resume_9:
auto source = parent->cli->st_cli.inode_config.find(inode);
if (source == parent->cli->st_cli.inode_config.end())
{
fprintf(stderr, "Inode %ld disappeared\n", inode);
exit(1);
char buf[1024];
snprintf(buf, 1024, "Inode 0x%lx disappeared", inode);
result = (cli_result_t){ .err = EIO, .text = std::string(buf) };
state = 100;
return;
}
cb = parent->start_rm(json11::Json::object {
cb = parent->start_rm_data(json11::Json::object {
{ "inode", inode },
{ "pool", (uint64_t)INODE_POOL(inode) },
{ "fsync-interval", fsync_interval },
@@ -532,22 +633,12 @@ resume_9:
}
};
std::function<bool(void)> cli_tool_t::start_snap_rm(json11::Json cfg)
std::function<bool(cli_result_t &)> cli_tool_t::start_rm(json11::Json cfg)
{
json11::Json::array cmd = cfg["command"].array_items();
auto snap_remover = new snap_remover_t();
snap_remover->parent = this;
snap_remover->from_name = cmd.size() > 1 ? cmd[1].string_value() : "";
snap_remover->to_name = cmd.size() > 2 ? cmd[2].string_value() : "";
if (snap_remover->from_name == "")
{
fprintf(stderr, "Layer to remove argument is missing\n");
exit(1);
}
if (snap_remover->to_name == "")
{
snap_remover->to_name = snap_remover->from_name;
}
snap_remover->from_name = cfg["from"].string_value();
snap_remover->to_name = cfg["to"].string_value();
snap_remover->fsync_interval = cfg["fsync-interval"].uint64_value();
if (!snap_remover->fsync_interval)
snap_remover->fsync_interval = 128;
@@ -555,11 +646,12 @@ std::function<bool(void)> cli_tool_t::start_snap_rm(json11::Json cfg)
snap_remover->use_cas = cfg["cas"].uint64_value() ? 2 : 0;
if (!cfg["writers_stopped"].is_null())
snap_remover->writers_stopped = true;
return [snap_remover]()
return [snap_remover](cli_result_t & result)
{
snap_remover->loop();
if (snap_remover->is_done())
{
result = snap_remover->result;
delete snap_remover;
return true;
}

View File

@@ -32,6 +32,9 @@ struct rm_inode_t
uint64_t pgs_to_list = 0;
bool lists_done = false;
int state = 0;
int error_count = 0;
cli_result_t result;
void start_delete()
{
@@ -74,8 +77,13 @@ struct rm_inode_t
});
if (!lister)
{
fprintf(stderr, "Failed to list inode %lu from pool %u objects\n", INODE_NO_POOL(inode), INODE_POOL(inode));
exit(1);
result = (cli_result_t){
.err = EIO,
.text = "Failed to list objects of inode "+std::to_string(INODE_NO_POOL(inode))+
" from pool "+std::to_string(INODE_POOL(inode)),
};
state = 100;
return;
}
pgs_to_list = parent->cli->list_pg_count(lister);
parent->cli->list_inode_next(lister, parent->parallel_osds);
@@ -118,6 +126,7 @@ struct rm_inode_t
fprintf(stderr, "Failed to remove object %lx:%lx from PG %u (OSD %lu) (retval=%ld)\n",
op->req.rw.inode, op->req.rw.offset,
cur_list->pg_num, cur_list->rm_osd_num, op->reply.hdr.retval);
error_count++;
}
delete op;
cur_list->obj_done++;
@@ -161,31 +170,43 @@ struct rm_inode_t
}
if (lists_done && !lists.size())
{
printf("Done, inode %lu in pool %u data removed\n", INODE_NO_POOL(inode), pool_id);
state = 2;
result = (cli_result_t){
.err = error_count > 0 ? EIO : 0,
.text = error_count > 0 ? "Some blocks were not removed" : (
"Done, inode "+std::to_string(INODE_NO_POOL(inode))+" from pool "+
std::to_string(pool_id)+" removed"),
};
state = 100;
}
}
bool loop()
bool is_done()
{
if (state == 0)
return state == 100;
}
void loop()
{
if (state == 1)
goto resume_1;
if (state == 100)
return;
if (!pool_id)
{
start_delete();
state = 1;
result = (cli_result_t){ .err = EINVAL, .text = "Pool is not specified" };
state = 100;
return;
}
else if (state == 1)
{
continue_delete();
}
else if (state == 2)
{
return true;
}
return false;
start_delete();
if (state == 100)
return;
state = 1;
resume_1:
continue_delete();
}
};
std::function<bool(void)> cli_tool_t::start_rm(json11::Json cfg)
std::function<bool(cli_result_t &)> cli_tool_t::start_rm_data(json11::Json cfg)
{
auto remover = new rm_inode_t();
remover->parent = this;
@@ -193,19 +214,16 @@ std::function<bool(void)> cli_tool_t::start_rm(json11::Json cfg)
remover->pool_id = cfg["pool"].uint64_value();
if (remover->pool_id)
{
remover->inode = (remover->inode & ((1l << (64-POOL_ID_BITS)) - 1)) | (((uint64_t)remover->pool_id) << (64-POOL_ID_BITS));
remover->inode = (remover->inode & (((uint64_t)1 << (64-POOL_ID_BITS)) - 1)) | (((uint64_t)remover->pool_id) << (64-POOL_ID_BITS));
}
remover->pool_id = INODE_POOL(remover->inode);
if (!remover->pool_id)
{
fprintf(stderr, "pool is missing\n");
exit(1);
}
remover->min_offset = cfg["min-offset"].uint64_value();
return [remover]()
return [remover](cli_result_t & result)
{
if (remover->loop())
remover->loop();
if (remover->is_done())
{
result = remover->result;
delete remover;
return true;
}

View File

@@ -11,9 +11,9 @@
#include <sys/stat.h>
// Calculate offsets for a block device and print OSD command line parameters
std::function<bool(void)> cli_tool_t::simple_offsets(json11::Json cfg)
std::function<bool(cli_result_t &)> cli_tool_t::simple_offsets(json11::Json cfg)
{
std::string device = cfg["command"][1].string_value();
std::string device = cfg["device"].string_value();
uint64_t object_size = parse_size(cfg["object_size"].string_value());
uint64_t bitmap_granularity = parse_size(cfg["bitmap_granularity"].string_value());
uint64_t journal_size = parse_size(cfg["journal_size"].string_value());

View File

@@ -83,6 +83,12 @@ resume_1:
resume_2:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
fprintf(stderr, "%s\n", parent->etcd_err.text.c_str());
state = 100;
return;
}
mon_members = parent->etcd_result["responses"][0]["response_range"]["kvs"].array_items();
osd_stats = parent->etcd_result["responses"][1]["response_range"]["kvs"].array_items();
if (parent->etcd_result["responses"][2]["response_range"]["kvs"].array_items().size() > 0)
@@ -140,7 +146,7 @@ resume_2:
else
{
down_raw += kv.value["size"].uint64_value();
free_down_raw += kv.value["size"].uint64_value();
free_down_raw += kv.value["free"].uint64_value();
}
}
int pool_count = 0, pools_active = 0;
@@ -217,7 +223,7 @@ resume_2:
// JSON output
printf("%s\n", json11::Json(json11::Json::object {
{ "etcd_alive", etcd_alive },
{ "etcd_count", etcd_states.size() },
{ "etcd_count", (uint64_t)etcd_states.size() },
{ "etcd_db_size", etcd_db_size },
{ "mon_count", mon_count },
{ "mon_master", mon_master },
@@ -277,16 +283,16 @@ resume_2:
}
};
std::function<bool(void)> cli_tool_t::start_status(json11::Json cfg)
std::function<bool(cli_result_t &)> cli_tool_t::start_status(json11::Json cfg)
{
json11::Json::array cmd = cfg["command"].array_items();
auto printer = new status_printer_t();
printer->parent = this;
return [printer]()
return [printer](cli_result_t & result)
{
printer->loop();
if (printer->is_done())
{
result = { .err = 0 };
delete printer;
return true;
}

View File

@@ -9,6 +9,7 @@
#define PART_SENT 1
#define PART_DONE 2
#define PART_ERROR 4
#define PART_RETRY 8
#define CACHE_DIRTY 1
#define CACHE_FLUSHING 2
#define CACHE_REPEATING 3
@@ -373,6 +374,11 @@ void cluster_client_t::on_change_hook(std::map<std::string, etcd_kv_t> & changes
continue_ops();
}
bool cluster_client_t::get_immediate_commit()
{
return immediate_commit;
}
void cluster_client_t::on_change_osd_state_hook(uint64_t peer_osd)
{
if (msgr.wanted_peers.find(peer_osd) != msgr.wanted_peers.end())
@@ -670,14 +676,17 @@ resume_2:
if (!try_send(op, i))
{
// We'll need to retry again
op->up_wait = true;
if (!retry_timeout_id)
if (op->parts[i].flags & PART_RETRY)
{
retry_timeout_id = tfd->set_timer(up_wait_retry_interval, false, [this](int)
op->up_wait = true;
if (!retry_timeout_id)
{
retry_timeout_id = 0;
continue_ops(true);
});
retry_timeout_id = tfd->set_timer(up_wait_retry_interval, false, [this](int)
{
retry_timeout_id = 0;
continue_ops(true);
});
}
}
op->state = 2;
}
@@ -746,7 +755,7 @@ resume_3:
{
for (int i = 0; i < op->parts.size(); i++)
{
op->parts[i].flags = 0;
op->parts[i].flags = PART_RETRY;
}
goto resume_2;
}

View File

@@ -118,6 +118,8 @@ public:
bool is_ready();
void on_ready(std::function<void(void)> fn);
bool get_immediate_commit();
static void copy_write(cluster_op_t *op, std::map<object_id, cluster_buffer_t> & dirty_buffers);
void continue_ops(bool up_retry = false);
inode_list_t *list_inode_start(inode_t inode,

View File

@@ -89,7 +89,7 @@ void etcd_state_client_t::etcd_call_oneshot(std::string etcd_address, std::strin
"Connection: close\r\n"
"\r\n"+req;
auto http_cli = http_init(tfd);
auto cb = [this, http_cli, callback](const http_response_t *response)
auto cb = [http_cli, callback](const http_response_t *response)
{
std::string err;
json11::Json data;
@@ -338,9 +338,14 @@ void etcd_state_client_t::start_etcd_watcher()
{
if (data["result"]["created"].bool_value())
{
if (etcd_watches_initialised == 3 && this->log_level > 0)
uint64_t watch_id = data["result"]["watch_id"].uint64_value();
if (watch_id == ETCD_CONFIG_WATCH_ID ||
watch_id == ETCD_PG_STATE_WATCH_ID ||
watch_id == ETCD_PG_HISTORY_WATCH_ID ||
watch_id == ETCD_OSD_STATE_WATCH_ID)
etcd_watches_initialised++;
if (etcd_watches_initialised == 4 && this->log_level > 0)
fprintf(stderr, "Successfully subscribed to etcd at %s\n", selected_etcd_address.c_str());
etcd_watches_initialised++;
}
if (data["result"]["canceled"].bool_value())
{
@@ -469,6 +474,10 @@ void etcd_state_client_t::start_etcd_watcher()
{ "progress_notify", true },
} }
}).dump());
if (on_start_watcher_hook)
{
on_start_watcher_hook(etcd_watch_ws);
}
if (ws_keepalive_timer < 0)
{
ws_keepalive_timer = tfd->set_timer(etcd_ws_keepalive_interval*1000, true, [this](int)
@@ -954,6 +963,10 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
}
if (!value.is_object())
{
if (on_inode_change_hook != NULL)
{
on_inode_change_hook(inode_num, true);
}
this->inode_config.erase(inode_num);
}
else
@@ -968,38 +981,47 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
{
fprintf(
stderr, "Inode %lu/%lu parent_pool value is invalid, ignoring parent setting\n",
inode_num >> (64-POOL_ID_BITS), inode_num & ((1l << (64-POOL_ID_BITS)) - 1)
inode_num >> (64-POOL_ID_BITS), inode_num & (((uint64_t)1 << (64-POOL_ID_BITS)) - 1)
);
parent_inode_num = 0;
}
else
parent_inode_num |= parent_pool_id << (64-POOL_ID_BITS);
}
inode_config_t cfg = (inode_config_t){
insert_inode_config((inode_config_t){
.num = inode_num,
.name = value["name"].string_value(),
.size = value["size"].uint64_value(),
.parent_id = parent_inode_num,
.readonly = value["readonly"].bool_value(),
.meta = value["meta"],
.mod_revision = kv.mod_revision,
};
this->inode_config[inode_num] = cfg;
if (cfg.name != "")
{
this->inode_by_name[cfg.name] = inode_num;
for (auto w: watches)
{
if (w->name == value["name"].string_value())
{
w->cfg = cfg;
}
}
}
});
}
}
}
}
void etcd_state_client_t::insert_inode_config(const inode_config_t & cfg)
{
this->inode_config[cfg.num] = cfg;
if (cfg.name != "")
{
this->inode_by_name[cfg.name] = cfg.num;
for (auto w: watches)
{
if (w->name == cfg.name)
{
w->cfg = cfg;
}
}
}
if (on_inode_change_hook != NULL)
{
on_inode_change_hook(cfg.num, false);
}
}
inode_watch_t* etcd_state_client_t::watch_inode(std::string name)
{
inode_watch_t *watch = new inode_watch_t;
@@ -1042,6 +1064,10 @@ json11::Json::object etcd_state_client_t::serialize_inode_cfg(inode_config_t *cf
{
new_cfg["readonly"] = true;
}
if (cfg->meta.is_object())
{
new_cfg["meta"] = cfg->meta;
}
return new_cfg;
}

View File

@@ -56,6 +56,8 @@ struct inode_config_t
uint64_t size;
inode_t parent_id;
bool readonly;
// Arbitrary metadata
json11::Json meta;
// Change revision of the metadata in etcd
uint64_t mod_revision;
};
@@ -109,6 +111,8 @@ public:
std::function<void(pool_id_t, pg_num_t)> on_change_pg_history_hook;
std::function<void(osd_num_t)> on_change_osd_state_hook;
std::function<void()> on_reload_hook;
std::function<void(inode_t, bool)> on_inode_change_hook;
std::function<void(http_co_t *)> on_start_watcher_hook;
json11::Json::object serialize_inode_cfg(inode_config_t *cfg);
etcd_kv_t parse_etcd_kv(const json11::Json & kv_json);
@@ -122,6 +126,7 @@ public:
void load_pgs();
void parse_state(const etcd_kv_t & kv);
void parse_config(const json11::Json & config);
void insert_inode_config(const inode_config_t & cfg);
inode_watch_t* watch_inode(std::string name);
void close_watch(inode_watch_t* watch);
int address_count();

View File

@@ -214,14 +214,14 @@ static int sec_setup(struct thread_data *td)
if (!o->image)
{
if (!(o->inode & ((1l << (64-POOL_ID_BITS)) - 1)))
if (!(o->inode & (((uint64_t)1 << (64-POOL_ID_BITS)) - 1)))
{
td_verror(td, EINVAL, "inode number is missing");
return 1;
}
if (o->pool)
{
o->inode = (o->inode & ((1l << (64-POOL_ID_BITS)) - 1)) | (o->pool << (64-POOL_ID_BITS));
o->inode = (o->inode & (((uint64_t)1 << (64-POOL_ID_BITS)) - 1)) | (o->pool << (64-POOL_ID_BITS));
}
if (!(o->inode >> (64-POOL_ID_BITS)))
{

View File

@@ -39,6 +39,12 @@ void osd_messenger_t::init()
handle_rdma_events();
}
}
#endif
#ifndef SO_ZEROCOPY
if (log_level > 0)
{
fprintf(stderr, "Zero-copy TCP send is not supported in this build, ignoring\n");
}
#endif
keepalive_timer_id = tfd->set_timer(1000, true, [this](int)
{
@@ -162,6 +168,8 @@ void osd_messenger_t::parse_config(const json11::Json & config)
this->receive_buffer_size = 65536;
this->use_sync_send_recv = config["use_sync_send_recv"].bool_value() ||
config["use_sync_send_recv"].uint64_value();
this->use_zerocopy_send = config["use_zerocopy_send"].bool_value() ||
config["use_zerocopy_send"].uint64_value();
this->peer_connect_interval = config["peer_connect_interval"].uint64_value();
if (!this->peer_connect_interval)
this->peer_connect_interval = 5;
@@ -288,8 +296,7 @@ void osd_messenger_t::handle_connect_epoll(int peer_fd)
on_connect_peer(peer_osd, -result);
return;
}
int one = 1;
setsockopt(peer_fd, SOL_TCP, TCP_NODELAY, &one, sizeof(one));
set_socket_options(cl);
cl->peer_state = PEER_CONNECTED;
tfd->set_fd_handler(peer_fd, false, [this](int peer_fd, int epoll_events)
{
@@ -299,6 +306,23 @@ void osd_messenger_t::handle_connect_epoll(int peer_fd)
check_peer_config(cl);
}
void osd_messenger_t::set_socket_options(osd_client_t *cl)
{
int one = 1;
setsockopt(cl->peer_fd, SOL_TCP, TCP_NODELAY, &one, sizeof(one));
#ifdef SO_ZEROCOPY
if (!use_zerocopy_send)
cl->zerocopy_send = false;
else if (setsockopt(cl->peer_fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one)) != 0)
{
if (log_level > 0)
fprintf(stderr, "[OSD %lu] Failed to enable zero-copy send for client %d: %s\n", this->osd_num, cl->peer_fd, strerror(errno));
}
else
cl->zerocopy_send = true;
#endif
}
void osd_messenger_t::handle_peer_epoll(int peer_fd, int epoll_events)
{
// Mark client as ready (i.e. some data is available)
@@ -493,14 +517,13 @@ void osd_messenger_t::accept_connections(int listen_fd)
fprintf(stderr, "[OSD %lu] new client %d: connection from %s\n", this->osd_num, peer_fd,
addr_to_string(addr).c_str());
fcntl(peer_fd, F_SETFL, fcntl(peer_fd, F_GETFL, 0) | O_NONBLOCK);
int one = 1;
setsockopt(peer_fd, SOL_TCP, TCP_NODELAY, &one, sizeof(one));
clients[peer_fd] = new osd_client_t();
clients[peer_fd]->peer_addr = addr;
clients[peer_fd]->peer_port = ntohs(((sockaddr_in*)&addr)->sin_port);
clients[peer_fd]->peer_fd = peer_fd;
clients[peer_fd]->peer_state = PEER_CONNECTED;
clients[peer_fd]->in_buf = malloc_or_die(receive_buffer_size);
auto cl = clients[peer_fd] = new osd_client_t();
cl->peer_addr = addr;
cl->peer_port = ntohs(((sockaddr_in*)&addr)->sin_port);
cl->peer_fd = peer_fd;
cl->peer_state = PEER_CONNECTED;
cl->in_buf = malloc_or_die(receive_buffer_size);
set_socket_options(cl);
// Add FD to epoll
tfd->set_fd_handler(peer_fd, false, [this](int peer_fd, int epoll_events)
{

View File

@@ -45,6 +45,12 @@ struct msgr_sendp_t
int flags;
};
struct msgr_zc_not_t
{
osd_op_t *op;
uint32_t nsend;
};
struct osd_client_t
{
int refs = 0;
@@ -57,6 +63,7 @@ struct osd_client_t
int ping_time_remaining = 0;
int idle_time_remaining = 0;
osd_num_t osd_num = 0;
bool zerocopy_send = false;
void *in_buf = NULL;
@@ -87,6 +94,12 @@ struct osd_client_t
int write_state = 0;
std::vector<iovec> send_list, next_send_list;
std::vector<msgr_sendp_t> outbox, next_outbox;
std::vector<msgr_zc_not_t> zerocopy_sent;
uint64_t outbox_size = 0, next_outbox_size = 0;
uint32_t zerocopy_notification_idx = 0;
uint32_t zerocopy_notification_prev = 0;
uint8_t zerocopy_notification_buf[256];
struct msghdr zerocopy_notification_msg;
~osd_client_t()
{
@@ -127,6 +140,7 @@ protected:
int osd_ping_timeout = 0;
int log_level = 0;
bool use_sync_send_recv = false;
bool use_zerocopy_send = false;
#ifdef WITH_RDMA
bool use_rdma = true;
@@ -181,10 +195,12 @@ protected:
void check_peer_config(osd_client_t *cl);
void cancel_osd_ops(osd_client_t *cl);
void cancel_op(osd_op_t *op);
void set_socket_options(osd_client_t *cl);
bool try_send(osd_client_t *cl);
void measure_exec(osd_op_t *cur_op);
void handle_send(int result, osd_client_t *cl);
void handle_zerocopy_notification(osd_client_t *cl, int res);
bool handle_read(int result, osd_client_t *cl);
bool handle_read_buffer(osd_client_t *cl, void *curbuf, int remain);

View File

@@ -6,6 +6,12 @@
#include "messenger.h"
#include <linux/errqueue.h>
#ifndef MSG_ZEROCOPY
#define MSG_ZEROCOPY 0
#endif
void osd_messenger_t::outbox_push(osd_op_t *cur_op)
{
assert(cur_op->peer_fd);
@@ -36,6 +42,7 @@ void osd_messenger_t::outbox_push(osd_op_t *cur_op)
}
auto & to_send_list = cl->write_msg.msg_iovlen ? cl->next_send_list : cl->send_list;
auto & to_outbox = cl->write_msg.msg_iovlen ? cl->next_outbox : cl->outbox;
auto & to_size = cl->write_msg.msg_iovlen ? cl->next_outbox_size : cl->outbox_size;
if (cur_op->op_type == OSD_OP_IN)
{
measure_exec(cur_op);
@@ -46,6 +53,7 @@ void osd_messenger_t::outbox_push(osd_op_t *cur_op)
to_send_list.push_back((iovec){ .iov_base = cur_op->req.buf, .iov_len = OSD_PACKET_SIZE });
cl->sent_ops[cur_op->req.hdr.id] = cur_op;
}
to_size += OSD_PACKET_SIZE;
to_outbox.push_back((msgr_sendp_t){ .op = cur_op, .flags = MSGR_SENDP_HDR });
// Bitmap
if (cur_op->op_type == OSD_OP_IN &&
@@ -57,6 +65,7 @@ void osd_messenger_t::outbox_push(osd_op_t *cur_op)
.iov_len = cur_op->reply.sec_rw.attr_len,
});
to_outbox.push_back((msgr_sendp_t){ .op = cur_op, .flags = 0 });
to_size += cur_op->reply.sec_rw.attr_len;
}
else if (cur_op->op_type == OSD_OP_OUT &&
(cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE || cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE) &&
@@ -67,6 +76,7 @@ void osd_messenger_t::outbox_push(osd_op_t *cur_op)
.iov_len = cur_op->req.sec_rw.attr_len,
});
to_outbox.push_back((msgr_sendp_t){ .op = cur_op, .flags = 0 });
to_size += cur_op->req.sec_rw.attr_len;
}
// Operation data
if ((cur_op->op_type == OSD_OP_IN
@@ -86,14 +96,21 @@ void osd_messenger_t::outbox_push(osd_op_t *cur_op)
assert(cur_op->iov.buf[i].iov_base);
to_send_list.push_back(cur_op->iov.buf[i]);
to_outbox.push_back((msgr_sendp_t){ .op = cur_op, .flags = 0 });
to_size += cur_op->iov.buf[i].iov_len;
}
}
if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ_BMP)
{
if (cur_op->op_type == OSD_OP_IN && cur_op->reply.hdr.retval > 0)
{
to_send_list.push_back((iovec){ .iov_base = cur_op->buf, .iov_len = (size_t)cur_op->reply.hdr.retval });
to_size += cur_op->reply.hdr.retval;
}
else if (cur_op->op_type == OSD_OP_OUT && cur_op->req.sec_read_bmp.len > 0)
{
to_send_list.push_back((iovec){ .iov_base = cur_op->buf, .iov_len = (size_t)cur_op->req.sec_read_bmp.len });
to_size += cur_op->req.sec_read_bmp.len;
}
to_outbox.push_back((msgr_sendp_t){ .op = cur_op, .flags = 0 });
}
if (cur_op->op_type == OSD_OP_IN)
@@ -177,17 +194,19 @@ bool osd_messenger_t::try_send(osd_client_t *cl)
}
cl->write_msg.msg_iov = cl->send_list.data();
cl->write_msg.msg_iovlen = cl->send_list.size() < IOV_MAX ? cl->send_list.size() : IOV_MAX;
cl->write_msg.msg_flags = (cl->zerocopy_send && (cl->outbox_size/cl->send_list.size()) >= 4096 ? MSG_ZEROCOPY : 0);
cl->refs++;
ring_data_t* data = ((ring_data_t*)sqe->user_data);
data->callback = [this, cl](ring_data_t *data) { handle_send(data->res, cl); };
my_uring_prep_sendmsg(sqe, peer_fd, &cl->write_msg, 0);
my_uring_prep_sendmsg(sqe, peer_fd, &cl->write_msg, cl->write_msg.msg_flags);
}
else
{
cl->write_msg.msg_iov = cl->send_list.data();
cl->write_msg.msg_iovlen = cl->send_list.size() < IOV_MAX ? cl->send_list.size() : IOV_MAX;
cl->write_msg.msg_flags = (cl->zerocopy_send && (cl->outbox_size/cl->send_list.size()) >= 4096 ? MSG_ZEROCOPY : 0);
cl->refs++;
int result = sendmsg(peer_fd, &cl->write_msg, MSG_NOSIGNAL);
int result = sendmsg(peer_fd, &cl->write_msg, MSG_NOSIGNAL | cl->write_msg.msg_flags);
if (result < 0)
{
result = -errno;
@@ -197,6 +216,62 @@ bool osd_messenger_t::try_send(osd_client_t *cl)
return true;
}
void osd_messenger_t::handle_zerocopy_notification(osd_client_t *cl, int res)
{
cl->refs--;
if (cl->peer_state == PEER_STOPPED)
{
if (cl->refs <= 0)
{
delete cl;
}
return;
}
if (res != 0)
{
return;
}
if (cl->zerocopy_notification_msg.msg_flags & MSG_CTRUNC)
{
fprintf(stderr, "zero-copy send notification truncated on client socket %d\n", cl->peer_fd);
return;
}
for (struct cmsghdr *cm = CMSG_FIRSTHDR(&cl->zerocopy_notification_msg); cm; cm = CMSG_NXTHDR(&cl->zerocopy_notification_msg, cm))
{
if (cm->cmsg_level == SOL_IP && cm->cmsg_type == IP_RECVERR)
{
struct sock_extended_err *serr = (struct sock_extended_err*)CMSG_DATA(cm);
if (serr->ee_errno == 0 && serr->ee_origin == SO_EE_ORIGIN_ZEROCOPY)
{
// completed sends numbered serr->ee_info .. serr->ee_data
int start = 0;
while (start < cl->zerocopy_sent.size() && cl->zerocopy_sent[start].nsend < serr->ee_info)
start++;
int end = start;
if (serr->ee_data < serr->ee_info)
{
// counter has wrapped around
while (end < cl->zerocopy_sent.size() && cl->zerocopy_sent[end].nsend >= cl->zerocopy_sent[start].nsend)
end++;
}
while (end < cl->zerocopy_sent.size() && cl->zerocopy_sent[end].nsend <= serr->ee_data)
end++;
if (end > start)
{
for (int i = start; i < end; i++)
{
delete cl->zerocopy_sent[i].op;
}
cl->zerocopy_sent.erase(
cl->zerocopy_sent.begin() + start,
cl->zerocopy_sent.begin() + end
);
}
}
}
}
}
void osd_messenger_t::send_replies()
{
for (int i = 0; i < write_ready_clients.size(); i++)
@@ -224,16 +299,19 @@ void osd_messenger_t::handle_send(int result, osd_client_t *cl)
}
return;
}
if (result < 0 && result != -EAGAIN && result != -EINTR)
if (result < 0 && result != -EAGAIN && result != -EINTR && result != -ENOBUFS)
{
// this is a client socket, so don't panic. just disconnect it
fprintf(stderr, "Client %d socket write error: %d (%s). Disconnecting client\n", cl->peer_fd, -result, strerror(-result));
stop_client(cl->peer_fd);
return;
}
bool used_zerocopy = false;
if (result >= 0)
{
used_zerocopy = (cl->write_msg.msg_flags & MSG_ZEROCOPY) ? true : false;
int done = 0;
int bytes_written = result;
while (result > 0 && done < cl->send_list.size())
{
iovec & iov = cl->send_list[done];
@@ -242,7 +320,19 @@ void osd_messenger_t::handle_send(int result, osd_client_t *cl)
if (cl->outbox[done].flags & MSGR_SENDP_FREE)
{
// Reply fully sent
delete cl->outbox[done].op;
if (!used_zerocopy)
{
delete cl->outbox[done].op;
}
else
{
// With zero-copy send the difference is that we must keep the buffer (i.e. the operation)
// allocated until we get send notification from MSG_ERRQUEUE
cl->zerocopy_sent.push_back((msgr_zc_not_t){
.op = cl->outbox[done].op,
.nsend = cl->zerocopy_notification_idx,
});
}
}
result -= iov.iov_len;
done++;
@@ -254,6 +344,11 @@ void osd_messenger_t::handle_send(int result, osd_client_t *cl)
break;
}
}
if (used_zerocopy)
{
cl->zerocopy_notification_idx++;
}
cl->outbox_size -= bytes_written;
if (done > 0)
{
cl->send_list.erase(cl->send_list.begin(), cl->send_list.begin()+done);
@@ -263,8 +358,10 @@ void osd_messenger_t::handle_send(int result, osd_client_t *cl)
{
cl->send_list.insert(cl->send_list.end(), cl->next_send_list.begin(), cl->next_send_list.end());
cl->outbox.insert(cl->outbox.end(), cl->next_outbox.begin(), cl->next_outbox.end());
cl->outbox_size += cl->next_outbox_size;
cl->next_send_list.clear();
cl->next_outbox.clear();
cl->next_outbox_size = 0;
}
cl->write_state = cl->outbox.size() > 0 ? CL_WRITE_READY : 0;
#ifdef WITH_RDMA
@@ -287,4 +384,34 @@ void osd_messenger_t::handle_send(int result, osd_client_t *cl)
{
write_ready_clients.push_back(cl->peer_fd);
}
if (used_zerocopy && (cl->zerocopy_notification_idx-cl->zerocopy_notification_prev) >= 16 &&
cl->zerocopy_sent.size() > 0)
{
cl->zerocopy_notification_prev = cl->zerocopy_notification_idx;
cl->zerocopy_notification_msg = {
.msg_control = cl->zerocopy_notification_buf,
.msg_controllen = sizeof(cl->zerocopy_notification_buf),
};
cl->refs++;
io_uring_sqe* sqe = NULL;
if (ringloop && !use_sync_send_recv)
{
sqe = ringloop->get_sqe();
}
if (!sqe)
{
int res = recvmsg(cl->peer_fd, &cl->zerocopy_notification_msg, MSG_ERRQUEUE|MSG_DONTWAIT);
if (res < 0)
{
res = -errno;
}
handle_zerocopy_notification(cl, res);
}
else
{
ring_data_t* data = ((ring_data_t*)sqe->user_data);
data->callback = [this, cl](ring_data_t *data) { handle_zerocopy_notification(cl, data->res); };
my_uring_prep_recvmsg(sqe, cl->peer_fd, &cl->zerocopy_notification_msg, MSG_ERRQUEUE);
}
}
}

View File

@@ -189,7 +189,7 @@ public:
uint64_t pool = cfg["pool"].uint64_value();
if (pool)
{
inode = (inode & ((1l << (64-POOL_ID_BITS)) - 1)) | (pool << (64-POOL_ID_BITS));
inode = (inode & (((uint64_t)1 << (64-POOL_ID_BITS)) - 1)) | (pool << (64-POOL_ID_BITS));
}
if (!(inode >> (64-POOL_ID_BITS)))
{

1690
src/nfs/nfs.h Normal file

File diff suppressed because it is too large Load Diff

1380
src/nfs/nfs.x Normal file

File diff suppressed because it is too large Load Diff

2954
src/nfs/nfs_xdr.cpp Normal file

File diff suppressed because it is too large Load Diff

190
src/nfs/portmap.h Normal file
View File

@@ -0,0 +1,190 @@
/*
* Please do not edit this file.
* It was generated using rpcgen.
*/
#ifndef _PORTMAP_H_RPCGEN
#define _PORTMAP_H_RPCGEN
#include "xdr_impl.h"
#ifdef __cplusplus
extern "C" {
#endif
#define PMAP_PORT 111
struct pmap2_mapping {
u_int prog;
u_int vers;
u_int prot;
u_int port;
};
typedef struct pmap2_mapping pmap2_mapping;
struct pmap2_call_args {
u_int prog;
u_int vers;
u_int proc;
xdr_string_t args;
};
typedef struct pmap2_call_args pmap2_call_args;
struct pmap2_call_result {
u_int port;
xdr_string_t res;
};
typedef struct pmap2_call_result pmap2_call_result;
struct pmap2_mapping_list {
pmap2_mapping map;
struct pmap2_mapping_list *next;
};
typedef struct pmap2_mapping_list pmap2_mapping_list;
struct pmap2_dump_result {
struct pmap2_mapping_list *list;
};
typedef struct pmap2_dump_result pmap2_dump_result;
struct pmap3_string_result {
xdr_string_t addr;
};
typedef struct pmap3_string_result pmap3_string_result;
struct pmap3_mapping {
u_int prog;
u_int vers;
xdr_string_t netid;
xdr_string_t addr;
xdr_string_t owner;
};
typedef struct pmap3_mapping pmap3_mapping;
struct pmap3_mapping_list {
pmap3_mapping map;
struct pmap3_mapping_list *next;
};
typedef struct pmap3_mapping_list pmap3_mapping_list;
struct pmap3_dump_result {
struct pmap3_mapping_list *list;
};
typedef struct pmap3_dump_result pmap3_dump_result;
struct pmap3_call_args {
u_int prog;
u_int vers;
u_int proc;
xdr_string_t args;
};
typedef struct pmap3_call_args pmap3_call_args;
struct pmap3_call_result {
u_int port;
xdr_string_t res;
};
typedef struct pmap3_call_result pmap3_call_result;
struct pmap3_netbuf {
u_int maxlen;
xdr_string_t buf;
};
typedef struct pmap3_netbuf pmap3_netbuf;
typedef pmap2_mapping PMAP2SETargs;
typedef pmap2_mapping PMAP2UNSETargs;
typedef pmap2_mapping PMAP2GETPORTargs;
typedef pmap2_call_args PMAP2CALLITargs;
typedef pmap2_call_result PMAP2CALLITres;
typedef pmap2_dump_result PMAP2DUMPres;
typedef pmap3_mapping PMAP3SETargs;
typedef pmap3_mapping PMAP3UNSETargs;
typedef pmap3_mapping PMAP3GETADDRargs;
typedef pmap3_string_result PMAP3GETADDRres;
typedef pmap3_dump_result PMAP3DUMPres;
typedef pmap3_call_result PMAP3CALLITargs;
typedef pmap3_call_result PMAP3CALLITres;
typedef pmap3_netbuf PMAP3UADDR2TADDRres;
typedef pmap3_netbuf PMAP3TADDR2UADDRargs;
typedef pmap3_string_result PMAP3TADDR2UADDRres;
#define PMAP_PROGRAM 100000
#define PMAP_V2 2
#define PMAP2_NULL 0
#define PMAP2_SET 1
#define PMAP2_UNSET 2
#define PMAP2_GETPORT 3
#define PMAP2_DUMP 4
#define PMAP2_CALLIT 5
#define PMAP_V3 3
#define PMAP3_NULL 0
#define PMAP3_SET 1
#define PMAP3_UNSET 2
#define PMAP3_GETADDR 3
#define PMAP3_DUMP 4
#define PMAP3_CALLIT 5
#define PMAP3_GETTIME 6
#define PMAP3_UADDR2TADDR 7
#define PMAP3_TADDR2UADDR 8
/* the xdr functions */
extern bool_t xdr_pmap2_mapping (XDR *, pmap2_mapping*);
extern bool_t xdr_pmap2_call_args (XDR *, pmap2_call_args*);
extern bool_t xdr_pmap2_call_result (XDR *, pmap2_call_result*);
extern bool_t xdr_pmap2_mapping_list (XDR *, pmap2_mapping_list*);
extern bool_t xdr_pmap2_dump_result (XDR *, pmap2_dump_result*);
extern bool_t xdr_pmap3_string_result (XDR *, pmap3_string_result*);
extern bool_t xdr_pmap3_mapping (XDR *, pmap3_mapping*);
extern bool_t xdr_pmap3_mapping_list (XDR *, pmap3_mapping_list*);
extern bool_t xdr_pmap3_dump_result (XDR *, pmap3_dump_result*);
extern bool_t xdr_pmap3_call_args (XDR *, pmap3_call_args*);
extern bool_t xdr_pmap3_call_result (XDR *, pmap3_call_result*);
extern bool_t xdr_pmap3_netbuf (XDR *, pmap3_netbuf*);
extern bool_t xdr_PMAP2SETargs (XDR *, PMAP2SETargs*);
extern bool_t xdr_PMAP2UNSETargs (XDR *, PMAP2UNSETargs*);
extern bool_t xdr_PMAP2GETPORTargs (XDR *, PMAP2GETPORTargs*);
extern bool_t xdr_PMAP2CALLITargs (XDR *, PMAP2CALLITargs*);
extern bool_t xdr_PMAP2CALLITres (XDR *, PMAP2CALLITres*);
extern bool_t xdr_PMAP2DUMPres (XDR *, PMAP2DUMPres*);
extern bool_t xdr_PMAP3SETargs (XDR *, PMAP3SETargs*);
extern bool_t xdr_PMAP3UNSETargs (XDR *, PMAP3UNSETargs*);
extern bool_t xdr_PMAP3GETADDRargs (XDR *, PMAP3GETADDRargs*);
extern bool_t xdr_PMAP3GETADDRres (XDR *, PMAP3GETADDRres*);
extern bool_t xdr_PMAP3DUMPres (XDR *, PMAP3DUMPres*);
extern bool_t xdr_PMAP3CALLITargs (XDR *, PMAP3CALLITargs*);
extern bool_t xdr_PMAP3CALLITres (XDR *, PMAP3CALLITres*);
extern bool_t xdr_PMAP3UADDR2TADDRres (XDR *, PMAP3UADDR2TADDRres*);
extern bool_t xdr_PMAP3TADDR2UADDRargs (XDR *, PMAP3TADDR2UADDRargs*);
extern bool_t xdr_PMAP3TADDR2UADDRres (XDR *, PMAP3TADDR2UADDRres*);
#ifdef __cplusplus
}
#endif
#endif /* !_PORTMAP_H_RPCGEN */

168
src/nfs/portmap.x Normal file
View File

@@ -0,0 +1,168 @@
/*
Copyright (c) 2014, Ronnie Sahlberg
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
The views and conclusions contained in the software and documentation are those
of the authors and should not be interpreted as representing official policies,
either expressed or implied, of the FreeBSD Project.
*/
const PMAP_PORT = 111; /* portmapper port number */
struct pmap2_mapping {
unsigned int prog;
unsigned int vers;
unsigned int prot;
unsigned int port;
};
struct pmap2_call_args {
unsigned int prog;
unsigned int vers;
unsigned int proc;
opaque args<>;
};
struct pmap2_call_result {
unsigned int port;
opaque res<>;
};
struct pmap2_mapping_list {
pmap2_mapping map;
pmap2_mapping_list *next;
};
struct pmap2_dump_result {
struct pmap2_mapping_list *list;
};
struct pmap3_string_result {
string addr<>;
};
struct pmap3_mapping {
unsigned int prog;
unsigned int vers;
string netid<>;
string addr<>;
string owner<>;
};
struct pmap3_mapping_list {
pmap3_mapping map;
pmap3_mapping_list *next;
};
struct pmap3_dump_result {
struct pmap3_mapping_list *list;
};
struct pmap3_call_args {
unsigned int prog;
unsigned int vers;
unsigned int proc;
opaque args<>;
};
struct pmap3_call_result {
unsigned int port;
opaque res<>;
};
struct pmap3_netbuf {
unsigned int maxlen;
/* This pretty much contains a sockaddr_storage.
* Beware differences in endianess for ss_family
* and whether or not ss_len exists.
*/
opaque buf<>;
};
typedef pmap2_mapping PMAP2SETargs;
typedef pmap2_mapping PMAP2UNSETargs;
typedef pmap2_mapping PMAP2GETPORTargs;
typedef pmap2_call_args PMAP2CALLITargs;
typedef pmap2_call_result PMAP2CALLITres;
typedef pmap2_dump_result PMAP2DUMPres;
typedef pmap3_mapping PMAP3SETargs;
typedef pmap3_mapping PMAP3UNSETargs;
typedef pmap3_mapping PMAP3GETADDRargs;
typedef pmap3_string_result PMAP3GETADDRres;
typedef pmap3_dump_result PMAP3DUMPres;
typedef pmap3_call_result PMAP3CALLITargs;
typedef pmap3_call_result PMAP3CALLITres;
typedef pmap3_netbuf PMAP3UADDR2TADDRres;
typedef pmap3_netbuf PMAP3TADDR2UADDRargs;
typedef pmap3_string_result PMAP3TADDR2UADDRres;
program PMAP_PROGRAM {
version PMAP_V2 {
void
PMAP2_NULL(void) = 0;
uint32_t
PMAP2_SET(PMAP2SETargs) = 1;
uint32_t
PMAP2_UNSET(PMAP2UNSETargs) = 2;
uint32_t
PMAP2_GETPORT(PMAP2GETPORTargs) = 3;
PMAP2DUMPres
PMAP2_DUMP(void) = 4;
PMAP2CALLITres
PMAP2_CALLIT(PMAP2CALLITargs) = 5;
} = 2;
version PMAP_V3 {
void
PMAP3_NULL(void) = 0;
uint32_t
PMAP3_SET(PMAP3SETargs) = 1;
uint32_t
PMAP3_UNSET(PMAP3UNSETargs) = 2;
PMAP3GETADDRres
PMAP3_GETADDR(PMAP3GETADDRargs) = 3;
PMAP3DUMPres
PMAP3_DUMP(void) = 4;
PMAP3CALLITres
PMAP3_CALLIT(PMAP3CALLITargs) = 5;
uint32_t
PMAP3_GETTIME(void) = 6;
PMAP3UADDR2TADDRres
PMAP3_UADDR2TADDR(string) = 7;
PMAP3TADDR2UADDRres
PMAP3_TADDR2UADDR(PMAP3TADDR2UADDRargs) = 8;
} = 3;
} = 100000;

406
src/nfs/portmap_xdr.cpp Normal file
View File

@@ -0,0 +1,406 @@
/*
* Please do not edit this file.
* It was generated using rpcgen.
*/
#include "portmap.h"
#include "xdr_impl_inline.h"
bool_t
xdr_pmap2_mapping (XDR *xdrs, pmap2_mapping *objp)
{
if (xdrs->x_op == XDR_ENCODE) {
if (1) {
if (!xdr_u_int (xdrs, &objp->prog))
return FALSE;
if (!xdr_u_int (xdrs, &objp->vers))
return FALSE;
if (!xdr_u_int (xdrs, &objp->prot))
return FALSE;
if (!xdr_u_int (xdrs, &objp->port))
return FALSE;
} else {
IXDR_PUT_U_LONG(buf, objp->prog);
IXDR_PUT_U_LONG(buf, objp->vers);
IXDR_PUT_U_LONG(buf, objp->prot);
IXDR_PUT_U_LONG(buf, objp->port);
}
return TRUE;
} else if (xdrs->x_op == XDR_DECODE) {
if (1) {
if (!xdr_u_int (xdrs, &objp->prog))
return FALSE;
if (!xdr_u_int (xdrs, &objp->vers))
return FALSE;
if (!xdr_u_int (xdrs, &objp->prot))
return FALSE;
if (!xdr_u_int (xdrs, &objp->port))
return FALSE;
} else {
objp->prog = IXDR_GET_U_LONG(buf);
objp->vers = IXDR_GET_U_LONG(buf);
objp->prot = IXDR_GET_U_LONG(buf);
objp->port = IXDR_GET_U_LONG(buf);
}
return TRUE;
}
if (!xdr_u_int (xdrs, &objp->prog))
return FALSE;
if (!xdr_u_int (xdrs, &objp->vers))
return FALSE;
if (!xdr_u_int (xdrs, &objp->prot))
return FALSE;
if (!xdr_u_int (xdrs, &objp->port))
return FALSE;
return TRUE;
}
bool_t
xdr_pmap2_call_args (XDR *xdrs, pmap2_call_args *objp)
{
if (xdrs->x_op == XDR_ENCODE) {
if (1) {
if (!xdr_u_int (xdrs, &objp->prog))
return FALSE;
if (!xdr_u_int (xdrs, &objp->vers))
return FALSE;
if (!xdr_u_int (xdrs, &objp->proc))
return FALSE;
} else {
IXDR_PUT_U_LONG(buf, objp->prog);
IXDR_PUT_U_LONG(buf, objp->vers);
IXDR_PUT_U_LONG(buf, objp->proc);
}
if (!xdr_bytes(xdrs, &objp->args, ~0))
return FALSE;
return TRUE;
} else if (xdrs->x_op == XDR_DECODE) {
if (1) {
if (!xdr_u_int (xdrs, &objp->prog))
return FALSE;
if (!xdr_u_int (xdrs, &objp->vers))
return FALSE;
if (!xdr_u_int (xdrs, &objp->proc))
return FALSE;
} else {
objp->prog = IXDR_GET_U_LONG(buf);
objp->vers = IXDR_GET_U_LONG(buf);
objp->proc = IXDR_GET_U_LONG(buf);
}
if (!xdr_bytes(xdrs, &objp->args, ~0))
return FALSE;
return TRUE;
}
if (!xdr_u_int (xdrs, &objp->prog))
return FALSE;
if (!xdr_u_int (xdrs, &objp->vers))
return FALSE;
if (!xdr_u_int (xdrs, &objp->proc))
return FALSE;
if (!xdr_bytes(xdrs, &objp->args, ~0))
return FALSE;
return TRUE;
}
bool_t
xdr_pmap2_call_result (XDR *xdrs, pmap2_call_result *objp)
{
if (!xdr_u_int (xdrs, &objp->port))
return FALSE;
if (!xdr_bytes(xdrs, &objp->res, ~0))
return FALSE;
return TRUE;
}
bool_t
xdr_pmap2_mapping_list (XDR *xdrs, pmap2_mapping_list *objp)
{
if (!xdr_pmap2_mapping (xdrs, &objp->map))
return FALSE;
if (!xdr_pointer (xdrs, (char **)&objp->next, sizeof (pmap2_mapping_list), (xdrproc_t) xdr_pmap2_mapping_list))
return FALSE;
return TRUE;
}
bool_t
xdr_pmap2_dump_result (XDR *xdrs, pmap2_dump_result *objp)
{
if (!xdr_pointer (xdrs, (char **)&objp->list, sizeof (pmap2_mapping_list), (xdrproc_t) xdr_pmap2_mapping_list))
return FALSE;
return TRUE;
}
bool_t
xdr_pmap3_string_result (XDR *xdrs, pmap3_string_result *objp)
{
if (!xdr_string (xdrs, &objp->addr, ~0))
return FALSE;
return TRUE;
}
bool_t
xdr_pmap3_mapping (XDR *xdrs, pmap3_mapping *objp)
{
if (!xdr_u_int (xdrs, &objp->prog))
return FALSE;
if (!xdr_u_int (xdrs, &objp->vers))
return FALSE;
if (!xdr_string (xdrs, &objp->netid, ~0))
return FALSE;
if (!xdr_string (xdrs, &objp->addr, ~0))
return FALSE;
if (!xdr_string (xdrs, &objp->owner, ~0))
return FALSE;
return TRUE;
}
bool_t
xdr_pmap3_mapping_list (XDR *xdrs, pmap3_mapping_list *objp)
{
if (!xdr_pmap3_mapping (xdrs, &objp->map))
return FALSE;
if (!xdr_pointer (xdrs, (char **)&objp->next, sizeof (pmap3_mapping_list), (xdrproc_t) xdr_pmap3_mapping_list))
return FALSE;
return TRUE;
}
bool_t
xdr_pmap3_dump_result (XDR *xdrs, pmap3_dump_result *objp)
{
if (!xdr_pointer (xdrs, (char **)&objp->list, sizeof (pmap3_mapping_list), (xdrproc_t) xdr_pmap3_mapping_list))
return FALSE;
return TRUE;
}
bool_t
xdr_pmap3_call_args (XDR *xdrs, pmap3_call_args *objp)
{
if (xdrs->x_op == XDR_ENCODE) {
if (1) {
if (!xdr_u_int (xdrs, &objp->prog))
return FALSE;
if (!xdr_u_int (xdrs, &objp->vers))
return FALSE;
if (!xdr_u_int (xdrs, &objp->proc))
return FALSE;
} else {
IXDR_PUT_U_LONG(buf, objp->prog);
IXDR_PUT_U_LONG(buf, objp->vers);
IXDR_PUT_U_LONG(buf, objp->proc);
}
if (!xdr_bytes(xdrs, &objp->args, ~0))
return FALSE;
return TRUE;
} else if (xdrs->x_op == XDR_DECODE) {
if (1) {
if (!xdr_u_int (xdrs, &objp->prog))
return FALSE;
if (!xdr_u_int (xdrs, &objp->vers))
return FALSE;
if (!xdr_u_int (xdrs, &objp->proc))
return FALSE;
} else {
objp->prog = IXDR_GET_U_LONG(buf);
objp->vers = IXDR_GET_U_LONG(buf);
objp->proc = IXDR_GET_U_LONG(buf);
}
if (!xdr_bytes(xdrs, &objp->args, ~0))
return FALSE;
return TRUE;
}
if (!xdr_u_int (xdrs, &objp->prog))
return FALSE;
if (!xdr_u_int (xdrs, &objp->vers))
return FALSE;
if (!xdr_u_int (xdrs, &objp->proc))
return FALSE;
if (!xdr_bytes(xdrs, &objp->args, ~0))
return FALSE;
return TRUE;
}
bool_t
xdr_pmap3_call_result (XDR *xdrs, pmap3_call_result *objp)
{
if (!xdr_u_int (xdrs, &objp->port))
return FALSE;
if (!xdr_bytes(xdrs, &objp->res, ~0))
return FALSE;
return TRUE;
}
bool_t
xdr_pmap3_netbuf (XDR *xdrs, pmap3_netbuf *objp)
{
if (!xdr_u_int (xdrs, &objp->maxlen))
return FALSE;
if (!xdr_bytes(xdrs, &objp->buf, ~0))
return FALSE;
return TRUE;
}
bool_t
xdr_PMAP2SETargs (XDR *xdrs, PMAP2SETargs *objp)
{
if (!xdr_pmap2_mapping (xdrs, objp))
return FALSE;
return TRUE;
}
bool_t
xdr_PMAP2UNSETargs (XDR *xdrs, PMAP2UNSETargs *objp)
{
if (!xdr_pmap2_mapping (xdrs, objp))
return FALSE;
return TRUE;
}
bool_t
xdr_PMAP2GETPORTargs (XDR *xdrs, PMAP2GETPORTargs *objp)
{
if (!xdr_pmap2_mapping (xdrs, objp))
return FALSE;
return TRUE;
}
bool_t
xdr_PMAP2CALLITargs (XDR *xdrs, PMAP2CALLITargs *objp)
{
if (!xdr_pmap2_call_args (xdrs, objp))
return FALSE;
return TRUE;
}
bool_t
xdr_PMAP2CALLITres (XDR *xdrs, PMAP2CALLITres *objp)
{
if (!xdr_pmap2_call_result (xdrs, objp))
return FALSE;
return TRUE;
}
bool_t
xdr_PMAP2DUMPres (XDR *xdrs, PMAP2DUMPres *objp)
{
if (!xdr_pmap2_dump_result (xdrs, objp))
return FALSE;
return TRUE;
}
bool_t
xdr_PMAP3SETargs (XDR *xdrs, PMAP3SETargs *objp)
{
if (!xdr_pmap3_mapping (xdrs, objp))
return FALSE;
return TRUE;
}
bool_t
xdr_PMAP3UNSETargs (XDR *xdrs, PMAP3UNSETargs *objp)
{
if (!xdr_pmap3_mapping (xdrs, objp))
return FALSE;
return TRUE;
}
bool_t
xdr_PMAP3GETADDRargs (XDR *xdrs, PMAP3GETADDRargs *objp)
{
if (!xdr_pmap3_mapping (xdrs, objp))
return FALSE;
return TRUE;
}
bool_t
xdr_PMAP3GETADDRres (XDR *xdrs, PMAP3GETADDRres *objp)
{
if (!xdr_pmap3_string_result (xdrs, objp))
return FALSE;
return TRUE;
}
bool_t
xdr_PMAP3DUMPres (XDR *xdrs, PMAP3DUMPres *objp)
{
if (!xdr_pmap3_dump_result (xdrs, objp))
return FALSE;
return TRUE;
}
bool_t
xdr_PMAP3CALLITargs (XDR *xdrs, PMAP3CALLITargs *objp)
{
if (!xdr_pmap3_call_result (xdrs, objp))
return FALSE;
return TRUE;
}
bool_t
xdr_PMAP3CALLITres (XDR *xdrs, PMAP3CALLITres *objp)
{
if (!xdr_pmap3_call_result (xdrs, objp))
return FALSE;
return TRUE;
}
bool_t
xdr_PMAP3UADDR2TADDRres (XDR *xdrs, PMAP3UADDR2TADDRres *objp)
{
if (!xdr_pmap3_netbuf (xdrs, objp))
return FALSE;
return TRUE;
}
bool_t
xdr_PMAP3TADDR2UADDRargs (XDR *xdrs, PMAP3TADDR2UADDRargs *objp)
{
if (!xdr_pmap3_netbuf (xdrs, objp))
return FALSE;
return TRUE;
}
bool_t
xdr_PMAP3TADDR2UADDRres (XDR *xdrs, PMAP3TADDR2UADDRres *objp)
{
if (!xdr_pmap3_string_result (xdrs, objp))
return FALSE;
return TRUE;
}

160
src/nfs/rpc.h Normal file
View File

@@ -0,0 +1,160 @@
/*
* Please do not edit this file.
* It was generated using rpcgen.
*/
#ifndef _RPC_H_RPCGEN
#define _RPC_H_RPCGEN
#include "xdr_impl.h"
#ifdef __cplusplus
extern "C" {
#endif
#define RPC_MSG_VERSION 2
enum rpc_auth_flavor {
RPC_AUTH_NONE = 0,
RPC_AUTH_SYS = 1,
RPC_AUTH_SHORT = 2,
RPC_AUTH_DH = 3,
RPC_RPCSEC_GSS = 6,
};
typedef enum rpc_auth_flavor rpc_auth_flavor;
enum rpc_msg_type {
RPC_CALL = 0,
RPC_REPLY = 1,
};
typedef enum rpc_msg_type rpc_msg_type;
enum rpc_reply_stat {
RPC_MSG_ACCEPTED = 0,
RPC_MSG_DENIED = 1,
};
typedef enum rpc_reply_stat rpc_reply_stat;
enum rpc_accept_stat {
RPC_SUCCESS = 0,
RPC_PROG_UNAVAIL = 1,
RPC_PROG_MISMATCH = 2,
RPC_PROC_UNAVAIL = 3,
RPC_GARBAGE_ARGS = 4,
RPC_SYSTEM_ERR = 5,
};
typedef enum rpc_accept_stat rpc_accept_stat;
enum rpc_reject_stat {
RPC_MISMATCH = 0,
RPC_AUTH_ERROR = 1,
};
typedef enum rpc_reject_stat rpc_reject_stat;
enum rpc_auth_stat {
RPC_AUTH_OK = 0,
RPC_AUTH_BADCRED = 1,
RPC_AUTH_REJECTEDCRED = 2,
RPC_AUTH_BADVERF = 3,
RPC_AUTH_REJECTEDVERF = 4,
RPC_AUTH_TOOWEAK = 5,
RPC_AUTH_INVALIDRESP = 6,
RPC_AUTH_FAILED = 7,
};
typedef enum rpc_auth_stat rpc_auth_stat;
struct rpc_opaque_auth {
rpc_auth_flavor flavor;
xdr_string_t body;
};
typedef struct rpc_opaque_auth rpc_opaque_auth;
struct rpc_call_body {
u_int rpcvers;
u_int prog;
u_int vers;
u_int proc;
rpc_opaque_auth cred;
rpc_opaque_auth verf;
};
typedef struct rpc_call_body rpc_call_body;
struct rpc_mismatch_info {
u_int min_version;
u_int max_version;
};
typedef struct rpc_mismatch_info rpc_mismatch_info;
struct rpc_accepted_reply_body {
rpc_accept_stat stat;
union {
rpc_mismatch_info mismatch_info;
};
};
typedef struct rpc_accepted_reply_body rpc_accepted_reply_body;
struct rpc_accepted_reply {
rpc_opaque_auth verf;
rpc_accepted_reply_body reply_data;
};
typedef struct rpc_accepted_reply rpc_accepted_reply;
struct rpc_rejected_reply {
rpc_reject_stat stat;
union {
rpc_mismatch_info mismatch_info;
rpc_auth_stat auth_stat;
};
};
typedef struct rpc_rejected_reply rpc_rejected_reply;
struct rpc_reply_body {
rpc_reply_stat stat;
union {
rpc_accepted_reply areply;
rpc_rejected_reply rreply;
};
};
typedef struct rpc_reply_body rpc_reply_body;
struct rpc_msg_body {
rpc_msg_type dir;
union {
rpc_call_body cbody;
rpc_reply_body rbody;
};
};
typedef struct rpc_msg_body rpc_msg_body;
struct rpc_msg {
u_int xid;
rpc_msg_body body;
};
typedef struct rpc_msg rpc_msg;
/* the xdr functions */
extern bool_t xdr_rpc_auth_flavor (XDR *, rpc_auth_flavor*);
extern bool_t xdr_rpc_msg_type (XDR *, rpc_msg_type*);
extern bool_t xdr_rpc_reply_stat (XDR *, rpc_reply_stat*);
extern bool_t xdr_rpc_accept_stat (XDR *, rpc_accept_stat*);
extern bool_t xdr_rpc_reject_stat (XDR *, rpc_reject_stat*);
extern bool_t xdr_rpc_auth_stat (XDR *, rpc_auth_stat*);
extern bool_t xdr_rpc_opaque_auth (XDR *, rpc_opaque_auth*);
extern bool_t xdr_rpc_call_body (XDR *, rpc_call_body*);
extern bool_t xdr_rpc_mismatch_info (XDR *, rpc_mismatch_info*);
extern bool_t xdr_rpc_accepted_reply_body (XDR *, rpc_accepted_reply_body*);
extern bool_t xdr_rpc_accepted_reply (XDR *, rpc_accepted_reply*);
extern bool_t xdr_rpc_rejected_reply (XDR *, rpc_rejected_reply*);
extern bool_t xdr_rpc_reply_body (XDR *, rpc_reply_body*);
extern bool_t xdr_rpc_msg_body (XDR *, rpc_msg_body*);
extern bool_t xdr_rpc_msg (XDR *, rpc_msg*);
#ifdef __cplusplus
}
#endif
#endif /* !_RPC_H_RPCGEN */

113
src/nfs/rpc.x Normal file
View File

@@ -0,0 +1,113 @@
/* Based on RFC 5531 - RPC: Remote Procedure Call Protocol Specification Version 2 */
const RPC_MSG_VERSION = 2;
enum rpc_auth_flavor {
RPC_AUTH_NONE = 0,
RPC_AUTH_SYS = 1,
RPC_AUTH_SHORT = 2,
RPC_AUTH_DH = 3,
RPC_RPCSEC_GSS = 6
};
enum rpc_msg_type {
RPC_CALL = 0,
RPC_REPLY = 1
};
enum rpc_reply_stat {
RPC_MSG_ACCEPTED = 0,
RPC_MSG_DENIED = 1
};
enum rpc_accept_stat {
RPC_SUCCESS = 0,
RPC_PROG_UNAVAIL = 1,
RPC_PROG_MISMATCH = 2,
RPC_PROC_UNAVAIL = 3,
RPC_GARBAGE_ARGS = 4,
RPC_SYSTEM_ERR = 5
};
enum rpc_reject_stat {
RPC_MISMATCH = 0,
RPC_AUTH_ERROR = 1
};
enum rpc_auth_stat {
RPC_AUTH_OK = 0,
/*
* failed at remote end
*/
RPC_AUTH_BADCRED = 1, /* bogus credentials (seal broken) */
RPC_AUTH_REJECTEDCRED = 2, /* client should begin new session */
RPC_AUTH_BADVERF = 3, /* bogus verifier (seal broken) */
RPC_AUTH_REJECTEDVERF = 4, /* verifier expired or was replayed */
RPC_AUTH_TOOWEAK = 5, /* rejected due to security reasons */
/*
* failed locally
*/
RPC_AUTH_INVALIDRESP = 6, /* bogus response verifier */
RPC_AUTH_FAILED = 7 /* some unknown reason */
};
struct rpc_opaque_auth {
rpc_auth_flavor flavor;
opaque body<400>;
};
struct rpc_call_body {
u_int rpcvers;
u_int prog;
u_int vers;
u_int proc;
rpc_opaque_auth cred;
rpc_opaque_auth verf;
/* procedure-specific parameters start here */
};
struct rpc_mismatch_info {
unsigned int min_version;
unsigned int max_version;
};
union rpc_accepted_reply_body switch (rpc_accept_stat stat) {
case RPC_SUCCESS:
void;
/* procedure-specific results start here */
case RPC_PROG_MISMATCH:
rpc_mismatch_info mismatch_info;
default:
void;
};
struct rpc_accepted_reply {
rpc_opaque_auth verf;
rpc_accepted_reply_body reply_data;
};
union rpc_rejected_reply switch (rpc_reject_stat stat) {
case RPC_MISMATCH:
rpc_mismatch_info mismatch_info;
case RPC_AUTH_ERROR:
rpc_auth_stat auth_stat;
};
union rpc_reply_body switch (rpc_reply_stat stat) {
case RPC_MSG_ACCEPTED:
rpc_accepted_reply areply;
case RPC_MSG_DENIED:
rpc_rejected_reply rreply;
};
union rpc_msg_body switch (rpc_msg_type dir) {
case RPC_CALL:
rpc_call_body cbody;
case RPC_REPLY:
rpc_reply_body rbody;
};
struct rpc_msg {
u_int xid;
rpc_msg_body body;
};

43
src/nfs/rpc_impl.h Normal file
View File

@@ -0,0 +1,43 @@
#pragma once
#include "rpc.h"
struct rpc_op_t;
// Handler should return 1 if the request is processed asynchronously
// and requires the incoming message to not be freed until processing ends,
// 0 otherwise.
typedef int (*rpc_handler_t)(void *opaque, rpc_op_t *rop);
struct rpc_service_proc_t
{
uint32_t prog;
uint32_t vers;
uint32_t proc;
rpc_handler_t handler_fn;
xdrproc_t req_fn;
uint32_t req_size;
xdrproc_t resp_fn;
uint32_t resp_size;
void *opaque;
};
inline bool operator < (const rpc_service_proc_t & a, const rpc_service_proc_t & b)
{
return a.prog < b.prog || a.prog == b.prog && (a.vers < b.vers || a.vers == b.vers && a.proc < b.proc);
}
struct rpc_op_t
{
void *client;
uint8_t *buffer;
XDR *xdrs;
rpc_msg in_msg, out_msg;
void *request;
void *reply;
xdrproc_t reply_fn;
uint32_t reply_marker;
bool referenced;
};
void rpc_queue_reply(rpc_op_t *rop);

253
src/nfs/rpc_xdr.cpp Normal file
View File

@@ -0,0 +1,253 @@
/*
* Please do not edit this file.
* It was generated using rpcgen.
*/
#include "rpc.h"
#include "xdr_impl_inline.h"
bool_t
xdr_rpc_auth_flavor (XDR *xdrs, rpc_auth_flavor *objp)
{
if (!xdr_enum (xdrs, (enum_t *) objp))
return FALSE;
return TRUE;
}
bool_t
xdr_rpc_msg_type (XDR *xdrs, rpc_msg_type *objp)
{
if (!xdr_enum (xdrs, (enum_t *) objp))
return FALSE;
return TRUE;
}
bool_t
xdr_rpc_reply_stat (XDR *xdrs, rpc_reply_stat *objp)
{
if (!xdr_enum (xdrs, (enum_t *) objp))
return FALSE;
return TRUE;
}
bool_t
xdr_rpc_accept_stat (XDR *xdrs, rpc_accept_stat *objp)
{
if (!xdr_enum (xdrs, (enum_t *) objp))
return FALSE;
return TRUE;
}
bool_t
xdr_rpc_reject_stat (XDR *xdrs, rpc_reject_stat *objp)
{
if (!xdr_enum (xdrs, (enum_t *) objp))
return FALSE;
return TRUE;
}
bool_t
xdr_rpc_auth_stat (XDR *xdrs, rpc_auth_stat *objp)
{
if (!xdr_enum (xdrs, (enum_t *) objp))
return FALSE;
return TRUE;
}
bool_t
xdr_rpc_opaque_auth (XDR *xdrs, rpc_opaque_auth *objp)
{
if (!xdr_rpc_auth_flavor (xdrs, &objp->flavor))
return FALSE;
if (!xdr_bytes(xdrs, &objp->body, 400))
return FALSE;
return TRUE;
}
bool_t
xdr_rpc_call_body (XDR *xdrs, rpc_call_body *objp)
{
if (xdrs->x_op == XDR_ENCODE) {
if (1) {
if (!xdr_u_int (xdrs, &objp->rpcvers))
return FALSE;
if (!xdr_u_int (xdrs, &objp->prog))
return FALSE;
if (!xdr_u_int (xdrs, &objp->vers))
return FALSE;
if (!xdr_u_int (xdrs, &objp->proc))
return FALSE;
} else {
IXDR_PUT_U_LONG(buf, objp->rpcvers);
IXDR_PUT_U_LONG(buf, objp->prog);
IXDR_PUT_U_LONG(buf, objp->vers);
IXDR_PUT_U_LONG(buf, objp->proc);
}
if (!xdr_rpc_opaque_auth (xdrs, &objp->cred))
return FALSE;
if (!xdr_rpc_opaque_auth (xdrs, &objp->verf))
return FALSE;
return TRUE;
} else if (xdrs->x_op == XDR_DECODE) {
if (1) {
if (!xdr_u_int (xdrs, &objp->rpcvers))
return FALSE;
if (!xdr_u_int (xdrs, &objp->prog))
return FALSE;
if (!xdr_u_int (xdrs, &objp->vers))
return FALSE;
if (!xdr_u_int (xdrs, &objp->proc))
return FALSE;
} else {
objp->rpcvers = IXDR_GET_U_LONG(buf);
objp->prog = IXDR_GET_U_LONG(buf);
objp->vers = IXDR_GET_U_LONG(buf);
objp->proc = IXDR_GET_U_LONG(buf);
}
if (!xdr_rpc_opaque_auth (xdrs, &objp->cred))
return FALSE;
if (!xdr_rpc_opaque_auth (xdrs, &objp->verf))
return FALSE;
return TRUE;
}
if (!xdr_u_int (xdrs, &objp->rpcvers))
return FALSE;
if (!xdr_u_int (xdrs, &objp->prog))
return FALSE;
if (!xdr_u_int (xdrs, &objp->vers))
return FALSE;
if (!xdr_u_int (xdrs, &objp->proc))
return FALSE;
if (!xdr_rpc_opaque_auth (xdrs, &objp->cred))
return FALSE;
if (!xdr_rpc_opaque_auth (xdrs, &objp->verf))
return FALSE;
return TRUE;
}
bool_t
xdr_rpc_mismatch_info (XDR *xdrs, rpc_mismatch_info *objp)
{
if (!xdr_u_int (xdrs, &objp->min_version))
return FALSE;
if (!xdr_u_int (xdrs, &objp->max_version))
return FALSE;
return TRUE;
}
bool_t
xdr_rpc_accepted_reply_body (XDR *xdrs, rpc_accepted_reply_body *objp)
{
if (!xdr_rpc_accept_stat (xdrs, &objp->stat))
return FALSE;
switch (objp->stat) {
case RPC_SUCCESS:
break;
case RPC_PROG_MISMATCH:
if (!xdr_rpc_mismatch_info (xdrs, &objp->mismatch_info))
return FALSE;
break;
default:
break;
}
return TRUE;
}
bool_t
xdr_rpc_accepted_reply (XDR *xdrs, rpc_accepted_reply *objp)
{
if (!xdr_rpc_opaque_auth (xdrs, &objp->verf))
return FALSE;
if (!xdr_rpc_accepted_reply_body (xdrs, &objp->reply_data))
return FALSE;
return TRUE;
}
bool_t
xdr_rpc_rejected_reply (XDR *xdrs, rpc_rejected_reply *objp)
{
if (!xdr_rpc_reject_stat (xdrs, &objp->stat))
return FALSE;
switch (objp->stat) {
case RPC_MISMATCH:
if (!xdr_rpc_mismatch_info (xdrs, &objp->mismatch_info))
return FALSE;
break;
case RPC_AUTH_ERROR:
if (!xdr_rpc_auth_stat (xdrs, &objp->auth_stat))
return FALSE;
break;
default:
return FALSE;
}
return TRUE;
}
bool_t
xdr_rpc_reply_body (XDR *xdrs, rpc_reply_body *objp)
{
if (!xdr_rpc_reply_stat (xdrs, &objp->stat))
return FALSE;
switch (objp->stat) {
case RPC_MSG_ACCEPTED:
if (!xdr_rpc_accepted_reply (xdrs, &objp->areply))
return FALSE;
break;
case RPC_MSG_DENIED:
if (!xdr_rpc_rejected_reply (xdrs, &objp->rreply))
return FALSE;
break;
default:
return FALSE;
}
return TRUE;
}
bool_t
xdr_rpc_msg_body (XDR *xdrs, rpc_msg_body *objp)
{
if (!xdr_rpc_msg_type (xdrs, &objp->dir))
return FALSE;
switch (objp->dir) {
case RPC_CALL:
if (!xdr_rpc_call_body (xdrs, &objp->cbody))
return FALSE;
break;
case RPC_REPLY:
if (!xdr_rpc_reply_body (xdrs, &objp->rbody))
return FALSE;
break;
default:
return FALSE;
}
return TRUE;
}
bool_t
xdr_rpc_msg (XDR *xdrs, rpc_msg *objp)
{
if (!xdr_u_int (xdrs, &objp->xid))
return FALSE;
if (!xdr_rpc_msg_body (xdrs, &objp->body))
return FALSE;
return TRUE;
}

48
src/nfs/run-rpcgen.sh Executable file
View File

@@ -0,0 +1,48 @@
#!/bin/bash
set -e
# 1) remove all extern non-xdr functions (service, client)
# 2) use xdr_string_t for strings instead of char*
# 3) remove K&R #ifdefs
# 4) remove register int32_t* buf
# 5) remove union names
# 6) use xdr_string_t for opaques instead of u_int + char*
# 7) TODO: generate normal procedure stubs
run_rpcgen() {
rpcgen -h $1.x | \
perl -e '
{ local $/ = undef; $_ = <>; }
s/^extern(?!.*"C"|.*bool_t xdr.*XDR).*\n//gm;
s/#include <rpc\/rpc.h>/#include "xdr_impl.h"/;
s/^typedef char \*/typedef xdr_string_t /gm;
s/^(\s*)char \*(?!.*_val)/$1xdr_string_t /gm;
# remove union names
s/ \w+_u;/;/gs;
# use xdr_string_t for opaques
s/struct\s*\{\s*u_int\s+\w+_len;\s*char\s+\*\w+_val;\s*\}\s*/xdr_string_t /gs;
# remove stdc/k&r
s/^#if.*__STDC__.*//gm;
s/\n#else[^\n]*K&R.*?\n#endif[^\n]*K&R[^\n]*//gs;
print;' > $1.h
rpcgen -c $1.x | \
perl -pe '
s/register int32_t \*buf;\s*//g;
s/\bbuf\s*=[^;]+;\s*//g;
s/\bbuf\s*==\s*NULL/1/g;
# remove union names
s/(\.|->)\w+_u\./$1/g;
# use xdr_string_t for opaques
# xdr_bytes(xdrs, (char**)&objp->data.data_val, (char**)&objp->data.data_len, 400)
# -> xdr_bytes(xdrs, &objp->data, 400)
# xdr_bytes(xdrs, (char**)&objp->data_val, (char**)&objp->data_len, 400)
# -> xdr_bytes(xdrs, objp, 400)
s/xdr_bytes\s*\(\s*xdrs,\s*\(\s*char\s*\*\*\s*\)\s*([^()]+?)\.\w+_val\s*,\s*\(\s*u_int\s*\*\s*\)\s*\1\.\w+_len,/xdr_bytes(xdrs, $1,/gs;
s/xdr_bytes\s*\(\s*xdrs,\s*\(\s*char\s*\*\*\s*\)\s*&\s*([^()]+?)->\w+_val\s*,\s*\(\s*u_int\s*\*\s*\)\s*&\s*\1->\w+_len,/xdr_bytes(xdrs, $1,/gs;
# add include
if (/#include/) { $_ .= "#include \"xdr_impl_inline.h\"\n"; }' > ${1}_xdr.cpp
}
run_rpcgen nfs
run_rpcgen rpc
run_rpcgen portmap

107
src/nfs/xdr_impl.cpp Normal file
View File

@@ -0,0 +1,107 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// Efficient XDR implementation almost compatible with rpcgen (see run-rpcgen.sh)
#include "xdr_impl_inline.h"
XDR* xdr_create()
{
return new XDR;
}
void xdr_destroy(XDR* xdrs)
{
xdr_reset(xdrs);
delete xdrs;
}
void xdr_reset(XDR *xdrs)
{
for (auto buf: xdrs->allocs)
{
free(buf);
}
xdrs->buf = NULL;
xdrs->avail = 0;
xdrs->allocs.resize(0);
xdrs->in_linked_list.resize(0);
xdrs->cur_out.resize(0);
xdrs->last_end = 0;
xdrs->buf_list.resize(0);
}
int xdr_decode(XDR *xdrs, void *buf, unsigned size, xdrproc_t fn, void *data)
{
xdrs->x_op = XDR_DECODE;
xdrs->buf = (uint8_t*)buf;
xdrs->avail = size;
return fn(xdrs, data);
}
int xdr_encode(XDR *xdrs, xdrproc_t fn, void *data)
{
xdrs->x_op = XDR_ENCODE;
return fn(xdrs, data);
}
void xdr_encode_finish(XDR *xdrs, iovec **iov_list, unsigned *iov_count)
{
if (xdrs->last_end < xdrs->cur_out.size())
{
xdrs->buf_list.push_back((iovec){
.iov_base = 0,
.iov_len = xdrs->cur_out.size() - xdrs->last_end,
});
xdrs->last_end = xdrs->cur_out.size();
}
uint8_t *cur_buf = xdrs->cur_out.data();
for (auto & buf: xdrs->buf_list)
{
if (!buf.iov_base)
{
buf.iov_base = cur_buf;
cur_buf += buf.iov_len;
}
}
*iov_list = xdrs->buf_list.data();
*iov_count = xdrs->buf_list.size();
}
void xdr_dump_encoded(XDR *xdrs)
{
for (auto & buf: xdrs->buf_list)
{
for (int i = 0; i < buf.iov_len; i++)
printf("%02x", ((uint8_t*)buf.iov_base)[i]);
}
printf("\n");
}
void xdr_add_malloc(XDR *xdrs, void *buf)
{
xdrs->allocs.push_back(buf);
}
xdr_string_t xdr_copy_string(XDR *xdrs, const std::string & str)
{
char *cp = (char*)malloc_or_die(str.size()+1);
memcpy(cp, str.data(), str.size());
cp[str.size()] = 0;
xdr_add_malloc(xdrs, cp);
return (xdr_string_t){ str.size(), cp };
}
xdr_string_t xdr_copy_string(XDR *xdrs, const char *str)
{
return xdr_copy_string(xdrs, str, strlen(str));
}
xdr_string_t xdr_copy_string(XDR *xdrs, const char *str, size_t len)
{
char *cp = (char*)malloc_or_die(len+1);
memcpy(cp, str, len);
cp[len] = 0;
xdr_add_malloc(xdrs, cp);
return (xdr_string_t){ len, cp };
}

83
src/nfs/xdr_impl.h Normal file
View File

@@ -0,0 +1,83 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// Efficient XDR implementation almost compatible with rpcgen (see run-rpcgen.sh)
#pragma once
#include <sys/uio.h>
#include <stdint.h>
#include <string>
#define XDR_COPY_LENGTH 128
struct xdr_string_t
{
size_t size;
char *data;
operator std::string()
{
return std::string(data, size);
}
bool operator == (const char *str)
{
if (!str)
return false;
int i;
for (i = 0; i < size; i++)
if (!str[i] || str[i] != data[i])
return false;
if (str[i])
return false;
return true;
}
bool operator != (const char *str)
{
return !(*this == str);
}
};
typedef uint32_t u_int;
typedef uint32_t enum_t;
typedef uint32_t bool_t;
struct XDR;
typedef int (*xdrproc_t)(XDR *xdrs, void *data);
// Create an empty XDR object
XDR* xdr_create();
// Destroy the XDR object
void xdr_destroy(XDR* xdrs);
// Free resources from any previous xdr_decode/xdr_encode calls
void xdr_reset(XDR *xdrs);
// Try to decode <size> bytes from buffer <buf> using <fn>
// Result may contain memory allocations that will be valid until the next call to xdr_{reset,destroy,decode,encode}
int xdr_decode(XDR *xdrs, void *buf, unsigned size, xdrproc_t fn, void *data);
// Try to encode <data> using <fn>
// May be mixed with xdr_decode
// May be called multiple times to encode multiple parts of the same message
int xdr_encode(XDR *xdrs, xdrproc_t fn, void *data);
// Get the result of previous xdr_encodes as a list of <struct iovec>'s
// in <iov_list> (start) and <iov_count> (count).
// The resulting iov_list is valid until the next call to xdr_{reset,destroy}.
// It may contain references to the original data, so original data must not
// be freed until the result is fully processed (sent).
void xdr_encode_finish(XDR *xdrs, iovec **iov_list, unsigned *iov_count);
// Remember an allocated buffer to free it later on xdr_reset() or xdr_destroy()
void xdr_add_malloc(XDR *xdrs, void *buf);
xdr_string_t xdr_copy_string(XDR *xdrs, const std::string & str);
xdr_string_t xdr_copy_string(XDR *xdrs, const char *str);
xdr_string_t xdr_copy_string(XDR *xdrs, const char *str, size_t len);
void xdr_dump_encoded(XDR *xdrs);

309
src/nfs/xdr_impl_inline.h Normal file
View File

@@ -0,0 +1,309 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// Efficient XDR implementation almost compatible with rpcgen (see run-rpcgen.sh)
// XDR in a nutshell:
//
// int: big endian 32bit
// unsigned: BE 32bit
// enum: BE 32bit
// bool: BE 32bit 0/1
// hyper: BE 64bit
// unsigned hyper: BE 64bit
// float: BE float
// double: BE double
// quadruple: BE long double
// opaque[n] (fixed-length): bytes, padded to !(n%4)
// opaque (variable-length): BE 32bit length, then n bytes, padded to !(n%4)
// string: same as opaque
// array<T>[n] (fixed-length): n items of type T
// vector<T> (variable-length): BE 32bit length, then n items of type T
// struct: components in the same order as specified
// union: BE 32bit variant id, then variant of the union
// void: nothing (empty, 0 byte data)
// optional (XDR T*): BE 32bit 1/0, then T or nothing
// linked list: sequence of optional entries
//
// RPC over TCP:
//
// BE 32bit length, then rpc_msg, then the procedure message itself
#pragma once
#include "xdr_impl.h"
#include <string.h>
#include <endian.h>
#include <vector>
#include "../malloc_or_die.h"
#define FALSE 0
#define TRUE 1
#define XDR_ENCODE 0
#define XDR_DECODE 1
#define BYTES_PER_XDR_UNIT 4
#define IXDR_PUT_U_LONG(a, b)
#define IXDR_GET_U_LONG(a) 0
#define IXDR_PUT_BOOL(a, b)
#define IXDR_GET_BOOL(a) 0
#define XDR_INLINE(xdrs, len) NULL
struct xdr_linked_list_t
{
xdrproc_t fn;
unsigned entry_size, size, cap;
void *base;
unsigned has_next, link_offset;
};
struct XDR
{
int x_op;
// For decoding:
uint8_t *buf = NULL;
unsigned avail = 0;
std::vector<void*> allocs;
std::vector<xdr_linked_list_t> in_linked_list;
// For encoding:
std::vector<uint8_t> cur_out;
unsigned last_end = 0;
std::vector<iovec> buf_list;
};
uint32_t inline len_pad4(uint32_t len)
{
return ((len+3)/4) * 4;
}
inline int xdr_opaque(XDR *xdrs, void *data, uint32_t len)
{
if (len <= 0)
{
return 1;
}
if (xdrs->x_op == XDR_DECODE)
{
uint32_t padded = len_pad4(len);
if (xdrs->avail < padded)
return 0;
memcpy(data, xdrs->buf, len);
xdrs->buf += padded;
xdrs->avail -= padded;
}
else
{
unsigned old = xdrs->cur_out.size();
uint32_t pad = (len & 3) ? (4 - (len & 3)) : 0;
xdrs->cur_out.resize(old + len + pad);
memcpy(xdrs->cur_out.data()+old, data, len);
for (uint32_t i = 0; i < pad; i++)
xdrs->cur_out[old+i] = 0;
}
return 1;
}
inline int xdr_bytes(XDR *xdrs, xdr_string_t *data, uint32_t maxlen)
{
if (xdrs->x_op == XDR_DECODE)
{
if (xdrs->avail < 4)
return 0;
uint32_t len = be32toh(*((uint32_t*)xdrs->buf));
uint32_t padded = len_pad4(len);
if (xdrs->avail < 4+padded)
return 0;
data->size = len;
data->data = (char*)(xdrs->buf+4);
xdrs->buf += 4+padded;
xdrs->avail -= 4+padded;
}
else
{
if (data->size < XDR_COPY_LENGTH)
{
unsigned old = xdrs->cur_out.size();
xdrs->cur_out.resize(old + 4+data->size);
*(uint32_t*)(xdrs->cur_out.data() + old) = htobe32(data->size);
memcpy(xdrs->cur_out.data()+old+4, data->data, data->size);
}
else
{
unsigned old = xdrs->cur_out.size();
xdrs->cur_out.resize(old + 4);
*(uint32_t*)(xdrs->cur_out.data() + old) = htobe32(data->size);
xdrs->buf_list.push_back((iovec){
.iov_base = 0,
.iov_len = xdrs->cur_out.size() - xdrs->last_end,
});
xdrs->last_end = xdrs->cur_out.size();
xdrs->buf_list.push_back((iovec)
{
.iov_base = (void*)data->data,
.iov_len = data->size,
});
}
if (data->size & 3)
{
int pad = 4-(data->size & 3);
unsigned old = xdrs->cur_out.size();
xdrs->cur_out.resize(old+pad);
for (int i = 0; i < pad; i++)
xdrs->cur_out[old+i] = 0;
}
}
return 1;
}
inline int xdr_string(XDR *xdrs, xdr_string_t *data, uint32_t maxlen)
{
return xdr_bytes(xdrs, data, maxlen);
}
inline int xdr_u_int(XDR *xdrs, void *data)
{
if (xdrs->x_op == XDR_DECODE)
{
if (xdrs->avail < 4)
return 0;
*((uint32_t*)data) = be32toh(*((uint32_t*)xdrs->buf));
xdrs->buf += 4;
xdrs->avail -= 4;
}
else
{
unsigned old = xdrs->cur_out.size();
xdrs->cur_out.resize(old + 4);
*(uint32_t*)(xdrs->cur_out.data() + old) = htobe32(*(uint32_t*)data);
}
return 1;
}
inline int xdr_enum(XDR *xdrs, void *data)
{
return xdr_u_int(xdrs, data);
}
inline int xdr_bool(XDR *xdrs, void *data)
{
return xdr_u_int(xdrs, data);
}
inline int xdr_uint64_t(XDR *xdrs, void *data)
{
if (xdrs->x_op == XDR_DECODE)
{
if (xdrs->avail < 8)
return 0;
*((uint64_t*)data) = be64toh(*((uint64_t*)xdrs->buf));
xdrs->buf += 8;
xdrs->avail -= 8;
}
else
{
unsigned old = xdrs->cur_out.size();
xdrs->cur_out.resize(old + 8);
*(uint64_t*)(xdrs->cur_out.data() + old) = htobe64(*(uint64_t*)data);
}
return 1;
}
// Parse inconvenient shitty linked lists as arrays
inline int xdr_pointer(XDR *xdrs, char **data, unsigned entry_size, xdrproc_t entry_fn)
{
if (xdrs->x_op == XDR_DECODE)
{
if (xdrs->avail < 4)
return 0;
uint32_t has_next = be32toh(*((uint32_t*)xdrs->buf));
xdrs->buf += 4;
xdrs->avail -= 4;
*data = NULL;
if (!xdrs->in_linked_list.size() ||
xdrs->in_linked_list.back().fn != entry_fn)
{
if (has_next)
{
unsigned cap = 2;
void *base = malloc_or_die(entry_size * cap);
xdrs->in_linked_list.push_back((xdr_linked_list_t){
.fn = entry_fn,
.entry_size = entry_size,
.size = 1,
.cap = cap,
.base = base,
.has_next = 0,
.link_offset = 0,
});
*data = (char*)base;
if (!entry_fn(xdrs, base))
return 0;
auto & ll = xdrs->in_linked_list.back();
while (ll.has_next)
{
ll.has_next = 0;
if (ll.size >= ll.cap)
{
ll.cap *= 2;
ll.base = realloc_or_die(ll.base, ll.entry_size * ll.cap);
}
if (!entry_fn(xdrs, (uint8_t*)ll.base + ll.entry_size*ll.size))
return 0;
ll.size++;
}
for (unsigned i = 0; i < ll.size-1; i++)
{
*(void**)((uint8_t*)ll.base + i*ll.entry_size + ll.link_offset) =
(uint8_t*)ll.base + (i+1)*ll.entry_size;
}
xdrs->allocs.push_back(ll.base);
xdrs->in_linked_list.pop_back();
}
}
else
{
auto & ll = xdrs->in_linked_list.back();
xdrs->in_linked_list.back().has_next = has_next;
xdrs->in_linked_list.back().link_offset = (uint8_t*)data - (uint8_t*)ll.base - ll.entry_size*ll.size;
}
}
else
{
unsigned old = xdrs->cur_out.size();
xdrs->cur_out.resize(old + 4);
*(uint32_t*)(xdrs->cur_out.data() + old) = htobe32(*data ? 1 : 0);
if (*data)
entry_fn(xdrs, *data);
}
return 1;
}
inline int xdr_array(XDR *xdrs, char **data, uint32_t* len, uint32_t maxlen, uint32_t entry_size, xdrproc_t fn)
{
if (xdrs->x_op == XDR_DECODE)
{
if (xdrs->avail < 4)
return 0;
*len = be32toh(*((uint32_t*)xdrs->buf));
if (*len > maxlen)
return 0;
xdrs->buf += 4;
xdrs->avail -= 4;
*data = (char*)malloc_or_die(entry_size * (*len));
for (uint32_t i = 0; i < *len; i++)
fn(xdrs, *data + entry_size*i);
xdrs->allocs.push_back(*data);
}
else
{
unsigned old = xdrs->cur_out.size();
xdrs->cur_out.resize(old + 4);
*(uint32_t*)(xdrs->cur_out.data() + old) = htobe32(*len);
for (uint32_t i = 0; i < *len; i++)
fn(xdrs, *data + entry_size*i);
}
return 1;
}

1301
src/nfs_conn.cpp Normal file

File diff suppressed because it is too large Load Diff

184
src/nfs_portmap.cpp Normal file
View File

@@ -0,0 +1,184 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// Portmap service for NFS proxy
#include <netinet/in.h>
#include <string.h>
#include "nfs/portmap.h"
#include "nfs/xdr_impl_inline.h"
#include "malloc_or_die.h"
#include "nfs_portmap.h"
#include "sha256.h"
#include "base64.h"
/*
* The NULL procedure. All protocols/versions must provide a NULL procedure
* as index 0.
* It is used by clients, and rpcinfo, to "ping" a service and verify that
* the service is available and that it does support the indicated version.
*/
static int pmap2_null_proc(struct rpc_context *rpc, rpc_op_t *rop)
{
rpc_queue_reply(rop);
return 0;
}
/*
* v2 GETPORT.
* This is the lookup function for portmapper version 2.
* A client provides program, version and protocol (tcp or udp)
* and portmapper returns which port that service is available on,
* (or 0 if no such program is registered.)
*/
static int pmap2_getport_proc(portmap_service_t *self, rpc_op_t *rop)
{
PMAP2GETPORTargs *args = (PMAP2GETPORTargs *)rop->request;
uint32_t *reply = (uint32_t *)rop->reply;
auto it = self->reg_ports.lower_bound((portmap_id_t){
.prog = args->prog,
.vers = args->vers,
.udp = args->prot == IPPROTO_UDP,
.ipv6 = false,
});
if (it != self->reg_ports.end() &&
it->prog == args->prog && it->vers == args->vers &&
it->udp == (args->prot == IPPROTO_UDP))
{
*reply = it->port;
}
else
{
*reply = 0;
}
rpc_queue_reply(rop);
return 0;
}
/*
* v2 DUMP.
* This RPC returns a list of all endpoints that are registered with
* portmapper.
*/
static int pmap2_dump_proc(portmap_service_t *self, rpc_op_t *rop)
{
pmap2_mapping_list *list = (pmap2_mapping_list*)malloc_or_die(sizeof(pmap2_mapping_list) * self->reg_ports.size());
xdr_add_malloc(rop->xdrs, list);
PMAP2DUMPres *reply = (PMAP2DUMPres *)rop->reply;
int i = 0;
for (auto it = self->reg_ports.begin(); it != self->reg_ports.end(); it++)
{
if (it->ipv6)
continue;
list[i] = {
.map = {
.prog = it->prog,
.vers = it->vers,
.prot = it->udp ? IPPROTO_UDP : IPPROTO_TCP,
.port = it->port,
},
.next = list+i+1,
};
i++;
}
list[i-1].next = NULL;
// Send reply
reply->list = list;
rpc_queue_reply(rop);
return 0;
}
/*
* v3 GETADDR.
* This is the lookup function for portmapper version 3.
*/
static int pmap3_getaddr_proc(portmap_service_t *self, rpc_op_t *rop)
{
PMAP3GETADDRargs *args = (PMAP3GETADDRargs *)rop->request;
PMAP3GETADDRres *reply = (PMAP3GETADDRres *)rop->reply;
portmap_id_t ref = (portmap_id_t){
.prog = args->prog,
.vers = args->vers,
.udp = args->netid == "udp" || args->netid == "udp6",
.ipv6 = args->netid == "tcp6" || args->netid == "udp6",
};
auto it = self->reg_ports.lower_bound(ref);
if (it != self->reg_ports.end() &&
it->prog == ref.prog && it->vers == ref.vers &&
it->udp == ref.udp && it->ipv6 == ref.ipv6)
{
reply->addr = xdr_copy_string(rop->xdrs, it->addr);
}
else
{
reply->addr = {};
}
rpc_queue_reply(rop);
return 0;
}
/*
* v3 DUMP.
* This RPC returns a list of all endpoints that are registered with
* portmapper.
*/
static std::string netid_udp = "udp";
static std::string netid_udp6 = "udp6";
static std::string netid_tcp = "tcp";
static std::string netid_tcp6 = "tcp6";
static int pmap3_dump_proc(portmap_service_t *self, rpc_op_t *rop)
{
PMAP3DUMPres *reply = (PMAP3DUMPres *)rop->reply;
pmap3_mapping_list *list = (pmap3_mapping_list*)malloc_or_die(sizeof(pmap3_mapping_list*) * self->reg_ports.size());
xdr_add_malloc(rop->xdrs, list);
int i = 0;
for (auto it = self->reg_ports.begin(); it != self->reg_ports.end(); it++)
{
list[i] = (pmap3_mapping_list){
.map = (pmap3_mapping){
.prog = it->prog,
.vers = it->vers,
.netid = xdr_copy_string(rop->xdrs, it->ipv6
? (it->udp ? netid_udp6 : netid_tcp6)
: (it->udp ? netid_udp : netid_tcp)),
.addr = xdr_copy_string(rop->xdrs, it->addr), // 0.0.0.0.port
.owner = xdr_copy_string(rop->xdrs, it->owner),
},
.next = list+i+1,
};
i++;
}
list[i-1].next = NULL;
reply->list = list;
rpc_queue_reply(rop);
return 0;
}
portmap_service_t::portmap_service_t()
{
struct rpc_service_proc_t pt[] = {
{PMAP_PROGRAM, PMAP_V2, PMAP2_NULL, (rpc_handler_t)pmap2_null_proc, NULL, 0, NULL, 0, this},
{PMAP_PROGRAM, PMAP_V2, PMAP2_GETPORT, (rpc_handler_t)pmap2_getport_proc, (xdrproc_t)xdr_PMAP2GETPORTargs, sizeof(PMAP2GETPORTargs), (xdrproc_t)xdr_u_int, sizeof(u_int), this},
{PMAP_PROGRAM, PMAP_V2, PMAP2_DUMP, (rpc_handler_t)pmap2_dump_proc, NULL, 0, (xdrproc_t)xdr_PMAP2DUMPres, sizeof(PMAP2DUMPres), this},
{PMAP_PROGRAM, PMAP_V3, PMAP3_NULL, (rpc_handler_t)pmap2_null_proc, NULL, 0, NULL, 0, this},
{PMAP_PROGRAM, PMAP_V3, PMAP3_GETADDR, (rpc_handler_t)pmap3_getaddr_proc, (xdrproc_t)xdr_PMAP3GETADDRargs, sizeof(PMAP3GETADDRargs), (xdrproc_t)xdr_string, sizeof(xdr_string_t), this},
{PMAP_PROGRAM, PMAP_V3, PMAP3_DUMP, (rpc_handler_t)pmap3_dump_proc, NULL, 0, (xdrproc_t)xdr_PMAP3DUMPres, sizeof(PMAP3DUMPres), this},
};
for (int i = 0; i < sizeof(pt)/sizeof(pt[0]); i++)
{
proc_table.push_back(pt[i]);
}
}
std::string sha256(const std::string & str)
{
std::string hash;
hash.resize(32);
SHA256_CTX ctx;
sha256_init(&ctx);
sha256_update(&ctx, (uint8_t*)str.data(), str.size());
sha256_final(&ctx, (uint8_t*)hash.data());
return hash;
}

39
src/nfs_portmap.h Normal file
View File

@@ -0,0 +1,39 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// Portmap service for NFS proxy
#pragma once
#include <string>
#include <set>
#include <vector>
#include "nfs/rpc_impl.h"
struct portmap_id_t
{
unsigned prog, vers;
bool udp;
bool ipv6;
unsigned port;
std::string owner;
std::string addr;
};
class portmap_service_t
{
public:
std::set<portmap_id_t> reg_ports;
std::vector<rpc_service_proc_t> proc_table;
portmap_service_t();
};
inline bool operator < (const portmap_id_t &a, const portmap_id_t &b)
{
return a.prog < b.prog || a.prog == b.prog && a.vers < b.vers ||
a.prog == b.prog && a.vers == b.vers && a.udp < b.udp ||
a.prog == b.prog && a.vers == b.vers && a.udp == b.udp && a.ipv6 < b.ipv6;
}
std::string sha256(const std::string & str);

982
src/nfs_proxy.cpp Normal file
View File

@@ -0,0 +1,982 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// Simplified NFS proxy
// Presents all images as files
// Keeps image/file list in memory and is thus unsuitable for a large number of files
#define _XOPEN_SOURCE
#include <limits.h>
#include <netinet/tcp.h>
#include <sys/epoll.h>
#include <unistd.h>
#include <fcntl.h>
//#include <signal.h>
#include "nfs/nfs.h"
#include "nfs/rpc.h"
#include "nfs/portmap.h"
#include "addr_util.h"
#include "base64.h"
#include "nfs_proxy.h"
#include "http_client.h"
#include "cli.h"
#define ETCD_INODE_STATS_WATCH_ID 101
#define ETCD_POOL_STATS_WATCH_ID 102
const char *exe_name = NULL;
nfs_proxy_t::~nfs_proxy_t()
{
if (cmd)
delete cmd;
if (cli)
delete cli;
if (epmgr)
delete epmgr;
if (ringloop)
delete ringloop;
}
json11::Json::object nfs_proxy_t::parse_args(int narg, const char *args[])
{
json11::Json::object cfg;
for (int i = 1; i < narg; i++)
{
if (!strcmp(args[i], "-h") || !strcmp(args[i], "--help"))
{
printf(
"Vitastor NFS 3.0 proxy\n"
"(c) Vitaliy Filippov, 2021-2022 (VNPL-1.1)\n"
"\n"
"USAGE:\n"
" %s [--etcd_address ADDR] [OTHER OPTIONS]\n"
" --subdir <DIR> export images prefixed <DIR>/ (default empty - export all images)\n"
" --portmap 0 do not listen on port 111 (portmap/rpcbind, requires root)\n"
" --bind <IP> bind service to <IP> address (default 0.0.0.0)\n"
" --nfspath <PATH> set NFS export path to <PATH> (default is /)\n"
" --port <PORT> use port <PORT> for NFS services (default is 2049)\n"
" --pool <POOL> use <POOL> as default pool for new files (images)\n"
" --foreground 1 stay in foreground, do not daemonize\n"
"\n"
"NFS proxy is stateless if you use immediate_commit=all in your cluster, so\n"
"you can freely use multiple NFS proxies with L3 load balancing in this case.\n"
"\n"
"Example start and mount commands for a custom NFS port:\n"
" %s --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool\n"
" mount localhost:/ /mnt/ -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp\n",
exe_name, exe_name
);
exit(0);
}
else if (args[i][0] == '-' && args[i][1] == '-')
{
const char *opt = args[i]+2;
cfg[opt] = !strcmp(opt, "json") || i == narg-1 ? "1" : args[++i];
}
}
return cfg;
}
void nfs_proxy_t::run(json11::Json cfg)
{
timespec tv;
clock_gettime(CLOCK_REALTIME, &tv);
srand48(tv.tv_sec*1000000000 + tv.tv_nsec);
server_id = (uint64_t)lrand48() | ((uint64_t)lrand48() << 31) | ((uint64_t)lrand48() << 62);
// Parse options
bind_address = cfg["bind"].string_value();
if (bind_address == "")
bind_address = "0.0.0.0";
default_pool = cfg["pool"].as_string();
portmap_enabled = cfg.object_items().find("portmap") == cfg.object_items().end() ||
cfg["portmap"].uint64_value() ||
cfg["portmap"].string_value() == "yes" ||
cfg["portmap"].string_value() == "true";
nfs_port = cfg["port"].uint64_value() & 0xffff;
if (!nfs_port)
nfs_port = 2049;
export_root = cfg["nfspath"].string_value();
if (!export_root.size())
export_root = "/";
name_prefix = cfg["subdir"].string_value();
{
int e = name_prefix.size();
while (e > 0 && name_prefix[e-1] == '/')
e--;
int s = 0;
while (s < e && name_prefix[s] == '/')
s++;
name_prefix = name_prefix.substr(s, e-s);
if (name_prefix.size())
name_prefix += "/";
}
// Create client
ringloop = new ring_loop_t(512);
epmgr = new epoll_manager_t(ringloop);
cli = new cluster_client_t(ringloop, epmgr->tfd, cfg);
cmd = new cli_tool_t();
cmd->ringloop = ringloop;
cmd->epmgr = epmgr;
cmd->cli = cli;
// We need inode name hashes for NFS handles to remain stateless and <= 64 bytes long
dir_info[""] = (nfs_dir_t){
.id = 1,
.mod_rev = 0,
};
clock_gettime(CLOCK_REALTIME, &dir_info[""].mtime);
watch_stats();
assert(cli->st_cli.on_inode_change_hook == NULL);
cli->st_cli.on_inode_change_hook = [this](inode_t changed_inode, bool removed)
{
auto inode_cfg_it = cli->st_cli.inode_config.find(changed_inode);
if (inode_cfg_it == cli->st_cli.inode_config.end())
{
return;
}
auto & inode_cfg = inode_cfg_it->second;
std::string full_name = inode_cfg.name;
if (name_prefix != "" && full_name.substr(0, name_prefix.size()) != name_prefix)
{
return;
}
// Calculate directory modification time and revision (used as "cookie verifier")
timespec now;
clock_gettime(CLOCK_REALTIME, &now);
dir_info[""].mod_rev = dir_info[""].mod_rev < inode_cfg.mod_revision ? inode_cfg.mod_revision : dir_info[""].mod_rev;
dir_info[""].mtime = now;
int pos = full_name.find('/', name_prefix.size());
while (pos >= 0)
{
std::string dir = full_name.substr(0, pos);
auto & dinf = dir_info[dir];
if (!dinf.id)
dinf.id = next_dir_id++;
dinf.mod_rev = dinf.mod_rev < inode_cfg.mod_revision ? inode_cfg.mod_revision : dinf.mod_rev;
dinf.mtime = now;
dir_by_hash["S"+base64_encode(sha256(dir))] = dir;
pos = full_name.find('/', pos+1);
}
// Alter inode_by_hash
if (removed)
{
auto ino_it = hash_by_inode.find(changed_inode);
if (ino_it != hash_by_inode.end())
{
inode_by_hash.erase(ino_it->second);
hash_by_inode.erase(ino_it);
}
}
else
{
std::string hash = "S"+base64_encode(sha256(full_name));
auto hbi_it = hash_by_inode.find(changed_inode);
if (hbi_it != hash_by_inode.end() && hbi_it->second != hash)
{
// inode had a different name, remove old hash=>inode pointer
inode_by_hash.erase(hbi_it->second);
}
inode_by_hash[hash] = changed_inode;
hash_by_inode[changed_inode] = hash;
}
};
// Load image metadata
while (!cli->is_ready())
{
ringloop->loop();
if (cli->is_ready())
break;
ringloop->wait();
}
// Check default pool
check_default_pool();
// Self-register portmap and NFS
pmap.reg_ports.insert((portmap_id_t){
.prog = PMAP_PROGRAM,
.vers = PMAP_V2,
.port = portmap_enabled ? 111 : nfs_port,
.owner = "portmapper-service",
.addr = portmap_enabled ? "0.0.0.0.0.111" : ("0.0.0.0.0."+std::to_string(nfs_port)),
});
pmap.reg_ports.insert((portmap_id_t){
.prog = PMAP_PROGRAM,
.vers = PMAP_V3,
.port = portmap_enabled ? 111 : nfs_port,
.owner = "portmapper-service",
.addr = portmap_enabled ? "0.0.0.0.0.111" : ("0.0.0.0.0."+std::to_string(nfs_port)),
});
pmap.reg_ports.insert((portmap_id_t){
.prog = NFS_PROGRAM,
.vers = NFS_V3,
.port = nfs_port,
.owner = "nfs-server",
.addr = "0.0.0.0.0."+std::to_string(nfs_port),
});
pmap.reg_ports.insert((portmap_id_t){
.prog = MOUNT_PROGRAM,
.vers = MOUNT_V3,
.port = nfs_port,
.owner = "rpc.mountd",
.addr = "0.0.0.0.0."+std::to_string(nfs_port),
});
// Create NFS socket and add it to epoll
int nfs_socket = create_and_bind_socket(bind_address, nfs_port, 128, NULL);
fcntl(nfs_socket, F_SETFL, fcntl(nfs_socket, F_GETFL, 0) | O_NONBLOCK);
epmgr->tfd->set_fd_handler(nfs_socket, false, [this](int nfs_socket, int epoll_events)
{
if (epoll_events & EPOLLRDHUP)
{
fprintf(stderr, "Listening portmap socket disconnected, exiting\n");
exit(1);
}
else
{
do_accept(nfs_socket);
}
});
if (portmap_enabled)
{
// Create portmap socket and add it to epoll
int portmap_socket = create_and_bind_socket(bind_address, 111, 128, NULL);
fcntl(portmap_socket, F_SETFL, fcntl(portmap_socket, F_GETFL, 0) | O_NONBLOCK);
epmgr->tfd->set_fd_handler(portmap_socket, false, [this](int portmap_socket, int epoll_events)
{
if (epoll_events & EPOLLRDHUP)
{
fprintf(stderr, "Listening portmap socket disconnected, exiting\n");
exit(1);
}
else
{
do_accept(portmap_socket);
}
});
}
if (cfg["foreground"].is_null())
{
daemonize();
}
while (true)
{
ringloop->loop();
ringloop->wait();
}
/*// Sync at the end
cluster_op_t *close_sync = new cluster_op_t;
close_sync->opcode = OSD_OP_SYNC;
close_sync->callback = [&stop](cluster_op_t *op)
{
stop = true;
delete op;
};
cli->execute(close_sync);*/
// Destroy the client
delete cli;
delete epmgr;
delete ringloop;
cli = NULL;
epmgr = NULL;
ringloop = NULL;
}
void nfs_proxy_t::watch_stats()
{
assert(cli->st_cli.on_start_watcher_hook == NULL);
cli->st_cli.on_start_watcher_hook = [this](http_co_t *etcd_watch_ws)
{
http_post_message(etcd_watch_ws, WS_TEXT, json11::Json(json11::Json::object {
{ "create_request", json11::Json::object {
{ "key", base64_encode(cli->st_cli.etcd_prefix+"/inode/stats/") },
{ "range_end", base64_encode(cli->st_cli.etcd_prefix+"/inode/stats0") },
{ "start_revision", cli->st_cli.etcd_watch_revision },
{ "watch_id", ETCD_INODE_STATS_WATCH_ID },
{ "progress_notify", true },
} }
}).dump());
http_post_message(etcd_watch_ws, WS_TEXT, json11::Json(json11::Json::object {
{ "create_request", json11::Json::object {
{ "key", base64_encode(cli->st_cli.etcd_prefix+"/pool/stats/") },
{ "range_end", base64_encode(cli->st_cli.etcd_prefix+"/pool/stats0") },
{ "start_revision", cli->st_cli.etcd_watch_revision },
{ "watch_id", ETCD_POOL_STATS_WATCH_ID },
{ "progress_notify", true },
} }
}).dump());
cli->st_cli.etcd_txn_slow(json11::Json::object {
{ "success", json11::Json::array {
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(cli->st_cli.etcd_prefix+"/inode/stats/") },
{ "range_end", base64_encode(cli->st_cli.etcd_prefix+"/inode/stats0") },
} }
},
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(cli->st_cli.etcd_prefix+"/pool/stats/") },
{ "range_end", base64_encode(cli->st_cli.etcd_prefix+"/pool/stats0") },
} }
},
} },
}, [this](std::string err, json11::Json res)
{
for (auto & rsp: res["responses"].array_items())
{
for (auto & item: rsp["response_range"]["kvs"].array_items())
{
etcd_kv_t kv = cli->st_cli.parse_etcd_kv(item);
parse_stats(kv);
}
}
});
};
cli->st_cli.on_change_hook = [this, old_hook = cli->st_cli.on_change_hook](std::map<std::string, etcd_kv_t> & changes)
{
for (auto & p: changes)
{
parse_stats(p.second);
}
};
}
void nfs_proxy_t::parse_stats(etcd_kv_t & kv)
{
auto & key = kv.key;
if (key.substr(0, cli->st_cli.etcd_prefix.length()+13) == cli->st_cli.etcd_prefix+"/inode/stats/")
{
pool_id_t pool_id = 0;
inode_t inode_num = 0;
char null_byte = 0;
sscanf(key.c_str() + cli->st_cli.etcd_prefix.length()+13, "%u/%lu%c", &pool_id, &inode_num, &null_byte);
if (!pool_id || pool_id >= POOL_ID_MAX || !inode_num || null_byte != 0)
{
fprintf(stderr, "Bad etcd key %s, ignoring\n", key.c_str());
}
else
{
inode_stats[INODE_WITH_POOL(pool_id, inode_num)] = kv.value;
}
}
else if (key.substr(0, cli->st_cli.etcd_prefix.length()+12) == cli->st_cli.etcd_prefix+"/pool/stats/")
{
pool_id_t pool_id = 0;
char null_byte = 0;
sscanf(key.c_str() + cli->st_cli.etcd_prefix.length()+12, "%u%c", &pool_id, &null_byte);
if (!pool_id || pool_id >= POOL_ID_MAX)
{
fprintf(stderr, "Bad etcd key %s, ignoring\n", key.c_str());
}
else
{
pool_stats[pool_id] = kv.value;
}
}
}
void nfs_proxy_t::check_default_pool()
{
if (default_pool == "")
{
if (cli->st_cli.pool_config.size() == 1)
{
default_pool = cli->st_cli.pool_config.begin()->second.name;
default_pool_id = cli->st_cli.pool_config.begin()->first;
}
else
{
fprintf(stderr, "There are %lu pools. Please select default pool with --pool option\n", cli->st_cli.pool_config.size());
exit(1);
}
}
else
{
for (auto & p: cli->st_cli.pool_config)
{
if (p.second.name == default_pool)
{
default_pool_id = p.first;
break;
}
}
if (!default_pool_id)
{
fprintf(stderr, "Pool %s is not found\n", default_pool.c_str());
exit(1);
}
}
}
void nfs_proxy_t::do_accept(int listen_fd)
{
struct sockaddr_storage addr;
socklen_t addr_size = sizeof(addr);
int nfs_fd = 0;
while ((nfs_fd = accept(listen_fd, (struct sockaddr *)&addr, &addr_size)) >= 0)
{
fprintf(stderr, "New client %d: connection from %s\n", nfs_fd, addr_to_string(addr).c_str());
fcntl(nfs_fd, F_SETFL, fcntl(nfs_fd, F_GETFL, 0) | O_NONBLOCK);
int one = 1;
setsockopt(nfs_fd, SOL_TCP, TCP_NODELAY, &one, sizeof(one));
auto cli = new nfs_client_t();
cli->parent = this;
cli->nfs_fd = nfs_fd;
for (auto & fn: pmap.proc_table)
{
cli->proc_table.insert(fn);
}
epmgr->tfd->set_fd_handler(nfs_fd, true, [cli](int nfs_fd, int epoll_events)
{
// Handle incoming event
if (epoll_events & EPOLLRDHUP)
{
fprintf(stderr, "Client %d disconnected\n", nfs_fd);
cli->stop();
return;
}
cli->epoll_events |= epoll_events;
if (epoll_events & EPOLLIN)
{
// Something is available for reading
cli->submit_read(0);
}
if (epoll_events & EPOLLOUT)
{
cli->submit_send();
}
});
}
if (nfs_fd < 0 && errno != EAGAIN)
{
fprintf(stderr, "Failed to accept connection: %s\n", strerror(errno));
exit(1);
}
}
// FIXME Move these functions to "rpc_context"
void nfs_client_t::select_read_buffer(unsigned wanted_size)
{
if (free_buffers.size())
{
auto & b = free_buffers.back();
if (b.size < wanted_size)
{
cur_buffer = {
.buf = (uint8_t*)malloc_or_die(wanted_size),
.size = wanted_size,
};
}
else
{
cur_buffer = {
.buf = b.buf,
.size = b.size,
};
}
free_buffers.pop_back();
}
else
{
unsigned sz = RPC_INIT_BUF_SIZE;
if (sz < wanted_size)
{
sz = wanted_size;
}
cur_buffer = {
.buf = (uint8_t*)malloc_or_die(sz),
.size = sz,
};
}
}
void nfs_client_t::submit_read(unsigned wanted_size)
{
if (read_msg.msg_iovlen)
{
return;
}
io_uring_sqe* sqe = parent->ringloop->get_sqe();
if (!sqe)
{
read_msg.msg_iovlen = 0;
parent->ringloop->wakeup();
return;
}
if (!cur_buffer.buf || cur_buffer.size <= cur_buffer.read_pos)
{
assert(!wanted_size);
if (cur_buffer.buf)
{
if (cur_buffer.refs > 0)
{
used_buffers[cur_buffer.buf] = (rpc_used_buffer_t){
.size = cur_buffer.size,
.refs = cur_buffer.refs,
};
}
else
{
free_buffers.push_back((rpc_free_buffer_t){
.buf = cur_buffer.buf,
.size = cur_buffer.size,
});
}
}
select_read_buffer(wanted_size);
}
assert(wanted_size <= cur_buffer.size-cur_buffer.read_pos);
read_iov = {
.iov_base = cur_buffer.buf+cur_buffer.read_pos,
.iov_len = wanted_size ? wanted_size : cur_buffer.size-cur_buffer.read_pos,
};
read_msg.msg_iov = &read_iov;
read_msg.msg_iovlen = 1;
ring_data_t* data = ((ring_data_t*)sqe->user_data);
data->callback = [this](ring_data_t *data) { handle_read(data->res); };
my_uring_prep_recvmsg(sqe, nfs_fd, &read_msg, 0);
refs++;
}
void nfs_client_t::handle_read(int result)
{
read_msg.msg_iovlen = 0;
if (deref())
return;
if (result <= 0 && result != -EAGAIN && result != -EINTR)
{
printf("Failed read from client %d: %d (%s)\n", nfs_fd, result, strerror(-result));
stop();
return;
}
if (result > 0)
{
cur_buffer.read_pos += result;
assert(cur_buffer.read_pos <= cur_buffer.size);
// Try to parse incoming RPC messages
uint8_t *data = cur_buffer.buf + cur_buffer.parsed_pos;
unsigned left = cur_buffer.read_pos - cur_buffer.parsed_pos;
while (left > 0)
{
// Assemble all fragments
unsigned fragments = 0;
uint32_t wanted = 0;
while (1)
{
fragments++;
wanted += 4;
if (left < wanted)
{
break;
}
// FIXME: Limit message size
uint32_t frag_size = be32toh(*(uint32_t*)(data + wanted - 4));
wanted += (frag_size & 0x7FFFFFFF);
if (left < wanted || (frag_size & 0x80000000))
{
break;
}
}
if (left >= wanted)
{
if (fragments > 1)
{
// Merge fragments. Fragmented messages are probably not that common,
// so it's probably fine to do an additional memory copy
unsigned frag_offset = 8+be32toh(*(uint32_t*)(data));
unsigned dest_offset = 4+be32toh(*(uint32_t*)(data));
unsigned frag_num = 1;
while (frag_num < fragments)
{
uint32_t frag_size = be32toh(*(uint32_t*)(data + frag_offset - 4)) & 0x7FFFFFFF;
memmove(data + dest_offset, data + frag_offset, frag_size);
frag_offset += 4+frag_size;
dest_offset += frag_size;
frag_num++;
}
}
// Handle full message
int referenced = handle_rpc_message(cur_buffer.buf, data+4, wanted-4*fragments);
cur_buffer.refs += referenced ? 1 : 0;
cur_buffer.parsed_pos += 4+wanted-4*fragments;
data += wanted;
left -= wanted;
}
else if (cur_buffer.size >= (data - cur_buffer.buf + wanted))
{
// Read the tail and come back
submit_read(wanted-left);
break;
}
else
{
// No place to put the whole tail
if (cur_buffer.refs > 0)
{
used_buffers[cur_buffer.buf] = (rpc_used_buffer_t){
.size = cur_buffer.size,
.refs = cur_buffer.refs,
};
select_read_buffer(wanted);
memcpy(cur_buffer.buf, data, left);
}
else if (cur_buffer.size < wanted)
{
uint8_t *old_buf = cur_buffer.buf;
select_read_buffer(wanted);
memcpy(cur_buffer.buf, data, left);
free(old_buf);
}
else
{
memmove(cur_buffer.buf, data, left);
}
cur_buffer.read_pos = left;
cur_buffer.parsed_pos = 0;
// Restart from the beginning
submit_read(wanted-left);
break;
}
}
}
}
void nfs_client_t::submit_send()
{
if (write_msg.msg_iovlen || !send_list.size())
{
return;
}
io_uring_sqe* sqe = parent->ringloop->get_sqe();
if (!sqe)
{
write_msg.msg_iovlen = 0;
parent->ringloop->wakeup();
return;
}
write_msg.msg_iov = send_list.data();
write_msg.msg_iovlen = send_list.size() < IOV_MAX ? send_list.size() : IOV_MAX;
ring_data_t* data = ((ring_data_t*)sqe->user_data);
data->callback = [this](ring_data_t *data) { handle_send(data->res); };
my_uring_prep_sendmsg(sqe, nfs_fd, &write_msg, 0);
refs++;
}
bool nfs_client_t::deref()
{
refs--;
if (stopped && refs <= 0)
{
stop();
return true;
}
return false;
}
void nfs_client_t::stop()
{
stopped = true;
if (refs <= 0)
{
parent->epmgr->tfd->set_fd_handler(nfs_fd, true, NULL);
close(nfs_fd);
delete this;
}
}
void nfs_client_t::handle_send(int result)
{
write_msg.msg_iovlen = 0;
if (deref())
return;
if (result <= 0 && result != -EAGAIN && result != -EINTR)
{
printf("Failed send to client %d: %d (%s)\n", nfs_fd, result, strerror(-result));
stop();
return;
}
if (result > 0)
{
int done = 0;
while (result > 0 && done < send_list.size())
{
iovec & iov = send_list[done];
if (iov.iov_len <= result)
{
auto rop = outbox[done];
if (rop)
{
// Reply fully sent
xdr_reset(rop->xdrs);
parent->xdr_pool.push_back(rop->xdrs);
if (rop->buffer && rop->referenced)
{
// Dereference the buffer
if (rop->buffer == cur_buffer.buf)
{
cur_buffer.refs--;
}
else
{
auto & ub = used_buffers.at(rop->buffer);
assert(ub.refs > 0);
ub.refs--;
if (ub.refs == 0)
{
// FIXME Maybe put free_buffers into parent
free_buffers.push_back((rpc_free_buffer_t){
.buf = rop->buffer,
.size = ub.size,
});
used_buffers.erase(rop->buffer);
}
}
}
free(rop);
}
result -= iov.iov_len;
done++;
}
else
{
iov.iov_len -= result;
iov.iov_base = (uint8_t*)iov.iov_base + result;
break;
}
}
if (done > 0)
{
send_list.erase(send_list.begin(), send_list.begin()+done);
outbox.erase(outbox.begin(), outbox.begin()+done);
}
if (next_send_list.size())
{
send_list.insert(send_list.end(), next_send_list.begin(), next_send_list.end());
outbox.insert(outbox.end(), next_outbox.begin(), next_outbox.end());
next_send_list.clear();
next_outbox.clear();
}
if (outbox.size() > 0)
{
submit_send();
}
}
}
void rpc_queue_reply(rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)rop->client;
iovec *iov_list = NULL;
unsigned iov_count = 0;
int r = xdr_encode(rop->xdrs, (xdrproc_t)xdr_rpc_msg, &rop->out_msg);
assert(r);
if (rop->reply_fn != NULL)
{
r = xdr_encode(rop->xdrs, rop->reply_fn, rop->reply);
assert(r);
}
xdr_encode_finish(rop->xdrs, &iov_list, &iov_count);
assert(iov_count > 0);
rop->reply_marker = 0;
for (unsigned i = 0; i < iov_count; i++)
{
rop->reply_marker += iov_list[i].iov_len;
}
rop->reply_marker = htobe32(rop->reply_marker | 0x80000000);
auto & to_send_list = self->write_msg.msg_iovlen ? self->next_send_list : self->send_list;
auto & to_outbox = self->write_msg.msg_iovlen ? self->next_outbox : self->outbox;
to_send_list.push_back((iovec){ .iov_base = &rop->reply_marker, .iov_len = 4 });
to_outbox.push_back(NULL);
for (unsigned i = 0; i < iov_count; i++)
{
to_send_list.push_back(iov_list[i]);
to_outbox.push_back(NULL);
}
to_outbox[to_outbox.size()-1] = rop;
self->submit_send();
}
int nfs_client_t::handle_rpc_message(void *base_buf, void *msg_buf, uint32_t msg_len)
{
// Take an XDR object from the pool
XDR *xdrs;
if (parent->xdr_pool.size())
{
xdrs = parent->xdr_pool.back();
parent->xdr_pool.pop_back();
}
else
{
xdrs = xdr_create();
}
// Decode the RPC header
char inmsg_data[sizeof(rpc_msg)];
rpc_msg *inmsg = (rpc_msg*)&inmsg_data;
if (!xdr_decode(xdrs, msg_buf, msg_len, (xdrproc_t)xdr_rpc_msg, inmsg))
{
// Invalid message, ignore it
xdr_reset(xdrs);
parent->xdr_pool.push_back(xdrs);
return 0;
}
if (inmsg->body.dir != RPC_CALL)
{
// Reply sent to the server? Strange thing. Also ignore it
xdr_reset(xdrs);
parent->xdr_pool.push_back(xdrs);
return 0;
}
if (inmsg->body.cbody.rpcvers != RPC_MSG_VERSION)
{
// Bad RPC version
rpc_op_t *rop = (rpc_op_t*)malloc_or_die(sizeof(rpc_op_t));
u_int x = RPC_MSG_VERSION;
*rop = (rpc_op_t){
.client = this,
.xdrs = xdrs,
.out_msg = (rpc_msg){
.xid = inmsg->xid,
.body = (rpc_msg_body){
.dir = RPC_REPLY,
.rbody = (rpc_reply_body){
.stat = RPC_MSG_DENIED,
.rreply = (rpc_rejected_reply){
.stat = RPC_MISMATCH,
.mismatch_info = (rpc_mismatch_info){
// Without at least one reference to a non-constant value (local variable or something else),
// with gcc 8 we get "internal compiler error: side-effects element in no-side-effects CONSTRUCTOR" here
// FIXME: get rid of this after raising compiler requirement
.min_version = x,
.max_version = RPC_MSG_VERSION,
},
},
},
},
},
};
rpc_queue_reply(rop);
// Incoming buffer isn't needed to handle request, so return 0
return 0;
}
// Find decoder for the request
auto proc_it = proc_table.find((rpc_service_proc_t){
.prog = inmsg->body.cbody.prog,
.vers = inmsg->body.cbody.vers,
.proc = inmsg->body.cbody.proc,
});
if (proc_it == proc_table.end())
{
// Procedure not implemented
uint32_t min_vers = 0, max_vers = 0;
auto prog_it = proc_table.lower_bound((rpc_service_proc_t){
.prog = inmsg->body.cbody.prog,
});
if (prog_it != proc_table.end())
{
min_vers = prog_it->vers;
auto max_vers_it = proc_table.lower_bound((rpc_service_proc_t){
.prog = inmsg->body.cbody.prog+1,
});
assert(max_vers_it != proc_table.begin());
max_vers_it--;
assert(max_vers_it->prog == inmsg->body.cbody.prog);
max_vers = max_vers_it->vers;
}
rpc_op_t *rop = (rpc_op_t*)malloc_or_die(sizeof(rpc_op_t));
*rop = (rpc_op_t){
.client = this,
.xdrs = xdrs,
.out_msg = (rpc_msg){
.xid = inmsg->xid,
.body = (rpc_msg_body){
.dir = RPC_REPLY,
.rbody = (rpc_reply_body){
.stat = RPC_MSG_ACCEPTED,
.areply = (rpc_accepted_reply){
.reply_data = (rpc_accepted_reply_body){
.stat = (min_vers == 0
? RPC_PROG_UNAVAIL
: (min_vers <= inmsg->body.cbody.vers &&
max_vers >= inmsg->body.cbody.vers
? RPC_PROC_UNAVAIL
: RPC_PROG_MISMATCH)),
.mismatch_info = (rpc_mismatch_info){ .min_version = min_vers, .max_version = max_vers },
},
},
},
},
},
};
rpc_queue_reply(rop);
// Incoming buffer isn't needed to handle request, so return 0
return 0;
}
// Allocate memory
rpc_op_t *rop = (rpc_op_t*)malloc_or_die(
sizeof(rpc_op_t) + proc_it->req_size + proc_it->resp_size
);
rpc_reply_stat x = RPC_MSG_ACCEPTED;
*rop = (rpc_op_t){
.client = this,
.buffer = (uint8_t*)base_buf,
.xdrs = xdrs,
.out_msg = (rpc_msg){
.xid = inmsg->xid,
.body = (rpc_msg_body){
.dir = RPC_REPLY,
.rbody = (rpc_reply_body){
// Without at least one reference to a non-constant value (local variable or something else),
// with gcc 8 we get "internal compiler error: side-effects element in no-side-effects CONSTRUCTOR" here
// FIXME: get rid of this after raising compiler requirement
.stat = x,
},
},
},
.request = ((uint8_t*)rop) + sizeof(rpc_op_t),
.reply = ((uint8_t*)rop) + sizeof(rpc_op_t) + proc_it->req_size,
};
memcpy(&rop->in_msg, inmsg, sizeof(rpc_msg));
// Try to decode the request
// req_fn may be NULL, that means function has no arguments
if (proc_it->req_fn && !proc_it->req_fn(xdrs, rop->request))
{
// Invalid request
rop->out_msg.body.rbody.areply.reply_data.stat = RPC_GARBAGE_ARGS;
rpc_queue_reply(rop);
// Incoming buffer isn't needed to handle request, so return 0
return 0;
}
rop->out_msg.body.rbody.areply.reply_data.stat = RPC_SUCCESS;
rop->reply_fn = proc_it->resp_fn;
int ref = proc_it->handler_fn(proc_it->opaque, rop);
rop->referenced = ref ? 1 : 0;
return ref;
}
void nfs_proxy_t::daemonize()
{
if (fork())
exit(0);
setsid();
if (fork())
exit(0);
if (chdir("/") != 0)
fprintf(stderr, "Warning: Failed to chdir into /\n");
close(0);
close(1);
close(2);
open("/dev/null", O_RDONLY);
open("/dev/null", O_WRONLY);
open("/dev/null", O_WRONLY);
}
int main(int narg, const char *args[])
{
setvbuf(stdout, NULL, _IONBF, 0);
setvbuf(stderr, NULL, _IONBF, 0);
exe_name = args[0];
nfs_proxy_t *p = new nfs_proxy_t();
p->run(nfs_proxy_t::parse_args(narg, args));
delete p;
return 0;
}

124
src/nfs_proxy.h Normal file
View File

@@ -0,0 +1,124 @@
#pragma once
#include "cluster_client.h"
#include "epoll_manager.h"
#include "nfs_portmap.h"
#include "nfs/xdr_impl.h"
#define RPC_INIT_BUF_SIZE 32768
class cli_tool_t;
struct nfs_dir_t
{
uint64_t id;
uint64_t mod_rev;
timespec mtime;
};
class nfs_proxy_t
{
public:
std::string bind_address;
std::string name_prefix;
uint64_t fsid = 1;
uint64_t server_id = 0;
std::string default_pool;
std::string export_root;
bool portmap_enabled;
unsigned nfs_port;
pool_id_t default_pool_id;
portmap_service_t pmap;
ring_loop_t *ringloop = NULL;
epoll_manager_t *epmgr = NULL;
cluster_client_t *cli = NULL;
cli_tool_t *cmd = NULL;
std::vector<XDR*> xdr_pool;
// filehandle = "S"+base64(sha256(full name with prefix)) or "roothandle" for mount root)
uint64_t next_dir_id = 2;
// filehandle => dir with name_prefix
std::map<std::string, std::string> dir_by_hash;
// dir with name_prefix => dir info
std::map<std::string, nfs_dir_t> dir_info;
// filehandle => inode ID
std::map<std::string, inode_t> inode_by_hash;
// inode ID => filehandle
std::map<inode_t, std::string> hash_by_inode;
// inode ID => statistics
std::map<inode_t, json11::Json> inode_stats;
// pool ID => statistics
std::map<pool_id_t, json11::Json> pool_stats;
~nfs_proxy_t();
static json11::Json::object parse_args(int narg, const char *args[]);
void run(json11::Json cfg);
void watch_stats();
void parse_stats(etcd_kv_t & kv);
void check_default_pool();
void do_accept(int listen_fd);
void daemonize();
};
struct rpc_cur_buffer_t
{
uint8_t *buf;
unsigned size;
unsigned read_pos;
unsigned parsed_pos;
int refs;
};
struct rpc_used_buffer_t
{
unsigned size;
int refs;
};
struct rpc_free_buffer_t
{
uint8_t *buf;
unsigned size;
};
class nfs_client_t
{
public:
nfs_proxy_t *parent = NULL;
int nfs_fd;
int epoll_events = 0;
int refs = 0;
bool stopped = false;
std::set<rpc_service_proc_t> proc_table;
// Read state
rpc_cur_buffer_t cur_buffer = { 0 };
std::map<uint8_t*, rpc_used_buffer_t> used_buffers;
std::vector<rpc_free_buffer_t> free_buffers;
iovec read_iov;
msghdr read_msg = { 0 };
// Write state
msghdr write_msg = { 0 };
std::vector<iovec> send_list, next_send_list;
std::vector<rpc_op_t*> outbox, next_outbox;
nfs_client_t();
~nfs_client_t();
void select_read_buffer(unsigned wanted_size);
void submit_read(unsigned wanted_size);
void handle_read(int result);
void submit_send();
void handle_send(int result);
int handle_rpc_message(void *base_buf, void *msg_buf, uint32_t msg_len);
bool deref();
void stop();
};

View File

@@ -189,7 +189,7 @@ void osd_t::report_statistics()
for (auto kv: bs->get_inode_space_stats())
{
pool_id_t pool_id = INODE_POOL(kv.first);
uint64_t only_inode_num = (kv.first & ((1l << (64-POOL_ID_BITS)) - 1));
uint64_t only_inode_num = INODE_NO_POOL(kv.first);
if (!last_pool || pool_id != last_pool)
{
if (last_pool)
@@ -207,7 +207,7 @@ void osd_t::report_statistics()
for (auto kv: inode_stats)
{
pool_id_t pool_id = INODE_POOL(kv.first);
uint64_t only_inode_num = (kv.first & ((1l << (64-POOL_ID_BITS)) - 1));
uint64_t only_inode_num = (kv.first & (((uint64_t)1 << (64-POOL_ID_BITS)) - 1));
if (!last_pool || pool_id != last_pool)
{
if (last_pool)

View File

@@ -9,7 +9,7 @@
#define POOL_ID_MAX 0x10000
#define POOL_ID_BITS 16
#define INODE_POOL(inode) (pool_id_t)((inode) >> (64 - POOL_ID_BITS))
#define INODE_NO_POOL(inode) (inode_t)(inode & ((1l << (64-POOL_ID_BITS)) - 1))
#define INODE_NO_POOL(inode) (inode_t)(inode & (((uint64_t)1 << (64-POOL_ID_BITS)) - 1))
#define INODE_WITH_POOL(pool_id, inode) (((inode_t)(pool_id) << (64-POOL_ID_BITS)) | INODE_NO_POOL(inode))
// Pool ID is 16 bits long

View File

@@ -437,7 +437,7 @@ void pg_t::calc_object_states(int log_level)
st.walk();
if (this->state & (PG_DEGRADED|PG_LEFT_ON_DEAD))
{
assert(epoch != ((1ul << PG_EPOCH_BITS)-1));
assert(epoch != (((uint64_t)1 << PG_EPOCH_BITS)-1));
epoch++;
}
}

View File

@@ -144,9 +144,9 @@ resume_3:
}
else
{
if ((op_data->fact_ver & (1ul<<(64-PG_EPOCH_BITS) - 1)) == (1ul<<(64-PG_EPOCH_BITS) - 1))
if ((op_data->fact_ver & ((uint64_t)1 << (64-PG_EPOCH_BITS) - 1)) == ((uint64_t)1 << (64-PG_EPOCH_BITS) - 1))
{
assert(pg.epoch != ((1ul << PG_EPOCH_BITS)-1));
assert(pg.epoch != (((uint64_t)1 << PG_EPOCH_BITS)-1));
pg.epoch++;
}
op_data->target_ver = op_data->fact_ver + 1;

View File

@@ -262,7 +262,7 @@ static int vitastor_file_open(BlockDriverState *bs, QDict *options, int flags, E
client->pool = qdict_get_try_int(options, "pool", 0);
if (client->pool)
{
client->inode = (client->inode & ((1l << (64-POOL_ID_BITS)) - 1)) | (client->pool << (64-POOL_ID_BITS));
client->inode = (client->inode & (((uint64_t)1 << (64-POOL_ID_BITS)) - 1)) | (client->pool << (64-POOL_ID_BITS));
}
client->size = qdict_get_try_int(options, "size", 0);
}

158
src/sha256.c Normal file
View File

@@ -0,0 +1,158 @@
/*********************************************************************
* Filename: sha256.c
* Author: Brad Conte (brad AT bradconte.com)
* Copyright:
* Disclaimer: This code is presented "as is" without any guarantees.
* Details: Implementation of the SHA-256 hashing algorithm.
SHA-256 is one of the three algorithms in the SHA2
specification. The others, SHA-384 and SHA-512, are not
offered in this implementation.
Algorithm specification can be found here:
* http://csrc.nist.gov/publications/fips/fips180-2/fips180-2withchangenotice.pdf
This implementation uses little endian byte order.
*********************************************************************/
/*************************** HEADER FILES ***************************/
#include <stdlib.h>
#include <memory.h>
#include "sha256.h"
/****************************** MACROS ******************************/
#define ROTLEFT(a,b) (((a) << (b)) | ((a) >> (32-(b))))
#define ROTRIGHT(a,b) (((a) >> (b)) | ((a) << (32-(b))))
#define CH(x,y,z) (((x) & (y)) ^ (~(x) & (z)))
#define MAJ(x,y,z) (((x) & (y)) ^ ((x) & (z)) ^ ((y) & (z)))
#define EP0(x) (ROTRIGHT(x,2) ^ ROTRIGHT(x,13) ^ ROTRIGHT(x,22))
#define EP1(x) (ROTRIGHT(x,6) ^ ROTRIGHT(x,11) ^ ROTRIGHT(x,25))
#define SIG0(x) (ROTRIGHT(x,7) ^ ROTRIGHT(x,18) ^ ((x) >> 3))
#define SIG1(x) (ROTRIGHT(x,17) ^ ROTRIGHT(x,19) ^ ((x) >> 10))
/**************************** VARIABLES *****************************/
static const WORD k[64] = {
0x428a2f98,0x71374491,0xb5c0fbcf,0xe9b5dba5,0x3956c25b,0x59f111f1,0x923f82a4,0xab1c5ed5,
0xd807aa98,0x12835b01,0x243185be,0x550c7dc3,0x72be5d74,0x80deb1fe,0x9bdc06a7,0xc19bf174,
0xe49b69c1,0xefbe4786,0x0fc19dc6,0x240ca1cc,0x2de92c6f,0x4a7484aa,0x5cb0a9dc,0x76f988da,
0x983e5152,0xa831c66d,0xb00327c8,0xbf597fc7,0xc6e00bf3,0xd5a79147,0x06ca6351,0x14292967,
0x27b70a85,0x2e1b2138,0x4d2c6dfc,0x53380d13,0x650a7354,0x766a0abb,0x81c2c92e,0x92722c85,
0xa2bfe8a1,0xa81a664b,0xc24b8b70,0xc76c51a3,0xd192e819,0xd6990624,0xf40e3585,0x106aa070,
0x19a4c116,0x1e376c08,0x2748774c,0x34b0bcb5,0x391c0cb3,0x4ed8aa4a,0x5b9cca4f,0x682e6ff3,
0x748f82ee,0x78a5636f,0x84c87814,0x8cc70208,0x90befffa,0xa4506ceb,0xbef9a3f7,0xc67178f2
};
/*********************** FUNCTION DEFINITIONS ***********************/
void sha256_transform(SHA256_CTX *ctx, const BYTE data[])
{
WORD a, b, c, d, e, f, g, h, i, j, t1, t2, m[64];
for (i = 0, j = 0; i < 16; ++i, j += 4)
m[i] = (data[j] << 24) | (data[j + 1] << 16) | (data[j + 2] << 8) | (data[j + 3]);
for ( ; i < 64; ++i)
m[i] = SIG1(m[i - 2]) + m[i - 7] + SIG0(m[i - 15]) + m[i - 16];
a = ctx->state[0];
b = ctx->state[1];
c = ctx->state[2];
d = ctx->state[3];
e = ctx->state[4];
f = ctx->state[5];
g = ctx->state[6];
h = ctx->state[7];
for (i = 0; i < 64; ++i) {
t1 = h + EP1(e) + CH(e,f,g) + k[i] + m[i];
t2 = EP0(a) + MAJ(a,b,c);
h = g;
g = f;
f = e;
e = d + t1;
d = c;
c = b;
b = a;
a = t1 + t2;
}
ctx->state[0] += a;
ctx->state[1] += b;
ctx->state[2] += c;
ctx->state[3] += d;
ctx->state[4] += e;
ctx->state[5] += f;
ctx->state[6] += g;
ctx->state[7] += h;
}
void sha256_init(SHA256_CTX *ctx)
{
ctx->datalen = 0;
ctx->bitlen = 0;
ctx->state[0] = 0x6a09e667;
ctx->state[1] = 0xbb67ae85;
ctx->state[2] = 0x3c6ef372;
ctx->state[3] = 0xa54ff53a;
ctx->state[4] = 0x510e527f;
ctx->state[5] = 0x9b05688c;
ctx->state[6] = 0x1f83d9ab;
ctx->state[7] = 0x5be0cd19;
}
void sha256_update(SHA256_CTX *ctx, const BYTE data[], size_t len)
{
WORD i;
for (i = 0; i < len; ++i) {
ctx->data[ctx->datalen] = data[i];
ctx->datalen++;
if (ctx->datalen == 64) {
sha256_transform(ctx, ctx->data);
ctx->bitlen += 512;
ctx->datalen = 0;
}
}
}
void sha256_final(SHA256_CTX *ctx, BYTE hash[])
{
WORD i;
i = ctx->datalen;
// Pad whatever data is left in the buffer.
if (ctx->datalen < 56) {
ctx->data[i++] = 0x80;
while (i < 56)
ctx->data[i++] = 0x00;
}
else {
ctx->data[i++] = 0x80;
while (i < 64)
ctx->data[i++] = 0x00;
sha256_transform(ctx, ctx->data);
memset(ctx->data, 0, 56);
}
// Append to the padding the total message's length in bits and transform.
ctx->bitlen += ctx->datalen * 8;
ctx->data[63] = ctx->bitlen;
ctx->data[62] = ctx->bitlen >> 8;
ctx->data[61] = ctx->bitlen >> 16;
ctx->data[60] = ctx->bitlen >> 24;
ctx->data[59] = ctx->bitlen >> 32;
ctx->data[58] = ctx->bitlen >> 40;
ctx->data[57] = ctx->bitlen >> 48;
ctx->data[56] = ctx->bitlen >> 56;
sha256_transform(ctx, ctx->data);
// Since this implementation uses little endian byte ordering and SHA uses big endian,
// reverse all the bytes when copying the final state to the output hash.
for (i = 0; i < 4; ++i) {
hash[i] = (ctx->state[0] >> (24 - i * 8)) & 0x000000ff;
hash[i + 4] = (ctx->state[1] >> (24 - i * 8)) & 0x000000ff;
hash[i + 8] = (ctx->state[2] >> (24 - i * 8)) & 0x000000ff;
hash[i + 12] = (ctx->state[3] >> (24 - i * 8)) & 0x000000ff;
hash[i + 16] = (ctx->state[4] >> (24 - i * 8)) & 0x000000ff;
hash[i + 20] = (ctx->state[5] >> (24 - i * 8)) & 0x000000ff;
hash[i + 24] = (ctx->state[6] >> (24 - i * 8)) & 0x000000ff;
hash[i + 28] = (ctx->state[7] >> (24 - i * 8)) & 0x000000ff;
}
}

41
src/sha256.h Normal file
View File

@@ -0,0 +1,41 @@
/*********************************************************************
* Filename: sha256.h
* Author: Brad Conte (brad AT bradconte.com)
* Copyright:
* Disclaimer: This code is presented "as is" without any guarantees.
* Details: Defines the API for the corresponding SHA1 implementation.
*********************************************************************/
#ifndef SHA256_H
#define SHA256_H
/*************************** HEADER FILES ***************************/
#include <stddef.h>
/****************************** MACROS ******************************/
#define SHA256_BLOCK_SIZE 32 // SHA256 outputs a 32 byte digest
#ifdef __cplusplus
extern "C" {
#endif
/**************************** DATA TYPES ****************************/
typedef unsigned char BYTE; // 8-bit byte
typedef unsigned int WORD; // 32-bit word, change to "long" for 16-bit machines
typedef struct {
BYTE data[64];
WORD datalen;
unsigned long long bitlen;
WORD state[8];
} SHA256_CTX;
/*********************** FUNCTION DECLARATIONS **********************/
void sha256_init(SHA256_CTX *ctx);
void sha256_update(SHA256_CTX *ctx, const BYTE data[], size_t len);
void sha256_final(SHA256_CTX *ctx, BYTE hash[]);
#ifdef __cplusplus
};
#endif
#endif // SHA256_H

View File

@@ -406,7 +406,7 @@ uint64_t crush(uint64_t key, int count, uint64_t *weights)
seed = (key + 0xc6a4a7935bd1e995 + (seed << 6) + (seed >> 2));
seed ^= (j + 0xc6a4a7935bd1e995 + (seed << 6) + (seed >> 2));
seed = 2862933555777941757ull*seed + 3037000493ull; // LCPRNG
seed = -log(((double)seed) / (1ul << 32) / (1ul << 32)) * weights[j];
seed = -log(((double)seed) / ((uint64_t)1 << 32) / ((uint64_t)1 << 32)) * weights[j];
if (seed > max)
{
max = seed;
@@ -439,8 +439,8 @@ void crush3(uint64_t key, int count, uint64_t *weights, uint64_t *r, uint64_t to
seed ^= (k2 + 0xc6a4a7935bd1e995 + (seed << 6) + (seed >> 2));
seed ^= (k3 + 0xc6a4a7935bd1e995 + (seed << 6) + (seed >> 2));
seed = 2862933555777941757ull*seed + 3037000493ull; // LCPRNG
//seed = ((double)seed) / (1ul << 32) / (1ul << 32) * (weights[k1] + weights[k2] + weights[k3]);
seed = ((double)seed) / (1ul << 32) / (1ul << 32) * (1 -
//seed = ((double)seed) / ((uint64_t)1 << 32) / ((uint64_t)1 << 32) * (weights[k1] + weights[k2] + weights[k3]);
seed = ((double)seed) / ((uint64_t)1 << 32) / ((uint64_t)1 << 32) * (1 -
(1 - 1.0*weights[k1]/total_weight)*
(1 - 1.0*weights[k2]/total_weight)*
(1 - 1.0*weights[k3]/total_weight)

View File

@@ -6,7 +6,7 @@ includedir=${prefix}/@CMAKE_INSTALL_INCLUDEDIR@
Name: Vitastor
Description: Vitastor client library
Version: 0.6.16
Version: 0.6.17
Libs: -L${libdir} -lvitastor_client
Cflags: -I${includedir}

View File

@@ -4,7 +4,7 @@ PG_COUNT=16
. `dirname $0`/run_3osds.sh
LD_PRELOAD=libasan.so.5 \
LD_PRELOAD="build/src/libfio_vitastor.so" \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4M -direct=1 -iodepth=1 -end_fsync=1 \
-rw=write -etcd=$ETCD_URL -pool=1 -inode=1 -size=128M -cluster_log_level=10

View File

@@ -37,7 +37,7 @@ if ! ($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '(
format_error "FAILED: 16 PGS NOT UP"
fi
LD_PRELOAD=libasan.so.5 \
LD_PRELOAD="build/src/libfio_vitastor.so" \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4M -direct=1 -iodepth=1 -fsync=1 -rw=write \
-etcd=$ETCD_URL -pool=1 -inode=2 -size=128M -cluster_log_level=10

View File

@@ -5,7 +5,7 @@ ETCD_COUNT=5
. `dirname $0`/run_3osds.sh
LD_PRELOAD=libasan.so.5 \
LD_PRELOAD="build/src/libfio_vitastor.so" \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4M -direct=1 -iodepth=1 -fsync=1 -rw=randwrite \
-etcd=$ETCD_URL -pool=1 -inode=1 -size=128M -cluster_log_level=10
@@ -26,7 +26,7 @@ kill_etcds()
kill_etcds &
LD_PRELOAD=libasan.so.5 \
LD_PRELOAD="build/src/libfio_vitastor.so" \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4k -direct=1 -iodepth=1 -fsync=1 -rw=randwrite \
-etcd=$ETCD_URL -pool=1 -inode=1 -size=128M -cluster_log_level=10 -runtime=30

View File

@@ -4,7 +4,7 @@
IMG_SIZE=960
LD_PRELOAD=libasan.so.5 \
LD_PRELOAD="build/src/libfio_vitastor.so" \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4M -direct=1 -iodepth=16 -fsync=16 -rw=write \
-etcd=$ETCD_URL -pool=1 -inode=2 -size=${IMG_SIZE}M -cluster_log_level=10

View File

@@ -21,7 +21,7 @@ if ! ($ETCDCTL get /vitastor/pg/state/1/1 --print-value-only | jq -s -e '(. | le
format_error "Failed to start the PG active+degraded"
fi
LD_PRELOAD=libasan.so.5 \
LD_PRELOAD="build/src/libfio_vitastor.so" \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4M -direct=1 -iodepth=1 -fsync=1 -rw=write \
-etcd=$ETCD_URL -pool=1 -inode=2 -size=32M -cluster_log_level=10

View File

@@ -3,7 +3,7 @@
PG_COUNT=16
. `dirname $0`/run_3osds.sh
LD_PRELOAD=libasan.so.5 \
LD_PRELOAD="build/src/libfio_vitastor.so" \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4M -direct=1 -iodepth=1 \
-end_fsync=1 -fsync=1 -rw=write -etcd=$ETCD_URL -pool=1 -inode=1 -size=128M -cluster_log_level=10

View File

@@ -6,7 +6,7 @@
$ETCDCTL put /vitastor/config/inode/1/2 '{"name":"testimg","size":'$((32*1024*1024))'}'
LD_PRELOAD="libasan.so.5 build/src/libfio_vitastor.so" \
LD_PRELOAD="build/src/libfio_vitastor.so" \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4M -direct=1 -iodepth=1 -fsync=1 -rw=write \
-etcd=$ETCD_URL -pool=1 -inode=2 -size=32M -cluster_log_level=10
@@ -14,11 +14,11 @@ $ETCDCTL put /vitastor/config/inode/1/2 '{"name":"testimg@0","size":'$((32*1024*
$ETCDCTL put /vitastor/config/inode/1/3 '{"parent_id":2,"name":"testimg","size":'$((32*1024*1024))'}'
# Preload build/src/libfio_vitastor.so so libasan detects all symbols
LD_PRELOAD="libasan.so.5 build/src/libfio_vitastor.so" \
LD_PRELOAD="build/src/libfio_vitastor.so" \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4k -direct=1 -iodepth=1 -fsync=32 -buffer_pattern=0xdeadface \
-rw=randwrite -etcd=$ETCD_URL -image=testimg -number_ios=1024
LD_PRELOAD="libasan.so.5 build/src/libfio_vitastor.so" \
LD_PRELOAD="build/src/libfio_vitastor.so" \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4M -direct=1 -iodepth=1 -rw=read -etcd=$ETCD_URL -pool=1 -inode=3 -size=32M
qemu-img convert -S 4096 -p \

View File

@@ -2,6 +2,7 @@
OSD_COUNT=2
PG_SIZE=2
PG_MINSIZE=1
SCHEME=replicated
. `dirname $0`/run_3osds.sh
@@ -13,7 +14,7 @@ sleep 2
# Write
LD_PRELOAD=libasan.so.5 \
LD_PRELOAD="build/src/libfio_vitastor.so" \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4k -direct=1 -iodepth=1 -fsync=1 \
-rw=randwrite -etcd=$ETCD_URL -pool=1 -inode=1 -size=128M -runtime=10 -number_ios=100

View File

@@ -7,20 +7,20 @@
# Random writes without immediate_commit were stalling OSDs
LD_PRELOAD=libasan.so.5 \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=124k -direct=1 -numjobs=16 -iodepth=4 \
LD_PRELOAD="build/src/libfio_vitastor.so" \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=68k -direct=1 -numjobs=16 -iodepth=4 \
-rw=randwrite -etcd=$ETCD_URL -pool=1 -inode=1 -size=128M -runtime=10
# A lot of parallel syncs was crashing the primary OSD at some point
LD_PRELOAD=libasan.so.5 \
LD_PRELOAD="build/src/libfio_vitastor.so" \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4k -direct=1 -numjobs=64 -iodepth=1 -fsync=1 \
-rw=randwrite -etcd=$ETCD_URL -pool=1 -inode=1 -size=128M -number_ios=100
LD_PRELOAD=libasan.so.5 \
LD_PRELOAD="build/src/libfio_vitastor.so" \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4M -direct=1 -iodepth=1 -fsync=1 -rw=write -etcd=$ETCD_URL -pool=1 -inode=1 -size=128M -cluster_log_level=10
LD_PRELOAD=libasan.so.5 \
LD_PRELOAD="build/src/libfio_vitastor.so" \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4k -direct=1 -iodepth=1 -fsync=32 -buffer_pattern=0xdeadface \
-rw=randwrite -etcd=$ETCD_URL -pool=1 -inode=1 -size=128M -number_ios=1024

View File

@@ -11,7 +11,7 @@ GLOBAL_CONF='{"immediate_commit":"all"}'
# Test basic write
LD_PRELOAD=libasan.so.5 \
LD_PRELOAD="build/src/libfio_vitastor.so" \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4M -direct=1 -iodepth=1 -rw=write -etcd=$ETCD_URL -pool=1 -inode=1 -size=1G -cluster_log_level=10
format_green OK