Compare commits

...

44 Commits

Author SHA1 Message Date
Vitaliy Filippov cb282d25e0 Release 0.6.5
- Basic support for OpenStack: Cinder driver, patches for Nova and libvirt
- Add missing "image" and "config_path" QEMU options
- Calculate aggregate per-pool statistics in monitor
- Implement writes with Check-And-Set semantics
- Add a C wrapper library with public header
2021-07-10 11:01:21 +03:00
Vitaliy Filippov 8b2a4c9539 Fix centos builds (yum-builddep stopped working in el7, cmake in el8..) 2021-07-10 11:01:21 +03:00
Vitaliy Filippov b66a079892 State basic OpenStack support 2021-07-10 01:11:20 +03:00
Vitaliy Filippov e90bbe6385 Implement OpenStack Cinder driver for Vitastor
It can't delete snapshots yet because Vitastor layer merge isn't
implemented yet. You can only delete volumes with all snapshots.
This will be fixed in the near future.
2021-07-10 01:06:29 +03:00
Vitaliy Filippov 4be761254c Move patches to patches/ 2021-07-09 21:51:19 +03:00
Vitaliy Filippov 7a45c5f86c buster-backports has broken mesa 2021-07-09 12:29:39 +03:00
Vitaliy Filippov bff413584d Fix qemuBlockStorageSourceGetVitastorProps 2021-07-09 02:09:47 +03:00
Vitaliy Filippov bb31050ab5 Add missing image, config_path options to QEMU QAPI 2021-07-09 02:09:47 +03:00
Vitaliy Filippov b52dd6843a Rename qemu_rbd_unescape and qemu_rbd_next_tok to *_vitastor_* 2021-07-03 23:14:44 +03:00
Vitaliy Filippov b66160a7ad Aggregate per-pool statistics in mon 2021-07-03 23:14:44 +03:00
Vitaliy Filippov 30bb602681 Add _VITASTOR to missing switches in libvirt 7.0 patch 2021-06-28 22:00:23 +03:00
Vitaliy Filippov eb0a3adafc Patch libvirt schema, add an example to test libvirt 2021-06-28 01:20:55 +03:00
Vitaliy Filippov 24301b116c Add libvirt 5.0 patch 2021-06-27 18:43:29 +03:00
Vitaliy Filippov 1d00c17d68 Add libvirt 7.5 patch 2021-06-27 10:58:12 +03:00
Vitaliy Filippov 24f19c4b80 Add libvirt 7.0 patch 2021-06-27 00:58:56 +03:00
Vitaliy Filippov dfdf5c1f9c Fix comments in mon.js 2021-06-20 00:23:56 +03:00
Vitaliy Filippov aad7792d3f Check for loops in parent inode chains 2021-06-20 00:23:03 +03:00
Vitaliy Filippov 6ca8afffe5 Add CAS version parameter to the C wrapper 2021-06-19 01:00:52 +03:00
Vitaliy Filippov 511a89948b Rework qemu_proxy into a C wrapper library with public header 2021-06-19 00:39:11 +03:00
Vitaliy Filippov 3de553ecd7 Add a test for CAS write operation 2021-06-15 00:12:35 +03:00
Vitaliy Filippov 9c45d43e74 Extract common 3 OSD code from several test scripts 2021-06-15 00:12:35 +03:00
Vitaliy Filippov 891250d355 Implement CAS writes
From now on, reads will return the server-side object version numbers
and writes and deletes will have an additional "version" parameter
which, if set to a non-zero value, will be atomically compared with
the current version of the object plus 1 and the modification will
fail if it doesn't match.

This feature opens the road to correct online flattening of snapshot
layers and other interesting things.
2021-06-15 00:12:35 +03:00
Vitaliy Filippov f9fe72d40a Release 0.6.4
- Implement a basic Kubernetes CSI driver
- Minor fixes for vitastor-nbd
- Fix build without RDMA broken in 0.6.3
2021-05-16 01:38:01 +03:00
Vitaliy Filippov 10ee4f7c1d Add notes about CSI to README 2021-05-16 01:38:01 +03:00
Vitaliy Filippov fd8244699b Implement basic CSI driver
Currently can create and remove volumes, but resizing and snapshots is not supported yet
2021-05-16 01:15:43 +03:00
Vitaliy Filippov eaac1fc5d1 Log to stderr in etcd_state_client, too 2021-05-16 01:09:25 +03:00
Vitaliy Filippov 57be1923d3 Daemonize NBD_DO_IT process, correctly cleanup unmounted NBD clients 2021-05-16 01:09:25 +03:00
Vitaliy Filippov c467acc388 Fix /v3 appendage to etcd URLs without /v3 2021-05-15 19:22:24 +03:00
Vitaliy Filippov bf591ba3ee Fix nbd module load check 2021-05-15 19:22:24 +03:00
Vitaliy Filippov 699a0fbbc7 Log to stderr instead of stdout in client 2021-05-15 19:22:24 +03:00
Vitaliy Filippov 6b2dd50f27 Fix build without RDMA 2021-05-08 18:20:43 +03:00
Vitaliy Filippov caf2f3c56f Release 0.6.3
- RDMA support
- Client performance optimisations (4k randread ~120k -> ~180k on 1 core)
- JSON configuration file (/etc/vitastor/vitastor.conf) support
- Bug fixes
2021-05-02 17:47:43 +03:00
Vitaliy Filippov 9174f188b1 Build packages with libibverbs
For CentOS 7 it also requires newer rdma-core as CentOS 7's native version doesn't have
implicit ODP support. The updated version is already uploaded into the vitastor repo.
2021-05-02 17:47:16 +03:00
Vitaliy Filippov d3978c6d0e Do not print RDMA connection messages when log_level=0
By the way, it's 1 by default in the OSD, so these messages will still be there in OSD logs
2021-05-01 00:26:09 +03:00
Vitaliy Filippov 4a7365660d Do not wait for down OSDs during sync
Fixes a hang introduced in 0.5.11 in the non-immediate_commit mode
2021-05-01 00:26:07 +03:00
Vitaliy Filippov 818ae5d61d Some config parsing fixes 2021-05-01 00:20:01 +03:00
Vitaliy Filippov 6810e93c3f Add RDMA options to mon.js list 2021-04-30 01:23:22 +03:00
Vitaliy Filippov f6f35f4127 Pass options correctly to not override /etc/vitastor/vitastor.conf 2021-04-30 01:17:44 +03:00
Vitaliy Filippov 72aa2fd819 Make OSD and client read common configuration from /etc/vitastor/vitastor.conf 2021-04-30 01:11:27 +03:00
Vitaliy Filippov 5010b0dd75 Use json11 instead of blockstore_config_t 2021-04-30 00:52:46 +03:00
Vitaliy Filippov 483c5ab380 Negotiate max_msg instead of max_sge, make buffer settings more conservative :-) 2021-04-29 11:10:35 +03:00
Vitaliy Filippov 6a6fd6544d Add RDMA options to the QEMU driver 2021-04-29 11:02:49 +03:00
Vitaliy Filippov 971aa4ae4f Implement RDMA receive with memory copying (send remains zero-copy)
This is the simplest and, as usual, the best implementation :)

100% zero-copy implementation is also possible (see rdma-zerocopy branch),
but it requires to create A LOT of queues (~128 per client) to use QPN as a 'tag'
because of the lack of receive tags and the server may simply run out of queues.
Hardware limit is 262144 on Mellanox ConnectX-4 which amounts to only 2048
'connections' per host. And even with that amount of queues it's still less optimal
than the non-zerocopy one.

In fact, newest hardware like Mellanox ConnectX-5 does have Tag Matching
support, but it's still unsuitable for us because it doesn't support scatter/gather
(tm_caps.max_sge=1).
2021-04-29 02:34:45 +03:00
Vitaliy Filippov 9e6cbc6ebc Negotiate max_sge between RDMA client & server 2021-04-29 02:15:20 +03:00
88 changed files with 6472 additions and 931 deletions

View File

@ -2,6 +2,6 @@ cmake_minimum_required(VERSION 2.8)
project(vitastor) project(vitastor)
set(VERSION "0.6.2") set(VERSION "0.6.5")
add_subdirectory(src) add_subdirectory(src)

View File

@ -22,7 +22,6 @@ Vitastor на данный момент находится в статусе п
Однако следующее уже реализовано: Однако следующее уже реализовано:
0.5.x (стабильная версия):
- Базовая часть - надёжное кластерное блочное хранилище без единой точки отказа - Базовая часть - надёжное кластерное блочное хранилище без единой точки отказа
- Производительность ;-D - Производительность ;-D
- Несколько схем отказоустойчивости: репликация, XOR n+1 (1 диск чётности), коды коррекции ошибок - Несколько схем отказоустойчивости: репликация, XOR n+1 (1 диск чётности), коды коррекции ошибок
@ -43,24 +42,26 @@ Vitastor на данный момент находится в статусе п
- NBD-прокси для монтирования образов ядром ("блочное устройство в режиме пользователя") - NBD-прокси для монтирования образов ядром ("блочное устройство в режиме пользователя")
- Утилита удаления образов/инодов (vitastor-rm) - Утилита удаления образов/инодов (vitastor-rm)
- Пакеты для Debian и CentOS - Пакеты для Debian и CentOS
0.6.x (master-ветка):
- Статистика операций ввода/вывода и занятого места в разрезе инодов - Статистика операций ввода/вывода и занятого места в разрезе инодов
- Именование инодов через хранение их метаданных в etcd - Именование инодов через хранение их метаданных в etcd
- Снапшоты и copy-on-write клоны - Снапшоты и copy-on-write клоны
- Сглаживание производительности случайной записи в SSD+HDD конфигурациях - Сглаживание производительности случайной записи в SSD+HDD конфигурациях
- Поддержка RDMA/RoCEv2 через libibverbs
- CSI-плагин для Kubernetes
- Базовая поддержка OpenStack: драйвер Cinder, патчи для Nova и libvirt
## Планы развития ## Планы развития
- Поддержка удаления снапшотов (слияния слоёв)
- Более корректные скрипты разметки дисков и автоматического запуска OSD - Более корректные скрипты разметки дисков и автоматического запуска OSD
- Другие инструменты администрирования - Другие инструменты администрирования
- Плагины для OpenStack, Kubernetes, OpenNebula, Proxmox и других облачных систем - Плагины для OpenNebula, Proxmox и других облачных систем
- iSCSI-прокси - iSCSI-прокси
- Более быстрое переключение при отказах - Более быстрое переключение при отказах
- Фоновая проверка целостности без контрольных сумм (сверка реплик) - Фоновая проверка целостности без контрольных сумм (сверка реплик)
- Контрольные суммы - Контрольные суммы
- Поддержка SSD-кэширования (tiered storage) - Поддержка SSD-кэширования (tiered storage)
- Поддержка RDMA и NVDIMM - Поддержка NVDIMM
- Web-интерфейс - Web-интерфейс
- Возможно, сжатие - Возможно, сжатие
- Возможно, поддержка кэширования данных через системный page cache - Возможно, поддержка кэширования данных через системный page cache
@ -371,7 +372,7 @@ Vitastor с однопоточной NBD прокси на том же стен
- Установите gcc и g++ 8.x или новее. - Установите gcc и g++ 8.x или новее.
- Склонируйте данный репозиторий с подмодулями: `git clone https://yourcmc.ru/git/vitalif/vitastor/`. - Склонируйте данный репозиторий с подмодулями: `git clone https://yourcmc.ru/git/vitalif/vitastor/`.
- Желательно пересобрать QEMU с патчем, который делает необязательным запуск через LD_PRELOAD. - Желательно пересобрать QEMU с патчем, который делает необязательным запуск через LD_PRELOAD.
См `qemu-*.*-vitastor.patch` - выберите версию, наиболее близкую вашей версии QEMU. См `patches/qemu-*.*-vitastor.patch` - выберите версию, наиболее близкую вашей версии QEMU.
- Установите QEMU 3.0 или новее, возьмите исходные коды установленного пакета, начните его пересборку, - Установите QEMU 3.0 или новее, возьмите исходные коды установленного пакета, начните его пересборку,
через некоторое время остановите её и скопируйте следующие заголовки: через некоторое время остановите её и скопируйте следующие заголовки:
- `<qemu>/include` &rarr; `<vitastor>/qemu/include` - `<qemu>/include` &rarr; `<vitastor>/qemu/include`
@ -510,6 +511,21 @@ vitastor-nbd map --etcd_address 10.115.0.10:2379/v3 --image testimg
Для обращения по номеру инода, аналогично другим командам, можно использовать опции Для обращения по номеру инода, аналогично другим командам, можно использовать опции
`--pool <POOL> --inode <INODE> --size <SIZE>` вместо `--image testimg`. `--pool <POOL> --inode <INODE> --size <SIZE>` вместо `--image testimg`.
### Kubernetes
У Vitastor есть CSI-плагин для Kubernetes, поддерживающий RWO-тома.
Для установки возьмите манифесты из директории [csi/deploy/](csi/deploy/), поместите
вашу конфигурацию подключения к Vitastor в [csi/deploy/001-csi-config-map.yaml](001-csi-config-map.yaml),
настройте StorageClass в [csi/deploy/009-storage-class.yaml](009-storage-class.yaml)
и примените все `NNN-*.yaml` к вашей инсталляции Kubernetes.
```
for i in ./???-*.yaml; do kubectl apply -f $i; done
```
После этого вы сможете создавать PersistentVolume. Пример смотрите в файле [csi/deploy/example-pvc.yaml](csi/deploy/example-pvc.yaml).
## Известные проблемы ## Известные проблемы
- Запросы удаления объектов могут в данный момент приводить к "неполным" объектам в EC-пулах, - Запросы удаления объектов могут в данный момент приводить к "неполным" объектам в EC-пулах,

View File

@ -16,7 +16,6 @@ with configurable redundancy (replication or erasure codes/XOR).
Vitastor is currently a pre-release, a lot of features are missing and you can still expect Vitastor is currently a pre-release, a lot of features are missing and you can still expect
breaking changes in the future. However, the following is implemented: breaking changes in the future. However, the following is implemented:
0.5.x (stable):
- Basic part: highly-available block storage with symmetric clustering and no SPOF - Basic part: highly-available block storage with symmetric clustering and no SPOF
- Performance ;-D - Performance ;-D
- Multiple redundancy schemes: Replication, XOR n+1, Reed-Solomon erasure codes - Multiple redundancy schemes: Replication, XOR n+1, Reed-Solomon erasure codes
@ -37,24 +36,26 @@ breaking changes in the future. However, the following is implemented:
- NBD proxy for kernel mounts - NBD proxy for kernel mounts
- Inode removal tool (vitastor-rm) - Inode removal tool (vitastor-rm)
- Packaging for Debian and CentOS - Packaging for Debian and CentOS
0.6.x (master):
- Per-inode I/O and space usage statistics - Per-inode I/O and space usage statistics
- Inode metadata storage in etcd - Inode metadata storage in etcd
- Snapshots and copy-on-write image clones - Snapshots and copy-on-write image clones
- Write throttling to smooth random write workloads in SSD+HDD configurations - Write throttling to smooth random write workloads in SSD+HDD configurations
- RDMA/RoCEv2 support via libibverbs
- CSI plugin for Kubernetes
- Basic OpenStack support: Cinder driver, Nova and libvirt patches
## Roadmap ## Roadmap
- Snapshot deletion (layer merge) support
- Better OSD creation and auto-start tools - Better OSD creation and auto-start tools
- Other administrative tools - Other administrative tools
- Plugins for OpenStack, Kubernetes, OpenNebula, Proxmox and other cloud systems - Plugins for OpenNebula, Proxmox and other cloud systems
- iSCSI proxy - iSCSI proxy
- Faster failover - Faster failover
- Scrubbing without checksums (verification of replicas) - Scrubbing without checksums (verification of replicas)
- Checksums - Checksums
- Tiered storage - Tiered storage
- RDMA and NVDIMM support - NVDIMM support
- Web GUI - Web GUI
- Compression (possibly) - Compression (possibly)
- Read caching using system page cache (possibly) - Read caching using system page cache (possibly)
@ -339,7 +340,7 @@ Vitastor with single-thread NBD on the same hardware:
* For QEMU 2.0+: `<qemu>/qapi-types.h` &rarr; `<vitastor>/qemu/b/qemu/qapi-types.h` * For QEMU 2.0+: `<qemu>/qapi-types.h` &rarr; `<vitastor>/qemu/b/qemu/qapi-types.h`
- `config-host.h` and `qapi` are required because they contain generated headers - `config-host.h` and `qapi` are required because they contain generated headers
- You can also rebuild QEMU with a patch that makes LD_PRELOAD unnecessary to load vitastor driver. - You can also rebuild QEMU with a patch that makes LD_PRELOAD unnecessary to load vitastor driver.
See `qemu-*.*-vitastor.patch`. See `patches/qemu-*.*-vitastor.patch`.
- Install fio 3.7 or later, get its source and symlink it into `<vitastor>/fio`. - Install fio 3.7 or later, get its source and symlink it into `<vitastor>/fio`.
- Build & install Vitastor with `mkdir build && cd build && cmake .. && make -j8 && make install`. - Build & install Vitastor with `mkdir build && cd build && cmake .. && make -j8 && make install`.
Pay attention to the `QEMU_PLUGINDIR` cmake option - it must be set to `qemu-kvm` on RHEL. Pay attention to the `QEMU_PLUGINDIR` cmake option - it must be set to `qemu-kvm` on RHEL.
@ -460,6 +461,21 @@ It will output the device name, like /dev/nbd0 which you can then format and mou
Again, you can use `--pool <POOL> --inode <INODE> --size <SIZE>` insteaf of `--image <IMAGE>` if you want. Again, you can use `--pool <POOL> --inode <INODE> --size <SIZE>` insteaf of `--image <IMAGE>` if you want.
### Kubernetes
Vitastor has a CSI plugin for Kubernetes which supports RWO volumes.
To deploy it, take manifests from [csi/deploy/](csi/deploy/) directory, put your
Vitastor configuration in [csi/deploy/001-csi-config-map.yaml](001-csi-config-map.yaml),
configure storage class in [csi/deploy/009-storage-class.yaml](009-storage-class.yaml)
and apply all `NNN-*.yaml` manifests to your Kubernetes installation:
```
for i in ./???-*.yaml; do kubectl apply -f $i; done
```
After that you'll be able to create PersistentVolumes. See example in [csi/deploy/example-pvc.yaml](csi/deploy/example-pvc.yaml).
## Known Problems ## Known Problems
- Object deletion requests may currently lead to 'incomplete' objects in EC pools - Object deletion requests may currently lead to 'incomplete' objects in EC pools

3
csi/.dockerignore Normal file
View File

@ -0,0 +1,3 @@
vitastor-csi
go.sum
Dockerfile

32
csi/Dockerfile Normal file
View File

@ -0,0 +1,32 @@
# Compile stage
FROM golang:buster AS build
ADD go.mod /app/
RUN cd /app; CGO_ENABLED=1 GOOS=linux GOARCH=amd64 go mod download -x
ADD . /app
RUN perl -i -e '$/ = undef; while(<>) { s/\n\s*(\{\s*\n)/$1\n/g; s/\}(\s*\n\s*)else\b/$1} else/g; print; }' `find /app -name '*.go'`
RUN cd /app; CGO_ENABLED=1 GOOS=linux GOARCH=amd64 go build -o vitastor-csi
# Final stage
FROM debian:buster
LABEL maintainers="Vitaliy Filippov <vitalif@yourcmc.ru>"
LABEL description="Vitastor CSI Driver"
ENV NODE_ID=""
ENV CSI_ENDPOINT=""
RUN apt-get update && \
apt-get install -y wget && \
wget -q -O /etc/apt/trusted.gpg.d/vitastor.gpg https://vitastor.io/debian/pubkey.gpg && \
(echo deb http://vitastor.io/debian buster main > /etc/apt/sources.list.d/vitastor.list) && \
(echo deb http://deb.debian.org/debian buster-backports main > /etc/apt/sources.list.d/backports.list) && \
(echo "APT::Install-Recommends false;" > /etc/apt/apt.conf) && \
apt-get update && \
apt-get install -y e2fsprogs xfsprogs vitastor kmod && \
apt-get clean && \
(echo options nbd nbds_max=128 > /etc/modprobe.d/nbd.conf)
COPY --from=build /app/vitastor-csi /bin/
ENTRYPOINT ["/bin/vitastor-csi"]

9
csi/Makefile Normal file
View File

@ -0,0 +1,9 @@
VERSION ?= v0.6.5
all: build push
build:
@docker build --rm -t vitalif/vitastor-csi:$(VERSION) .
push:
@docker push vitalif/vitastor-csi:$(VERSION)

View File

@ -0,0 +1,5 @@
---
apiVersion: v1
kind: Namespace
metadata:
name: vitastor-system

View File

@ -0,0 +1,9 @@
---
apiVersion: v1
kind: ConfigMap
data:
vitastor.conf: |-
{"etcd_address":"http://192.168.7.2:2379","etcd_prefix":"/vitastor"}
metadata:
namespace: vitastor-system
name: vitastor-config

View File

@ -0,0 +1,37 @@
---
apiVersion: v1
kind: ServiceAccount
metadata:
namespace: vitastor-system
name: vitastor-csi-nodeplugin
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: vitastor-system
name: vitastor-csi-nodeplugin
rules:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get"]
# allow to read Vault Token and connection options from the Tenants namespace
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: vitastor-system
name: vitastor-csi-nodeplugin
subjects:
- kind: ServiceAccount
name: vitastor-csi-nodeplugin
namespace: vitastor-system
roleRef:
kind: ClusterRole
name: vitastor-csi-nodeplugin
apiGroup: rbac.authorization.k8s.io

View File

@ -0,0 +1,72 @@
---
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
namespace: vitastor-system
name: vitastor-csi-nodeplugin-psp
spec:
allowPrivilegeEscalation: true
allowedCapabilities:
- 'SYS_ADMIN'
fsGroup:
rule: RunAsAny
privileged: true
hostNetwork: true
hostPID: true
runAsUser:
rule: RunAsAny
seLinux:
rule: RunAsAny
supplementalGroups:
rule: RunAsAny
volumes:
- 'configMap'
- 'emptyDir'
- 'projected'
- 'secret'
- 'downwardAPI'
- 'hostPath'
allowedHostPaths:
- pathPrefix: '/dev'
readOnly: false
- pathPrefix: '/run/mount'
readOnly: false
- pathPrefix: '/sys'
readOnly: false
- pathPrefix: '/lib/modules'
readOnly: true
- pathPrefix: '/var/lib/kubelet/pods'
readOnly: false
- pathPrefix: '/var/lib/kubelet/plugins/csi.vitastor.io'
readOnly: false
- pathPrefix: '/var/lib/kubelet/plugins_registry'
readOnly: false
- pathPrefix: '/var/lib/kubelet/plugins'
readOnly: false
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: vitastor-system
name: vitastor-csi-nodeplugin-psp
rules:
- apiGroups: ['policy']
resources: ['podsecuritypolicies']
verbs: ['use']
resourceNames: ['vitastor-csi-nodeplugin-psp']
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: vitastor-system
name: vitastor-csi-nodeplugin-psp
subjects:
- kind: ServiceAccount
name: vitastor-csi-nodeplugin
namespace: vitastor-system
roleRef:
kind: Role
name: vitastor-csi-nodeplugin-psp
apiGroup: rbac.authorization.k8s.io

View File

@ -0,0 +1,140 @@
---
kind: DaemonSet
apiVersion: apps/v1
metadata:
namespace: vitastor-system
name: csi-vitastor
spec:
selector:
matchLabels:
app: csi-vitastor
template:
metadata:
namespace: vitastor-system
labels:
app: csi-vitastor
spec:
serviceAccountName: vitastor-csi-nodeplugin
hostNetwork: true
hostPID: true
priorityClassName: system-node-critical
# to use e.g. Rook orchestrated cluster, and mons' FQDN is
# resolved through k8s service, set dns policy to cluster first
dnsPolicy: ClusterFirstWithHostNet
containers:
- name: driver-registrar
# This is necessary only for systems with SELinux, where
# non-privileged sidecar containers cannot access unix domain socket
# created by privileged CSI driver container.
securityContext:
privileged: true
image: k8s.gcr.io/sig-storage/csi-node-driver-registrar:v2.2.0
args:
- "--v=5"
- "--csi-address=/csi/csi.sock"
- "--kubelet-registration-path=/var/lib/kubelet/plugins/csi.vitastor.io/csi.sock"
env:
- name: KUBE_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumeMounts:
- name: socket-dir
mountPath: /csi
- name: registration-dir
mountPath: /registration
- name: csi-vitastor
securityContext:
privileged: true
capabilities:
add: ["SYS_ADMIN"]
allowPrivilegeEscalation: true
image: vitalif/vitastor-csi:v0.6.5
args:
- "--node=$(NODE_ID)"
- "--endpoint=$(CSI_ENDPOINT)"
env:
- name: NODE_ID
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: CSI_ENDPOINT
value: unix:///csi/csi.sock
imagePullPolicy: "IfNotPresent"
ports:
- containerPort: 9898
name: healthz
protocol: TCP
livenessProbe:
failureThreshold: 5
httpGet:
path: /healthz
port: healthz
initialDelaySeconds: 10
timeoutSeconds: 3
periodSeconds: 2
volumeMounts:
- name: socket-dir
mountPath: /csi
- mountPath: /dev
name: host-dev
- mountPath: /sys
name: host-sys
- mountPath: /run/mount
name: host-mount
- mountPath: /lib/modules
name: lib-modules
readOnly: true
- name: vitastor-config
mountPath: /etc/vitastor
- name: plugin-dir
mountPath: /var/lib/kubelet/plugins
mountPropagation: "Bidirectional"
- name: mountpoint-dir
mountPath: /var/lib/kubelet/pods
mountPropagation: "Bidirectional"
- name: liveness-probe
securityContext:
privileged: true
image: quay.io/k8scsi/livenessprobe:v1.1.0
args:
- "--csi-address=$(CSI_ENDPOINT)"
- "--health-port=9898"
env:
- name: CSI_ENDPOINT
value: unix://csi/csi.sock
volumeMounts:
- mountPath: /csi
name: socket-dir
volumes:
- name: socket-dir
hostPath:
path: /var/lib/kubelet/plugins/csi.vitastor.io
type: DirectoryOrCreate
- name: plugin-dir
hostPath:
path: /var/lib/kubelet/plugins
type: Directory
- name: mountpoint-dir
hostPath:
path: /var/lib/kubelet/pods
type: DirectoryOrCreate
- name: registration-dir
hostPath:
path: /var/lib/kubelet/plugins_registry/
type: Directory
- name: host-dev
hostPath:
path: /dev
- name: host-sys
hostPath:
path: /sys
- name: host-mount
hostPath:
path: /run/mount
- name: lib-modules
hostPath:
path: /lib/modules
- name: vitastor-config
configMap:
name: vitastor-config

View File

@ -0,0 +1,102 @@
---
apiVersion: v1
kind: ServiceAccount
metadata:
namespace: vitastor-system
name: vitastor-csi-provisioner
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: vitastor-system
name: vitastor-external-provisioner-runner
rules:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["events"]
verbs: ["list", "watch", "create", "update", "patch"]
- apiGroups: [""]
resources: ["persistentvolumes"]
verbs: ["get", "list", "watch", "create", "update", "delete", "patch"]
- apiGroups: [""]
resources: ["persistentvolumeclaims"]
verbs: ["get", "list", "watch", "update"]
- apiGroups: [""]
resources: ["persistentvolumeclaims/status"]
verbs: ["update", "patch"]
- apiGroups: ["storage.k8s.io"]
resources: ["storageclasses"]
verbs: ["get", "list", "watch"]
- apiGroups: ["snapshot.storage.k8s.io"]
resources: ["volumesnapshots"]
verbs: ["get", "list"]
- apiGroups: ["snapshot.storage.k8s.io"]
resources: ["volumesnapshotcontents"]
verbs: ["create", "get", "list", "watch", "update", "delete"]
- apiGroups: ["snapshot.storage.k8s.io"]
resources: ["volumesnapshotclasses"]
verbs: ["get", "list", "watch"]
- apiGroups: ["storage.k8s.io"]
resources: ["volumeattachments"]
verbs: ["get", "list", "watch", "update", "patch"]
- apiGroups: ["storage.k8s.io"]
resources: ["volumeattachments/status"]
verbs: ["patch"]
- apiGroups: ["storage.k8s.io"]
resources: ["csinodes"]
verbs: ["get", "list", "watch"]
- apiGroups: ["snapshot.storage.k8s.io"]
resources: ["volumesnapshotcontents/status"]
verbs: ["update"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: vitastor-system
name: vitastor-csi-provisioner-role
subjects:
- kind: ServiceAccount
name: vitastor-csi-provisioner
namespace: vitastor-system
roleRef:
kind: ClusterRole
name: vitastor-external-provisioner-runner
apiGroup: rbac.authorization.k8s.io
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: vitastor-system
name: vitastor-external-provisioner-cfg
rules:
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "list", "watch", "create", "update", "delete"]
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["get", "watch", "list", "delete", "update", "create"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: vitastor-csi-provisioner-role-cfg
namespace: vitastor-system
subjects:
- kind: ServiceAccount
name: vitastor-csi-provisioner
namespace: vitastor-system
roleRef:
kind: Role
name: vitastor-external-provisioner-cfg
apiGroup: rbac.authorization.k8s.io

View File

@ -0,0 +1,60 @@
---
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
namespace: vitastor-system
name: vitastor-csi-provisioner-psp
spec:
allowPrivilegeEscalation: true
allowedCapabilities:
- 'SYS_ADMIN'
fsGroup:
rule: RunAsAny
privileged: true
runAsUser:
rule: RunAsAny
seLinux:
rule: RunAsAny
supplementalGroups:
rule: RunAsAny
volumes:
- 'configMap'
- 'emptyDir'
- 'projected'
- 'secret'
- 'downwardAPI'
- 'hostPath'
allowedHostPaths:
- pathPrefix: '/dev'
readOnly: false
- pathPrefix: '/sys'
readOnly: false
- pathPrefix: '/lib/modules'
readOnly: true
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: vitastor-system
name: vitastor-csi-provisioner-psp
rules:
- apiGroups: ['policy']
resources: ['podsecuritypolicies']
verbs: ['use']
resourceNames: ['vitastor-csi-provisioner-psp']
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: vitastor-csi-provisioner-psp
namespace: vitastor-system
subjects:
- kind: ServiceAccount
name: vitastor-csi-provisioner
namespace: vitastor-system
roleRef:
kind: Role
name: vitastor-csi-provisioner-psp
apiGroup: rbac.authorization.k8s.io

View File

@ -0,0 +1,159 @@
---
kind: Service
apiVersion: v1
metadata:
namespace: vitastor-system
name: csi-vitastor-provisioner
labels:
app: csi-metrics
spec:
selector:
app: csi-vitastor-provisioner
ports:
- name: http-metrics
port: 8080
protocol: TCP
targetPort: 8680
---
kind: Deployment
apiVersion: apps/v1
metadata:
namespace: vitastor-system
name: csi-vitastor-provisioner
spec:
replicas: 3
selector:
matchLabels:
app: csi-vitastor-provisioner
template:
metadata:
namespace: vitastor-system
labels:
app: csi-vitastor-provisioner
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- csi-vitastor-provisioner
topologyKey: "kubernetes.io/hostname"
serviceAccountName: vitastor-csi-provisioner
priorityClassName: system-cluster-critical
containers:
- name: csi-provisioner
image: k8s.gcr.io/sig-storage/csi-provisioner:v2.2.0
args:
- "--csi-address=$(ADDRESS)"
- "--v=5"
- "--timeout=150s"
- "--retry-interval-start=500ms"
- "--leader-election=true"
# set it to true to use topology based provisioning
- "--feature-gates=Topology=false"
# if fstype is not specified in storageclass, ext4 is default
- "--default-fstype=ext4"
- "--extra-create-metadata=true"
env:
- name: ADDRESS
value: unix:///csi/csi-provisioner.sock
imagePullPolicy: "IfNotPresent"
volumeMounts:
- name: socket-dir
mountPath: /csi
- name: csi-snapshotter
image: k8s.gcr.io/sig-storage/csi-snapshotter:v4.0.0
args:
- "--csi-address=$(ADDRESS)"
- "--v=5"
- "--timeout=150s"
- "--leader-election=true"
env:
- name: ADDRESS
value: unix:///csi/csi-provisioner.sock
imagePullPolicy: "IfNotPresent"
securityContext:
privileged: true
volumeMounts:
- name: socket-dir
mountPath: /csi
- name: csi-attacher
image: k8s.gcr.io/sig-storage/csi-attacher:v3.1.0
args:
- "--v=5"
- "--csi-address=$(ADDRESS)"
- "--leader-election=true"
- "--retry-interval-start=500ms"
env:
- name: ADDRESS
value: /csi/csi-provisioner.sock
imagePullPolicy: "IfNotPresent"
volumeMounts:
- name: socket-dir
mountPath: /csi
- name: csi-resizer
image: k8s.gcr.io/sig-storage/csi-resizer:v1.1.0
args:
- "--csi-address=$(ADDRESS)"
- "--v=5"
- "--timeout=150s"
- "--leader-election"
- "--retry-interval-start=500ms"
- "--handle-volume-inuse-error=false"
env:
- name: ADDRESS
value: unix:///csi/csi-provisioner.sock
imagePullPolicy: "IfNotPresent"
volumeMounts:
- name: socket-dir
mountPath: /csi
- name: csi-vitastor
securityContext:
privileged: true
capabilities:
add: ["SYS_ADMIN"]
image: vitalif/vitastor-csi:v0.6.5
args:
- "--node=$(NODE_ID)"
- "--endpoint=$(CSI_ENDPOINT)"
env:
- name: NODE_ID
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: CSI_ENDPOINT
value: unix:///csi/csi-provisioner.sock
imagePullPolicy: "IfNotPresent"
volumeMounts:
- name: socket-dir
mountPath: /csi
- mountPath: /dev
name: host-dev
- mountPath: /sys
name: host-sys
- mountPath: /lib/modules
name: lib-modules
readOnly: true
- name: vitastor-config
mountPath: /etc/vitastor
volumes:
- name: host-dev
hostPath:
path: /dev
- name: host-sys
hostPath:
path: /sys
- name: lib-modules
hostPath:
path: /lib/modules
- name: socket-dir
emptyDir: {
medium: "Memory"
}
- name: vitastor-config
configMap:
name: vitastor-config

View File

@ -0,0 +1,11 @@
---
# if Kubernetes version is less than 1.18 change
# apiVersion to storage.k8s.io/v1betav1
apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
namespace: vitastor-system
name: csi.vitastor.io
spec:
attachRequired: true
podInfoOnMount: false

View File

@ -0,0 +1,19 @@
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
namespace: vitastor-system
name: vitastor
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: csi.vitastor.io
volumeBindingMode: Immediate
parameters:
etcdVolumePrefix: ""
poolId: "1"
# you can choose other configuration file if you have it in the config map
#configPath: "/etc/vitastor/vitastor.conf"
# you can also specify etcdUrl here, maybe to connect to another Vitastor cluster
# multiple etcdUrls may be specified, delimited by comma
#etcdUrl: "http://192.168.7.2:2379"
#etcdPrefix: "/vitastor"

View File

@ -0,0 +1,12 @@
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-vitastor-pvc
spec:
storageClassName: vitastor
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi

35
csi/go.mod Normal file
View File

@ -0,0 +1,35 @@
module vitastor.io/csi
go 1.15
require (
github.com/container-storage-interface/spec v1.4.0
github.com/coreos/bbolt v0.0.0-00010101000000-000000000000 // indirect
github.com/coreos/etcd v3.3.25+incompatible // indirect
github.com/coreos/go-semver v0.3.0 // indirect
github.com/coreos/go-systemd v0.0.0-20191104093116-d3cd4ed1dbcf // indirect
github.com/coreos/pkg v0.0.0-20180928190104-399ea9e2e55f // indirect
github.com/dustin/go-humanize v1.0.0 // indirect
github.com/golang/glog v0.0.0-20160126235308-23def4e6c14b
github.com/gorilla/websocket v1.4.2 // indirect
github.com/grpc-ecosystem/go-grpc-middleware v1.3.0 // indirect
github.com/grpc-ecosystem/go-grpc-prometheus v1.2.0 // indirect
github.com/grpc-ecosystem/grpc-gateway v1.16.0 // indirect
github.com/jonboulle/clockwork v0.2.2 // indirect
github.com/kubernetes-csi/csi-lib-utils v0.9.1
github.com/soheilhy/cmux v0.1.5 // indirect
github.com/tmc/grpc-websocket-proxy v0.0.0-20201229170055-e5319fda7802 // indirect
github.com/xiang90/probing v0.0.0-20190116061207-43a291ad63a2 // indirect
go.etcd.io/bbolt v0.0.0-00010101000000-000000000000 // indirect
go.etcd.io/etcd v3.3.25+incompatible
golang.org/x/net v0.0.0-20201202161906-c7110b5ffcbb
google.golang.org/grpc v1.33.1
k8s.io/klog v1.0.0
k8s.io/utils v0.0.0-20210305010621-2afb4311ab10
)
replace github.com/coreos/bbolt => go.etcd.io/bbolt v1.3.5
replace go.etcd.io/bbolt => github.com/coreos/bbolt v1.3.5
replace google.golang.org/grpc => google.golang.org/grpc v1.25.1

22
csi/src/config.go Normal file
View File

@ -0,0 +1,22 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
package vitastor
const (
vitastorCSIDriverName = "csi.vitastor.io"
vitastorCSIDriverVersion = "0.6.5"
)
// Config struct fills the parameters of request or user input
type Config struct
{
Endpoint string
NodeID string
}
// NewConfig returns config struct to initialize new driver
func NewConfig() *Config
{
return &Config{}
}

530
csi/src/controllerserver.go Normal file
View File

@ -0,0 +1,530 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
package vitastor
import (
"context"
"encoding/json"
"strings"
"bytes"
"strconv"
"time"
"fmt"
"os"
"os/exec"
"io/ioutil"
"github.com/kubernetes-csi/csi-lib-utils/protosanitizer"
"k8s.io/klog"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/status"
"go.etcd.io/etcd/clientv3"
"github.com/container-storage-interface/spec/lib/go/csi"
)
const (
KB int64 = 1024
MB int64 = 1024 * KB
GB int64 = 1024 * MB
TB int64 = 1024 * GB
ETCD_TIMEOUT time.Duration = 15*time.Second
)
type InodeIndex struct
{
Id uint64 `json:"id"`
PoolId uint64 `json:"pool_id"`
}
type InodeConfig struct
{
Name string `json:"name"`
Size uint64 `json:"size,omitempty"`
ParentPool uint64 `json:"parent_pool,omitempty"`
ParentId uint64 `json:"parent_id,omitempty"`
Readonly bool `json:"readonly,omitempty"`
}
type ControllerServer struct
{
*Driver
}
// NewControllerServer create new instance controller
func NewControllerServer(driver *Driver) *ControllerServer
{
return &ControllerServer{
Driver: driver,
}
}
func GetConnectionParams(params map[string]string) (map[string]string, []string, string)
{
ctxVars := make(map[string]string)
configPath := params["configPath"]
if (configPath == "")
{
configPath = "/etc/vitastor/vitastor.conf"
}
else
{
ctxVars["configPath"] = configPath
}
config := make(map[string]interface{})
if configFD, err := os.Open(configPath); err == nil
{
defer configFD.Close()
data, _ := ioutil.ReadAll(configFD)
json.Unmarshal(data, &config)
}
// Try to load prefix & etcd URL from the config
var etcdUrl []string
if (params["etcdUrl"] != "")
{
ctxVars["etcdUrl"] = params["etcdUrl"]
etcdUrl = strings.Split(params["etcdUrl"], ",")
}
if (len(etcdUrl) == 0)
{
switch config["etcd_address"].(type)
{
case string:
etcdUrl = strings.Split(config["etcd_address"].(string), ",")
case []string:
etcdUrl = config["etcd_address"].([]string)
}
}
etcdPrefix := params["etcdPrefix"]
if (etcdPrefix == "")
{
etcdPrefix, _ = config["etcd_prefix"].(string)
if (etcdPrefix == "")
{
etcdPrefix = "/vitastor"
}
}
else
{
ctxVars["etcdPrefix"] = etcdPrefix
}
return ctxVars, etcdUrl, etcdPrefix
}
// Create the volume
func (cs *ControllerServer) CreateVolume(ctx context.Context, req *csi.CreateVolumeRequest) (*csi.CreateVolumeResponse, error)
{
klog.Infof("received controller create volume request %+v", protosanitizer.StripSecrets(req))
if (req == nil)
{
return nil, status.Errorf(codes.InvalidArgument, "request cannot be empty")
}
if (req.GetName() == "")
{
return nil, status.Error(codes.InvalidArgument, "name is a required field")
}
volumeCapabilities := req.GetVolumeCapabilities()
if (volumeCapabilities == nil)
{
return nil, status.Error(codes.InvalidArgument, "volume capabilities is a required field")
}
etcdVolumePrefix := req.Parameters["etcdVolumePrefix"]
poolId, _ := strconv.ParseUint(req.Parameters["poolId"], 10, 64)
if (poolId == 0)
{
return nil, status.Error(codes.InvalidArgument, "poolId is missing in storage class configuration")
}
volName := etcdVolumePrefix + req.GetName()
volSize := 1 * GB
if capRange := req.GetCapacityRange(); capRange != nil
{
volSize = ((capRange.GetRequiredBytes() + MB - 1) / MB) * MB
}
// FIXME: The following should PROBABLY be implemented externally in a management tool
ctxVars, etcdUrl, etcdPrefix := GetConnectionParams(req.Parameters)
if (len(etcdUrl) == 0)
{
return nil, status.Error(codes.InvalidArgument, "no etcdUrl in storage class configuration and no etcd_address in vitastor.conf")
}
// Connect to etcd
cli, err := clientv3.New(clientv3.Config{
DialTimeout: ETCD_TIMEOUT,
Endpoints: etcdUrl,
})
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to connect to etcd at "+strings.Join(etcdUrl, ",")+": "+err.Error())
}
defer cli.Close()
var imageId uint64 = 0
for
{
// Check if the image exists
ctx, cancel := context.WithTimeout(context.Background(), ETCD_TIMEOUT)
resp, err := cli.Get(ctx, etcdPrefix+"/index/image/"+volName)
cancel()
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
}
if (len(resp.Kvs) > 0)
{
kv := resp.Kvs[0]
var v InodeIndex
err := json.Unmarshal(kv.Value, &v)
if (err != nil)
{
return nil, status.Error(codes.Internal, "invalid /index/image/"+volName+" key in etcd: "+err.Error())
}
poolId = v.PoolId
imageId = v.Id
inodeCfgKey := fmt.Sprintf("/config/inode/%d/%d", poolId, imageId)
ctx, cancel := context.WithTimeout(context.Background(), ETCD_TIMEOUT)
resp, err := cli.Get(ctx, etcdPrefix+inodeCfgKey)
cancel()
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
}
if (len(resp.Kvs) == 0)
{
return nil, status.Error(codes.Internal, "missing "+inodeCfgKey+" key in etcd")
}
var inodeCfg InodeConfig
err = json.Unmarshal(resp.Kvs[0].Value, &inodeCfg)
if (err != nil)
{
return nil, status.Error(codes.Internal, "invalid "+inodeCfgKey+" key in etcd: "+err.Error())
}
if (inodeCfg.Size < uint64(volSize))
{
return nil, status.Error(codes.Internal, "image "+volName+" is already created, but size is less than expected")
}
}
else
{
// Find a free ID
// Create image metadata in a transaction verifying that the image doesn't exist yet AND ID is still free
maxIdKey := fmt.Sprintf("%s/index/maxid/%d", etcdPrefix, poolId)
ctx, cancel := context.WithTimeout(context.Background(), ETCD_TIMEOUT)
resp, err := cli.Get(ctx, maxIdKey)
cancel()
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
}
var modRev int64
var nextId uint64
if (len(resp.Kvs) > 0)
{
var err error
nextId, err = strconv.ParseUint(string(resp.Kvs[0].Value), 10, 64)
if (err != nil)
{
return nil, status.Error(codes.Internal, maxIdKey+" contains invalid ID")
}
modRev = resp.Kvs[0].ModRevision
nextId++
}
else
{
nextId = 1
}
inodeIdxJson, _ := json.Marshal(InodeIndex{
Id: nextId,
PoolId: poolId,
})
inodeCfgJson, _ := json.Marshal(InodeConfig{
Name: volName,
Size: uint64(volSize),
})
ctx, cancel = context.WithTimeout(context.Background(), ETCD_TIMEOUT)
txnResp, err := cli.Txn(ctx).If(
clientv3.Compare(clientv3.ModRevision(fmt.Sprintf("%s/index/maxid/%d", etcdPrefix, poolId)), "=", modRev),
clientv3.Compare(clientv3.CreateRevision(fmt.Sprintf("%s/index/image/%s", etcdPrefix, volName)), "=", 0),
clientv3.Compare(clientv3.CreateRevision(fmt.Sprintf("%s/config/inode/%d/%d", etcdPrefix, poolId, nextId)), "=", 0),
).Then(
clientv3.OpPut(fmt.Sprintf("%s/index/maxid/%d", etcdPrefix, poolId), fmt.Sprintf("%d", nextId)),
clientv3.OpPut(fmt.Sprintf("%s/index/image/%s", etcdPrefix, volName), string(inodeIdxJson)),
clientv3.OpPut(fmt.Sprintf("%s/config/inode/%d/%d", etcdPrefix, poolId, nextId), string(inodeCfgJson)),
).Commit()
cancel()
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to commit transaction in etcd: "+err.Error())
}
if (txnResp.Succeeded)
{
imageId = nextId
break
}
// Start over if the transaction fails
}
}
ctxVars["name"] = volName
volumeIdJson, _ := json.Marshal(ctxVars)
return &csi.CreateVolumeResponse{
Volume: &csi.Volume{
// Ugly, but VolumeContext isn't passed to DeleteVolume :-(
VolumeId: string(volumeIdJson),
CapacityBytes: volSize,
},
}, nil
}
// DeleteVolume deletes the given volume
func (cs *ControllerServer) DeleteVolume(ctx context.Context, req *csi.DeleteVolumeRequest) (*csi.DeleteVolumeResponse, error)
{
klog.Infof("received controller delete volume request %+v", protosanitizer.StripSecrets(req))
if (req == nil)
{
return nil, status.Error(codes.InvalidArgument, "request cannot be empty")
}
ctxVars := make(map[string]string)
err := json.Unmarshal([]byte(req.VolumeId), &ctxVars)
if (err != nil)
{
return nil, status.Error(codes.Internal, "volume ID not in JSON format")
}
volName := ctxVars["name"]
_, etcdUrl, etcdPrefix := GetConnectionParams(ctxVars)
if (len(etcdUrl) == 0)
{
return nil, status.Error(codes.InvalidArgument, "no etcdUrl in storage class configuration and no etcd_address in vitastor.conf")
}
cli, err := clientv3.New(clientv3.Config{
DialTimeout: ETCD_TIMEOUT,
Endpoints: etcdUrl,
})
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to connect to etcd at "+strings.Join(etcdUrl, ",")+": "+err.Error())
}
defer cli.Close()
// Find inode by name
ctx, cancel := context.WithTimeout(context.Background(), ETCD_TIMEOUT)
resp, err := cli.Get(ctx, etcdPrefix+"/index/image/"+volName)
cancel()
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
}
if (len(resp.Kvs) == 0)
{
return nil, status.Error(codes.NotFound, "volume "+volName+" does not exist")
}
var idx InodeIndex
err = json.Unmarshal(resp.Kvs[0].Value, &idx)
if (err != nil)
{
return nil, status.Error(codes.Internal, "invalid /index/image/"+volName+" key in etcd: "+err.Error())
}
// Get inode config
inodeCfgKey := fmt.Sprintf("%s/config/inode/%d/%d", etcdPrefix, idx.PoolId, idx.Id)
ctx, cancel = context.WithTimeout(context.Background(), ETCD_TIMEOUT)
resp, err = cli.Get(ctx, inodeCfgKey)
cancel()
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
}
if (len(resp.Kvs) == 0)
{
return nil, status.Error(codes.NotFound, "volume "+volName+" does not exist")
}
var inodeCfg InodeConfig
err = json.Unmarshal(resp.Kvs[0].Value, &inodeCfg)
if (err != nil)
{
return nil, status.Error(codes.Internal, "invalid "+inodeCfgKey+" key in etcd: "+err.Error())
}
// Delete inode data by invoking vitastor-rm
args := []string{
"--etcd_address", strings.Join(etcdUrl, ","),
"--pool", fmt.Sprintf("%d", idx.PoolId),
"--inode", fmt.Sprintf("%d", idx.Id),
}
if (ctxVars["configPath"] != "")
{
args = append(args, "--config_path", ctxVars["configPath"])
}
c := exec.Command("/usr/bin/vitastor-rm", args...)
var stderr bytes.Buffer
c.Stdout = nil
c.Stderr = &stderr
err = c.Run()
stderrStr := string(stderr.Bytes())
if (err != nil)
{
klog.Errorf("vitastor-rm failed: %s, status %s\n", stderrStr, err)
return nil, status.Error(codes.Internal, stderrStr+" (status "+err.Error()+")")
}
// Delete inode config in etcd
ctx, cancel = context.WithTimeout(context.Background(), ETCD_TIMEOUT)
txnResp, err := cli.Txn(ctx).Then(
clientv3.OpDelete(fmt.Sprintf("%s/index/image/%s", etcdPrefix, volName)),
clientv3.OpDelete(fmt.Sprintf("%s/config/inode/%d/%d", etcdPrefix, idx.PoolId, idx.Id)),
).Commit()
cancel()
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to delete keys in etcd: "+err.Error())
}
if (!txnResp.Succeeded)
{
return nil, status.Error(codes.Internal, "failed to delete keys in etcd: transaction failed")
}
return &csi.DeleteVolumeResponse{}, nil
}
// ControllerPublishVolume return Unimplemented error
func (cs *ControllerServer) ControllerPublishVolume(ctx context.Context, req *csi.ControllerPublishVolumeRequest) (*csi.ControllerPublishVolumeResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}
// ControllerUnpublishVolume return Unimplemented error
func (cs *ControllerServer) ControllerUnpublishVolume(ctx context.Context, req *csi.ControllerUnpublishVolumeRequest) (*csi.ControllerUnpublishVolumeResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}
// ValidateVolumeCapabilities checks whether the volume capabilities requested are supported.
func (cs *ControllerServer) ValidateVolumeCapabilities(ctx context.Context, req *csi.ValidateVolumeCapabilitiesRequest) (*csi.ValidateVolumeCapabilitiesResponse, error)
{
klog.Infof("received controller validate volume capability request %+v", protosanitizer.StripSecrets(req))
if (req == nil)
{
return nil, status.Errorf(codes.InvalidArgument, "request is nil")
}
volumeID := req.GetVolumeId()
if (volumeID == "")
{
return nil, status.Error(codes.InvalidArgument, "volumeId is nil")
}
volumeCapabilities := req.GetVolumeCapabilities()
if (volumeCapabilities == nil)
{
return nil, status.Error(codes.InvalidArgument, "volumeCapabilities is nil")
}
var volumeCapabilityAccessModes []*csi.VolumeCapability_AccessMode
for _, mode := range []csi.VolumeCapability_AccessMode_Mode{
csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER,
csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER,
} {
volumeCapabilityAccessModes = append(volumeCapabilityAccessModes, &csi.VolumeCapability_AccessMode{Mode: mode})
}
capabilitySupport := false
for _, capability := range volumeCapabilities
{
for _, volumeCapabilityAccessMode := range volumeCapabilityAccessModes
{
if (volumeCapabilityAccessMode.Mode == capability.AccessMode.Mode)
{
capabilitySupport = true
}
}
}
if (!capabilitySupport)
{
return nil, status.Errorf(codes.NotFound, "%v not supported", req.GetVolumeCapabilities())
}
return &csi.ValidateVolumeCapabilitiesResponse{
Confirmed: &csi.ValidateVolumeCapabilitiesResponse_Confirmed{
VolumeCapabilities: req.VolumeCapabilities,
},
}, nil
}
// ListVolumes returns a list of volumes
func (cs *ControllerServer) ListVolumes(ctx context.Context, req *csi.ListVolumesRequest) (*csi.ListVolumesResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}
// GetCapacity returns the capacity of the storage pool
func (cs *ControllerServer) GetCapacity(ctx context.Context, req *csi.GetCapacityRequest) (*csi.GetCapacityResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}
// ControllerGetCapabilities returns the capabilities of the controller service.
func (cs *ControllerServer) ControllerGetCapabilities(ctx context.Context, req *csi.ControllerGetCapabilitiesRequest) (*csi.ControllerGetCapabilitiesResponse, error)
{
functionControllerServerCapabilities := func(cap csi.ControllerServiceCapability_RPC_Type) *csi.ControllerServiceCapability
{
return &csi.ControllerServiceCapability{
Type: &csi.ControllerServiceCapability_Rpc{
Rpc: &csi.ControllerServiceCapability_RPC{
Type: cap,
},
},
}
}
var controllerServerCapabilities []*csi.ControllerServiceCapability
for _, capability := range []csi.ControllerServiceCapability_RPC_Type{
csi.ControllerServiceCapability_RPC_CREATE_DELETE_VOLUME,
csi.ControllerServiceCapability_RPC_LIST_VOLUMES,
csi.ControllerServiceCapability_RPC_EXPAND_VOLUME,
csi.ControllerServiceCapability_RPC_CREATE_DELETE_SNAPSHOT,
} {
controllerServerCapabilities = append(controllerServerCapabilities, functionControllerServerCapabilities(capability))
}
return &csi.ControllerGetCapabilitiesResponse{
Capabilities: controllerServerCapabilities,
}, nil
}
// CreateSnapshot create snapshot of an existing PV
func (cs *ControllerServer) CreateSnapshot(ctx context.Context, req *csi.CreateSnapshotRequest) (*csi.CreateSnapshotResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}
// DeleteSnapshot delete provided snapshot of a PV
func (cs *ControllerServer) DeleteSnapshot(ctx context.Context, req *csi.DeleteSnapshotRequest) (*csi.DeleteSnapshotResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}
// ListSnapshots list the snapshots of a PV
func (cs *ControllerServer) ListSnapshots(ctx context.Context, req *csi.ListSnapshotsRequest) (*csi.ListSnapshotsResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}
// ControllerExpandVolume resizes a volume
func (cs *ControllerServer) ControllerExpandVolume(ctx context.Context, req *csi.ControllerExpandVolumeRequest) (*csi.ControllerExpandVolumeResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}
// ControllerGetVolume get volume info
func (cs *ControllerServer) ControllerGetVolume(ctx context.Context, req *csi.ControllerGetVolumeRequest) (*csi.ControllerGetVolumeResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}

137
csi/src/grpc.go Normal file
View File

@ -0,0 +1,137 @@
/*
Copyright 2017 The Kubernetes Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package vitastor
import (
"fmt"
"net"
"os"
"strings"
"sync"
"github.com/golang/glog"
"golang.org/x/net/context"
"google.golang.org/grpc"
"github.com/container-storage-interface/spec/lib/go/csi"
"github.com/kubernetes-csi/csi-lib-utils/protosanitizer"
)
// Defines Non blocking GRPC server interfaces
type NonBlockingGRPCServer interface {
// Start services at the endpoint
Start(endpoint string, ids csi.IdentityServer, cs csi.ControllerServer, ns csi.NodeServer)
// Waits for the service to stop
Wait()
// Stops the service gracefully
Stop()
// Stops the service forcefully
ForceStop()
}
func NewNonBlockingGRPCServer() NonBlockingGRPCServer {
return &nonBlockingGRPCServer{}
}
// NonBlocking server
type nonBlockingGRPCServer struct {
wg sync.WaitGroup
server *grpc.Server
}
func (s *nonBlockingGRPCServer) Start(endpoint string, ids csi.IdentityServer, cs csi.ControllerServer, ns csi.NodeServer) {
s.wg.Add(1)
go s.serve(endpoint, ids, cs, ns)
return
}
func (s *nonBlockingGRPCServer) Wait() {
s.wg.Wait()
}
func (s *nonBlockingGRPCServer) Stop() {
s.server.GracefulStop()
}
func (s *nonBlockingGRPCServer) ForceStop() {
s.server.Stop()
}
func (s *nonBlockingGRPCServer) serve(endpoint string, ids csi.IdentityServer, cs csi.ControllerServer, ns csi.NodeServer) {
proto, addr, err := ParseEndpoint(endpoint)
if err != nil {
glog.Fatal(err.Error())
}
if proto == "unix" {
addr = "/" + addr
if err := os.Remove(addr); err != nil && !os.IsNotExist(err) {
glog.Fatalf("Failed to remove %s, error: %s", addr, err.Error())
}
}
listener, err := net.Listen(proto, addr)
if err != nil {
glog.Fatalf("Failed to listen: %v", err)
}
opts := []grpc.ServerOption{
grpc.UnaryInterceptor(logGRPC),
}
server := grpc.NewServer(opts...)
s.server = server
if ids != nil {
csi.RegisterIdentityServer(server, ids)
}
if cs != nil {
csi.RegisterControllerServer(server, cs)
}
if ns != nil {
csi.RegisterNodeServer(server, ns)
}
glog.Infof("Listening for connections on address: %#v", listener.Addr())
server.Serve(listener)
}
func ParseEndpoint(ep string) (string, string, error) {
if strings.HasPrefix(strings.ToLower(ep), "unix://") || strings.HasPrefix(strings.ToLower(ep), "tcp://") {
s := strings.SplitN(ep, "://", 2)
if s[1] != "" {
return s[0], s[1], nil
}
}
return "", "", fmt.Errorf("Invalid endpoint: %v", ep)
}
func logGRPC(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (interface{}, error) {
glog.V(3).Infof("GRPC call: %s", info.FullMethod)
glog.V(5).Infof("GRPC request: %s", protosanitizer.StripSecrets(req))
resp, err := handler(ctx, req)
if err != nil {
glog.Errorf("GRPC error: %v", err)
} else {
glog.V(5).Infof("GRPC response: %s", protosanitizer.StripSecrets(resp))
}
return resp, err
}

60
csi/src/identityserver.go Normal file
View File

@ -0,0 +1,60 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
package vitastor
import (
"context"
"github.com/kubernetes-csi/csi-lib-utils/protosanitizer"
"k8s.io/klog"
"github.com/container-storage-interface/spec/lib/go/csi"
)
// IdentityServer struct of Vitastor CSI driver with supported methods of CSI identity server spec.
type IdentityServer struct
{
*Driver
}
// NewIdentityServer create new instance identity
func NewIdentityServer(driver *Driver) *IdentityServer
{
return &IdentityServer{
Driver: driver,
}
}
// GetPluginInfo returns metadata of the plugin
func (is *IdentityServer) GetPluginInfo(ctx context.Context, req *csi.GetPluginInfoRequest) (*csi.GetPluginInfoResponse, error)
{
klog.Infof("received identity plugin info request %+v", protosanitizer.StripSecrets(req))
return &csi.GetPluginInfoResponse{
Name: vitastorCSIDriverName,
VendorVersion: vitastorCSIDriverVersion,
}, nil
}
// GetPluginCapabilities returns available capabilities of the plugin
func (is *IdentityServer) GetPluginCapabilities(ctx context.Context, req *csi.GetPluginCapabilitiesRequest) (*csi.GetPluginCapabilitiesResponse, error)
{
klog.Infof("received identity plugin capabilities request %+v", protosanitizer.StripSecrets(req))
return &csi.GetPluginCapabilitiesResponse{
Capabilities: []*csi.PluginCapability{
{
Type: &csi.PluginCapability_Service_{
Service: &csi.PluginCapability_Service{
Type: csi.PluginCapability_Service_CONTROLLER_SERVICE,
},
},
},
},
}, nil
}
// Probe returns the health and readiness of the plugin
func (is *IdentityServer) Probe(ctx context.Context, req *csi.ProbeRequest) (*csi.ProbeResponse, error)
{
return &csi.ProbeResponse{}, nil
}

279
csi/src/nodeserver.go Normal file
View File

@ -0,0 +1,279 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
package vitastor
import (
"context"
"os"
"os/exec"
"encoding/json"
"strings"
"bytes"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/status"
"k8s.io/utils/mount"
utilexec "k8s.io/utils/exec"
"github.com/container-storage-interface/spec/lib/go/csi"
"github.com/kubernetes-csi/csi-lib-utils/protosanitizer"
"k8s.io/klog"
)
// NodeServer struct of Vitastor CSI driver with supported methods of CSI node server spec.
type NodeServer struct
{
*Driver
mounter mount.Interface
}
// NewNodeServer create new instance node
func NewNodeServer(driver *Driver) *NodeServer
{
return &NodeServer{
Driver: driver,
mounter: mount.New(""),
}
}
// NodeStageVolume mounts the volume to a staging path on the node.
func (ns *NodeServer) NodeStageVolume(ctx context.Context, req *csi.NodeStageVolumeRequest) (*csi.NodeStageVolumeResponse, error)
{
return &csi.NodeStageVolumeResponse{}, nil
}
// NodeUnstageVolume unstages the volume from the staging path
func (ns *NodeServer) NodeUnstageVolume(ctx context.Context, req *csi.NodeUnstageVolumeRequest) (*csi.NodeUnstageVolumeResponse, error)
{
return &csi.NodeUnstageVolumeResponse{}, nil
}
func Contains(list []string, s string) bool
{
for i := 0; i < len(list); i++
{
if (list[i] == s)
{
return true
}
}
return false
}
// NodePublishVolume mounts the volume mounted to the staging path to the target path
func (ns *NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublishVolumeRequest) (*csi.NodePublishVolumeResponse, error)
{
klog.Infof("received node publish volume request %+v", protosanitizer.StripSecrets(req))
targetPath := req.GetTargetPath()
// Check that it's not already mounted
free, error := mount.IsNotMountPoint(ns.mounter, targetPath)
if (error != nil)
{
if (os.IsNotExist(error))
{
error := os.MkdirAll(targetPath, 0777)
if (error != nil)
{
return nil, status.Error(codes.Internal, error.Error())
}
free = true
}
else
{
return nil, status.Error(codes.Internal, error.Error())
}
}
if (!free)
{
return &csi.NodePublishVolumeResponse{}, nil
}
ctxVars := make(map[string]string)
err := json.Unmarshal([]byte(req.VolumeId), &ctxVars)
if (err != nil)
{
return nil, status.Error(codes.Internal, "volume ID not in JSON format")
}
volName := ctxVars["name"]
_, etcdUrl, etcdPrefix := GetConnectionParams(ctxVars)
if (len(etcdUrl) == 0)
{
return nil, status.Error(codes.InvalidArgument, "no etcdUrl in storage class configuration and no etcd_address in vitastor.conf")
}
// Map NBD device
// FIXME: Check if already mapped
args := []string{
"map", "--etcd_address", strings.Join(etcdUrl, ","),
"--etcd_prefix", etcdPrefix,
"--image", volName,
};
if (ctxVars["configPath"] != "")
{
args = append(args, "--config_path", ctxVars["configPath"])
}
if (req.GetReadonly())
{
args = append(args, "--readonly", "1")
}
c := exec.Command("/usr/bin/vitastor-nbd", args...)
var stdout, stderr bytes.Buffer
c.Stdout, c.Stderr = &stdout, &stderr
err = c.Run()
stdoutStr, stderrStr := string(stdout.Bytes()), string(stderr.Bytes())
if (err != nil)
{
klog.Errorf("vitastor-nbd map failed: %s, status %s\n", stdoutStr+stderrStr, err)
return nil, status.Error(codes.Internal, stdoutStr+stderrStr+" (status "+err.Error()+")")
}
devicePath := strings.TrimSpace(stdoutStr)
// Check existing format
diskMounter := &mount.SafeFormatAndMount{Interface: ns.mounter, Exec: utilexec.New()}
existingFormat, err := diskMounter.GetDiskFormat(devicePath)
if (err != nil)
{
klog.Errorf("failed to get disk format for path %s, error: %v", err)
// unmap NBD device
unmapOut, unmapErr := exec.Command("/usr/bin/vitastor-nbd", "unmap", devicePath).CombinedOutput()
if (unmapErr != nil)
{
klog.Errorf("failed to unmap NBD device %s: %s, error: %v", devicePath, unmapOut, unmapErr)
}
return nil, err
}
// Format the device (ext4 or xfs)
fsType := req.GetVolumeCapability().GetMount().GetFsType()
isBlock := req.GetVolumeCapability().GetBlock() != nil
opt := req.GetVolumeCapability().GetMount().GetMountFlags()
opt = append(opt, "_netdev")
if ((req.VolumeCapability.AccessMode.Mode == csi.VolumeCapability_AccessMode_MULTI_NODE_READER_ONLY ||
req.VolumeCapability.AccessMode.Mode == csi.VolumeCapability_AccessMode_SINGLE_NODE_READER_ONLY) &&
!Contains(opt, "ro"))
{
opt = append(opt, "ro")
}
if (fsType == "xfs")
{
opt = append(opt, "nouuid")
}
readOnly := Contains(opt, "ro")
if (existingFormat == "" && !readOnly)
{
args := []string{}
switch fsType
{
case "ext4":
args = []string{"-m0", "-Enodiscard,lazy_itable_init=1,lazy_journal_init=1", devicePath}
case "xfs":
args = []string{"-K", devicePath}
}
if (len(args) > 0)
{
cmdOut, cmdErr := diskMounter.Exec.Command("mkfs."+fsType, args...).CombinedOutput()
if (cmdErr != nil)
{
klog.Errorf("failed to run mkfs error: %v, output: %v", cmdErr, string(cmdOut))
// unmap NBD device
unmapOut, unmapErr := exec.Command("/usr/bin/vitastor-nbd", "unmap", devicePath).CombinedOutput()
if (unmapErr != nil)
{
klog.Errorf("failed to unmap NBD device %s: %s, error: %v", devicePath, unmapOut, unmapErr)
}
return nil, status.Error(codes.Internal, cmdErr.Error())
}
}
}
if (isBlock)
{
opt = append(opt, "bind")
err = diskMounter.Mount(devicePath, targetPath, fsType, opt)
}
else
{
err = diskMounter.FormatAndMount(devicePath, targetPath, fsType, opt)
}
if (err != nil)
{
klog.Errorf(
"failed to mount device path (%s) to path (%s) for volume (%s) error: %s",
devicePath, targetPath, volName, err,
)
// unmap NBD device
unmapOut, unmapErr := exec.Command("/usr/bin/vitastor-nbd", "unmap", devicePath).CombinedOutput()
if (unmapErr != nil)
{
klog.Errorf("failed to unmap NBD device %s: %s, error: %v", devicePath, unmapOut, unmapErr)
}
return nil, status.Error(codes.Internal, err.Error())
}
return &csi.NodePublishVolumeResponse{}, nil
}
// NodeUnpublishVolume unmounts the volume from the target path
func (ns *NodeServer) NodeUnpublishVolume(ctx context.Context, req *csi.NodeUnpublishVolumeRequest) (*csi.NodeUnpublishVolumeResponse, error)
{
klog.Infof("received node unpublish volume request %+v", protosanitizer.StripSecrets(req))
targetPath := req.GetTargetPath()
devicePath, refCount, err := mount.GetDeviceNameFromMount(ns.mounter, targetPath)
if (err != nil)
{
if (os.IsNotExist(err))
{
return nil, status.Error(codes.NotFound, "Target path not found")
}
return nil, status.Error(codes.Internal, err.Error())
}
if (devicePath == "")
{
return nil, status.Error(codes.NotFound, "Volume not mounted")
}
// unmount
err = mount.CleanupMountPoint(targetPath, ns.mounter, false)
if (err != nil)
{
return nil, status.Error(codes.Internal, err.Error())
}
// unmap NBD device
if (refCount == 1)
{
unmapOut, unmapErr := exec.Command("/usr/bin/vitastor-nbd", "unmap", devicePath).CombinedOutput()
if (unmapErr != nil)
{
klog.Errorf("failed to unmap NBD device %s: %s, error: %v", devicePath, unmapOut, unmapErr)
}
}
return &csi.NodeUnpublishVolumeResponse{}, nil
}
// NodeGetVolumeStats returns volume capacity statistics available for the volume
func (ns *NodeServer) NodeGetVolumeStats(ctx context.Context, req *csi.NodeGetVolumeStatsRequest) (*csi.NodeGetVolumeStatsResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}
// NodeExpandVolume expanding the file system on the node
func (ns *NodeServer) NodeExpandVolume(ctx context.Context, req *csi.NodeExpandVolumeRequest) (*csi.NodeExpandVolumeResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}
// NodeGetCapabilities returns the supported capabilities of the node server
func (ns *NodeServer) NodeGetCapabilities(ctx context.Context, req *csi.NodeGetCapabilitiesRequest) (*csi.NodeGetCapabilitiesResponse, error)
{
return &csi.NodeGetCapabilitiesResponse{}, nil
}
// NodeGetInfo returns NodeGetInfoResponse for CO.
func (ns *NodeServer) NodeGetInfo(ctx context.Context, req *csi.NodeGetInfoRequest) (*csi.NodeGetInfoResponse, error)
{
klog.Infof("received node get info request %+v", protosanitizer.StripSecrets(req))
return &csi.NodeGetInfoResponse{
NodeId: ns.NodeID,
}, nil
}

36
csi/src/server.go Normal file
View File

@ -0,0 +1,36 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
package vitastor
import (
"k8s.io/klog"
)
type Driver struct
{
*Config
}
// NewDriver create new instance driver
func NewDriver(config *Config) (*Driver, error)
{
if (config == nil)
{
klog.Errorf("Vitastor CSI driver initialization failed")
return nil, nil
}
driver := &Driver{
Config: config,
}
klog.Infof("Vitastor CSI driver initialized")
return driver, nil
}
// Start server
func (driver *Driver) Run()
{
server := NewNonBlockingGRPCServer()
server.Start(driver.Endpoint, NewIdentityServer(driver), NewControllerServer(driver), NewNodeServer(driver))
server.Wait()
}

39
csi/vitastor-csi.go Normal file
View File

@ -0,0 +1,39 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
package main
import (
"flag"
"fmt"
"os"
"k8s.io/klog"
"vitastor.io/csi/src"
)
func main()
{
var config = vitastor.NewConfig()
flag.StringVar(&config.Endpoint, "endpoint", "", "CSI endpoint")
flag.StringVar(&config.NodeID, "node", "", "Node ID")
flag.Parse()
if (config.Endpoint == "")
{
config.Endpoint = os.Getenv("CSI_ENDPOINT")
}
if (config.NodeID == "")
{
config.NodeID = os.Getenv("NODE_ID")
}
if (config.Endpoint == "" && config.NodeID == "")
{
fmt.Fprintf(os.Stderr, "Please set -endpoint and -node / CSI_ENDPOINT & NODE_ID env vars\n")
os.Exit(1)
}
drv, err := vitastor.NewDriver(config)
if (err != nil)
{
klog.Fatalln(err)
}
drv.Run()
}

14
debian/changelog vendored
View File

@ -1,8 +1,18 @@
vitastor (0.6.2-1) unstable; urgency=medium vitastor (0.6.5-1) unstable; urgency=medium
* RDMA support
* Bugfixes * Bugfixes
-- Vitaliy Filippov <vitalif@yourcmc.ru> Tue, 02 Feb 2021 23:01:24 +0300 -- Vitaliy Filippov <vitalif@yourcmc.ru> Sat, 01 May 2021 18:46:10 +0300
vitastor (0.6.0-1) unstable; urgency=medium
* Snapshots and Copy-on-Write clones
* Image metadata in etcd (name, size)
* Image I/O and space statistics in etcd
* Write throttling for smoothing random write workloads in SSD+HDD configurations
-- Vitaliy Filippov <vitalif@yourcmc.ru> Sun, 11 Apr 2021 00:49:18 +0300
vitastor (0.5.1-1) unstable; urgency=medium vitastor (0.5.1-1) unstable; urgency=medium

2
debian/control vendored
View File

@ -2,7 +2,7 @@ Source: vitastor
Section: admin Section: admin
Priority: optional Priority: optional
Maintainer: Vitaliy Filippov <vitalif@yourcmc.ru> Maintainer: Vitaliy Filippov <vitalif@yourcmc.ru>
Build-Depends: debhelper, liburing-dev (>= 0.6), g++ (>= 8), libstdc++6 (>= 8), linux-libc-dev, libgoogle-perftools-dev, libjerasure-dev, libgf-complete-dev Build-Depends: debhelper, liburing-dev (>= 0.6), g++ (>= 8), libstdc++6 (>= 8), linux-libc-dev, libgoogle-perftools-dev, libjerasure-dev, libgf-complete-dev, libibverbs-dev
Standards-Version: 4.5.0 Standards-Version: 4.5.0
Homepage: https://vitastor.io/ Homepage: https://vitastor.io/
Rules-Requires-Root: no Rules-Requires-Root: no

View File

@ -11,6 +11,10 @@ RUN if [ "$REL" = "buster" ]; then \
echo 'Package: *' >> /etc/apt/preferences; \ echo 'Package: *' >> /etc/apt/preferences; \
echo 'Pin: release a=buster-backports' >> /etc/apt/preferences; \ echo 'Pin: release a=buster-backports' >> /etc/apt/preferences; \
echo 'Pin-Priority: 500' >> /etc/apt/preferences; \ echo 'Pin-Priority: 500' >> /etc/apt/preferences; \
echo >> /etc/apt/preferences; \
echo 'Package: libglvnd* libgles* libglx* libgl1 libegl* libopengl* mesa*' >> /etc/apt/preferences; \
echo 'Pin: release a=buster-backports' >> /etc/apt/preferences; \
echo 'Pin-Priority: 50' >> /etc/apt/preferences; \
fi; \ fi; \
grep '^deb ' /etc/apt/sources.list | perl -pe 's/^deb/deb-src/' >> /etc/apt/sources.list; \ grep '^deb ' /etc/apt/sources.list | perl -pe 's/^deb/deb-src/' >> /etc/apt/sources.list; \
echo 'APT::Install-Recommends false;' >> /etc/apt/apt.conf; \ echo 'APT::Install-Recommends false;' >> /etc/apt/apt.conf; \
@ -20,20 +24,22 @@ RUN apt-get update
RUN apt-get -y install qemu fio liburing1 liburing-dev libgoogle-perftools-dev devscripts RUN apt-get -y install qemu fio liburing1 liburing-dev libgoogle-perftools-dev devscripts
RUN apt-get -y build-dep qemu RUN apt-get -y build-dep qemu
RUN apt-get -y build-dep fio RUN apt-get -y build-dep fio
# To build a custom version
#RUN cp /root/packages/qemu-orig/* /root
RUN apt-get --download-only source qemu RUN apt-get --download-only source qemu
RUN apt-get --download-only source fio RUN apt-get --download-only source fio
ADD qemu-5.0-vitastor.patch qemu-5.1-vitastor.patch /root/vitastor/ ADD patches/qemu-5.0-vitastor.patch patches/qemu-5.1-vitastor.patch /root/vitastor/patches/
RUN set -e; \ RUN set -e; \
mkdir -p /root/packages/qemu-$REL; \ mkdir -p /root/packages/qemu-$REL; \
rm -rf /root/packages/qemu-$REL/*; \ rm -rf /root/packages/qemu-$REL/*; \
cd /root/packages/qemu-$REL; \ cd /root/packages/qemu-$REL; \
dpkg-source -x /root/qemu*.dsc; \ dpkg-source -x /root/qemu*.dsc; \
if [ -d /root/packages/qemu-$REL/qemu-5.0 ]; then \ if [ -d /root/packages/qemu-$REL/qemu-5.0 ]; then \
cp /root/vitastor/qemu-5.0-vitastor.patch /root/packages/qemu-$REL/qemu-5.0/debian/patches; \ cp /root/vitastor/patches/qemu-5.0-vitastor.patch /root/packages/qemu-$REL/qemu-5.0/debian/patches; \
echo qemu-5.0-vitastor.patch >> /root/packages/qemu-$REL/qemu-5.0/debian/patches/series; \ echo qemu-5.0-vitastor.patch >> /root/packages/qemu-$REL/qemu-5.0/debian/patches/series; \
else \ else \
cp /root/vitastor/qemu-5.1-vitastor.patch /root/packages/qemu-$REL/qemu-*/debian/patches; \ cp /root/vitastor/patches/qemu-5.1-vitastor.patch /root/packages/qemu-$REL/qemu-*/debian/patches; \
P=`ls -d /root/packages/qemu-$REL/qemu-*/debian/patches`; \ P=`ls -d /root/packages/qemu-$REL/qemu-*/debian/patches`; \
echo qemu-5.1-vitastor.patch >> $P/series; \ echo qemu-5.1-vitastor.patch >> $P/series; \
fi; \ fi; \

View File

@ -22,7 +22,7 @@ RUN apt-get -y build-dep qemu
RUN apt-get -y build-dep fio RUN apt-get -y build-dep fio
RUN apt-get --download-only source qemu RUN apt-get --download-only source qemu
RUN apt-get --download-only source fio RUN apt-get --download-only source fio
RUN apt-get -y install libjerasure-dev cmake RUN apt-get update && apt-get -y install libjerasure-dev cmake libibverbs-dev
ADD . /root/vitastor ADD . /root/vitastor
RUN set -e -x; \ RUN set -e -x; \
@ -40,10 +40,10 @@ RUN set -e -x; \
mkdir -p /root/packages/vitastor-$REL; \ mkdir -p /root/packages/vitastor-$REL; \
rm -rf /root/packages/vitastor-$REL/*; \ rm -rf /root/packages/vitastor-$REL/*; \
cd /root/packages/vitastor-$REL; \ cd /root/packages/vitastor-$REL; \
cp -r /root/vitastor vitastor-0.6.2; \ cp -r /root/vitastor vitastor-0.6.5; \
ln -s /root/packages/qemu-$REL/qemu-*/ vitastor-0.6.2/qemu; \ ln -s /root/packages/qemu-$REL/qemu-*/ vitastor-0.6.5/qemu; \
ln -s /root/fio-build/fio-*/ vitastor-0.6.2/fio; \ ln -s /root/fio-build/fio-*/ vitastor-0.6.5/fio; \
cd vitastor-0.6.2; \ cd vitastor-0.6.5; \
FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \ FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
QEMU=$(head -n1 qemu/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \ QEMU=$(head -n1 qemu/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
sh copy-qemu-includes.sh; \ sh copy-qemu-includes.sh; \
@ -59,8 +59,8 @@ RUN set -e -x; \
echo "dep:fio=$FIO" > debian/substvars; \ echo "dep:fio=$FIO" > debian/substvars; \
echo "dep:qemu=$QEMU" >> debian/substvars; \ echo "dep:qemu=$QEMU" >> debian/substvars; \
cd /root/packages/vitastor-$REL; \ cd /root/packages/vitastor-$REL; \
tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.6.2.orig.tar.xz vitastor-0.6.2; \ tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.6.5.orig.tar.xz vitastor-0.6.5; \
cd vitastor-0.6.2; \ cd vitastor-0.6.5; \
V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \ V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \ DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \
DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \ DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \

View File

@ -244,6 +244,7 @@ async function optimize_change({ prev_pgs: prev_int_pgs, osd_tree, pg_size = 3,
{ {
return null; return null;
} }
// FIXME: use parity_chunks with parity_space instead of pg_minsize
const pg_effsize = Math.min(pg_minsize, Object.keys(osd_tree).length) const pg_effsize = Math.min(pg_minsize, Object.keys(osd_tree).length)
+ Math.max(0, Math.min(pg_size, Object.keys(osd_tree).length) - pg_minsize) * parity_space; + Math.max(0, Math.min(pg_size, Object.keys(osd_tree).length) - pg_minsize) * parity_space;
const pg_count = prev_int_pgs.length; const pg_count = prev_int_pgs.length;

View File

@ -36,11 +36,19 @@ const etcd_allow = new RegExp('^'+[
'history/last_clean_pgs', 'history/last_clean_pgs',
'inode/stats/[1-9]\\d*/[1-9]\\d*', 'inode/stats/[1-9]\\d*/[1-9]\\d*',
'stats', 'stats',
'index/image/.*',
'index/maxid/[1-9]\\d*',
].join('$|^')+'$'); ].join('$|^')+'$');
const etcd_tree = { const etcd_tree = {
config: { config: {
/* global: { /* global: {
// WARNING: NOT ALL OF THESE ARE ACTUALLY CONFIGURABLE HERE
// THIS IS JUST A POOR MAN'S CONFIG DOCUMENTATION
// etcd connection
config_path: "/etc/vitastor/vitastor.conf",
etcd_address: "10.0.115.10:2379/v3",
etcd_prefix: "/vitastor",
// mon // mon
etcd_mon_ttl: 30, // min: 10 etcd_mon_ttl: 30, // min: 10
etcd_mon_timeout: 1000, // ms. min: 0 etcd_mon_timeout: 1000, // ms. min: 0
@ -50,7 +58,17 @@ const etcd_tree = {
osd_out_time: 600, // seconds. min: 0 osd_out_time: 600, // seconds. min: 0
placement_levels: { datacenter: 1, rack: 2, host: 3, osd: 4, ... }, placement_levels: { datacenter: 1, rack: 2, host: 3, osd: 4, ... },
// client and osd // client and osd
tcp_header_buffer_size: 65536,
use_sync_send_recv: false, use_sync_send_recv: false,
use_rdma: true,
rdma_device: null, // for example, "rocep5s0f0"
rdma_port_num: 1,
rdma_gid_index: 0,
rdma_mtu: 4096,
rdma_max_sge: 128,
rdma_max_send: 32,
rdma_max_recv: 8,
rdma_max_msg: 1048576,
log_level: 0, log_level: 0,
block_size: 131072, block_size: 131072,
disk_alignment: 4096, disk_alignment: 4096,
@ -241,14 +259,26 @@ const etcd_tree = {
}, },
inode: { inode: {
stats: { stats: {
/* <inode_t>: { /* <pool_id>: {
<inode_t>: {
raw_used: uint64_t, // raw used bytes on OSDs raw_used: uint64_t, // raw used bytes on OSDs
read: { count: uint64_t, usec: uint64_t, bytes: uint64_t }, read: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
write: { count: uint64_t, usec: uint64_t, bytes: uint64_t }, write: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
delete: { count: uint64_t, usec: uint64_t, bytes: uint64_t }, delete: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
},
}, */ }, */
}, },
}, },
pool: {
stats: {
/* <pool_id>: {
used_raw_tb: float, // used raw space in the pool
total_raw_tb: float, // maximum amount of space in the pool
raw_to_usable: float, // raw to usable ratio
space_efficiency: float, // 0..1
} */
},
},
stats: { stats: {
/* op_stats: { /* op_stats: {
<string>: { count: uint64_t, usec: uint64_t, bytes: uint64_t }, <string>: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
@ -271,6 +301,17 @@ const etcd_tree = {
history: { history: {
last_clean_pgs: {}, last_clean_pgs: {},
}, },
index: {
image: {
/* <name>: {
id: uint64_t,
pool_id: uint64_t,
}, */
},
maxid: {
/* <pool_id>: uint64_t, */
},
},
}; };
// FIXME Split into several files // FIXME Split into several files
@ -345,6 +386,11 @@ class Mon
{ {
this.config.mon_stats_timeout = 100; this.config.mon_stats_timeout = 100;
} }
this.config.mon_stats_interval = Number(this.config.mon_stats_interval) || 5000;
if (this.config.mon_stats_interval < 100)
{
this.config.mon_stats_interval = 100;
}
// After this number of seconds, a dead OSD will be removed from PG distribution // After this number of seconds, a dead OSD will be removed from PG distribution
this.config.osd_out_time = Number(this.config.osd_out_time) || 0; this.config.osd_out_time = Number(this.config.osd_out_time) || 0;
if (!this.config.osd_out_time) if (!this.config.osd_out_time)
@ -1009,6 +1055,17 @@ class Mon
} }); } });
} }
LPOptimizer.print_change_stats(optimize_result); LPOptimizer.print_change_stats(optimize_result);
const pg_effsize = Math.min(pool_cfg.pg_size, Object.keys(pool_tree).length);
this.state.pool.stats[pool_id] = {
used_raw_tb: (this.state.pool.stats[pool_id]||{}).used_raw_tb || 0,
total_raw_tb: optimize_result.space,
raw_to_usable: pg_effsize / (pool_cfg.pg_size - (pool_cfg.parity_chunks||0)),
space_efficiency: optimize_result.space/(optimize_result.total_space||1),
};
etcd_request.success.push({ requestPut: {
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
value: b64(JSON.stringify(this.state.pool.stats[pool_id])),
} });
this.save_new_pgs_txn(etcd_request, pool_id, up_osds, real_prev_pgs, optimize_result.int_pgs, pg_history); this.save_new_pgs_txn(etcd_request, pool_id, up_osds, real_prev_pgs, optimize_result.int_pgs, pg_history);
} }
this.state.config.pgs.hash = tree_hash; this.state.config.pgs.hash = tree_hash;
@ -1115,7 +1172,7 @@ class Mon
}, this.config.mon_change_timeout || 1000); }, this.config.mon_change_timeout || 1000);
} }
sum_stats() sum_op_stats()
{ {
const op_stats = {}, subop_stats = {}, recovery_stats = {}; const op_stats = {}, subop_stats = {}, recovery_stats = {};
for (const osd in this.state.osd.stats) for (const osd in this.state.osd.stats)
@ -1176,18 +1233,31 @@ class Mon
write: { count: 0n, usec: 0n, bytes: 0n }, write: { count: 0n, usec: 0n, bytes: 0n },
delete: { count: 0n, usec: 0n, bytes: 0n }, delete: { count: 0n, usec: 0n, bytes: 0n },
}); });
for (const pool_id in this.state.config.pools)
{
this.state.pool.stats[pool_id] = this.state.pool.stats[pool_id] || {};
this.state.pool.stats[pool_id].used_raw_tb = 0n;
}
for (const osd_num in this.state.osd.space) for (const osd_num in this.state.osd.space)
{ {
for (const pool_id in this.state.osd.space[osd_num]) for (const pool_id in this.state.osd.space[osd_num])
{ {
this.state.pool.stats[pool_id] = this.state.pool.stats[pool_id] || { used_raw_tb: 0n };
inode_stats[pool_id] = inode_stats[pool_id] || {}; inode_stats[pool_id] = inode_stats[pool_id] || {};
for (const inode_num in this.state.osd.space[osd_num][pool_id]) for (const inode_num in this.state.osd.space[osd_num][pool_id])
{ {
const u = BigInt(this.state.osd.space[osd_num][pool_id][inode_num]||0);
inode_stats[pool_id][inode_num] = inode_stats[pool_id][inode_num] || inode_stub(); inode_stats[pool_id][inode_num] = inode_stats[pool_id][inode_num] || inode_stub();
inode_stats[pool_id][inode_num].raw_used += BigInt(this.state.osd.space[osd_num][pool_id][inode_num]||0); inode_stats[pool_id][inode_num].raw_used += u;
this.state.pool.stats[pool_id].used_raw_tb += u;
} }
} }
} }
for (const pool_id in this.state.config.pools)
{
const used = this.state.pool.stats[pool_id].used_raw_tb;
this.state.pool.stats[pool_id].used_raw_tb = Number(used)/1024/1024/1024/1024;
}
for (const osd_num in this.state.osd.inodestats) for (const osd_num in this.state.osd.inodestats)
{ {
const ist = this.state.osd.inodestats[osd_num]; const ist = this.state.osd.inodestats[osd_num];
@ -1259,7 +1329,7 @@ class Mon
async update_total_stats() async update_total_stats()
{ {
const txn = []; const txn = [];
const stats = this.sum_stats(); const stats = this.sum_op_stats();
const object_counts = this.sum_object_counts(); const object_counts = this.sum_object_counts();
const inode_stats = this.sum_inode_stats(); const inode_stats = this.sum_inode_stats();
this.fix_stat_overflows(stats, (this.prev_stats = this.prev_stats || {})); this.fix_stat_overflows(stats, (this.prev_stats = this.prev_stats || {}));
@ -1278,6 +1348,13 @@ class Mon
} }); } });
} }
} }
for (const pool_id in this.state.pool.stats)
{
txn.push({ requestPut: {
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
value: b64(JSON.stringify(this.state.pool.stats[pool_id])),
} });
}
if (txn.length) if (txn.length)
{ {
await this.etcd_call('/kv/txn', { success: txn }, this.config.etcd_mon_timeout, 0); await this.etcd_call('/kv/txn', { success: txn }, this.config.etcd_mon_timeout, 0);
@ -1291,11 +1368,17 @@ class Mon
clearTimeout(this.stats_timer); clearTimeout(this.stats_timer);
this.stats_timer = null; this.stats_timer = null;
} }
let sleep = (this.stats_update_next||0) - Date.now();
if (sleep < this.config.mon_stats_timeout)
{
sleep = this.config.mon_stats_timeout;
}
this.stats_timer = setTimeout(() => this.stats_timer = setTimeout(() =>
{ {
this.stats_timer = null; this.stats_timer = null;
this.stats_update_next = Date.now() + this.config.mon_stats_interval;
this.update_total_stats().catch(console.error); this.update_total_stats().catch(console.error);
}, this.config.mon_stats_timeout || 1000); }, sleep);
} }
parse_kv(kv) parse_kv(kv)

948
patches/cinder-vitastor.py Normal file
View File

@ -0,0 +1,948 @@
# Vitastor Driver for OpenStack Cinder
#
# --------------------------------------------
# Install as cinder/volume/drivers/vitastor.py
# --------------------------------------------
#
# Copyright 2020 Vitaliy Filippov
#
# Licensed under the Apache License, Version 2.0 (the "License"); you may
# not use this file except in compliance with the License. You may obtain
# a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
# License for the specific language governing permissions and limitations
# under the License.
"""Cinder Vitastor Driver"""
import binascii
import base64
import errno
import json
import math
import os
import tempfile
from castellan import key_manager
from oslo_config import cfg
from oslo_log import log as logging
from oslo_service import loopingcall
from oslo_concurrency import processutils
from oslo_utils import encodeutils
from oslo_utils import excutils
from oslo_utils import fileutils
from oslo_utils import units
import six
from six.moves.urllib import request
from cinder import exception
from cinder.i18n import _
from cinder.image import image_utils
from cinder import interface
from cinder import objects
from cinder.objects import fields
from cinder import utils
from cinder.volume import configuration
from cinder.volume import driver
from cinder.volume import volume_utils
VERSION = '0.6.5'
LOG = logging.getLogger(__name__)
VITASTOR_OPTS = [
cfg.StrOpt(
'vitastor_config_path',
default='/etc/vitastor/vitastor.conf',
help='Vitastor configuration file path'
),
cfg.StrOpt(
'vitastor_etcd_address',
default='',
help='Vitastor etcd address(es)'),
cfg.StrOpt(
'vitastor_etcd_prefix',
default='/vitastor',
help='Vitastor etcd prefix'
),
cfg.StrOpt(
'vitastor_pool_id',
default='',
help='Vitastor pool ID to use for volumes'
),
# FIXME exclusive_cinder_pool ?
]
CONF = cfg.CONF
CONF.register_opts(VITASTOR_OPTS, group = configuration.SHARED_CONF_GROUP)
class VitastorDriverException(exception.VolumeDriverException):
message = _("Vitastor Cinder driver failure: %(reason)s")
@interface.volumedriver
class VitastorDriver(driver.CloneableImageVD,
driver.ManageableVD, driver.ManageableSnapshotsVD,
driver.BaseVD):
"""Implements Vitastor volume commands."""
cfg = {}
_etcd_urls = []
def __init__(self, active_backend_id = None, *args, **kwargs):
super(VitastorDriver, self).__init__(*args, **kwargs)
self.configuration.append_config_values(VITASTOR_OPTS)
@classmethod
def get_driver_options(cls):
additional_opts = cls._get_oslo_driver_opts(
'reserved_percentage',
'max_over_subscription_ratio',
'volume_dd_blocksize'
)
return VITASTOR_OPTS + additional_opts
def do_setup(self, context):
"""Performs initialization steps that could raise exceptions."""
super(VitastorDriver, self).do_setup(context)
# Make sure configuration is in UTF-8
for attr in [ 'config_path', 'etcd_address', 'etcd_prefix', 'pool_id' ]:
val = self.configuration.safe_get('vitastor_'+attr)
if val is not None:
self.cfg[attr] = utils.convert_str(val)
self.cfg = self._load_config(self.cfg)
def _load_config(self, cfg):
# Try to load configuration file
try:
f = open(cfg['config_path'] or '/etc/vitastor/vitastor.conf')
conf = json.loads(f.read())
f.close()
for k in conf:
cfg[k] = cfg.get(k, conf[k])
except:
pass
if isinstance(cfg['etcd_address'], str):
cfg['etcd_address'] = cfg['etcd_address'].split(',')
# Sanitize etcd URLs
for i, etcd_url in enumerate(cfg['etcd_address']):
ssl = False
if etcd_url.lower().startswith('http://'):
etcd_url = etcd_url[7:]
elif etcd_url.lower().startswith('https://'):
etcd_url = etcd_url[8:]
ssl = True
if etcd_url.find('/') < 0:
etcd_url += '/v3'
if ssl:
etcd_url = 'https://'+etcd_url
else:
etcd_url = 'http://'+etcd_url
cfg['etcd_address'][i] = etcd_url
return cfg
def check_for_setup_error(self):
"""Returns an error if prerequisites aren't met."""
def _encode_etcd_key(self, key):
if not isinstance(key, bytes):
key = str(key).encode('utf-8')
return base64.b64encode(self.cfg['etcd_prefix'].encode('utf-8')+b'/'+key).decode('utf-8')
def _encode_etcd_value(self, value):
if not isinstance(value, bytes):
value = str(value).encode('utf-8')
return base64.b64encode(value).decode('utf-8')
def _encode_etcd_requests(self, obj):
for v in obj:
for rt in v:
if 'key' in v[rt]:
v[rt]['key'] = self._encode_etcd_key(v[rt]['key'])
if 'range_end' in v[rt]:
v[rt]['range_end'] = self._encode_etcd_key(v[rt]['range_end'])
if 'value' in v[rt]:
v[rt]['value'] = self._encode_etcd_value(v[rt]['value'])
def _etcd_txn(self, params):
if 'compare' in params:
for v in params['compare']:
if 'key' in v:
v['key'] = self._encode_etcd_key(v['key'])
if 'failure' in params:
self._encode_etcd_requests(params['failure'])
if 'success' in params:
self._encode_etcd_requests(params['success'])
body = json.dumps(params).encode('utf-8')
headers = {
'Content-Type': 'application/json'
}
err = None
for etcd_url in self.cfg['etcd_address']:
try:
resp = request.urlopen(request.Request(etcd_url+'/kv/txn', body, headers), timeout = 5)
data = json.loads(resp.read())
if 'responses' not in data:
data['responses'] = []
for i, resp in enumerate(data['responses']):
if 'response_range' in resp:
if 'kvs' not in resp['response_range']:
resp['response_range']['kvs'] = []
for kv in resp['response_range']['kvs']:
kv['key'] = base64.b64decode(kv['key'].encode('utf-8')).decode('utf-8')
if kv['key'].startswith(self.cfg['etcd_prefix']+'/'):
kv['key'] = kv['key'][len(self.cfg['etcd_prefix'])+1 : ]
kv['value'] = json.loads(base64.b64decode(kv['value'].encode('utf-8')))
if len(resp.keys()) != 1:
LOG.exception('unknown responses['+str(i)+'] format: '+json.dumps(resp))
else:
resp = data['responses'][i] = resp[list(resp.keys())[0]]
return data
except Exception as e:
LOG.exception('error calling etcd transaction: '+body.decode('utf-8')+'\nerror: '+str(e))
err = e
raise err
def _etcd_foreach(self, prefix, add_fn):
total = 0
batch = 1000
begin = prefix+'/'
while True:
resp = self._etcd_txn({ 'success': [
{ 'request_range': {
'key': begin,
'range_end': prefix+'0',
'limit': batch+1,
} },
] })
i = 0
while i < batch and i < len(resp['responses'][0]['kvs']):
kv = resp['responses'][0]['kvs'][i]
add_fn(kv)
i += 1
if len(resp['responses'][0]['kvs']) <= batch:
break
begin = resp['responses'][0]['kvs'][batch]['key']
return total
def _update_volume_stats(self):
location_info = json.dumps({
'config': self.configuration.vitastor_config_path,
'etcd_address': self.configuration.vitastor_etcd_address,
'etcd_prefix': self.configuration.vitastor_etcd_prefix,
'pool_id': self.configuration.vitastor_pool_id,
})
stats = {
'vendor_name': 'Vitastor',
'driver_version': self.VERSION,
'storage_protocol': 'vitastor',
'total_capacity_gb': 'unknown',
'free_capacity_gb': 'unknown',
# FIXME check if safe_get is required
'reserved_percentage': self.configuration.safe_get('reserved_percentage'),
'multiattach': True,
'thin_provisioning_support': True,
'max_over_subscription_ratio': self.configuration.safe_get('max_over_subscription_ratio'),
'location_info': location_info,
'backend_state': 'down',
'volume_backend_name': self.configuration.safe_get('volume_backend_name') or 'vitastor',
'replication_enabled': False,
}
try:
pool_stats = self._etcd_txn({ 'success': [
{ 'request_range': { 'key': 'pool/stats/'+str(self.cfg['pool_id']) } }
] })
total_provisioned = 0
def add_total(kv):
nonlocal total_provisioned
if kv['key'].find('@') >= 0:
total_provisioned += kv['value']['size']
self._etcd_foreach('config/inode/'+str(self.cfg['pool_id']), lambda kv: add_total(kv))
stats['provisioned_capacity_gb'] = round(total_provisioned/1024.0/1024.0/1024.0, 2)
pool_stats = pool_stats['responses'][0]['kvs']
if len(pool_stats):
pool_stats = pool_stats[0]
stats['free_capacity_gb'] = round(1024.0*(pool_stats['total_raw_tb']-pool_stats['used_raw_tb'])/pool_stats['raw_to_usable'], 2)
stats['total_capacity_gb'] = round(1024.0*pool_stats['total_raw_tb'], 2)
stats['backend_state'] = 'up'
except Exception as e:
# just log and return unknown capacities
LOG.exception('error getting vitastor pool stats: '+str(e))
self._stats = stats
def _next_id(self, resp):
if len(resp['kvs']) == 0:
return (1, 0)
else:
return (1 + resp['kvs'][0]['value'], resp['kvs'][0]['mod_revision'])
def create_volume(self, volume):
"""Creates a logical volume."""
size = int(volume.size) * units.Gi
# FIXME: Check if convert_str is really required
vol_name = utils.convert_str(volume.name)
if vol_name.find('@') >= 0 or vol_name.find('/') >= 0:
raise exception.VolumeBackendAPIException(data = '@ and / are forbidden in volume and snapshot names')
LOG.debug("creating volume '%s'", vol_name)
self._create_image(vol_name, { 'size': size })
if volume.encryption_key_id:
self._create_encrypted_volume(volume, volume.obj_context)
volume_update = {}
return volume_update
def _create_encrypted_volume(self, volume, context):
"""Create a new LUKS encrypted image directly in Vitastor."""
vol_name = utils.convert_str(volume.name)
f, opts = self._encrypt_opts(volume, context)
# FIXME: Check if it works at all :-)
self._execute(
'qemu-img', 'convert', '-f', 'luks', *opts,
'vitastor:image='+vol_name.replace(':', '\\:')+self._qemu_args(),
'%sM' % (volume.size * 1024)
)
f.close()
def _encrypt_opts(self, volume, context):
encryption = volume_utils.check_encryption_provider(self.db, volume, context)
# Fetch the key associated with the volume and decode the passphrase
keymgr = key_manager.API(CONF)
key = keymgr.get(context, encryption['encryption_key_id'])
passphrase = binascii.hexlify(key.get_encoded()).decode('utf-8')
# Decode the dm-crypt style cipher spec into something qemu-img can use
cipher_spec = image_utils.decode_cipher(encryption['cipher'], encryption['key_size'])
tmp_dir = volume_utils.image_conversion_dir()
f = tempfile.NamedTemporaryFile(prefix = 'luks_', dir = tmp_dir)
f.write(passphrase)
f.flush()
return (f, [
'--object', 'secret,id=luks_sec,format=raw,file=%(passfile)s' % {'passfile': f.name},
'-o', 'key-secret=luks_sec,cipher-alg=%(cipher_alg)s,cipher-mode=%(cipher_mode)s,ivgen-alg=%(ivgen_alg)s' % cipher_spec,
])
def create_snapshot(self, snapshot):
"""Creates a volume snapshot."""
vol_name = utils.convert_str(snapshot.volume_name)
snap_name = utils.convert_str(snapshot.name)
if snap_name.find('@') >= 0 or snap_name.find('/') >= 0:
raise exception.VolumeBackendAPIException(data = '@ and / are forbidden in volume and snapshot names')
self._create_snapshot(vol_name, vol_name+'@'+snap_name)
def snapshot_revert_use_temp_snapshot(self):
"""Disable the use of a temporary snapshot on revert."""
return False
def revert_to_snapshot(self, context, volume, snapshot):
"""Revert a volume to a given snapshot."""
# FIXME Delete the image, then recreate it from the snapshot
def delete_snapshot(self, snapshot):
"""Deletes a snapshot."""
vol_name = utils.convert_str(snapshot.volume_name)
snap_name = utils.convert_str(snapshot.name)
# Find the snapshot
resp = self._etcd_txn({ 'success': [
{ 'request_range': { 'key': 'index/image/'+vol_name+'@'+snap_name } },
] })
if len(resp['responses'][0]['kvs']) == 0:
raise exception.SnapshotNotFound(snapshot_id = snap_name)
inode_id = int(resp['responses'][0]['kvs'][0]['value']['id'])
pool_id = int(resp['responses'][0]['kvs'][0]['value']['pool_id'])
parents = {}
parents[(pool_id << 48) | (inode_id & 0xffffffffffff)] = True
# Check if there are child volumes
children = self._child_count(parents)
if children > 0:
raise exception.SnapshotIsBusy(snapshot_name = snap_name)
# FIXME: We can't delete snapshots because we can't merge layers yet
raise exception.VolumeBackendAPIException(data = 'Snapshot delete (layer merge) is not implemented yet')
def _child_count(self, parents):
children = 0
def add_child(kv):
nonlocal children
children += self._check_parent(kv, parents)
self._etcd_foreach('config/inode', lambda kv: add_child(kv))
return children
def _check_parent(self, kv, parents):
if 'parent_id' not in kv['value']:
return 0
parent_id = kv['value']['parent_id']
_, _, pool_id, inode_id = kv['key'].split('/')
parent_pool_id = pool_id
if 'parent_pool_id' in kv['value'] and kv['value']['parent_pool_id']:
parent_pool_id = kv['value']['parent_pool_id']
inode = (int(pool_id) << 48) | (int(inode_id) & 0xffffffffffff)
parent = (int(parent_pool_id) << 48) | (int(parent_id) & 0xffffffffffff)
if parent in parents and inode not in parents:
return 1
return 0
def create_cloned_volume(self, volume, src_vref):
"""Create a cloned volume from another volume."""
size = int(volume.size) * units.Gi
src_name = utils.convert_str(src_vref.name)
dest_name = utils.convert_str(volume.name)
if dest_name.find('@') >= 0 or dest_name.find('/') >= 0:
raise exception.VolumeBackendAPIException(data = '@ and / are forbidden in volume and snapshot names')
# FIXME Do full copy if requested (cfg.disable_clone)
if src_vref.admin_metadata.get('readonly') == 'True':
# source volume is a volume-image cache entry or other readonly volume
# clone without intermediate snapshot
src = self._get_image(src_name)
LOG.debug("creating image '%s' from '%s'", dest_name, src_name)
new_cfg = self._create_image(dest_name, {
'size': size,
'parent_id': src['idx']['id'],
'parent_pool_id': src['idx']['pool_id'],
})
return {}
clone_snap = "%s@%s.clone_snap" % (src_name, dest_name)
make_img = True
if (volume.display_name and
volume.display_name.startswith('image-') and
src_vref.project_id != volume.project_id):
# idiotic openstack creates image-volume cache entries
# as clones of normal VM volumes... :-X prevent it :-D
clone_snap = dest_name
make_img = False
LOG.debug("creating layer '%s' under '%s'", clone_snap, src_name)
new_cfg = self._create_snapshot(src_name, clone_snap, True)
if make_img:
# Then create a clone from it
new_cfg = self._create_image(dest_name, {
'size': size,
'parent_id': new_cfg['parent_id'],
'parent_pool_id': new_cfg['parent_pool_id'],
})
return {}
def create_volume_from_snapshot(self, volume, snapshot):
"""Creates a cloned volume from an existing snapshot."""
vol_name = utils.convert_str(volume.name)
snap_name = utils.convert_str(snapshot.name)
snap = self._get_image(vol_name+'@'+snap_name)
if not snap:
raise exception.SnapshotNotFound(snapshot_id = snap_name)
snap_inode_id = int(resp['responses'][0]['kvs'][0]['value']['id'])
snap_pool_id = int(resp['responses'][0]['kvs'][0]['value']['pool_id'])
size = snap['cfg']['size']
if int(volume.size):
size = int(volume.size) * units.Gi
new_cfg = self._create_image(vol_name, {
'size': size,
'parent_id': snap['idx']['id'],
'parent_pool_id': snap['idx']['pool_id'],
})
return {}
def _vitastor_args(self):
args = []
for k in [ 'config_path', 'etcd_address', 'etcd_prefix' ]:
v = self.configuration.safe_get('vitastor_'+k)
if v:
args.extend(['--'+k, v])
return args
def _qemu_args(self):
args = ''
for k in [ 'config_path', 'etcd_address', 'etcd_prefix' ]:
v = self.configuration.safe_get('vitastor_'+k)
kk = k
if kk == 'etcd_address':
# FIXME use etcd_address in qemu driver
kk = 'etcd_host'
if v:
args += ':'+kk+'='+v.replace(':', '\\:')
return args
def delete_volume(self, volume):
"""Deletes a logical volume."""
vol_name = utils.convert_str(volume.name)
# Find the volume and all its snapshots
range_end = b'index/image/' + vol_name.encode('utf-8')
range_end = range_end[0 : len(range_end)-1] + six.int2byte(range_end[len(range_end)-1] + 1)
resp = self._etcd_txn({ 'success': [
{ 'request_range': { 'key': 'index/image/'+vol_name, 'range_end': range_end } },
] })
if len(resp['responses'][0]['kvs']) == 0:
# already deleted
LOG.info("volume %s no longer exists in backend", vol_name)
return
layers = resp['responses'][0]['kvs']
layer_ids = {}
for kv in layers:
inode_id = int(kv['value']['id'])
pool_id = int(kv['value']['pool_id'])
inode_pool_id = (pool_id << 48) | (inode_id & 0xffffffffffff)
layer_ids[inode_pool_id] = True
# Check if the volume has clones and raise 'busy' if so
children = self._child_count(layer_ids)
if children > 0:
raise exception.VolumeIsBusy(volume_name = vol_name)
# Clear data
for kv in layers:
args = [
'vitastor-rm', '--pool', str(kv['value']['pool_id']),
'--inode', str(kv['value']['id']), '--progress', '0',
*(self._vitastor_args())
]
try:
self._execute(*args)
except processutils.ProcessExecutionError as exc:
LOG.error("Failed to remove layer "+kv['key']+": "+exc)
raise exception.VolumeBackendAPIException(data = exc.stderr)
# Delete all layers from etcd
requests = []
for kv in layers:
requests.append({ 'request_delete_range': { 'key': kv['key'] } })
requests.append({ 'request_delete_range': { 'key': 'config/inode/'+str(kv['value']['pool_id'])+'/'+str(kv['value']['id']) } })
self._etcd_txn({ 'success': requests })
def retype(self, context, volume, new_type, diff, host):
"""Change extra type specifications for a volume."""
# FIXME Maybe (in the future) support multiple pools as different types
return True, {}
def ensure_export(self, context, volume):
"""Synchronously recreates an export for a logical volume."""
pass
def create_export(self, context, volume, connector):
"""Exports the volume."""
pass
def remove_export(self, context, volume):
"""Removes an export for a logical volume."""
pass
def _create_image(self, vol_name, cfg):
pool_s = str(self.cfg['pool_id'])
image_id = 0
while image_id == 0:
# check if the image already exists and find a free ID
resp = self._etcd_txn({ 'success': [
{ 'request_range': { 'key': 'index/image/'+vol_name } },
{ 'request_range': { 'key': 'index/maxid/'+pool_s } },
] })
if len(resp['responses'][0]['kvs']) > 0:
# already exists
raise exception.VolumeBackendAPIException(data = 'Volume '+vol_name+' already exists')
image_id, id_mod = self._next_id(resp['responses'][1])
# try to create the image
resp = self._etcd_txn({ 'compare': [
{ 'target': 'MOD', 'mod_revision': id_mod, 'key': 'index/maxid/'+pool_s },
{ 'target': 'VERSION', 'version': 0, 'key': 'index/image/'+vol_name },
{ 'target': 'VERSION', 'version': 0, 'key': 'config/inode/'+pool_s+'/'+str(image_id) },
], 'success': [
{ 'request_put': { 'key': 'index/maxid/'+pool_s, 'value': image_id } },
{ 'request_put': { 'key': 'index/image/'+vol_name, 'value': json.dumps({
'id': image_id, 'pool_id': self.cfg['pool_id']
}) } },
{ 'request_put': { 'key': 'config/inode/'+pool_s+'/'+str(image_id), 'value': json.dumps({
**cfg, 'name': vol_name,
}) } },
] })
if not resp.get('succeeded'):
# repeat
image_id = 0
def _create_snapshot(self, vol_name, snap_vol_name, allow_existing = False):
while True:
# check if the image already exists and snapshot doesn't
resp = self._etcd_txn({ 'success': [
{ 'request_range': { 'key': 'index/image/'+vol_name } },
{ 'request_range': { 'key': 'index/image/'+snap_vol_name } },
] })
if len(resp['responses'][0]['kvs']) == 0:
raise exception.VolumeBackendAPIException(data = 'Volume '+vol_name+' does not exist')
if len(resp['responses'][1]['kvs']) > 0:
if allow_existing:
snap_idx = resp['responses'][1]['kvs'][0]['value']
resp = self._etcd_txn({ 'success': [
{ 'request_range': { 'key': 'config/inode/'+str(snap_idx['pool_id'])+'/'+str(snap_idx['id']) } },
] })
if len(resp['responses'][0]['kvs']) == 0:
raise exception.VolumeBackendAPIException(data =
'Volume '+snap_vol_name+' is already indexed, but does not exist'
)
return resp['responses'][0]['kvs'][0]['value']
raise exception.VolumeBackendAPIException(
data = 'Volume '+snap_vol_name+' already exists'
)
vol_idx = resp['responses'][0]['kvs'][0]['value']
vol_idx_mod = resp['responses'][0]['kvs'][0]['mod_revision']
# get image inode config and find a new ID
resp = self._etcd_txn({ 'success': [
{ 'request_range': { 'key': 'config/inode/'+str(vol_idx['pool_id'])+'/'+str(vol_idx['id']) } },
{ 'request_range': { 'key': 'index/maxid/'+str(self.cfg['pool_id']) } },
] })
if len(resp['responses'][0]['kvs']) == 0:
raise exception.VolumeBackendAPIException(data = 'Volume '+vol_name+' does not exist')
vol_cfg = resp['responses'][0]['kvs'][0]['value']
vol_mod = resp['responses'][0]['kvs'][0]['mod_revision']
new_id, id_mod = self._next_id(resp['responses'][1])
# try to redirect image to the new inode
new_cfg = {
**vol_cfg, 'name': vol_name, 'parent_id': vol_idx['id'], 'parent_pool_id': vol_idx['pool_id']
}
resp = self._etcd_txn({ 'compare': [
{ 'target': 'MOD', 'mod_revision': vol_idx_mod, 'key': 'index/image/'+vol_name },
{ 'target': 'MOD', 'mod_revision': vol_mod, 'key': 'config/inode/'+str(vol_idx['pool_id'])+'/'+str(vol_idx['id']) },
{ 'target': 'MOD', 'mod_revision': id_mod, 'key': 'index/maxid/'+str(self.cfg['pool_id']) },
{ 'target': 'VERSION', 'version': 0, 'key': 'index/image/'+snap_vol_name },
{ 'target': 'VERSION', 'version': 0, 'key': 'config/inode/'+str(self.cfg['pool_id'])+'/'+str(new_id) },
], 'success': [
{ 'request_put': { 'key': 'index/maxid/'+str(self.cfg['pool_id']), 'value': new_id } },
{ 'request_put': { 'key': 'index/image/'+vol_name, 'value': json.dumps({
'id': new_id, 'pool_id': self.cfg['pool_id']
}) } },
{ 'request_put': { 'key': 'config/inode/'+str(self.cfg['pool_id'])+'/'+str(new_id), 'value': json.dumps(new_cfg) } },
{ 'request_put': { 'key': 'index/image/'+snap_vol_name, 'value': json.dumps({
'id': vol_idx['id'], 'pool_id': vol_idx['pool_id']
}) } },
{ 'request_put': { 'key': 'config/inode/'+str(vol_idx['pool_id'])+'/'+str(vol_idx['id']), 'value': json.dumps({
**vol_cfg, 'name': snap_vol_name, 'readonly': True
}) } }
] })
if resp.get('succeeded'):
return new_cfg
def initialize_connection(self, volume, connector):
data = {
'driver_volume_type': 'vitastor',
'data': {
'config_path': self.configuration.vitastor_config_path,
'etcd_address': self.configuration.vitastor_etcd_address,
'etcd_prefix': self.configuration.vitastor_etcd_prefix,
'name': volume.name,
'logical_block_size': 512,
'physical_block_size': 4096,
}
}
LOG.debug('connection data: %s', data)
return data
def terminate_connection(self, volume, connector, **kwargs):
pass
def clone_image(self, context, volume, image_location, image_meta, image_service):
if image_location:
# Note: image_location[0] is glance image direct_url.
# image_location[1] contains the list of all locations (including
# direct_url) or None if show_multiple_locations is False in
# glance configuration.
if image_location[1]:
url_locations = [location['url'] for location in image_location[1]]
else:
url_locations = [image_location[0]]
# iterate all locations to look for a cloneable one.
for url_location in url_locations:
if url_location and url_location.startswith('cinder://'):
# The idea is to use cinder://<volume-id> Glance volumes as base images
base_vol = self.db.volume_get(context, url_location[len('cinder://') : ])
if not base_vol or base_vol.volume_type_id != volume.volume_type_id:
continue
size = int(volume.size) * units.Gi
dest_name = utils.convert_str(volume.name)
# Find or create the base snapshot
snap_cfg = self._create_snapshot(base_vol.name, base_vol.name+'@.clone_snap', True)
# Then create a clone from it
new_cfg = self._create_image(dest_name, {
'size': size,
'parent_id': snap_cfg['parent_id'],
'parent_pool_id': snap_cfg['parent_pool_id'],
})
return ({}, True)
return ({}, False)
def copy_image_to_encrypted_volume(self, context, volume, image_service, image_id):
self.copy_image_to_volume(context, volume, image_service, image_id, encrypted = True)
def copy_image_to_volume(self, context, volume, image_service, image_id, encrypted = False):
tmp_dir = volume_utils.image_conversion_dir()
with tempfile.NamedTemporaryFile(dir = tmp_dir) as tmp:
image_utils.fetch_to_raw(
context, image_service, image_id, tmp.name,
self.configuration.volume_dd_blocksize, size = volume.size
)
out_format = [ '-O', 'raw' ]
if encrypted:
key_file, opts = self._encrypt_opts(volume, context)
out_format = [ '-O', 'luks', *opts ]
dest_name = utils.convert_str(volume.name)
self._try_execute(
'qemu-img', 'convert', '-f', 'raw', tmp.name, *out_format,
'vitastor:image='+dest_name.replace(':', '\\:')+self._qemu_args()
)
if encrypted:
key_file.close()
def copy_volume_to_image(self, context, volume, image_service, image_meta):
tmp_dir = volume_utils.image_conversion_dir()
tmp_file = os.path.join(tmp_dir, volume.name + '-' + image_meta['id'])
with fileutils.remove_path_on_error(tmp_file):
vol_name = utils.convert_str(volume.name)
self._try_execute(
'qemu-img', 'convert', '-f', 'raw',
'vitastor:image='+vol_name.replace(':', '\\:')+self._qemu_args(),
'-O', 'raw', tmp_file
)
# FIXME: Copy directly if the destination image is also in Vitastor
volume_utils.upload_volume(context, image_service, image_meta, tmp_file, volume)
os.unlink(tmp_file)
def _get_image(self, vol_name):
# find the image
resp = self._etcd_txn({ 'success': [
{ 'request_range': { 'key': 'index/image/'+vol_name } },
] })
if len(resp['responses'][0]['kvs']) == 0:
return None
vol_idx = resp['responses'][0]['kvs'][0]['value']
vol_idx_mod = resp['responses'][0]['kvs'][0]['mod_revision']
# get image inode config
resp = self._etcd_txn({ 'success': [
{ 'request_range': { 'key': 'config/inode/'+str(vol_idx['pool_id'])+'/'+str(vol_idx['id']) } },
] })
if len(resp['responses'][0]['kvs']) == 0:
return None
vol_cfg = resp['responses'][0]['kvs'][0]['value']
vol_cfg_mod = resp['responses'][0]['kvs'][0]['mod_revision']
return {
'cfg': vol_cfg,
'cfg_mod': vol_cfg_mod,
'idx': vol_idx,
'idx_mod': vol_idx_mod,
}
def extend_volume(self, volume, new_size):
"""Extend an existing volume."""
vol_name = utils.convert_str(volume.name)
while True:
vol = self._get_image(vol_name)
if not vol:
raise exception.VolumeBackendAPIException(data = 'Volume '+vol_name+' does not exist')
# change size
size = int(new_size) * units.Gi
if size == vol['cfg']['size']:
break
resp = self._etcd_txn({ 'compare': [ {
'target': 'MOD',
'mod_revision': vol['cfg_mod'],
'key': 'config/inode/'+str(vol['idx']['pool_id'])+'/'+str(vol['idx']['id']),
} ], 'success': [
{ 'request_put': {
'key': 'config/inode/'+str(vol['idx']['pool_id'])+'/'+str(vol['idx']['id']),
'value': json.dumps({ **vol['cfg'], 'size': size }),
} },
] })
if resp.get('succeeded'):
break
LOG.debug(
"Extend volume from %(old_size)s GB to %(new_size)s GB.",
{'old_size': volume.size, 'new_size': new_size}
)
def _add_manageable_volume(self, kv, manageable_volumes, cinder_ids):
cfg = kv['value']
if kv['key'].find('@') >= 0:
# snapshot
return
image_id = volume_utils.extract_id_from_volume_name(cfg['name'])
image_info = {
'reference': {'source-name': image_name},
'size': int(math.ceil(float(cfg['size']) / units.Gi)),
'cinder_id': None,
'extra_info': None,
}
if image_id in cinder_ids:
image_info['cinder_id'] = image_id
image_info['safe_to_manage'] = False
image_info['reason_not_safe'] = 'already managed'
else:
image_info['safe_to_manage'] = True
image_info['reason_not_safe'] = None
manageable_volumes.append(image_info)
def get_manageable_volumes(self, cinder_volumes, marker, limit, offset, sort_keys, sort_dirs):
manageable_volumes = []
cinder_ids = [resource['id'] for resource in cinder_volumes]
# List all volumes
# FIXME: It's possible to use pagination in our case, but.. do we want it?
self._etcd_foreach('config/inode/'+str(self.cfg['pool_id']),
lambda kv: self._add_manageable_volume(kv, manageable_volumes, cinder_ids))
return volume_utils.paginate_entries_list(
manageable_volumes, marker, limit, offset, sort_keys, sort_dirs)
def _get_existing_name(existing_ref):
if not isinstance(existing_ref, dict):
existing_ref = {"source-name": existing_ref}
if 'source-name' not in existing_ref:
reason = _('Reference must contain source-name element.')
raise exception.ManageExistingInvalidReference(existing_ref=existing_ref, reason=reason)
src_name = utils.convert_str(existing_ref['source-name'])
if not src_name:
reason = _('Reference must contain source-name element.')
raise exception.ManageExistingInvalidReference(existing_ref=existing_ref, reason=reason)
return src_name
def manage_existing_get_size(self, volume, existing_ref):
"""Return size of an existing image for manage_existing.
:param volume: volume ref info to be set
:param existing_ref: {'source-name': <image name>}
"""
src_name = self._get_existing_name(existing_ref)
vol = self._get_image(src_name)
if not vol:
raise exception.VolumeBackendAPIException(data = 'Volume '+src_name+' does not exist')
return int(math.ceil(float(vol['cfg']['size']) / units.Gi))
def manage_existing(self, volume, existing_ref):
"""Manages an existing image.
Renames the image name to match the expected name for the volume.
:param volume: volume ref info to be set
:param existing_ref: {'source-name': <image name>}
"""
from_name = self._get_existing_name(existing_ref)
to_name = utils.convert_str(volume.name)
self._rename(from_name, to_name)
def _rename(self, from_name, to_name):
while True:
vol = self._get_image(from_name)
if not vol:
raise exception.VolumeBackendAPIException(data = 'Volume '+from_name+' does not exist')
to = self._get_image(to_name)
if to:
raise exception.VolumeBackendAPIException(data = 'Volume '+to_name+' already exists')
resp = self._etcd_txn({ 'compare': [
{ 'target': 'MOD', 'mod_revision': vol['idx_mod'], 'key': 'index/image/'+vol['cfg']['name'] },
{ 'target': 'MOD', 'mod_revision': vol['cfg_mod'], 'key': 'config/inode/'+str(vol['idx']['pool_id'])+'/'+str(vol['idx']['id']) },
{ 'target': 'VERSION', 'version': 0, 'key': 'index/image/'+to_name },
], 'success': [
{ 'request_delete_range': { 'key': 'index/image/'+vol['cfg']['name'] } },
{ 'request_put': { 'key': 'index/image/'+to_name, 'value': json.dumps(vol['idx']) } },
{ 'request_put': { 'key': 'config/inode/'+str(vol['idx']['pool_id'])+'/'+str(vol['idx']['id']),
'value': json.dumps({ **vol['cfg'], 'name': to_name }) } },
] })
if resp.get('succeeded'):
break
def unmanage(self, volume):
pass
def _add_manageable_snapshot(self, kv, manageable_snapshots, cinder_ids):
cfg = kv['value']
dog = kv['key'].find('@')
if dog < 0:
# snapshot
return
image_name = kv['key'][0 : dog]
snap_name = kv['key'][dog+1 : ]
snapshot_id = volume_utils.extract_id_from_snapshot_name(snap_name)
snapshot_info = {
'reference': {'source-name': snap_name},
'size': int(math.ceil(float(cfg['size']) / units.Gi)),
'cinder_id': None,
'extra_info': None,
'safe_to_manage': False,
'reason_not_safe': None,
'source_reference': {'source-name': image_name}
}
if snapshot_id in cinder_ids:
# Exclude snapshots already managed.
snapshot_info['reason_not_safe'] = ('already managed')
snapshot_info['cinder_id'] = snapshot_id
elif snap_name.endswith('.clone_snap'):
# Exclude clone snapshot.
snapshot_info['reason_not_safe'] = ('used for clone snap')
else:
snapshot_info['safe_to_manage'] = True
manageable_snapshots.append(snapshot_info)
def get_manageable_snapshots(self, cinder_snapshots, marker, limit, offset, sort_keys, sort_dirs):
"""List manageable snapshots in Vitastor."""
manageable_snapshots = []
cinder_snapshot_ids = [resource['id'] for resource in cinder_snapshots]
# List all volumes
# FIXME: It's possible to use pagination in our case, but.. do we want it?
self._etcd_foreach('config/inode/'+str(self.cfg['pool_id']),
lambda kv: self._add_manageable_volume(kv, manageable_snapshots, cinder_snapshot_ids))
return volume_utils.paginate_entries_list(
manageable_snapshots, marker, limit, offset, sort_keys, sort_dirs)
def manage_existing_snapshot_get_size(self, snapshot, existing_ref):
"""Return size of an existing image for manage_existing.
:param snapshot: snapshot ref info to be set
:param existing_ref: {'source-name': <name of snapshot>}
"""
vol_name = utils.convert_str(snapshot.volume_name)
snap_name = self._get_existing_name(existing_ref)
vol = self._get_image(vol_name+'@'+snap_name)
if not vol:
raise exception.ManageExistingInvalidReference(
existing_ref=snapshot_name, reason='Specified snapshot does not exist.'
)
return int(math.ceil(float(vol['cfg']['size']) / units.Gi))
def manage_existing_snapshot(self, snapshot, existing_ref):
"""Manages an existing snapshot.
Renames the snapshot name to match the expected name for the snapshot.
Error checking done by manage_existing_get_size is not repeated.
:param snapshot: snapshot ref info to be set
:param existing_ref: {'source-name': <name of snapshot>}
"""
vol_name = utils.convert_str(snapshot.volume_name)
snap_name = self._get_existing_name(existing_ref)
from_name = vol_name+'@'+snap_name
to_name = vol_name+'@'+utils.convert_str(snapshot.name)
self._rename(from_name, to_name)
def unmanage_snapshot(self, snapshot):
"""Removes the specified snapshot from Cinder management."""
pass
def _dumps(self, obj):
return json.dumps(obj, separators=(',', ':'), sort_keys=True)

View File

@ -0,0 +1,23 @@
# Devstack configuration for bridged networking
[[local|localrc]]
ADMIN_PASSWORD=secret
DATABASE_PASSWORD=$ADMIN_PASSWORD
RABBIT_PASSWORD=$ADMIN_PASSWORD
SERVICE_PASSWORD=$ADMIN_PASSWORD
HOST_IP=10.0.2.15
Q_USE_SECGROUP=True
FLOATING_RANGE="10.0.2.0/24"
IPV4_ADDRS_SAFE_TO_USE="10.0.5.0/24"
Q_FLOATING_ALLOCATION_POOL=start=10.0.2.50,end=10.0.2.100
PUBLIC_NETWORK_GATEWAY=10.0.2.2
PUBLIC_INTERFACE=ens3
Q_USE_PROVIDERNET_FOR_PUBLIC=True
Q_AGENT=linuxbridge
Q_ML2_PLUGIN_MECHANISM_DRIVERS=linuxbridge
LB_PHYSICAL_INTERFACE=ens3
PUBLIC_PHYSICAL_NETWORK=default
LB_INTERFACE_MAPPINGS=default:ens3
Q_SERVICE_PLUGIN_CLASSES=
Q_ML2_PLUGIN_TYPE_DRIVERS=flat
Q_ML2_PLUGIN_EXT_DRIVERS=

View File

@ -0,0 +1,609 @@
commit bd283191b3e7a4c6d1c100d3d96e348a1ebffe55
Author: Vitaliy Filippov <vitalif@yourcmc.ru>
Date: Sun Jun 27 12:52:40 2021 +0300
Add Vitastor support
diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng
index aa50eac..082b4f8 100644
--- a/docs/schemas/domaincommon.rng
+++ b/docs/schemas/domaincommon.rng
@@ -1728,6 +1728,35 @@
</element>
</define>
+ <define name="diskSourceNetworkProtocolVitastor">
+ <element name="source">
+ <interleave>
+ <attribute name="protocol">
+ <value>vitastor</value>
+ </attribute>
+ <ref name="diskSourceCommon"/>
+ <optional>
+ <attribute name="name"/>
+ </optional>
+ <optional>
+ <attribute name="query"/>
+ </optional>
+ <zeroOrMore>
+ <ref name="diskSourceNetworkHost"/>
+ </zeroOrMore>
+ <optional>
+ <element name="config">
+ <attribute name="file">
+ <ref name="absFilePath"/>
+ </attribute>
+ <empty/>
+ </element>
+ </optional>
+ <empty/>
+ </interleave>
+ </element>
+ </define>
+
<define name="diskSourceNetworkProtocolISCSI">
<element name="source">
<attribute name="protocol">
@@ -1851,6 +1880,7 @@
<ref name="diskSourceNetworkProtocolHTTP"/>
<ref name="diskSourceNetworkProtocolSimple"/>
<ref name="diskSourceNetworkProtocolVxHS"/>
+ <ref name="diskSourceNetworkProtocolVitastor"/>
</choice>
</define>
diff --git a/include/libvirt/libvirt-storage.h b/include/libvirt/libvirt-storage.h
index 4bf2b5f..dbc011b 100644
--- a/include/libvirt/libvirt-storage.h
+++ b/include/libvirt/libvirt-storage.h
@@ -240,6 +240,7 @@ typedef enum {
VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER = 1 << 16,
VIR_CONNECT_LIST_STORAGE_POOLS_ZFS = 1 << 17,
VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE = 1 << 18,
+ VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR = 1 << 20,
} virConnectListAllStoragePoolsFlags;
int virConnectListAllStoragePools(virConnectPtr conn,
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
index 222bb8c..685d255 100644
--- a/src/conf/domain_conf.c
+++ b/src/conf/domain_conf.c
@@ -8653,6 +8653,10 @@ virDomainDiskSourceNetworkParse(xmlNodePtr node,
goto cleanup;
}
+ if (src->protocol == VIR_STORAGE_NET_PROTOCOL_VITASTOR) {
+ src->relPath = virXMLPropString(node, "query");
+ }
+
if ((haveTLS = virXMLPropString(node, "tls")) &&
(src->haveTLS = virTristateBoolTypeFromString(haveTLS)) <= 0) {
virReportError(VIR_ERR_XML_ERROR,
@@ -23849,6 +23853,10 @@ virDomainDiskSourceFormatNetwork(virBufferPtr attrBuf,
virBufferEscapeString(attrBuf, " name='%s'", path ? path : src->path);
+ if (src->protocol == VIR_STORAGE_NET_PROTOCOL_VITASTOR && src->relPath != NULL) {
+ virBufferEscapeString(attrBuf, " query='%s'", src->relPath);
+ }
+
VIR_FREE(path);
if (src->haveTLS != VIR_TRISTATE_BOOL_ABSENT &&
@@ -30930,6 +30938,7 @@ virDomainDiskTranslateSourcePool(virDomainDiskDefPtr def)
case VIR_STORAGE_POOL_MPATH:
case VIR_STORAGE_POOL_RBD:
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_SHEEPDOG:
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_LAST:
diff --git a/src/conf/storage_conf.c b/src/conf/storage_conf.c
index 55db7a9..7cbe937 100644
--- a/src/conf/storage_conf.c
+++ b/src/conf/storage_conf.c
@@ -58,7 +58,7 @@ VIR_ENUM_IMPL(virStoragePool,
"logical", "disk", "iscsi",
"iscsi-direct", "scsi", "mpath",
"rbd", "sheepdog", "gluster",
- "zfs", "vstorage")
+ "zfs", "vstorage", "vitastor")
VIR_ENUM_IMPL(virStoragePoolFormatFileSystem,
VIR_STORAGE_POOL_FS_LAST,
@@ -232,6 +232,18 @@ static virStoragePoolTypeInfo poolTypeInfo[] = {
.formatToString = virStorageFileFormatTypeToString,
}
},
+ {.poolType = VIR_STORAGE_POOL_VITASTOR,
+ .poolOptions = {
+ .flags = (VIR_STORAGE_POOL_SOURCE_HOST |
+ VIR_STORAGE_POOL_SOURCE_NETWORK |
+ VIR_STORAGE_POOL_SOURCE_NAME),
+ },
+ .volOptions = {
+ .defaultFormat = VIR_STORAGE_FILE_RAW,
+ .formatFromString = virStorageVolumeFormatFromString,
+ .formatToString = virStorageFileFormatTypeToString,
+ }
+ },
{.poolType = VIR_STORAGE_POOL_SHEEPDOG,
.poolOptions = {
.flags = (VIR_STORAGE_POOL_SOURCE_HOST |
@@ -434,6 +446,11 @@ virStoragePoolDefParseSource(xmlXPathContextPtr ctxt,
_("element 'name' is mandatory for RBD pool"));
goto cleanup;
}
+ if (pool_type == VIR_STORAGE_POOL_VITASTOR && source->name == NULL) {
+ virReportError(VIR_ERR_XML_ERROR, "%s",
+ _("element 'name' is mandatory for Vitastor pool"));
+ return -1;
+ }
if (options->formatFromString) {
char *format = virXPathString("string(./format/@type)", ctxt);
@@ -1009,6 +1026,7 @@ virStoragePoolDefFormatBuf(virBufferPtr buf,
/* RBD, Sheepdog, Gluster and Iscsi-direct devices are not local block devs nor
* files, so they don't have a target */
if (def->type != VIR_STORAGE_POOL_RBD &&
+ def->type != VIR_STORAGE_POOL_VITASTOR &&
def->type != VIR_STORAGE_POOL_SHEEPDOG &&
def->type != VIR_STORAGE_POOL_GLUSTER &&
def->type != VIR_STORAGE_POOL_ISCSI_DIRECT) {
diff --git a/src/conf/storage_conf.h b/src/conf/storage_conf.h
index dc0aa2a..ed4983d 100644
--- a/src/conf/storage_conf.h
+++ b/src/conf/storage_conf.h
@@ -91,6 +91,7 @@ typedef enum {
VIR_STORAGE_POOL_GLUSTER, /* Gluster device */
VIR_STORAGE_POOL_ZFS, /* ZFS */
VIR_STORAGE_POOL_VSTORAGE, /* Virtuozzo Storage */
+ VIR_STORAGE_POOL_VITASTOR, /* Vitastor */
VIR_STORAGE_POOL_LAST,
} virStoragePoolType;
@@ -422,6 +423,7 @@ VIR_ENUM_DECL(virStoragePartedFs)
VIR_CONNECT_LIST_STORAGE_POOLS_SCSI | \
VIR_CONNECT_LIST_STORAGE_POOLS_MPATH | \
VIR_CONNECT_LIST_STORAGE_POOLS_RBD | \
+ VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR | \
VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG | \
VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER | \
VIR_CONNECT_LIST_STORAGE_POOLS_ZFS | \
diff --git a/src/conf/virstorageobj.c b/src/conf/virstorageobj.c
index 6ea6a97..3ba45b9 100644
--- a/src/conf/virstorageobj.c
+++ b/src/conf/virstorageobj.c
@@ -1478,6 +1478,7 @@ virStoragePoolObjSourceFindDuplicateCb(const void *payload,
return 1;
break;
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_RBD:
case VIR_STORAGE_POOL_LAST:
break;
@@ -1971,6 +1972,8 @@ virStoragePoolObjMatch(virStoragePoolObjPtr obj,
(obj->def->type == VIR_STORAGE_POOL_MPATH)) ||
(MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_RBD) &&
(obj->def->type == VIR_STORAGE_POOL_RBD)) ||
+ (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR) &&
+ (obj->def->type == VIR_STORAGE_POOL_VITASTOR)) ||
(MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG) &&
(obj->def->type == VIR_STORAGE_POOL_SHEEPDOG)) ||
(MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER) &&
diff --git a/src/libvirt-storage.c b/src/libvirt-storage.c
index 2ea3e94..d5d2273 100644
--- a/src/libvirt-storage.c
+++ b/src/libvirt-storage.c
@@ -92,6 +92,7 @@ virStoragePoolGetConnect(virStoragePoolPtr pool)
* VIR_CONNECT_LIST_STORAGE_POOLS_SCSI
* VIR_CONNECT_LIST_STORAGE_POOLS_MPATH
* VIR_CONNECT_LIST_STORAGE_POOLS_RBD
+ * VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR
* VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG
*
* Returns the number of storage pools found or -1 and sets @pools to
diff --git a/src/libxl/libxl_conf.c b/src/libxl/libxl_conf.c
index 73e988a..ab7bb81 100644
--- a/src/libxl/libxl_conf.c
+++ b/src/libxl/libxl_conf.c
@@ -905,6 +905,7 @@ libxlMakeNetworkDiskSrcStr(virStorageSourcePtr src,
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_SSH:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_LAST:
case VIR_STORAGE_NET_PROTOCOL_NONE:
virReportError(VIR_ERR_NO_SUPPORT,
diff --git a/src/qemu/qemu_block.c b/src/qemu/qemu_block.c
index cbf0aa4..096700d 100644
--- a/src/qemu/qemu_block.c
+++ b/src/qemu/qemu_block.c
@@ -959,6 +959,42 @@ qemuBlockStorageSourceGetRBDProps(virStorageSourcePtr src)
}
+static virJSONValuePtr
+qemuBlockStorageSourceGetVitastorProps(virStorageSource *src)
+{
+ virJSONValuePtr ret = NULL;
+ virStorageNetHostDefPtr host;
+ size_t i;
+ virBuffer buf = VIR_BUFFER_INITIALIZER;
+ char *etcd = NULL;
+
+ for (i = 0; i < src->nhosts; i++) {
+ host = src->hosts + i;
+ if ((virStorageNetHostTransport)host->transport != VIR_STORAGE_NET_HOST_TRANS_TCP) {
+ goto cleanup;
+ }
+ virBufferAsprintf(&buf, i > 0 ? ",%s:%u" : "%s:%u", host->name, host->port);
+ }
+ if (src->nhosts > 0) {
+ etcd = virBufferContentAndReset(&buf);
+ }
+
+ if (virJSONValueObjectCreate(&ret,
+ "s:driver", "vitastor",
+ "S:etcd_host", etcd,
+ "S:etcd_prefix", src->relPath,
+ "S:config_path", src->configFile,
+ "s:image", src->path,
+ NULL) < 0)
+ goto cleanup;
+
+cleanup:
+ VIR_FREE(etcd);
+ virBufferFreeAndReset(&buf);
+ return ret;
+}
+
+
static virJSONValuePtr
qemuBlockStorageSourceGetSheepdogProps(virStorageSourcePtr src)
{
@@ -1174,6 +1210,11 @@ qemuBlockStorageSourceGetBackendProps(virStorageSourcePtr src,
return NULL;
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ if (!(fileprops = qemuBlockStorageSourceGetVitastorProps(src)))
+ return NULL;
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
if (!(fileprops = qemuBlockStorageSourceGetSheepdogProps(src)))
return NULL;
diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
index 822d5f8..e375cef 100644
--- a/src/qemu/qemu_command.c
+++ b/src/qemu/qemu_command.c
@@ -975,6 +975,43 @@ qemuBuildNetworkDriveStr(virStorageSourcePtr src,
ret = virBufferContentAndReset(&buf);
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ if (strchr(src->path, ':')) {
+ virReportError(VIR_ERR_CONFIG_UNSUPPORTED,
+ _("':' not allowed in Vitastor source volume name '%s'"),
+ src->path);
+ return NULL;
+ }
+
+ virBufferStrcat(&buf, "vitastor:image=", src->path, NULL);
+
+ if (src->nhosts > 0) {
+ virBufferAddLit(&buf, ":etcd_host=");
+ for (i = 0; i < src->nhosts; i++) {
+ if (i)
+ virBufferAddLit(&buf, ",");
+
+ /* assume host containing : is ipv6 */
+ if (strchr(src->hosts[i].name, ':'))
+ virBufferEscape(&buf, '\\', ":", "[%s]",
+ src->hosts[i].name);
+ else
+ virBufferAsprintf(&buf, "%s", src->hosts[i].name);
+
+ if (src->hosts[i].port)
+ virBufferAsprintf(&buf, "\\:%u", src->hosts[i].port);
+ }
+ }
+
+ if (src->configFile)
+ virBufferEscape(&buf, '\\', ":", ":config_path=%s", src->configFile);
+
+ if (src->relPath)
+ virBufferEscape(&buf, '\\', ":", ":etcd_prefix=%s", src->relPath);
+
+ ret = virBufferContentAndReset(&buf);
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_VXHS:
virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
_("VxHS protocol does not support URI syntax"));
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
index ec6b340..f399efa 100644
--- a/src/qemu/qemu_domain.c
+++ b/src/qemu/qemu_domain.c
@@ -10881,6 +10881,7 @@ qemuDomainPrepareStorageSourceTLS(virStorageSourcePtr src,
break;
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
index 1d96170..2d24396 100644
--- a/src/qemu/qemu_driver.c
+++ b/src/qemu/qemu_driver.c
@@ -14687,6 +14687,7 @@ qemuDomainSnapshotPrepareDiskExternalInactive(virDomainSnapshotDiskDefPtr snapdi
case VIR_STORAGE_NET_PROTOCOL_TFTP:
case VIR_STORAGE_NET_PROTOCOL_SSH:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_LAST:
virReportError(VIR_ERR_INTERNAL_ERROR,
_("external inactive snapshots are not supported on "
@@ -14764,6 +14765,7 @@ qemuDomainSnapshotPrepareDiskExternalActive(virDomainSnapshotDiskDefPtr snapdisk
case VIR_STORAGE_NET_PROTOCOL_TFTP:
case VIR_STORAGE_NET_PROTOCOL_SSH:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_LAST:
virReportError(VIR_ERR_INTERNAL_ERROR,
_("external active snapshots are not supported on "
@@ -14887,6 +14889,7 @@ qemuDomainSnapshotPrepareDiskInternal(virDomainDiskDefPtr disk,
case VIR_STORAGE_NET_PROTOCOL_TFTP:
case VIR_STORAGE_NET_PROTOCOL_SSH:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_LAST:
virReportError(VIR_ERR_INTERNAL_ERROR,
_("internal inactive snapshots are not supported on "
diff --git a/src/qemu/qemu_parse_command.c b/src/qemu/qemu_parse_command.c
index c4650f0..551da41 100644
--- a/src/qemu/qemu_parse_command.c
+++ b/src/qemu/qemu_parse_command.c
@@ -2184,6 +2184,7 @@ qemuParseCommandLine(virFileCachePtr capsCache,
case VIR_STORAGE_NET_PROTOCOL_TFTP:
case VIR_STORAGE_NET_PROTOCOL_SSH:
case VIR_STORAGE_NET_PROTOCOL_LAST:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_NONE:
/* ignored for now */
break;
diff --git a/src/storage/storage_driver.c b/src/storage/storage_driver.c
index 4a13e90..33301c7 100644
--- a/src/storage/storage_driver.c
+++ b/src/storage/storage_driver.c
@@ -1568,6 +1568,7 @@ storageVolLookupByPathCallback(virStoragePoolObjPtr obj,
case VIR_STORAGE_POOL_RBD:
case VIR_STORAGE_POOL_SHEEPDOG:
case VIR_STORAGE_POOL_ZFS:
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_LAST:
ignore_value(VIR_STRDUP(stable_path, data->path));
break;
diff --git a/src/util/virstoragefile.c b/src/util/virstoragefile.c
index bd4b027..b323cd6 100644
--- a/src/util/virstoragefile.c
+++ b/src/util/virstoragefile.c
@@ -84,7 +84,8 @@ VIR_ENUM_IMPL(virStorageNetProtocol, VIR_STORAGE_NET_PROTOCOL_LAST,
"ftps",
"tftp",
"ssh",
- "vxhs")
+ "vxhs",
+ "vitastor")
VIR_ENUM_IMPL(virStorageNetHostTransport, VIR_STORAGE_NET_HOST_TRANS_LAST,
"tcp",
@@ -2839,6 +2840,83 @@ virStorageSourceParseRBDColonString(const char *rbdstr,
}
+static int
+virStorageSourceParseVitastorColonString(const char *colonstr,
+ virStorageSourcePtr src)
+{
+ char *p, *e, *next;
+ char *options = NULL;
+
+ /* optionally skip the "vitastor:" prefix if provided */
+ if (STRPREFIX(colonstr, "vitastor:"))
+ colonstr += strlen("vitastor:");
+
+ if (VIR_STRDUP(options, colonstr) < 0)
+ return -1;
+
+ p = options;
+ while (*p) {
+ /* find : delimiter or end of string */
+ for (e = p; *e && *e != ':'; ++e) {
+ if (*e == '\\') {
+ e++;
+ if (*e == '\0')
+ break;
+ }
+ }
+ if (*e == '\0') {
+ next = e; /* last kv pair */
+ } else {
+ next = e + 1;
+ *e = '\0';
+ }
+
+ if (STRPREFIX(p, "image=")) {
+ if (VIR_STRDUP(src->path, p + strlen("image=")) < 0)
+ return -1;
+ } else if (STRPREFIX(p, "etcd_prefix=")) {
+ if (VIR_STRDUP(src->relPath, p + strlen("etcd_prefix=")) < 0)
+ return -1;
+ } else if (STRPREFIX(p, "config_file=")) {
+ if (VIR_STRDUP(src->configFile, p + strlen("config_file=")) < 0)
+ return -1;
+ } else if (STRPREFIX(p, "etcd_host=")) {
+ char *h, *sep;
+
+ h = p + strlen("etcd_host=");
+ while (h < e) {
+ for (sep = h; sep < e; ++sep) {
+ if (*sep == '\\' && (sep[1] == ',' ||
+ sep[1] == ';' ||
+ sep[1] == ' ')) {
+ *sep = '\0';
+ sep += 2;
+ break;
+ }
+ }
+
+ if (virStorageSourceRBDAddHost(src, h) < 0)
+ goto error;
+
+ h = sep;
+ }
+ }
+
+ p = next;
+ }
+
+ if (!src->path) {
+ goto error;
+ }
+
+ return 0;
+
+error:
+ VIR_FREE(options);
+ return -1;
+}
+
+
static int
virStorageSourceParseNBDColonString(const char *nbdstr,
virStorageSourcePtr src)
@@ -2942,6 +3020,11 @@ virStorageSourceParseBackingColon(virStorageSourcePtr src,
goto cleanup;
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ if (virStorageSourceParseVitastorColonString(path, src) < 0)
+ return -1;
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_LAST:
case VIR_STORAGE_NET_PROTOCOL_NONE:
@@ -3441,6 +3524,56 @@ virStorageSourceParseBackingJSONRBD(virStorageSourcePtr src,
return ret;
}
+static int
+virStorageSourceParseBackingJSONVitastor(virStorageSourcePtr src,
+ virJSONValuePtr json,
+ int opaque ATTRIBUTE_UNUSED)
+{
+ const char *filename;
+ const char *image = virJSONValueObjectGetString(json, "image");
+ const char *conf = virJSONValueObjectGetString(json, "config_path");
+ const char *etcd_prefix = virJSONValueObjectGetString(json, "etcd_prefix");
+ virJSONValuePtr servers = virJSONValueObjectGetArray(json, "server");
+ size_t nservers;
+ size_t i;
+
+ src->type = VIR_STORAGE_TYPE_NETWORK;
+ src->protocol = VIR_STORAGE_NET_PROTOCOL_VITASTOR;
+
+ /* legacy syntax passed via 'filename' option */
+ if ((filename = virJSONValueObjectGetString(json, "filename")))
+ return virStorageSourceParseVitastorColonString(filename, src);
+
+ if (!image) {
+ virReportError(VIR_ERR_INVALID_ARG, "%s",
+ _("missing image name in Vitastor backing volume "
+ "JSON specification"));
+ return -1;
+ }
+
+ if (VIR_STRDUP(src->path, image) < 0 ||
+ VIR_STRDUP(src->configFile, conf) < 0 ||
+ VIR_STRDUP(src->relPath, etcd_prefix) < 0)
+ return -1;
+
+ if (servers) {
+ nservers = virJSONValueArraySize(servers);
+
+ if (VIR_ALLOC_N(src->hosts, nservers) < 0)
+ return -1;
+
+ src->nhosts = nservers;
+
+ for (i = 0; i < nservers; i++) {
+ if (virStorageSourceParseBackingJSONInetSocketAddress(src->hosts + i,
+ virJSONValueArrayGet(servers, i)) < 0)
+ return -1;
+ }
+ }
+
+ return 0;
+}
+
static int
virStorageSourceParseBackingJSONRaw(virStorageSourcePtr src,
virJSONValuePtr json,
@@ -3507,6 +3640,7 @@ static const struct virStorageSourceJSONDriverParser jsonParsers[] = {
{"sheepdog", virStorageSourceParseBackingJSONSheepdog, 0},
{"ssh", virStorageSourceParseBackingJSONSSH, 0},
{"rbd", virStorageSourceParseBackingJSONRBD, 0},
+ {"vitastor", virStorageSourceParseBackingJSONVitastor, 0},
{"raw", virStorageSourceParseBackingJSONRaw, 0},
{"vxhs", virStorageSourceParseBackingJSONVxHS, 0},
};
@@ -4276,6 +4410,7 @@ virStorageSourceNetworkDefaultPort(virStorageNetProtocol protocol)
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
return 24007;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_RBD:
/* we don't provide a default for RBD */
return 0;
diff --git a/src/util/virstoragefile.h b/src/util/virstoragefile.h
index 1d6161a..8d83bf3 100644
--- a/src/util/virstoragefile.h
+++ b/src/util/virstoragefile.h
@@ -134,6 +134,7 @@ typedef enum {
VIR_STORAGE_NET_PROTOCOL_TFTP,
VIR_STORAGE_NET_PROTOCOL_SSH,
VIR_STORAGE_NET_PROTOCOL_VXHS,
+ VIR_STORAGE_NET_PROTOCOL_VITASTOR,
VIR_STORAGE_NET_PROTOCOL_LAST
} virStorageNetProtocol;
diff --git a/src/xenconfig/xen_xl.c b/src/xenconfig/xen_xl.c
index accfc3a..a18f9c3 100644
--- a/src/xenconfig/xen_xl.c
+++ b/src/xenconfig/xen_xl.c
@@ -1535,6 +1535,7 @@ xenFormatXLDiskSrcNet(virStorageSourcePtr src)
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_SSH:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_LAST:
case VIR_STORAGE_NET_PROTOCOL_NONE:
virReportError(VIR_ERR_NO_SUPPORT,
diff --git a/tools/virsh-pool.c b/tools/virsh-pool.c
index 70ca39b..9caef51 100644
--- a/tools/virsh-pool.c
+++ b/tools/virsh-pool.c
@@ -1219,6 +1219,9 @@ cmdPoolList(vshControl *ctl, const vshCmd *cmd ATTRIBUTE_UNUSED)
case VIR_STORAGE_POOL_VSTORAGE:
flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE;
break;
+ case VIR_STORAGE_POOL_VITASTOR:
+ flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR;
+ break;
case VIR_STORAGE_POOL_LAST:
break;
}

View File

@ -0,0 +1,657 @@
commit 41cdfe8317d98f70aadedfdbb381effed2641bdd
Author: Vitaliy Filippov <vitalif@yourcmc.ru>
Date: Fri Jul 9 01:31:57 2021 +0300
Add Vitastor support
diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng
index 7dc419b..875433b 100644
--- a/docs/schemas/domaincommon.rng
+++ b/docs/schemas/domaincommon.rng
@@ -1827,6 +1827,35 @@
</element>
</define>
+ <define name="diskSourceNetworkProtocolVitastor">
+ <element name="source">
+ <interleave>
+ <attribute name="protocol">
+ <value>vitastor</value>
+ </attribute>
+ <ref name="diskSourceCommon"/>
+ <optional>
+ <attribute name="name"/>
+ </optional>
+ <optional>
+ <attribute name="query"/>
+ </optional>
+ <zeroOrMore>
+ <ref name="diskSourceNetworkHost"/>
+ </zeroOrMore>
+ <optional>
+ <element name="config">
+ <attribute name="file">
+ <ref name="absFilePath"/>
+ </attribute>
+ <empty/>
+ </element>
+ </optional>
+ <empty/>
+ </interleave>
+ </element>
+ </define>
+
<define name="diskSourceNetworkProtocolISCSI">
<element name="source">
<attribute name="protocol">
@@ -2083,6 +2112,7 @@
<ref name="diskSourceNetworkProtocolSimple"/>
<ref name="diskSourceNetworkProtocolVxHS"/>
<ref name="diskSourceNetworkProtocolNFS"/>
+ <ref name="diskSourceNetworkProtocolVitastor"/>
</choice>
</define>
diff --git a/include/libvirt/libvirt-storage.h b/include/libvirt/libvirt-storage.h
index 089e1e0..d7e7ef4 100644
--- a/include/libvirt/libvirt-storage.h
+++ b/include/libvirt/libvirt-storage.h
@@ -245,6 +245,7 @@ typedef enum {
VIR_CONNECT_LIST_STORAGE_POOLS_ZFS = 1 << 17,
VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE = 1 << 18,
VIR_CONNECT_LIST_STORAGE_POOLS_ISCSI_DIRECT = 1 << 19,
+ VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR = 1 << 20,
} virConnectListAllStoragePoolsFlags;
int virConnectListAllStoragePools(virConnectPtr conn,
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
index 01b7187..c6e9702 100644
--- a/src/conf/domain_conf.c
+++ b/src/conf/domain_conf.c
@@ -8261,7 +8261,8 @@ virDomainDiskSourceNetworkParse(xmlNodePtr node,
src->configFile = virXPathString("string(./config/@file)", ctxt);
if (src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTP ||
- src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTPS)
+ src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTPS ||
+ src->protocol == VIR_STORAGE_NET_PROTOCOL_VITASTOR)
src->query = virXMLPropString(node, "query");
if (virDomainStorageNetworkParseHosts(node, ctxt, &src->hosts, &src->nhosts) < 0)
@@ -31392,6 +31393,7 @@ virDomainStorageSourceTranslateSourcePool(virStorageSourcePtr src,
case VIR_STORAGE_POOL_MPATH:
case VIR_STORAGE_POOL_RBD:
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_SHEEPDOG:
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_LAST:
diff --git a/src/conf/storage_conf.c b/src/conf/storage_conf.c
index 0c50529..fe97574 100644
--- a/src/conf/storage_conf.c
+++ b/src/conf/storage_conf.c
@@ -60,7 +60,7 @@ VIR_ENUM_IMPL(virStoragePool,
"logical", "disk", "iscsi",
"iscsi-direct", "scsi", "mpath",
"rbd", "sheepdog", "gluster",
- "zfs", "vstorage",
+ "zfs", "vstorage", "vitastor",
);
VIR_ENUM_IMPL(virStoragePoolFormatFileSystem,
@@ -249,6 +249,18 @@ static virStoragePoolTypeInfo poolTypeInfo[] = {
.formatToString = virStorageFileFormatTypeToString,
}
},
+ {.poolType = VIR_STORAGE_POOL_VITASTOR,
+ .poolOptions = {
+ .flags = (VIR_STORAGE_POOL_SOURCE_HOST |
+ VIR_STORAGE_POOL_SOURCE_NETWORK |
+ VIR_STORAGE_POOL_SOURCE_NAME),
+ },
+ .volOptions = {
+ .defaultFormat = VIR_STORAGE_FILE_RAW,
+ .formatFromString = virStorageVolumeFormatFromString,
+ .formatToString = virStorageFileFormatTypeToString,
+ }
+ },
{.poolType = VIR_STORAGE_POOL_SHEEPDOG,
.poolOptions = {
.flags = (VIR_STORAGE_POOL_SOURCE_HOST |
@@ -551,6 +563,11 @@ virStoragePoolDefParseSource(xmlXPathContextPtr ctxt,
_("element 'name' is mandatory for RBD pool"));
goto cleanup;
}
+ if (pool_type == VIR_STORAGE_POOL_VITASTOR && source->name == NULL) {
+ virReportError(VIR_ERR_XML_ERROR, "%s",
+ _("element 'name' is mandatory for Vitastor pool"));
+ return -1;
+ }
if (options->formatFromString) {
g_autofree char *format = NULL;
@@ -1217,6 +1234,7 @@ virStoragePoolDefFormatBuf(virBufferPtr buf,
/* RBD, Sheepdog, Gluster and Iscsi-direct devices are not local block devs nor
* files, so they don't have a target */
if (def->type != VIR_STORAGE_POOL_RBD &&
+ def->type != VIR_STORAGE_POOL_VITASTOR &&
def->type != VIR_STORAGE_POOL_SHEEPDOG &&
def->type != VIR_STORAGE_POOL_GLUSTER &&
def->type != VIR_STORAGE_POOL_ISCSI_DIRECT) {
diff --git a/src/conf/storage_conf.h b/src/conf/storage_conf.h
index ffd406e..8868a05 100644
--- a/src/conf/storage_conf.h
+++ b/src/conf/storage_conf.h
@@ -110,6 +110,7 @@ typedef enum {
VIR_STORAGE_POOL_GLUSTER, /* Gluster device */
VIR_STORAGE_POOL_ZFS, /* ZFS */
VIR_STORAGE_POOL_VSTORAGE, /* Virtuozzo Storage */
+ VIR_STORAGE_POOL_VITASTOR, /* Vitastor */
VIR_STORAGE_POOL_LAST,
} virStoragePoolType;
@@ -474,6 +475,7 @@ VIR_ENUM_DECL(virStoragePartedFs);
VIR_CONNECT_LIST_STORAGE_POOLS_SCSI | \
VIR_CONNECT_LIST_STORAGE_POOLS_MPATH | \
VIR_CONNECT_LIST_STORAGE_POOLS_RBD | \
+ VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR | \
VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG | \
VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER | \
VIR_CONNECT_LIST_STORAGE_POOLS_ZFS | \
diff --git a/src/conf/virstorageobj.c b/src/conf/virstorageobj.c
index 9fe8b3f..bf595b0 100644
--- a/src/conf/virstorageobj.c
+++ b/src/conf/virstorageobj.c
@@ -1491,6 +1491,7 @@ virStoragePoolObjSourceFindDuplicateCb(const void *payload,
return 1;
break;
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_RBD:
case VIR_STORAGE_POOL_LAST:
break;
@@ -1990,6 +1991,8 @@ virStoragePoolObjMatch(virStoragePoolObjPtr obj,
(obj->def->type == VIR_STORAGE_POOL_MPATH)) ||
(MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_RBD) &&
(obj->def->type == VIR_STORAGE_POOL_RBD)) ||
+ (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR) &&
+ (obj->def->type == VIR_STORAGE_POOL_VITASTOR)) ||
(MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG) &&
(obj->def->type == VIR_STORAGE_POOL_SHEEPDOG)) ||
(MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER) &&
diff --git a/src/libvirt-storage.c b/src/libvirt-storage.c
index 2a7cdca..f756be1 100644
--- a/src/libvirt-storage.c
+++ b/src/libvirt-storage.c
@@ -92,6 +92,7 @@ virStoragePoolGetConnect(virStoragePoolPtr pool)
* VIR_CONNECT_LIST_STORAGE_POOLS_SCSI
* VIR_CONNECT_LIST_STORAGE_POOLS_MPATH
* VIR_CONNECT_LIST_STORAGE_POOLS_RBD
+ * VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR
* VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG
* VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER
* VIR_CONNECT_LIST_STORAGE_POOLS_ZFS
diff --git a/src/libxl/libxl_conf.c b/src/libxl/libxl_conf.c
index 6a8ae27..a735bc6 100644
--- a/src/libxl/libxl_conf.c
+++ b/src/libxl/libxl_conf.c
@@ -942,6 +942,7 @@ libxlMakeNetworkDiskSrcStr(virStorageSourcePtr src,
case VIR_STORAGE_NET_PROTOCOL_SSH:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
case VIR_STORAGE_NET_PROTOCOL_NFS:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_LAST:
case VIR_STORAGE_NET_PROTOCOL_NONE:
virReportError(VIR_ERR_NO_SUPPORT,
diff --git a/src/libxl/xen_xl.c b/src/libxl/xen_xl.c
index 17b93d0..c5a0084 100644
--- a/src/libxl/xen_xl.c
+++ b/src/libxl/xen_xl.c
@@ -1601,6 +1601,7 @@ xenFormatXLDiskSrcNet(virStorageSourcePtr src)
case VIR_STORAGE_NET_PROTOCOL_SSH:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
case VIR_STORAGE_NET_PROTOCOL_NFS:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_LAST:
case VIR_STORAGE_NET_PROTOCOL_NONE:
virReportError(VIR_ERR_NO_SUPPORT,
diff --git a/src/qemu/qemu_block.c b/src/qemu/qemu_block.c
index f9c6da2..922dde5 100644
--- a/src/qemu/qemu_block.c
+++ b/src/qemu/qemu_block.c
@@ -938,6 +938,38 @@ qemuBlockStorageSourceGetRBDProps(virStorageSourcePtr src,
}
+static virJSONValuePtr
+qemuBlockStorageSourceGetVitastorProps(virStorageSource *src)
+{
+ virJSONValuePtr ret = NULL;
+ virStorageNetHostDefPtr host;
+ size_t i;
+ g_auto(virBuffer) buf = VIR_BUFFER_INITIALIZER;
+ g_autofree char *etcd = NULL;
+
+ for (i = 0; i < src->nhosts; i++) {
+ host = src->hosts + i;
+ if ((virStorageNetHostTransport)host->transport != VIR_STORAGE_NET_HOST_TRANS_TCP) {
+ return NULL;
+ }
+ virBufferAsprintf(&buf, i > 0 ? ",%s:%u" : "%s:%u", host->name, host->port);
+ }
+ if (src->nhosts > 0) {
+ etcd = virBufferContentAndReset(&buf);
+ }
+
+ if (virJSONValueObjectCreate(&ret,
+ "S:etcd_host", etcd,
+ "S:etcd_prefix", src->query,
+ "S:config_path", src->configFile,
+ "s:image", src->path,
+ NULL) < 0)
+ return NULL;
+
+ return ret;
+}
+
+
static virJSONValuePtr
qemuBlockStorageSourceGetSheepdogProps(virStorageSourcePtr src)
{
@@ -1224,6 +1256,12 @@ qemuBlockStorageSourceGetBackendProps(virStorageSourcePtr src,
return NULL;
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ driver = "vitastor";
+ if (!(fileprops = qemuBlockStorageSourceGetVitastorProps(src)))
+ return NULL;
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
driver = "sheepdog";
if (!(fileprops = qemuBlockStorageSourceGetSheepdogProps(src)))
@@ -2183,6 +2221,7 @@ qemuBlockGetBackingStoreString(virStorageSourcePtr src,
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
case VIR_STORAGE_NET_PROTOCOL_NFS:
case VIR_STORAGE_NET_PROTOCOL_SSH:
@@ -2560,6 +2599,12 @@ qemuBlockStorageSourceCreateGetStorageProps(virStorageSourcePtr src,
return -1;
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ driver = "vitastor";
+ if (!(location = qemuBlockStorageSourceGetVitastorProps(src)))
+ return -1;
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
driver = "sheepdog";
if (!(location = qemuBlockStorageSourceGetSheepdogProps(src)))
diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
index 6f970a3..10b39ca 100644
--- a/src/qemu/qemu_command.c
+++ b/src/qemu/qemu_command.c
@@ -1034,6 +1034,43 @@ qemuBuildNetworkDriveStr(virStorageSourcePtr src,
ret = virBufferContentAndReset(&buf);
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ if (strchr(src->path, ':')) {
+ virReportError(VIR_ERR_CONFIG_UNSUPPORTED,
+ _("':' not allowed in Vitastor source volume name '%s'"),
+ src->path);
+ return NULL;
+ }
+
+ virBufferStrcat(&buf, "vitastor:image=", src->path, NULL);
+
+ if (src->nhosts > 0) {
+ virBufferAddLit(&buf, ":etcd_host=");
+ for (i = 0; i < src->nhosts; i++) {
+ if (i)
+ virBufferAddLit(&buf, ",");
+
+ /* assume host containing : is ipv6 */
+ if (strchr(src->hosts[i].name, ':'))
+ virBufferEscape(&buf, '\\', ":", "[%s]",
+ src->hosts[i].name);
+ else
+ virBufferAsprintf(&buf, "%s", src->hosts[i].name);
+
+ if (src->hosts[i].port)
+ virBufferAsprintf(&buf, "\\:%u", src->hosts[i].port);
+ }
+ }
+
+ if (src->configFile)
+ virBufferEscape(&buf, '\\', ":", ":config_path=%s", src->configFile);
+
+ if (src->query)
+ virBufferEscape(&buf, '\\', ":", ":etcd_prefix=%s", src->query);
+
+ ret = virBufferContentAndReset(&buf);
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_VXHS:
virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
_("VxHS protocol does not support URI syntax"));
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
index 0765dc7..4cff344 100644
--- a/src/qemu/qemu_domain.c
+++ b/src/qemu/qemu_domain.c
@@ -4610,7 +4610,8 @@ qemuDomainValidateStorageSource(virStorageSourcePtr src,
if (src->query &&
(actualType != VIR_STORAGE_TYPE_NETWORK ||
(src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTPS &&
- src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTP))) {
+ src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTP &&
+ src->protocol != VIR_STORAGE_NET_PROTOCOL_VITASTOR))) {
virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s",
_("query is supported only with HTTP(S) protocols"));
return -1;
@@ -9704,6 +9705,7 @@ qemuDomainPrepareStorageSourceTLS(virStorageSourcePtr src,
break;
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
diff --git a/src/qemu/qemu_snapshot.c b/src/qemu/qemu_snapshot.c
index ee333c3..674aa58 100644
--- a/src/qemu/qemu_snapshot.c
+++ b/src/qemu/qemu_snapshot.c
@@ -403,6 +403,7 @@ qemuSnapshotPrepareDiskExternalInactive(virDomainSnapshotDiskDefPtr snapdisk,
case VIR_STORAGE_NET_PROTOCOL_NONE:
case VIR_STORAGE_NET_PROTOCOL_NBD:
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
@@ -493,6 +494,7 @@ qemuSnapshotPrepareDiskExternalActive(virDomainObjPtr vm,
case VIR_STORAGE_NET_PROTOCOL_NONE:
case VIR_STORAGE_NET_PROTOCOL_NBD:
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
case VIR_STORAGE_NET_PROTOCOL_HTTP:
@@ -623,6 +625,7 @@ qemuSnapshotPrepareDiskInternal(virDomainDiskDefPtr disk,
case VIR_STORAGE_NET_PROTOCOL_NONE:
case VIR_STORAGE_NET_PROTOCOL_NBD:
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
diff --git a/src/storage/storage_driver.c b/src/storage/storage_driver.c
index 16bc53a..1e5d820 100644
--- a/src/storage/storage_driver.c
+++ b/src/storage/storage_driver.c
@@ -1645,6 +1645,7 @@ storageVolLookupByPathCallback(virStoragePoolObjPtr obj,
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_RBD:
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_SHEEPDOG:
case VIR_STORAGE_POOL_ZFS:
case VIR_STORAGE_POOL_LAST:
diff --git a/src/test/test_driver.c b/src/test/test_driver.c
index 29c4c86..a27ad94 100644
--- a/src/test/test_driver.c
+++ b/src/test/test_driver.c
@@ -7096,6 +7096,7 @@ testStorageVolumeTypeForPool(int pooltype)
case VIR_STORAGE_POOL_ISCSI_DIRECT:
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_RBD:
+ case VIR_STORAGE_POOL_VITASTOR:
return VIR_STORAGE_VOL_NETWORK;
case VIR_STORAGE_POOL_LOGICAL:
case VIR_STORAGE_POOL_DISK:
diff --git a/src/util/virstoragefile.c b/src/util/virstoragefile.c
index 0d3c2af..36e3afc 100644
--- a/src/util/virstoragefile.c
+++ b/src/util/virstoragefile.c
@@ -91,6 +91,7 @@ VIR_ENUM_IMPL(virStorageNetProtocol,
"ssh",
"vxhs",
"nfs",
+ "vitastor",
);
VIR_ENUM_IMPL(virStorageNetHostTransport,
@@ -2880,6 +2881,75 @@ virStorageSourceParseRBDColonString(const char *rbdstr,
}
+static int
+virStorageSourceParseVitastorColonString(const char *colonstr,
+ virStorageSourcePtr src)
+{
+ char *p, *e, *next;
+ g_autofree char *options = NULL;
+
+ /* optionally skip the "vitastor:" prefix if provided */
+ if (STRPREFIX(colonstr, "vitastor:"))
+ colonstr += strlen("vitastor:");
+
+ options = g_strdup(colonstr);
+
+ p = options;
+ while (*p) {
+ /* find : delimiter or end of string */
+ for (e = p; *e && *e != ':'; ++e) {
+ if (*e == '\\') {
+ e++;
+ if (*e == '\0')
+ break;
+ }
+ }
+ if (*e == '\0') {
+ next = e; /* last kv pair */
+ } else {
+ next = e + 1;
+ *e = '\0';
+ }
+
+ if (STRPREFIX(p, "image=")) {
+ src->path = g_strdup(p + strlen("image="));
+ } else if (STRPREFIX(p, "etcd_prefix=")) {
+ src->query = g_strdup(p + strlen("etcd_prefix="));
+ } else if (STRPREFIX(p, "config_file=")) {
+ src->configFile = g_strdup(p + strlen("config_file="));
+ } else if (STRPREFIX(p, "etcd_host=")) {
+ char *h, *sep;
+
+ h = p + strlen("etcd_host=");
+ while (h < e) {
+ for (sep = h; sep < e; ++sep) {
+ if (*sep == '\\' && (sep[1] == ',' ||
+ sep[1] == ';' ||
+ sep[1] == ' ')) {
+ *sep = '\0';
+ sep += 2;
+ break;
+ }
+ }
+
+ if (virStorageSourceRBDAddHost(src, h) < 0)
+ return -1;
+
+ h = sep;
+ }
+ }
+
+ p = next;
+ }
+
+ if (!src->path) {
+ return -1;
+ }
+
+ return 0;
+}
+
+
static int
virStorageSourceParseNBDColonString(const char *nbdstr,
virStorageSourcePtr src)
@@ -2992,6 +3062,11 @@ virStorageSourceParseBackingColon(virStorageSourcePtr src,
return -1;
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ if (virStorageSourceParseVitastorColonString(path, src) < 0)
+ return -1;
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_LAST:
case VIR_STORAGE_NET_PROTOCOL_NONE:
@@ -3581,6 +3656,54 @@ virStorageSourceParseBackingJSONRBD(virStorageSourcePtr src,
return 0;
}
+static int
+virStorageSourceParseBackingJSONVitastor(virStorageSourcePtr src,
+ virJSONValuePtr json,
+ const char *jsonstr G_GNUC_UNUSED,
+ int opaque G_GNUC_UNUSED)
+{
+ const char *filename;
+ const char *image = virJSONValueObjectGetString(json, "image");
+ const char *conf = virJSONValueObjectGetString(json, "config_path");
+ const char *etcd_prefix = virJSONValueObjectGetString(json, "etcd_prefix");
+ virJSONValuePtr servers = virJSONValueObjectGetArray(json, "server");
+ size_t nservers;
+ size_t i;
+
+ src->type = VIR_STORAGE_TYPE_NETWORK;
+ src->protocol = VIR_STORAGE_NET_PROTOCOL_VITASTOR;
+
+ /* legacy syntax passed via 'filename' option */
+ if ((filename = virJSONValueObjectGetString(json, "filename")))
+ return virStorageSourceParseVitastorColonString(filename, src);
+
+ if (!image) {
+ virReportError(VIR_ERR_INVALID_ARG, "%s",
+ _("missing image name in Vitastor backing volume "
+ "JSON specification"));
+ return -1;
+ }
+
+ src->path = g_strdup(image);
+ src->configFile = g_strdup(conf);
+ src->query = g_strdup(etcd_prefix);
+
+ if (servers) {
+ nservers = virJSONValueArraySize(servers);
+
+ src->hosts = g_new0(virStorageNetHostDef, nservers);
+ src->nhosts = nservers;
+
+ for (i = 0; i < nservers; i++) {
+ if (virStorageSourceParseBackingJSONInetSocketAddress(src->hosts + i,
+ virJSONValueArrayGet(servers, i)) < 0)
+ return -1;
+ }
+ }
+
+ return 0;
+}
+
static int
virStorageSourceParseBackingJSONRaw(virStorageSourcePtr src,
virJSONValuePtr json,
@@ -3759,6 +3882,7 @@ static const struct virStorageSourceJSONDriverParser jsonParsers[] = {
{"sheepdog", false, virStorageSourceParseBackingJSONSheepdog, 0},
{"ssh", false, virStorageSourceParseBackingJSONSSH, 0},
{"rbd", false, virStorageSourceParseBackingJSONRBD, 0},
+ {"vitastor", false, virStorageSourceParseBackingJSONVitastor, 0},
{"raw", true, virStorageSourceParseBackingJSONRaw, 0},
{"nfs", false, virStorageSourceParseBackingJSONNFS, 0},
{"vxhs", false, virStorageSourceParseBackingJSONVxHS, 0},
@@ -4503,6 +4627,7 @@ virStorageSourceNetworkDefaultPort(virStorageNetProtocol protocol)
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
return 24007;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_RBD:
/* we don't provide a default for RBD */
return 0;
diff --git a/src/util/virstoragefile.h b/src/util/virstoragefile.h
index 5689c39..3eb4e3c 100644
--- a/src/util/virstoragefile.h
+++ b/src/util/virstoragefile.h
@@ -136,6 +136,7 @@ typedef enum {
VIR_STORAGE_NET_PROTOCOL_SSH,
VIR_STORAGE_NET_PROTOCOL_VXHS,
VIR_STORAGE_NET_PROTOCOL_NFS,
+ VIR_STORAGE_NET_PROTOCOL_VITASTOR,
VIR_STORAGE_NET_PROTOCOL_LAST
} virStorageNetProtocol;
diff --git a/tests/storagepoolcapsschemadata/poolcaps-fs.xml b/tests/storagepoolcapsschemadata/poolcaps-fs.xml
index eee75af..8bd0a57 100644
--- a/tests/storagepoolcapsschemadata/poolcaps-fs.xml
+++ b/tests/storagepoolcapsschemadata/poolcaps-fs.xml
@@ -204,4 +204,11 @@
</enum>
</volOptions>
</pool>
+ <pool type='vitastor' supported='no'>
+ <volOptions>
+ <defaultFormat type='raw'/>
+ <enum name='targetFormatType'>
+ </enum>
+ </volOptions>
+ </pool>
</storagepoolCapabilities>
diff --git a/tests/storagepoolcapsschemadata/poolcaps-full.xml b/tests/storagepoolcapsschemadata/poolcaps-full.xml
index 805950a..852df0d 100644
--- a/tests/storagepoolcapsschemadata/poolcaps-full.xml
+++ b/tests/storagepoolcapsschemadata/poolcaps-full.xml
@@ -204,4 +204,11 @@
</enum>
</volOptions>
</pool>
+ <pool type='vitastor' supported='yes'>
+ <volOptions>
+ <defaultFormat type='raw'/>
+ <enum name='targetFormatType'>
+ </enum>
+ </volOptions>
+ </pool>
</storagepoolCapabilities>
diff --git a/tests/storagepoolxml2argvtest.c b/tests/storagepoolxml2argvtest.c
index 967d1f2..1e8ff7a 100644
--- a/tests/storagepoolxml2argvtest.c
+++ b/tests/storagepoolxml2argvtest.c
@@ -68,6 +68,7 @@ testCompareXMLToArgvFiles(bool shouldFail,
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_ZFS:
case VIR_STORAGE_POOL_VSTORAGE:
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_LAST:
default:
VIR_TEST_DEBUG("pool type '%s' has no xml2argv test", defTypeStr);
diff --git a/tools/virsh-pool.c b/tools/virsh-pool.c
index 7835fa6..8841fcf 100644
--- a/tools/virsh-pool.c
+++ b/tools/virsh-pool.c
@@ -1237,6 +1237,9 @@ cmdPoolList(vshControl *ctl, const vshCmd *cmd G_GNUC_UNUSED)
case VIR_STORAGE_POOL_VSTORAGE:
flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE;
break;
+ case VIR_STORAGE_POOL_VITASTOR:
+ flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR;
+ break;
case VIR_STORAGE_POOL_LAST:
break;
}

View File

@ -0,0 +1,661 @@
commit c6e1958a1b4974828e8e5852beb252ce6594e670
Author: Vitaliy Filippov <vitalif@yourcmc.ru>
Date: Mon Jun 28 01:20:19 2021 +0300
Add Vitastor support
diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng
index 5ea14b6..a9df168 100644
--- a/docs/schemas/domaincommon.rng
+++ b/docs/schemas/domaincommon.rng
@@ -1859,6 +1859,35 @@
</element>
</define>
+ <define name="diskSourceNetworkProtocolVitastor">
+ <element name="source">
+ <interleave>
+ <attribute name="protocol">
+ <value>vitastor</value>
+ </attribute>
+ <ref name="diskSourceCommon"/>
+ <optional>
+ <attribute name="name"/>
+ </optional>
+ <optional>
+ <attribute name="query"/>
+ </optional>
+ <zeroOrMore>
+ <ref name="diskSourceNetworkHost"/>
+ </zeroOrMore>
+ <optional>
+ <element name="config">
+ <attribute name="file">
+ <ref name="absFilePath"/>
+ </attribute>
+ <empty/>
+ </element>
+ </optional>
+ <empty/>
+ </interleave>
+ </element>
+ </define>
+
<define name="diskSourceNetworkProtocolISCSI">
<element name="source">
<attribute name="protocol">
@@ -2115,6 +2144,7 @@
<ref name="diskSourceNetworkProtocolSimple"/>
<ref name="diskSourceNetworkProtocolVxHS"/>
<ref name="diskSourceNetworkProtocolNFS"/>
+ <ref name="diskSourceNetworkProtocolVitastor"/>
</choice>
</define>
diff --git a/include/libvirt/libvirt-storage.h b/include/libvirt/libvirt-storage.h
index 089e1e0..d7e7ef4 100644
--- a/include/libvirt/libvirt-storage.h
+++ b/include/libvirt/libvirt-storage.h
@@ -245,6 +245,7 @@ typedef enum {
VIR_CONNECT_LIST_STORAGE_POOLS_ZFS = 1 << 17,
VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE = 1 << 18,
VIR_CONNECT_LIST_STORAGE_POOLS_ISCSI_DIRECT = 1 << 19,
+ VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR = 1 << 20,
} virConnectListAllStoragePoolsFlags;
int virConnectListAllStoragePools(virConnectPtr conn,
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
index d78f846..f7222e3 100644
--- a/src/conf/domain_conf.c
+++ b/src/conf/domain_conf.c
@@ -8251,7 +8251,8 @@ virDomainDiskSourceNetworkParse(xmlNodePtr node,
src->configFile = virXPathString("string(./config/@file)", ctxt);
if (src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTP ||
- src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTPS)
+ src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTPS ||
+ src->protocol == VIR_STORAGE_NET_PROTOCOL_VITASTOR)
src->query = virXMLPropString(node, "query");
if (virDomainStorageNetworkParseHosts(node, ctxt, &src->hosts, &src->nhosts) < 0)
@@ -30775,6 +30776,7 @@ virDomainStorageSourceTranslateSourcePool(virStorageSource *src,
case VIR_STORAGE_POOL_MPATH:
case VIR_STORAGE_POOL_RBD:
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_SHEEPDOG:
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_LAST:
diff --git a/src/conf/storage_conf.c b/src/conf/storage_conf.c
index 2aa9a3d..166ca1f 100644
--- a/src/conf/storage_conf.c
+++ b/src/conf/storage_conf.c
@@ -60,7 +60,7 @@ VIR_ENUM_IMPL(virStoragePool,
"logical", "disk", "iscsi",
"iscsi-direct", "scsi", "mpath",
"rbd", "sheepdog", "gluster",
- "zfs", "vstorage",
+ "zfs", "vstorage", "vitastor",
);
VIR_ENUM_IMPL(virStoragePoolFormatFileSystem,
@@ -246,6 +246,18 @@ static virStoragePoolTypeInfo poolTypeInfo[] = {
.formatToString = virStorageFileFormatTypeToString,
}
},
+ {.poolType = VIR_STORAGE_POOL_VITASTOR,
+ .poolOptions = {
+ .flags = (VIR_STORAGE_POOL_SOURCE_HOST |
+ VIR_STORAGE_POOL_SOURCE_NETWORK |
+ VIR_STORAGE_POOL_SOURCE_NAME),
+ },
+ .volOptions = {
+ .defaultFormat = VIR_STORAGE_FILE_RAW,
+ .formatFromString = virStorageVolumeFormatFromString,
+ .formatToString = virStorageFileFormatTypeToString,
+ }
+ },
{.poolType = VIR_STORAGE_POOL_SHEEPDOG,
.poolOptions = {
.flags = (VIR_STORAGE_POOL_SOURCE_HOST |
@@ -546,6 +558,11 @@ virStoragePoolDefParseSource(xmlXPathContextPtr ctxt,
_("element 'name' is mandatory for RBD pool"));
return -1;
}
+ if (pool_type == VIR_STORAGE_POOL_VITASTOR && source->name == NULL) {
+ virReportError(VIR_ERR_XML_ERROR, "%s",
+ _("element 'name' is mandatory for Vitastor pool"));
+ return -1;
+ }
if (options->formatFromString) {
g_autofree char *format = NULL;
@@ -1182,6 +1199,7 @@ virStoragePoolDefFormatBuf(virBuffer *buf,
/* RBD, Sheepdog, Gluster and Iscsi-direct devices are not local block devs nor
* files, so they don't have a target */
if (def->type != VIR_STORAGE_POOL_RBD &&
+ def->type != VIR_STORAGE_POOL_VITASTOR &&
def->type != VIR_STORAGE_POOL_SHEEPDOG &&
def->type != VIR_STORAGE_POOL_GLUSTER &&
def->type != VIR_STORAGE_POOL_ISCSI_DIRECT) {
diff --git a/src/conf/storage_conf.h b/src/conf/storage_conf.h
index 76efaac..928149a 100644
--- a/src/conf/storage_conf.h
+++ b/src/conf/storage_conf.h
@@ -106,6 +106,7 @@ typedef enum {
VIR_STORAGE_POOL_GLUSTER, /* Gluster device */
VIR_STORAGE_POOL_ZFS, /* ZFS */
VIR_STORAGE_POOL_VSTORAGE, /* Virtuozzo Storage */
+ VIR_STORAGE_POOL_VITASTOR, /* Vitastor */
VIR_STORAGE_POOL_LAST,
} virStoragePoolType;
@@ -465,6 +466,7 @@ VIR_ENUM_DECL(virStoragePartedFs);
VIR_CONNECT_LIST_STORAGE_POOLS_SCSI | \
VIR_CONNECT_LIST_STORAGE_POOLS_MPATH | \
VIR_CONNECT_LIST_STORAGE_POOLS_RBD | \
+ VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR | \
VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG | \
VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER | \
VIR_CONNECT_LIST_STORAGE_POOLS_ZFS | \
diff --git a/src/conf/storage_source_conf.c b/src/conf/storage_source_conf.c
index 5ca06fa..05ded49 100644
--- a/src/conf/storage_source_conf.c
+++ b/src/conf/storage_source_conf.c
@@ -85,6 +85,7 @@ VIR_ENUM_IMPL(virStorageNetProtocol,
"ssh",
"vxhs",
"nfs",
+ "vitastor",
);
@@ -1262,6 +1263,7 @@ virStorageSourceNetworkDefaultPort(virStorageNetProtocol protocol)
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
return 24007;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_RBD:
/* we don't provide a default for RBD */
return 0;
diff --git a/src/conf/storage_source_conf.h b/src/conf/storage_source_conf.h
index 389c7b5..dbf02e3 100644
--- a/src/conf/storage_source_conf.h
+++ b/src/conf/storage_source_conf.h
@@ -127,6 +127,7 @@ typedef enum {
VIR_STORAGE_NET_PROTOCOL_SSH,
VIR_STORAGE_NET_PROTOCOL_VXHS,
VIR_STORAGE_NET_PROTOCOL_NFS,
+ VIR_STORAGE_NET_PROTOCOL_VITASTOR,
VIR_STORAGE_NET_PROTOCOL_LAST
} virStorageNetProtocol;
diff --git a/src/conf/virstorageobj.c b/src/conf/virstorageobj.c
index 24957d6..4520a73 100644
--- a/src/conf/virstorageobj.c
+++ b/src/conf/virstorageobj.c
@@ -1487,6 +1487,7 @@ virStoragePoolObjSourceFindDuplicateCb(const void *payload,
return 1;
break;
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_RBD:
case VIR_STORAGE_POOL_LAST:
break;
@@ -1986,6 +1987,8 @@ virStoragePoolObjMatch(virStoragePoolObj *obj,
(obj->def->type == VIR_STORAGE_POOL_MPATH)) ||
(MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_RBD) &&
(obj->def->type == VIR_STORAGE_POOL_RBD)) ||
+ (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR) &&
+ (obj->def->type == VIR_STORAGE_POOL_VITASTOR)) ||
(MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG) &&
(obj->def->type == VIR_STORAGE_POOL_SHEEPDOG)) ||
(MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER) &&
diff --git a/src/libvirt-storage.c b/src/libvirt-storage.c
index 2a7cdca..f756be1 100644
--- a/src/libvirt-storage.c
+++ b/src/libvirt-storage.c
@@ -92,6 +92,7 @@ virStoragePoolGetConnect(virStoragePoolPtr pool)
* VIR_CONNECT_LIST_STORAGE_POOLS_SCSI
* VIR_CONNECT_LIST_STORAGE_POOLS_MPATH
* VIR_CONNECT_LIST_STORAGE_POOLS_RBD
+ * VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR
* VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG
* VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER
* VIR_CONNECT_LIST_STORAGE_POOLS_ZFS
diff --git a/src/libxl/libxl_conf.c b/src/libxl/libxl_conf.c
index 56cb9ab..dfb31b9 100644
--- a/src/libxl/libxl_conf.c
+++ b/src/libxl/libxl_conf.c
@@ -972,6 +972,7 @@ libxlMakeNetworkDiskSrcStr(virStorageSource *src,
case VIR_STORAGE_NET_PROTOCOL_SSH:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
case VIR_STORAGE_NET_PROTOCOL_NFS:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_LAST:
case VIR_STORAGE_NET_PROTOCOL_NONE:
virReportError(VIR_ERR_NO_SUPPORT,
diff --git a/src/libxl/xen_xl.c b/src/libxl/xen_xl.c
index c0905b0..c172378 100644
--- a/src/libxl/xen_xl.c
+++ b/src/libxl/xen_xl.c
@@ -1540,6 +1540,7 @@ xenFormatXLDiskSrcNet(virStorageSource *src)
case VIR_STORAGE_NET_PROTOCOL_SSH:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
case VIR_STORAGE_NET_PROTOCOL_NFS:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_LAST:
case VIR_STORAGE_NET_PROTOCOL_NONE:
virReportError(VIR_ERR_NO_SUPPORT,
diff --git a/src/qemu/qemu_block.c b/src/qemu/qemu_block.c
index 6627d04..c33f428 100644
--- a/src/qemu/qemu_block.c
+++ b/src/qemu/qemu_block.c
@@ -928,6 +928,38 @@ qemuBlockStorageSourceGetRBDProps(virStorageSource *src,
}
+static virJSONValue *
+qemuBlockStorageSourceGetVitastorProps(virStorageSource *src)
+{
+ virJSONValuePtr ret = NULL;
+ virStorageNetHostDefPtr host;
+ size_t i;
+ g_auto(virBuffer) buf = VIR_BUFFER_INITIALIZER;
+ g_autofree char *etcd = NULL;
+
+ for (i = 0; i < src->nhosts; i++) {
+ host = src->hosts + i;
+ if ((virStorageNetHostTransport)host->transport != VIR_STORAGE_NET_HOST_TRANS_TCP) {
+ return NULL;
+ }
+ virBufferAsprintf(&buf, i > 0 ? ",%s:%u" : "%s:%u", host->name, host->port);
+ }
+ if (src->nhosts > 0) {
+ etcd = virBufferContentAndReset(&buf);
+ }
+
+ if (virJSONValueObjectCreate(&ret,
+ "S:etcd_host", etcd,
+ "S:etcd_prefix", src->query,
+ "S:config_path", src->configFile,
+ "s:image", src->path,
+ NULL) < 0)
+ return NULL;
+
+ return ret;
+}
+
+
static virJSONValue *
qemuBlockStorageSourceGetSheepdogProps(virStorageSource *src)
{
@@ -1218,6 +1250,12 @@ qemuBlockStorageSourceGetBackendProps(virStorageSource *src,
return NULL;
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ driver = "vitastor";
+ if (!(fileprops = qemuBlockStorageSourceGetVitastorProps(src)))
+ return NULL;
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
driver = "sheepdog";
if (!(fileprops = qemuBlockStorageSourceGetSheepdogProps(src)))
@@ -2231,6 +2269,7 @@ qemuBlockGetBackingStoreString(virStorageSource *src,
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
case VIR_STORAGE_NET_PROTOCOL_NFS:
case VIR_STORAGE_NET_PROTOCOL_SSH:
@@ -2608,6 +2647,12 @@ qemuBlockStorageSourceCreateGetStorageProps(virStorageSource *src,
return -1;
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ driver = "vitastor";
+ if (!(location = qemuBlockStorageSourceGetVitastorProps(src)))
+ return -1;
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
driver = "sheepdog";
if (!(location = qemuBlockStorageSourceGetSheepdogProps(src)))
diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
index ea51369..8258632 100644
--- a/src/qemu/qemu_command.c
+++ b/src/qemu/qemu_command.c
@@ -1074,6 +1074,43 @@ qemuBuildNetworkDriveStr(virStorageSource *src,
ret = virBufferContentAndReset(&buf);
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ if (strchr(src->path, ':')) {
+ virReportError(VIR_ERR_CONFIG_UNSUPPORTED,
+ _("':' not allowed in Vitastor source volume name '%s'"),
+ src->path);
+ return NULL;
+ }
+
+ virBufferStrcat(&buf, "vitastor:image=", src->path, NULL);
+
+ if (src->nhosts > 0) {
+ virBufferAddLit(&buf, ":etcd_host=");
+ for (i = 0; i < src->nhosts; i++) {
+ if (i)
+ virBufferAddLit(&buf, ",");
+
+ /* assume host containing : is ipv6 */
+ if (strchr(src->hosts[i].name, ':'))
+ virBufferEscape(&buf, '\\', ":", "[%s]",
+ src->hosts[i].name);
+ else
+ virBufferAsprintf(&buf, "%s", src->hosts[i].name);
+
+ if (src->hosts[i].port)
+ virBufferAsprintf(&buf, "\\:%u", src->hosts[i].port);
+ }
+ }
+
+ if (src->configFile)
+ virBufferEscape(&buf, '\\', ":", ":config_path=%s", src->configFile);
+
+ if (src->query)
+ virBufferEscape(&buf, '\\', ":", ":etcd_prefix=%s", src->query);
+
+ ret = virBufferContentAndReset(&buf);
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_VXHS:
virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
_("VxHS protocol does not support URI syntax"));
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
index fc60e15..5ab410d 100644
--- a/src/qemu/qemu_domain.c
+++ b/src/qemu/qemu_domain.c
@@ -4829,7 +4829,8 @@ qemuDomainValidateStorageSource(virStorageSource *src,
if (src->query &&
(actualType != VIR_STORAGE_TYPE_NETWORK ||
(src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTPS &&
- src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTP))) {
+ src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTP &&
+ src->protocol != VIR_STORAGE_NET_PROTOCOL_VITASTOR))) {
virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s",
_("query is supported only with HTTP(S) protocols"));
return -1;
@@ -10027,6 +10028,7 @@ qemuDomainPrepareStorageSourceTLS(virStorageSource *src,
break;
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
diff --git a/src/qemu/qemu_snapshot.c b/src/qemu/qemu_snapshot.c
index 4e74ddd..14e5f2e 100644
--- a/src/qemu/qemu_snapshot.c
+++ b/src/qemu/qemu_snapshot.c
@@ -402,6 +402,7 @@ qemuSnapshotPrepareDiskExternalInactive(virDomainSnapshotDiskDef *snapdisk,
case VIR_STORAGE_NET_PROTOCOL_NONE:
case VIR_STORAGE_NET_PROTOCOL_NBD:
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
@@ -494,6 +495,7 @@ qemuSnapshotPrepareDiskExternalActive(virDomainObj *vm,
case VIR_STORAGE_NET_PROTOCOL_NONE:
case VIR_STORAGE_NET_PROTOCOL_NBD:
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
case VIR_STORAGE_NET_PROTOCOL_HTTP:
@@ -647,6 +649,7 @@ qemuSnapshotPrepareDiskInternal(virDomainDiskDef *disk,
case VIR_STORAGE_NET_PROTOCOL_NONE:
case VIR_STORAGE_NET_PROTOCOL_NBD:
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
diff --git a/src/storage/storage_driver.c b/src/storage/storage_driver.c
index c2ff4b8..70d0689 100644
--- a/src/storage/storage_driver.c
+++ b/src/storage/storage_driver.c
@@ -1644,6 +1644,7 @@ storageVolLookupByPathCallback(virStoragePoolObj *obj,
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_RBD:
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_SHEEPDOG:
case VIR_STORAGE_POOL_ZFS:
case VIR_STORAGE_POOL_LAST:
diff --git a/src/storage_file/storage_source_backingstore.c b/src/storage_file/storage_source_backingstore.c
index e48ae72..d7a9b72 100644
--- a/src/storage_file/storage_source_backingstore.c
+++ b/src/storage_file/storage_source_backingstore.c
@@ -284,6 +284,75 @@ virStorageSourceParseRBDColonString(const char *rbdstr,
}
+static int
+virStorageSourceParseVitastorColonString(const char *colonstr,
+ virStorageSource *src)
+{
+ char *p, *e, *next;
+ g_autofree char *options = NULL;
+
+ /* optionally skip the "vitastor:" prefix if provided */
+ if (STRPREFIX(colonstr, "vitastor:"))
+ colonstr += strlen("vitastor:");
+
+ options = g_strdup(colonstr);
+
+ p = options;
+ while (*p) {
+ /* find : delimiter or end of string */
+ for (e = p; *e && *e != ':'; ++e) {
+ if (*e == '\\') {
+ e++;
+ if (*e == '\0')
+ break;
+ }
+ }
+ if (*e == '\0') {
+ next = e; /* last kv pair */
+ } else {
+ next = e + 1;
+ *e = '\0';
+ }
+
+ if (STRPREFIX(p, "image=")) {
+ src->path = g_strdup(p + strlen("image="));
+ } else if (STRPREFIX(p, "etcd_prefix=")) {
+ src->query = g_strdup(p + strlen("etcd_prefix="));
+ } else if (STRPREFIX(p, "config_file=")) {
+ src->configFile = g_strdup(p + strlen("config_file="));
+ } else if (STRPREFIX(p, "etcd_host=")) {
+ char *h, *sep;
+
+ h = p + strlen("etcd_host=");
+ while (h < e) {
+ for (sep = h; sep < e; ++sep) {
+ if (*sep == '\\' && (sep[1] == ',' ||
+ sep[1] == ';' ||
+ sep[1] == ' ')) {
+ *sep = '\0';
+ sep += 2;
+ break;
+ }
+ }
+
+ if (virStorageSourceRBDAddHost(src, h) < 0)
+ return -1;
+
+ h = sep;
+ }
+ }
+
+ p = next;
+ }
+
+ if (!src->path) {
+ return -1;
+ }
+
+ return 0;
+}
+
+
static int
virStorageSourceParseNBDColonString(const char *nbdstr,
virStorageSource *src)
@@ -396,6 +465,11 @@ virStorageSourceParseBackingColon(virStorageSource *src,
return -1;
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ if (virStorageSourceParseVitastorColonString(path, src) < 0)
+ return -1;
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_LAST:
case VIR_STORAGE_NET_PROTOCOL_NONE:
@@ -984,6 +1058,54 @@ virStorageSourceParseBackingJSONRBD(virStorageSource *src,
return 0;
}
+static int
+virStorageSourceParseBackingJSONVitastor(virStorageSource *src,
+ virJSONValue *json,
+ const char *jsonstr G_GNUC_UNUSED,
+ int opaque G_GNUC_UNUSED)
+{
+ const char *filename;
+ const char *image = virJSONValueObjectGetString(json, "image");
+ const char *conf = virJSONValueObjectGetString(json, "config_path");
+ const char *etcd_prefix = virJSONValueObjectGetString(json, "etcd_prefix");
+ virJSONValue *servers = virJSONValueObjectGetArray(json, "server");
+ size_t nservers;
+ size_t i;
+
+ src->type = VIR_STORAGE_TYPE_NETWORK;
+ src->protocol = VIR_STORAGE_NET_PROTOCOL_VITASTOR;
+
+ /* legacy syntax passed via 'filename' option */
+ if ((filename = virJSONValueObjectGetString(json, "filename")))
+ return virStorageSourceParseVitastorColonString(filename, src);
+
+ if (!image) {
+ virReportError(VIR_ERR_INVALID_ARG, "%s",
+ _("missing image name in Vitastor backing volume "
+ "JSON specification"));
+ return -1;
+ }
+
+ src->path = g_strdup(image);
+ src->configFile = g_strdup(conf);
+ src->query = g_strdup(etcd_prefix);
+
+ if (servers) {
+ nservers = virJSONValueArraySize(servers);
+
+ src->hosts = g_new0(virStorageNetHostDef, nservers);
+ src->nhosts = nservers;
+
+ for (i = 0; i < nservers; i++) {
+ if (virStorageSourceParseBackingJSONInetSocketAddress(src->hosts + i,
+ virJSONValueArrayGet(servers, i)) < 0)
+ return -1;
+ }
+ }
+
+ return 0;
+}
+
static int
virStorageSourceParseBackingJSONRaw(virStorageSource *src,
virJSONValue *json,
@@ -1162,6 +1284,7 @@ static const struct virStorageSourceJSONDriverParser jsonParsers[] = {
{"sheepdog", false, virStorageSourceParseBackingJSONSheepdog, 0},
{"ssh", false, virStorageSourceParseBackingJSONSSH, 0},
{"rbd", false, virStorageSourceParseBackingJSONRBD, 0},
+ {"vitastor", false, virStorageSourceParseBackingJSONVitastor, 0},
{"raw", true, virStorageSourceParseBackingJSONRaw, 0},
{"nfs", false, virStorageSourceParseBackingJSONNFS, 0},
{"vxhs", false, virStorageSourceParseBackingJSONVxHS, 0},
diff --git a/src/test/test_driver.c b/src/test/test_driver.c
index ef0ddab..2173dc3 100644
--- a/src/test/test_driver.c
+++ b/src/test/test_driver.c
@@ -7131,6 +7131,7 @@ testStorageVolumeTypeForPool(int pooltype)
case VIR_STORAGE_POOL_ISCSI_DIRECT:
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_RBD:
+ case VIR_STORAGE_POOL_VITASTOR:
return VIR_STORAGE_VOL_NETWORK;
case VIR_STORAGE_POOL_LOGICAL:
case VIR_STORAGE_POOL_DISK:
diff --git a/tests/storagepoolcapsschemadata/poolcaps-fs.xml b/tests/storagepoolcapsschemadata/poolcaps-fs.xml
index eee75af..8bd0a57 100644
--- a/tests/storagepoolcapsschemadata/poolcaps-fs.xml
+++ b/tests/storagepoolcapsschemadata/poolcaps-fs.xml
@@ -204,4 +204,11 @@
</enum>
</volOptions>
</pool>
+ <pool type='vitastor' supported='no'>
+ <volOptions>
+ <defaultFormat type='raw'/>
+ <enum name='targetFormatType'>
+ </enum>
+ </volOptions>
+ </pool>
</storagepoolCapabilities>
diff --git a/tests/storagepoolcapsschemadata/poolcaps-full.xml b/tests/storagepoolcapsschemadata/poolcaps-full.xml
index 805950a..852df0d 100644
--- a/tests/storagepoolcapsschemadata/poolcaps-full.xml
+++ b/tests/storagepoolcapsschemadata/poolcaps-full.xml
@@ -204,4 +204,11 @@
</enum>
</volOptions>
</pool>
+ <pool type='vitastor' supported='yes'>
+ <volOptions>
+ <defaultFormat type='raw'/>
+ <enum name='targetFormatType'>
+ </enum>
+ </volOptions>
+ </pool>
</storagepoolCapabilities>
diff --git a/tests/storagepoolxml2argvtest.c b/tests/storagepoolxml2argvtest.c
index 449b745..7f95cc8 100644
--- a/tests/storagepoolxml2argvtest.c
+++ b/tests/storagepoolxml2argvtest.c
@@ -68,6 +68,7 @@ testCompareXMLToArgvFiles(bool shouldFail,
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_ZFS:
case VIR_STORAGE_POOL_VSTORAGE:
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_LAST:
default:
VIR_TEST_DEBUG("pool type '%s' has no xml2argv test", defTypeStr);
diff --git a/tools/virsh-pool.c b/tools/virsh-pool.c
index 18f3839..c8e1436 100644
--- a/tools/virsh-pool.c
+++ b/tools/virsh-pool.c
@@ -1231,6 +1231,9 @@ cmdPoolList(vshControl *ctl, const vshCmd *cmd G_GNUC_UNUSED)
case VIR_STORAGE_POOL_VSTORAGE:
flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE;
break;
+ case VIR_STORAGE_POOL_VITASTOR:
+ flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR;
+ break;
case VIR_STORAGE_POOL_LAST:
break;
}

View File

@ -0,0 +1,32 @@
<!-- Example libvirt VM configuration with Vitastor disk -->
<domain type='kvm'>
<name>debian9</name>
<uuid>96f277fb-fd9c-49da-bf21-a5cfd54eb162</uuid>
<memory unit="KiB">524288</memory>
<currentMemory>524288</currentMemory>
<vcpu>1</vcpu>
<os>
<type arch='x86_64'>hvm</type>
<boot dev='hd' />
</os>
<devices>
<emulator>/usr/bin/qemu-system-x86_64</emulator>
<disk type='network' device='disk'>
<target dev='vda' bus='virtio' />
<driver name='qemu' type='raw' />
<!-- name is Vitastor image name -->
<!-- config (optional) is the path to Vitastor's configuration file -->
<!-- query (optional) is Vitastor's etcd_prefix -->
<source protocol='vitastor' name='debian9' query='/vitastor' config='/etc/vitastor/vitastor.conf'>
<!-- hosts = etcd addresses -->
<host name='192.168.7.2' port='2379' />
</source>
<!-- required because Vitastor only supports 4k physical sectors -->
<blockio physical_block_size="4096" logical_block_size="512" />
</disk>
<interface type='network'>
<source network='default' />
</interface>
<graphics type='vnc' port='-1' />
</devices>
</domain>

287
patches/nova-20.diff Normal file
View File

@ -0,0 +1,287 @@
diff --git a/nova/virt/image/model.py b/nova/virt/image/model.py
index 971f7e9c07..70ed70d5e2 100644
--- a/nova/virt/image/model.py
+++ b/nova/virt/image/model.py
@@ -129,3 +129,22 @@ class RBDImage(Image):
self.user = user
self.password = password
self.servers = servers
+
+
+class VitastorImage(Image):
+ """Class for images in a remote Vitastor cluster"""
+
+ def __init__(self, name, etcd_address = None, etcd_prefix = None, config_path = None):
+ """Create a new Vitastor image object
+
+ :param name: name of the image
+ :param etcd_address: etcd URL(s) (optional)
+ :param etcd_prefix: etcd prefix (optional)
+ :param config_path: path to the configuration (optional)
+ """
+ super(RBDImage, self).__init__(FORMAT_RAW)
+
+ self.name = name
+ self.etcd_address = etcd_address
+ self.etcd_prefix = etcd_prefix
+ self.config_path = config_path
diff --git a/nova/virt/images.py b/nova/virt/images.py
index 5358f3766a..ebe3d6effb 100644
--- a/nova/virt/images.py
+++ b/nova/virt/images.py
@@ -41,7 +41,7 @@ IMAGE_API = glance.API()
def qemu_img_info(path, format=None):
"""Return an object containing the parsed output from qemu-img info."""
- if not os.path.exists(path) and not path.startswith('rbd:'):
+ if not os.path.exists(path) and not path.startswith('rbd:') and not path.startswith('vitastor:'):
raise exception.DiskNotFound(location=path)
info = nova.privsep.qemu.unprivileged_qemu_img_info(path, format=format)
@@ -50,7 +50,7 @@ def qemu_img_info(path, format=None):
def privileged_qemu_img_info(path, format=None, output_format='json'):
"""Return an object containing the parsed output from qemu-img info."""
- if not os.path.exists(path) and not path.startswith('rbd:'):
+ if not os.path.exists(path) and not path.startswith('rbd:') and not path.startswith('vitastor:'):
raise exception.DiskNotFound(location=path)
info = nova.privsep.qemu.privileged_qemu_img_info(path, format=format)
diff --git a/nova/virt/libvirt/config.py b/nova/virt/libvirt/config.py
index f9475776b3..51573fe41d 100644
--- a/nova/virt/libvirt/config.py
+++ b/nova/virt/libvirt/config.py
@@ -1060,6 +1060,8 @@ class LibvirtConfigGuestDisk(LibvirtConfigGuestDevice):
self.driver_iommu = False
self.source_path = None
self.source_protocol = None
+ self.source_query = None
+ self.source_config = None
self.source_name = None
self.source_hosts = []
self.source_ports = []
@@ -1186,7 +1188,8 @@ class LibvirtConfigGuestDisk(LibvirtConfigGuestDevice):
elif self.source_type == "mount":
dev.append(etree.Element("source", dir=self.source_path))
elif self.source_type == "network" and self.source_protocol:
- source = etree.Element("source", protocol=self.source_protocol)
+ source = etree.Element("source", protocol=self.source_protocol,
+ query=self.source_query, config=self.source_config)
if self.source_name is not None:
source.set('name', self.source_name)
hosts_info = zip(self.source_hosts, self.source_ports)
diff --git a/nova/virt/libvirt/driver.py b/nova/virt/libvirt/driver.py
index 391231c527..34dc60dcdd 100644
--- a/nova/virt/libvirt/driver.py
+++ b/nova/virt/libvirt/driver.py
@@ -179,6 +179,7 @@ VOLUME_DRIVERS = {
'local': 'nova.virt.libvirt.volume.volume.LibvirtVolumeDriver',
'fake': 'nova.virt.libvirt.volume.volume.LibvirtFakeVolumeDriver',
'rbd': 'nova.virt.libvirt.volume.net.LibvirtNetVolumeDriver',
+ 'vitastor': 'nova.virt.libvirt.volume.vitastor.LibvirtVitastorVolumeDriver',
'nfs': 'nova.virt.libvirt.volume.nfs.LibvirtNFSVolumeDriver',
'smbfs': 'nova.virt.libvirt.volume.smbfs.LibvirtSMBFSVolumeDriver',
'fibre_channel': 'nova.virt.libvirt.volume.fibrechannel.LibvirtFibreChannelVolumeDriver', # noqa:E501
@@ -385,10 +386,10 @@ class LibvirtDriver(driver.ComputeDriver):
# This prevents the risk of one test setting a capability
# which bleeds over into other tests.
- # LVM and RBD require raw images. If we are not configured to
+ # LVM, RBD, Vitastor require raw images. If we are not configured to
# force convert images into raw format, then we _require_ raw
# images only.
- raw_only = ('rbd', 'lvm')
+ raw_only = ('rbd', 'lvm', 'vitastor')
requires_raw_image = (CONF.libvirt.images_type in raw_only and
not CONF.force_raw_images)
requires_ploop_image = CONF.libvirt.virt_type == 'parallels'
@@ -775,12 +776,12 @@ class LibvirtDriver(driver.ComputeDriver):
# Some imagebackends are only able to import raw disk images,
# and will fail if given any other format. See the bug
# https://bugs.launchpad.net/nova/+bug/1816686 for more details.
- if CONF.libvirt.images_type in ('rbd',):
+ if CONF.libvirt.images_type in ('rbd', 'vitastor'):
if not CONF.force_raw_images:
msg = _("'[DEFAULT]/force_raw_images = False' is not "
- "allowed with '[libvirt]/images_type = rbd'. "
+ "allowed with '[libvirt]/images_type = rbd' or 'vitastor'. "
"Please check the two configs and if you really "
- "do want to use rbd as images_type, set "
+ "do want to use rbd or vitastor as images_type, set "
"force_raw_images to True.")
raise exception.InvalidConfiguration(msg)
@@ -2603,6 +2604,16 @@ class LibvirtDriver(driver.ComputeDriver):
if connection_info['data'].get('auth_enabled'):
username = connection_info['data']['auth_username']
path = f"rbd:{volume_name}:id={username}"
+ elif connection_info['driver_volume_type'] == 'vitastor':
+ volume_name = connection_info['data']['name']
+ path = 'vitastor:image='+volume_name.replace(':', '\\:')
+ for k in [ 'config_path', 'etcd_address', 'etcd_prefix' ]:
+ if k in connection_info['data']:
+ kk = k
+ if kk == 'etcd_address':
+ # FIXME use etcd_address in qemu driver
+ kk = 'etcd_host'
+ path += ":"+kk+"="+connection_info['data'][k].replace(':', '\\:')
else:
path = 'unknown'
raise exception.DiskNotFound(location='unknown')
@@ -2827,8 +2838,8 @@ class LibvirtDriver(driver.ComputeDriver):
image_format = CONF.libvirt.snapshot_image_format or source_type
- # NOTE(bfilippov): save lvm and rbd as raw
- if image_format == 'lvm' or image_format == 'rbd':
+ # NOTE(bfilippov): save lvm and rbd and vitastor as raw
+ if image_format == 'lvm' or image_format == 'rbd' or image_format == 'vitastor':
image_format = 'raw'
metadata = self._create_snapshot_metadata(instance.image_meta,
@@ -2899,7 +2910,7 @@ class LibvirtDriver(driver.ComputeDriver):
expected_state=task_states.IMAGE_UPLOADING)
# TODO(nic): possibly abstract this out to the root_disk
- if source_type == 'rbd' and live_snapshot:
+ if (source_type == 'rbd' or source_type == 'vitastor') and live_snapshot:
# Standard snapshot uses qemu-img convert from RBD which is
# not safe to run with live_snapshot.
live_snapshot = False
@@ -4099,7 +4110,7 @@ class LibvirtDriver(driver.ComputeDriver):
# cleanup rescue volume
lvm.remove_volumes([lvmdisk for lvmdisk in self._lvm_disks(instance)
if lvmdisk.endswith('.rescue')])
- if CONF.libvirt.images_type == 'rbd':
+ if CONF.libvirt.images_type == 'rbd' or CONF.libvirt.images_type == 'vitastor':
filter_fn = lambda disk: (disk.startswith(instance.uuid) and
disk.endswith('.rescue'))
rbd_utils.RBDDriver().cleanup_volumes(filter_fn)
@@ -4356,6 +4367,8 @@ class LibvirtDriver(driver.ComputeDriver):
# TODO(mikal): there is a bug here if images_type has
# changed since creation of the instance, but I am pretty
# sure that this bug already exists.
+ if CONF.libvirt.images_type == 'vitastor':
+ return 'vitastor'
return 'rbd' if CONF.libvirt.images_type == 'rbd' else 'raw'
@staticmethod
@@ -4764,10 +4777,10 @@ class LibvirtDriver(driver.ComputeDriver):
finally:
# NOTE(mikal): if the config drive was imported into RBD,
# then we no longer need the local copy
- if CONF.libvirt.images_type == 'rbd':
+ if CONF.libvirt.images_type == 'rbd' or CONF.libvirt.images_type == 'vitastor':
LOG.info('Deleting local config drive %(path)s '
- 'because it was imported into RBD.',
- {'path': config_disk_local_path},
+ 'because it was imported into %(type).',
+ {'path': config_disk_local_path, 'type': CONF.libvirt.images_type},
instance=instance)
os.unlink(config_disk_local_path)
diff --git a/nova/virt/libvirt/utils.py b/nova/virt/libvirt/utils.py
index da2a6e8b8a..52c02e72f1 100644
--- a/nova/virt/libvirt/utils.py
+++ b/nova/virt/libvirt/utils.py
@@ -340,6 +340,10 @@ def find_disk(guest: libvirt_guest.Guest) -> ty.Tuple[str, ty.Optional[str]]:
disk_path = disk.source_name
if disk_path:
disk_path = 'rbd:' + disk_path
+ elif not disk_path and disk.source_protocol == 'vitastor':
+ disk_path = disk.source_name
+ if disk_path:
+ disk_path = 'vitastor:' + disk_path
if not disk_path:
raise RuntimeError(_("Can't retrieve root device path "
@@ -354,6 +358,8 @@ def get_disk_type_from_path(path: str) -> ty.Optional[str]:
return 'lvm'
elif path.startswith('rbd:'):
return 'rbd'
+ elif path.startswith('vitastor:'):
+ return 'vitastor'
elif (os.path.isdir(path) and
os.path.exists(os.path.join(path, "DiskDescriptor.xml"))):
return 'ploop'
diff --git a/nova/virt/libvirt/volume/vitastor.py b/nova/virt/libvirt/volume/vitastor.py
new file mode 100644
index 0000000000..0256df62c1
--- /dev/null
+++ b/nova/virt/libvirt/volume/vitastor.py
@@ -0,0 +1,75 @@
+# Copyright (c) 2021+, Vitaliy Filippov <vitalif@yourcmc.ru>
+#
+# Licensed under the Apache License, Version 2.0 (the "License"); you may
+# not use this file except in compliance with the License. You may obtain
+# a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+# License for the specific language governing permissions and limitations
+# under the License.
+
+from os_brick import exception as os_brick_exception
+from os_brick import initiator
+from os_brick.initiator import connector
+from oslo_log import log as logging
+
+import nova.conf
+from nova import utils
+from nova.virt.libvirt.volume import volume as libvirt_volume
+
+
+CONF = nova.conf.CONF
+LOG = logging.getLogger(__name__)
+
+
+class LibvirtVitastorVolumeDriver(libvirt_volume.LibvirtBaseVolumeDriver):
+ """Driver to attach Vitastor volumes to libvirt."""
+ def __init__(self, host):
+ super(LibvirtVitastorVolumeDriver, self).__init__(host, is_block_dev=False)
+
+ def connect_volume(self, connection_info, instance):
+ pass
+
+ def disconnect_volume(self, connection_info, instance):
+ pass
+
+ def get_config(self, connection_info, disk_info):
+ """Returns xml for libvirt."""
+ conf = super(LibvirtVitastorVolumeDriver, self).get_config(connection_info, disk_info)
+ conf.source_type = 'network'
+ conf.source_protocol = 'vitastor'
+ conf.source_name = connection_info['data'].get('name')
+ conf.source_query = connection_info['data'].get('etcd_prefix') or None
+ conf.source_config = connection_info['data'].get('config_path') or None
+ conf.source_hosts = []
+ conf.source_ports = []
+ addresses = connection_info['data'].get('etcd_address', '')
+ if addresses:
+ if not isinstance(addresses, list):
+ addresses = addresses.split(',')
+ for addr in addresses:
+ if addr.startswith('https://'):
+ raise NotImplementedError('Vitastor block driver does not support SSL for etcd communication yet')
+ if addr.startswith('http://'):
+ addr = addr[7:]
+ addr = addr.rstrip('/')
+ if addr.endswith('/v3'):
+ addr = addr[0:-3]
+ p = addr.find('/')
+ if p > 0:
+ raise NotImplementedError('libvirt does not support custom URL paths for Vitastor etcd yet. Use /etc/vitastor/vitastor.conf')
+ p = addr.find(':')
+ port = '2379'
+ if p > 0:
+ port = addr[p+1:]
+ addr = addr[0:p]
+ conf.source_hosts.append(addr)
+ conf.source_ports.append(port)
+ return conf
+
+ def extend_volume(self, connection_info, instance, requested_size):
+ raise NotImplementedError

View File

@ -11,7 +11,7 @@ Index: qemu-3.1+dfsg/qapi/block-core.json
'host_cdrom', 'host_device', 'http', 'https', 'iscsi', 'luks', 'host_cdrom', 'host_device', 'http', 'https', 'iscsi', 'luks',
'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels', 'qcow', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels', 'qcow',
'qcow2', 'qed', 'quorum', 'raw', 'rbd', 'replication', 'sheepdog', 'qcow2', 'qed', 'quorum', 'raw', 'rbd', 'replication', 'sheepdog',
@@ -3367,6 +3367,24 @@ @@ -3367,6 +3367,28 @@
'*tag': 'str' } } '*tag': 'str' } }
## ##
@ -19,17 +19,21 @@ Index: qemu-3.1+dfsg/qapi/block-core.json
+# +#
+# Driver specific block device options for vitastor +# Driver specific block device options for vitastor
+# +#
+# @image: Image name
+# @inode: Inode number +# @inode: Inode number
+# @pool: Pool ID +# @pool: Pool ID
+# @size: Desired image size in bytes +# @size: Desired image size in bytes
+# @etcd_host: etcd connection address +# @config_path: Path to Vitastor configuration
+# @etcd_host: etcd connection address(es)
+# @etcd_prefix: etcd key/value prefix +# @etcd_prefix: etcd key/value prefix
+## +##
+{ 'struct': 'BlockdevOptionsVitastor', +{ 'struct': 'BlockdevOptionsVitastor',
+ 'data': { 'inode': 'uint64', + 'data': { '*inode': 'uint64',
+ 'pool': 'uint64', + '*pool': 'uint64',
+ 'size': 'uint64', + '*size': 'uint64',
+ 'etcd_host': 'str', + '*image': 'str',
+ '*config_path': 'str',
+ '*etcd_host': 'str',
+ '*etcd_prefix': 'str' } } + '*etcd_prefix': 'str' } }
+ +
+## +##

View File

@ -11,7 +11,7 @@ Index: qemu/qapi/block-core.json
'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', 'vxhs' ] } 'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', 'vxhs' ] }
## ##
@@ -3725,6 +3725,24 @@ @@ -3725,6 +3725,28 @@
'*tag': 'str' } } '*tag': 'str' } }
## ##
@ -19,17 +19,21 @@ Index: qemu/qapi/block-core.json
+# +#
+# Driver specific block device options for vitastor +# Driver specific block device options for vitastor
+# +#
+# @image: Image name
+# @inode: Inode number +# @inode: Inode number
+# @pool: Pool ID +# @pool: Pool ID
+# @size: Desired image size in bytes +# @size: Desired image size in bytes
+# @etcd_host: etcd connection address +# @config_path: Path to Vitastor configuration
+# @etcd_host: etcd connection address(es)
+# @etcd_prefix: etcd key/value prefix +# @etcd_prefix: etcd key/value prefix
+## +##
+{ 'struct': 'BlockdevOptionsVitastor', +{ 'struct': 'BlockdevOptionsVitastor',
+ 'data': { 'inode': 'uint64', + 'data': { '*inode': 'uint64',
+ 'pool': 'uint64', + '*pool': 'uint64',
+ 'size': 'uint64', + '*size': 'uint64',
+ 'etcd_host': 'str', + '*image': 'str',
+ '*config_path': 'str',
+ '*etcd_host': 'str',
+ '*etcd_prefix': 'str' } } + '*etcd_prefix': 'str' } }
+ +
+## +##

View File

@ -11,7 +11,7 @@ Index: qemu/qapi/block-core.json
'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', 'vxhs' ] } 'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', 'vxhs' ] }
## ##
@@ -3635,6 +3635,24 @@ @@ -3635,6 +3635,28 @@
'*tag': 'str' } } '*tag': 'str' } }
## ##
@ -19,17 +19,21 @@ Index: qemu/qapi/block-core.json
+# +#
+# Driver specific block device options for vitastor +# Driver specific block device options for vitastor
+# +#
+# @image: Image name
+# @inode: Inode number +# @inode: Inode number
+# @pool: Pool ID +# @pool: Pool ID
+# @size: Desired image size in bytes +# @size: Desired image size in bytes
+# @etcd_host: etcd connection address +# @config_path: Path to Vitastor configuration
+# @etcd_host: etcd connection address(es)
+# @etcd_prefix: etcd key/value prefix +# @etcd_prefix: etcd key/value prefix
+## +##
+{ 'struct': 'BlockdevOptionsVitastor', +{ 'struct': 'BlockdevOptionsVitastor',
+ 'data': { 'inode': 'uint64', + 'data': { '*inode': 'uint64',
+ 'pool': 'uint64', + '*pool': 'uint64',
+ 'size': 'uint64', + '*size': 'uint64',
+ 'etcd_host': 'str', + '*image': 'str',
+ '*config_path': 'str',
+ '*etcd_host': 'str',
+ '*etcd_prefix': 'str' } } + '*etcd_prefix': 'str' } }
+ +
+## +##

View File

@ -11,7 +11,7 @@ Index: qemu-5.1+dfsg/qapi/block-core.json
'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] } 'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
## ##
@@ -3644,6 +3644,24 @@ @@ -3644,6 +3644,28 @@
'*tag': 'str' } } '*tag': 'str' } }
## ##
@ -19,17 +19,21 @@ Index: qemu-5.1+dfsg/qapi/block-core.json
+# +#
+# Driver specific block device options for vitastor +# Driver specific block device options for vitastor
+# +#
+# @image: Image name
+# @inode: Inode number +# @inode: Inode number
+# @pool: Pool ID +# @pool: Pool ID
+# @size: Desired image size in bytes +# @size: Desired image size in bytes
+# @etcd_host: etcd connection address +# @config_path: Path to Vitastor configuration
+# @etcd_host: etcd connection address(es)
+# @etcd_prefix: etcd key/value prefix +# @etcd_prefix: etcd key/value prefix
+## +##
+{ 'struct': 'BlockdevOptionsVitastor', +{ 'struct': 'BlockdevOptionsVitastor',
+ 'data': { 'inode': 'uint64', + 'data': { '*inode': 'uint64',
+ 'pool': 'uint64', + '*pool': 'uint64',
+ 'size': 'uint64', + '*size': 'uint64',
+ 'etcd_host': 'str', + '*image': 'str',
+ '*config_path': 'str',
+ '*etcd_host': 'str',
+ '*etcd_prefix': 'str' } } + '*etcd_prefix': 'str' } }
+ +
+## +##

View File

@ -48,4 +48,4 @@ FIO=`rpm -qi fio | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Ve
QEMU=`rpm -qi qemu qemu-kvm | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'` QEMU=`rpm -qi qemu qemu-kvm | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'`
perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec
perl -i -pe 's/(Requires:\s*qemu(?:-kvm)?)([^\n]+)?/$1 = '$QEMU'/' $VITASTOR/rpm/vitastor-el$EL.spec perl -i -pe 's/(Requires:\s*qemu(?:-kvm)?)([^\n]+)?/$1 = '$QEMU'/' $VITASTOR/rpm/vitastor-el$EL.spec
tar --transform 's#^#vitastor-0.6.2/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.6.2$(rpm --eval '%dist').tar.gz * tar --transform 's#^#vitastor-0.6.5/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.6.5$(rpm --eval '%dist').tar.gz *

View File

@ -11,7 +11,7 @@ RUN rm -rf /var/lib/dnf/*; dnf download --disablerepo='*' --enablerepo='centos-a
RUN rpm --nomd5 -i qemu*.src.rpm RUN rpm --nomd5 -i qemu*.src.rpm
RUN cd ~/rpmbuild/SPECS && dnf builddep -y --enablerepo=PowerTools --spec qemu-kvm.spec RUN cd ~/rpmbuild/SPECS && dnf builddep -y --enablerepo=PowerTools --spec qemu-kvm.spec
ADD qemu-*-vitastor.patch /root/vitastor/ ADD patches/qemu-*-vitastor.patch /root/vitastor/patches/
RUN set -e; \ RUN set -e; \
mkdir -p /root/packages/qemu-el8; \ mkdir -p /root/packages/qemu-el8; \
@ -25,7 +25,7 @@ RUN set -e; \
echo "Patch$((PN+1)): qemu-4.2-vitastor.patch" >> qemu-kvm.spec; \ echo "Patch$((PN+1)): qemu-4.2-vitastor.patch" >> qemu-kvm.spec; \
tail -n +2 xx01 >> qemu-kvm.spec; \ tail -n +2 xx01 >> qemu-kvm.spec; \
perl -i -pe 's/(^Release:\s*\d+)/$1.vitastor/' qemu-kvm.spec; \ perl -i -pe 's/(^Release:\s*\d+)/$1.vitastor/' qemu-kvm.spec; \
cp /root/vitastor/qemu-4.2-vitastor.patch ~/rpmbuild/SOURCES; \ cp /root/vitastor/patches/qemu-4.2-vitastor.patch ~/rpmbuild/SOURCES; \
rpmbuild --nocheck -ba qemu-kvm.spec; \ rpmbuild --nocheck -ba qemu-kvm.spec; \
cp ~/rpmbuild/RPMS/*/*qemu* /root/packages/qemu-el8/; \ cp ~/rpmbuild/RPMS/*/*qemu* /root/packages/qemu-el8/; \
cp ~/rpmbuild/SRPMS/*qemu* /root/packages/qemu-el8/ cp ~/rpmbuild/SRPMS/*qemu* /root/packages/qemu-el8/

View File

@ -15,8 +15,9 @@ RUN yumdownloader --disablerepo=centos-sclo-rh --source fio
RUN rpm --nomd5 -i qemu*.src.rpm RUN rpm --nomd5 -i qemu*.src.rpm
RUN rpm --nomd5 -i fio*.src.rpm RUN rpm --nomd5 -i fio*.src.rpm
RUN rm -f /etc/yum.repos.d/CentOS-Media.repo RUN rm -f /etc/yum.repos.d/CentOS-Media.repo
RUN cd ~/rpmbuild/SPECS && yum-builddep -y --enablerepo='*' --disablerepo=centos-sclo-rh --disablerepo=centos-sclo-rh-source --disablerepo=centos-sclo-sclo-testing qemu-kvm.spec RUN cd ~/rpmbuild/SPECS && yum-builddep -y qemu-kvm.spec
RUN cd ~/rpmbuild/SPECS && yum-builddep -y --enablerepo='*' --disablerepo=centos-sclo-rh --disablerepo=centos-sclo-rh-source --disablerepo=centos-sclo-sclo-testing fio.spec RUN cd ~/rpmbuild/SPECS && yum-builddep -y fio.spec
RUN yum -y install rdma-core-devel
ADD https://vitastor.io/rpms/liburing-el7/liburing-0.7-2.el7.src.rpm /root ADD https://vitastor.io/rpms/liburing-el7/liburing-0.7-2.el7.src.rpm /root
@ -37,7 +38,7 @@ ADD . /root/vitastor
RUN set -e; \ RUN set -e; \
cd /root/vitastor/rpm; \ cd /root/vitastor/rpm; \
sh build-tarball.sh; \ sh build-tarball.sh; \
cp /root/vitastor-0.6.2.el7.tar.gz ~/rpmbuild/SOURCES; \ cp /root/vitastor-0.6.5.el7.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \ cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \ cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \ rpmbuild -ba vitastor.spec; \

View File

@ -1,11 +1,11 @@
Name: vitastor Name: vitastor
Version: 0.6.2 Version: 0.6.5
Release: 1%{?dist} Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1 License: Vitastor Network Public License 1.1
URL: https://vitastor.io/ URL: https://vitastor.io/
Source0: vitastor-0.6.2.el7.tar.gz Source0: vitastor-0.6.5.el7.tar.gz
BuildRequires: liburing-devel >= 0.6 BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel BuildRequires: gperftools-devel
@ -14,6 +14,7 @@ BuildRequires: rh-nodejs12
BuildRequires: rh-nodejs12-npm BuildRequires: rh-nodejs12-npm
BuildRequires: jerasure-devel BuildRequires: jerasure-devel
BuildRequires: gf-complete-devel BuildRequires: gf-complete-devel
BuildRequires: libibverbs-devel
BuildRequires: cmake BuildRequires: cmake
Requires: fio = 3.7-1.el7 Requires: fio = 3.7-1.el7
Requires: qemu-kvm = 2.0.0-1.el7.6 Requires: qemu-kvm = 2.0.0-1.el7.6
@ -61,8 +62,9 @@ cp -r mon %buildroot/usr/lib/vitastor/mon
%_libdir/libfio_vitastor.so %_libdir/libfio_vitastor.so
%_libdir/libfio_vitastor_blk.so %_libdir/libfio_vitastor_blk.so
%_libdir/libfio_vitastor_sec.so %_libdir/libfio_vitastor_sec.so
%_libdir/libvitastor_blk.so %_libdir/libvitastor_blk.so*
%_libdir/libvitastor_client.so %_libdir/libvitastor_client.so*
%_includedir/vitastor_c.h
/usr/lib/vitastor /usr/lib/vitastor

View File

@ -15,6 +15,7 @@ RUN rpm --nomd5 -i qemu*.src.rpm
RUN rpm --nomd5 -i fio*.src.rpm RUN rpm --nomd5 -i fio*.src.rpm
RUN cd ~/rpmbuild/SPECS && dnf builddep -y --enablerepo=powertools --spec qemu-kvm.spec RUN cd ~/rpmbuild/SPECS && dnf builddep -y --enablerepo=powertools --spec qemu-kvm.spec
RUN cd ~/rpmbuild/SPECS && dnf builddep -y --enablerepo=powertools --spec fio.spec && dnf install -y cmake RUN cd ~/rpmbuild/SPECS && dnf builddep -y --enablerepo=powertools --spec fio.spec && dnf install -y cmake
RUN yum -y install libibverbs-devel libarchive
ADD https://vitastor.io/rpms/liburing-el7/liburing-0.7-2.el7.src.rpm /root ADD https://vitastor.io/rpms/liburing-el7/liburing-0.7-2.el7.src.rpm /root
@ -35,7 +36,7 @@ ADD . /root/vitastor
RUN set -e; \ RUN set -e; \
cd /root/vitastor/rpm; \ cd /root/vitastor/rpm; \
sh build-tarball.sh; \ sh build-tarball.sh; \
cp /root/vitastor-0.6.2.el8.tar.gz ~/rpmbuild/SOURCES; \ cp /root/vitastor-0.6.5.el8.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \ cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \ cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \ rpmbuild -ba vitastor.spec; \

View File

@ -1,11 +1,11 @@
Name: vitastor Name: vitastor
Version: 0.6.2 Version: 0.6.5
Release: 1%{?dist} Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1 License: Vitastor Network Public License 1.1
URL: https://vitastor.io/ URL: https://vitastor.io/
Source0: vitastor-0.6.2.el8.tar.gz Source0: vitastor-0.6.5.el8.tar.gz
BuildRequires: liburing-devel >= 0.6 BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel BuildRequires: gperftools-devel
@ -13,6 +13,7 @@ BuildRequires: gcc-toolset-9-gcc-c++
BuildRequires: nodejs >= 10 BuildRequires: nodejs >= 10
BuildRequires: jerasure-devel BuildRequires: jerasure-devel
BuildRequires: gf-complete-devel BuildRequires: gf-complete-devel
BuildRequires: libibverbs-devel
BuildRequires: cmake BuildRequires: cmake
Requires: fio = 3.7-3.el8 Requires: fio = 3.7-3.el8
Requires: qemu-kvm = 4.2.0-29.el8.6 Requires: qemu-kvm = 4.2.0-29.el8.6
@ -58,8 +59,9 @@ cp -r mon %buildroot/usr/lib/vitastor
%_libdir/libfio_vitastor.so %_libdir/libfio_vitastor.so
%_libdir/libfio_vitastor_blk.so %_libdir/libfio_vitastor_blk.so
%_libdir/libfio_vitastor_sec.so %_libdir/libfio_vitastor_sec.so
%_libdir/libvitastor_blk.so %_libdir/libvitastor_blk.so*
%_libdir/libvitastor_client.so %_libdir/libvitastor_client.so*
%_includedir/vitastor_c.h
/usr/lib/vitastor /usr/lib/vitastor

View File

@ -4,6 +4,8 @@ project(vitastor)
include(GNUInstallDirs) include(GNUInstallDirs)
set(WITH_QEMU true CACHE BOOL "Build QEMU driver")
set(WITH_FIO true CACHE BOOL "Build FIO driver")
set(QEMU_PLUGINDIR qemu CACHE STRING "QEMU plugin directory suffix (qemu-kvm on RHEL)") set(QEMU_PLUGINDIR qemu CACHE STRING "QEMU plugin directory suffix (qemu-kvm on RHEL)")
set(WITH_ASAN false CACHE BOOL "Build with AddressSanitizer") set(WITH_ASAN false CACHE BOOL "Build with AddressSanitizer")
if("${CMAKE_INSTALL_PREFIX}" MATCHES "^/usr/local/?$") if("${CMAKE_INSTALL_PREFIX}" MATCHES "^/usr/local/?$")
@ -13,7 +15,7 @@ if("${CMAKE_INSTALL_PREFIX}" MATCHES "^/usr/local/?$")
set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}") set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}")
endif() endif()
add_definitions(-DVERSION="0.6.2") add_definitions(-DVERSION="0.6.5")
add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -I ${CMAKE_SOURCE_DIR}/src) add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -I ${CMAKE_SOURCE_DIR}/src)
if (${WITH_ASAN}) if (${WITH_ASAN})
add_definitions(-fsanitize=address -fno-omit-frame-pointer) add_definitions(-fsanitize=address -fno-omit-frame-pointer)
@ -36,7 +38,9 @@ string(REGEX REPLACE "([\\/\\-]D) *NDEBUG" "" CMAKE_C_FLAGS_RELWITHDEBINFO "${CM
find_package(PkgConfig) find_package(PkgConfig)
pkg_check_modules(LIBURING REQUIRED liburing) pkg_check_modules(LIBURING REQUIRED liburing)
pkg_check_modules(GLIB REQUIRED glib-2.0) if (${WITH_QEMU})
pkg_check_modules(GLIB REQUIRED glib-2.0)
endif (${WITH_QEMU})
pkg_check_modules(IBVERBS libibverbs) pkg_check_modules(IBVERBS libibverbs)
if (IBVERBS_LIBRARIES) if (IBVERBS_LIBRARIES)
add_definitions(-DWITH_RDMA) add_definitions(-DWITH_RDMA)
@ -62,24 +66,27 @@ target_link_libraries(vitastor_blk
) )
set_target_properties(vitastor_blk PROPERTIES VERSION ${VERSION} SOVERSION 0) set_target_properties(vitastor_blk PROPERTIES VERSION ${VERSION} SOVERSION 0)
# libfio_vitastor_blk.so if (${WITH_FIO})
add_library(fio_vitastor_blk SHARED # libfio_vitastor_blk.so
add_library(fio_vitastor_blk SHARED
fio_engine.cpp fio_engine.cpp
../json11/json11.cpp ../json11/json11.cpp
) )
target_link_libraries(fio_vitastor_blk target_link_libraries(fio_vitastor_blk
vitastor_blk vitastor_blk
) )
endif (${WITH_FIO})
# libvitastor_common.a # libvitastor_common.a
set(MSGR_RDMA "")
if (IBVERBS_LIBRARIES)
set(MSGR_RDMA "msgr_rdma.cpp")
endif (IBVERBS_LIBRARIES)
add_library(vitastor_common STATIC add_library(vitastor_common STATIC
epoll_manager.cpp etcd_state_client.cpp epoll_manager.cpp etcd_state_client.cpp
messenger.cpp msgr_stop.cpp msgr_op.cpp msgr_send.cpp msgr_receive.cpp ringloop.cpp ../json11/json11.cpp messenger.cpp msgr_stop.cpp msgr_op.cpp msgr_send.cpp msgr_receive.cpp ringloop.cpp ../json11/json11.cpp
http_client.cpp osd_ops.cpp pg_states.cpp timerfd_manager.cpp base64.cpp http_client.cpp osd_ops.cpp pg_states.cpp timerfd_manager.cpp base64.cpp ${MSGR_RDMA}
) )
if (IBVERBS_LIBRARIES)
target_sources(vitastor_common PRIVATE msgr_rdma.cpp)
endif (IBVERBS_LIBRARIES)
target_compile_options(vitastor_common PUBLIC -fPIC) target_compile_options(vitastor_common PUBLIC -fPIC)
# vitastor-osd # vitastor-osd
@ -95,19 +102,23 @@ target_link_libraries(vitastor-osd
${IBVERBS_LIBRARIES} ${IBVERBS_LIBRARIES}
) )
# libfio_vitastor_sec.so if (${WITH_FIO})
add_library(fio_vitastor_sec SHARED # libfio_vitastor_sec.so
add_library(fio_vitastor_sec SHARED
fio_sec_osd.cpp fio_sec_osd.cpp
rw_blocking.cpp rw_blocking.cpp
) )
target_link_libraries(fio_vitastor_sec target_link_libraries(fio_vitastor_sec
tcmalloc_minimal tcmalloc_minimal
) )
endif (${WITH_FIO})
# libvitastor_client.so # libvitastor_client.so
add_library(vitastor_client SHARED add_library(vitastor_client SHARED
cluster_client.cpp cluster_client.cpp
vitastor_c.cpp
) )
set_target_properties(vitastor_client PROPERTIES PUBLIC_HEADER "vitastor_c.h")
target_link_libraries(vitastor_client target_link_libraries(vitastor_client
vitastor_common vitastor_common
tcmalloc_minimal tcmalloc_minimal
@ -116,13 +127,15 @@ target_link_libraries(vitastor_client
) )
set_target_properties(vitastor_client PROPERTIES VERSION ${VERSION} SOVERSION 0) set_target_properties(vitastor_client PROPERTIES VERSION ${VERSION} SOVERSION 0)
# libfio_vitastor.so if (${WITH_FIO})
add_library(fio_vitastor SHARED # libfio_vitastor.so
add_library(fio_vitastor SHARED
fio_cluster.cpp fio_cluster.cpp
) )
target_link_libraries(fio_vitastor target_link_libraries(fio_vitastor
vitastor_client vitastor_client
) )
endif (${WITH_FIO})
# vitastor-nbd # vitastor-nbd
add_executable(vitastor-nbd add_executable(vitastor-nbd
@ -145,27 +158,24 @@ add_executable(vitastor-dump-journal
dump_journal.cpp crc32c.c dump_journal.cpp crc32c.c
) )
# qemu_driver.so if (${WITH_QEMU})
add_library(qemu_proxy STATIC qemu_proxy.cpp) # qemu_driver.so
target_compile_options(qemu_proxy PUBLIC -fPIC) add_library(qemu_vitastor SHARED
target_include_directories(qemu_proxy PUBLIC qemu_driver.c
)
target_include_directories(qemu_vitastor PUBLIC
../qemu/b/qemu ../qemu/b/qemu
../qemu/include ../qemu/include
${GLIB_INCLUDE_DIRS} ${GLIB_INCLUDE_DIRS}
) )
target_link_libraries(qemu_proxy target_link_libraries(qemu_vitastor
vitastor_client vitastor_client
) )
add_library(qemu_vitastor SHARED set_target_properties(qemu_vitastor PROPERTIES
qemu_driver.c
)
target_link_libraries(qemu_vitastor
qemu_proxy
)
set_target_properties(qemu_vitastor PROPERTIES
PREFIX "" PREFIX ""
OUTPUT_NAME "block-vitastor" OUTPUT_NAME "block-vitastor"
) )
endif (${WITH_QEMU})
### Test stubs ### Test stubs
@ -199,6 +209,14 @@ target_link_libraries(osd_peering_pg_test tcmalloc_minimal)
# test_allocator # test_allocator
add_executable(test_allocator test_allocator.cpp allocator.cpp) add_executable(test_allocator test_allocator.cpp allocator.cpp)
# test_cas
add_executable(test_cas
test_cas.cpp
)
target_link_libraries(test_cas
vitastor_client
)
# test_cluster_client # test_cluster_client
add_executable(test_cluster_client add_executable(test_cluster_client
test_cluster_client.cpp test_cluster_client.cpp
@ -217,5 +235,14 @@ target_include_directories(test_cluster_client PUBLIC ${CMAKE_SOURCE_DIR}/src/mo
### Install ### Install
install(TARGETS vitastor-osd vitastor-dump-journal vitastor-nbd vitastor-rm RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR}) install(TARGETS vitastor-osd vitastor-dump-journal vitastor-nbd vitastor-rm RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR})
install(TARGETS fio_vitastor fio_vitastor_blk fio_vitastor_sec vitastor_blk vitastor_client LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR}) install(
install(TARGETS qemu_vitastor LIBRARY DESTINATION /usr/${CMAKE_INSTALL_LIBDIR}/${QEMU_PLUGINDIR}) TARGETS vitastor_blk vitastor_client
LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR}
PUBLIC_HEADER DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}
)
if (${WITH_FIO})
install(TARGETS fio_vitastor fio_vitastor_blk fio_vitastor_sec LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR})
endif (${WITH_FIO})
if (${WITH_QEMU})
install(TARGETS qemu_vitastor LIBRARY DESTINATION /usr/${CMAKE_INSTALL_LIBDIR}/${QEMU_PLUGINDIR})
endif (${WITH_QEMU})

View File

@ -43,11 +43,6 @@ int blockstore_t::read_bitmap(object_id oid, uint64_t target_version, void *bitm
return impl->read_bitmap(oid, target_version, bitmap, result_version); return impl->read_bitmap(oid, target_version, bitmap, result_version);
} }
std::unordered_map<object_id, uint64_t> & blockstore_t::get_unstable_writes()
{
return impl->unstable_writes;
}
std::map<uint64_t, uint64_t> & blockstore_t::get_inode_space_stats() std::map<uint64_t, uint64_t> & blockstore_t::get_inode_space_stats()
{ {
return impl->inode_space_stats; return impl->inode_space_stats;

View File

@ -183,9 +183,6 @@ public:
// Simplified synchronous operation: get object bitmap & current version // Simplified synchronous operation: get object bitmap & current version
int read_bitmap(object_id oid, uint64_t target_version, void *bitmap, uint64_t *result_version = NULL); int read_bitmap(object_id oid, uint64_t target_version, void *bitmap, uint64_t *result_version = NULL);
// Unstable writes are added here (map of object_id -> version)
std::unordered_map<object_id, uint64_t> & get_unstable_writes();
// Get per-inode space usage statistics // Get per-inode space usage statistics
std::map<uint64_t, uint64_t> & get_inode_space_stats(); std::map<uint64_t, uint64_t> & get_inode_space_stats();

View File

@ -16,6 +16,8 @@
cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config) cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config)
{ {
config = osd_messenger_t::read_config(config);
this->ringloop = ringloop; this->ringloop = ringloop;
this->tfd = tfd; this->tfd = tfd;
this->config = config; this->config = config;
@ -49,7 +51,7 @@ cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd
msgr.exec_op = [this](osd_op_t *op) msgr.exec_op = [this](osd_op_t *op)
{ {
// Garbage in // Garbage in
printf("Incoming garbage from peer %d\n", op->peer_fd); fprintf(stderr, "Incoming garbage from peer %d\n", op->peer_fd);
msgr.stop_client(op->peer_fd); msgr.stop_client(op->peer_fd);
delete op; delete op;
}; };
@ -631,6 +633,13 @@ resume_1:
// Slice the operation into parts // Slice the operation into parts
slice_rw(op); slice_rw(op);
op->needs_reslice = false; op->needs_reslice = false;
if (op->opcode == OSD_OP_WRITE && op->version && op->parts.size() > 1)
{
// Atomic writes to multiple stripes are unsupported
op->retval = -EINVAL;
erase_op(op);
return 1;
}
resume_2: resume_2:
// Send unsent parts, if they're not subject to change // Send unsent parts, if they're not subject to change
op->state = 3; op->state = 3;
@ -686,13 +695,16 @@ resume_3:
// Check parent inode // Check parent inode
auto ino_it = st_cli.inode_config.find(op->cur_inode); auto ino_it = st_cli.inode_config.find(op->cur_inode);
while (ino_it != st_cli.inode_config.end() && ino_it->second.parent_id && while (ino_it != st_cli.inode_config.end() && ino_it->second.parent_id &&
INODE_POOL(ino_it->second.parent_id) == INODE_POOL(op->cur_inode)) INODE_POOL(ino_it->second.parent_id) == INODE_POOL(op->cur_inode) &&
// Check for loops
ino_it->second.parent_id != op->inode)
{ {
// Skip parents from the same pool // Skip parents from the same pool
ino_it = st_cli.inode_config.find(ino_it->second.parent_id); ino_it = st_cli.inode_config.find(ino_it->second.parent_id);
} }
if (ino_it != st_cli.inode_config.end() && if (ino_it != st_cli.inode_config.end() &&
ino_it->second.parent_id) ino_it->second.parent_id &&
ino_it->second.parent_id != op->inode)
{ {
// Continue reading from the parent inode // Continue reading from the parent inode
op->cur_inode = ino_it->second.parent_id; op->cur_inode = ino_it->second.parent_id;
@ -920,6 +932,7 @@ bool cluster_client_t::try_send(cluster_op_t *op, int i)
.offset = part->offset, .offset = part->offset,
.len = part->len, .len = part->len,
.meta_revision = meta_rev, .meta_revision = meta_rev,
.version = op->opcode == OSD_OP_WRITE ? op->version : 0,
} }, } },
.bitmap = op->opcode == OSD_OP_WRITE ? NULL : op->part_bitmaps + pg_bitmap_size*i, .bitmap = op->opcode == OSD_OP_WRITE ? NULL : op->part_bitmaps + pg_bitmap_size*i,
.bitmap_len = (unsigned)(op->opcode == OSD_OP_WRITE ? 0 : pg_bitmap_size), .bitmap_len = (unsigned)(op->opcode == OSD_OP_WRITE ? 0 : pg_bitmap_size),
@ -952,13 +965,14 @@ int cluster_client_t::continue_sync(cluster_op_t *op)
return 1; return 1;
} }
// Check that all OSD connections are still alive // Check that all OSD connections are still alive
for (auto sync_osd: dirty_osds) for (auto do_it = dirty_osds.begin(); do_it != dirty_osds.end(); )
{ {
osd_num_t sync_osd = *do_it;
auto peer_it = msgr.osd_peer_fds.find(sync_osd); auto peer_it = msgr.osd_peer_fds.find(sync_osd);
if (peer_it == msgr.osd_peer_fds.end()) if (peer_it == msgr.osd_peer_fds.end())
{ dirty_osds.erase(do_it++);
return 0; else
} do_it++;
} }
// Post sync to affected OSDs // Post sync to affected OSDs
for (auto & prev_op: dirty_buffers) for (auto & prev_op: dirty_buffers)
@ -1069,10 +1083,6 @@ void cluster_client_t::handle_op_part(cluster_op_part_t *part)
if (part->op.reply.hdr.retval != expected) if (part->op.reply.hdr.retval != expected)
{ {
// Operation failed, retry // Operation failed, retry
printf(
"%s operation failed on OSD %lu: retval=%ld (expected %d), dropping connection\n",
osd_op_names[part->op.req.hdr.opcode], part->osd_num, part->op.reply.hdr.retval, expected
);
if (part->op.reply.hdr.retval == -EPIPE) if (part->op.reply.hdr.retval == -EPIPE)
{ {
// Mark op->up_wait = true before stopping the client // Mark op->up_wait = true before stopping the client
@ -1091,7 +1101,14 @@ void cluster_client_t::handle_op_part(cluster_op_part_t *part)
// Don't overwrite other errors with -EPIPE // Don't overwrite other errors with -EPIPE
op->retval = part->op.reply.hdr.retval; op->retval = part->op.reply.hdr.retval;
} }
if (op->retval != -EINTR && op->retval != -EIO)
{
fprintf(
stderr, "%s operation failed on OSD %lu: retval=%ld (expected %d), dropping connection\n",
osd_op_names[part->op.req.hdr.opcode], part->osd_num, part->op.reply.hdr.retval, expected
);
msgr.stop_client(part->op.peer_fd); msgr.stop_client(part->op.peer_fd);
}
part->flags |= PART_ERROR; part->flags |= PART_ERROR;
} }
else else
@ -1103,6 +1120,7 @@ void cluster_client_t::handle_op_part(cluster_op_part_t *part)
if (op->opcode == OSD_OP_READ) if (op->opcode == OSD_OP_READ)
{ {
copy_part_bitmap(op, part); copy_part_bitmap(op, part);
op->version = op->parts.size() == 1 ? part->op.reply.rw.version : 0;
} }
} }
if (op->inflight_count == 0) if (op->inflight_count == 0)

View File

@ -31,6 +31,9 @@ struct cluster_op_t
uint64_t inode; uint64_t inode;
uint64_t offset; uint64_t offset;
uint64_t len; uint64_t len;
// for reads and writes within a single object (stripe),
// reads can return current version and writes can use "CAS" semantics
uint64_t version = 0;
int retval; int retval;
osd_op_buf_list_t iov; osd_op_buf_list_t iov;
std::function<void(cluster_op_t*)> callback; std::function<void(cluster_op_t*)> callback;

View File

@ -35,7 +35,7 @@ etcd_kv_t etcd_state_client_t::parse_etcd_kv(const json11::Json & kv_json)
kv.value = json_text == "" ? json11::Json() : json11::Json::parse(json_text, json_err); kv.value = json_text == "" ? json11::Json() : json11::Json::parse(json_text, json_err);
if (json_err != "") if (json_err != "")
{ {
printf("Bad JSON in etcd key %s: %s (value: %s)\n", kv.key.c_str(), json_err.c_str(), json_text.c_str()); fprintf(stderr, "Bad JSON in etcd key %s: %s (value: %s)\n", kv.key.c_str(), json_err.c_str(), json_text.c_str());
kv.key = ""; kv.key = "";
} }
else else
@ -50,6 +50,11 @@ void etcd_state_client_t::etcd_txn(json11::Json txn, int timeout, std::function<
void etcd_state_client_t::etcd_call(std::string api, json11::Json payload, int timeout, std::function<void(std::string, json11::Json)> callback) void etcd_state_client_t::etcd_call(std::string api, json11::Json payload, int timeout, std::function<void(std::string, json11::Json)> callback)
{ {
if (!etcd_addresses.size())
{
fprintf(stderr, "etcd_address is missing in Vitastor configuration\n");
exit(1);
}
std::string etcd_address = etcd_addresses[rand() % etcd_addresses.size()]; std::string etcd_address = etcd_addresses[rand() % etcd_addresses.size()];
std::string etcd_api_path; std::string etcd_api_path;
int pos = etcd_address.find('/'); int pos = etcd_address.find('/');
@ -76,16 +81,16 @@ void etcd_state_client_t::add_etcd_url(std::string addr)
addr = addr.substr(7); addr = addr.substr(7);
else if (strtolower(addr.substr(0, 8)) == "https://") else if (strtolower(addr.substr(0, 8)) == "https://")
{ {
printf("HTTPS is unsupported for etcd. Either use plain HTTP or setup a local proxy for etcd interaction\n"); fprintf(stderr, "HTTPS is unsupported for etcd. Either use plain HTTP or setup a local proxy for etcd interaction\n");
exit(1); exit(1);
} }
if (addr.find('/') < 0) if (addr.find('/') == std::string::npos)
addr += "/v3"; addr += "/v3";
this->etcd_addresses.push_back(addr); this->etcd_addresses.push_back(addr);
} }
} }
void etcd_state_client_t::parse_config(json11::Json & config) void etcd_state_client_t::parse_config(const json11::Json & config)
{ {
this->etcd_addresses.clear(); this->etcd_addresses.clear();
if (config["etcd_address"].is_string()) if (config["etcd_address"].is_string())
@ -122,6 +127,11 @@ void etcd_state_client_t::parse_config(json11::Json & config)
void etcd_state_client_t::start_etcd_watcher() void etcd_state_client_t::start_etcd_watcher()
{ {
if (!etcd_addresses.size())
{
fprintf(stderr, "etcd_address is missing in Vitastor configuration\n");
exit(1);
}
std::string etcd_address = etcd_addresses[rand() % etcd_addresses.size()]; std::string etcd_address = etcd_addresses[rand() % etcd_addresses.size()];
std::string etcd_api_path; std::string etcd_api_path;
int pos = etcd_address.find('/'); int pos = etcd_address.find('/');
@ -139,7 +149,7 @@ void etcd_state_client_t::start_etcd_watcher()
json11::Json data = json11::Json::parse(msg->body, json_err); json11::Json data = json11::Json::parse(msg->body, json_err);
if (json_err != "") if (json_err != "")
{ {
printf("Bad JSON in etcd event: %s, ignoring event\n", json_err.c_str()); fprintf(stderr, "Bad JSON in etcd event: %s, ignoring event\n", json_err.c_str());
} }
else else
{ {
@ -165,7 +175,7 @@ void etcd_state_client_t::start_etcd_watcher()
{ {
if (this->log_level > 3) if (this->log_level > 3)
{ {
printf("Incoming event: %s -> %s\n", kv.first.c_str(), kv.second.value.dump().c_str()); fprintf(stderr, "Incoming event: %s -> %s\n", kv.first.c_str(), kv.second.value.dump().c_str());
} }
parse_state(kv.second); parse_state(kv.second);
} }
@ -240,7 +250,7 @@ void etcd_state_client_t::load_global_config()
{ {
if (err != "") if (err != "")
{ {
printf("Error reading OSD configuration from etcd: %s\n", err.c_str()); fprintf(stderr, "Error reading OSD configuration from etcd: %s\n", err.c_str());
tfd->set_timer(ETCD_SLOW_TIMEOUT, false, [this](int timer_id) tfd->set_timer(ETCD_SLOW_TIMEOUT, false, [this](int timer_id)
{ {
load_global_config(); load_global_config();
@ -313,7 +323,7 @@ void etcd_state_client_t::load_pgs()
{ {
if (err != "") if (err != "")
{ {
printf("Error loading PGs from etcd: %s\n", err.c_str()); fprintf(stderr, "Error loading PGs from etcd: %s\n", err.c_str());
tfd->set_timer(ETCD_SLOW_TIMEOUT, false, [this](int timer_id) tfd->set_timer(ETCD_SLOW_TIMEOUT, false, [this](int timer_id)
{ {
load_pgs(); load_pgs();
@ -342,7 +352,7 @@ void etcd_state_client_t::load_pgs()
}); });
} }
#else #else
void etcd_state_client_t::parse_config(json11::Json & config) void etcd_state_client_t::parse_config(const json11::Json & config)
{ {
} }
@ -376,7 +386,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
sscanf(pool_item.first.c_str(), "%u%c", &pool_id, &null_byte); sscanf(pool_item.first.c_str(), "%u%c", &pool_id, &null_byte);
if (!pool_id || pool_id >= POOL_ID_MAX || null_byte != 0) if (!pool_id || pool_id >= POOL_ID_MAX || null_byte != 0)
{ {
printf("Pool ID %s is invalid (must be a number less than 0x%x), skipping pool\n", pool_item.first.c_str(), POOL_ID_MAX); fprintf(stderr, "Pool ID %s is invalid (must be a number less than 0x%x), skipping pool\n", pool_item.first.c_str(), POOL_ID_MAX);
continue; continue;
} }
pc.id = pool_id; pc.id = pool_id;
@ -384,7 +394,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
pc.name = pool_item.second["name"].string_value(); pc.name = pool_item.second["name"].string_value();
if (pc.name == "") if (pc.name == "")
{ {
printf("Pool %u has empty name, skipping pool\n", pool_id); fprintf(stderr, "Pool %u has empty name, skipping pool\n", pool_id);
continue; continue;
} }
// Failure Domain // Failure Domain
@ -398,7 +408,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
pc.scheme = POOL_SCHEME_JERASURE; pc.scheme = POOL_SCHEME_JERASURE;
else else
{ {
printf("Pool %u has invalid coding scheme (one of \"xor\", \"replicated\" or \"jerasure\" required), skipping pool\n", pool_id); fprintf(stderr, "Pool %u has invalid coding scheme (one of \"xor\", \"replicated\" or \"jerasure\" required), skipping pool\n", pool_id);
continue; continue;
} }
// PG Size // PG Size
@ -408,7 +418,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
(pc.scheme == POOL_SCHEME_XOR || pc.scheme == POOL_SCHEME_JERASURE) || (pc.scheme == POOL_SCHEME_XOR || pc.scheme == POOL_SCHEME_JERASURE) ||
pool_item.second["pg_size"].uint64_value() > 256) pool_item.second["pg_size"].uint64_value() > 256)
{ {
printf("Pool %u has invalid pg_size, skipping pool\n", pool_id); fprintf(stderr, "Pool %u has invalid pg_size, skipping pool\n", pool_id);
continue; continue;
} }
// Parity Chunks // Parity Chunks
@ -417,7 +427,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
{ {
if (pc.parity_chunks > 1) if (pc.parity_chunks > 1)
{ {
printf("Pool %u has invalid parity_chunks (must be 1), skipping pool\n", pool_id); fprintf(stderr, "Pool %u has invalid parity_chunks (must be 1), skipping pool\n", pool_id);
continue; continue;
} }
pc.parity_chunks = 1; pc.parity_chunks = 1;
@ -425,7 +435,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
if (pc.scheme == POOL_SCHEME_JERASURE && if (pc.scheme == POOL_SCHEME_JERASURE &&
(pc.parity_chunks < 1 || pc.parity_chunks > pc.pg_size-2)) (pc.parity_chunks < 1 || pc.parity_chunks > pc.pg_size-2))
{ {
printf("Pool %u has invalid parity_chunks (must be between 1 and pg_size-2), skipping pool\n", pool_id); fprintf(stderr, "Pool %u has invalid parity_chunks (must be between 1 and pg_size-2), skipping pool\n", pool_id);
continue; continue;
} }
// PG MinSize // PG MinSize
@ -434,14 +444,14 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
(pc.scheme == POOL_SCHEME_XOR || pc.scheme == POOL_SCHEME_JERASURE) && (pc.scheme == POOL_SCHEME_XOR || pc.scheme == POOL_SCHEME_JERASURE) &&
pc.pg_minsize < (pc.pg_size-pc.parity_chunks)) pc.pg_minsize < (pc.pg_size-pc.parity_chunks))
{ {
printf("Pool %u has invalid pg_minsize, skipping pool\n", pool_id); fprintf(stderr, "Pool %u has invalid pg_minsize, skipping pool\n", pool_id);
continue; continue;
} }
// PG Count // PG Count
pc.pg_count = pool_item.second["pg_count"].uint64_value(); pc.pg_count = pool_item.second["pg_count"].uint64_value();
if (pc.pg_count < 1) if (pc.pg_count < 1)
{ {
printf("Pool %u has invalid pg_count, skipping pool\n", pool_id); fprintf(stderr, "Pool %u has invalid pg_count, skipping pool\n", pool_id);
continue; continue;
} }
// Max OSD Combinations // Max OSD Combinations
@ -450,7 +460,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
pc.max_osd_combinations = 10000; pc.max_osd_combinations = 10000;
if (pc.max_osd_combinations > 0 && pc.max_osd_combinations < 100) if (pc.max_osd_combinations > 0 && pc.max_osd_combinations < 100)
{ {
printf("Pool %u has invalid max_osd_combinations (must be at least 100), skipping pool\n", pool_id); fprintf(stderr, "Pool %u has invalid max_osd_combinations (must be at least 100), skipping pool\n", pool_id);
continue; continue;
} }
// PG Stripe Size // PG Stripe Size
@ -468,7 +478,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
{ {
if (pg_item.second.target_set.size() != parsed_cfg.pg_size) if (pg_item.second.target_set.size() != parsed_cfg.pg_size)
{ {
printf("Pool %u PG %u configuration is invalid: osd_set size %lu != pool pg_size %lu\n", fprintf(stderr, "Pool %u PG %u configuration is invalid: osd_set size %lu != pool pg_size %lu\n",
pool_id, pg_item.first, pg_item.second.target_set.size(), parsed_cfg.pg_size); pool_id, pg_item.first, pg_item.second.target_set.size(), parsed_cfg.pg_size);
pg_item.second.pause = true; pg_item.second.pause = true;
} }
@ -491,7 +501,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
sscanf(pool_item.first.c_str(), "%u%c", &pool_id, &null_byte); sscanf(pool_item.first.c_str(), "%u%c", &pool_id, &null_byte);
if (!pool_id || pool_id >= POOL_ID_MAX || null_byte != 0) if (!pool_id || pool_id >= POOL_ID_MAX || null_byte != 0)
{ {
printf("Pool ID %s is invalid in PG configuration (must be a number less than 0x%x), skipping pool\n", pool_item.first.c_str(), POOL_ID_MAX); fprintf(stderr, "Pool ID %s is invalid in PG configuration (must be a number less than 0x%x), skipping pool\n", pool_item.first.c_str(), POOL_ID_MAX);
continue; continue;
} }
for (auto & pg_item: pool_item.second.object_items()) for (auto & pg_item: pool_item.second.object_items())
@ -500,7 +510,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
sscanf(pg_item.first.c_str(), "%u%c", &pg_num, &null_byte); sscanf(pg_item.first.c_str(), "%u%c", &pg_num, &null_byte);
if (!pg_num || null_byte != 0) if (!pg_num || null_byte != 0)
{ {
printf("Bad key in pool %u PG configuration: %s (must be a number), skipped\n", pool_id, pg_item.first.c_str()); fprintf(stderr, "Bad key in pool %u PG configuration: %s (must be a number), skipped\n", pool_id, pg_item.first.c_str());
continue; continue;
} }
auto & parsed_cfg = this->pool_config[pool_id].pg_config[pg_num]; auto & parsed_cfg = this->pool_config[pool_id].pg_config[pg_num];
@ -514,7 +524,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
} }
if (parsed_cfg.target_set.size() != pool_config[pool_id].pg_size) if (parsed_cfg.target_set.size() != pool_config[pool_id].pg_size)
{ {
printf("Pool %u PG %u configuration is invalid: osd_set size %lu != pool pg_size %lu\n", fprintf(stderr, "Pool %u PG %u configuration is invalid: osd_set size %lu != pool pg_size %lu\n",
pool_id, pg_num, parsed_cfg.target_set.size(), pool_config[pool_id].pg_size); pool_id, pg_num, parsed_cfg.target_set.size(), pool_config[pool_id].pg_size);
parsed_cfg.pause = true; parsed_cfg.pause = true;
} }
@ -527,8 +537,8 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
{ {
if (pg_it->second.exists && pg_it->first != ++n) if (pg_it->second.exists && pg_it->first != ++n)
{ {
printf( fprintf(
"Invalid pool %u PG configuration: PG numbers don't cover whole 1..%lu range\n", stderr, "Invalid pool %u PG configuration: PG numbers don't cover whole 1..%lu range\n",
pool_item.second.id, pool_item.second.pg_config.size() pool_item.second.id, pool_item.second.pg_config.size()
); );
for (pg_it = pool_item.second.pg_config.begin(); pg_it != pool_item.second.pg_config.end(); pg_it++) for (pg_it = pool_item.second.pg_config.begin(); pg_it != pool_item.second.pg_config.end(); pg_it++)
@ -551,7 +561,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
sscanf(key.c_str() + etcd_prefix.length()+12, "%u/%u%c", &pool_id, &pg_num, &null_byte); sscanf(key.c_str() + etcd_prefix.length()+12, "%u/%u%c", &pool_id, &pg_num, &null_byte);
if (!pool_id || pool_id >= POOL_ID_MAX || !pg_num || null_byte != 0) if (!pool_id || pool_id >= POOL_ID_MAX || !pg_num || null_byte != 0)
{ {
printf("Bad etcd key %s, ignoring\n", key.c_str()); fprintf(stderr, "Bad etcd key %s, ignoring\n", key.c_str());
} }
else else
{ {
@ -590,7 +600,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
sscanf(key.c_str() + etcd_prefix.length()+10, "%u/%u%c", &pool_id, &pg_num, &null_byte); sscanf(key.c_str() + etcd_prefix.length()+10, "%u/%u%c", &pool_id, &pg_num, &null_byte);
if (!pool_id || pool_id >= POOL_ID_MAX || !pg_num || null_byte != 0) if (!pool_id || pool_id >= POOL_ID_MAX || !pg_num || null_byte != 0)
{ {
printf("Bad etcd key %s, ignoring\n", key.c_str()); fprintf(stderr, "Bad etcd key %s, ignoring\n", key.c_str());
} }
else if (value.is_null()) else if (value.is_null())
{ {
@ -614,7 +624,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
} }
if (i >= pg_state_bit_count) if (i >= pg_state_bit_count)
{ {
printf("Unexpected pool %u PG %u state keyword in etcd: %s\n", pool_id, pg_num, e.dump().c_str()); fprintf(stderr, "Unexpected pool %u PG %u state keyword in etcd: %s\n", pool_id, pg_num, e.dump().c_str());
return; return;
} }
} }
@ -623,7 +633,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
(state & PG_PEERING) && state != PG_PEERING || (state & PG_PEERING) && state != PG_PEERING ||
(state & PG_INCOMPLETE) && state != PG_INCOMPLETE) (state & PG_INCOMPLETE) && state != PG_INCOMPLETE)
{ {
printf("Unexpected pool %u PG %u state in etcd: primary=%lu, state=%s\n", pool_id, pg_num, cur_primary, value["state"].dump().c_str()); fprintf(stderr, "Unexpected pool %u PG %u state in etcd: primary=%lu, state=%s\n", pool_id, pg_num, cur_primary, value["state"].dump().c_str());
return; return;
} }
this->pool_config[pool_id].pg_config[pg_num].cur_primary = cur_primary; this->pool_config[pool_id].pg_config[pg_num].cur_primary = cur_primary;
@ -661,7 +671,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
sscanf(key.c_str() + etcd_prefix.length()+14, "%lu/%lu%c", &pool_id, &inode_num, &null_byte); sscanf(key.c_str() + etcd_prefix.length()+14, "%lu/%lu%c", &pool_id, &inode_num, &null_byte);
if (!pool_id || pool_id >= POOL_ID_MAX || !inode_num || (inode_num >> (64-POOL_ID_BITS)) || null_byte != 0) if (!pool_id || pool_id >= POOL_ID_MAX || !inode_num || (inode_num >> (64-POOL_ID_BITS)) || null_byte != 0)
{ {
printf("Bad etcd key %s, ignoring\n", key.c_str()); fprintf(stderr, "Bad etcd key %s, ignoring\n", key.c_str());
} }
else else
{ {
@ -696,8 +706,8 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
parent_inode_num |= pool_id << (64-POOL_ID_BITS); parent_inode_num |= pool_id << (64-POOL_ID_BITS);
else if (parent_pool_id >= POOL_ID_MAX) else if (parent_pool_id >= POOL_ID_MAX)
{ {
printf( fprintf(
"Inode %lu/%lu parent_pool value is invalid, ignoring parent setting\n", stderr, "Inode %lu/%lu parent_pool value is invalid, ignoring parent setting\n",
inode_num >> (64-POOL_ID_BITS), inode_num & ((1l << (64-POOL_ID_BITS)) - 1) inode_num >> (64-POOL_ID_BITS), inode_num & ((1l << (64-POOL_ID_BITS)) - 1)
); );
parent_inode_num = 0; parent_inode_num = 0;

View File

@ -106,7 +106,7 @@ public:
void load_global_config(); void load_global_config();
void load_pgs(); void load_pgs();
void parse_state(const etcd_kv_t & kv); void parse_state(const etcd_kv_t & kv);
void parse_config(json11::Json & config); void parse_config(const json11::Json & config);
inode_watch_t* watch_inode(std::string name); inode_watch_t* watch_inode(std::string name);
void close_watch(inode_watch_t* watch); void close_watch(inode_watch_t* watch);
~etcd_state_client_t(); ~etcd_state_client_t();

View File

@ -24,28 +24,25 @@
#include <netinet/tcp.h> #include <netinet/tcp.h>
#include <vector> #include <vector>
#include <unordered_map>
#include "epoll_manager.h" #include "vitastor_c.h"
#include "cluster_client.h"
#include "fio_headers.h" #include "fio_headers.h"
struct sec_data struct sec_data
{ {
ring_loop_t *ringloop = NULL; vitastor_c *cli = NULL;
epoll_manager_t *epmgr = NULL; void *watch = NULL;
cluster_client_t *cli = NULL;
inode_watch_t *watch = NULL;
bool last_sync = false; bool last_sync = false;
/* The list of completed io_u structs. */ /* The list of completed io_u structs. */
std::vector<io_u*> completed; std::vector<io_u*> completed;
uint64_t op_n = 0, inflight = 0; uint64_t inflight = 0;
bool trace = false; bool trace = false;
}; };
struct sec_options struct sec_options
{ {
int __pad; int __pad;
char *config_path = NULL;
char *etcd_host = NULL; char *etcd_host = NULL;
char *etcd_prefix = NULL; char *etcd_prefix = NULL;
char *image = NULL; char *image = NULL;
@ -54,12 +51,22 @@ struct sec_options
int cluster_log = 0; int cluster_log = 0;
int trace = 0; int trace = 0;
int use_rdma = 0; int use_rdma = 0;
char *rdma_device = NULL;
int rdma_port_num = 0; int rdma_port_num = 0;
int rdma_gid_index = 0; int rdma_gid_index = 0;
int rdma_mtu = 0; int rdma_mtu = 0;
}; };
static struct fio_option options[] = { static struct fio_option options[] = {
{
.name = "conf",
.lname = "Vitastor config path",
.type = FIO_OPT_STR_STORE,
.off1 = offsetof(struct sec_options, config_path),
.help = "Vitastor config path",
.category = FIO_OPT_C_ENGINE,
.group = FIO_OPT_G_FILENAME,
},
{ {
.name = "etcd", .name = "etcd",
.lname = "etcd address", .lname = "etcd address",
@ -127,10 +134,49 @@ static struct fio_option options[] = {
}, },
{ {
.name = "use_rdma", .name = "use_rdma",
.lname = "OSD trace", .lname = "Use RDMA",
.type = FIO_OPT_BOOL, .type = FIO_OPT_BOOL,
.off1 = offsetof(struct sec_options, use_rdma), .off1 = offsetof(struct sec_options, use_rdma),
.help = "Use RDMA", .help = "Use RDMA",
.def = "-1",
.category = FIO_OPT_C_ENGINE,
.group = FIO_OPT_G_FILENAME,
},
{
.name = "rdma_device",
.lname = "RDMA device name",
.type = FIO_OPT_STR_STORE,
.off1 = offsetof(struct sec_options, rdma_device),
.help = "RDMA device name",
.category = FIO_OPT_C_ENGINE,
.group = FIO_OPT_G_FILENAME,
},
{
.name = "rdma_port_num",
.lname = "RDMA port number",
.type = FIO_OPT_INT,
.off1 = offsetof(struct sec_options, rdma_port_num),
.help = "RDMA port number",
.def = "0",
.category = FIO_OPT_C_ENGINE,
.group = FIO_OPT_G_FILENAME,
},
{
.name = "rdma_gid_index",
.lname = "RDMA gid index",
.type = FIO_OPT_INT,
.off1 = offsetof(struct sec_options, rdma_gid_index),
.help = "RDMA gid index",
.def = "0",
.category = FIO_OPT_C_ENGINE,
.group = FIO_OPT_G_FILENAME,
},
{
.name = "rdma_mtu",
.lname = "RDMA path MTU",
.type = FIO_OPT_INT,
.off1 = offsetof(struct sec_options, rdma_mtu),
.help = "RDMA path MTU",
.def = "0", .def = "0",
.category = FIO_OPT_C_ENGINE, .category = FIO_OPT_C_ENGINE,
.group = FIO_OPT_G_FILENAME, .group = FIO_OPT_G_FILENAME,
@ -140,17 +186,17 @@ static struct fio_option options[] = {
}, },
}; };
static void watch_callback(void *opaque, long watch)
{
struct sec_data *bsd = (struct sec_data*)opaque;
bsd->watch = (void*)watch;
}
static int sec_setup(struct thread_data *td) static int sec_setup(struct thread_data *td)
{ {
sec_options *o = (sec_options*)td->eo; sec_options *o = (sec_options*)td->eo;
sec_data *bsd; sec_data *bsd;
if (!o->etcd_host)
{
td_verror(td, EINVAL, "etcd address is missing");
return 1;
}
bsd = new sec_data; bsd = new sec_data;
if (!bsd) if (!bsd)
{ {
@ -166,13 +212,6 @@ static int sec_setup(struct thread_data *td)
td->o.open_files++; td->o.open_files++;
} }
json11::Json cfg = json11::Json::object {
{ "etcd_address", std::string(o->etcd_host) },
{ "etcd_prefix", std::string(o->etcd_prefix ? o->etcd_prefix : "/vitastor") },
{ "log_level", o->cluster_log },
{ "use_rdma", o->use_rdma },
};
if (!o->image) if (!o->image)
{ {
if (!(o->inode & ((1l << (64-POOL_ID_BITS)) - 1))) if (!(o->inode & ((1l << (64-POOL_ID_BITS)) - 1)))
@ -194,20 +233,20 @@ static int sec_setup(struct thread_data *td)
{ {
o->inode = 0; o->inode = 0;
} }
bsd->ringloop = new ring_loop_t(512); bsd->cli = vitastor_c_create_uring(o->config_path, o->etcd_host, o->etcd_prefix,
bsd->epmgr = new epoll_manager_t(bsd->ringloop); o->use_rdma, o->rdma_device, o->rdma_port_num, o->rdma_gid_index, o->rdma_mtu, o->cluster_log);
bsd->cli = new cluster_client_t(bsd->ringloop, bsd->epmgr->tfd, cfg);
if (o->image) if (o->image)
{ {
while (!bsd->cli->is_ready()) bsd->watch = NULL;
vitastor_c_watch_inode(bsd->cli, o->image, watch_callback, bsd);
while (true)
{ {
bsd->ringloop->loop(); vitastor_c_uring_handle_events(bsd->cli);
if (bsd->cli->is_ready()) if (bsd->watch)
break; break;
bsd->ringloop->wait(); vitastor_c_uring_wait_events(bsd->cli);
} }
bsd->watch = bsd->cli->st_cli.watch_inode(std::string(o->image)); td->files[0]->real_file_size = vitastor_c_inode_get_size(bsd->watch);
td->files[0]->real_file_size = bsd->watch->cfg.size;
} }
bsd->trace = o->trace ? true : false; bsd->trace = o->trace ? true : false;
@ -222,11 +261,9 @@ static void sec_cleanup(struct thread_data *td)
{ {
if (bsd->watch) if (bsd->watch)
{ {
bsd->cli->st_cli.close_watch(bsd->watch); vitastor_c_close_watch(bsd->cli, bsd->watch);
} }
delete bsd->cli; vitastor_c_destroy(bsd->cli);
delete bsd->epmgr;
delete bsd->ringloop;
delete bsd; delete bsd;
} }
} }
@ -237,12 +274,31 @@ static int sec_init(struct thread_data *td)
return 0; return 0;
} }
static void io_callback(void *opaque, long retval)
{
struct io_u *io = (struct io_u*)opaque;
io->error = retval < 0 ? -retval : 0;
sec_data *bsd = (sec_data*)io->engine_data;
bsd->inflight--;
bsd->completed.push_back(io);
if (bsd->trace)
{
printf("--- %s 0x%lx retval=%ld\n", io->ddir == DDIR_READ ? "READ" :
(io->ddir == DDIR_WRITE ? "WRITE" : "SYNC"), (uint64_t)io, retval);
}
}
static void read_callback(void *opaque, long retval, uint64_t version)
{
io_callback(opaque, retval);
}
/* Begin read or write request. */ /* Begin read or write request. */
static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io) static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)
{ {
sec_options *opt = (sec_options*)td->eo; sec_options *opt = (sec_options*)td->eo;
sec_data *bsd = (sec_data*)td->io_ops_data; sec_data *bsd = (sec_data*)td->io_ops_data;
int n = bsd->op_n; struct iovec iov;
fio_ro_check(td, io); fio_ro_check(td, io);
if (io->ddir == DDIR_SYNC && bsd->last_sync) if (io->ddir == DDIR_SYNC && bsd->last_sync)
@ -251,32 +307,29 @@ static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)
} }
io->engine_data = bsd; io->engine_data = bsd;
cluster_op_t *op = new cluster_op_t; io->error = 0;
bsd->inflight++;
op->inode = opt->image ? bsd->watch->cfg.num : opt->inode; uint64_t inode = opt->image ? vitastor_c_inode_get_num(bsd->watch) : opt->inode;
switch (io->ddir) switch (io->ddir)
{ {
case DDIR_READ: case DDIR_READ:
op->opcode = OSD_OP_READ; iov = { .iov_base = io->xfer_buf, .iov_len = io->xfer_buflen };
op->offset = io->offset; vitastor_c_read(bsd->cli, inode, io->offset, io->xfer_buflen, &iov, 1, read_callback, io);
op->len = io->xfer_buflen;
op->iov.push_back(io->xfer_buf, io->xfer_buflen);
bsd->last_sync = false; bsd->last_sync = false;
break; break;
case DDIR_WRITE: case DDIR_WRITE:
if (opt->image && bsd->watch->cfg.readonly) if (opt->image && vitastor_c_inode_get_readonly(bsd->watch))
{ {
io->error = EROFS; io->error = EROFS;
return FIO_Q_COMPLETED; return FIO_Q_COMPLETED;
} }
op->opcode = OSD_OP_WRITE; iov = { .iov_base = io->xfer_buf, .iov_len = io->xfer_buflen };
op->offset = io->offset; vitastor_c_write(bsd->cli, inode, io->offset, io->xfer_buflen, 0, &iov, 1, io_callback, io);
op->len = io->xfer_buflen;
op->iov.push_back(io->xfer_buf, io->xfer_buflen);
bsd->last_sync = false; bsd->last_sync = false;
break; break;
case DDIR_SYNC: case DDIR_SYNC:
op->opcode = OSD_OP_SYNC; vitastor_c_sync(bsd->cli, io_callback, io);
bsd->last_sync = true; bsd->last_sync = true;
break; break;
default: default:
@ -284,39 +337,20 @@ static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)
return FIO_Q_COMPLETED; return FIO_Q_COMPLETED;
} }
op->callback = [io, n](cluster_op_t *op)
{
io->error = op->retval < 0 ? -op->retval : 0;
sec_data *bsd = (sec_data*)io->engine_data;
bsd->inflight--;
bsd->completed.push_back(io);
if (bsd->trace)
{
printf("--- %s n=%d retval=%d\n", io->ddir == DDIR_READ ? "READ" :
(io->ddir == DDIR_WRITE ? "WRITE" : "SYNC"), n, op->retval);
}
delete op;
};
if (opt->trace) if (opt->trace)
{ {
if (io->ddir == DDIR_SYNC) if (io->ddir == DDIR_SYNC)
{ {
printf("+++ SYNC # %d\n", n); printf("+++ SYNC 0x%lx\n", (uint64_t)io);
} }
else else
{ {
printf("+++ %s # %d 0x%llx+%llx\n", printf("+++ %s 0x%lx 0x%llx+%llx\n",
io->ddir == DDIR_READ ? "READ" : "WRITE", io->ddir == DDIR_READ ? "READ" : "WRITE",
n, io->offset, io->xfer_buflen); (uint64_t)io, io->offset, io->xfer_buflen);
} }
} }
io->error = 0;
bsd->inflight++;
bsd->op_n++;
bsd->cli->execute(op);
if (io->error != 0) if (io->error != 0)
return FIO_Q_COMPLETED; return FIO_Q_COMPLETED;
return FIO_Q_QUEUED; return FIO_Q_QUEUED;
@ -327,10 +361,10 @@ static int sec_getevents(struct thread_data *td, unsigned int min, unsigned int
sec_data *bsd = (sec_data*)td->io_ops_data; sec_data *bsd = (sec_data*)td->io_ops_data;
while (true) while (true)
{ {
bsd->ringloop->loop(); vitastor_c_uring_handle_events(bsd->cli);
if (bsd->completed.size() >= min) if (bsd->completed.size() >= min)
break; break;
bsd->ringloop->wait(); vitastor_c_uring_wait_events(bsd->cli);
} }
return bsd->completed.size(); return bsd->completed.size();
} }

View File

@ -21,11 +21,13 @@ void osd_messenger_t::init()
); );
if (!rdma_context) if (!rdma_context)
{ {
printf("[OSD %lu] Couldn't initialize RDMA, proceeding with TCP only\n", osd_num); fprintf(stderr, "[OSD %lu] Couldn't initialize RDMA, proceeding with TCP only\n", osd_num);
} }
else else
{ {
printf("[OSD %lu] RDMA initialized successfully\n", osd_num); rdma_max_sge = rdma_max_sge < rdma_context->attrx.orig_attr.max_sge
? rdma_max_sge : rdma_context->attrx.orig_attr.max_sge;
fprintf(stderr, "[OSD %lu] RDMA initialized successfully\n", osd_num);
fcntl(rdma_context->channel->fd, F_SETFL, fcntl(rdma_context->channel->fd, F_GETFL, 0) | O_NONBLOCK); fcntl(rdma_context->channel->fd, F_SETFL, fcntl(rdma_context->channel->fd, F_GETFL, 0) | O_NONBLOCK);
tfd->set_fd_handler(rdma_context->channel->fd, false, [this](int notify_fd, int epoll_events) tfd->set_fd_handler(rdma_context->channel->fd, false, [this](int notify_fd, int epoll_events)
{ {
@ -53,7 +55,7 @@ void osd_messenger_t::init()
if (!cl->ping_time_remaining) if (!cl->ping_time_remaining)
{ {
// Ping timed out, stop the client // Ping timed out, stop the client
printf("Ping timed out for OSD %lu (client %d), disconnecting peer\n", cl->osd_num, cl->peer_fd); fprintf(stderr, "Ping timed out for OSD %lu (client %d), disconnecting peer\n", cl->osd_num, cl->peer_fd);
to_stop.push_back(cl->peer_fd); to_stop.push_back(cl->peer_fd);
} }
} }
@ -80,7 +82,7 @@ void osd_messenger_t::init()
delete op; delete op;
if (fail_fd >= 0) if (fail_fd >= 0)
{ {
printf("Ping failed for OSD %lu (client %d), disconnecting peer\n", cl->osd_num, cl->peer_fd); fprintf(stderr, "Ping failed for OSD %lu (client %d), disconnecting peer\n", cl->osd_num, cl->peer_fd);
stop_client(fail_fd, true); stop_client(fail_fd, true);
} }
}; };
@ -129,39 +131,46 @@ void osd_messenger_t::parse_config(const json11::Json & config)
{ {
#ifdef WITH_RDMA #ifdef WITH_RDMA
if (!config["use_rdma"].is_null()) if (!config["use_rdma"].is_null())
{
// RDMA is on by default in RDMA-enabled builds
this->use_rdma = config["use_rdma"].bool_value() || config["use_rdma"].uint64_value() != 0; this->use_rdma = config["use_rdma"].bool_value() || config["use_rdma"].uint64_value() != 0;
}
this->rdma_device = config["rdma_device"].string_value(); this->rdma_device = config["rdma_device"].string_value();
this->rdma_port_num = (uint8_t)config["rdma_port_num"].uint64_value(); this->rdma_port_num = (uint8_t)config["rdma_port_num"].uint64_value();
if (!this->rdma_port_num) if (!this->rdma_port_num)
this->rdma_port_num = 1; this->rdma_port_num = 1;
this->rdma_gid_index = (uint8_t)config["rdma_gid_index"].uint64_value(); this->rdma_gid_index = (uint8_t)config["rdma_gid_index"].uint64_value();
this->rdma_mtu = (uint32_t)config["rdma_mtu"].uint64_value(); this->rdma_mtu = (uint32_t)config["rdma_mtu"].uint64_value();
this->rdma_max_sge = config["rdma_max_sge"].uint64_value();
if (!this->rdma_max_sge)
this->rdma_max_sge = 128;
this->rdma_max_send = config["rdma_max_send"].uint64_value();
if (!this->rdma_max_send)
this->rdma_max_send = 32;
this->rdma_max_recv = config["rdma_max_recv"].uint64_value();
if (!this->rdma_max_recv)
this->rdma_max_recv = 8;
this->rdma_max_msg = config["rdma_max_msg"].uint64_value();
if (!this->rdma_max_msg || this->rdma_max_msg > 128*1024*1024)
this->rdma_max_msg = 1024*1024;
#endif #endif
this->bs_bitmap_granularity = strtoull(config["bitmap_granularity"].string_value().c_str(), NULL, 10); this->receive_buffer_size = (uint32_t)config["tcp_header_buffer_size"].uint64_value();
if (!this->bs_bitmap_granularity) if (!this->receive_buffer_size || this->receive_buffer_size > 1024*1024*1024)
this->bs_bitmap_granularity = DEFAULT_BITMAP_GRANULARITY; this->receive_buffer_size = 65536;
this->use_sync_send_recv = config["use_sync_send_recv"].bool_value() || this->use_sync_send_recv = config["use_sync_send_recv"].bool_value() ||
config["use_sync_send_recv"].uint64_value(); config["use_sync_send_recv"].uint64_value();
this->peer_connect_interval = config["peer_connect_interval"].uint64_value(); this->peer_connect_interval = config["peer_connect_interval"].uint64_value();
if (!this->peer_connect_interval) if (!this->peer_connect_interval)
{ this->peer_connect_interval = 5;
this->peer_connect_interval = DEFAULT_PEER_CONNECT_INTERVAL;
}
this->peer_connect_timeout = config["peer_connect_timeout"].uint64_value(); this->peer_connect_timeout = config["peer_connect_timeout"].uint64_value();
if (!this->peer_connect_timeout) if (!this->peer_connect_timeout)
{ this->peer_connect_timeout = 5;
this->peer_connect_timeout = DEFAULT_PEER_CONNECT_TIMEOUT;
}
this->osd_idle_timeout = config["osd_idle_timeout"].uint64_value(); this->osd_idle_timeout = config["osd_idle_timeout"].uint64_value();
if (!this->osd_idle_timeout) if (!this->osd_idle_timeout)
{ this->osd_idle_timeout = 5;
this->osd_idle_timeout = DEFAULT_OSD_PING_TIMEOUT;
}
this->osd_ping_timeout = config["osd_ping_timeout"].uint64_value(); this->osd_ping_timeout = config["osd_ping_timeout"].uint64_value();
if (!this->osd_ping_timeout) if (!this->osd_ping_timeout)
{ this->osd_ping_timeout = 5;
this->osd_ping_timeout = DEFAULT_OSD_PING_TIMEOUT;
}
this->log_level = config["log_level"].uint64_value(); this->log_level = config["log_level"].uint64_value();
} }
@ -252,7 +261,7 @@ void osd_messenger_t::try_connect_peer_addr(osd_num_t peer_osd, const char *peer
{ {
osd_num_t peer_osd = clients.at(peer_fd)->osd_num; osd_num_t peer_osd = clients.at(peer_fd)->osd_num;
stop_client(peer_fd, true); stop_client(peer_fd, true);
on_connect_peer(peer_osd, -EIO); on_connect_peer(peer_osd, -EPIPE);
return; return;
}); });
} }
@ -296,7 +305,7 @@ void osd_messenger_t::handle_peer_epoll(int peer_fd, int epoll_events)
if (epoll_events & EPOLLRDHUP) if (epoll_events & EPOLLRDHUP)
{ {
// Stop client // Stop client
printf("[OSD %lu] client %d disconnected\n", this->osd_num, peer_fd); fprintf(stderr, "[OSD %lu] client %d disconnected\n", this->osd_num, peer_fd);
stop_client(peer_fd, true); stop_client(peer_fd, true);
} }
else if (epoll_events & EPOLLIN) else if (epoll_events & EPOLLIN)
@ -321,7 +330,7 @@ void osd_messenger_t::on_connect_peer(osd_num_t peer_osd, int peer_fd)
wp.connecting = false; wp.connecting = false;
if (peer_fd < 0) if (peer_fd < 0)
{ {
printf("Failed to connect to peer OSD %lu address %s port %d: %s\n", peer_osd, wp.cur_addr.c_str(), wp.cur_port, strerror(-peer_fd)); fprintf(stderr, "Failed to connect to peer OSD %lu address %s port %d: %s\n", peer_osd, wp.cur_addr.c_str(), wp.cur_port, strerror(-peer_fd));
if (wp.address_changed) if (wp.address_changed)
{ {
wp.address_changed = false; wp.address_changed = false;
@ -348,7 +357,7 @@ void osd_messenger_t::on_connect_peer(osd_num_t peer_osd, int peer_fd)
} }
if (log_level > 0) if (log_level > 0)
{ {
printf("[OSD %lu] Connected with peer OSD %lu (client %d)\n", osd_num, peer_osd, peer_fd); fprintf(stderr, "[OSD %lu] Connected with peer OSD %lu (client %d)\n", osd_num, peer_osd, peer_fd);
} }
wanted_peers.erase(peer_osd); wanted_peers.erase(peer_osd);
repeer_pgs(peer_osd); repeer_pgs(peer_osd);
@ -356,9 +365,6 @@ void osd_messenger_t::on_connect_peer(osd_num_t peer_osd, int peer_fd)
void osd_messenger_t::check_peer_config(osd_client_t *cl) void osd_messenger_t::check_peer_config(osd_client_t *cl)
{ {
#ifdef WITH_RDMA
msgr_rdma_connection_t *rdma_conn = NULL;
#endif
osd_op_t *op = new osd_op_t(); osd_op_t *op = new osd_op_t();
op->op_type = OSD_OP_OUT; op->op_type = OSD_OP_OUT;
op->peer_fd = cl->peer_fd; op->peer_fd = cl->peer_fd;
@ -374,11 +380,12 @@ void osd_messenger_t::check_peer_config(osd_client_t *cl)
#ifdef WITH_RDMA #ifdef WITH_RDMA
if (rdma_context) if (rdma_context)
{ {
cl->rdma_conn = msgr_rdma_connection_t::create(rdma_context, max_rdma_send, max_rdma_recv, max_rdma_sge); cl->rdma_conn = msgr_rdma_connection_t::create(rdma_context, rdma_max_send, rdma_max_recv, rdma_max_sge, rdma_max_msg);
if (cl->rdma_conn) if (cl->rdma_conn)
{ {
json11::Json payload = json11::Json::object { json11::Json payload = json11::Json::object {
{ "connect_rdma", cl->rdma_conn->addr.to_string() }, { "connect_rdma", cl->rdma_conn->addr.to_string() },
{ "rdma_max_msg", cl->rdma_conn->max_msg },
}; };
std::string payload_str = payload.dump(); std::string payload_str = payload.dump();
op->req.show_conf.json_len = payload_str.size(); op->req.show_conf.json_len = payload_str.size();
@ -388,11 +395,7 @@ void osd_messenger_t::check_peer_config(osd_client_t *cl)
} }
} }
#endif #endif
op->callback = [this, cl op->callback = [this, cl](osd_op_t *op)
#ifdef WITH_RDMA
, rdma_conn
#endif
](osd_op_t *op)
{ {
std::string json_err; std::string json_err;
json11::Json config; json11::Json config;
@ -400,7 +403,7 @@ void osd_messenger_t::check_peer_config(osd_client_t *cl)
if (op->reply.hdr.retval < 0) if (op->reply.hdr.retval < 0)
{ {
err = true; err = true;
printf("Failed to get config from OSD %lu (retval=%ld), disconnecting peer\n", cl->osd_num, op->reply.hdr.retval); fprintf(stderr, "Failed to get config from OSD %lu (retval=%ld), disconnecting peer\n", cl->osd_num, op->reply.hdr.retval);
} }
else else
{ {
@ -408,18 +411,18 @@ void osd_messenger_t::check_peer_config(osd_client_t *cl)
if (json_err != "") if (json_err != "")
{ {
err = true; err = true;
printf("Failed to get config from OSD %lu: bad JSON: %s, disconnecting peer\n", cl->osd_num, json_err.c_str()); fprintf(stderr, "Failed to get config from OSD %lu: bad JSON: %s, disconnecting peer\n", cl->osd_num, json_err.c_str());
} }
else if (config["osd_num"].uint64_value() != cl->osd_num) else if (config["osd_num"].uint64_value() != cl->osd_num)
{ {
err = true; err = true;
printf("Connected to OSD %lu instead of OSD %lu, peer state is outdated, disconnecting peer\n", config["osd_num"].uint64_value(), cl->osd_num); fprintf(stderr, "Connected to OSD %lu instead of OSD %lu, peer state is outdated, disconnecting peer\n", config["osd_num"].uint64_value(), cl->osd_num);
} }
else if (config["protocol_version"].uint64_value() != OSD_PROTOCOL_VERSION) else if (config["protocol_version"].uint64_value() != OSD_PROTOCOL_VERSION)
{ {
err = true; err = true;
printf( fprintf(
"OSD %lu protocol version is %lu, but only version %u is supported.\n" stderr, "OSD %lu protocol version is %lu, but only version %u is supported.\n"
" If you need to upgrade from 0.5.x please request it via the issue tracker.\n", " If you need to upgrade from 0.5.x please request it via the issue tracker.\n",
cl->osd_num, config["protocol_version"].uint64_value(), OSD_PROTOCOL_VERSION cl->osd_num, config["protocol_version"].uint64_value(), OSD_PROTOCOL_VERSION
); );
@ -440,8 +443,8 @@ void osd_messenger_t::check_peer_config(osd_client_t *cl)
if (!msgr_rdma_address_t::from_string(config["rdma_address"].string_value().c_str(), &addr) || if (!msgr_rdma_address_t::from_string(config["rdma_address"].string_value().c_str(), &addr) ||
cl->rdma_conn->connect(&addr) != 0) cl->rdma_conn->connect(&addr) != 0)
{ {
printf( fprintf(
"Failed to connect to OSD %lu (address %s) using RDMA\n", stderr, "Failed to connect to OSD %lu (address %s) using RDMA\n",
cl->osd_num, config["rdma_address"].string_value().c_str() cl->osd_num, config["rdma_address"].string_value().c_str()
); );
delete cl->rdma_conn; delete cl->rdma_conn;
@ -455,7 +458,15 @@ void osd_messenger_t::check_peer_config(osd_client_t *cl)
} }
else else
{ {
printf("Connected to OSD %lu using RDMA\n", cl->osd_num); uint64_t server_max_msg = config["rdma_max_msg"].uint64_value();
if (cl->rdma_conn->max_msg > server_max_msg)
{
cl->rdma_conn->max_msg = server_max_msg;
}
if (log_level > 0)
{
fprintf(stderr, "Connected to OSD %lu using RDMA\n", cl->osd_num);
}
cl->peer_state = PEER_RDMA; cl->peer_state = PEER_RDMA;
tfd->set_fd_handler(cl->peer_fd, false, NULL); tfd->set_fd_handler(cl->peer_fd, false, NULL);
// Add the initial receive request // Add the initial receive request
@ -480,7 +491,7 @@ void osd_messenger_t::accept_connections(int listen_fd)
{ {
assert(peer_fd != 0); assert(peer_fd != 0);
char peer_str[256]; char peer_str[256];
printf("[OSD %lu] new client %d: connection from %s port %d\n", this->osd_num, peer_fd, fprintf(stderr, "[OSD %lu] new client %d: connection from %s port %d\n", this->osd_num, peer_fd,
inet_ntop(AF_INET, &addr.sin_addr, peer_str, 256), ntohs(addr.sin_port)); inet_ntop(AF_INET, &addr.sin_addr, peer_str, 256), ntohs(addr.sin_port));
fcntl(peer_fd, F_SETFL, fcntl(peer_fd, F_GETFL, 0) | O_NONBLOCK); fcntl(peer_fd, F_SETFL, fcntl(peer_fd, F_GETFL, 0) | O_NONBLOCK);
int one = 1; int one = 1;
@ -505,7 +516,58 @@ void osd_messenger_t::accept_connections(int listen_fd)
} }
} }
#ifdef WITH_RDMA
bool osd_messenger_t::is_rdma_enabled() bool osd_messenger_t::is_rdma_enabled()
{ {
return rdma_context != NULL; return rdma_context != NULL;
} }
#endif
json11::Json osd_messenger_t::read_config(const json11::Json & config)
{
const char *config_path = config["config_path"].string_value() != ""
? config["config_path"].string_value().c_str() : VITASTOR_CONFIG_PATH;
int fd = open(config_path, O_RDONLY);
if (fd < 0)
{
if (errno != ENOENT)
fprintf(stderr, "Error reading %s: %s\n", config_path, strerror(errno));
return config;
}
struct stat st;
if (fstat(fd, &st) != 0)
{
fprintf(stderr, "Error reading %s: %s\n", config_path, strerror(errno));
close(fd);
return config;
}
std::string buf;
buf.resize(st.st_size);
int done = 0;
while (done < st.st_size)
{
int r = read(fd, (void*)buf.data()+done, st.st_size-done);
if (r < 0)
{
fprintf(stderr, "Error reading %s: %s\n", config_path, strerror(errno));
close(fd);
return config;
}
done += r;
}
close(fd);
std::string json_err;
json11::Json::object file_config = json11::Json::parse(buf, json_err).object_items();
if (json_err != "")
{
fprintf(stderr, "Invalid JSON in %s: %s\n", config_path, json_err.c_str());
return config;
}
file_config.erase("config_path");
file_config.erase("osd_num");
for (auto kv: config.object_items())
{
file_config[kv.first] = kv.second;
}
return file_config;
}

View File

@ -33,10 +33,8 @@
#define PEER_RDMA 4 #define PEER_RDMA 4
#define PEER_STOPPED 5 #define PEER_STOPPED 5
#define DEFAULT_PEER_CONNECT_INTERVAL 5
#define DEFAULT_PEER_CONNECT_TIMEOUT 5
#define DEFAULT_OSD_PING_TIMEOUT 5
#define DEFAULT_BITMAP_GRANULARITY 4096 #define DEFAULT_BITMAP_GRANULARITY 4096
#define VITASTOR_CONFIG_PATH "/etc/vitastor/vitastor.conf"
#define MSGR_SENDP_HDR 1 #define MSGR_SENDP_HDR 1
#define MSGR_SENDP_FREE 2 #define MSGR_SENDP_FREE 2
@ -122,13 +120,11 @@ struct osd_messenger_t
protected: protected:
int keepalive_timer_id = -1; int keepalive_timer_id = -1;
// FIXME: make receive_buffer_size configurable uint32_t receive_buffer_size = 0;
int receive_buffer_size = 64*1024; int peer_connect_interval = 0;
int peer_connect_interval = DEFAULT_PEER_CONNECT_INTERVAL; int peer_connect_timeout = 0;
int peer_connect_timeout = DEFAULT_PEER_CONNECT_TIMEOUT; int osd_idle_timeout = 0;
int osd_idle_timeout = DEFAULT_OSD_PING_TIMEOUT; int osd_ping_timeout = 0;
int osd_ping_timeout = DEFAULT_OSD_PING_TIMEOUT;
uint32_t bs_bitmap_granularity = 0;
int log_level = 0; int log_level = 0;
bool use_sync_send_recv = false; bool use_sync_send_recv = false;
@ -137,7 +133,8 @@ protected:
std::string rdma_device; std::string rdma_device;
uint64_t rdma_port_num = 1, rdma_gid_index = 0, rdma_mtu = 0; uint64_t rdma_port_num = 1, rdma_gid_index = 0, rdma_mtu = 0;
msgr_rdma_context_t *rdma_context = NULL; msgr_rdma_context_t *rdma_context = NULL;
int max_rdma_sge = 128, max_rdma_send = 32, max_rdma_recv = 32; uint64_t rdma_max_sge = 0, rdma_max_send = 0, rdma_max_recv = 8;
uint64_t rdma_max_msg = 0;
#endif #endif
std::vector<int> read_ready_clients; std::vector<int> read_ready_clients;
@ -168,9 +165,11 @@ public:
void accept_connections(int listen_fd); void accept_connections(int listen_fd);
~osd_messenger_t(); ~osd_messenger_t();
static json11::Json read_config(const json11::Json & config);
#ifdef WITH_RDMA #ifdef WITH_RDMA
bool is_rdma_enabled(); bool is_rdma_enabled();
bool connect_rdma(int peer_fd, std::string rdma_address); bool connect_rdma(int peer_fd, std::string rdma_address, uint64_t client_max_msg);
#endif #endif
protected: protected:
@ -188,6 +187,7 @@ protected:
void handle_send(int result, osd_client_t *cl); void handle_send(int result, osd_client_t *cl);
bool handle_read(int result, osd_client_t *cl); bool handle_read(int result, osd_client_t *cl);
bool handle_read_buffer(osd_client_t *cl, void *curbuf, int remain);
bool handle_finished_read(osd_client_t *cl); bool handle_finished_read(osd_client_t *cl);
void handle_op_hdr(osd_client_t *cl); void handle_op_hdr(osd_client_t *cl);
bool handle_reply_hdr(osd_client_t *cl); bool handle_reply_hdr(osd_client_t *cl);

View File

@ -42,3 +42,8 @@ void osd_messenger_t::read_requests()
void osd_messenger_t::send_replies() void osd_messenger_t::send_replies()
{ {
} }
json11::Json osd_messenger_t::read_config(const json11::Json & config)
{
return config;
}

View File

@ -76,7 +76,7 @@ struct osd_op_buf_list_t
buf = (iovec*)malloc(sizeof(iovec) * alloc); buf = (iovec*)malloc(sizeof(iovec) * alloc);
if (!buf) if (!buf)
{ {
printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc); fprintf(stderr, "Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
exit(1); exit(1);
} }
memcpy(buf, inline_buf, sizeof(iovec) * old); memcpy(buf, inline_buf, sizeof(iovec) * old);
@ -87,7 +87,7 @@ struct osd_op_buf_list_t
buf = (iovec*)realloc(buf, sizeof(iovec) * alloc); buf = (iovec*)realloc(buf, sizeof(iovec) * alloc);
if (!buf) if (!buf)
{ {
printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc); fprintf(stderr, "Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
exit(1); exit(1);
} }
} }
@ -109,7 +109,7 @@ struct osd_op_buf_list_t
buf = (iovec*)malloc(sizeof(iovec) * alloc); buf = (iovec*)malloc(sizeof(iovec) * alloc);
if (!buf) if (!buf)
{ {
printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc); fprintf(stderr, "Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
exit(1); exit(1);
} }
memcpy(buf, inline_buf, sizeof(iovec)*old); memcpy(buf, inline_buf, sizeof(iovec)*old);
@ -120,7 +120,7 @@ struct osd_op_buf_list_t
buf = (iovec*)realloc(buf, sizeof(iovec) * alloc); buf = (iovec*)realloc(buf, sizeof(iovec) * alloc);
if (!buf) if (!buf)
{ {
printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc); fprintf(stderr, "Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
exit(1); exit(1);
} }
} }

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#include <stdio.h> #include <stdio.h>
#include <stdlib.h> #include <stdlib.h>
#include "msgr_rdma.h" #include "msgr_rdma.h"
@ -163,7 +166,8 @@ cleanup:
return NULL; return NULL;
} }
msgr_rdma_connection_t *msgr_rdma_connection_t::create(msgr_rdma_context_t *ctx, uint32_t max_send, uint32_t max_recv, uint32_t max_sge) msgr_rdma_connection_t *msgr_rdma_connection_t::create(msgr_rdma_context_t *ctx, uint32_t max_send,
uint32_t max_recv, uint32_t max_sge, uint32_t max_msg)
{ {
msgr_rdma_connection_t *conn = new msgr_rdma_connection_t; msgr_rdma_connection_t *conn = new msgr_rdma_connection_t;
@ -173,6 +177,7 @@ msgr_rdma_connection_t *msgr_rdma_connection_t::create(msgr_rdma_context_t *ctx,
conn->max_send = max_send; conn->max_send = max_send;
conn->max_recv = max_recv; conn->max_recv = max_recv;
conn->max_sge = max_sge; conn->max_sge = max_sge;
conn->max_msg = max_msg;
ctx->used_max_cqe += max_send+max_recv; ctx->used_max_cqe += max_send+max_recv;
if (ctx->used_max_cqe > ctx->max_cqe) if (ctx->used_max_cqe > ctx->max_cqe)
@ -293,21 +298,25 @@ int msgr_rdma_connection_t::connect(msgr_rdma_address_t *dest)
return 0; return 0;
} }
bool osd_messenger_t::connect_rdma(int peer_fd, std::string rdma_address) bool osd_messenger_t::connect_rdma(int peer_fd, std::string rdma_address, uint64_t client_max_msg)
{ {
// Try to connect to the peer using RDMA // Try to connect to the peer using RDMA
msgr_rdma_address_t addr; msgr_rdma_address_t addr;
if (msgr_rdma_address_t::from_string(rdma_address.c_str(), &addr)) if (msgr_rdma_address_t::from_string(rdma_address.c_str(), &addr))
{ {
auto rdma_conn = msgr_rdma_connection_t::create(rdma_context, max_rdma_send, max_rdma_recv, max_rdma_sge); if (client_max_msg > rdma_max_msg)
{
client_max_msg = rdma_max_msg;
}
auto rdma_conn = msgr_rdma_connection_t::create(rdma_context, rdma_max_send, rdma_max_recv, rdma_max_sge, client_max_msg);
if (rdma_conn) if (rdma_conn)
{ {
int r = rdma_conn->connect(&addr); int r = rdma_conn->connect(&addr);
if (r != 0) if (r != 0)
{ {
delete rdma_conn; delete rdma_conn;
printf( fprintf(
"Failed to connect RDMA queue pair to %s (client %d)\n", stderr, "Failed to connect RDMA queue pair to %s (client %d)\n",
addr.to_string().c_str(), peer_fd addr.to_string().c_str(), peer_fd
); );
} }
@ -337,7 +346,7 @@ static void try_send_rdma_wr(osd_client_t *cl, ibv_sge *sge, int op_sge)
int err = ibv_post_send(cl->rdma_conn->qp, &wr, &bad_wr); int err = ibv_post_send(cl->rdma_conn->qp, &wr, &bad_wr);
if (err || bad_wr) if (err || bad_wr)
{ {
printf("RDMA send failed: %s\n", strerror(err)); fprintf(stderr, "RDMA send failed: %s\n", strerror(err));
exit(1); exit(1);
} }
cl->rdma_conn->cur_send++; cl->rdma_conn->cur_send++;
@ -351,46 +360,23 @@ bool osd_messenger_t::try_send_rdma(osd_client_t *cl)
// Only send one batch at a time // Only send one batch at a time
return true; return true;
} }
int op_size = 0, op_sge = 0, op_max = rc->max_sge*bs_bitmap_granularity; uint64_t op_size = 0, op_sge = 0;
// FIXME: rc->max_sge should be negotiated between client & server
ibv_sge sge[rc->max_sge]; ibv_sge sge[rc->max_sge];
while (rc->send_pos < cl->send_list.size()) while (rc->send_pos < cl->send_list.size())
{ {
iovec & iov = cl->send_list[rc->send_pos]; iovec & iov = cl->send_list[rc->send_pos];
if (cl->outbox[rc->send_pos].flags & MSGR_SENDP_HDR) if (op_size >= rc->max_msg || op_sge >= rc->max_sge)
{
if (op_sge > 0)
{ {
try_send_rdma_wr(cl, sge, op_sge); try_send_rdma_wr(cl, sge, op_sge);
op_sge = 0; op_sge = 0;
op_size = 0; op_size = 0;
if (rc->cur_send >= rc->max_send) if (rc->cur_send >= rc->max_send)
break;
}
assert(rc->send_buf_pos == 0);
sge[0] = {
.addr = (uintptr_t)iov.iov_base,
.length = (uint32_t)iov.iov_len,
.lkey = rc->ctx->mr->lkey,
};
try_send_rdma_wr(cl, sge, 1);
rc->send_pos++;
if (rc->cur_send >= rc->max_send)
break;
}
else
{ {
if (op_size >= op_max || op_sge >= rc->max_sge)
{
try_send_rdma_wr(cl, sge, op_sge);
op_sge = 0;
op_size = 0;
if (rc->cur_send >= rc->max_send)
break; break;
} }
// Fragment all messages into parts no longer than (max_sge*4k) = 120k on ConnectX-4 }
// Otherwise the client may not be able to receive them in small parts uint32_t len = (uint32_t)(op_size+iov.iov_len-rc->send_buf_pos < rc->max_msg
uint32_t len = (uint32_t)(op_size+iov.iov_len-rc->send_buf_pos < op_max ? iov.iov_len-rc->send_buf_pos : op_max-op_size); ? iov.iov_len-rc->send_buf_pos : rc->max_msg-op_size);
sge[op_sge++] = { sge[op_sge++] = {
.addr = (uintptr_t)(iov.iov_base+rc->send_buf_pos), .addr = (uintptr_t)(iov.iov_base+rc->send_buf_pos),
.length = len, .length = len,
@ -404,7 +390,6 @@ bool osd_messenger_t::try_send_rdma(osd_client_t *cl)
rc->send_buf_pos = 0; rc->send_buf_pos = 0;
} }
} }
}
if (op_sge > 0) if (op_sge > 0)
{ {
try_send_rdma_wr(cl, sge, op_sge); try_send_rdma_wr(cl, sge, op_sge);
@ -423,7 +408,7 @@ static void try_recv_rdma_wr(osd_client_t *cl, ibv_sge *sge, int op_sge)
int err = ibv_post_recv(cl->rdma_conn->qp, &wr, &bad_wr); int err = ibv_post_recv(cl->rdma_conn->qp, &wr, &bad_wr);
if (err || bad_wr) if (err || bad_wr)
{ {
printf("RDMA receive failed: %s\n", strerror(err)); fprintf(stderr, "RDMA receive failed: %s\n", strerror(err));
exit(1); exit(1);
} }
cl->rdma_conn->cur_recv++; cl->rdma_conn->cur_recv++;
@ -432,53 +417,16 @@ static void try_recv_rdma_wr(osd_client_t *cl, ibv_sge *sge, int op_sge)
bool osd_messenger_t::try_recv_rdma(osd_client_t *cl) bool osd_messenger_t::try_recv_rdma(osd_client_t *cl)
{ {
auto rc = cl->rdma_conn; auto rc = cl->rdma_conn;
if (rc->cur_recv > 0) while (rc->cur_recv < rc->max_recv)
{ {
return true; void *buf = malloc_or_die(rc->max_msg);
} rc->recv_buffers.push_back(buf);
if (!cl->recv_list.get_size()) ibv_sge sge = {
{ .addr = (uintptr_t)buf,
cl->recv_list.reset(); .length = (uint32_t)rc->max_msg,
cl->read_op = new osd_op_t;
cl->read_op->peer_fd = cl->peer_fd;
cl->read_op->op_type = OSD_OP_IN;
cl->recv_list.push_back(cl->read_op->req.buf, OSD_PACKET_SIZE);
cl->read_remaining = OSD_PACKET_SIZE;
cl->read_state = CL_READ_HDR;
}
int op_size = 0, op_sge = 0, op_max = rc->max_sge*bs_bitmap_granularity;
iovec *segments = cl->recv_list.get_iovec();
// FIXME: rc->max_sge should be negotiated between client & server
ibv_sge sge[rc->max_sge];
while (rc->recv_pos < cl->recv_list.get_size())
{
iovec & iov = segments[rc->recv_pos];
if (op_size >= op_max || op_sge >= rc->max_sge)
{
try_recv_rdma_wr(cl, sge, op_sge);
op_sge = 0;
op_size = 0;
if (rc->cur_recv >= rc->max_recv)
break;
}
// Receive in identical (max_sge*4k) fragments
uint32_t len = (uint32_t)(op_size+iov.iov_len-rc->recv_buf_pos < op_max ? iov.iov_len-rc->recv_buf_pos : op_max-op_size);
sge[op_sge++] = {
.addr = (uintptr_t)(iov.iov_base+rc->recv_buf_pos),
.length = len,
.lkey = rc->ctx->mr->lkey, .lkey = rc->ctx->mr->lkey,
}; };
op_size += len; try_recv_rdma_wr(cl, &sge, 1);
rc->recv_buf_pos += len;
if (rc->recv_buf_pos >= iov.iov_len)
{
rc->recv_pos++;
rc->recv_buf_pos = 0;
}
}
if (op_sge > 0)
{
try_recv_rdma_wr(cl, sge, op_sge);
} }
return true; return true;
} }
@ -497,7 +445,7 @@ void osd_messenger_t::handle_rdma_events()
} }
if (ibv_req_notify_cq(rdma_context->cq, 0) != 0) if (ibv_req_notify_cq(rdma_context->cq, 0) != 0)
{ {
printf("Failed to request RDMA completion notification, exiting\n"); fprintf(stderr, "Failed to request RDMA completion notification, exiting\n");
exit(1); exit(1);
} }
ibv_wc wc[RDMA_EVENTS_AT_ONCE]; ibv_wc wc[RDMA_EVENTS_AT_ONCE];
@ -517,37 +465,23 @@ void osd_messenger_t::handle_rdma_events()
osd_client_t *cl = cl_it->second; osd_client_t *cl = cl_it->second;
if (wc[i].status != IBV_WC_SUCCESS) if (wc[i].status != IBV_WC_SUCCESS)
{ {
printf("RDMA work request failed for client %d", client_id); fprintf(stderr, "RDMA work request failed for client %d", client_id);
if (cl->osd_num) if (cl->osd_num)
{ {
printf(" (OSD %lu)", cl->osd_num); fprintf(stderr, " (OSD %lu)", cl->osd_num);
} }
printf(" with status: %s, stopping client\n", ibv_wc_status_str(wc[i].status)); fprintf(stderr, " with status: %s, stopping client\n", ibv_wc_status_str(wc[i].status));
stop_client(client_id); stop_client(client_id);
continue; continue;
} }
if (!is_send) if (!is_send)
{ {
cl->rdma_conn->cur_recv--; cl->rdma_conn->cur_recv--;
if (!cl->rdma_conn->cur_recv) handle_read_buffer(cl, cl->rdma_conn->recv_buffers[0], wc[i].byte_len);
{ free(cl->rdma_conn->recv_buffers[0]);
cl->recv_list.done += cl->rdma_conn->recv_pos; cl->rdma_conn->recv_buffers.erase(cl->rdma_conn->recv_buffers.begin(), cl->rdma_conn->recv_buffers.begin()+1);
cl->rdma_conn->recv_pos = 0;
if (!cl->recv_list.get_size())
{
cl->read_remaining = 0;
if (handle_finished_read(cl))
{
try_recv_rdma(cl); try_recv_rdma(cl);
} }
}
else
{
// Continue to receive data
try_recv_rdma(cl);
}
}
}
else else
{ {
cl->rdma_conn->cur_send--; cl->rdma_conn->cur_send--;

View File

@ -1,3 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once #pragma once
#include <infiniband/verbs.h> #include <infiniband/verbs.h>
#include <string> #include <string>
@ -43,11 +46,13 @@ struct msgr_rdma_connection_t
msgr_rdma_address_t addr; msgr_rdma_address_t addr;
int max_send = 0, max_recv = 0, max_sge = 0; int max_send = 0, max_recv = 0, max_sge = 0;
int cur_send = 0, cur_recv = 0; int cur_send = 0, cur_recv = 0;
uint64_t max_msg = 0;
int send_pos = 0, send_buf_pos = 0; int send_pos = 0, send_buf_pos = 0;
int recv_pos = 0, recv_buf_pos = 0; int recv_pos = 0, recv_buf_pos = 0;
std::vector<void*> recv_buffers;
~msgr_rdma_connection_t(); ~msgr_rdma_connection_t();
static msgr_rdma_connection_t *create(msgr_rdma_context_t *ctx, uint32_t max_send, uint32_t max_recv, uint32_t max_sge); static msgr_rdma_connection_t *create(msgr_rdma_context_t *ctx, uint32_t max_send, uint32_t max_recv, uint32_t max_sge, uint32_t max_msg);
int connect(msgr_rdma_address_t *dest); int connect(msgr_rdma_address_t *dest);
}; };

View File

@ -72,7 +72,7 @@ bool osd_messenger_t::handle_read(int result, osd_client_t *cl)
// this is a client socket, so don't panic on error. just disconnect it // this is a client socket, so don't panic on error. just disconnect it
if (result != 0) if (result != 0)
{ {
printf("Client %d socket read error: %d (%s). Disconnecting client\n", cl->peer_fd, -result, strerror(-result)); fprintf(stderr, "Client %d socket read error: %d (%s). Disconnecting client\n", cl->peer_fd, -result, strerror(-result));
} }
stop_client(cl->peer_fd); stop_client(cl->peer_fd);
return false; return false;
@ -91,9 +91,38 @@ bool osd_messenger_t::handle_read(int result, osd_client_t *cl)
{ {
if (cl->read_iov.iov_base == cl->in_buf) if (cl->read_iov.iov_base == cl->in_buf)
{ {
if (!handle_read_buffer(cl, cl->in_buf, result))
{
goto fin;
}
}
else
{
// Long data
cl->read_remaining -= result;
cl->recv_list.eat(result);
if (cl->recv_list.done >= cl->recv_list.count)
{
handle_finished_read(cl);
}
}
if (result >= cl->read_iov.iov_len)
{
ret = true;
}
}
fin:
for (auto cb: set_immediate)
{
cb();
}
set_immediate.clear();
return ret;
}
bool osd_messenger_t::handle_read_buffer(osd_client_t *cl, void *curbuf, int remain)
{
// Compose operation(s) from the buffer // Compose operation(s) from the buffer
int remain = result;
void *curbuf = cl->in_buf;
while (remain > 0) while (remain > 0)
{ {
if (!cl->read_op) if (!cl->read_op)
@ -130,33 +159,11 @@ bool osd_messenger_t::handle_read(int result, osd_client_t *cl)
{ {
if (!handle_finished_read(cl)) if (!handle_finished_read(cl))
{ {
goto fin; return false;
} }
} }
} }
} return true;
else
{
// Long data
cl->read_remaining -= result;
cl->recv_list.eat(result);
if (cl->recv_list.done >= cl->recv_list.count)
{
handle_finished_read(cl);
}
}
if (result >= cl->read_iov.iov_len)
{
ret = true;
}
}
fin:
for (auto cb: set_immediate)
{
cb();
}
set_immediate.clear();
return ret;
} }
bool osd_messenger_t::handle_finished_read(osd_client_t *cl) bool osd_messenger_t::handle_finished_read(osd_client_t *cl)
@ -170,7 +177,7 @@ bool osd_messenger_t::handle_finished_read(osd_client_t *cl)
handle_op_hdr(cl); handle_op_hdr(cl);
else else
{ {
printf("Received garbage: magic=%lx id=%lu opcode=%lx from %d\n", cl->read_op->req.hdr.magic, cl->read_op->req.hdr.id, cl->read_op->req.hdr.opcode, cl->peer_fd); fprintf(stderr, "Received garbage: magic=%lx id=%lu opcode=%lx from %d\n", cl->read_op->req.hdr.magic, cl->read_op->req.hdr.id, cl->read_op->req.hdr.opcode, cl->peer_fd);
stop_client(cl->peer_fd); stop_client(cl->peer_fd);
return false; return false;
} }
@ -285,7 +292,7 @@ bool osd_messenger_t::handle_reply_hdr(osd_client_t *cl)
if (req_it == cl->sent_ops.end()) if (req_it == cl->sent_ops.end())
{ {
// Command out of sync. Drop connection // Command out of sync. Drop connection
printf("Client %d command out of sync: id %lu\n", cl->peer_fd, cl->read_op->req.hdr.id); fprintf(stderr, "Client %d command out of sync: id %lu\n", cl->peer_fd, cl->read_op->req.hdr.id);
stop_client(cl->peer_fd); stop_client(cl->peer_fd);
return false; return false;
} }
@ -300,7 +307,7 @@ bool osd_messenger_t::handle_reply_hdr(osd_client_t *cl)
if (op->reply.hdr.retval >= 0 && (op->reply.hdr.retval != expected_size || bmp_len > op->bitmap_len)) if (op->reply.hdr.retval >= 0 && (op->reply.hdr.retval != expected_size || bmp_len > op->bitmap_len))
{ {
// Check reply length to not overflow the buffer // Check reply length to not overflow the buffer
printf("Client %d read reply of different length: expected %u+%u, got %ld+%u\n", fprintf(stderr, "Client %d read reply of different length: expected %u+%u, got %ld+%u\n",
cl->peer_fd, expected_size, op->bitmap_len, op->reply.hdr.retval, bmp_len); cl->peer_fd, expected_size, op->bitmap_len, op->reply.hdr.retval, bmp_len);
cl->sent_ops[op->req.hdr.id] = op; cl->sent_ops[op->req.hdr.id] = op;
stop_client(cl->peer_fd); stop_client(cl->peer_fd);

View File

@ -227,7 +227,7 @@ void osd_messenger_t::handle_send(int result, osd_client_t *cl)
if (result < 0 && result != -EAGAIN) if (result < 0 && result != -EAGAIN)
{ {
// this is a client socket, so don't panic. just disconnect it // this is a client socket, so don't panic. just disconnect it
printf("Client %d socket write error: %d (%s). Disconnecting client\n", cl->peer_fd, -result, strerror(-result)); fprintf(stderr, "Client %d socket write error: %d (%s). Disconnecting client\n", cl->peer_fd, -result, strerror(-result));
stop_client(cl->peer_fd); stop_client(cl->peer_fd);
return; return;
} }
@ -272,7 +272,10 @@ void osd_messenger_t::handle_send(int result, osd_client_t *cl)
{ {
// FIXME: Do something better than just forgetting the FD // FIXME: Do something better than just forgetting the FD
// FIXME: Ignore pings during RDMA state transition // FIXME: Ignore pings during RDMA state transition
printf("Successfully connected with client %d using RDMA\n", cl->peer_fd); if (log_level > 0)
{
fprintf(stderr, "Successfully connected with client %d using RDMA\n", cl->peer_fd);
}
cl->peer_state = PEER_RDMA; cl->peer_state = PEER_RDMA;
tfd->set_fd_handler(cl->peer_fd, false, NULL); tfd->set_fd_handler(cl->peer_fd, false, NULL);
// Add the initial receive request // Add the initial receive request

View File

@ -58,11 +58,11 @@ void osd_messenger_t::stop_client(int peer_fd, bool force)
{ {
if (cl->osd_num) if (cl->osd_num)
{ {
printf("[OSD %lu] Stopping client %d (OSD peer %lu)\n", osd_num, peer_fd, cl->osd_num); fprintf(stderr, "[OSD %lu] Stopping client %d (OSD peer %lu)\n", osd_num, peer_fd, cl->osd_num);
} }
else else
{ {
printf("[OSD %lu] Stopping client %d (regular client)\n", osd_num, peer_fd); fprintf(stderr, "[OSD %lu] Stopping client %d (regular client)\n", osd_num, peer_fd);
} }
} }
// First set state to STOPPED so another stop_client() call doesn't try to free it again // First set state to STOPPED so another stop_client() call doesn't try to free it again

View File

@ -10,6 +10,7 @@
#include <netinet/tcp.h> #include <netinet/tcp.h>
#include <arpa/inet.h> #include <arpa/inet.h>
#include <sys/un.h> #include <sys/un.h>
#include <sys/epoll.h>
#include <unistd.h> #include <unistd.h>
#include <fcntl.h> #include <fcntl.h>
#include <signal.h> #include <signal.h>
@ -116,7 +117,7 @@ public:
"Vitastor NBD proxy\n" "Vitastor NBD proxy\n"
"(c) Vitaliy Filippov, 2020-2021 (VNPL-1.1)\n\n" "(c) Vitaliy Filippov, 2020-2021 (VNPL-1.1)\n\n"
"USAGE:\n" "USAGE:\n"
" %s map --etcd_address <etcd_address> (--image <image> | --pool <pool> --inode <inode> --size <size in bytes>)\n" " %s map [--etcd_address <etcd_address>] (--image <image> | --pool <pool> --inode <inode> --size <size in bytes>)\n"
" %s unmap /dev/nbd0\n" " %s unmap /dev/nbd0\n"
" %s list [--json]\n", " %s list [--json]\n",
exe_name, exe_name, exe_name exe_name, exe_name, exe_name
@ -146,11 +147,6 @@ public:
void start(json11::Json cfg) void start(json11::Json cfg)
{ {
// Check options // Check options
if (cfg["etcd_address"].string_value() == "")
{
fprintf(stderr, "etcd_address is missing\n");
exit(1);
}
if (cfg["image"].string_value() != "") if (cfg["image"].string_value() != "")
{ {
// Use image name // Use image name
@ -205,9 +201,10 @@ public:
fcntl(sockfd[0], F_SETFL, fcntl(sockfd[0], F_GETFL, 0) | O_NONBLOCK); fcntl(sockfd[0], F_SETFL, fcntl(sockfd[0], F_GETFL, 0) | O_NONBLOCK);
nbd_fd = sockfd[0]; nbd_fd = sockfd[0];
load_module(); load_module();
bool bg = cfg["foreground"].is_null();
if (!cfg["dev_num"].is_null()) if (!cfg["dev_num"].is_null())
{ {
if (run_nbd(sockfd, cfg["dev_num"].int64_value(), device_size, NBD_FLAG_SEND_FLUSH, 30) < 0) if (run_nbd(sockfd, cfg["dev_num"].int64_value(), device_size, NBD_FLAG_SEND_FLUSH, 30, bg) < 0)
{ {
perror("run_nbd"); perror("run_nbd");
exit(1); exit(1);
@ -219,7 +216,7 @@ public:
int i = 0; int i = 0;
while (true) while (true)
{ {
int r = run_nbd(sockfd, i, device_size, NBD_FLAG_SEND_FLUSH, 30); int r = run_nbd(sockfd, i, device_size, NBD_FLAG_SEND_FLUSH, 30, bg);
if (r == 0) if (r == 0)
{ {
printf("/dev/nbd%d\n", i); printf("/dev/nbd%d\n", i);
@ -242,7 +239,7 @@ public:
} }
} }
} }
if (cfg["foreground"].is_null()) if (bg)
{ {
daemonize(); daemonize();
} }
@ -259,22 +256,47 @@ public:
}; };
ringloop->register_consumer(&consumer); ringloop->register_consumer(&consumer);
// Add FD to epoll // Add FD to epoll
epmgr->tfd->set_fd_handler(sockfd[0], false, [this](int peer_fd, int epoll_events) bool stop = false;
epmgr->tfd->set_fd_handler(sockfd[0], false, [this, &stop](int peer_fd, int epoll_events)
{
if (epoll_events & EPOLLRDHUP)
{
close(peer_fd);
stop = true;
}
else
{ {
read_ready++; read_ready++;
submit_read(); submit_read();
}
}); });
while (1) while (!stop)
{ {
ringloop->loop(); ringloop->loop();
ringloop->wait(); ringloop->wait();
} }
// FIXME: Cleanup when exiting stop = false;
cluster_op_t *close_sync = new cluster_op_t;
close_sync->opcode = OSD_OP_SYNC;
close_sync->callback = [this, &stop](cluster_op_t *op)
{
stop = true;
delete op;
};
cli->execute(close_sync);
while (!stop)
{
ringloop->loop();
ringloop->wait();
}
delete cli;
delete epmgr;
delete ringloop;
} }
void load_module() void load_module()
{ {
if (access("/sys/module/nbd", F_OK)) if (access("/sys/module/nbd", F_OK) == 0)
{ {
return; return;
} }
@ -416,7 +438,7 @@ public:
} }
protected: protected:
int run_nbd(int sockfd[2], int dev_num, uint64_t size, uint64_t flags, unsigned timeout) int run_nbd(int sockfd[2], int dev_num, uint64_t size, uint64_t flags, unsigned timeout, bool bg)
{ {
// Check handle size // Check handle size
assert(sizeof(cur_req.handle) == 8); assert(sizeof(cur_req.handle) == 8);
@ -464,11 +486,14 @@ protected:
{ {
// Run in child // Run in child
close(sockfd[0]); close(sockfd[0]);
if (bg)
{
daemonize();
}
r = ioctl(nbd, NBD_DO_IT); r = ioctl(nbd, NBD_DO_IT);
if (r < 0) if (r < 0)
{ {
fprintf(stderr, "NBD device terminated with error: %s\n", strerror(errno)); fprintf(stderr, "NBD device terminated with error: %s\n", strerror(errno));
kill(getppid(), SIGTERM);
} }
close(sockfd[1]); close(sockfd[1]);
ioctl(nbd, NBD_CLEAR_QUE); ioctl(nbd, NBD_CLEAR_QUE);

View File

@ -10,31 +10,39 @@
#include "osd.h" #include "osd.h"
#include "http_client.h" #include "http_client.h"
osd_t::osd_t(blockstore_config_t & config, ring_loop_t *ringloop) static blockstore_config_t json_to_bs(const json11::Json::object & config)
{ {
bs_block_size = strtoull(config["block_size"].c_str(), NULL, 10); blockstore_config_t bs;
bs_bitmap_granularity = strtoull(config["bitmap_granularity"].c_str(), NULL, 10); for (auto kv: config)
if (!bs_block_size) {
bs_block_size = DEFAULT_BLOCK_SIZE; if (kv.second.is_string())
if (!bs_bitmap_granularity) bs[kv.first] = kv.second.string_value();
bs_bitmap_granularity = DEFAULT_BITMAP_GRANULARITY; else
clean_entry_bitmap_size = bs_block_size / bs_bitmap_granularity / 8; bs[kv.first] = kv.second.dump();
}
return bs;
}
osd_t::osd_t(const json11::Json & config, ring_loop_t *ringloop)
{
zero_buffer_size = 1<<20; zero_buffer_size = 1<<20;
zero_buffer = malloc_or_die(zero_buffer_size); zero_buffer = malloc_or_die(zero_buffer_size);
memset(zero_buffer, 0, zero_buffer_size); memset(zero_buffer, 0, zero_buffer_size);
this->config = config;
this->ringloop = ringloop; this->ringloop = ringloop;
this->config = msgr.read_config(config).object_items();
if (this->config.find("log_level") == this->config.end())
this->config["log_level"] = 1;
parse_config(this->config);
epmgr = new epoll_manager_t(ringloop); epmgr = new epoll_manager_t(ringloop);
// FIXME: Use timerfd_interval based directly on io_uring // FIXME: Use timerfd_interval based directly on io_uring
this->tfd = epmgr->tfd; this->tfd = epmgr->tfd;
// FIXME: Create Blockstore from on-disk superblock config and check it against the OSD cluster config // FIXME: Create Blockstore from on-disk superblock config and check it against the OSD cluster config
this->bs = new blockstore_t(config, ringloop, tfd); auto bs_cfg = json_to_bs(this->config);
this->bs = new blockstore_t(bs_cfg, ringloop, tfd);
parse_config(config);
this->tfd->set_timer(print_stats_interval*1000, true, [this](int timer_id) this->tfd->set_timer(print_stats_interval*1000, true, [this](int timer_id)
{ {
@ -66,63 +74,71 @@ osd_t::~osd_t()
free(zero_buffer); free(zero_buffer);
} }
void osd_t::parse_config(blockstore_config_t & config) void osd_t::parse_config(const json11::Json & config)
{ {
if (config.find("log_level") == config.end()) st_cli.parse_config(config);
config["log_level"] = "1"; msgr.parse_config(config);
log_level = strtoull(config["log_level"].c_str(), NULL, 10); // OSD number
// Initial startup configuration osd_num = config["osd_num"].uint64_value();
json11::Json json_config = json11::Json(config);
st_cli.parse_config(json_config);
etcd_report_interval = strtoull(config["etcd_report_interval"].c_str(), NULL, 10);
if (etcd_report_interval <= 0)
etcd_report_interval = 30;
osd_num = strtoull(config["osd_num"].c_str(), NULL, 10);
if (!osd_num) if (!osd_num)
throw std::runtime_error("osd_num is required in the configuration"); throw std::runtime_error("osd_num is required in the configuration");
msgr.osd_num = osd_num; msgr.osd_num = osd_num;
// Vital Blockstore parameters
bs_block_size = config["block_size"].uint64_value();
if (!bs_block_size)
bs_block_size = DEFAULT_BLOCK_SIZE;
bs_bitmap_granularity = config["bitmap_granularity"].uint64_value();
if (!bs_bitmap_granularity)
bs_bitmap_granularity = DEFAULT_BITMAP_GRANULARITY;
clean_entry_bitmap_size = bs_block_size / bs_bitmap_granularity / 8;
// Bind address
bind_address = config["bind_address"].string_value();
if (bind_address == "")
bind_address = "0.0.0.0";
bind_port = config["bind_port"].uint64_value();
if (bind_port <= 0 || bind_port > 65535)
bind_port = 0;
// OSD configuration
log_level = config["log_level"].uint64_value();
etcd_report_interval = config["etcd_report_interval"].uint64_value();
if (etcd_report_interval <= 0)
etcd_report_interval = 30;
readonly = config["readonly"] == "true" || config["readonly"] == "1" || config["readonly"] == "yes";
run_primary = config["run_primary"] != "false" && config["run_primary"] != "0" && config["run_primary"] != "no"; run_primary = config["run_primary"] != "false" && config["run_primary"] != "0" && config["run_primary"] != "no";
no_rebalance = config["no_rebalance"] == "true" || config["no_rebalance"] == "1" || config["no_rebalance"] == "yes"; no_rebalance = config["no_rebalance"] == "true" || config["no_rebalance"] == "1" || config["no_rebalance"] == "yes";
no_recovery = config["no_recovery"] == "true" || config["no_recovery"] == "1" || config["no_recovery"] == "yes"; no_recovery = config["no_recovery"] == "true" || config["no_recovery"] == "1" || config["no_recovery"] == "yes";
allow_test_ops = config["allow_test_ops"] == "true" || config["allow_test_ops"] == "1" || config["allow_test_ops"] == "yes"; allow_test_ops = config["allow_test_ops"] == "true" || config["allow_test_ops"] == "1" || config["allow_test_ops"] == "yes";
// Cluster configuration
bind_address = config["bind_address"];
if (bind_address == "")
bind_address = "0.0.0.0";
bind_port = stoull_full(config["bind_port"]);
if (bind_port <= 0 || bind_port > 65535)
bind_port = 0;
if (config["immediate_commit"] == "all") if (config["immediate_commit"] == "all")
immediate_commit = IMMEDIATE_ALL; immediate_commit = IMMEDIATE_ALL;
else if (config["immediate_commit"] == "small") else if (config["immediate_commit"] == "small")
immediate_commit = IMMEDIATE_SMALL; immediate_commit = IMMEDIATE_SMALL;
if (config.find("autosync_interval") != config.end()) else
immediate_commit = IMMEDIATE_NONE;
if (!config["autosync_interval"].is_null())
{ {
autosync_interval = strtoull(config["autosync_interval"].c_str(), NULL, 10); // Allow to set it to 0
autosync_interval = config["autosync_interval"].uint64_value();
if (autosync_interval > MAX_AUTOSYNC_INTERVAL) if (autosync_interval > MAX_AUTOSYNC_INTERVAL)
autosync_interval = DEFAULT_AUTOSYNC_INTERVAL; autosync_interval = DEFAULT_AUTOSYNC_INTERVAL;
} }
if (config.find("client_queue_depth") != config.end()) if (!config["client_queue_depth"].is_null())
{ {
client_queue_depth = strtoull(config["client_queue_depth"].c_str(), NULL, 10); client_queue_depth = config["client_queue_depth"].uint64_value();
if (client_queue_depth < 128) if (client_queue_depth < 128)
client_queue_depth = 128; client_queue_depth = 128;
} }
recovery_queue_depth = strtoull(config["recovery_queue_depth"].c_str(), NULL, 10); recovery_queue_depth = config["recovery_queue_depth"].uint64_value();
if (recovery_queue_depth < 1 || recovery_queue_depth > MAX_RECOVERY_QUEUE) if (recovery_queue_depth < 1 || recovery_queue_depth > MAX_RECOVERY_QUEUE)
recovery_queue_depth = DEFAULT_RECOVERY_QUEUE; recovery_queue_depth = DEFAULT_RECOVERY_QUEUE;
recovery_sync_batch = strtoull(config["recovery_sync_batch"].c_str(), NULL, 10); recovery_sync_batch = config["recovery_sync_batch"].uint64_value();
if (recovery_sync_batch < 1 || recovery_sync_batch > MAX_RECOVERY_QUEUE) if (recovery_sync_batch < 1 || recovery_sync_batch > MAX_RECOVERY_QUEUE)
recovery_sync_batch = DEFAULT_RECOVERY_BATCH; recovery_sync_batch = DEFAULT_RECOVERY_BATCH;
if (config["readonly"] == "true" || config["readonly"] == "1" || config["readonly"] == "yes") print_stats_interval = config["print_stats_interval"].uint64_value();
readonly = true;
print_stats_interval = strtoull(config["print_stats_interval"].c_str(), NULL, 10);
if (!print_stats_interval) if (!print_stats_interval)
print_stats_interval = 3; print_stats_interval = 3;
slow_log_interval = strtoull(config["slow_log_interval"].c_str(), NULL, 10); slow_log_interval = config["slow_log_interval"].uint64_value();
if (!slow_log_interval) if (!slow_log_interval)
slow_log_interval = 10; slow_log_interval = 10;
msgr.parse_config(json_config);
} }
void osd_t::bind_socket() void osd_t::bind_socket()

View File

@ -92,7 +92,7 @@ class osd_t
{ {
// config // config
blockstore_config_t config; json11::Json::object config;
int etcd_report_interval = 30; int etcd_report_interval = 30;
bool readonly = false; bool readonly = false;
@ -167,7 +167,7 @@ class osd_t
uint64_t recovery_stat_bytes[2][2] = { 0 }; uint64_t recovery_stat_bytes[2][2] = { 0 };
// cluster connection // cluster connection
void parse_config(blockstore_config_t & config); void parse_config(const json11::Json & config);
void init_cluster(); void init_cluster();
void on_change_osd_state_hook(osd_num_t peer_osd); void on_change_osd_state_hook(osd_num_t peer_osd);
void on_change_pg_history_hook(pool_id_t pool_id, pg_num_t pg_num); void on_change_pg_history_hook(pool_id_t pool_id, pg_num_t pg_num);
@ -268,7 +268,7 @@ class osd_t
} }
public: public:
osd_t(blockstore_config_t & config, ring_loop_t *ringloop); osd_t(const json11::Json & config, ring_loop_t *ringloop);
~osd_t(); ~osd_t();
void force_stop(int exitcode); void force_stop(int exitcode);
bool shutdown(); bool shutdown();

View File

@ -21,7 +21,7 @@ void osd_t::init_cluster()
{ {
// Test version of clustering code with 1 pool, 1 PG and 2 peers // Test version of clustering code with 1 pool, 1 PG and 2 peers
// Example: peers = 2:127.0.0.1:11204,3:127.0.0.1:11205 // Example: peers = 2:127.0.0.1:11204,3:127.0.0.1:11205
std::string peerstr = config["peers"]; std::string peerstr = config["peers"].string_value();
while (peerstr.size()) while (peerstr.size())
{ {
int pos = peerstr.find(','); int pos = peerstr.find(',');
@ -340,21 +340,10 @@ void osd_t::on_change_pg_history_hook(pool_id_t pool_id, pg_num_t pg_num)
void osd_t::on_load_config_hook(json11::Json::object & global_config) void osd_t::on_load_config_hook(json11::Json::object & global_config)
{ {
blockstore_config_t osd_config = this->config; json11::Json::object osd_config = this->config;
for (auto & cfg_var: global_config) for (auto & kv: global_config)
{ if (osd_config.find(kv.first) == osd_config.end())
if (this->config.find(cfg_var.first) == this->config.end()) osd_config[kv.first] = kv.second;
{
if (cfg_var.second.is_string())
{
osd_config[cfg_var.first] = cfg_var.second.string_value();
}
else
{
osd_config[cfg_var.first] = cfg_var.second.dump();
}
}
}
parse_config(osd_config); parse_config(osd_config);
bind_socket(); bind_socket();
acquire_lease(); acquire_lease();
@ -380,7 +369,7 @@ void osd_t::acquire_lease()
etcd_lease_id = data["ID"].string_value(); etcd_lease_id = data["ID"].string_value();
create_osd_state(); create_osd_state();
}); });
printf("[OSD %lu] reporting to etcd at %s every %d seconds\n", this->osd_num, config["etcd_address"].c_str(), etcd_report_interval); printf("[OSD %lu] reporting to etcd at %s every %d seconds\n", this->osd_num, config["etcd_address"].string_value().c_str(), etcd_report_interval);
tfd->set_timer(etcd_report_interval*1000, true, [this](int timer_id) tfd->set_timer(etcd_report_interval*1000, true, [this](int timer_id)
{ {
renew_lease(); renew_lease();

View File

@ -29,13 +29,13 @@ int main(int narg, char *args[])
perror("BUG: too small packet size"); perror("BUG: too small packet size");
return 1; return 1;
} }
blockstore_config_t config; json11::Json::object config;
for (int i = 1; i < narg; i++) for (int i = 1; i < narg; i++)
{ {
if (args[i][0] == '-' && args[i][1] == '-' && i < narg-1) if (args[i][0] == '-' && args[i][1] == '-' && i < narg-1)
{ {
char *opt = args[i]+2; char *opt = args[i]+2;
config[opt] = args[++i]; config[std::string(opt)] = std::string(args[++i]);
} }
} }
signal(SIGINT, handle_sigint); signal(SIGINT, handle_sigint);

View File

@ -191,6 +191,9 @@ struct __attribute__((__packed__)) osd_op_rw_t
uint32_t flags; uint32_t flags;
// inode metadata revision // inode metadata revision
uint64_t meta_revision; uint64_t meta_revision;
// object version for atomic "CAS" (compare-and-set) writes
// writes and deletes fail with -EINTR if object version differs from (version-1)
uint64_t version;
}; };
struct __attribute__((__packed__)) osd_reply_rw_t struct __attribute__((__packed__)) osd_reply_rw_t
@ -199,6 +202,8 @@ struct __attribute__((__packed__)) osd_reply_rw_t
// for reads: bitmap length // for reads: bitmap length
uint32_t bitmap_len; uint32_t bitmap_len;
uint32_t pad0; uint32_t pad0;
// for reads: object version
uint64_t version;
}; };
// sync to the primary OSD // sync to the primary OSD

View File

@ -67,7 +67,9 @@ bool osd_t::prepare_primary_rw(osd_op_t *cur_op)
} }
// Find parents from the same pool. Optimized reads only work within pools // Find parents from the same pool. Optimized reads only work within pools
while (inode_it != st_cli.inode_config.end() && inode_it->second.parent_id && while (inode_it != st_cli.inode_config.end() && inode_it->second.parent_id &&
INODE_POOL(inode_it->second.parent_id) == pg_it->second.pool_id) INODE_POOL(inode_it->second.parent_id) == pg_it->second.pool_id &&
// Check for loops
inode_it->second.parent_id != cur_op->req.rw.inode)
{ {
chain_size++; chain_size++;
inode_it = st_cli.inode_config.find(inode_it->second.parent_id); inode_it = st_cli.inode_config.find(inode_it->second.parent_id);
@ -123,7 +125,10 @@ bool osd_t::prepare_primary_rw(osd_op_t *cur_op)
int chain_num = 0; int chain_num = 0;
op_data->read_chain[chain_num++] = cur_op->req.rw.inode; op_data->read_chain[chain_num++] = cur_op->req.rw.inode;
auto inode_it = st_cli.inode_config.find(cur_op->req.rw.inode); auto inode_it = st_cli.inode_config.find(cur_op->req.rw.inode);
while (inode_it != st_cli.inode_config.end() && inode_it->second.parent_id) while (inode_it != st_cli.inode_config.end() && inode_it->second.parent_id &&
INODE_POOL(inode_it->second.parent_id) == pg_it->second.pool_id &&
// Check for loops
inode_it->second.parent_id != cur_op->req.rw.inode)
{ {
op_data->read_chain[chain_num++] = inode_it->second.parent_id; op_data->read_chain[chain_num++] = inode_it->second.parent_id;
inode_it = st_cli.inode_config.find(inode_it->second.parent_id); inode_it = st_cli.inode_config.find(inode_it->second.parent_id);
@ -222,6 +227,7 @@ resume_2:
finish_op(cur_op, op_data->epipe > 0 ? -EPIPE : -EIO); finish_op(cur_op, op_data->epipe > 0 ? -EPIPE : -EIO);
return; return;
} }
cur_op->reply.rw.version = op_data->fact_ver;
cur_op->reply.rw.bitmap_len = op_data->pg_data_size * clean_entry_bitmap_size; cur_op->reply.rw.bitmap_len = op_data->pg_data_size * clean_entry_bitmap_size;
if (op_data->degraded) if (op_data->degraded)
{ {
@ -343,6 +349,12 @@ resume_3:
pg_cancel_write_queue(pg, cur_op, op_data->oid, op_data->epipe > 0 ? -EPIPE : -EIO); pg_cancel_write_queue(pg, cur_op, op_data->oid, op_data->epipe > 0 ? -EPIPE : -EIO);
return; return;
} }
// Check CAS version
if (cur_op->req.rw.version && op_data->fact_ver != (cur_op->req.rw.version-1))
{
cur_op->reply.hdr.retval = -EINTR;
goto continue_others;
}
// Save version override for parallel reads // Save version override for parallel reads
pg.ver_override[op_data->oid] = op_data->fact_ver; pg.ver_override[op_data->oid] = op_data->fact_ver;
// Submit deletes // Submit deletes
@ -370,6 +382,8 @@ resume_5:
free_object_state(pg, &op_data->object_state); free_object_state(pg, &op_data->object_state);
} }
pg.total_count--; pg.total_count--;
cur_op->reply.hdr.retval = 0;
continue_others:
osd_op_t *next_op = NULL; osd_op_t *next_op = NULL;
auto next_it = pg.write_queue.find(op_data->oid); auto next_it = pg.write_queue.find(op_data->oid);
if (next_it != pg.write_queue.end() && next_it->second == cur_op) if (next_it != pg.write_queue.end() && next_it->second == cur_op)
@ -378,7 +392,7 @@ resume_5:
if (next_it != pg.write_queue.end() && next_it->first == op_data->oid) if (next_it != pg.write_queue.end() && next_it->first == op_data->oid)
next_op = next_it->second; next_op = next_it->second;
} }
finish_op(cur_op, cur_op->req.rw.len); finish_op(cur_op, cur_op->reply.hdr.retval);
if (next_op) if (next_op)
{ {
// Continue next write to the same object // Continue next write to the same object

View File

@ -65,7 +65,10 @@ int osd_t::read_bitmaps(osd_op_t *cur_op, pg_t & pg, int base_state)
auto vo_it = pg.ver_override.find(cur_oid); auto vo_it = pg.ver_override.find(cur_oid);
auto read_version = (vo_it != pg.ver_override.end() ? vo_it->second : UINT64_MAX); auto read_version = (vo_it != pg.ver_override.end() ? vo_it->second : UINT64_MAX);
// Read bitmap synchronously from the local database // Read bitmap synchronously from the local database
bs->read_bitmap(cur_oid, read_version, op_data->snapshot_bitmaps + chain_num*clean_entry_bitmap_size, NULL); bs->read_bitmap(
cur_oid, read_version, op_data->snapshot_bitmaps + chain_num*clean_entry_bitmap_size,
!chain_num ? &cur_op->reply.rw.version : NULL
);
} }
} }
else else
@ -228,7 +231,10 @@ int osd_t::submit_bitmap_subops(osd_op_t *cur_op, pg_t & pg)
// Read bitmap synchronously from the local database // Read bitmap synchronously from the local database
for (int j = prev; j <= i; j++) for (int j = prev; j <= i; j++)
{ {
bs->read_bitmap((*bitmap_requests)[j].oid, (*bitmap_requests)[j].version, (*bitmap_requests)[j].bmp_buf, NULL); bs->read_bitmap(
(*bitmap_requests)[j].oid, (*bitmap_requests)[j].version, (*bitmap_requests)[j].bmp_buf,
(*bitmap_requests)[j].oid.inode == cur_op->req.rw.inode ? &cur_op->reply.rw.version : NULL
);
} }
} }
else else
@ -264,6 +270,10 @@ int osd_t::submit_bitmap_subops(osd_op_t *cur_op, pg_t & pg)
for (int j = prev; j <= i; j++) for (int j = prev; j <= i; j++)
{ {
memcpy((*bitmap_requests)[j].bmp_buf, cur_buf, clean_entry_bitmap_size); memcpy((*bitmap_requests)[j].bmp_buf, cur_buf, clean_entry_bitmap_size);
if ((*bitmap_requests)[j].oid.inode == cur_op->req.rw.inode)
{
memcpy(&cur_op->reply.rw.version, cur_buf-8, 8);
}
cur_buf += 8 + clean_entry_bitmap_size; cur_buf += 8 + clean_entry_bitmap_size;
} }
} }

View File

@ -96,6 +96,12 @@ resume_3:
pg_cancel_write_queue(pg, cur_op, op_data->oid, op_data->epipe > 0 ? -EPIPE : -EIO); pg_cancel_write_queue(pg, cur_op, op_data->oid, op_data->epipe > 0 ? -EPIPE : -EIO);
return; return;
} }
// Check CAS version
if (cur_op->req.rw.version && op_data->fact_ver != (cur_op->req.rw.version-1))
{
cur_op->reply.hdr.retval = -EINTR;
goto continue_others;
}
if (op_data->scheme == POOL_SCHEME_REPLICATED) if (op_data->scheme == POOL_SCHEME_REPLICATED)
{ {
// Set bitmap bits // Set bitmap bits
@ -265,7 +271,7 @@ continue_others:
next_op = next_it->second; next_op = next_it->second;
} }
// finish_op would invalidate next_it if it cleared pg.write_queue, but it doesn't do that :) // finish_op would invalidate next_it if it cleared pg.write_queue, but it doesn't do that :)
finish_op(cur_op, cur_op->req.rw.len); finish_op(cur_op, cur_op->reply.hdr.retval);
if (next_op) if (next_op)
{ {
// Continue next write to the same object // Continue next write to the same object

View File

@ -169,11 +169,12 @@ void osd_t::exec_show_config(osd_op_t *cur_op)
if (req_json["connect_rdma"].is_string()) if (req_json["connect_rdma"].is_string())
{ {
// Peer is trying to connect using RDMA, try to satisfy him // Peer is trying to connect using RDMA, try to satisfy him
bool ok = msgr.connect_rdma(cur_op->peer_fd, req_json["connect_rdma"].string_value()); bool ok = msgr.connect_rdma(cur_op->peer_fd, req_json["connect_rdma"].string_value(), req_json["rdma_max_msg"].uint64_value());
if (ok) if (ok)
{ {
wire_config["rdma_connected"] = true; auto rc = msgr.clients.at(cur_op->peer_fd)->rdma_conn;
wire_config["rdma_address"] = msgr.clients.at(cur_op->peer_fd)->rdma_conn->addr.to_string(); wire_config["rdma_address"] = rc->addr.to_string();
wire_config["rdma_max_msg"] = rc->max_msg;
} }
} }
} }

View File

@ -26,7 +26,7 @@
#define qobject_unref QDECREF #define qobject_unref QDECREF
#endif #endif
#include "qemu_proxy.h" #include "vitastor_c.h"
void qemu_module_dummy(void) void qemu_module_dummy(void)
{ {
@ -40,6 +40,7 @@ typedef struct VitastorClient
{ {
void *proxy; void *proxy;
void *watch; void *watch;
char *config_path;
char *etcd_host; char *etcd_host;
char *etcd_prefix; char *etcd_prefix;
char *image; char *image;
@ -47,6 +48,11 @@ typedef struct VitastorClient
uint64_t pool; uint64_t pool;
uint64_t size; uint64_t size;
long readonly; long readonly;
int use_rdma;
char *rdma_device;
int rdma_port_num;
int rdma_gid_index;
int rdma_mtu;
QemuMutex mutex; QemuMutex mutex;
} VitastorClient; } VitastorClient;
@ -60,10 +66,11 @@ typedef struct VitastorRPC
} VitastorRPC; } VitastorRPC;
static void vitastor_co_init_task(BlockDriverState *bs, VitastorRPC *task); static void vitastor_co_init_task(BlockDriverState *bs, VitastorRPC *task);
static void vitastor_co_generic_bh_cb(long retval, void *opaque); static void vitastor_co_generic_bh_cb(void *opaque, long retval);
static void vitastor_co_read_cb(void *opaque, long retval, uint64_t version);
static void vitastor_close(BlockDriverState *bs); static void vitastor_close(BlockDriverState *bs);
static char *qemu_rbd_next_tok(char *src, char delim, char **p) static char *qemu_vitastor_next_tok(char *src, char delim, char **p)
{ {
char *end; char *end;
*p = NULL; *p = NULL;
@ -82,7 +89,7 @@ static char *qemu_rbd_next_tok(char *src, char delim, char **p)
return src; return src;
} }
static void qemu_rbd_unescape(char *src) static void qemu_vitastor_unescape(char *src)
{ {
char *p; char *p;
for (p = src; *src; ++src, ++p) for (p = src; *src; ++src, ++p)
@ -95,7 +102,8 @@ static void qemu_rbd_unescape(char *src)
} }
// vitastor[:key=value]* // vitastor[:key=value]*
// vitastor:etcd_host=127.0.0.1:inode=1:pool=1 // vitastor[:etcd_host=127.0.0.1]:inode=1:pool=1[:rdma_gid_index=3]
// vitastor:config_path=/etc/vitastor/vitastor.conf:image=testimg
static void vitastor_parse_filename(const char *filename, QDict *options, Error **errp) static void vitastor_parse_filename(const char *filename, QDict *options, Error **errp)
{ {
const char *start; const char *start;
@ -114,16 +122,22 @@ static void vitastor_parse_filename(const char *filename, QDict *options, Error
while (p) while (p)
{ {
char *name, *value; char *name, *value;
name = qemu_rbd_next_tok(p, '=', &p); name = qemu_vitastor_next_tok(p, '=', &p);
if (!p) if (!p)
{ {
error_setg(errp, "conf option %s has no value", name); error_setg(errp, "conf option %s has no value", name);
break; break;
} }
qemu_rbd_unescape(name); qemu_vitastor_unescape(name);
value = qemu_rbd_next_tok(p, ':', &p); value = qemu_vitastor_next_tok(p, ':', &p);
qemu_rbd_unescape(value); qemu_vitastor_unescape(value);
if (!strcmp(name, "inode") || !strcmp(name, "pool") || !strcmp(name, "size")) if (!strcmp(name, "inode") ||
!strcmp(name, "pool") ||
!strcmp(name, "size") ||
!strcmp(name, "use_rdma") ||
!strcmp(name, "rdma_port_num") ||
!strcmp(name, "rdma_gid_index") ||
!strcmp(name, "rdma_mtu"))
{ {
unsigned long long num_val; unsigned long long num_val;
if (parse_uint_full(value, &num_val, 0)) if (parse_uint_full(value, &num_val, 0))
@ -157,11 +171,6 @@ static void vitastor_parse_filename(const char *filename, QDict *options, Error
goto out; goto out;
} }
} }
if (!qdict_get_str(options, "etcd_host"))
{
error_setg(errp, "etcd_host is missing");
goto out;
}
out: out:
g_free(buf); g_free(buf);
@ -175,7 +184,7 @@ static void coroutine_fn vitastor_co_get_metadata(VitastorRPC *task)
task->co = qemu_coroutine_self(); task->co = qemu_coroutine_self();
qemu_mutex_lock(&client->mutex); qemu_mutex_lock(&client->mutex);
vitastor_proxy_watch_metadata(client->proxy, client->image, vitastor_co_generic_bh_cb, task); vitastor_c_watch_inode(client->proxy, client->image, vitastor_co_generic_bh_cb, task);
qemu_mutex_unlock(&client->mutex); qemu_mutex_unlock(&client->mutex);
while (!task->complete) while (!task->complete)
@ -189,9 +198,19 @@ static int vitastor_file_open(BlockDriverState *bs, QDict *options, int flags, E
VitastorClient *client = bs->opaque; VitastorClient *client = bs->opaque;
int64_t ret = 0; int64_t ret = 0;
qemu_mutex_init(&client->mutex); qemu_mutex_init(&client->mutex);
client->config_path = g_strdup(qdict_get_try_str(options, "config_path"));
// FIXME: Rename to etcd_address
client->etcd_host = g_strdup(qdict_get_try_str(options, "etcd_host")); client->etcd_host = g_strdup(qdict_get_try_str(options, "etcd_host"));
client->etcd_prefix = g_strdup(qdict_get_try_str(options, "etcd_prefix")); client->etcd_prefix = g_strdup(qdict_get_try_str(options, "etcd_prefix"));
client->proxy = vitastor_proxy_create(bdrv_get_aio_context(bs), client->etcd_host, client->etcd_prefix); client->use_rdma = qdict_get_try_int(options, "use_rdma", -1);
client->rdma_device = g_strdup(qdict_get_try_str(options, "rdma_device"));
client->rdma_port_num = qdict_get_try_int(options, "rdma_port_num", 0);
client->rdma_gid_index = qdict_get_try_int(options, "rdma_gid_index", 0);
client->rdma_mtu = qdict_get_try_int(options, "rdma_mtu", 0);
client->proxy = vitastor_c_create_qemu(
(QEMUSetFDHandler*)aio_set_fd_handler, bdrv_get_aio_context(bs), client->config_path, client->etcd_host, client->etcd_prefix,
client->use_rdma, client->rdma_device, client->rdma_port_num, client->rdma_gid_index, client->rdma_mtu, 0
);
client->image = g_strdup(qdict_get_try_str(options, "image")); client->image = g_strdup(qdict_get_try_str(options, "image"));
client->readonly = (flags & BDRV_O_RDWR) ? 1 : 0; client->readonly = (flags & BDRV_O_RDWR) ? 1 : 0;
if (client->image) if (client->image)
@ -210,9 +229,9 @@ static int vitastor_file_open(BlockDriverState *bs, QDict *options, int flags, E
} }
BDRV_POLL_WHILE(bs, !task.complete); BDRV_POLL_WHILE(bs, !task.complete);
client->watch = (void*)task.ret; client->watch = (void*)task.ret;
client->readonly = client->readonly || vitastor_proxy_get_readonly(client->watch); client->readonly = client->readonly || vitastor_c_inode_get_readonly(client->watch);
client->size = vitastor_proxy_get_size(client->watch); client->size = vitastor_c_inode_get_size(client->watch);
if (!vitastor_proxy_get_inode_num(client->watch)) if (!vitastor_c_inode_get_num(client->watch))
{ {
error_setg(errp, "image does not exist"); error_setg(errp, "image does not exist");
vitastor_close(bs); vitastor_close(bs);
@ -241,6 +260,12 @@ static int vitastor_file_open(BlockDriverState *bs, QDict *options, int flags, E
} }
bs->total_sectors = client->size / BDRV_SECTOR_SIZE; bs->total_sectors = client->size / BDRV_SECTOR_SIZE;
//client->aio_context = bdrv_get_aio_context(bs); //client->aio_context = bdrv_get_aio_context(bs);
qdict_del(options, "use_rdma");
qdict_del(options, "rdma_mtu");
qdict_del(options, "rdma_gid_index");
qdict_del(options, "rdma_port_num");
qdict_del(options, "rdma_device");
qdict_del(options, "config_path");
qdict_del(options, "etcd_host"); qdict_del(options, "etcd_host");
qdict_del(options, "etcd_prefix"); qdict_del(options, "etcd_prefix");
qdict_del(options, "image"); qdict_del(options, "image");
@ -253,8 +278,11 @@ static int vitastor_file_open(BlockDriverState *bs, QDict *options, int flags, E
static void vitastor_close(BlockDriverState *bs) static void vitastor_close(BlockDriverState *bs)
{ {
VitastorClient *client = bs->opaque; VitastorClient *client = bs->opaque;
vitastor_proxy_destroy(client->proxy); vitastor_c_destroy(client->proxy);
qemu_mutex_destroy(&client->mutex); qemu_mutex_destroy(&client->mutex);
if (client->config_path)
g_free(client->config_path);
if (client->etcd_host)
g_free(client->etcd_host); g_free(client->etcd_host);
if (client->etcd_prefix) if (client->etcd_prefix)
g_free(client->etcd_prefix); g_free(client->etcd_prefix);
@ -365,7 +393,7 @@ static void vitastor_co_init_task(BlockDriverState *bs, VitastorRPC *task)
}; };
} }
static void vitastor_co_generic_bh_cb(long retval, void *opaque) static void vitastor_co_generic_bh_cb(void *opaque, long retval)
{ {
VitastorRPC *task = opaque; VitastorRPC *task = opaque;
task->ret = retval; task->ret = retval;
@ -381,6 +409,11 @@ static void vitastor_co_generic_bh_cb(long retval, void *opaque)
} }
} }
static void vitastor_co_read_cb(void *opaque, long retval, uint64_t version)
{
vitastor_co_generic_bh_cb(opaque, retval);
}
static int coroutine_fn vitastor_co_preadv(BlockDriverState *bs, uint64_t offset, uint64_t bytes, QEMUIOVector *iov, int flags) static int coroutine_fn vitastor_co_preadv(BlockDriverState *bs, uint64_t offset, uint64_t bytes, QEMUIOVector *iov, int flags)
{ {
VitastorClient *client = bs->opaque; VitastorClient *client = bs->opaque;
@ -388,9 +421,9 @@ static int coroutine_fn vitastor_co_preadv(BlockDriverState *bs, uint64_t offset
vitastor_co_init_task(bs, &task); vitastor_co_init_task(bs, &task);
task.iov = iov; task.iov = iov;
uint64_t inode = client->watch ? vitastor_proxy_get_inode_num(client->watch) : client->inode; uint64_t inode = client->watch ? vitastor_c_inode_get_num(client->watch) : client->inode;
qemu_mutex_lock(&client->mutex); qemu_mutex_lock(&client->mutex);
vitastor_proxy_rw(0, client->proxy, inode, offset, bytes, iov->iov, iov->niov, vitastor_co_generic_bh_cb, &task); vitastor_c_read(client->proxy, inode, offset, bytes, iov->iov, iov->niov, vitastor_co_read_cb, &task);
qemu_mutex_unlock(&client->mutex); qemu_mutex_unlock(&client->mutex);
while (!task.complete) while (!task.complete)
@ -408,9 +441,9 @@ static int coroutine_fn vitastor_co_pwritev(BlockDriverState *bs, uint64_t offse
vitastor_co_init_task(bs, &task); vitastor_co_init_task(bs, &task);
task.iov = iov; task.iov = iov;
uint64_t inode = client->watch ? vitastor_proxy_get_inode_num(client->watch) : client->inode; uint64_t inode = client->watch ? vitastor_c_inode_get_num(client->watch) : client->inode;
qemu_mutex_lock(&client->mutex); qemu_mutex_lock(&client->mutex);
vitastor_proxy_rw(1, client->proxy, inode, offset, bytes, iov->iov, iov->niov, vitastor_co_generic_bh_cb, &task); vitastor_c_write(client->proxy, inode, offset, bytes, 0, iov->iov, iov->niov, vitastor_co_generic_bh_cb, &task);
qemu_mutex_unlock(&client->mutex); qemu_mutex_unlock(&client->mutex);
while (!task.complete) while (!task.complete)
@ -440,7 +473,7 @@ static int coroutine_fn vitastor_co_flush(BlockDriverState *bs)
vitastor_co_init_task(bs, &task); vitastor_co_init_task(bs, &task);
qemu_mutex_lock(&client->mutex); qemu_mutex_lock(&client->mutex);
vitastor_proxy_sync(client->proxy, vitastor_co_generic_bh_cb, &task); vitastor_c_sync(client->proxy, vitastor_co_generic_bh_cb, &task);
qemu_mutex_unlock(&client->mutex); qemu_mutex_unlock(&client->mutex);
while (!task.complete) while (!task.complete)
@ -478,6 +511,7 @@ static QEMUOptionParameter vitastor_create_opts[] = {
static const char *vitastor_strong_runtime_opts[] = { static const char *vitastor_strong_runtime_opts[] = {
"inode", "inode",
"pool", "pool",
"config_path",
"etcd_host", "etcd_host",
"etcd_prefix", "etcd_prefix",

View File

@ -1,163 +0,0 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
// C-C++ proxy for the QEMU driver
// (QEMU headers don't compile with g++)
#include <sys/epoll.h>
#include "cluster_client.h"
typedef void* AioContext;
#include "qemu_proxy.h"
extern "C"
{
// QEMU
typedef void IOHandler(void *opaque);
void aio_set_fd_handler(AioContext *ctx, int fd, int is_external, IOHandler *fd_read, IOHandler *fd_write, void *poll_fn, void *opaque);
}
struct QemuProxyData
{
int fd;
std::function<void(int, int)> callback;
};
class QemuProxy
{
std::map<int, QemuProxyData> handlers;
public:
timerfd_manager_t *tfd;
cluster_client_t *cli;
AioContext *ctx;
QemuProxy(AioContext *ctx, const char *etcd_host, const char *etcd_prefix)
{
this->ctx = ctx;
json11::Json cfg = json11::Json::object {
{ "etcd_address", std::string(etcd_host) },
{ "etcd_prefix", std::string(etcd_prefix ? etcd_prefix : "/vitastor") },
};
tfd = new timerfd_manager_t([this](int fd, bool wr, std::function<void(int, int)> callback) { set_fd_handler(fd, wr, callback); });
cli = new cluster_client_t(NULL, tfd, cfg);
}
~QemuProxy()
{
delete cli;
delete tfd;
}
void set_fd_handler(int fd, bool wr, std::function<void(int, int)> callback)
{
if (callback != NULL)
{
handlers[fd] = { .fd = fd, .callback = callback };
aio_set_fd_handler(ctx, fd, false, &QemuProxy::read_handler, wr ? &QemuProxy::write_handler : NULL, NULL, &handlers[fd]);
}
else
{
handlers.erase(fd);
aio_set_fd_handler(ctx, fd, false, NULL, NULL, NULL, NULL);
}
}
static void read_handler(void *opaque)
{
QemuProxyData *data = (QemuProxyData *)opaque;
data->callback(data->fd, EPOLLIN);
}
static void write_handler(void *opaque)
{
QemuProxyData *data = (QemuProxyData *)opaque;
data->callback(data->fd, EPOLLOUT);
}
};
extern "C" {
void* vitastor_proxy_create(AioContext *ctx, const char *etcd_host, const char *etcd_prefix)
{
QemuProxy *p = new QemuProxy(ctx, etcd_host, etcd_prefix);
return p;
}
void vitastor_proxy_destroy(void *client)
{
QemuProxy *p = (QemuProxy*)client;
delete p;
}
void vitastor_proxy_rw(int write, void *client, uint64_t inode, uint64_t offset, uint64_t len,
iovec *iov, int iovcnt, VitastorIOHandler cb, void *opaque)
{
QemuProxy *p = (QemuProxy*)client;
cluster_op_t *op = new cluster_op_t;
op->opcode = write ? OSD_OP_WRITE : OSD_OP_READ;
op->inode = inode;
op->offset = offset;
op->len = len;
for (int i = 0; i < iovcnt; i++)
{
op->iov.push_back(iov[i].iov_base, iov[i].iov_len);
}
op->callback = [cb, opaque](cluster_op_t *op)
{
cb(op->retval, opaque);
delete op;
};
p->cli->execute(op);
}
void vitastor_proxy_sync(void *client, VitastorIOHandler cb, void *opaque)
{
QemuProxy *p = (QemuProxy*)client;
cluster_op_t *op = new cluster_op_t;
op->opcode = OSD_OP_SYNC;
op->callback = [cb, opaque](cluster_op_t *op)
{
cb(op->retval, opaque);
delete op;
};
p->cli->execute(op);
}
void vitastor_proxy_watch_metadata(void *client, char *image, VitastorIOHandler cb, void *opaque)
{
QemuProxy *p = (QemuProxy*)client;
p->cli->on_ready([=]()
{
auto watch = p->cli->st_cli.watch_inode(std::string(image));
cb((long)watch, opaque);
});
}
void vitastor_proxy_close_watch(void *client, void *watch)
{
QemuProxy *p = (QemuProxy*)client;
p->cli->st_cli.close_watch((inode_watch_t*)watch);
}
uint64_t vitastor_proxy_get_size(void *watch_ptr)
{
inode_watch_t *watch = (inode_watch_t*)watch_ptr;
return watch->cfg.size;
}
uint64_t vitastor_proxy_get_inode_num(void *watch_ptr)
{
inode_watch_t *watch = (inode_watch_t*)watch_ptr;
return watch->cfg.num;
}
int vitastor_proxy_get_readonly(void *watch_ptr)
{
inode_watch_t *watch = (inode_watch_t*)watch_ptr;
return watch->cfg.readonly;
}
}

View File

@ -1,34 +0,0 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#ifndef VITASTOR_QEMU_PROXY_H
#define VITASTOR_QEMU_PROXY_H
#ifndef POOL_ID_BITS
#define POOL_ID_BITS 16
#endif
#include <stdint.h>
#include <sys/uio.h>
#ifdef __cplusplus
extern "C" {
#endif
// Our exports
typedef void VitastorIOHandler(long retval, void *opaque);
void* vitastor_proxy_create(AioContext *ctx, const char *etcd_host, const char *etcd_prefix);
void vitastor_proxy_destroy(void *client);
void vitastor_proxy_rw(int write, void *client, uint64_t inode, uint64_t offset, uint64_t len,
struct iovec *iov, int iovcnt, VitastorIOHandler cb, void *opaque);
void vitastor_proxy_sync(void *client, VitastorIOHandler cb, void *opaque);
void vitastor_proxy_watch_metadata(void *client, char *image, VitastorIOHandler cb, void *opaque);
void vitastor_proxy_close_watch(void *client, void *watch);
uint64_t vitastor_proxy_get_size(void *watch);
uint64_t vitastor_proxy_get_inode_num(void *watch);
int vitastor_proxy_get_readonly(void *watch);
#ifdef __cplusplus
}
#endif
#endif

View File

@ -87,7 +87,7 @@ public:
"Vitastor inode removal tool\n" "Vitastor inode removal tool\n"
"(c) Vitaliy Filippov, 2020 (VNPL-1.1)\n\n" "(c) Vitaliy Filippov, 2020 (VNPL-1.1)\n\n"
"USAGE:\n" "USAGE:\n"
" %s --etcd_address <etcd_address> --pool <pool> --inode <inode> [--wait-list]\n", " %s [--etcd_address <etcd_address>] --pool <pool> --inode <inode> [--wait-list]\n",
exe_name exe_name
); );
exit(0); exit(0);
@ -95,11 +95,6 @@ public:
void run(json11::Json cfg) void run(json11::Json cfg)
{ {
if (cfg["etcd_address"].string_value() == "")
{
fprintf(stderr, "etcd_address is missing\n");
exit(1);
}
inode = cfg["inode"].uint64_value(); inode = cfg["inode"].uint64_value();
pool_id = cfg["pool"].uint64_value(); pool_id = cfg["pool"].uint64_value();
if (pool_id) if (pool_id)

135
src/test_cas.cpp Normal file
View File

@ -0,0 +1,135 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
#include <stdio.h>
#include <stdlib.h>
#include "epoll_manager.h"
#include "cluster_client.h"
void send_read(cluster_client_t *cli, uint64_t inode, std::function<void(int, uint64_t)> cb)
{
cluster_op_t *op = new cluster_op_t();
op->opcode = OSD_OP_READ;
op->inode = inode;
op->offset = 0;
op->len = 4096;
op->iov.push_back(malloc_or_die(op->len), op->len);
op->callback = [cb](cluster_op_t *op)
{
uint64_t version = op->version;
int retval = op->retval;
if (retval == op->len)
retval = 0;
free(op->iov.buf[0].iov_base);
delete op;
if (cb != NULL)
cb(retval, version);
};
cli->execute(op);
}
void send_write(cluster_client_t *cli, uint64_t inode, int byte, uint64_t version, std::function<void(int)> cb)
{
cluster_op_t *op = new cluster_op_t();
op->opcode = OSD_OP_WRITE;
op->inode = inode;
op->offset = 0;
op->len = 4096;
op->version = version;
op->iov.push_back(malloc_or_die(op->len), op->len);
memset(op->iov.buf[0].iov_base, byte, op->len);
op->callback = [cb](cluster_op_t *op)
{
int retval = op->retval;
if (retval == op->len)
retval = 0;
free(op->iov.buf[0].iov_base);
delete op;
if (cb != NULL)
cb(retval);
};
cli->execute(op);
}
int main(int narg, char *args[])
{
json11::Json::object cfgo;
for (int i = 1; i < narg; i++)
{
if (args[i][0] == '-' && args[i][1] == '-')
{
const char *opt = args[i]+2;
cfgo[opt] = i == narg-1 ? "1" : args[++i];
}
}
json11::Json cfg(cfgo);
uint64_t inode = (cfg["pool_id"].uint64_value() << (64-POOL_ID_BITS))
| cfg["inode_id"].uint64_value();
uint64_t base_ver = 0;
// Create client
auto ringloop = new ring_loop_t(512);
auto epmgr = new epoll_manager_t(ringloop);
auto cli = new cluster_client_t(ringloop, epmgr->tfd, cfg);
cli->on_ready([&]()
{
send_read(cli, inode, [&](int r, uint64_t v)
{
if (r < 0)
{
fprintf(stderr, "Initial read operation failed\n");
exit(1);
}
base_ver = v;
// CAS v=1 = compare with zero, non-existing object
send_write(cli, inode, 0x01, base_ver+1, [&](int r)
{
if (r < 0)
{
fprintf(stderr, "CAS for non-existing object failed\n");
exit(1);
}
// Check that read returns the new version
send_read(cli, inode, [&](int r, uint64_t v)
{
if (r < 0)
{
fprintf(stderr, "Read operation failed after write\n");
exit(1);
}
if (v != base_ver+1)
{
fprintf(stderr, "Read operation failed to return the new version number\n");
exit(1);
}
// CAS v=2 = compare with v=1, existing object
send_write(cli, inode, 0x02, base_ver+2, [&](int r)
{
if (r < 0)
{
fprintf(stderr, "CAS for existing object failed\n");
exit(1);
}
// CAS v=2 again = compare with v=1, but version is 2. Must fail with -EINTR
send_write(cli, inode, 0x03, base_ver+2, [&](int r)
{
if (r != -EINTR)
{
fprintf(stderr, "CAS conflict detection failed\n");
exit(1);
}
printf("Basic CAS test succeeded\n");
exit(0);
});
});
});
});
});
});
while (1)
{
ringloop->loop();
ringloop->wait();
}
return 0;
}

254
src/vitastor_c.cpp Normal file
View File

@ -0,0 +1,254 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
// Simplified C client library for QEMU, fio and other external drivers
// Also acts as a C-C++ proxy for the QEMU driver (QEMU headers don't compile with g++)
#include <sys/epoll.h>
#include "ringloop.h"
#include "epoll_manager.h"
#include "cluster_client.h"
#include "vitastor_c.h"
struct vitastor_qemu_fd_t
{
int fd;
std::function<void(int, int)> callback;
};
struct vitastor_c
{
std::map<int, vitastor_qemu_fd_t> handlers;
ring_loop_t *ringloop = NULL;
epoll_manager_t *epmgr = NULL;
timerfd_manager_t *tfd = NULL;
cluster_client_t *cli = NULL;
QEMUSetFDHandler *aio_set_fd_handler = NULL;
void *aio_ctx = NULL;
};
extern "C" {
static json11::Json vitastor_c_common_config(const char *config_path, const char *etcd_host, const char *etcd_prefix,
int use_rdma, const char *rdma_device, int rdma_port_num, int rdma_gid_index, int rdma_mtu, int log_level)
{
json11::Json::object cfg;
if (config_path)
cfg["config_path"] = std::string(config_path);
if (etcd_host)
cfg["etcd_address"] = std::string(etcd_host);
if (etcd_prefix)
cfg["etcd_prefix"] = std::string(etcd_prefix);
// -1 means unspecified
if (use_rdma >= 0)
cfg["use_rdma"] = use_rdma > 0;
if (rdma_device)
cfg["rdma_device"] = std::string(rdma_device);
if (rdma_port_num)
cfg["rdma_port_num"] = rdma_port_num;
if (rdma_gid_index)
cfg["rdma_gid_index"] = rdma_gid_index;
if (rdma_mtu)
cfg["rdma_mtu"] = rdma_mtu;
if (log_level)
cfg["log_level"] = log_level;
return json11::Json(cfg);
}
static void vitastor_c_read_handler(void *opaque)
{
vitastor_qemu_fd_t *data = (vitastor_qemu_fd_t *)opaque;
data->callback(data->fd, EPOLLIN);
}
static void vitastor_c_write_handler(void *opaque)
{
vitastor_qemu_fd_t *data = (vitastor_qemu_fd_t *)opaque;
data->callback(data->fd, EPOLLOUT);
}
vitastor_c *vitastor_c_create_qemu(QEMUSetFDHandler *aio_set_fd_handler, void *aio_context,
const char *config_path, const char *etcd_host, const char *etcd_prefix,
bool use_rdma, const char *rdma_device, int rdma_port_num, int rdma_gid_index, int rdma_mtu, int log_level)
{
json11::Json cfg_json = vitastor_c_common_config(
config_path, etcd_host, etcd_prefix, use_rdma,
rdma_device, rdma_port_num, rdma_gid_index, rdma_mtu, log_level
);
vitastor_c *self = new vitastor_c;
self->aio_set_fd_handler = aio_set_fd_handler;
self->aio_ctx = aio_context;
self->tfd = new timerfd_manager_t([self](int fd, bool wr, std::function<void(int, int)> callback)
{
if (callback != NULL)
{
self->handlers[fd] = { .fd = fd, .callback = callback };
self->aio_set_fd_handler(self->aio_ctx, fd, false,
vitastor_c_read_handler, wr ? vitastor_c_write_handler : NULL, NULL, &self->handlers[fd]);
}
else
{
self->handlers.erase(fd);
self->aio_set_fd_handler(self->aio_ctx, fd, false, NULL, NULL, NULL, NULL);
}
});
self->cli = new cluster_client_t(NULL, self->tfd, cfg_json);
return self;
}
vitastor_c *vitastor_c_create_uring(const char *config_path, const char *etcd_host, const char *etcd_prefix,
int use_rdma, const char *rdma_device, int rdma_port_num, int rdma_gid_index, int rdma_mtu, int log_level)
{
json11::Json cfg_json = vitastor_c_common_config(
config_path, etcd_host, etcd_prefix, use_rdma,
rdma_device, rdma_port_num, rdma_gid_index, rdma_mtu, log_level
);
vitastor_c *self = new vitastor_c;
self->ringloop = new ring_loop_t(512);
self->epmgr = new epoll_manager_t(self->ringloop);
self->cli = new cluster_client_t(self->ringloop, self->epmgr->tfd, cfg_json);
return self;
}
vitastor_c *vitastor_c_create_uring_json(const char **options, int options_len)
{
json11::Json::object cfg;
for (int i = 0; i < options_len-1; i += 2)
{
cfg[options[i]] = std::string(options[i+1]);
}
json11::Json cfg_json(cfg);
vitastor_c *self = new vitastor_c;
self->ringloop = new ring_loop_t(512);
self->epmgr = new epoll_manager_t(self->ringloop);
self->cli = new cluster_client_t(self->ringloop, self->epmgr->tfd, cfg_json);
return self;
}
void vitastor_c_destroy(vitastor_c *client)
{
delete client->cli;
if (client->epmgr)
delete client->epmgr;
else
delete client->tfd;
if (client->ringloop)
delete client->ringloop;
delete client;
}
int vitastor_c_is_ready(vitastor_c *client)
{
return client->cli->is_ready();
}
void vitastor_c_uring_wait_ready(vitastor_c *client)
{
while (!client->cli->is_ready())
{
client->ringloop->loop();
if (client->cli->is_ready())
break;
client->ringloop->wait();
}
}
void vitastor_c_uring_handle_events(vitastor_c *client)
{
client->ringloop->loop();
}
void vitastor_c_uring_wait_events(vitastor_c *client)
{
client->ringloop->wait();
}
void vitastor_c_read(vitastor_c *client, uint64_t inode, uint64_t offset, uint64_t len,
struct iovec *iov, int iovcnt, VitastorReadHandler cb, void *opaque)
{
cluster_op_t *op = new cluster_op_t;
op->opcode = OSD_OP_READ;
op->inode = inode;
op->offset = offset;
op->len = len;
for (int i = 0; i < iovcnt; i++)
{
op->iov.push_back(iov[i].iov_base, iov[i].iov_len);
}
op->callback = [cb, opaque](cluster_op_t *op)
{
cb(opaque, op->retval, op->version);
delete op;
};
client->cli->execute(op);
}
void vitastor_c_write(vitastor_c *client, uint64_t inode, uint64_t offset, uint64_t len, uint64_t check_version,
struct iovec *iov, int iovcnt, VitastorIOHandler cb, void *opaque)
{
cluster_op_t *op = new cluster_op_t;
op->opcode = OSD_OP_WRITE;
op->inode = inode;
op->offset = offset;
op->len = len;
op->version = check_version;
for (int i = 0; i < iovcnt; i++)
{
op->iov.push_back(iov[i].iov_base, iov[i].iov_len);
}
op->callback = [cb, opaque](cluster_op_t *op)
{
cb(opaque, op->retval);
delete op;
};
client->cli->execute(op);
}
void vitastor_c_sync(vitastor_c *client, VitastorIOHandler cb, void *opaque)
{
cluster_op_t *op = new cluster_op_t;
op->opcode = OSD_OP_SYNC;
op->callback = [cb, opaque](cluster_op_t *op)
{
cb(opaque, op->retval);
delete op;
};
client->cli->execute(op);
}
void vitastor_c_watch_inode(vitastor_c *client, char *image, VitastorIOHandler cb, void *opaque)
{
client->cli->on_ready([=]()
{
auto watch = client->cli->st_cli.watch_inode(std::string(image));
cb(opaque, (long)watch);
});
}
void vitastor_c_close_watch(vitastor_c *client, void *handle)
{
client->cli->st_cli.close_watch((inode_watch_t*)handle);
}
uint64_t vitastor_c_inode_get_size(void *handle)
{
inode_watch_t *watch = (inode_watch_t*)handle;
return watch->cfg.size;
}
uint64_t vitastor_c_inode_get_num(void *handle)
{
inode_watch_t *watch = (inode_watch_t*)handle;
return watch->cfg.num;
}
int vitastor_c_inode_get_readonly(void *handle)
{
inode_watch_t *watch = (inode_watch_t*)handle;
return watch->cfg.readonly;
}
}

55
src/vitastor_c.h Normal file
View File

@ -0,0 +1,55 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
// Simplified C client library for QEMU, fio and other external drivers
#ifndef VITASTOR_QEMU_PROXY_H
#define VITASTOR_QEMU_PROXY_H
#ifndef POOL_ID_BITS
#define POOL_ID_BITS 16
#endif
#include <stdint.h>
#include <sys/uio.h>
#ifdef __cplusplus
extern "C" {
#endif
struct vitastor_c;
typedef struct vitastor_c vitastor_c;
typedef void VitastorReadHandler(void *opaque, long retval, uint64_t version);
typedef void VitastorIOHandler(void *opaque, long retval);
// QEMU
typedef void IOHandler(void *opaque);
typedef void QEMUSetFDHandler(void *ctx, int fd, int is_external, IOHandler *fd_read, IOHandler *fd_write, void *poll_fn, void *opaque);
vitastor_c *vitastor_c_create_qemu(QEMUSetFDHandler *aio_set_fd_handler, void *aio_context,
const char *config_path, const char *etcd_host, const char *etcd_prefix,
bool use_rdma, const char *rdma_device, int rdma_port_num, int rdma_gid_index, int rdma_mtu, int log_level);
vitastor_c *vitastor_c_create_uring(const char *config_path, const char *etcd_host, const char *etcd_prefix,
int use_rdma, const char *rdma_device, int rdma_port_num, int rdma_gid_index, int rdma_mtu, int log_level);
vitastor_c *vitastor_c_create_uring_json(const char **options, int options_len);
void vitastor_c_destroy(vitastor_c *client);
int vitastor_c_is_ready(vitastor_c *client);
void vitastor_c_uring_wait_ready(vitastor_c *client);
void vitastor_c_uring_handle_events(vitastor_c *client);
void vitastor_c_uring_wait_events(vitastor_c *client);
void vitastor_c_read(vitastor_c *client, uint64_t inode, uint64_t offset, uint64_t len,
struct iovec *iov, int iovcnt, VitastorReadHandler cb, void *opaque);
void vitastor_c_write(vitastor_c *client, uint64_t inode, uint64_t offset, uint64_t len, uint64_t check_version,
struct iovec *iov, int iovcnt, VitastorIOHandler cb, void *opaque);
void vitastor_c_sync(vitastor_c *client, VitastorIOHandler cb, void *opaque);
void vitastor_c_watch_inode(vitastor_c *client, char *image, VitastorIOHandler cb, void *opaque);
void vitastor_c_close_watch(vitastor_c *client, void *handle);
uint64_t vitastor_c_inode_get_size(void *handle);
uint64_t vitastor_c_inode_get_num(void *handle);
int vitastor_c_inode_get_readonly(void *handle);
#ifdef __cplusplus
}
#endif
#endif

43
tests/run_3osds.sh Normal file
View File

@ -0,0 +1,43 @@
#!/bin/bash -ex
. `dirname $0`/common.sh
OSD_SIZE=${OSD_SIZE:-1024}
dd if=/dev/zero of=./testdata/test_osd1.bin bs=1024 count=1 seek=$((OSD_SIZE*1024-1))
dd if=/dev/zero of=./testdata/test_osd2.bin bs=1024 count=1 seek=$((OSD_SIZE*1024-1))
dd if=/dev/zero of=./testdata/test_osd3.bin bs=1024 count=1 seek=$((OSD_SIZE*1024-1))
build/src/vitastor-osd --osd_num 1 --bind_address 127.0.0.1 $OSD_ARGS --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd1.bin 2>/dev/null) &>./testdata/osd1.log &
OSD1_PID=$!
build/src/vitastor-osd --osd_num 2 --bind_address 127.0.0.1 $OSD_ARGS --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd2.bin 2>/dev/null) &>./testdata/osd2.log &
OSD2_PID=$!
build/src/vitastor-osd --osd_num 3 --bind_address 127.0.0.1 $OSD_ARGS --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd3.bin 2>/dev/null) &>./testdata/osd3.log &
OSD3_PID=$!
cd mon
npm install
cd ..
node mon/mon-main.js --etcd_url http://$ETCD_URL --etcd_prefix "/vitastor" &>./testdata/mon.log &
MON_PID=$!
if [ -n "$GLOBAL_CONF" ]; then
$ETCDCTL put /vitastor/config/global "$GLOBAL_CONF"
fi
$ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"xor","pg_size":3,"pg_minsize":2,"parity_chunks":1,"pg_count":1,"failure_domain":"osd"}}'
sleep 2
if ! ($ETCDCTL get /vitastor/config/pgs --print-value-only | jq -s -e '(. | length) != 0 and (.[0].items["1"]["1"].osd_set | sort) == ["1","2","3"]'); then
format_error "FAILED: 1 PG NOT CONFIGURED"
fi
if ! ($ETCDCTL get /vitastor/pg/state/1/1 --print-value-only | jq -s -e '(. | length) != 0 and .[0].state == ["active"]'); then
format_error "FAILED: 1 PG NOT UP"
fi
if ! cmp build/src/block-vitastor.so /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so; then
sudo rm -f /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so
sudo ln -s "$(realpath .)/build/src/block-vitastor.so" /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so
fi

7
tests/test_cas.sh Executable file
View File

@ -0,0 +1,7 @@
#!/bin/bash -ex
. `dirname $0`/run_3osds.sh
build/src/test_cas --pool_id 1 --inode_id 1 --etcd_address $ETCD_URL
format_green OK

View File

@ -1,40 +1,6 @@
#!/bin/bash -ex #!/bin/bash -ex
. `dirname $0`/common.sh . `dirname $0`/run_3osds.sh
dd if=/dev/zero of=./testdata/test_osd1.bin bs=1024 count=1 seek=$((1024*1024-1))
dd if=/dev/zero of=./testdata/test_osd2.bin bs=1024 count=1 seek=$((1024*1024-1))
dd if=/dev/zero of=./testdata/test_osd3.bin bs=1024 count=1 seek=$((1024*1024-1))
build/src/vitastor-osd --osd_num 1 --bind_address 127.0.0.1 --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd1.bin 2>/dev/null) &>./testdata/osd1.log &
OSD1_PID=$!
build/src/vitastor-osd --osd_num 2 --bind_address 127.0.0.1 --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd2.bin 2>/dev/null) &>./testdata/osd2.log &
OSD2_PID=$!
build/src/vitastor-osd --osd_num 3 --bind_address 127.0.0.1 --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd3.bin 2>/dev/null) &>./testdata/osd3.log &
OSD3_PID=$!
cd mon
npm install
cd ..
node mon/mon-main.js --etcd_url http://$ETCD_URL --etcd_prefix "/vitastor" &>./testdata/mon.log &
MON_PID=$!
$ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"xor","pg_size":3,"pg_minsize":2,"parity_chunks":1,"pg_count":1,"failure_domain":"osd"}}'
sleep 2
if ! ($ETCDCTL get /vitastor/config/pgs --print-value-only | jq -s -e '(. | length) != 0 and (.[0].items["1"]["1"].osd_set | sort) == ["1","2","3"]'); then
format_error "FAILED: 1 PG NOT CONFIGURED"
fi
if ! ($ETCDCTL get /vitastor/pg/state/1/1 --print-value-only | jq -s -e '(. | length) != 0 and .[0].state == ["active"]'); then
format_error "FAILED: 1 PG NOT UP"
fi
if ! cmp build/src/block-vitastor.so /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so; then
sudo rm -f /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so
sudo ln -s "$(realpath .)/build/src/block-vitastor.so" /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so
fi
# Test basic write and snapshot # Test basic write and snapshot

View File

@ -1,40 +1,8 @@
#!/bin/bash -ex #!/bin/bash -ex
. `dirname $0`/common.sh OSD_SIZE=2048
dd if=/dev/zero of=./testdata/test_osd1.bin bs=2048 count=1 seek=$((1024*1024-1)) . `dirname $0`/run_3osds.sh
dd if=/dev/zero of=./testdata/test_osd2.bin bs=2048 count=1 seek=$((1024*1024-1))
dd if=/dev/zero of=./testdata/test_osd3.bin bs=2048 count=1 seek=$((1024*1024-1))
build/src/vitastor-osd --osd_num 1 --bind_address 127.0.0.1 --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd1.bin 2>/dev/null) &>./testdata/osd1.log &
OSD1_PID=$!
build/src/vitastor-osd --osd_num 2 --bind_address 127.0.0.1 --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd2.bin 2>/dev/null) &>./testdata/osd2.log &
OSD2_PID=$!
build/src/vitastor-osd --osd_num 3 --bind_address 127.0.0.1 --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd3.bin 2>/dev/null) &>./testdata/osd3.log &
OSD3_PID=$!
cd mon
npm install
cd ..
node mon/mon-main.js --etcd_url http://$ETCD_URL --etcd_prefix "/vitastor" &>./testdata/mon.log &
MON_PID=$!
$ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"xor","pg_size":3,"pg_minsize":2,"parity_chunks":1,"pg_count":1,"failure_domain":"osd"}}'
sleep 2
if ! ($ETCDCTL get /vitastor/config/pgs --print-value-only | jq -s -e '(. | length) != 0 and (.[0].items["1"]["1"].osd_set | sort) == ["1","2","3"]'); then
format_error "FAILED: 1 PG NOT CONFIGURED"
fi
if ! ($ETCDCTL get /vitastor/pg/state/1/1 --print-value-only | jq -s -e '(. | length) != 0 and .[0].state == ["active"]'); then
format_error "FAILED: 1 PG NOT UP"
fi
if ! cmp build/src/block-vitastor.so /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so; then
sudo rm -f /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so
sudo ln -s "$(realpath .)/build/src/block-vitastor.so" /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so
fi
$ETCDCTL put /vitastor/config/inode/1/1 '{"name":"debian9","size":'$((2048*1024*1024))'}' $ETCDCTL put /vitastor/config/inode/1/1 '{"name":"debian9","size":'$((2048*1024*1024))'}'
@ -46,7 +14,7 @@ $ETCDCTL put /vitastor/config/inode/1/1 '{"name":"debian9@0","size":'$((2048*102
$ETCDCTL put /vitastor/config/inode/1/2 '{"parent_id":1,"name":"debian9","size":'$((2048*1024*1024))'}' $ETCDCTL put /vitastor/config/inode/1/2 '{"parent_id":1,"name":"debian9","size":'$((2048*1024*1024))'}'
qemu-system-x86_64 -enable-kvm -m 1024 \ qemu-system-x86_64 -enable-kvm -m 1024 \
-drive 'file=vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:image=debian9',format=raw,if=none,id=drive-virtio-disk0,cache=none \ -drive 'file=vitastor:etcd_host=127.0.0.1\:'$ETCD_PORT'/v3:image=debian9',format=raw,if=none,id=drive-virtio-disk0,cache=none \
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=off,physical_block_size=4096,logical_block_size=512 \ -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=off,physical_block_size=4096,logical_block_size=512 \
-vnc 0.0.0.0:0 -vnc 0.0.0.0:0

View File

@ -1,44 +1,10 @@
#!/bin/bash -ex #!/bin/bash -ex
. `dirname $0`/common.sh . `dirname $0`/run_3osds.sh
dd if=/dev/zero of=./testdata/test_osd1.bin bs=1024 count=1 seek=$((1024*1024-1))
dd if=/dev/zero of=./testdata/test_osd2.bin bs=1024 count=1 seek=$((1024*1024-1))
dd if=/dev/zero of=./testdata/test_osd3.bin bs=1024 count=1 seek=$((1024*1024-1))
build/src/vitastor-osd --osd_num 1 --bind_address 127.0.0.1 --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd1.bin 2>/dev/null) &>./testdata/osd1.log &
OSD1_PID=$!
build/src/vitastor-osd --osd_num 2 --bind_address 127.0.0.1 --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd2.bin 2>/dev/null) &>./testdata/osd2.log &
OSD2_PID=$!
build/src/vitastor-osd --osd_num 3 --bind_address 127.0.0.1 --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd3.bin 2>/dev/null) &>./testdata/osd3.log &
OSD3_PID=$!
cd mon
npm install
cd ..
node mon/mon-main.js --etcd_url http://$ETCD_URL --etcd_prefix "/vitastor" &>./testdata/mon.log &
MON_PID=$!
$ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"xor","pg_size":3,"pg_minsize":2,"parity_chunks":1,"pg_count":1,"failure_domain":"osd"}}'
sleep 2
if ! ($ETCDCTL get /vitastor/config/pgs --print-value-only | jq -s -e '(. | length) != 0 and (.[0].items["1"]["1"].osd_set | sort) == ["1","2","3"]'); then
format_error "FAILED: 1 PG NOT CONFIGURED"
fi
if ! ($ETCDCTL get /vitastor/pg/state/1/1 --print-value-only | jq -s -e '(. | length) != 0 and .[0].state == ["active"]'); then
format_error "FAILED: 1 PG NOT UP"
fi
#LD_PRELOAD=libasan.so.5 \ #LD_PRELOAD=libasan.so.5 \
# fio -thread -name=test -ioengine=build/src/libfio_vitastor_sec.so -bs=4k -fsync=128 `$ETCDCTL get /vitastor/osd/state/1 --print-value-only | jq -r '"-host="+.addresses[0]+" -port="+(.port|tostring)'` -rw=write -size=32M # fio -thread -name=test -ioengine=build/src/libfio_vitastor_sec.so -bs=4k -fsync=128 `$ETCDCTL get /vitastor/osd/state/1 --print-value-only | jq -r '"-host="+.addresses[0]+" -port="+(.port|tostring)'` -rw=write -size=32M
if ! cmp build/src/block-vitastor.so /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so; then
sudo rm -f /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so
sudo ln -s "$(realpath .)/build/src/block-vitastor.so" /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so
fi
# A lot of parallel syncs was crashing the primary OSD at some point # A lot of parallel syncs was crashing the primary OSD at some point
LD_PRELOAD=libasan.so.5 \ LD_PRELOAD=libasan.so.5 \

View File

@ -1,40 +1,10 @@
#!/bin/bash -ex #!/bin/bash -ex
# Test the `no_same_sector_overwrites` mode # Test the `no_same_sector_overwrites` mode
. `dirname $0`/common.sh OSD_ARGS="--journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024 --disable_data_fsync 1 --immediate_commit all"
GLOBAL_CONF='{"immediate_commit":"all"}'
dd if=/dev/zero of=./testdata/test_osd1.bin bs=1024 count=1 seek=$((1024*1024-1)) . `dirname $0`/run_3osds.sh
dd if=/dev/zero of=./testdata/test_osd2.bin bs=1024 count=1 seek=$((1024*1024-1))
dd if=/dev/zero of=./testdata/test_osd3.bin bs=1024 count=1 seek=$((1024*1024-1))
NO_SAME="--journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024 --disable_data_fsync 1 --immediate_commit all"
build/src/vitastor-osd --osd_num 1 --bind_address 127.0.0.1 $NO_SAME --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd1.bin 2>/dev/null) &>./testdata/osd1.log &
OSD1_PID=$!
build/src/vitastor-osd --osd_num 2 --bind_address 127.0.0.1 $NO_SAME --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd2.bin 2>/dev/null) &>./testdata/osd2.log &
OSD2_PID=$!
build/src/vitastor-osd --osd_num 3 --bind_address 127.0.0.1 $NO_SAME --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd3.bin 2>/dev/null) &>./testdata/osd3.log &
OSD3_PID=$!
cd mon
npm install
cd ..
node mon/mon-main.js --etcd_url http://$ETCD_URL --etcd_prefix "/vitastor" &>./testdata/mon.log &
MON_PID=$!
$ETCDCTL put /vitastor/config/global '{"immediate_commit":"all"}'
$ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"xor","pg_size":3,"pg_minsize":2,"parity_chunks":1,"pg_count":1,"failure_domain":"osd"}}'
sleep 2
if ! ($ETCDCTL get /vitastor/config/pgs --print-value-only | jq -s -e '(. | length) != 0 and (.[0].items["1"]["1"].osd_set | sort) == ["1","2","3"]'); then
format_error "FAILED: 1 PG NOT CONFIGURED"
fi
if ! ($ETCDCTL get /vitastor/pg/state/1/1 --print-value-only | jq -s -e '(. | length) != 0 and .[0].state == ["active"]'); then
format_error "FAILED: 1 PG NOT UP"
fi
#LSAN_OPTIONS=report_objects=true:suppressions=`pwd`/testdata/lsan-suppress.txt LD_PRELOAD=libasan.so.5 \ #LSAN_OPTIONS=report_objects=true:suppressions=`pwd`/testdata/lsan-suppress.txt LD_PRELOAD=libasan.so.5 \
# fio -thread -name=test -ioengine=build/src/libfio_vitastor_sec.so -bs=4k -fsync=128 `$ETCDCTL get /vitastor/osd/state/1 --print-value-only | jq -r '"-host="+.addresses[0]+" -port="+(.port|tostring)'` -rw=write -size=32M # fio -thread -name=test -ioengine=build/src/libfio_vitastor_sec.so -bs=4k -fsync=128 `$ETCDCTL get /vitastor/osd/state/1 --print-value-only | jq -r '"-host="+.addresses[0]+" -port="+(.port|tostring)'` -rw=write -size=32M