Compare commits

...

87 Commits

Author SHA1 Message Date
9ad6822353 Release 1.5.0
After half a year of hard work, VitastorFS is finally here ! :-)

New features:
- VitastorFS, a full-featured clustered (read-write-many) file system.
  Documentation: [VitastorFS](docs/usage/nfs.en.md)
- Embedded key-value database implementation based on Parallel Optimistic B-Tree
  algorithm and used for the metadata of VitastorFS
- Pool management commands in vitastor-cli (create-pool, list-pools, rm-pool, modify-pool).
  Thanks MIND Software (https://mindsw.io) for their contribution!
  [Documentation](docs/usage/cli.en.md#create-pool)

Bug fixes:
- Fix a very rare "infinite loop" in the client library
- Fix a rare OSD hang on during start when zeroing out bad metadata entries left from the previous run
2024-03-16 15:35:10 +03:00
2043b4e374 Fix build errors for gcc 8 2024-03-16 15:35:10 +03:00
de840e6fe3 Reduce kv-cli loadjson load parallelism to 16 2024-03-16 15:35:10 +03:00
b5e04bf809 Fix build warning 2024-03-16 15:35:10 +03:00
8807a1623b Fix markdown tables 2024-03-16 15:35:10 +03:00
f12855c31b Add vitastor-kv to packages 2024-03-16 15:35:10 +03:00
e75dcc9a71 Add documentation for VitastorFS 2024-03-16 15:16:43 +03:00
88516ab4bd Remove extra log 2024-03-16 13:24:36 +03:00
6221126b4f Allow to print simple-offsets just given the device size 2024-03-16 13:24:36 +03:00
6783d4a13c Implement fool protection for FS pools 2024-03-16 13:24:36 +03:00
dcbe1afac3 Store pool ID in inode metadata 2024-03-16 13:24:36 +03:00
0bde28c24a Make nfs_do_rmw a library function 2024-03-16 13:24:36 +03:00
bb8ca6184e Support setattr guard 2024-03-16 13:24:36 +03:00
87310ef7bb Support ctime 2024-03-16 13:24:36 +03:00
4f4b2dab80 Log NFS liveness checks 2024-03-16 13:24:36 +03:00
f70da82317 Add loadjson command to vitastor-kv 2024-03-16 13:24:36 +03:00
e42148f347 Allow to specify KV commands on command line 2024-03-16 13:24:36 +03:00
c289584469 Add JSON dump format 2024-03-16 13:24:36 +03:00
018e89f867 Erase verf key left from creation from ientries on every modification 2024-03-16 13:24:36 +03:00
603dc68f11 Implement async mtime change 2024-03-16 13:24:36 +03:00
7b12342933 Allow to specify additional NFS mount options 2024-03-16 13:24:36 +03:00
44bf0f16ee Fix malloc/free in nfs_kv_read/write 2024-03-16 13:24:36 +03:00
8840c84572 Fix "bad key in etcd" in mon for FS pools 2024-03-16 13:24:36 +03:00
5b747c12ec Check if already mounted before mounting 2024-03-16 13:24:36 +03:00
05f5f46162 Fix zero used space, update mtime when moving/changing inode 2024-03-16 13:24:36 +03:00
b5604191c8 Ignore ECANCELED in nfs-proxy (happens in io_uring on fork) 2024-03-16 13:24:36 +03:00
e871de27de Support unaligned shared_offsets, align shared file data instead of header 2024-03-16 13:24:36 +03:00
f600ce98e2 Implement auto-unmount local NFS server mode for vitastor-nfs 2024-03-16 13:24:36 +03:00
57605a5c13 Return error on failed shrink 2024-03-16 13:24:36 +03:00
29bd4561bb Implement rename over an existing file/directory 2024-03-16 13:24:36 +03:00
7142460ec8 Support --logfile in nfs-proxy 2024-03-16 13:24:36 +03:00
d03f19ebe5 Fix shared file overlap, add FIXMEs 2024-03-16 13:24:36 +03:00
88f9d18be3 Create inode, then direntry, not direntry, then inode; retry ID collisions 2024-03-16 13:24:36 +03:00
6213fbd8c6 Fix NFS shared/aligned write FIXMEs 2024-03-16 13:24:36 +03:00
3aee37eadd Allow to disable per-inode stats for VitastorFS pools 2024-03-16 13:24:36 +03:00
ecfc753e93 Add basic NFS tests, fix bugs 2024-03-16 13:24:36 +03:00
a574f9ad71 Return block NFS implementation back as an option too 2024-03-16 13:24:36 +03:00
7c235c9103 Move KV FS header into a separate file 2024-03-16 13:24:36 +03:00
e5bb986164 Implement packing small files into shared inodes 2024-03-16 13:24:36 +03:00
181795d748 Split new NFS proxy implementation into multiple files 2024-03-16 13:24:36 +03:00
8cdc38805b WIP VitastorFS with metadata storage in VitastorKV 2024-03-16 13:24:36 +03:00
0cd455d17f First just recheck version without actually re-reading block in vitastor-kv 2024-03-16 13:24:36 +03:00
32ba653ba6 Fix vitastor-kv hang on reopen & unfinished closed listing 2024-03-16 13:24:36 +03:00
231d4b15fc Add loadable dump format to vitastor-kv (dump) 2024-03-16 13:24:36 +03:00
9dc4d5fd7b Fix freeing r/w buffers on errors in kv_db 2024-03-16 13:24:36 +03:00
e58538fa47 Fix eviction when random_pos selects the end 2024-03-16 13:24:36 +03:00
11ac9e7024 Implement min/max list_count to make listings during performance test reasonable 2024-03-16 13:24:36 +03:00
511bc3df1c Fix and improve parallel allocation
- Do not try to allocate more DB blocks in an inode block until it's "confirmed" and "locked" by the first write
- Do not recheck for new zero DB blocks on first write into an inode block - a CAS failure means someone else is already writing into it
- Throw new allocation blocks away regardless of whether the known_version is 0 on a CAS failure
2024-03-16 13:24:36 +03:00
a64f0d1f73 Implement key_prefix for K/V stress test 2024-03-16 13:24:36 +03:00
ec5f7c6b87 More fixes
- do not overwrite a block with older version if known version is newer
  (read may start before update and end after update)
- invalidated block versions can't be remembered and trusted
- right boundary for split blocks is right_half when diving down, not key_lt
- restart update also when block is "invalidated", not just on version mismatch
- copy callback in listings to avoid closure destruction bugs too
2024-03-16 13:24:36 +03:00
3ebed9a749 Add logging and one more assert 2024-03-16 13:24:36 +03:00
eab67a6e8f Make get_block() wait for updating when unrelated block is found along the path 2024-03-16 13:24:36 +03:00
20993d9b7a Fix a race condition where changed blocks were parsed over existing cached blocks and getting a mix of data 2024-03-16 13:24:36 +03:00
5cf9b343c0 Simplify code by removing an unneeded "optimisation" 2024-03-16 13:24:36 +03:00
79ae0aadcd Add kv_log_level, print warnings on level 1, trace ops on level 10 2024-03-16 13:24:36 +03:00
605afc3583 Fix duplicate keys in listings on parallel updates -- do not rewind key "iterator position" 2024-03-16 13:24:36 +03:00
c0681d8242 Implement key suffix to avoid collisions of multiple test workers 2024-03-16 13:24:36 +03:00
763e77b4f4 Do not complain on empty first block 2024-03-16 13:24:36 +03:00
19426aa4c5 Add JSON output for stress-tester 2024-03-16 13:24:36 +03:00
08f586bcec Print total stats 2024-03-16 13:24:36 +03:00
f1cd87473a Do not send more than op_count operations (fix segfault on finish) 2024-03-16 13:24:36 +03:00
1bd8d2da56 Add some more resiliency to serialize() 2024-03-16 13:24:36 +03:00
a7396d2baf Invalidate blocks being updated too 2024-03-16 13:24:36 +03:00
e98a38810d Change new block allocation method: make each writer choose multiple empty PG blocks and place blocks in them 2024-03-16 13:24:36 +03:00
28c4324c36 Remove blocks from cache on unsuccessful updates 2024-03-16 13:24:36 +03:00
31ec3fa8f5 Allow to track multiple updates per block (it should never happen though) 2024-03-16 13:24:36 +03:00
e4fa26f60a Do not call stop_updating after failed write_new_block and after clear_block (both delete the item) 2024-03-16 13:24:36 +03:00
59ae27f9e5 Track versions of parent blocks and recheck if changed during update 2024-03-16 13:24:36 +03:00
2c6a301d9b Fix resume_split condition (key_lt can also be "") 2024-03-16 13:24:36 +03:00
01558349f8 Experiment: transform offsets for better sharding 2024-03-16 13:24:36 +03:00
36f4717d0d More post-stress-test fixes
- Prevent _split types of new blocks
- Stop updating new blocks only after the whole update, otherwise pointers
  may become invalid
- Use recheck_none for updates initially
- Use UINT64_MAX as initial block version when postponing ops, otherwise the
  check fails when the block is initially empty. This for example leads to
  writing both leaf items & block pointers (which is incorrect) into the root
  block when starting stress-test with --parallelism 32
- Fix -EINTR comparison
2024-03-16 13:24:36 +03:00
babaf2a0ce Print operation statistics 2024-03-16 13:24:36 +03:00
5773f1a375 K/V fixes after stress-test :-)
- track block versions correctly - per inode block (128kb) instead of tree block (4kb)
- prevent multiple parallel CAS writes of the same inode block
- add logging for EILSEQ which means invalid data in the tree
- fix get_block updated flag which was true for blocks already in cache and was leading to infinite loops on "unrelated block" errors
- apply changes to blocks in cache only after successful writes (using "virtual changes")
- do not replace cached block with an older version from disk
- recheck "unrelated blocks" (read/update collisions) until data stops changing
- track tree path correctly - do not treat split block as parent of its right half
- correctly move blocks when finding new empty place on disk
- restart updates from the beginning when one of blocks is changed by a parallel update
- fix delete using SET opcode and setting key to the empty value instead
- prevent changing the same key more than 1 time in parallel
- fix listing verification
- resume continue_updates in update_find (required because it uses continue_update itself)
- add allow_old_cached parameter to get()
2024-03-16 13:24:36 +03:00
57222a9f79 Implement K/V DB stress tester 2024-03-16 13:24:36 +03:00
61ef000c6e Evict blocks based on memory limit & block usage 2024-03-16 13:24:36 +03:00
7d5e1cc393 Track blocks per level 2024-03-16 13:24:36 +03:00
5e7f27a02d Track block level 2024-03-16 13:24:36 +03:00
fd1d8a8520 Experimental B-Tree Vitastor embedded K/V database implementation! 2024-03-16 13:24:36 +03:00
c364e14c40 Stop then retry, not retry then stop 2024-03-16 13:24:36 +03:00
3ebbfa0428 Fix another rare OSD hang on zeroing out entries on start 2024-03-16 13:24:36 +03:00
aa79d1db1c Fix incorrect "changing scheme" message in modify-pool 2024-03-06 00:41:35 +03:00
a1fecb7eff Move callback away when calling it in cluster_client 2024-03-06 00:41:35 +03:00
ff74b19423 Fix rare OSD hang on zeroing out bad entries on start 2024-03-06 00:41:35 +03:00
4cf6dceed7 Merge branch 'rel-1.4' 2024-02-29 09:59:01 +03:00
4eab26f968 Add documentation and a very basic test for pool management commands 2024-02-28 13:08:04 +03:00
86243b7101 Rework & fix pool-create / pool-modify / pool-ls 2024-02-28 13:08:04 +03:00
idelson
dc92851322 vitastor-cli: add commands to control pools: pool-create, pool-ls, pool-modify, pool-rm
PR #59 - https://github.com/vitalif/vitastor/pull/58/commits

By MIND Software LLC

By submitting this pull request, I accept Vitastor CLA
2024-02-28 13:08:04 +03:00
95 changed files with 11512 additions and 937 deletions

View File

@@ -22,7 +22,7 @@ RUN apt-get update
RUN apt-get -y install etcd qemu-system-x86 qemu-block-extra qemu-utils fio libasan5 \
liburing1 liburing-dev libgoogle-perftools-dev devscripts libjerasure-dev cmake libibverbs-dev libisal-dev
RUN apt-get -y build-dep fio qemu=`dpkg -s qemu-system-x86|grep ^Version:|awk '{print $2}'`
RUN apt-get -y install jq lp-solve sudo
RUN apt-get -y install jq lp-solve sudo nfs-common
RUN apt-get --download-only source fio qemu=`dpkg -s qemu-system-x86|grep ^Version:|awk '{print $2}'`
RUN set -ex; \

View File

@@ -856,3 +856,21 @@ jobs:
echo ""
done
test_nfs:
runs-on: ubuntu-latest
needs: build
container: ${{env.TEST_IMAGE}}:${{github.sha}}
steps:
- name: Run test
id: test
timeout-minutes: 3
run: /root/vitastor/tests/test_nfs.sh
- name: Print logs
if: always() && steps.test.outcome == 'failure'
run: |
for i in /root/vitastor/testdata/*.log /root/vitastor/testdata/*.txt; do
echo "-------- $i --------"
cat $i
echo ""
done

View File

@@ -2,6 +2,6 @@ cmake_minimum_required(VERSION 2.8.12)
project(vitastor)
set(VERSION "1.4.8")
set(VERSION "1.5.0")
add_subdirectory(src)

View File

@@ -6,8 +6,8 @@
Вернём былую скорость кластерному блочному хранилищу!
Vitastor - распределённая блочная SDS (программная СХД), прямой аналог Ceph RBD и
внутренних СХД популярных облачных провайдеров. Однако, в отличие от них, Vitastor
Vitastor - распределённая блочная и файловая SDS (программная СХД), прямой аналог Ceph RBD и CephFS,
а также внутренних СХД популярных облачных провайдеров. Однако, в отличие от них, Vitastor
быстрый и при этом простой. Только пока маленький :-).
Vitastor архитектурно похож на Ceph, что означает атомарность и строгую консистентность,
@@ -63,7 +63,7 @@ Vitastor поддерживает QEMU-драйвер, протоколы NBD и
- [fio](docs/usage/fio.ru.md) для тестов производительности
- [NBD](docs/usage/nbd.ru.md) для монтирования ядром
- [QEMU и qemu-img](docs/usage/qemu.ru.md)
- [NFS](docs/usage/nfs.ru.md)-прокси для VMWare и подобных
- [NFS](docs/usage/nfs.ru.md) кластерная файловая система и псевдо-ФС прокси
- Производительность
- [Понимание сути производительности](docs/performance/understanding.ru.md)
- [Теоретический максимум](docs/performance/theoretical.ru.md)

View File

@@ -6,9 +6,9 @@
Make Clustered Block Storage Fast Again.
Vitastor is a distributed block SDS, direct replacement of Ceph RBD and internal SDS's
of public clouds. However, in contrast to them, Vitastor is fast and simple at the same time.
The only thing is it's slightly young :-).
Vitastor is a distributed block and file SDS, direct replacement of Ceph RBD and CephFS,
and also internal SDS's of public clouds. However, in contrast to them, Vitastor is fast
and simple at the same time. The only thing is it's slightly young :-).
Vitastor is architecturally similar to Ceph which means strong consistency,
primary-replication, symmetric clustering and automatic data distribution over any
@@ -63,7 +63,7 @@ Read more details below in the documentation.
- [fio](docs/usage/fio.en.md) for benchmarks
- [NBD](docs/usage/nbd.en.md) for kernel mounts
- [QEMU and qemu-img](docs/usage/qemu.en.md)
- [NFS](docs/usage/nfs.en.md) emulator for VMWare and similar
- [NFS](docs/usage/nfs.en.md) clustered file system and pseudo-FS proxy
- Performance
- [Understanding storage performance](docs/performance/understanding.en.md)
- [Theoretical performance](docs/performance/theoretical.en.md)

View File

@@ -1,4 +1,4 @@
VERSION ?= v1.4.8
VERSION ?= v1.5.0
all: build push

View File

@@ -49,7 +49,7 @@ spec:
capabilities:
add: ["SYS_ADMIN"]
allowPrivilegeEscalation: true
image: vitalif/vitastor-csi:v1.4.8
image: vitalif/vitastor-csi:v1.5.0
args:
- "--node=$(NODE_ID)"
- "--endpoint=$(CSI_ENDPOINT)"

View File

@@ -121,7 +121,7 @@ spec:
privileged: true
capabilities:
add: ["SYS_ADMIN"]
image: vitalif/vitastor-csi:v1.4.8
image: vitalif/vitastor-csi:v1.5.0
args:
- "--node=$(NODE_ID)"
- "--endpoint=$(CSI_ENDPOINT)"

View File

@@ -5,7 +5,7 @@ package vitastor
const (
vitastorCSIDriverName = "csi.vitastor.io"
vitastorCSIDriverVersion = "1.4.8"
vitastorCSIDriverVersion = "1.5.0"
)
// Config struct fills the parameters of request or user input

2
debian/changelog vendored
View File

@@ -1,4 +1,4 @@
vitastor (1.4.8-1) unstable; urgency=medium
vitastor (1.5.0-1) unstable; urgency=medium
* Bugfixes

View File

@@ -3,4 +3,6 @@ usr/bin/vitastor-cli
usr/bin/vitastor-rm
usr/bin/vitastor-nbd
usr/bin/vitastor-nfs
usr/bin/vitastor-kv
usr/bin/vitastor-kv-stress
usr/lib/*/libvitastor*.so*

View File

@@ -37,8 +37,8 @@ RUN set -e -x; \
mkdir -p /root/packages/vitastor-$REL; \
rm -rf /root/packages/vitastor-$REL/*; \
cd /root/packages/vitastor-$REL; \
cp -r /root/vitastor vitastor-1.4.8; \
cd vitastor-1.4.8; \
cp -r /root/vitastor vitastor-1.5.0; \
cd vitastor-1.5.0; \
ln -s /root/fio-build/fio-*/ ./fio; \
FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
ls /usr/include/linux/raw.h || cp ./debian/raw.h /usr/include/linux/raw.h; \
@@ -51,8 +51,8 @@ RUN set -e -x; \
rm -rf a b; \
echo "dep:fio=$FIO" > debian/fio_version; \
cd /root/packages/vitastor-$REL; \
tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_1.4.8.orig.tar.xz vitastor-1.4.8; \
cd vitastor-1.4.8; \
tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_1.5.0.orig.tar.xz vitastor-1.5.0; \
cd vitastor-1.5.0; \
V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \
DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \

View File

@@ -41,6 +41,7 @@ Parameters:
- [osd_tags](#osd_tags)
- [primary_affinity_tags](#primary_affinity_tags)
- [scrub_interval](#scrub_interval)
- [used_for_fs](#used_for_fs)
Examples:
@@ -154,8 +155,25 @@ That is, if it becomes impossible to place PG data on at least (pg_minsize)
OSDs, PG is deactivated for both read and write. So you know that a fresh
write always goes to at least (pg_minsize) OSDs (disks).
That is, pg_size minus pg_minsize sets the number of disk failures to tolerate
without temporary downtime (for [osd_out_time](monitor.en.md#osd_out_time)).
For example, the difference between pg_minsize 2 and 1 in a 3-way replicated
pool (pg_size=3) is:
- If 2 hosts go down with pg_minsize=2, the pool becomes inactive and remains
inactive for [osd_out_time](monitor.en.md#osd_out_time) (10 minutes). After
this timeout, the monitor selects replacement hosts/OSDs and the pool comes
up and starts to heal. Therefore, if you don't have replacement OSDs, i.e.
if you only have 3 hosts with OSDs and 2 of them are down, the pool remains
inactive until you add or return at least 1 host (or change failure_domain
to "osd").
- If 2 hosts go down with pg_minsize=1, the pool only experiences a short
I/O pause until the monitor notices that OSDs are down (5-10 seconds with
the default [etcd_report_interval](osd.en.md#etcd_report_interval)). After
this pause, I/O resumes, but new data is temporarily written in only 1 copy.
Then, after osd_out_time, the monitor also selects replacement OSDs and the
pool starts to heal.
So, pg_minsize regulates the number of failures that a pool can tolerate
without temporary downtime for [osd_out_time](monitor.en.md#osd_out_time),
but at a cost of slightly reduced storage reliability.
FIXME: pg_minsize behaviour may be changed in the future to only make PGs
read-only instead of deactivating them.
@@ -168,8 +186,8 @@ read-only instead of deactivating them.
Number of PGs for this pool. The value should be big enough for the monitor /
LP solver to be able to optimize data placement.
"Enough" is usually around 64-128 PGs per OSD, i.e. you set pg_count for pool
to (total OSD count * 100 / pg_size). You can round it to the closest power of 2,
"Enough" is usually around 10-100 PGs per OSD, i.e. you set pg_count for pool
to (total OSD count * 10 / pg_size). You can round it to the closest power of 2,
because it makes it easier to reduce or increase PG count later by dividing or
multiplying it by 2.
@@ -282,6 +300,25 @@ of the OSDs containing a data chunk for a PG.
Automatic scrubbing interval for this pool. Overrides
[global scrub_interval setting](osd.en.md#scrub_interval).
## used_for_fs
- Type: string
If non-empty, the pool is marked as used for VitastorFS with metadata stored
in block image (regular Vitastor volume) named as the value of this pool parameter.
When a pool is marked as used for VitastorFS, regular block volume creation in it
is disabled (vitastor-cli refuses to create images without --force) to protect
the user from block volume and FS file ID collisions and data loss.
[vitastor-nfs](../usage/nfs.ru.md), in its turn, refuses to use pools not marked
for the corresponding FS when starting. This also implies that you can use one
pool only for one VitastorFS.
The second thing that is disabled for VitastorFS pools is reporting per-inode space
usage statistics in etcd because a FS pool may store a very large number of files
and statistics for them all would take a lot of space in etcd.
# Examples
## Replicated pool

View File

@@ -40,6 +40,7 @@
- [osd_tags](#osd_tags)
- [primary_affinity_tags](#primary_affinity_tags)
- [scrub_interval](#scrub_interval)
- [used_for_fs](#used_for_fs)
Примеры:
@@ -157,9 +158,25 @@
OSD, PG деактивируется на чтение и запись. Иными словами, всегда известно,
что новые блоки данных всегда записываются как минимум на pg_minsize дисков.
По сути, разница pg_size и pg_minsize задаёт число отказов дисков, которые пул
может пережить без временной (на [osd_out_time](monitor.ru.md#osd_out_time))
остановки обслуживания.
Для примера, разница между pg_minsize 2 и 1 в реплицированном пуле с 3 копиями
данных (pg_size=3), проявляется следующим образом:
- Если 2 сервера отключаются при pg_minsize=2, пул становится неактивным и
остаётся неактивным в течение [osd_out_time](monitor.en.md#osd_out_time)
(10 минут), после чего монитор назначает другие OSD/серверы на замену, пул
поднимается и начинает восстанавливать недостающие копии данных. Соответственно,
если OSD на замену нет - то есть, если у вас всего 3 сервера с OSD и 2 из них
недоступны - пул так и остаётся недоступным до тех пор, пока вы не вернёте
или не добавите хотя бы 1 сервер (или не переключите failure_domain на "osd").
- Если 2 сервера отключаются при pg_minsize=1, ввод-вывод лишь приостанавливается
на короткое время, до тех пор, пока монитор не поймёт, что OSD отключены
(что занимает 5-10 секунд при стандартном [etcd_report_interval](osd.en.md#etcd_report_interval)).
После этого ввод-вывод восстанавливается, но новые данные временно пишутся
всего в 1 копии. Когда же проходит osd_out_time, монитор точно так же назначает
другие OSD на замену выбывшим и пул начинает восстанавливать копии данных.
То есть, pg_minsize регулирует число отказов, которые пул может пережить без
временной остановки обслуживания на [osd_out_time](monitor.ru.md#osd_out_time),
но ценой немного пониженных гарантий надёжности.
FIXME: Поведение pg_minsize может быть изменено в будущем с полной деактивации
PG на перевод их в режим только для чтения.
@@ -172,8 +189,8 @@ PG на перевод их в режим только для чтения.
Число PG для данного пула. Число должно быть достаточно большим, чтобы монитор
мог равномерно распределить по ним данные.
Обычно это означает примерно 64-128 PG на 1 OSD, т.е. pg_count можно устанавливать
равным (общему числу OSD * 100 / pg_size). Значение можно округлить до ближайшей
Обычно это означает примерно 10-100 PG на 1 OSD, т.е. pg_count можно устанавливать
равным (общему числу OSD * 10 / pg_size). Значение можно округлить до ближайшей
степени 2, чтобы потом было легче уменьшать или увеличивать число PG, умножая
или деля его на 2.
@@ -290,6 +307,27 @@ OSD с "all".
Интервал скраба, то есть, автоматической фоновой проверки данных для данного пула.
Переопределяет [глобальную настройку scrub_interval](osd.ru.md#scrub_interval).
## used_for_fs
- Type: string
Если непусто, пул помечается как используемый для файловой системы VitastorFS с
метаданными, хранимыми в блочном образе Vitastor с именем, равным значению
этого параметра.
Когда пул помечается как используемый для VitastorFS, создание обычных блочных
образов в нём отключается (vitastor-cli отказывается создавать образы без --force),
чтобы защитить пользователя от коллизий ID файлов и блочных образов и, таким
образом, от потери данных.
[vitastor-nfs](../usage/nfs.ru.md), в свою очередь, при запуске отказывается
использовать для ФС пулы, не выделенные для неё. Это также означает, что один
пул может использоваться только для одной VitastorFS.
Также для ФС-пулов отключается передача статистики в etcd по отдельным инодам,
так как ФС-пул может содержать очень много файлов и статистика по ним всем
заняла бы очень много места в etcd.
# Примеры
## Реплицированный пул

View File

@@ -33,6 +33,7 @@
- [Checksums](../config/layout-osd.en.md#data_csum_type)
- [Client write-back cache](../config/client.en.md#client_enable_writeback)
- [Intelligent recovery auto-tuning](../config/osd.en.md#recovery_tune_interval)
- [Clustered file system](../usage/nfs.en.md#vitastorfs)
## Plugins and tools
@@ -46,13 +47,12 @@
- [CSI plugin for Kubernetes](../installation/kubernetes.en.md)
- [OpenStack support: Cinder driver, Nova and libvirt patches](../installation/openstack.en.md)
- [Proxmox storage plugin and packages](../installation/proxmox.en.md)
- [Simplified NFS proxy for file-based image access emulation (suitable for VMWare)](../usage/nfs.en.md)
- [Simplified NFS proxy for file-based image access emulation (suitable for VMWare)](../usage/nfs.en.md#pseudo-fs)
## Roadmap
The following features are planned for the future:
- File system
- Control plane optimisation
- Other administrative tools
- Web GUI

View File

@@ -35,6 +35,7 @@
- [Контрольные суммы](../config/layout-osd.ru.md#data_csum_type)
- [Буферизация записи на стороне клиента](../config/client.ru.md#client_enable_writeback)
- [Интеллектуальная автоподстройка скорости восстановления](../config/osd.ru.md#recovery_tune_interval)
- [Кластерная файловая система](../usage/nfs.ru.md#vitastorfs)
## Драйверы и инструменты
@@ -48,11 +49,10 @@
- [CSI-плагин для Kubernetes](../installation/kubernetes.ru.md)
- [Базовая поддержка OpenStack: драйвер Cinder, патчи для Nova и libvirt](../installation/openstack.ru.md)
- [Плагин для Proxmox](../installation/proxmox.ru.md)
- [Упрощённая NFS-прокси для эмуляции файлового доступа к образам (подходит для VMWare)](../usage/nfs.ru.md)
- [Упрощённая NFS-прокси для эмуляции файлового доступа к образам (подходит для VMWare)](../usage/nfs.ru.md#псевдо-фс)
## Планы развития
- Файловая система
- Оптимизация слоя управления
- Другие инструменты администрирования
- Web-интерфейс

View File

@@ -14,6 +14,7 @@
- [Check cluster status](#check-cluster-status)
- [Create an image](#create-an-image)
- [Install plugins](#install-plugins)
- [Create VitastorFS](#create-vitastorfs)
## Preparation
@@ -75,18 +76,16 @@ On the monitor hosts:
## Create a pool
Create pool configuration in etcd:
Create a pool using vitastor-cli:
```
etcdctl --endpoints=... put /vitastor/config/pools '{"1":{"name":"testpool",
"scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":256,"failure_domain":"host"}}'
vitastor-cli create-pool testpool --pg_size 2 --pg_count 256
```
For EC pools the configuration should look like the following:
```
etcdctl --endpoints=... put /vitastor/config/pools '{"2":{"name":"ecpool",
"scheme":"ec","pg_size":4,"parity_chunks":2,"pg_minsize":2,"pg_count":256,"failure_domain":"host"}}'
vitastor-cli create-pool testpool --ec 2+2 --pg_count 256
```
After you do this, one of the monitors will configure PGs and OSDs will start them.
@@ -116,3 +115,9 @@ After that, you can [run benchmarks](../usage/fio.en.md) or [start QEMU manually
- [Proxmox](../installation/proxmox.en.md)
- [OpenStack](../installation/openstack.en.md)
- [Kubernetes CSI](../installation/kubernetes.en.md)
## Create VitastorFS
If you want to use clustered file system in addition to VM or container images:
- [Follow the instructions here](../usage/nfs.en.md#vitastorfs)

View File

@@ -14,6 +14,7 @@
- [Проверьте состояние кластера](#проверьте-состояние-кластера)
- [Создайте образ](#создайте-образ)
- [Установите плагины](#установите-плагины)
- [Создайте VitastorFS](#создайте-vitastorfs)
## Подготовка
@@ -77,18 +78,16 @@
## Создайте пул
Создайте конфигурацию пула с помощью etcdctl:
Создайте пул с помощью vitastor-cli:
```
etcdctl --endpoints=... put /vitastor/config/pools '{"1":{"name":"testpool",
"scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":256,"failure_domain":"host"}}'
vitastor-cli create-pool testpool --pg_size 2 --pg_count 256
```
Для пулов с кодами коррекции ошибок конфигурация должна выглядеть примерно так:
```
etcdctl --endpoints=... put /vitastor/config/pools '{"2":{"name":"ecpool",
"scheme":"ec","pg_size":4,"parity_chunks":2,"pg_minsize":2,"pg_count":256,"failure_domain":"host"}}'
vitastor-cli create-pool testpool --ec 2+2 --pg_count 256
```
После этого один из мониторов должен сконфигурировать PG, а OSD должны запустить их.
@@ -118,3 +117,10 @@ vitastor-cli create -s 10G testimg
- [Proxmox](../installation/proxmox.ru.md)
- [OpenStack](../installation/openstack.ru.md)
- [Kubernetes CSI](../installation/kubernetes.ru.md)
## Создайте VitastorFS
Если вы хотите использовать не только блочные образы виртуальных машин или контейнеров,
а также кластерную файловую систему, то:
- [Следуйте инструкциям](../usage/nfs.en.md#vitastorfs)

View File

@@ -24,6 +24,10 @@ It supports the following commands:
- [fix](#fix)
- [alloc-osd](#alloc-osd)
- [rm-osd](#rm-osd)
- [create-pool](#create-pool)
- [modify-pool](#modify-pool)
- [ls-pools](#ls-pools)
- [rm-pool](#rm-pool)
Global options:
@@ -137,8 +141,8 @@ Rename, resize image or change its readonly status. Images with children can't b
If the new size is smaller than the old size, extra data will be purged.
You should resize file system in the image, if present, before shrinking it.
| `-f|--force` | Proceed with shrinking or setting readwrite flag even if the image has children. |
| `--down-ok` | Proceed with shrinking even if some data will be left on unavailable OSDs. |
* `-f|--force` - Proceed with shrinking or setting readwrite flag even if the image has children.
* `--down-ok` - Proceed with shrinking even if some data will be left on unavailable OSDs.
## rm
@@ -152,7 +156,7 @@ In other cases parent layers are always merged into children.
Other options:
| `--down-ok` | Continue deletion/merging even if some data will be left on unavailable OSDs. |
* `--down-ok` - Continue deletion/merging even if some data will be left on unavailable OSDs.
## flatten
@@ -241,3 +245,91 @@ Refuses to remove OSDs with data without `--force` and `--allow-data-loss`.
With `--dry-run` only checks if deletion is possible without data loss and
redundancy degradation.
## create-pool
`vitastor-cli create-pool|pool-create <name> (-s <pg_size>|--ec <N>+<K>) -n <pg_count> [OPTIONS]`
Create a pool. Required parameters:
| <!-- --> | <!-- --> |
|--------------------------|---------------------------------------------------------------------------------------|
| `-s R` or `--pg_size R` | Number of replicas for replicated pools |
| `--ec N+K` | Number of data (N) and parity (K) chunks for erasure-coded pools |
| `-n N` or `--pg_count N` | PG count for the new pool (start with 10*<OSD count>/pg_size rounded to a power of 2) |
Optional parameters:
| <!-- --> | <!-- --> |
|--------------------------------|----------------------------------------------------------------------------|
| `--pg_minsize <number>` | R or N+K minus number of failures to tolerate without downtime ([details](../config/pool.en.md#pg_minsize)) |
| `--failure_domain host` | Failure domain: host, osd or a level from placement_levels. Default: host |
| `--root_node <node>` | Put pool only on child OSDs of this placement tree node |
| `--osd_tags <tag>[,<tag>]...` | Put pool only on OSDs tagged with all specified tags |
| `--block_size 128k` | Put pool only on OSDs with this data block size |
| `--bitmap_granularity 4k` | Put pool only on OSDs with this logical sector size |
| `--immediate_commit none` | Put pool only on OSDs with this or larger immediate_commit (none < small < all) |
| `--primary_affinity_tags tags` | Prefer to put primary copies on OSDs with all specified tags |
| `--scrub_interval <time>` | Enable regular scrubbing for this pool. Format: number + unit s/m/h/d/M/y |
| `--used_for_fs <name>` | Mark pool as used for VitastorFS with metadata in image <name> |
| `--pg_stripe_size <number>` | Increase object grouping stripe |
| `--max_osd_combinations 10000` | Maximum number of random combinations for LP solver input |
| `--wait` | Wait for the new pool to come online |
| `-f` or `--force` | Do not check that cluster has enough OSDs to create the pool |
See also [Pool configuration](../config/pool.en.md) for detailed parameter descriptions.
Examples:
`vitastor-cli create-pool test_x4 -s 4 -n 32`
`vitastor-cli create-pool test_ec42 --ec 4+2 -n 32`
## modify-pool
`vitastor-cli modify-pool|pool-modify <id|name> [--name <new_name>] [PARAMETERS...]`
Modify an existing pool. Modifiable parameters:
```
[-s|--pg_size <number>] [--pg_minsize <number>] [-n|--pg_count <count>]
[--failure_domain <level>] [--root_node <node>] [--osd_tags <tags>] [--no_inode_stats 0|1]
[--max_osd_combinations <number>] [--primary_affinity_tags <tags>] [--scrub_interval <time>]
```
Non-modifiable parameters (changing them WILL lead to data loss):
```
[--block_size <size>] [--bitmap_granularity <size>]
[--immediate_commit <all|small|none>] [--pg_stripe_size <size>]
```
These, however, can still be modified with -f|--force.
See [create-pool](#create-pool) for parameter descriptions.
Examples:
`vitastor-cli modify-pool pool_A --name pool_B`
`vitastor-cli modify-pool 2 --pg_size 4 -n 128`
## rm-pool
`vitastor-cli rm-pool|pool-rm [--force] <id|name>`
Remove a pool. Refuses to remove pools with images without `--force`.
## ls-pools
`vitastor-cli ls-pools|pool-ls|ls-pool|pools [-l] [--detail] [--sort FIELD] [-r] [-n N] [--stats] [<glob> ...]`
List pools (only matching <glob> patterns if passed).
| <!-- --> | <!-- --> |
|----------------------|-------------------------------------------------------|
| `-l` or `--long` | Also report I/O statistics |
| `--detail` | Use list format (not table), show all details |
| `--sort FIELD` | Sort by specified field (see fields in --json output) |
| `-r` or `--reverse` | Sort in descending order |
| `-n` or `--count N` | Only list first N items |

View File

@@ -23,6 +23,10 @@ vitastor-cli - интерфейс командной строки для адм
- [merge-data](#merge-data)
- [alloc-osd](#alloc-osd)
- [rm-osd](#rm-osd)
- [create-pool](#create-pool)
- [modify-pool](#modify-pool)
- [ls-pools](#ls-pools)
- [rm-pool](#rm-pool)
Глобальные опции:
@@ -85,8 +89,8 @@ kaveri 2/1 32 0 B 10 G 0 B 100% 0%
`vitastor-cli ls [-l] [-p POOL] [--sort FIELD] [-r] [-n N] [<glob> ...]`
Показать список образов, если переданы шаблоны `<glob>`, то только с именами,
соответствующими этим шаблонам (стандартные ФС-шаблоны с * и ?).
Показать список образов, если передан(ы) шаблон(ы) `<glob>`, то только с именами,
соответствующими одному из шаблонов (стандартные ФС-шаблоны с * и ?).
Опции:
@@ -140,8 +144,8 @@ vitastor-cli snap-create [-p|--pool <id|name>] <image>@<snapshot>
Если новый размер меньше старого, "лишние" данные будут удалены, поэтому перед уменьшением
образа сначала уменьшите файловую систему в нём.
| -f|--force | Разрешить уменьшение или перевод в чтение-запись образа, у которого есть клоны. |
| --down-ok | Разрешить уменьшение, даже если часть данных останется неудалённой на недоступных OSD. |
* `-f|--force` - Разрешить уменьшение или перевод в чтение-запись образа, у которого есть клоны.
* `--down-ok` - Разрешить уменьшение, даже если часть данных останется неудалённой на недоступных OSD.
## rm
@@ -159,7 +163,7 @@ vitastor-cli snap-create [-p|--pool <id|name>] <image>@<snapshot>
Другие опции:
| `--down-ok` | Продолжать удаление/слияние, даже если часть данных останется неудалённой на недоступных OSD. |
* `--down-ok` - Продолжать удаление/слияние, даже если часть данных останется неудалённой на недоступных OSD.
## flatten
@@ -258,3 +262,91 @@ vitastor-cli snap-create [-p|--pool <id|name>] <image>@<snapshot>
С опцией `--dry-run` только проверяет, возможно ли удаление без потери данных и деградации
избыточности.
## create-pool
`vitastor-cli create-pool|pool-create <name> (-s <pg_size>|--ec <N>+<K>) -n <pg_count> [OPTIONS]`
Создать пул. Обязательные параметры:
| <!-- --> | <!-- --> |
|---------------------------|---------------------------------------------------------------------------------------------|
| `-s R` или `--pg_size R` | Число копий данных для реплицированных пулов |
| `--ec N+K` | Число частей данных (N) и чётности (K) для пулов с кодами коррекции ошибок |
| `-n N` или `--pg_count N` | Число PG для нового пула (начните с 10*<число OSD>/pg_size, округлённого до степени двойки) |
Необязательные параметры:
| <!-- --> | <!-- --> |
|--------------------------------|----------------------------------------------------------------------------|
| `--pg_minsize <number>` | (R или N+K) минус число разрешённых отказов без остановки пула ([подробнее](../config/pool.ru.md#pg_minsize)) |
| `--failure_domain host` | Домен отказа: host, osd или другой из placement_levels. По умолчанию: host |
| `--root_node <node>` | Использовать для пула только дочерние OSD этого узла дерева размещения |
| `--osd_tags <tag>[,<tag>]...` | ...только OSD со всеми заданными тегами |
| `--block_size 128k` | ...только OSD с данным размером блока |
| `--bitmap_granularity 4k` | ...только OSD с данным размером логического сектора |
| `--immediate_commit none` | ...только OSD с этим или большим immediate_commit (none < small < all) |
| `--primary_affinity_tags tags` | Предпочитать OSD со всеми данными тегами для роли первичных |
| `--scrub_interval <time>` | Включить скрабы с заданным интервалом времени (число + единица s/m/h/d/M/y) |
| `--pg_stripe_size <number>` | Увеличить блок группировки объектов по PG |
| `--max_osd_combinations 10000` | Максимальное число случайных комбинаций OSD для ЛП-солвера |
| `--wait` | Подождать, пока новый пул будет активирован |
| `-f` или `--force` | Не проверять, что в кластере достаточно доменов отказа для создания пула |
Подробно о параметрах см. [Конфигурация пулов](../config/pool.ru.md).
Примеры:
`vitastor-cli create-pool test_x4 -s 4 -n 32`
`vitastor-cli create-pool test_ec42 --ec 4+2 -n 32`
## modify-pool
`vitastor-cli modify-pool|pool-modify <id|name> [--name <new_name>] [PARAMETERS...]`
Изменить настройки существующего пула. Изменяемые параметры:
```
[-s|--pg_size <number>] [--pg_minsize <number>] [-n|--pg_count <count>]
[--failure_domain <level>] [--root_node <node>] [--osd_tags <tags>]
[--max_osd_combinations <number>] [--primary_affinity_tags <tags>] [--scrub_interval <time>]
```
Неизменяемые параметры (их изменение ПРИВЕДЁТ к потере данных):
```
[--block_size <size>] [--bitmap_granularity <size>]
[--immediate_commit <all|small|none>] [--pg_stripe_size <size>]
```
Эти параметры можно изменить, только если явно передать опцию -f или --force.
Описания параметров смотрите в [create-pool](#create-pool).
Примеры:
`vitastor-cli modify-pool pool_A --name pool_B`
`vitastor-cli modify-pool 2 --pg_size 4 -n 128`
## rm-pool
`vitastor-cli rm-pool|pool-rm [--force] <id|name>`
Удалить пул. Отказывается удалять пул, в котором ещё есть образы, без `--force`.
## ls-pools
`vitastor-cli ls-pools|pool-ls|ls-pool|pools [-l] [--detail] [--sort FIELD] [-r] [-n N] [--stats] [<glob> ...]`
Показать список пулов. Если передан(ы) шаблон(ы) `<glob>`, то только с именами,
соответствующими одному из шаблонов (стандартные ФС-шаблоны с * и ?).
| <!-- --> | <!-- --> |
|-----------------------|------------------------------------------------------------|
| `-l` или `--long` | Вывести также статистику ввода-вывода |
| `--detail` | Максимально подробный вывод в виде списка (а не таблицы) |
| `--sort FIELD` | Сортировать по заданному полю (поля см. в выводе с --json) |
| `-r` или `--reverse` | Сортировать в обратном порядке |
| `-n` или `--count N` | Выводить только первые N записей |

View File

@@ -4,42 +4,150 @@
[Читать на русском](nfs.ru.md)
# NFS
# VitastorFS and pseudo-FS
Vitastor has a simplified NFS 3.0 proxy for file-based image access emulation. It's not
suitable as a full-featured file system, at least because all file/image metadata is stored
in etcd and kept in memory all the time - thus you can't put a lot of files in it.
Vitastor has two file system implementations. Both can be used via `vitastor-nfs`.
However, NFS proxy is totally fine as a method to provide VM image access and allows to
plug Vitastor into, for example, VMWare. It's important to note that for VMWare it's a much
better access method than iSCSI, because with iSCSI we'd have to put all VM images into one
Vitastor image exported as a LUN to VMWare and formatted with VMFS. VMWare doesn't use VMFS
over NFS.
Commands:
- [mount](#mount)
- [start](#start)
NFS proxy is stateless if you use immediate_commit=all mode (for SSD with capacitors or
HDDs with disabled cache), so you can run multiple NFS proxies and use a network load
balancer or any failover method you want to in that case.
## Pseudo-FS
vitastor-nfs usage:
Simplified pseudo-FS proxy is used for file-based image access emulation. It's not
suitable as a full-featured file system: it lacks a lot of FS features, it stores
all file/image metadata in memory and in etcd. So it's fine for hundreds or thousands
of large files/images, but not for millions.
Pseudo-FS proxy is intended for environments where other block volume access methods
can't be used or impose additional restrictions - for example, VMWare. NFS is better
for VMWare than, for example, iSCSI, because with iSCSI, VMWare puts all VM images
into one large shared block image in its own VMFS file system, and with NFS, VMWare
doesn't use VMFS and puts each VM disk in a regular file which is equal to one
Vitastor block image, just as originally intended.
To use Vitastor pseudo-FS locally, run `vitastor-nfs mount --block /mnt/vita`.
Also you can start the network server:
```
vitastor-nfs [STANDARD OPTIONS] [OTHER OPTIONS]
--subdir <DIR> export images prefixed <DIR>/ (default empty - export all images)
--portmap 0 do not listen on port 111 (portmap/rpcbind, requires root)
--bind <IP> bind service to <IP> address (default 0.0.0.0)
--nfspath <PATH> set NFS export path to <PATH> (default is /)
--port <PORT> use port <PORT> for NFS services (default is 2049)
--pool <POOL> use <POOL> as default pool for new files (images)
--foreground 1 stay in foreground, do not daemonize
vitastor-nfs start --block --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool
```
Example start and mount commands (etcd_address is optional):
To mount the FS exported by this server, run:
```
vitastor-nfs --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool
mount server:/ /mnt/ -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
```
```
mount localhost:/ /mnt/ -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
```
## VitastorFS
VitastorFS is a full-featured clustered (Read-Write-Many) file system. It supports most POSIX
features like hierarchical organization, symbolic links, hard links, quick renames and so on.
VitastorFS metadata is stored in a Parallel Optimistic B-Tree key-value database,
implemented over a regular Vitastor block volume. Directory entries and inodes
are stored in a simple human-readable JSON format in the B-Tree. `vitastor-kv` tool
can be used to inspect the database.
To use VitastorFS:
1. Create a pool or choose an existing empty pool for FS data
2. Create an image for FS metadata, preferably in a faster (SSD or replica-HDD) pool,
but you can create it in the data pool too if you want (image size doesn't matter):
`vitastor-cli create -s 10G -p fastpool testfs`
3. Mark data pool as an FS pool: `vitastor-cli modify-pool --used-for-fs testfs data-pool`
4. Either mount the FS: `vitastor-nfs mount --fs testfs --pool data-pool /mnt/vita`
5. Or start the NFS server: `vitastor-nfs start --fs testfs --pool data-pool`
### Supported POSIX features
- Read-after-write semantics (read returns new data immediately after write)
- Linear and random read and write
- Writing outside current file size
- Hierarchical structure, immediate rename of files and directories
- File size change support (truncate)
- Permissions (chmod/chown)
- Flushing data to stable storage (if required) (fsync)
- Symbolic links
- Hard links
- Special files (devices, sockets, named pipes)
- File modification and attribute change time tracking (mtime and ctime)
- Modification time (mtime) and last access time (atime) change support (utimes)
- Correct handling of directory listing during file creation/deletion
### Limitations
POSIX features currently not implemented in VitastorFS:
- File locking is not supported
- Actually used space is not counted, so `du` always reports apparent file sizes
instead of actually allocated space
- Access times (`atime`) are not tracked (like `-o noatime`)
- Modification time (`mtime`) is updated lazily every second (like `-o lazytime`)
Other notable missing features which should be addressed in the future:
- Defragmentation of "shared" inodes. Files smaller than pool object size (block_size
multiplied by data part count if pool is EC) are internally stored in large block
volumes sequentially, one after another, and leave garbage after deleting or resizing.
Defragmentator will be implemented to collect this garbage.
- Inode ID reuse. Currently inode IDs always grow, the limit is 2^48 inodes, so
in theory you may hit it if you create and delete a very large number of files
- Compaction of the key-value B-Tree. Current implementation never merges or deletes
B-Tree blocks, so B-Tree may become bloated over time. Currently you can
use `vitastor-kv dumpjson` & `loadjson` commands to recreate the index in such
situations.
- Filesystem check tool. VitastorFS doesn't have journal because it would impose a
severe performance hit, optimistic CAS-based transactions are used instead of it.
So, again, in theory an abnormal shutdown of the FS server may leave some garbage
in the DB. The FS is implemented is such way that this garbage doesn't affect its
function, but having a tool to clean it up still seems a right thing to do.
## Horizontal scaling
Linux NFS 3.0 client doesn't support built-in scaling or failover, i.e. you can't
specify multiple server addresses when mounting the FS.
However, you can use any regular TCP load balancing over multiple NFS servers.
It's absolutely safe with `immediate_commit=all` and `client_enable_writeback=false`
settings, because Vitastor NFS proxy doesn't keep uncommitted data in memory
with these settings. But it may even work without `immediate_commit=all` because
the Linux NFS client repeats all uncommitted writes if it loses the connection.
## Commands
### mount
`vitastor-nfs (--fs <NAME> | --block) [-o <OPT>] mount <MOUNTPOINT>`
Start local filesystem server and mount file system to <MOUNTPOINT>.
Use regular `umount <MOUNTPOINT>` to unmount the FS.
The server will be automatically stopped when the FS is unmounted.
- `-o|--options <OPT>` - Pass additional NFS mount options (ex.: -o async).
### start
`vitastor-nfs (--fs <NAME> | --block) start`
Start network NFS server. Options:
| <!-- --> | <!-- --> |
|-----------------|------------------------------------------------------------|
| `--bind <IP>` | bind service to \<IP> address (default 0.0.0.0) |
| `--port <PORT>` | use port \<PORT> for NFS services (default is 2049) |
| `--portmap 0` | do not listen on port 111 (portmap/rpcbind, requires root) |
## Common options
| <!-- --> | <!-- --> |
|--------------------|----------------------------------------------------------|
| `--fs <NAME>` | use VitastorFS with metadata in image \<NAME> |
| `--block` | use pseudo-FS presenting images as files |
| `--pool <POOL>` | use \<POOL> as default pool for new files |
| `--subdir <DIR>` | export \<DIR> instead of root directory (pseudo-FS only) |
| `--nfspath <PATH>` | set NFS export path to \<PATH> (default is /) |
| `--pidfile <FILE>` | write process ID to the specified file |
| `--logfile <FILE>` | log to the specified file |
| `--foreground 1` | stay in foreground, do not daemonize |

View File

@@ -4,41 +4,156 @@
[Read in English](nfs.en.md)
# NFS
# VitastorFS и псевдо-ФС
В Vitastor реализована упрощённая NFS 3.0 прокси для эмуляции файлового доступа к образам.
Это не полноценная файловая система, т.к. метаданные всех файлов (образов) сохраняются
в etcd и всё время хранятся в оперативной памяти - то есть, положить туда много файлов
не получится.
В Vitastor есть две реализации файловой системы. Обе используются через `vitastor-nfs`.
Однако в качестве способа доступа к образам виртуальных машин NFS прокси прекрасно подходит
и позволяет подключить Vitastor, например, к VMWare.
Команды:
- [mount](#mount)
- [start](#start)
При этом, если вы используете режим immediate_commit=all (для SSD с конденсаторами или HDD
с отключённым кэшем), то NFS-сервер не имеет состояния и вы можете свободно поднять
его в нескольких экземплярах и использовать поверх них сетевой балансировщик нагрузки или
схему с отказоустойчивостью.
## Псевдо-ФС
Использование vitastor-nfs:
Упрощённая реализация псевдо-ФС используется для эмуляции файлового доступа к блочным
образам Vitastor. Это не полноценная файловая система - в ней отсутствуют многие функции
POSIX ФС, а метаданные всех файлов (образов) сохраняются в etcd и всё время хранятся в
оперативной памяти - то есть, псевдо-ФС подходит для сотен или тысяч файлов, но не миллионов.
Псевдо-ФС предназначена для доступа к образам виртуальных машин в средах, где другие
способы невозможны или неудобны - например, в VMWare. Для VMWare это лучшая опция, чем
iSCSI, так как при использовании iSCSI VMWare размещает все виртуальные машины в одном
большом блочном образе внутри собственной ФС VMFS, а с NFS VMFS не используется и каждый
диск ВМ представляется в виде одного файла, то есть, соответствует одному блочному образу
Vitastor, как это и задумано изначально.
Чтобы подключить псевдо-ФС Vitastor, выполните команду `vitastor-nfs mount --block /mnt/vita`.
Либо же запустите сетевой вариант сервера:
```
vitastor-nfs [СТАНДАРТНЫЕ ОПЦИИ] [ДРУГИЕ ОПЦИИ]
--subdir <DIR> экспортировать "поддиректорию" - образы с префиксом имени <DIR>/ (по умолчанию пусто - экспортировать все образы)
--portmap 0 отключить сервис portmap/rpcbind на порту 111 (по умолчанию включён и требует root привилегий)
--bind <IP> принимать соединения по адресу <IP> (по умолчанию 0.0.0.0 - на всех)
--nfspath <PATH> установить путь NFS-экспорта в <PATH> (по умолчанию /)
--port <PORT> использовать порт <PORT> для NFS-сервисов (по умолчанию 2049)
--pool <POOL> использовать пул <POOL> для новых образов (обязательно, если пул в кластере не один)
--foreground 1 не уходить в фон после запуска
vitastor-nfs start --block --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool
```
Пример монтирования Vitastor через NFS (etcd_address необязателен):
Примонтировать ФС, запущенную с такими опциями, можно следующей командой:
```
vitastor-nfs --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool
mount server:/ /mnt/ -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
```
```
mount localhost:/ /mnt/ -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
```
## VitastorFS
VitastorFS - полноценная кластерная (Read-Write-Many) файловая система. Она поддерживает
большую часть функций POSIX - иерархическую организацию, символические ссылки, жёсткие
ссылки, быстрые переименования и так далее.
Метаданные VitastorFS хранятся в собственной реализации БД формата ключ-значения,
основанной на Параллельном Оптимистичном Б-дереве поверх обычного блочного образа Vitastor.
И записи каталогов, и иноды, как обычно в Vitastor, хранятся в простом человекочитаемом
JSON-формате :-). Для инспекции содержимого БД можно использовать инструмент `vitastor-kv`.
Чтобы использовать VitastorFS:
1. Создайте пул для данных ФС или выберите существующий пустой пул
2. Создайте блочный образ для метаданных ФС, желательно, в более быстром пуле (на SSD
или по крайней мере на HDD, но без EC), но можно и в том же пуле, что данные
(размер образа значения не имеет):
`vitastor-cli create -s 10G -p fastpool testfs`
3. Пометьте пул данных как ФС-пул: `vitastor-cli modify-pool --used-for-fs testfs data-pool`
4. Либо примонтируйте ФС: `vitastor-nfs mount --fs testfs --pool data-pool /mnt/vita`
5. Либо запустите сетевой NFS-сервер: `vitastor-nfs start --fs testfs --pool data-pool`
### Поддерживаемые функции POSIX
- Чтение актуальной версии данных сразу после записи
- Последовательное и произвольное чтение и запись
- Запись за пределами текущего размера файла
- Иерархическая организация, мгновенное переименование файлов и каталогов
- Изменение размера файла (truncate)
- Права на файлы (chmod/chown)
- Фиксация данных на диски (когда необходимо) (fsync)
- Символические ссылки
- Жёсткие ссылки
- Специальные файлы (устройства, сокеты, каналы)
- Отслеживание времён модификации (mtime), изменения атрибутов (ctime)
- Ручное изменение времён модификации (mtime), последнего доступа (atime)
- Корректная обработка изменений списка файлов во время листинга
### Ограничения
Отсутствующие на данный момент в VitastorFS функции POSIX:
- Блокировки файлов не поддерживаются
- Фактически занятое файлами место не подсчитывается и не возвращается вызовами
stat(2), так что `du` всегда показывает сумму размеров файлов, а не фактически занятое место
- Времена доступа (`atime`) не отслеживаются (как будто ФС смонтирована с `-o noatime`)
- Времена модификации (`mtime`) отслеживаются асинхронно (как будто ФС смонтирована с `-o lazytime`)
Другие недостающие функции, которые нужно добавить в будущем:
- Дефрагментация "общих инодов". На уровне реализации ФС файлы, меньшие, чем размер
объекта пула (block_size умножить на число частей данных, если пул EC),
упаковываются друг за другом в большие "общие" иноды/тома. Если такие файлы удалять
или увеличивать, они перемещаются и оставляют за собой "мусор", вот тут-то и нужен
дефрагментатор.
- Переиспользование номеров инодов. В текущей реализации номера инодов всё время
увеличиваются, так что в теории вы можете упереться в лимит, если насоздаёте
и наудаляете больше, чем 2^48 файлов.
- Очистка места в Б-дереве метаданных. Текущая реализация никогда не сливает и не
удаляет блоки Б-дерева, так что в теории дерево может разростись и стать неоптимальным.
Если вы столкнётесь с такой ситуацией сейчас, вы можете решить её с помощью
команд `vitastor-kv dumpjson` и `loadjson` (т.е. пересоздав и загрузив обратно все метаданные ФС).
- Инструмент проверки метаданных файловой системы. У VitastorFS нет журнала, так как
журнал бы сильно замедлил реализацию, вместо него используются оптимистичные
транзакции на основе CAS (сравнить-и-записать), и теоретически при нештатном
завершении сервера ФС в БД также могут оставаться неконсистентные "мусорные"
записи. ФС устроена так, что на работу они не влияют, но для порядка и их стоит
уметь подчищать.
## Горизонтальное масштабирование
Клиент Linux NFS 3.0 не поддерживает встроенное масштабирование или отказоустойчивость.
То есть, вы не можете задать несколько адресов серверов при монтировании ФС.
Однако вы можете использовать любые стандартные сетевые балансировщики нагрузки
или схемы с отказоустойчивостью. Это точно безопасно при настройках `immediate_commit=all` и
`client_enable_writeback=false`, так как с ними NFS-сервер Vitastor вообще не хранит
в памяти ещё не зафиксированные на дисках данные; и вполне вероятно безопасно
даже без `immediate_commit=all`, потому что NFS-клиент ядра Linux повторяет все
незафиксированные запросы при потере соединения.
## Команды
### mount
`vitastor-nfs (--fs <NAME> | --block) mount [-o <OPT>] <MOUNTPOINT>`
Запустить локальный сервер и примонтировать ФС в директорию <MOUNTPOINT>.
Чтобы отмонтировать ФС, используйте обычную команду `umount <MOUNTPOINT>`.
Сервер автоматически останавливается при отмонтировании ФС.
- `-o|--options <OPT>` - Передать дополнительные опции монтирования NFS (пример: -o async).
### start
`vitastor-nfs (--fs <NAME> | --block) start`
Запустить сетевой NFS-сервер. Опции:
| <!-- --> | <!-- --> |
|-----------------|-----------------------------------------------------------------------|
| `--bind <IP>` | принимать соединения по адресу \<IP> (по умолчанию 0.0.0.0 - на всех) |
| `--port <PORT>` | использовать порт \<PORT> для NFS-сервисов (по умолчанию 2049) |
| `--portmap 0` | отключить сервис portmap/rpcbind на порту 111 (по умолчанию включён и требует root привилегий) |
## Общие опции
| <!-- --> | <!-- --> |
|--------------------|---------------------------------------------------------|
| `--fs <NAME>` | использовать VitastorFS с метаданными в образе \<NAME> |
| `--block` | использовать псевдо-ФС для доступа к блочным образам |
| `--pool <POOL>` | использовать пул \<POOL> для новых файлов (обязательно, если пул в кластере не один) |
| `--subdir <DIR>` | экспортировать подкаталог \<DIR>, а не корень (только для псевдо-ФС) |
| `--nfspath <PATH>` | установить путь NFS-экспорта в \<PATH> (по умолчанию /) |
| `--pidfile <FILE>` | записать ID процесса в заданный файл |
| `--logfile <FILE>` | записывать логи в заданный файл |
| `--foreground 1` | не уходить в фон после запуска |

View File

@@ -37,7 +37,7 @@ const etcd_allow = new RegExp('^'+[
'pg/history/[1-9]\\d*/[1-9]\\d*',
'pool/stats/[1-9]\\d*',
'history/last_clean_pgs',
'inode/stats/[1-9]\\d*/[1-9]\\d*',
'inode/stats/[1-9]\\d*/\\d+',
'pool/stats/[1-9]\\d*',
'stats',
'index/image/.*',
@@ -1737,8 +1737,11 @@ class Mon
for (const inode_num in this.state.osd.space[osd_num][pool_id])
{
const u = BigInt(this.state.osd.space[osd_num][pool_id][inode_num]||0);
inode_stats[pool_id][inode_num] = inode_stats[pool_id][inode_num] || inode_stub();
inode_stats[pool_id][inode_num].raw_used += u;
if (inode_num)
{
inode_stats[pool_id][inode_num] = inode_stats[pool_id][inode_num] || inode_stub();
inode_stats[pool_id][inode_num].raw_used += u;
}
this.state.pool.stats[pool_id].used_raw_tb += u;
}
}

View File

@@ -1,6 +1,6 @@
{
"name": "vitastor-mon",
"version": "1.4.8",
"version": "1.5.0",
"description": "Vitastor SDS monitor service",
"main": "mon-main.js",
"scripts": {

View File

@@ -50,7 +50,7 @@ from cinder.volume import configuration
from cinder.volume import driver
from cinder.volume import volume_utils
VERSION = '1.4.8'
VERSION = '1.5.0'
LOG = logging.getLogger(__name__)

View File

@@ -24,4 +24,4 @@ rm fio
mv fio-copy fio
FIO=`rpm -qi fio | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'`
perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec
tar --transform 's#^#vitastor-1.4.8/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-1.4.8$(rpm --eval '%dist').tar.gz *
tar --transform 's#^#vitastor-1.5.0/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-1.5.0$(rpm --eval '%dist').tar.gz *

View File

@@ -36,7 +36,7 @@ ADD . /root/vitastor
RUN set -e; \
cd /root/vitastor/rpm; \
sh build-tarball.sh; \
cp /root/vitastor-1.4.8.el7.tar.gz ~/rpmbuild/SOURCES; \
cp /root/vitastor-1.5.0.el7.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \

View File

@@ -1,11 +1,11 @@
Name: vitastor
Version: 1.4.8
Version: 1.5.0
Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1
URL: https://vitastor.io/
Source0: vitastor-1.4.8.el7.tar.gz
Source0: vitastor-1.5.0.el7.tar.gz
BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel
@@ -149,9 +149,12 @@ mkdir -p /etc/vitastor
%_bindir/vitastor-nfs
%_bindir/vitastor-cli
%_bindir/vitastor-rm
%_bindir/vitastor-kv
%_bindir/vitastor-kv-stress
%_bindir/vita
%_libdir/libvitastor_blk.so*
%_libdir/libvitastor_client.so*
%_libdir/libvitastor_kv.so*
%files -n vitastor-client-devel

View File

@@ -35,7 +35,7 @@ ADD . /root/vitastor
RUN set -e; \
cd /root/vitastor/rpm; \
sh build-tarball.sh; \
cp /root/vitastor-1.4.8.el8.tar.gz ~/rpmbuild/SOURCES; \
cp /root/vitastor-1.5.0.el8.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \

View File

@@ -1,11 +1,11 @@
Name: vitastor
Version: 1.4.8
Version: 1.5.0
Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1
URL: https://vitastor.io/
Source0: vitastor-1.4.8.el8.tar.gz
Source0: vitastor-1.5.0.el8.tar.gz
BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel
@@ -146,9 +146,12 @@ mkdir -p /etc/vitastor
%_bindir/vitastor-nfs
%_bindir/vitastor-cli
%_bindir/vitastor-rm
%_bindir/vitastor-kv
%_bindir/vitastor-kv-stress
%_bindir/vita
%_libdir/libvitastor_blk.so*
%_libdir/libvitastor_client.so*
%_libdir/libvitastor_kv.so*
%files -n vitastor-client-devel

View File

@@ -18,7 +18,7 @@ ADD . /root/vitastor
RUN set -e; \
cd /root/vitastor/rpm; \
sh build-tarball.sh; \
cp /root/vitastor-1.4.8.el9.tar.gz ~/rpmbuild/SOURCES; \
cp /root/vitastor-1.5.0.el9.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el9.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \

View File

@@ -1,11 +1,11 @@
Name: vitastor
Version: 1.4.8
Version: 1.5.0
Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1
URL: https://vitastor.io/
Source0: vitastor-1.4.8.el9.tar.gz
Source0: vitastor-1.5.0.el9.tar.gz
BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel
@@ -139,9 +139,12 @@ mkdir -p /etc/vitastor
%_bindir/vitastor-nfs
%_bindir/vitastor-cli
%_bindir/vitastor-rm
%_bindir/vitastor-kv
%_bindir/vitastor-kv-stress
%_bindir/vita
%_libdir/libvitastor_blk.so*
%_libdir/libvitastor_client.so*
%_libdir/libvitastor_kv.so*
%files -n vitastor-client-devel

View File

@@ -16,7 +16,7 @@ if("${CMAKE_INSTALL_PREFIX}" MATCHES "^/usr/local/?$")
set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}")
endif()
add_definitions(-DVERSION="1.4.8")
add_definitions(-DVERSION="1.5.0")
add_definitions(-D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -fdiagnostics-color=always -fno-omit-frame-pointer -I ${CMAKE_SOURCE_DIR}/src)
add_link_options(-fno-omit-frame-pointer)
if (${WITH_ASAN})
@@ -145,7 +145,6 @@ add_library(vitastor_client SHARED
cli_status.cpp
cli_describe.cpp
cli_fix.cpp
cli_df.cpp
cli_ls.cpp
cli_create.cpp
cli_modify.cpp
@@ -154,6 +153,11 @@ add_library(vitastor_client SHARED
cli_rm_data.cpp
cli_rm.cpp
cli_rm_osd.cpp
cli_pool_cfg.cpp
cli_pool_create.cpp
cli_pool_ls.cpp
cli_pool_modify.cpp
cli_pool_rm.cpp
)
set_target_properties(vitastor_client PROPERTIES PUBLIC_HEADER "vitastor_c.h")
target_link_libraries(vitastor_client
@@ -181,10 +185,48 @@ target_link_libraries(vitastor-nbd
vitastor_client
)
# libvitastor_kv.so
add_library(vitastor_kv SHARED
kv_db.cpp
kv_db.h
)
target_link_libraries(vitastor_kv
vitastor_client
)
set_target_properties(vitastor_kv PROPERTIES VERSION ${VERSION} SOVERSION 0)
# vitastor-kv
add_executable(vitastor-kv
kv_cli.cpp
)
target_link_libraries(vitastor-kv
vitastor_kv
)
add_executable(vitastor-kv-stress
kv_stress.cpp
)
target_link_libraries(vitastor-kv-stress
vitastor_kv
)
# vitastor-nfs
add_executable(vitastor-nfs
nfs_proxy.cpp
nfs_conn.cpp
nfs_block.cpp
nfs_kv.cpp
nfs_kv_create.cpp
nfs_kv_getattr.cpp
nfs_kv_link.cpp
nfs_kv_lookup.cpp
nfs_kv_read.cpp
nfs_kv_readdir.cpp
nfs_kv_remove.cpp
nfs_kv_rename.cpp
nfs_kv_setattr.cpp
nfs_kv_write.cpp
nfs_fsstat.cpp
nfs_mount.cpp
nfs_portmap.cpp
sha256.c
nfs/xdr_impl.cpp
@@ -194,6 +236,7 @@ add_executable(vitastor-nfs
)
target_link_libraries(vitastor-nfs
vitastor_client
vitastor_kv
)
# vitastor-cli
@@ -318,12 +361,12 @@ add_test(NAME test_cluster_client COMMAND test_cluster_client)
### Install
install(TARGETS vitastor-osd vitastor-disk vitastor-nbd vitastor-nfs vitastor-cli RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR})
install(TARGETS vitastor-osd vitastor-disk vitastor-nbd vitastor-nfs vitastor-cli vitastor-kv vitastor-kv-stress RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR})
install_symlink(vitastor-disk ${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_BINDIR}/vitastor-dump-journal)
install_symlink(vitastor-cli ${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_BINDIR}/vitastor-rm)
install_symlink(vitastor-cli ${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_BINDIR}/vita)
install(
TARGETS vitastor_blk vitastor_client
TARGETS vitastor_blk vitastor_client vitastor_kv
LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR}
PUBLIC_HEADER DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}
)

View File

@@ -82,3 +82,8 @@ uint32_t blockstore_t::get_bitmap_granularity()
{
return impl->get_bitmap_granularity();
}
void blockstore_t::set_no_inode_stats(const std::vector<uint64_t> & pool_ids)
{
impl->set_no_inode_stats(pool_ids);
}

View File

@@ -216,6 +216,9 @@ public:
// Get per-inode space usage statistics
std::map<uint64_t, uint64_t> & get_inode_space_stats();
// Set per-pool no_inode_stats
void set_no_inode_stats(const std::vector<uint64_t> & pool_ids);
// Print diagnostics to stdout
void dump_diagnostics();

View File

@@ -733,3 +733,86 @@ void blockstore_impl_t::disk_error_abort(const char *op, int retval, int expecte
fprintf(stderr, "Disk %s failed: result is %d, expected %d. Can't continue, sorry :-(\n", op, retval, expected);
exit(1);
}
void blockstore_impl_t::set_no_inode_stats(const std::vector<uint64_t> & pool_ids)
{
for (auto & np: no_inode_stats)
{
np.second = 2;
}
for (auto pool_id: pool_ids)
{
if (!no_inode_stats[pool_id])
recalc_inode_space_stats(pool_id, false);
no_inode_stats[pool_id] = 1;
}
for (auto np_it = no_inode_stats.begin(); np_it != no_inode_stats.end(); )
{
if (np_it->second == 2)
{
recalc_inode_space_stats(np_it->first, true);
no_inode_stats.erase(np_it++);
}
else
np_it++;
}
}
void blockstore_impl_t::recalc_inode_space_stats(uint64_t pool_id, bool per_inode)
{
auto sp_begin = inode_space_stats.lower_bound((pool_id << (64-POOL_ID_BITS)));
auto sp_end = inode_space_stats.lower_bound(((pool_id+1) << (64-POOL_ID_BITS)));
inode_space_stats.erase(sp_begin, sp_end);
auto sh_it = clean_db_shards.lower_bound((pool_id << (64-POOL_ID_BITS)));
while (sh_it != clean_db_shards.end() &&
(sh_it->first >> (64-POOL_ID_BITS)) == pool_id)
{
for (auto & pair: sh_it->second)
{
uint64_t space_id = per_inode ? pair.first.inode : (pool_id << (64-POOL_ID_BITS));
inode_space_stats[space_id] += dsk.data_block_size;
}
sh_it++;
}
object_id last_oid = {};
bool last_exists = false;
auto dirty_it = dirty_db.lower_bound((obj_ver_id){ .oid = { .inode = (pool_id << (64-POOL_ID_BITS)) } });
while (dirty_it != dirty_db.end() && (dirty_it->first.oid.inode >> (64-POOL_ID_BITS)) == pool_id)
{
if (IS_STABLE(dirty_it->second.state) && (IS_BIG_WRITE(dirty_it->second.state) || IS_DELETE(dirty_it->second.state)))
{
bool exists = false;
if (last_oid == dirty_it->first.oid)
{
exists = last_exists;
}
else
{
auto & clean_db = clean_db_shard(dirty_it->first.oid);
auto clean_it = clean_db.find(dirty_it->first.oid);
exists = clean_it != clean_db.end();
}
uint64_t space_id = per_inode ? dirty_it->first.oid.inode : (pool_id << (64-POOL_ID_BITS));
if (IS_BIG_WRITE(dirty_it->second.state))
{
if (!exists)
inode_space_stats[space_id] += dsk.data_block_size;
last_exists = true;
}
else
{
if (exists)
{
auto & sp = inode_space_stats[space_id];
if (sp > dsk.data_block_size)
sp -= dsk.data_block_size;
else
inode_space_stats.erase(space_id);
}
last_exists = false;
}
last_oid = dirty_it->first.oid;
}
dirty_it++;
}
}

View File

@@ -272,6 +272,7 @@ class blockstore_impl_t
std::map<pool_id_t, pool_shard_settings_t> clean_db_settings;
std::map<pool_pg_id_t, blockstore_clean_db_t> clean_db_shards;
std::map<uint64_t, int> no_inode_stats;
uint8_t *clean_bitmaps = NULL;
blockstore_dirty_db_t dirty_db;
std::vector<blockstore_op_t*> submit_queue;
@@ -318,6 +319,7 @@ class blockstore_impl_t
blockstore_clean_db_t& clean_db_shard(object_id oid);
void reshard_clean_db(pool_id_t pool_id, uint32_t pg_count, uint32_t pg_stripe_size);
void recalc_inode_space_stats(uint64_t pool_id, bool per_inode);
// Journaling
void prepare_journal_sector_write(int sector, blockstore_op_t *op);
@@ -428,6 +430,9 @@ public:
// Space usage statistics
std::map<uint64_t, uint64_t> inode_space_stats;
// Set per-pool no_inode_stats
void set_no_inode_stats(const std::vector<uint64_t> & pool_ids);
// Print diagnostics to stdout
void dump_diagnostics();

View File

@@ -32,7 +32,7 @@ void blockstore_init_meta::handle_event(ring_data_t *data, int buf_num)
if (data->res < 0)
{
throw std::runtime_error(
std::string("read metadata failed at offset ") + std::to_string(bufs[buf_num].offset) +
std::string("read metadata failed at offset ") + std::to_string(buf_num >= 0 ? bufs[buf_num].offset : last_read_offset) +
std::string(": ") + strerror(-data->res)
);
}
@@ -63,6 +63,7 @@ int blockstore_init_meta::loop()
throw std::runtime_error("Failed to allocate metadata read buffer");
// Read superblock
GET_SQE();
last_read_offset = 0;
data->iov = { metadata_buffer, (size_t)bs->dsk.meta_block_size };
data->callback = [this](ring_data_t *data) { handle_event(data, -1); };
my_uring_prep_readv(sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset);
@@ -100,6 +101,7 @@ resume_1:
{
printf("Initializing metadata area\n");
GET_SQE();
last_read_offset = 0;
data->iov = (struct iovec){ metadata_buffer, (size_t)bs->dsk.meta_block_size };
data->callback = [this](ring_data_t *data) { handle_event(data, -1); };
my_uring_prep_writev(sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset);
@@ -236,6 +238,7 @@ resume_2:
data->iov = { bufs[i].buf, (size_t)bufs[i].size };
data->callback = [this, i](ring_data_t *data) { handle_event(data, i); };
my_uring_prep_writev(sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset + bufs[i].offset);
bs->ringloop->submit();
bufs[i].state = INIT_META_WRITING;
submitted++;
}
@@ -259,9 +262,11 @@ resume_2:
next_offset = entries_to_zero[i]/entries_per_block;
for (j = i; j < entries_to_zero.size() && entries_to_zero[j]/entries_per_block == next_offset; j++) {}
GET_SQE();
last_read_offset = (1+next_offset)*bs->dsk.meta_block_size;
data->iov = { metadata_buffer, (size_t)bs->dsk.meta_block_size };
data->callback = [this](ring_data_t *data) { handle_event(data, -1); };
my_uring_prep_readv(sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset + (1+next_offset)*bs->dsk.meta_block_size);
bs->ringloop->submit();
submitted++;
resume_5:
if (submitted > 0)
@@ -278,6 +283,7 @@ resume_5:
data->iov = { metadata_buffer, (size_t)bs->dsk.meta_block_size };
data->callback = [this](ring_data_t *data) { handle_event(data, -1); };
my_uring_prep_writev(sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset + (1+next_offset)*bs->dsk.meta_block_size);
bs->ringloop->submit();
submitted++;
resume_6:
if (submitted > 0)
@@ -299,6 +305,7 @@ resume_6:
{
GET_SQE();
my_uring_prep_fsync(sqe, bs->dsk.meta_fd, IORING_FSYNC_DATASYNC);
last_read_offset = 0;
data->iov = { 0 };
data->callback = [this](ring_data_t *data) { handle_event(data, -1); };
submitted++;

View File

@@ -23,6 +23,7 @@ class blockstore_init_meta
struct ring_data_t *data;
uint64_t md_offset = 0;
uint64_t next_offset = 0;
uint64_t last_read_offset = 0;
uint64_t entries_loaded = 0;
unsigned entries_per_block = 0;
int i = 0, j = 0;

View File

@@ -487,18 +487,24 @@ void blockstore_impl_t::mark_stable(obj_ver_id v, bool forget_dirty)
}
if (!exists)
{
inode_space_stats[dirty_it->first.oid.inode] += dsk.data_block_size;
uint64_t space_id = dirty_it->first.oid.inode;
if (no_inode_stats[dirty_it->first.oid.inode >> (64-POOL_ID_BITS)])
space_id = space_id & ~(((uint64_t)1 << (64-POOL_ID_BITS)) - 1);
inode_space_stats[space_id] += dsk.data_block_size;
used_blocks++;
}
big_to_flush++;
}
else if (IS_DELETE(dirty_it->second.state))
{
auto & sp = inode_space_stats[dirty_it->first.oid.inode];
uint64_t space_id = dirty_it->first.oid.inode;
if (no_inode_stats[dirty_it->first.oid.inode >> (64-POOL_ID_BITS)])
space_id = space_id & ~(((uint64_t)1 << (64-POOL_ID_BITS)) - 1);
auto & sp = inode_space_stats[space_id];
if (sp > dsk.data_block_size)
sp -= dsk.data_block_size;
else
inode_space_stats.erase(dirty_it->first.oid.inode);
inode_space_stats.erase(space_id);
used_blocks--;
big_to_flush++;
}

View File

@@ -116,6 +116,55 @@ static const char* help_text =
" With --dry-run only checks if deletion is possible without data loss and\n"
" redundancy degradation.\n"
"\n"
"vitastor-cli create-pool|pool-create <name> (-s <pg_size>|--ec <N>+<K>) -n <pg_count> [OPTIONS]\n"
" Create a pool. Required parameters:\n"
" -s|--pg_size R Number of replicas for replicated pools\n"
" --ec N+K Number of data (N) and parity (K) chunks for erasure-coded pools\n"
" -n|--pg_count N PG count for the new pool (start with 10*<OSD count>/pg_size rounded to a power of 2)\n"
" Optional parameters:\n"
" --pg_minsize <number> R or N+K minus number of failures to tolerate without downtime\n"
" --failure_domain host Failure domain: host, osd or a level from placement_levels. Default: host\n"
" --root_node <node> Put pool only on child OSDs of this placement tree node\n"
" --osd_tags <tag>[,<tag>]... Put pool only on OSDs tagged with all specified tags\n"
" --block_size 128k Put pool only on OSDs with this data block size\n"
" --bitmap_granularity 4k Put pool only on OSDs with this logical sector size\n"
" --immediate_commit none Put pool only on OSDs with this or larger immediate_commit (none < small < all)\n"
" --primary_affinity_tags tags Prefer to put primary copies on OSDs with all specified tags\n"
" --scrub_interval <time> Enable regular scrubbing for this pool. Format: number + unit s/m/h/d/M/y\n"
" --used_for_fs <name> Mark pool as used for VitastorFS with metadata in image <name>\n"
" --pg_stripe_size <number> Increase object grouping stripe\n"
" --max_osd_combinations 10000 Maximum number of random combinations for LP solver input\n"
" --wait Wait for the new pool to come online\n"
" -f|--force Do not check that cluster has enough OSDs to create the pool\n"
" Examples:\n"
" vitastor-cli create-pool test_x4 -s 4 -n 32\n"
" vitastor-cli create-pool test_ec42 --ec 4+2 -n 32\n"
"\n"
"vitastor-cli modify-pool|pool-modify <id|name> [--name <new_name>] [PARAMETERS...]\n"
" Modify an existing pool. Modifiable parameters:\n"
" [-s|--pg_size <number>] [--pg_minsize <number>] [-n|--pg_count <count>]\n"
" [--failure_domain <level>] [--root_node <node>] [--osd_tags <tags>] [--used_for_fs <name>]\n"
" [--max_osd_combinations <number>] [--primary_affinity_tags <tags>] [--scrub_interval <time>]\n"
" Non-modifiable parameters (changing them WILL lead to data loss):\n"
" [--block_size <size>] [--bitmap_granularity <size>]\n"
" [--immediate_commit <all|small|none>] [--pg_stripe_size <size>]\n"
" These, however, can still be modified with -f|--force.\n"
" See create-pool for parameter descriptions.\n"
" Examples:\n"
" vitastor-cli modify-pool pool_A --name pool_B\n"
" vitastor-cli modify-pool 2 --pg_size 4 -n 128\n"
"\n"
"vitastor-cli rm-pool|pool-rm [--force] <id|name>\n"
" Remove a pool. Refuses to remove pools with images without --force.\n"
"\n"
"vitastor-cli ls-pools|pool-ls|ls-pool|pools [-l] [--detail] [--sort FIELD] [-r] [-n N] [--stats] [<glob> ...]\n"
" List pools (only matching <glob> patterns if passed).\n"
" -l|--long Also report I/O statistics\n"
" --detail Use list format (not table), show all details\n"
" --sort FIELD Sort by specified field (see fields in --json output)\n"
" -r|--reverse Sort in descending order\n"
" -n|--count N Only list first N items\n"
"\n"
"Use vitastor-cli --help <command> for command details or vitastor-cli --help --all for all details.\n"
"\n"
"GLOBAL OPTIONS:\n"
@@ -136,6 +185,7 @@ static json11::Json::object parse_args(int narg, const char *args[])
cfg["progress"] = "1";
for (int i = 1; i < narg; i++)
{
bool argHasValue = (!(i == narg-1) && (args[i+1][0] != '-'));
if (args[i][0] == '-' && args[i][1] == 'h' && args[i][2] == 0)
{
cfg["help"] = "1";
@@ -146,15 +196,15 @@ static json11::Json::object parse_args(int narg, const char *args[])
}
else if (args[i][0] == '-' && args[i][1] == 'n' && args[i][2] == 0)
{
cfg["count"] = args[++i];
cfg["count"] = argHasValue ? args[++i] : "";
}
else if (args[i][0] == '-' && args[i][1] == 'p' && args[i][2] == 0)
{
cfg["pool"] = args[++i];
cfg["pool"] = argHasValue ? args[++i] : "";
}
else if (args[i][0] == '-' && args[i][1] == 's' && args[i][2] == 0)
{
cfg["size"] = args[++i];
cfg["size"] = argHasValue ? args[++i] : "";
}
else if (args[i][0] == '-' && args[i][1] == 'r' && args[i][2] == 0)
{
@@ -167,9 +217,9 @@ static json11::Json::object parse_args(int narg, const char *args[])
else if (args[i][0] == '-' && args[i][1] == '-')
{
const char *opt = args[i]+2;
cfg[opt] = i == narg-1 || !strcmp(opt, "json") ||
if (!strcmp(opt, "json") || !strcmp(opt, "wait") ||
!strcmp(opt, "wait-list") || !strcmp(opt, "wait_list") ||
!strcmp(opt, "long") || !strcmp(opt, "del") ||
!strcmp(opt, "long") || !strcmp(opt, "detail") || !strcmp(opt, "del") ||
!strcmp(opt, "no-color") || !strcmp(opt, "no_color") ||
!strcmp(opt, "readonly") || !strcmp(opt, "readwrite") ||
!strcmp(opt, "force") || !strcmp(opt, "reverse") ||
@@ -177,8 +227,14 @@ static json11::Json::object parse_args(int narg, const char *args[])
!strcmp(opt, "down-ok") || !strcmp(opt, "down_ok") ||
!strcmp(opt, "dry-run") || !strcmp(opt, "dry_run") ||
!strcmp(opt, "help") || !strcmp(opt, "all") ||
(!strcmp(opt, "writers-stopped") || !strcmp(opt, "writers_stopped")) && strcmp("1", args[i+1]) != 0
? "1" : args[++i];
!strcmp(opt, "writers-stopped") || !strcmp(opt, "writers_stopped"))
{
cfg[opt] = "1";
}
else
{
cfg[opt] = argHasValue ? args[++i] : "";
}
}
else
{
@@ -221,7 +277,7 @@ static int run(cli_tool_t *p, json11::Json::object cfg)
else if (cmd[0] == "df")
{
// Show pool space stats
action_cb = p->start_df(cfg);
action_cb = p->start_pool_ls(cfg);
}
else if (cmd[0] == "ls")
{
@@ -328,6 +384,44 @@ static int run(cli_tool_t *p, json11::Json::object cfg)
// Allocate a new OSD number
action_cb = p->start_alloc_osd(cfg);
}
else if (cmd[0] == "create-pool" || cmd[0] == "pool-create")
{
// Create a new pool
if (cmd.size() > 1 && cfg["name"].is_null())
{
cfg["name"] = cmd[1];
}
action_cb = p->start_pool_create(cfg);
}
else if (cmd[0] == "modify-pool" || cmd[0] == "pool-modify")
{
// Modify existing pool
if (cmd.size() > 1)
{
cfg["old_name"] = cmd[1];
}
action_cb = p->start_pool_modify(cfg);
}
else if (cmd[0] == "rm-pool" || cmd[0] == "pool-rm")
{
// Remove existing pool
if (cmd.size() > 1)
{
cfg["pool"] = cmd[1];
}
action_cb = p->start_pool_rm(cfg);
}
else if (cmd[0] == "ls-pool" || cmd[0] == "pool-ls" || cmd[0] == "ls-pools" || cmd[0] == "pools")
{
// Show pool list
cfg["show_recovery"] = 1;
if (cmd.size() > 1)
{
cmd.erase(cmd.begin(), cmd.begin()+1);
cfg["names"] = cmd;
}
action_cb = p->start_pool_ls(cfg);
}
else
{
result = { .err = EINVAL, .text = "unknown command: "+cmd[0].string_value() };

View File

@@ -46,6 +46,7 @@ public:
json11::Json etcd_result;
void parse_config(json11::Json::object & cfg);
json11::Json parse_tags(std::string tags);
void change_parent(inode_t cur, inode_t new_parent, cli_result_t *result);
inode_config_t* get_inode_cfg(const std::string & name);
@@ -58,7 +59,6 @@ public:
std::function<bool(cli_result_t &)> start_status(json11::Json);
std::function<bool(cli_result_t &)> start_describe(json11::Json);
std::function<bool(cli_result_t &)> start_fix(json11::Json);
std::function<bool(cli_result_t &)> start_df(json11::Json);
std::function<bool(cli_result_t &)> start_ls(json11::Json);
std::function<bool(cli_result_t &)> start_create(json11::Json);
std::function<bool(cli_result_t &)> start_modify(json11::Json);
@@ -68,6 +68,10 @@ public:
std::function<bool(cli_result_t &)> start_rm(json11::Json);
std::function<bool(cli_result_t &)> start_rm_osd(json11::Json cfg);
std::function<bool(cli_result_t &)> start_alloc_osd(json11::Json cfg);
std::function<bool(cli_result_t &)> start_pool_create(json11::Json);
std::function<bool(cli_result_t &)> start_pool_modify(json11::Json);
std::function<bool(cli_result_t &)> start_pool_rm(json11::Json);
std::function<bool(cli_result_t &)> start_pool_ls(json11::Json);
// Should be called like loop_and_wait(start_status(), <completion callback>)
void loop_and_wait(std::function<bool(cli_result_t &)> loop_cb, std::function<void(const cli_result_t &)> complete_cb);
@@ -77,8 +81,13 @@ public:
std::string print_table(json11::Json items, json11::Json header, bool use_esc);
size_t print_detail_title_len(json11::Json item, std::vector<std::pair<std::string, std::string>> names, size_t prev_len);
std::string print_detail(json11::Json item, std::vector<std::pair<std::string, std::string>> names, size_t title_len, bool use_esc);
std::string format_lat(uint64_t lat);
std::string format_q(double depth);
bool stupid_glob(const std::string str, const std::string glob);
std::string implode(const std::string & sep, json11::Json array);

View File

@@ -153,6 +153,7 @@ void cli_tool_t::loop_and_wait(std::function<bool(cli_result_t &)> loop_cb, std:
ringloop->unregister_consumer(&looper->consumer);
looper->loop_cb = NULL;
looper->complete_cb(looper->result);
ringloop->submit();
delete looper;
return;
}

View File

@@ -27,6 +27,7 @@ struct image_creator_t
std::string image_name, new_snap, new_parent;
json11::Json new_meta;
uint64_t size;
bool force = false;
bool force_size = false;
pool_id_t old_pool_id = 0;
@@ -45,6 +46,7 @@ struct image_creator_t
void loop()
{
auto & pools = parent->cli->st_cli.pool_config;
if (state >= 1)
goto resume_1;
if (image_name == "")
@@ -62,7 +64,6 @@ struct image_creator_t
}
if (new_pool_id)
{
auto & pools = parent->cli->st_cli.pool_config;
if (pools.find(new_pool_id) == pools.end())
{
result = (cli_result_t){ .err = ENOENT, .text = "Pool "+std::to_string(new_pool_id)+" does not exist" };
@@ -72,7 +73,7 @@ struct image_creator_t
}
else if (new_pool_name != "")
{
for (auto & ic: parent->cli->st_cli.pool_config)
for (auto & ic: pools)
{
if (ic.second.name == new_pool_name)
{
@@ -87,10 +88,20 @@ struct image_creator_t
return;
}
}
else if (parent->cli->st_cli.pool_config.size() == 1)
else if (pools.size() == 1)
{
auto it = parent->cli->st_cli.pool_config.begin();
new_pool_id = it->first;
new_pool_id = pools.begin()->first;
}
if (new_pool_id && !pools.at(new_pool_id).used_for_fs.empty() && !force)
{
result = (cli_result_t){
.err = EINVAL,
.text = "Pool "+pools.at(new_pool_id).name+
" is used for VitastorFS "+pools.at(new_pool_id).used_for_fs+
". Use --force if you really know what you are doing",
};
state = 100;
return;
}
state = 1;
resume_1:
@@ -183,7 +194,16 @@ resume_3:
// Save into inode_config for library users to be able to take it from there immediately
new_cfg.mod_revision = parent->etcd_result["responses"][0]["response_put"]["header"]["revision"].uint64_value();
parent->cli->st_cli.insert_inode_config(new_cfg);
result = (cli_result_t){ .err = 0, .text = "Image "+image_name+" created" };
result = (cli_result_t){
.err = 0,
.text = "Image "+image_name+" created",
.data = json11::Json::object {
{ "name", image_name },
{ "pool", new_pool_name },
{ "parent", new_parent },
{ "size", size },
}
};
state = 100;
}
@@ -251,7 +271,16 @@ resume_4:
// Save into inode_config for library users to be able to take it from there immediately
new_cfg.mod_revision = parent->etcd_result["responses"][0]["response_put"]["header"]["revision"].uint64_value();
parent->cli->st_cli.insert_inode_config(new_cfg);
result = (cli_result_t){ .err = 0, .text = "Snapshot "+image_name+"@"+new_snap+" created" };
result = (cli_result_t){
.err = 0,
.text = "Snapshot "+image_name+"@"+new_snap+" created",
.data = json11::Json::object {
{ "name", image_name+"@"+new_snap },
{ "pool", (uint64_t)new_pool_id },
{ "parent", new_parent },
{ "size", size },
}
};
state = 100;
}
@@ -514,6 +543,7 @@ std::function<bool(cli_result_t &)> cli_tool_t::start_create(json11::Json cfg)
image_creator->image_name = cfg["image"].string_value();
image_creator->new_pool_id = cfg["pool"].uint64_value();
image_creator->new_pool_name = cfg["pool"].string_value();
image_creator->force = cfg["force"].bool_value();
image_creator->force_size = cfg["force_size"].bool_value();
if (cfg["image_meta"].is_object())
{

View File

@@ -1,243 +0,0 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
#include "cli.h"
#include "cluster_client.h"
#include "str_util.h"
// List pools with space statistics
struct pool_lister_t
{
cli_tool_t *parent;
int state = 0;
json11::Json space_info;
cli_result_t result;
std::map<pool_id_t, json11::Json::object> pool_stats;
bool is_done()
{
return state == 100;
}
void get_stats()
{
if (state == 1)
goto resume_1;
// Space statistics - pool/stats/<pool>
parent->etcd_txn(json11::Json::object {
{ "success", json11::Json::array {
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(
parent->cli->st_cli.etcd_prefix+"/pool/stats/"
) },
{ "range_end", base64_encode(
parent->cli->st_cli.etcd_prefix+"/pool/stats0"
) },
} },
},
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(
parent->cli->st_cli.etcd_prefix+"/osd/stats/"
) },
{ "range_end", base64_encode(
parent->cli->st_cli.etcd_prefix+"/osd/stats0"
) },
} },
},
} },
});
state = 1;
resume_1:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
space_info = parent->etcd_result;
std::map<pool_id_t, uint64_t> osd_free;
for (auto & kv_item: space_info["responses"][0]["response_range"]["kvs"].array_items())
{
auto kv = parent->cli->st_cli.parse_etcd_kv(kv_item);
// pool ID
pool_id_t pool_id;
char null_byte = 0;
int scanned = sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(), "/pool/stats/%u%c", &pool_id, &null_byte);
if (scanned != 1 || !pool_id || pool_id >= POOL_ID_MAX)
{
fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
continue;
}
// pool/stats/<N>
pool_stats[pool_id] = kv.value.object_items();
}
for (auto & kv_item: space_info["responses"][1]["response_range"]["kvs"].array_items())
{
auto kv = parent->cli->st_cli.parse_etcd_kv(kv_item);
// osd ID
osd_num_t osd_num;
char null_byte = 0;
int scanned = sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(), "/osd/stats/%ju%c", &osd_num, &null_byte);
if (scanned != 1 || !osd_num || osd_num >= POOL_ID_MAX)
{
fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
continue;
}
// osd/stats/<N>::free
osd_free[osd_num] = kv.value["free"].uint64_value();
}
// Calculate max_avail for each pool
for (auto & pp: parent->cli->st_cli.pool_config)
{
auto & pool_cfg = pp.second;
uint64_t pool_avail = UINT64_MAX;
std::map<osd_num_t, uint64_t> pg_per_osd;
for (auto & pgp: pool_cfg.pg_config)
{
for (auto pg_osd: pgp.second.target_set)
{
if (pg_osd != 0)
{
pg_per_osd[pg_osd]++;
}
}
}
for (auto pg_per_pair: pg_per_osd)
{
uint64_t pg_free = osd_free[pg_per_pair.first] * pool_cfg.real_pg_count / pg_per_pair.second;
if (pool_avail > pg_free)
{
pool_avail = pg_free;
}
}
if (pool_avail == UINT64_MAX)
{
pool_avail = 0;
}
if (pool_cfg.scheme != POOL_SCHEME_REPLICATED)
{
pool_avail *= (pool_cfg.pg_size - pool_cfg.parity_chunks);
}
pool_stats[pool_cfg.id] = json11::Json::object {
{ "id", (uint64_t)pool_cfg.id },
{ "name", pool_cfg.name },
{ "pg_count", pool_cfg.pg_count },
{ "real_pg_count", pool_cfg.real_pg_count },
{ "scheme", pool_cfg.scheme == POOL_SCHEME_REPLICATED ? "replicated" : "ec" },
{ "scheme_name", pool_cfg.scheme == POOL_SCHEME_REPLICATED
? std::to_string(pool_cfg.pg_size)+"/"+std::to_string(pool_cfg.pg_minsize)
: "EC "+std::to_string(pool_cfg.pg_size-pool_cfg.parity_chunks)+"+"+std::to_string(pool_cfg.parity_chunks) },
{ "used_raw", (uint64_t)(pool_stats[pool_cfg.id]["used_raw_tb"].number_value() * ((uint64_t)1<<40)) },
{ "total_raw", (uint64_t)(pool_stats[pool_cfg.id]["total_raw_tb"].number_value() * ((uint64_t)1<<40)) },
{ "max_available", pool_avail },
{ "raw_to_usable", pool_stats[pool_cfg.id]["raw_to_usable"].number_value() },
{ "space_efficiency", pool_stats[pool_cfg.id]["space_efficiency"].number_value() },
{ "pg_real_size", pool_stats[pool_cfg.id]["pg_real_size"].uint64_value() },
{ "failure_domain", pool_cfg.failure_domain },
};
}
}
json11::Json::array to_list()
{
json11::Json::array list;
for (auto & kv: pool_stats)
{
list.push_back(kv.second);
}
return list;
}
void loop()
{
get_stats();
if (parent->waiting > 0)
return;
if (state == 100)
return;
if (parent->json_output)
{
// JSON output
result.data = to_list();
state = 100;
return;
}
// Table output: name, scheme_name, pg_count, total, used, max_avail, used%, efficiency
json11::Json::array cols;
cols.push_back(json11::Json::object{
{ "key", "name" },
{ "title", "NAME" },
});
cols.push_back(json11::Json::object{
{ "key", "scheme_name" },
{ "title", "SCHEME" },
});
cols.push_back(json11::Json::object{
{ "key", "pg_count_fmt" },
{ "title", "PGS" },
});
cols.push_back(json11::Json::object{
{ "key", "total_fmt" },
{ "title", "TOTAL" },
});
cols.push_back(json11::Json::object{
{ "key", "used_fmt" },
{ "title", "USED" },
});
cols.push_back(json11::Json::object{
{ "key", "max_avail_fmt" },
{ "title", "AVAILABLE" },
});
cols.push_back(json11::Json::object{
{ "key", "used_pct" },
{ "title", "USED%" },
});
cols.push_back(json11::Json::object{
{ "key", "eff_fmt" },
{ "title", "EFFICIENCY" },
});
json11::Json::array list;
for (auto & kv: pool_stats)
{
double raw_to = kv.second["raw_to_usable"].number_value();
if (raw_to < 0.000001 && raw_to > -0.000001)
raw_to = 1;
kv.second["pg_count_fmt"] = kv.second["real_pg_count"] == kv.second["pg_count"]
? kv.second["real_pg_count"].as_string()
: kv.second["real_pg_count"].as_string()+"->"+kv.second["pg_count"].as_string();
kv.second["total_fmt"] = format_size(kv.second["total_raw"].uint64_value() / raw_to);
kv.second["used_fmt"] = format_size(kv.second["used_raw"].uint64_value() / raw_to);
kv.second["max_avail_fmt"] = format_size(kv.second["max_available"].uint64_value());
kv.second["used_pct"] = format_q(kv.second["total_raw"].uint64_value()
? (100 - 100*kv.second["max_available"].uint64_value() *
kv.second["raw_to_usable"].number_value() / kv.second["total_raw"].uint64_value())
: 100)+"%";
kv.second["eff_fmt"] = format_q(kv.second["space_efficiency"].number_value()*100)+"%";
}
result.data = to_list();
result.text = print_table(result.data, cols, parent->color);
state = 100;
}
};
std::function<bool(cli_result_t &)> cli_tool_t::start_df(json11::Json cfg)
{
auto lister = new pool_lister_t();
lister->parent = this;
return [lister](cli_result_t & result)
{
lister->loop();
if (lister->is_done())
{
result = lister->result;
delete lister;
return true;
}
return false;
};
}

View File

@@ -350,7 +350,11 @@ struct snap_merger_t
: "Overwriting blocks: %ju/%ju\n", to_process, to_process);
}
// Done
result = (cli_result_t){ .text = "Done, layers from "+from_name+" to "+to_name+" merged into "+target_name };
result = (cli_result_t){ .text = "Done, layers from "+from_name+" to "+to_name+" merged into "+target_name, .data = json11::Json::object {
{ "from", from_name },
{ "to", to_name },
{ "into", target_name },
}};
state = 100;
resume_100:
return;

View File

@@ -85,7 +85,10 @@ struct image_changer_t
(!new_size && !force_size || cfg.size == new_size || cfg.size >= new_size && inc_size) &&
(new_name == "" || new_name == image_name))
{
result = (cli_result_t){ .text = "No change" };
result = (cli_result_t){ .err = 0, .text = "No change", .data = json11::Json::object {
{ "error_code", 0 },
{ "error_text", "No change" },
}};
state = 100;
return;
}
@@ -222,7 +225,16 @@ resume_2:
parent->cli->st_cli.inode_by_name.erase(image_name);
}
parent->cli->st_cli.insert_inode_config(cfg);
result = (cli_result_t){ .err = 0, .text = "Image "+image_name+" modified" };
result = (cli_result_t){
.err = 0,
.text = "Image "+image_name+" modified",
.data = json11::Json::object {
{ "name", image_name },
{ "inode", INODE_NO_POOL(inode_num) },
{ "pool", (uint64_t)INODE_POOL(inode_num) },
{ "size", new_size },
}
};
state = 100;
}
};

270
src/cli_pool_cfg.cpp Normal file
View File

@@ -0,0 +1,270 @@
// Copyright (c) Vitaliy Filippov, 2024
// License: VNPL-1.1 (see README.md for details)
#include "cli_pool_cfg.h"
#include "etcd_state_client.h"
#include "str_util.h"
std::string validate_pool_config(json11::Json::object & new_cfg, json11::Json old_cfg,
uint64_t global_block_size, uint64_t global_bitmap_granularity, bool force)
{
// short option names
if (new_cfg.find("count") != new_cfg.end())
{
new_cfg["pg_count"] = new_cfg["count"];
new_cfg.erase("count");
}
if (new_cfg.find("size") != new_cfg.end())
{
new_cfg["pg_size"] = new_cfg["size"];
new_cfg.erase("size");
}
// --ec shortcut
if (new_cfg.find("ec") != new_cfg.end())
{
if (new_cfg.find("scheme") != new_cfg.end() ||
new_cfg.find("pg_size") != new_cfg.end() ||
new_cfg.find("parity_chunks") != new_cfg.end())
{
return "--ec can't be used with --pg_size, --parity_chunks or --scheme";
}
// pg_size = N+K
// parity_chunks = K
uint64_t data_chunks = 0, parity_chunks = 0;
char null_byte = 0;
int ret = sscanf(new_cfg["ec"].string_value().c_str(), "%ju+%ju%c", &data_chunks, &parity_chunks, &null_byte);
if (ret != 2 || !data_chunks || !parity_chunks)
{
return "--ec should be <N>+<K> format (<N>, <K> - numbers)";
}
new_cfg.erase("ec");
new_cfg["scheme"] = "ec";
new_cfg["pg_size"] = data_chunks+parity_chunks;
new_cfg["parity_chunks"] = parity_chunks;
}
if (old_cfg.is_null() && new_cfg["scheme"].string_value() == "")
{
// Default scheme
new_cfg["scheme"] = "replicated";
}
if (new_cfg.find("pg_minsize") == new_cfg.end() && (old_cfg.is_null() || new_cfg.find("pg_size") != new_cfg.end()))
{
// Default pg_minsize
if (new_cfg["scheme"] == "replicated")
{
// pg_minsize = (N+K > 2) ? 2 : 1
new_cfg["pg_minsize"] = new_cfg["pg_size"].uint64_value() > 2 ? 2 : 1;
}
else // ec or xor
{
// pg_minsize = (K > 1) ? N + 1 : N
new_cfg["pg_minsize"] = new_cfg["pg_size"].uint64_value() - new_cfg["parity_chunks"].uint64_value() +
(new_cfg["parity_chunks"].uint64_value() > 1 ? 1 : 0);
}
}
// Check integer values and unknown keys
for (auto kv_it = new_cfg.begin(); kv_it != new_cfg.end(); )
{
auto & key = kv_it->first;
auto & value = kv_it->second;
if (key == "pg_size" || key == "parity_chunks" || key == "pg_minsize" ||
key == "pg_count" || key == "max_osd_combinations" || key == "block_size" ||
key == "bitmap_granularity" || key == "pg_stripe_size")
{
if (value.is_number() && value.uint64_value() != value.number_value() ||
value.is_string() && !value.uint64_value() && value.string_value() != "0")
{
return key+" must be a non-negative integer";
}
value = value.uint64_value();
}
else if (key == "name" || key == "scheme" || key == "immediate_commit" ||
key == "failure_domain" || key == "root_node" || key == "scrub_interval" || key == "used_for_fs")
{
// OK
}
else if (key == "osd_tags" || key == "primary_affinity_tags")
{
if (value.is_string())
{
value = explode(",", value.string_value(), true);
}
}
else
{
// Unknown parameter
new_cfg.erase(kv_it++);
continue;
}
kv_it++;
}
// Merge with the old config
if (!old_cfg.is_null())
{
for (auto & kv: old_cfg.object_items())
{
if (new_cfg.find(kv.first) == new_cfg.end())
{
new_cfg[kv.first] = kv.second;
}
}
}
// Check after merging
if (new_cfg["scheme"] != "ec")
{
new_cfg.erase("parity_chunks");
}
if (new_cfg.find("used_for_fs") != new_cfg.end() && new_cfg["used_for_fs"].string_value() == "")
{
new_cfg.erase("used_for_fs");
}
// Prevent autovivification of object keys. Now we don't modify the config, we just check it
json11::Json cfg = new_cfg;
// Validate changes
if (!old_cfg.is_null() && !force)
{
if (old_cfg["scheme"] != cfg["scheme"])
{
return "Changing scheme for an existing pool will lead to data loss. Use --force to proceed";
}
if (etcd_state_client_t::parse_scheme(old_cfg["scheme"].string_value()) == POOL_SCHEME_EC)
{
uint64_t old_data_chunks = old_cfg["pg_size"].uint64_value() - old_cfg["parity_chunks"].uint64_value();
uint64_t new_data_chunks = cfg["pg_size"].uint64_value() - cfg["parity_chunks"].uint64_value();
if (old_data_chunks != new_data_chunks)
{
return "Changing EC data chunk count for an existing pool will lead to data loss. Use --force to proceed";
}
}
if (old_cfg["block_size"] != cfg["block_size"] ||
old_cfg["bitmap_granularity"] != cfg["bitmap_granularity"] ||
old_cfg["immediate_commit"] != cfg["immediate_commit"])
{
return "Changing block_size, bitmap_granularity or immediate_commit"
" for an existing pool will lead to incomplete PGs. Use --force to proceed";
}
if (old_cfg["pg_stripe_size"] != cfg["pg_stripe_size"])
{
return "Changing pg_stripe_size for an existing pool will lead to data loss. Use --force to proceed";
}
}
// Validate values
if (cfg["name"].string_value() == "")
{
return "Non-empty pool name is required";
}
// scheme
auto scheme = etcd_state_client_t::parse_scheme(cfg["scheme"].string_value());
if (!scheme)
{
return "Scheme must be one of \"replicated\", \"ec\" or \"xor\"";
}
// pg_size
auto pg_size = cfg["pg_size"].uint64_value();
if (!pg_size)
{
return "Non-zero PG size is required";
}
if (scheme != POOL_SCHEME_REPLICATED && pg_size < 3)
{
return "PG size can't be smaller than 3 for EC/XOR pools";
}
if (pg_size > 256)
{
return "PG size can't be greater than 256";
}
// parity_chunks
uint64_t parity_chunks = 1;
if (scheme == POOL_SCHEME_EC)
{
parity_chunks = cfg["parity_chunks"].uint64_value();
if (!parity_chunks)
{
return "Non-zero parity_chunks is required";
}
if (parity_chunks > pg_size-2)
{
return "parity_chunks can't be greater than "+std::to_string(pg_size-2)+" (PG size - 2)";
}
}
// pg_minsize
auto pg_minsize = cfg["pg_minsize"].uint64_value();
if (!pg_minsize)
{
return "Non-zero pg_minsize is required";
}
else if (pg_minsize > pg_size)
{
return "pg_minsize can't be greater than "+std::to_string(pg_size)+" (PG size)";
}
else if (scheme != POOL_SCHEME_REPLICATED && pg_minsize < pg_size-parity_chunks)
{
return "pg_minsize can't be smaller than "+std::to_string(pg_size-parity_chunks)+
" (pg_size - parity_chunks) for XOR/EC pool";
}
// pg_count
if (!cfg["pg_count"].uint64_value())
{
return "Non-zero pg_count is required";
}
// max_osd_combinations
if (!cfg["max_osd_combinations"].is_null() && cfg["max_osd_combinations"].uint64_value() < 100)
{
return "max_osd_combinations must be at least 100, but it is "+cfg["max_osd_combinations"].as_string();
}
// block_size
auto block_size = cfg["block_size"].uint64_value();
if (!cfg["block_size"].is_null() && ((block_size & (block_size-1)) ||
block_size < MIN_DATA_BLOCK_SIZE || block_size > MAX_DATA_BLOCK_SIZE))
{
return "block_size must be a power of two between "+std::to_string(MIN_DATA_BLOCK_SIZE)+
" and "+std::to_string(MAX_DATA_BLOCK_SIZE)+", but it is "+std::to_string(block_size);
}
block_size = (block_size ? block_size : global_block_size);
// bitmap_granularity
auto bitmap_granularity = cfg["bitmap_granularity"].uint64_value();
if (!cfg["bitmap_granularity"].is_null() && (!bitmap_granularity || (bitmap_granularity % 512)))
{
return "bitmap_granularity must be a multiple of 512, but it is "+std::to_string(bitmap_granularity);
}
bitmap_granularity = (bitmap_granularity ? bitmap_granularity : global_bitmap_granularity);
if (block_size % bitmap_granularity)
{
return "bitmap_granularity must divide data block size ("+std::to_string(block_size)+"), but it is "+std::to_string(bitmap_granularity);
}
// immediate_commit
if (!cfg["immediate_commit"].is_null() && !etcd_state_client_t::parse_immediate_commit(cfg["immediate_commit"].string_value()))
{
return "immediate_commit must be one of \"all\", \"small\", or \"none\", but it is "+cfg["immediate_commit"].as_string();
}
// scrub_interval
if (!cfg["scrub_interval"].is_null())
{
bool ok;
parse_time(cfg["scrub_interval"].string_value(), &ok);
if (!ok)
{
return "scrub_interval must be a time interval (number + unit s/m/h/d/M/y), but it is "+cfg["scrub_interval"].as_string();
}
}
return "";
}

10
src/cli_pool_cfg.h Normal file
View File

@@ -0,0 +1,10 @@
// Copyright (c) Vitaliy Filippov, 2024
// License: VNPL-1.1 (see README.md for details)
#pragma once
#include "json11/json11.hpp"
#include <stdint.h>
std::string validate_pool_config(json11::Json::object & new_cfg, json11::Json old_cfg,
uint64_t global_block_size, uint64_t global_bitmap_granularity, bool force);

622
src/cli_pool_create.cpp Normal file
View File

@@ -0,0 +1,622 @@
// Copyright (c) MIND Software LLC, 2023 (info@mindsw.io)
// I accept Vitastor CLA: see CLA-en.md for details
// Copyright (c) Vitaliy Filippov, 2024
// License: VNPL-1.1 (see README.md for details)
#include <ctype.h>
#include "cli.h"
#include "cli_pool_cfg.h"
#include "cluster_client.h"
#include "epoll_manager.h"
#include "pg_states.h"
#include "str_util.h"
struct pool_creator_t
{
cli_tool_t *parent;
json11::Json::object cfg;
bool force = false;
bool wait = false;
int state = 0;
cli_result_t result;
struct {
uint32_t retries = 5;
uint32_t interval = 0;
bool passed = false;
} create_check;
uint64_t new_id = 1;
uint64_t new_pools_mod_rev;
json11::Json state_node_tree;
json11::Json new_pools;
bool is_done() { return state == 100; }
void loop()
{
if (state == 1)
goto resume_1;
else if (state == 2)
goto resume_2;
else if (state == 3)
goto resume_3;
else if (state == 4)
goto resume_4;
else if (state == 5)
goto resume_5;
else if (state == 6)
goto resume_6;
else if (state == 7)
goto resume_7;
else if (state == 8)
goto resume_8;
// Validate pool parameters
result.text = validate_pool_config(cfg, json11::Json(), parent->cli->st_cli.global_block_size,
parent->cli->st_cli.global_bitmap_granularity, force);
if (result.text != "")
{
result.err = EINVAL;
state = 100;
return;
}
state = 1;
resume_1:
// If not forced, check that we have enough osds for pg_size
if (!force)
{
// Get node_placement configuration from etcd
parent->etcd_txn(json11::Json::object {
{ "success", json11::Json::array {
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/node_placement") },
} }
},
} },
});
state = 2;
resume_2:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
// Get state_node_tree based on node_placement and osd peer states
{
auto kv = parent->cli->st_cli.parse_etcd_kv(parent->etcd_result["responses"][0]["response_range"]["kvs"][0]);
state_node_tree = get_state_node_tree(kv.value.object_items());
}
// Skip tag checks, if pool has none
if (cfg["osd_tags"].array_items().size())
{
// Get osd configs (for tags) of osds in state_node_tree
{
json11::Json::array osd_configs;
for (auto osd_num: state_node_tree["osds"].array_items())
{
osd_configs.push_back(json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/osd/"+osd_num.as_string()) },
} }
});
}
parent->etcd_txn(json11::Json::object { { "success", osd_configs, }, });
}
state = 3;
resume_3:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
// Filter out osds from state_node_tree based on pool/osd tags
{
std::vector<json11::Json> osd_configs;
for (auto & ocr: parent->etcd_result["responses"].array_items())
{
auto kv = parent->cli->st_cli.parse_etcd_kv(ocr["response_range"]["kvs"][0]);
osd_configs.push_back(kv.value);
}
state_node_tree = filter_state_node_tree_by_tags(state_node_tree, osd_configs);
}
}
// Get stats (for block_size, bitmap_granularity, ...) of osds in state_node_tree
{
json11::Json::array osd_stats;
for (auto osd_num: state_node_tree["osds"].array_items())
{
osd_stats.push_back(json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/osd/stats/"+osd_num.as_string()) },
} }
});
}
parent->etcd_txn(json11::Json::object { { "success", osd_stats, }, });
}
state = 4;
resume_4:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
// Filter osds from state_node_tree based on pool parameters and osd stats
{
std::vector<json11::Json> osd_stats;
for (auto & ocr: parent->etcd_result["responses"].array_items())
{
auto kv = parent->cli->st_cli.parse_etcd_kv(ocr["response_range"]["kvs"][0]);
osd_stats.push_back(kv.value);
}
state_node_tree = filter_state_node_tree_by_stats(state_node_tree, osd_stats);
}
// Check that pg_size <= max_pg_size
{
auto failure_domain = cfg["failure_domain"].string_value() == ""
? "host" : cfg["failure_domain"].string_value();
uint64_t max_pg_size = get_max_pg_size(state_node_tree["nodes"].object_items(),
failure_domain, cfg["root_node"].string_value());
if (cfg["pg_size"].uint64_value() > max_pg_size)
{
result = (cli_result_t){
.err = EINVAL,
.text =
"There are "+std::to_string(max_pg_size)+" \""+failure_domain+"\" failure domains with OSDs matching tags and"
" block_size/bitmap_granularity/immediate_commit parameters, but you want to create a"
" pool with "+cfg["pg_size"].as_string()+" OSDs from different failure domains in a PG."
" Change parameters or add --force if you want to create a degraded pool and add OSDs later."
};
state = 100;
return;
}
}
}
// Create pool
state = 5;
resume_5:
// Get pools from etcd
parent->etcd_txn(json11::Json::object {
{ "success", json11::Json::array {
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/pools") },
} }
},
} },
});
state = 6;
resume_6:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
{
// Add new pool
auto kv = parent->cli->st_cli.parse_etcd_kv(parent->etcd_result["responses"][0]["response_range"]["kvs"][0]);
new_pools = create_pool(kv);
if (new_pools.is_string())
{
result = (cli_result_t){ .err = EEXIST, .text = new_pools.string_value() };
state = 100;
return;
}
new_pools_mod_rev = kv.mod_revision;
}
// Update pools in etcd
parent->etcd_txn(json11::Json::object {
{ "compare", json11::Json::array {
json11::Json::object {
{ "target", "MOD" },
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/pools") },
{ "result", "LESS" },
{ "mod_revision", new_pools_mod_rev+1 },
}
} },
{ "success", json11::Json::array {
json11::Json::object {
{ "request_put", json11::Json::object {
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/pools") },
{ "value", base64_encode(new_pools.dump()) },
} },
},
} },
});
state = 7;
resume_7:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
// Perform final create-check
create_check.interval = parent->cli->config["mon_change_timeout"].uint64_value();
if (!create_check.interval)
create_check.interval = 1000;
state = 8;
resume_8:
if (parent->waiting > 0)
return;
// Unless forced, check that pool was created and is active
if (!wait)
{
create_check.passed = true;
}
else if (create_check.retries)
{
create_check.retries--;
parent->waiting++;
parent->epmgr->tfd->set_timer(create_check.interval, false, [this](int timer_id)
{
if (parent->cli->st_cli.pool_config.find(new_id) != parent->cli->st_cli.pool_config.end())
{
auto & pool_cfg = parent->cli->st_cli.pool_config[new_id];
create_check.passed = pool_cfg.real_pg_count > 0;
for (auto pg_it = pool_cfg.pg_config.begin(); pg_it != pool_cfg.pg_config.end(); pg_it++)
{
if (!(pg_it->second.cur_state & PG_ACTIVE))
{
create_check.passed = false;
break;
}
}
if (create_check.passed)
create_check.retries = 0;
}
parent->waiting--;
parent->ringloop->wakeup();
});
return;
}
if (!create_check.passed)
{
result = (cli_result_t) {
.err = EAGAIN,
.text = "Pool "+cfg["name"].string_value()+" was created, but failed to become active."
" This may indicate that cluster state has changed while the pool was being created."
" Please check the current state and adjust the pool configuration if necessary.",
};
}
else
{
result = (cli_result_t){
.err = 0,
.text = "Pool "+cfg["name"].string_value()+" created",
.data = new_pools[std::to_string(new_id)],
};
}
state = 100;
}
// Returns a JSON object of form {"nodes": {...}, "osds": [...]} that
// contains: all nodes (osds, hosts, ...) based on node_placement config
// and current peer state, and a list of active peer osds.
json11::Json get_state_node_tree(json11::Json::object node_placement)
{
// Erase non-peer osd nodes from node_placement
for (auto np_it = node_placement.begin(); np_it != node_placement.end();)
{
// Numeric nodes are osds
osd_num_t osd_num = stoull_full(np_it->first);
// If node is osd and it is not in peer states, erase it
if (osd_num > 0 &&
parent->cli->st_cli.peer_states.find(osd_num) == parent->cli->st_cli.peer_states.end())
{
node_placement.erase(np_it++);
}
else
np_it++;
}
// List of peer osds
std::vector<std::string> peer_osds;
// Record peer osds and add missing osds/hosts to np
for (auto & ps: parent->cli->st_cli.peer_states)
{
std::string osd_num = std::to_string(ps.first);
// Record peer osd
peer_osds.push_back(osd_num);
// Add osd, if necessary
if (node_placement.find(osd_num) == node_placement.end())
{
std::string osd_host = ps.second["host"].as_string();
// Add host, if necessary
if (node_placement.find(osd_host) == node_placement.end())
{
node_placement[osd_host] = json11::Json::object {
{ "level", "host" }
};
}
node_placement[osd_num] = json11::Json::object {
{ "parent", osd_host }
};
}
}
return json11::Json::object { { "osds", peer_osds }, { "nodes", node_placement } };
}
// Returns new state_node_tree based on given state_node_tree with osds
// filtered out by tags in given osd_configs and current pool config.
// Requires: state_node_tree["osds"] must match osd_configs 1-1
json11::Json filter_state_node_tree_by_tags(const json11::Json & state_node_tree, std::vector<json11::Json> & osd_configs)
{
auto & osds = state_node_tree["osds"].array_items();
// Accepted state_node_tree nodes
auto accepted_nodes = state_node_tree["nodes"].object_items();
// List of accepted osds
std::vector<std::string> accepted_osds;
for (size_t i = 0; i < osd_configs.size(); i++)
{
auto & oc = osd_configs[i].object_items();
// Get osd number
auto osd_num = osds[i].as_string();
// We need tags in config to check against pool tags
if (oc.find("tags") == oc.end())
{
// Exclude osd from state_node_tree nodes
accepted_nodes.erase(osd_num);
continue;
}
else
{
// If all pool tags are in osd tags, accept osd
if (all_in_tags(osd_configs[i]["tags"], cfg["osd_tags"]))
{
accepted_osds.push_back(osd_num);
}
// Otherwise, exclude osd
else
{
// Exclude osd from state_node_tree nodes
accepted_nodes.erase(osd_num);
}
}
}
return json11::Json::object { { "osds", accepted_osds }, { "nodes", accepted_nodes } };
}
// Returns new state_node_tree based on given state_node_tree with osds
// filtered out by stats parameters (block_size, bitmap_granularity) in
// given osd_stats and current pool config.
// Requires: state_node_tree["osds"] must match osd_stats 1-1
json11::Json filter_state_node_tree_by_stats(const json11::Json & state_node_tree, std::vector<json11::Json> & osd_stats)
{
auto & osds = state_node_tree["osds"].array_items();
// Accepted state_node_tree nodes
auto accepted_nodes = state_node_tree["nodes"].object_items();
// List of accepted osds
std::vector<std::string> accepted_osds;
uint64_t p_block_size = cfg["block_size"].uint64_value()
? cfg["block_size"].uint64_value()
: parent->cli->st_cli.global_block_size;
uint64_t p_bitmap_granularity = cfg["bitmap_granularity"].uint64_value()
? cfg["bitmap_granularity"].uint64_value()
: parent->cli->st_cli.global_bitmap_granularity;
uint32_t p_immediate_commit = cfg["immediate_commit"].is_string()
? etcd_state_client_t::parse_immediate_commit(cfg["immediate_commit"].string_value())
: parent->cli->st_cli.global_immediate_commit;
for (size_t i = 0; i < osd_stats.size(); i++)
{
auto & os = osd_stats[i];
// Get osd number
auto osd_num = osds[i].as_string();
if (!os["data_block_size"].is_null() && os["data_block_size"] != p_block_size ||
!os["bitmap_granularity"].is_null() && os["bitmap_granularity"] != p_bitmap_granularity ||
!os["immediate_commit"].is_null() &&
etcd_state_client_t::parse_immediate_commit(os["immediate_commit"].string_value()) < p_immediate_commit)
{
accepted_nodes.erase(osd_num);
}
else
{
accepted_osds.push_back(osd_num);
}
}
return json11::Json::object { { "osds", accepted_osds }, { "nodes", accepted_nodes } };
}
// Returns maximum pg_size possible for given node_tree and failure_domain, starting at parent_node
uint64_t get_max_pg_size(json11::Json::object node_tree, const std::string & level, const std::string & parent_node)
{
uint64_t max_pg_sz = 0;
std::vector<std::string> nodes;
// Check if parent node is an osd (numeric)
if (parent_node != "" && stoull_full(parent_node))
{
// Add it to node list if osd is in node tree
if (node_tree.find(parent_node) != node_tree.end())
nodes.push_back(parent_node);
}
// If parent node given, ...
else if (parent_node != "")
{
// ... look for children nodes of this parent
for (auto & sn: node_tree)
{
auto & props = sn.second.object_items();
auto parent_prop = props.find("parent");
if (parent_prop != props.end() && (parent_prop->second.as_string() == parent_node))
{
nodes.push_back(sn.first);
// If we're not looking for all osds, we only need a single
// child osd node
if (level != "osd" && stoull_full(sn.first))
break;
}
}
}
// No parent node given, and we're not looking for all osds
else if (level != "osd")
{
// ... look for all level nodes
for (auto & sn: node_tree)
{
auto & props = sn.second.object_items();
auto level_prop = props.find("level");
if (level_prop != props.end() && (level_prop->second.as_string() == level))
{
nodes.push_back(sn.first);
}
}
}
// Otherwise, ...
else
{
// ... we're looking for osd nodes only
for (auto & sn: node_tree)
{
if (stoull_full(sn.first))
{
nodes.push_back(sn.first);
}
}
}
// Process gathered nodes
for (auto & node: nodes)
{
// Check for osd node, return constant max size
if (stoull_full(node))
{
max_pg_sz += 1;
}
// Otherwise, ...
else
{
// ... exclude parent node from tree, and ...
node_tree.erase(parent_node);
// ... descend onto the resulting tree
max_pg_sz += get_max_pg_size(node_tree, level, node);
}
}
return max_pg_sz;
}
json11::Json create_pool(const etcd_kv_t & kv)
{
for (auto & p: kv.value.object_items())
{
// ID
uint64_t pool_id = stoull_full(p.first);
new_id = std::max(pool_id+1, new_id);
// Name
if (p.second["name"].string_value() == cfg["name"].string_value())
{
return "Pool with name \""+cfg["name"].string_value()+"\" already exists (ID "+std::to_string(pool_id)+")";
}
}
auto res = kv.value.object_items();
res[std::to_string(new_id)] = cfg;
return res;
}
// Checks whether tags2 tags are all in tags1 tags
bool all_in_tags(json11::Json tags1, json11::Json tags2)
{
if (!tags2.is_array())
{
tags2 = json11::Json::array{ tags2.string_value() };
}
if (!tags1.is_array())
{
tags1 = json11::Json::array{ tags1.string_value() };
}
for (auto & tag2: tags2.array_items())
{
bool found = false;
for (auto & tag1: tags1.array_items())
{
if (tag1 == tag2)
{
found = true;
break;
}
}
if (!found)
{
return false;
}
}
return true;
}
};
std::function<bool(cli_result_t &)> cli_tool_t::start_pool_create(json11::Json cfg)
{
auto pool_creator = new pool_creator_t();
pool_creator->parent = this;
pool_creator->cfg = cfg.object_items();
pool_creator->force = cfg["force"].bool_value();
pool_creator->wait = cfg["wait"].bool_value();
return [pool_creator](cli_result_t & result)
{
pool_creator->loop();
if (pool_creator->is_done())
{
result = pool_creator->result;
delete pool_creator;
return true;
}
return false;
};
}

723
src/cli_pool_ls.cpp Normal file
View File

@@ -0,0 +1,723 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
#include <algorithm>
#include "cli.h"
#include "cluster_client.h"
#include "str_util.h"
#include "pg_states.h"
// List pools with space statistics
// - df - minimal list with % used space
// - pool-ls - same but with PG state and recovery %
// - pool-ls -l - same but also include I/O statistics
// - pool-ls --detail - use list format, include PG states, I/O stats and all pool parameters
struct pool_lister_t
{
cli_tool_t *parent;
std::string sort_field;
std::set<std::string> only_names;
bool reverse = false;
int max_count = 0;
bool show_recovery = false;
bool show_stats = false;
bool detailed = false;
int state = 0;
cli_result_t result;
std::map<pool_id_t, json11::Json::object> pool_stats;
struct io_stats_t
{
uint64_t count = 0;
uint64_t read_iops = 0;
uint64_t read_bps = 0;
uint64_t read_lat = 0;
uint64_t write_iops = 0;
uint64_t write_bps = 0;
uint64_t write_lat = 0;
uint64_t delete_iops = 0;
uint64_t delete_bps = 0;
uint64_t delete_lat = 0;
};
struct object_counts_t
{
uint64_t object_count = 0;
uint64_t misplaced_count = 0;
uint64_t degraded_count = 0;
uint64_t incomplete_count = 0;
};
bool is_done()
{
return state == 100;
}
void get_pool_stats(int base_state)
{
if (state == base_state+1)
goto resume_1;
// Space statistics - pool/stats/<pool>
parent->etcd_txn(json11::Json::object {
{ "success", json11::Json::array {
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(
parent->cli->st_cli.etcd_prefix+"/pool/stats/"
) },
{ "range_end", base64_encode(
parent->cli->st_cli.etcd_prefix+"/pool/stats0"
) },
} },
},
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(
parent->cli->st_cli.etcd_prefix+"/osd/stats/"
) },
{ "range_end", base64_encode(
parent->cli->st_cli.etcd_prefix+"/osd/stats0"
) },
} },
},
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(
parent->cli->st_cli.etcd_prefix+"/config/pools"
) },
} },
},
} },
});
state = base_state+1;
resume_1:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
auto space_info = parent->etcd_result;
auto config_pools = space_info["responses"][2]["response_range"]["kvs"][0];
if (!config_pools.is_null())
{
config_pools = parent->cli->st_cli.parse_etcd_kv(config_pools).value;
}
for (auto & kv_item: space_info["responses"][0]["response_range"]["kvs"].array_items())
{
auto kv = parent->cli->st_cli.parse_etcd_kv(kv_item);
// pool ID
pool_id_t pool_id;
char null_byte = 0;
int scanned = sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(), "/pool/stats/%u%c", &pool_id, &null_byte);
if (scanned != 1 || !pool_id || pool_id >= POOL_ID_MAX)
{
fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
continue;
}
// pool/stats/<N>
pool_stats[pool_id] = kv.value.object_items();
}
std::map<pool_id_t, uint64_t> osd_free;
for (auto & kv_item: space_info["responses"][1]["response_range"]["kvs"].array_items())
{
auto kv = parent->cli->st_cli.parse_etcd_kv(kv_item);
// osd ID
osd_num_t osd_num;
char null_byte = 0;
int scanned = sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(), "/osd/stats/%ju%c", &osd_num, &null_byte);
if (scanned != 1 || !osd_num || osd_num >= POOL_ID_MAX)
{
fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
continue;
}
// osd/stats/<N>::free
osd_free[osd_num] = kv.value["free"].uint64_value();
}
// Calculate max_avail for each pool
for (auto & pp: parent->cli->st_cli.pool_config)
{
auto & pool_cfg = pp.second;
uint64_t pool_avail = UINT64_MAX;
std::map<osd_num_t, uint64_t> pg_per_osd;
bool active = pool_cfg.real_pg_count > 0;
uint64_t pg_states = 0;
for (auto & pgp: pool_cfg.pg_config)
{
if (!(pgp.second.cur_state & PG_ACTIVE))
{
active = false;
}
pg_states |= pgp.second.cur_state;
for (auto pg_osd: pgp.second.target_set)
{
if (pg_osd != 0)
{
pg_per_osd[pg_osd]++;
}
}
}
for (auto pg_per_pair: pg_per_osd)
{
uint64_t pg_free = osd_free[pg_per_pair.first] * pool_cfg.real_pg_count / pg_per_pair.second;
if (pool_avail > pg_free)
{
pool_avail = pg_free;
}
}
if (pool_avail == UINT64_MAX)
{
pool_avail = 0;
}
if (pool_cfg.scheme != POOL_SCHEME_REPLICATED)
{
pool_avail *= (pool_cfg.pg_size - pool_cfg.parity_chunks);
}
// incomplete > has_incomplete > degraded > has_degraded > has_misplaced
std::string status;
if (!active)
status = "inactive";
else if (pg_states & PG_INCOMPLETE)
status = "incomplete";
else if (pg_states & PG_HAS_INCOMPLETE)
status = "has_incomplete";
else if (pg_states & PG_DEGRADED)
status = "degraded";
else if (pg_states & PG_HAS_DEGRADED)
status = "has_degraded";
else if (pg_states & PG_HAS_MISPLACED)
status = "has_misplaced";
else
status = "active";
pool_stats[pool_cfg.id] = json11::Json::object {
{ "id", (uint64_t)pool_cfg.id },
{ "name", pool_cfg.name },
{ "status", status },
{ "pg_count", pool_cfg.pg_count },
{ "real_pg_count", pool_cfg.real_pg_count },
{ "scheme_name", pool_cfg.scheme == POOL_SCHEME_REPLICATED
? std::to_string(pool_cfg.pg_size)+"/"+std::to_string(pool_cfg.pg_minsize)
: "EC "+std::to_string(pool_cfg.pg_size-pool_cfg.parity_chunks)+"+"+std::to_string(pool_cfg.parity_chunks) },
{ "used_raw", (uint64_t)(pool_stats[pool_cfg.id]["used_raw_tb"].number_value() * ((uint64_t)1<<40)) },
{ "total_raw", (uint64_t)(pool_stats[pool_cfg.id]["total_raw_tb"].number_value() * ((uint64_t)1<<40)) },
{ "max_available", pool_avail },
{ "raw_to_usable", pool_stats[pool_cfg.id]["raw_to_usable"].number_value() },
{ "space_efficiency", pool_stats[pool_cfg.id]["space_efficiency"].number_value() },
{ "pg_real_size", pool_stats[pool_cfg.id]["pg_real_size"].uint64_value() },
{ "osd_count", pg_per_osd.size() },
};
}
// Include full pool config
for (auto & pp: config_pools.object_items())
{
if (!pp.second.is_object())
{
continue;
}
auto pool_id = stoull_full(pp.first);
auto & st = pool_stats[pool_id];
for (auto & kv: pp.second.object_items())
{
if (st.find(kv.first) == st.end())
st[kv.first] = kv.second;
}
}
}
void get_pg_stats(int base_state)
{
if (state == base_state+1)
goto resume_1;
// Space statistics - pool/stats/<pool>
parent->etcd_txn(json11::Json::object {
{ "success", json11::Json::array {
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(
parent->cli->st_cli.etcd_prefix+"/pg/stats/"
) },
{ "range_end", base64_encode(
parent->cli->st_cli.etcd_prefix+"/pg/stats0"
) },
} },
},
} },
});
state = base_state+1;
resume_1:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
auto pg_stats = parent->etcd_result["responses"][0]["response_range"]["kvs"];
// Calculate recovery percent
std::map<pool_id_t, object_counts_t> counts;
for (auto & kv_item: pg_stats.array_items())
{
auto kv = parent->cli->st_cli.parse_etcd_kv(kv_item);
// pool ID & pg number
pool_id_t pool_id;
pg_num_t pg_num = 0;
char null_byte = 0;
int scanned = sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(),
"/pg/stats/%u/%u%c", &pool_id, &pg_num, &null_byte);
if (scanned != 2 || !pool_id || pool_id >= POOL_ID_MAX)
{
fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
continue;
}
auto & cnt = counts[pool_id];
cnt.object_count += kv.value["object_count"].uint64_value();
cnt.misplaced_count += kv.value["misplaced_count"].uint64_value();
cnt.degraded_count += kv.value["degraded_count"].uint64_value();
cnt.incomplete_count += kv.value["incomplete_count"].uint64_value();
}
for (auto & pp: pool_stats)
{
auto & cnt = counts[pp.first];
auto & st = pp.second;
st["object_count"] = cnt.object_count;
st["misplaced_count"] = cnt.misplaced_count;
st["degraded_count"] = cnt.degraded_count;
st["incomplete_count"] = cnt.incomplete_count;
}
}
void get_inode_stats(int base_state)
{
if (state == base_state+1)
goto resume_1;
// Space statistics - pool/stats/<pool>
parent->etcd_txn(json11::Json::object {
{ "success", json11::Json::array {
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(
parent->cli->st_cli.etcd_prefix+"/inode/stats/"
) },
{ "range_end", base64_encode(
parent->cli->st_cli.etcd_prefix+"/inode/stats0"
) },
} },
},
} },
});
state = base_state+1;
resume_1:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
auto inode_stats = parent->etcd_result["responses"][0]["response_range"]["kvs"];
// Performance statistics
std::map<pool_id_t, io_stats_t> pool_io;
for (auto & kv_item: inode_stats.array_items())
{
auto kv = parent->cli->st_cli.parse_etcd_kv(kv_item);
// pool ID & inode number
pool_id_t pool_id;
inode_t only_inode_num;
char null_byte = 0;
int scanned = sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(),
"/inode/stats/%u/%ju%c", &pool_id, &only_inode_num, &null_byte);
if (scanned != 2 || !pool_id || pool_id >= POOL_ID_MAX || INODE_POOL(only_inode_num) != 0)
{
fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
continue;
}
auto & io = pool_io[pool_id];
io.read_iops += kv.value["read"]["iops"].uint64_value();
io.read_bps += kv.value["read"]["bps"].uint64_value();
io.read_lat += kv.value["read"]["lat"].uint64_value();
io.write_iops += kv.value["write"]["iops"].uint64_value();
io.write_bps += kv.value["write"]["bps"].uint64_value();
io.write_lat += kv.value["write"]["lat"].uint64_value();
io.delete_iops += kv.value["delete"]["iops"].uint64_value();
io.delete_bps += kv.value["delete"]["bps"].uint64_value();
io.delete_lat += kv.value["delete"]["lat"].uint64_value();
io.count++;
}
for (auto & pp: pool_stats)
{
auto & io = pool_io[pp.first];
if (io.count > 0)
{
io.read_lat /= io.count;
io.write_lat /= io.count;
io.delete_lat /= io.count;
}
auto & st = pp.second;
st["read_iops"] = io.read_iops;
st["read_bps"] = io.read_bps;
st["read_lat"] = io.read_lat;
st["write_iops"] = io.write_iops;
st["write_bps"] = io.write_bps;
st["write_lat"] = io.write_lat;
st["delete_iops"] = io.delete_iops;
st["delete_bps"] = io.delete_bps;
st["delete_lat"] = io.delete_lat;
}
}
json11::Json::array to_list()
{
json11::Json::array list;
for (auto & kv: pool_stats)
{
if (!only_names.size())
{
list.push_back(kv.second);
}
else
{
for (auto glob: only_names)
{
if (stupid_glob(kv.second["name"].string_value(), glob))
{
list.push_back(kv.second);
break;
}
}
}
}
if (sort_field == "name" || sort_field == "scheme" ||
sort_field == "scheme_name" || sort_field == "status")
{
std::sort(list.begin(), list.end(), [this](json11::Json a, json11::Json b)
{
auto av = a[sort_field].as_string();
auto bv = b[sort_field].as_string();
return reverse ? av > bv : av < bv;
});
}
else
{
std::sort(list.begin(), list.end(), [this](json11::Json a, json11::Json b)
{
auto av = a[sort_field].number_value();
auto bv = b[sort_field].number_value();
return reverse ? av > bv : av < bv;
});
}
if (max_count > 0 && list.size() > max_count)
{
list.resize(max_count);
}
return list;
}
void loop()
{
if (state == 1)
goto resume_1;
if (state == 2)
goto resume_2;
if (state == 3)
goto resume_3;
if (state == 100)
return;
show_stats = show_stats || detailed;
show_recovery = show_recovery || detailed;
resume_1:
get_pool_stats(0);
if (parent->waiting > 0)
return;
if (show_stats)
{
resume_2:
get_inode_stats(1);
if (parent->waiting > 0)
return;
}
if (show_recovery)
{
resume_3:
get_pg_stats(2);
if (parent->waiting > 0)
return;
}
if (parent->json_output)
{
// JSON output
result.data = to_list();
state = 100;
return;
}
json11::Json::array list;
for (auto & kv: pool_stats)
{
auto & st = kv.second;
double raw_to = st["raw_to_usable"].number_value();
if (raw_to < 0.000001 && raw_to > -0.000001)
raw_to = 1;
st["pg_count_fmt"] = st["real_pg_count"] == st["pg_count"]
? st["real_pg_count"].as_string()
: st["real_pg_count"].as_string()+"->"+st["pg_count"].as_string();
st["total_fmt"] = format_size(st["total_raw"].uint64_value() / raw_to);
st["used_fmt"] = format_size(st["used_raw"].uint64_value() / raw_to);
st["max_avail_fmt"] = format_size(st["max_available"].uint64_value());
st["used_pct"] = format_q(st["total_raw"].uint64_value()
? (100 - 100*st["max_available"].uint64_value() *
st["raw_to_usable"].number_value() / st["total_raw"].uint64_value())
: 100)+"%";
st["eff_fmt"] = format_q(st["space_efficiency"].number_value()*100)+"%";
if (show_stats)
{
st["read_bw"] = format_size(st["read_bps"].uint64_value())+"/s";
st["write_bw"] = format_size(st["write_bps"].uint64_value())+"/s";
st["delete_bw"] = format_size(st["delete_bps"].uint64_value())+"/s";
st["read_iops"] = format_q(st["read_iops"].number_value());
st["write_iops"] = format_q(st["write_iops"].number_value());
st["delete_iops"] = format_q(st["delete_iops"].number_value());
st["read_lat_f"] = format_lat(st["read_lat"].uint64_value());
st["write_lat_f"] = format_lat(st["write_lat"].uint64_value());
st["delete_lat_f"] = format_lat(st["delete_lat"].uint64_value());
}
if (show_recovery)
{
auto object_count = st["object_count"].uint64_value();
auto recovery_pct = 100.0 * (object_count - (st["misplaced_count"].uint64_value() +
st["degraded_count"].uint64_value() + st["incomplete_count"].uint64_value())) /
(object_count ? object_count : 1);
st["recovery_fmt"] = format_q(recovery_pct)+"%";
}
}
if (detailed)
{
for (auto & kv: pool_stats)
{
auto & st = kv.second;
auto total = st["object_count"].uint64_value();
auto obj_size = st["block_size"].uint64_value();
if (!obj_size)
obj_size = parent->cli->st_cli.global_block_size;
if (st["scheme"] == "ec")
obj_size *= st["pg_size"].uint64_value() - st["parity_chunks"].uint64_value();
else if (st["scheme"] == "xor")
obj_size *= st["pg_size"].uint64_value() - 1;
auto n = st["misplaced_count"].uint64_value();
if (n > 0)
st["misplaced_fmt"] = format_size(n * obj_size) + " / " + format_q(100.0 * n / total);
n = st["degraded_count"].uint64_value();
if (n > 0)
st["degraded_fmt"] = format_size(n * obj_size) + " / " + format_q(100.0 * n / total);
n = st["incomplete_count"].uint64_value();
if (n > 0)
st["incomplete_fmt"] = format_size(n * obj_size) + " / " + format_q(100.0 * n / total);
st["read_fmt"] = st["read_bw"].string_value()+", "+st["read_iops"].string_value()+" op/s, "+
st["read_lat_f"].string_value()+" lat";
st["write_fmt"] = st["write_bw"].string_value()+", "+st["write_iops"].string_value()+" op/s, "+
st["write_lat_f"].string_value()+" lat";
st["delete_fmt"] = st["delete_bw"].string_value()+", "+st["delete_iops"].string_value()+" op/s, "+
st["delete_lat_f"].string_value()+" lat";
if (st["scheme"] == "replicated")
st["scheme_name"] = "x"+st["pg_size"].as_string();
if (st["failure_domain"].string_value() == "")
st["failure_domain"] = "host";
st["osd_tags_fmt"] = implode(", ", st["osd_tags"]);
st["primary_affinity_tags_fmt"] = implode(", ", st["primary_affinity_tags"]);
if (st["block_size"].uint64_value())
st["block_size_fmt"] = format_size(st["block_size"].uint64_value());
if (st["bitmap_granularity"].uint64_value())
st["bitmap_granularity_fmt"] = format_size(st["bitmap_granularity"].uint64_value());
}
// All pool parameters are only displayed in the "detailed" mode
// because there's too many of them to show them in table
auto cols = std::vector<std::pair<std::string, std::string>>{
{ "name", "Name" },
{ "id", "ID" },
{ "scheme_name", "Scheme" },
{ "used_for_fs", "Used for VitastorFS" },
{ "status", "Status" },
{ "pg_count_fmt", "PGs" },
{ "pg_minsize", "PG minsize" },
{ "failure_domain", "Failure domain" },
{ "root_node", "Root node" },
{ "osd_tags_fmt", "OSD tags" },
{ "primary_affinity_tags_fmt", "Primary affinity" },
{ "block_size_fmt", "Block size" },
{ "bitmap_granularity_fmt", "Bitmap granularity" },
{ "immediate_commit", "Immediate commit" },
{ "scrub_interval", "Scrub interval" },
{ "inode_stats_fmt", "Per-inode stats" },
{ "pg_stripe_size", "PG stripe size" },
{ "max_osd_combinations", "Max OSD combinations" },
{ "total_fmt", "Total" },
{ "used_fmt", "Used" },
{ "max_avail_fmt", "Available" },
{ "used_pct", "Used%" },
{ "eff_fmt", "Efficiency" },
{ "osd_count", "OSD count" },
{ "misplaced_fmt", "Misplaced" },
{ "degraded_fmt", "Degraded" },
{ "incomplete_fmt", "Incomplete" },
{ "read_fmt", "Read" },
{ "write_fmt", "Write" },
{ "delete_fmt", "Delete" },
};
auto list = to_list();
size_t title_len = 0;
for (auto & item: list)
{
title_len = print_detail_title_len(item, cols, title_len);
}
for (auto & item: list)
{
if (result.text != "")
result.text += "\n";
result.text += print_detail(item, cols, title_len, parent->color);
}
state = 100;
return;
}
// Table output: name, scheme_name, pg_count, total, used, max_avail, used%, efficiency
json11::Json::array cols;
cols.push_back(json11::Json::object{
{ "key", "name" },
{ "title", "NAME" },
});
cols.push_back(json11::Json::object{
{ "key", "scheme_name" },
{ "title", "SCHEME" },
});
cols.push_back(json11::Json::object{
{ "key", "status" },
{ "title", "STATUS" },
});
cols.push_back(json11::Json::object{
{ "key", "pg_count_fmt" },
{ "title", "PGS" },
});
cols.push_back(json11::Json::object{
{ "key", "total_fmt" },
{ "title", "TOTAL" },
});
cols.push_back(json11::Json::object{
{ "key", "used_fmt" },
{ "title", "USED" },
});
cols.push_back(json11::Json::object{
{ "key", "max_avail_fmt" },
{ "title", "AVAILABLE" },
});
cols.push_back(json11::Json::object{
{ "key", "used_pct" },
{ "title", "USED%" },
});
cols.push_back(json11::Json::object{
{ "key", "eff_fmt" },
{ "title", "EFFICIENCY" },
});
if (show_recovery)
{
cols.push_back(json11::Json::object{ { "key", "recovery_fmt" }, { "title", "RECOVERY" } });
}
if (show_stats)
{
cols.push_back(json11::Json::object{ { "key", "read_bw" }, { "title", "READ" } });
cols.push_back(json11::Json::object{ { "key", "read_iops" }, { "title", "IOPS" } });
cols.push_back(json11::Json::object{ { "key", "read_lat_f" }, { "title", "LAT" } });
cols.push_back(json11::Json::object{ { "key", "write_bw" }, { "title", "WRITE" } });
cols.push_back(json11::Json::object{ { "key", "write_iops" }, { "title", "IOPS" } });
cols.push_back(json11::Json::object{ { "key", "write_lat_f" }, { "title", "LAT" } });
cols.push_back(json11::Json::object{ { "key", "delete_bw" }, { "title", "DELETE" } });
cols.push_back(json11::Json::object{ { "key", "delete_iops" }, { "title", "IOPS" } });
cols.push_back(json11::Json::object{ { "key", "delete_lat_f" }, { "title", "LAT" } });
}
result.data = to_list();
result.text = print_table(result.data, cols, parent->color);
state = 100;
}
};
size_t print_detail_title_len(json11::Json item, std::vector<std::pair<std::string, std::string>> names, size_t prev_len)
{
size_t title_len = prev_len;
for (auto & kv: names)
{
if (!item[kv.first].is_null() && (!item[kv.first].is_string() || item[kv.first].string_value() != ""))
{
size_t len = utf8_length(kv.second);
title_len = title_len < len ? len : title_len;
}
}
return title_len;
}
std::string print_detail(json11::Json item, std::vector<std::pair<std::string, std::string>> names, size_t title_len, bool use_esc)
{
std::string str;
for (auto & kv: names)
{
if (!item[kv.first].is_null() && (!item[kv.first].is_string() || item[kv.first].string_value() != ""))
{
str += kv.second;
str += ": ";
size_t len = utf8_length(kv.second);
for (int j = 0; j < title_len-len; j++)
str += ' ';
if (use_esc)
str += "\033[1m";
str += item[kv.first].as_string();
if (use_esc)
str += "\033[0m";
str += "\n";
}
}
return str;
}
std::function<bool(cli_result_t &)> cli_tool_t::start_pool_ls(json11::Json cfg)
{
auto lister = new pool_lister_t();
lister->parent = this;
lister->show_recovery = cfg["show_recovery"].bool_value();
lister->show_stats = cfg["long"].bool_value();
lister->detailed = cfg["detail"].bool_value();
lister->sort_field = cfg["sort"].string_value();
if ((lister->sort_field == "osd_tags") ||
(lister->sort_field == "primary_affinity_tags" ))
lister->sort_field = lister->sort_field + "_fmt";
lister->reverse = cfg["reverse"].bool_value();
lister->max_count = cfg["count"].uint64_value();
for (auto & item: cfg["names"].array_items())
{
lister->only_names.insert(item.string_value());
}
return [lister](cli_result_t & result)
{
lister->loop();
if (lister->is_done())
{
result = lister->result;
delete lister;
return true;
}
return false;
};
}
std::string implode(const std::string & sep, json11::Json array)
{
if (array.is_number() || array.is_bool() || array.is_string())
{
return array.as_string();
}
std::string res;
bool first = true;
for (auto & item: array.array_items())
{
res += (first ? item.as_string() : sep+item.as_string());
first = false;
}
return res;
}

203
src/cli_pool_modify.cpp Normal file
View File

@@ -0,0 +1,203 @@
// Copyright (c) MIND Software LLC, 2023 (info@mindsw.io)
// I accept Vitastor CLA: see CLA-en.md for details
// Copyright (c) Vitaliy Filippov, 2024
// License: VNPL-1.1 (see README.md for details)
#include <ctype.h>
#include "cli.h"
#include "cli_pool_cfg.h"
#include "cluster_client.h"
#include "str_util.h"
struct pool_changer_t
{
cli_tool_t *parent;
// Required parameters (id/name)
pool_id_t pool_id = 0;
std::string pool_name;
json11::Json::object cfg;
json11::Json::object new_cfg;
bool force = false;
json11::Json old_cfg;
int state = 0;
cli_result_t result;
// Updated pools
json11::Json new_pools;
// Expected pools mod revision
uint64_t pools_mod_rev;
bool is_done() { return state == 100; }
void loop()
{
if (state == 1)
goto resume_1;
else if (state == 2)
goto resume_2;
pool_id = stoull_full(cfg["old_name"].string_value());
if (!pool_id)
{
pool_name = cfg["old_name"].string_value();
if (pool_name == "")
{
result = (cli_result_t){ .err = ENOENT, .text = "Pool ID or name is required to modify it" };
state = 100;
return;
}
}
resume_0:
// Get pools from etcd
parent->etcd_txn(json11::Json::object {
{ "success", json11::Json::array {
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/pools") },
} }
},
} },
});
state = 1;
resume_1:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
{
// Parse received pools from etcd
auto kv = parent->cli->st_cli.parse_etcd_kv(parent->etcd_result["responses"][0]["response_range"]["kvs"][0]);
// Get pool by name or ID
old_cfg = json11::Json();
if (pool_name != "")
{
for (auto & pce: kv.value.object_items())
{
if (pce.second["name"] == pool_name)
{
pool_id = stoull_full(pce.first);
old_cfg = pce.second;
break;
}
}
}
else
{
pool_name = std::to_string(pool_id);
old_cfg = kv.value[pool_name];
}
if (!old_cfg.is_object())
{
result = (cli_result_t){ .err = ENOENT, .text = "Pool "+pool_name+" does not exist" };
state = 100;
return;
}
// Update pool
new_cfg = cfg;
result.text = validate_pool_config(new_cfg, old_cfg, parent->cli->st_cli.global_block_size,
parent->cli->st_cli.global_bitmap_granularity, force);
if (result.text != "")
{
result.err = EINVAL;
state = 100;
return;
}
if (new_cfg.find("used_for_fs") != new_cfg.end() && !force)
{
// Check that pool doesn't have images
auto img_it = parent->cli->st_cli.inode_config.lower_bound(INODE_WITH_POOL(pool_id, 0));
if (img_it != parent->cli->st_cli.inode_config.end() && INODE_POOL(img_it->first) == pool_id &&
img_it->second.name == new_cfg["used_for_fs"].string_value())
{
// Only allow metadata image to exist in the FS pool
img_it++;
}
if (img_it != parent->cli->st_cli.inode_config.end() && INODE_POOL(img_it->first) == pool_id)
{
result = (cli_result_t){ .err = ENOENT, .text = "Pool "+pool_name+" has block images, delete them before using it for VitastorFS" };
state = 100;
return;
}
}
// Update pool
auto pls = kv.value.object_items();
pls[std::to_string(pool_id)] = new_cfg;
new_pools = pls;
// Expected pools mod revision
pools_mod_rev = kv.mod_revision;
}
// Update pools in etcd
parent->etcd_txn(json11::Json::object {
{ "compare", json11::Json::array {
json11::Json::object {
{ "target", "MOD" },
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/pools") },
{ "result", "LESS" },
{ "mod_revision", pools_mod_rev+1 },
}
} },
{ "success", json11::Json::array {
json11::Json::object {
{ "request_put", json11::Json::object {
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/pools") },
{ "value", base64_encode(new_pools.dump()) },
} },
},
} },
});
state = 2;
resume_2:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
if (!parent->etcd_result["succeeded"].bool_value())
{
// CAS failure - retry
fprintf(stderr, "Warning: pool configuration was modified in the meantime by someone else\n");
goto resume_0;
}
// Successfully updated pool
result = (cli_result_t){
.err = 0,
.text = "Pool "+pool_name+" updated",
.data = new_pools,
};
state = 100;
}
};
std::function<bool(cli_result_t &)> cli_tool_t::start_pool_modify(json11::Json cfg)
{
auto pool_changer = new pool_changer_t();
pool_changer->parent = this;
pool_changer->cfg = cfg.object_items();
pool_changer->force = cfg["force"].bool_value();
return [pool_changer](cli_result_t & result)
{
pool_changer->loop();
if (pool_changer->is_done())
{
result = pool_changer->result;
delete pool_changer;
return true;
}
return false;
};
}

226
src/cli_pool_rm.cpp Normal file
View File

@@ -0,0 +1,226 @@
// Copyright (c) MIND Software LLC, 2023 (info@mindsw.io)
// I accept Vitastor CLA: see CLA-en.md for details
// Copyright (c) Vitaliy Filippov, 2024
// License: VNPL-1.1 (see README.md for details)
#include <ctype.h>
#include "cli.h"
#include "cluster_client.h"
#include "str_util.h"
struct pool_remover_t
{
cli_tool_t *parent;
// Required parameters (id/name)
pool_id_t pool_id = 0;
std::string pool_name;
// Force removal
bool force;
int state = 0;
cli_result_t result;
// Is pool valid?
bool pool_valid = false;
// Updated pools
json11::Json new_pools;
// Expected pools mod revision
uint64_t pools_mod_rev;
bool is_done() { return state == 100; }
void loop()
{
if (state == 1)
goto resume_1;
else if (state == 2)
goto resume_2;
else if (state == 3)
goto resume_3;
// Pool name (or id) required
if (!pool_id && pool_name == "")
{
result = (cli_result_t){ .err = EINVAL, .text = "Pool name or id must be given" };
state = 100;
return;
}
// Validate pool name/id
// Get pool id by name (if name given)
if (pool_name != "")
{
for (auto & ic: parent->cli->st_cli.pool_config)
{
if (ic.second.name == pool_name)
{
pool_id = ic.first;
pool_valid = 1;
break;
}
}
}
// Otherwise, check if given pool id is valid
else
{
// Set pool name from id (for easier logging)
pool_name = "id " + std::to_string(pool_id);
// Look-up pool id in pool_config
if (parent->cli->st_cli.pool_config.find(pool_id) != parent->cli->st_cli.pool_config.end())
{
pool_valid = 1;
}
}
// Need a valid pool to proceed
if (!pool_valid)
{
result = (cli_result_t){ .err = ENOENT, .text = "Pool "+pool_name+" does not exist" };
state = 100;
return;
}
// Unless forced, check if pool has associated Images/Snapshots
if (!force)
{
std::string images;
for (auto & ic: parent->cli->st_cli.inode_config)
{
if (pool_id && INODE_POOL(ic.second.num) != pool_id)
{
continue;
}
images += ((images != "") ? ", " : "") + ic.second.name;
}
if (images != "")
{
result = (cli_result_t){
.err = ENOTEMPTY,
.text =
"Pool "+pool_name+" cannot be removed as it still has the following "
"images/snapshots associated with it: "+images
};
state = 100;
return;
}
}
// Proceed to deleting the pool
state = 1;
do
{
resume_1:
// Get pools from etcd
parent->etcd_txn(json11::Json::object {
{ "success", json11::Json::array {
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/pools") },
} }
},
} },
});
state = 2;
resume_2:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
{
// Parse received pools from etcd
auto kv = parent->cli->st_cli.parse_etcd_kv(parent->etcd_result["responses"][0]["response_range"]["kvs"][0]);
// Remove pool
auto p = kv.value.object_items();
if (p.erase(std::to_string(pool_id)) != 1)
{
result = (cli_result_t){
.err = ENOENT,
.text = "Failed to erase pool "+pool_name+" from: "+kv.value.string_value()
};
state = 100;
return;
}
// Record updated pools
new_pools = p;
// Expected pools mod revision
pools_mod_rev = kv.mod_revision;
}
// Update pools in etcd
parent->etcd_txn(json11::Json::object {
{ "compare", json11::Json::array {
json11::Json::object {
{ "target", "MOD" },
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/pools") },
{ "result", "LESS" },
{ "mod_revision", pools_mod_rev+1 },
}
} },
{ "success", json11::Json::array {
json11::Json::object {
{ "request_put", json11::Json::object {
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/pools") },
{ "value", base64_encode(new_pools.dump()) },
} },
},
} },
});
state = 3;
resume_3:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
} while (!parent->etcd_result["succeeded"].bool_value());
// Successfully deleted pool
result = (cli_result_t){
.err = 0,
.text = "Pool "+pool_name+" deleted",
.data = new_pools
};
state = 100;
}
};
std::function<bool(cli_result_t &)> cli_tool_t::start_pool_rm(json11::Json cfg)
{
auto pool_remover = new pool_remover_t();
pool_remover->parent = this;
pool_remover->pool_id = cfg["pool"].uint64_value();
pool_remover->pool_name = pool_remover->pool_id ? "" : cfg["pool"].as_string();
pool_remover->force = !cfg["force"].is_null();
return [pool_remover](cli_result_t & result)
{
pool_remover->loop();
if (pool_remover->is_done())
{
result = pool_remover->result;
delete pool_remover;
return true;
}
return false;
};
}

View File

@@ -247,6 +247,7 @@ resume_8:
}
state = 100;
result = (cli_result_t){
.err = 0,
.text = "",
.data = my_result(result.data),
};

View File

@@ -6,7 +6,7 @@
#include "cluster_client_impl.h"
#include "http_client.h" // json_is_true
cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config)
cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json config)
{
wb = new writeback_cache_t();
@@ -238,7 +238,8 @@ void cluster_client_t::erase_op(cluster_op_t *op)
// which may continue following SYNCs, but these SYNCs
// should know about the changed buffer state
// This is ugly but this is the way we do it
std::function<void(cluster_op_t*)>(op->callback)(op);
auto cb = std::move(op->callback);
cb(op);
}
if (!(flags & OP_IMMEDIATE_COMMIT) || enable_writeback)
{
@@ -248,7 +249,8 @@ void cluster_client_t::erase_op(cluster_op_t *op)
{
// Call callback at the end to avoid inconsistencies in prev_wait
// if the callback adds more operations itself
std::function<void(cluster_op_t*)>(op->callback)(op);
auto cb = std::move(op->callback);
cb(op);
}
if (flags & OP_FLUSH_BUFFER)
{
@@ -548,7 +550,8 @@ void cluster_client_t::execute(cluster_op_t *op)
op->opcode != OSD_OP_READ_BITMAP && op->opcode != OSD_OP_READ_CHAIN_BITMAP && op->opcode != OSD_OP_WRITE)
{
op->retval = -EINVAL;
std::function<void(cluster_op_t*)>(op->callback)(op);
auto cb = std::move(op->callback);
cb(op);
return;
}
if (!pgs_loaded)
@@ -570,7 +573,7 @@ void cluster_client_t::execute_internal(cluster_op_t *op)
return;
}
if (op->opcode == OSD_OP_WRITE && enable_writeback && !(op->flags & OP_FLUSH_BUFFER) &&
!op->version /* FIXME no CAS writeback */)
!op->version /* no CAS writeback */)
{
if (wb->writebacks_active >= client_max_writeback_iodepth)
{
@@ -586,12 +589,13 @@ void cluster_client_t::execute_internal(cluster_op_t *op)
wb->start_writebacks(this, 1);
}
op->retval = op->len;
std::function<void(cluster_op_t*)>(op->callback)(op);
auto cb = std::move(op->callback);
cb(op);
return;
}
if (op->opcode == OSD_OP_WRITE && !(op->flags & OP_IMMEDIATE_COMMIT))
{
if (!(op->flags & OP_FLUSH_BUFFER))
if (!(op->flags & OP_FLUSH_BUFFER) && !op->version /* no CAS write-repeat */)
{
wb->copy_write(op, CACHE_WRITTEN);
}
@@ -655,7 +659,8 @@ bool cluster_client_t::check_rw(cluster_op_t *op)
if (!pool_id)
{
op->retval = -EINVAL;
std::function<void(cluster_op_t*)>(op->callback)(op);
auto cb = std::move(op->callback);
cb(op);
return false;
}
auto pool_it = st_cli.pool_config.find(pool_id);
@@ -663,15 +668,17 @@ bool cluster_client_t::check_rw(cluster_op_t *op)
{
// Pools are loaded, but this one is unknown
op->retval = -EINVAL;
std::function<void(cluster_op_t*)>(op->callback)(op);
auto cb = std::move(op->callback);
cb(op);
return false;
}
// Check alignment
if (!op->len && (op->opcode == OSD_OP_READ || op->opcode == OSD_OP_READ_BITMAP || op->opcode == OSD_OP_READ_CHAIN_BITMAP || op->opcode == OSD_OP_WRITE) ||
if (!op->len && (op->opcode == OSD_OP_READ_BITMAP || op->opcode == OSD_OP_READ_CHAIN_BITMAP || op->opcode == OSD_OP_WRITE) ||
op->offset % pool_it->second.bitmap_granularity || op->len % pool_it->second.bitmap_granularity)
{
op->retval = -EINVAL;
std::function<void(cluster_op_t*)>(op->callback)(op);
auto cb = std::move(op->callback);
cb(op);
return false;
}
if (pool_it->second.immediate_commit == IMMEDIATE_ALL)
@@ -684,7 +691,8 @@ bool cluster_client_t::check_rw(cluster_op_t *op)
if (ino_it != st_cli.inode_config.end() && ino_it->second.readonly)
{
op->retval = -EROFS;
std::function<void(cluster_op_t*)>(op->callback)(op);
auto cb = std::move(op->callback);
cb(op);
return false;
}
}
@@ -1166,7 +1174,6 @@ static inline void mem_or(void *res, const void *r2, unsigned int len)
void cluster_client_t::handle_op_part(cluster_op_part_t *part)
{
cluster_op_t *op = part->parent;
op->inflight_count--;
int expected = part->op.req.hdr.opcode == OSD_OP_SYNC ? 0 : part->op.req.rw.len;
if (part->op.reply.hdr.retval != expected)
{
@@ -1189,7 +1196,7 @@ void cluster_client_t::handle_op_part(cluster_op_part_t *part)
);
}
}
else
else if (log_level > 0)
{
fprintf(
stderr, "%s operation failed on OSD %ju: retval=%jd (expected %d)\n",
@@ -1205,6 +1212,11 @@ void cluster_client_t::handle_op_part(cluster_op_part_t *part)
op->retry_after = op->retval == -EIO ? client_eio_retry_interval : client_retry_interval;
}
reset_retry_timer(op->retry_after);
if (stop_fd >= 0)
{
msgr.stop_client(stop_fd);
}
op->inflight_count--;
if (op->inflight_count == 0)
{
if (op->opcode == OSD_OP_SYNC)
@@ -1212,14 +1224,11 @@ void cluster_client_t::handle_op_part(cluster_op_part_t *part)
else
continue_rw(op);
}
if (stop_fd >= 0)
{
msgr.stop_client(stop_fd);
}
}
else
{
// OK
op->inflight_count--;
if ((op->opcode == OSD_OP_WRITE || op->opcode == OSD_OP_DELETE) && !(op->flags & OP_IMMEDIATE_COMMIT))
dirty_osds.insert(part->osd_num);
part->flags |= PART_DONE;

View File

@@ -123,7 +123,7 @@ public:
json11::Json::object cli_config, file_config, etcd_global_config;
json11::Json::object config;
cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config);
cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json config);
~cluster_client_t();
void execute(cluster_op_t *op);
void execute_raw(osd_num_t osd_num, osd_op_t *op);

View File

@@ -16,11 +16,6 @@
void disk_tool_simple_offsets(json11::Json cfg, bool json_output)
{
std::string device = cfg["device"].string_value();
if (device == "")
{
fprintf(stderr, "Device path is missing\n");
exit(1);
}
uint64_t data_block_size = parse_size(cfg["object_size"].string_value());
uint64_t bitmap_granularity = parse_size(cfg["bitmap_granularity"].string_value());
uint64_t journal_size = parse_size(cfg["journal_size"].string_value());
@@ -57,6 +52,11 @@ void disk_tool_simple_offsets(json11::Json cfg, bool json_output)
uint64_t orig_device_size = device_size;
if (!device_size)
{
if (device == "")
{
fprintf(stderr, "Device path is missing\n");
exit(1);
}
struct stat st;
if (stat(device.c_str(), &st) < 0)
{

View File

@@ -132,9 +132,6 @@ void disk_tool_simple_offsets(json11::Json cfg, bool json_output);
uint64_t sscanf_json(const char *fmt, const json11::Json & str);
void fromhexstr(const std::string & from, int bytes, uint8_t *to);
std::string realpath_str(std::string path, bool nofail = true);
std::string read_all_fd(int fd);
std::string read_file(std::string file, bool allow_enoent = false);
int disable_cache(std::string dev);
std::string get_parent_device(std::string dev);
bool json_is_true(const json11::Json & val);

View File

@@ -42,36 +42,6 @@ void fromhexstr(const std::string & from, int bytes, uint8_t *to)
}
}
std::string realpath_str(std::string path, bool nofail)
{
char *p = realpath((char*)path.c_str(), NULL);
if (!p)
{
fprintf(stderr, "Failed to resolve %s: %s\n", path.c_str(), strerror(errno));
return nofail ? path : "";
}
std::string rp(p);
free(p);
return rp;
}
std::string read_file(std::string file, bool allow_enoent)
{
std::string res;
int fd = open(file.c_str(), O_RDONLY);
if (fd < 0 || (res = read_all_fd(fd)) == "")
{
int err = errno;
if (fd >= 0)
close(fd);
if (!allow_enoent || err != ENOENT)
fprintf(stderr, "Can't read %s: %s\n", file.c_str(), strerror(err));
return "";
}
close(fd);
return res;
}
// returns 1 = check error, 0 = write through, -1 = write back
// (similar to 1 = warning, -1 = error, 0 = success in disable_cache)
static int check_queue_cache(std::string dev, std::string parent_dev)

View File

@@ -101,7 +101,7 @@ void epoll_manager_t::handle_uring_event()
my_uring_prep_poll_add(sqe, epoll_fd, POLLIN);
data->callback = [this](ring_data_t *data)
{
if (data->res < 0)
if (data->res < 0 && data->res != -ECANCELED)
{
throw std::runtime_error(std::string("epoll failed: ") + strerror(-data->res));
}

View File

@@ -573,8 +573,7 @@ void etcd_state_client_t::load_global_config()
{
global_bitmap_granularity = DEFAULT_BITMAP_GRANULARITY;
}
global_immediate_commit = global_config["immediate_commit"].string_value() == "all"
? IMMEDIATE_ALL : (global_config["immediate_commit"].string_value() == "small" ? IMMEDIATE_SMALL : IMMEDIATE_NONE);
global_immediate_commit = parse_immediate_commit(global_config["immediate_commit"].string_value());
on_load_config_hook(global_config);
});
}
@@ -782,13 +781,8 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
// Failure Domain
pc.failure_domain = pool_item.second["failure_domain"].string_value();
// Coding Scheme
if (pool_item.second["scheme"] == "replicated")
pc.scheme = POOL_SCHEME_REPLICATED;
else if (pool_item.second["scheme"] == "xor")
pc.scheme = POOL_SCHEME_XOR;
else if (pool_item.second["scheme"] == "ec" || pool_item.second["scheme"] == "jerasure")
pc.scheme = POOL_SCHEME_EC;
else
pc.scheme = parse_scheme(pool_item.second["scheme"].string_value());
if (!pc.scheme)
{
fprintf(stderr, "Pool %u has invalid coding scheme (one of \"xor\", \"replicated\", \"ec\" or \"jerasure\" required), skipping pool\n", pool_id);
continue;
@@ -869,11 +863,11 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
pc.scrub_interval = parse_time(pool_item.second["scrub_interval"].string_value());
if (!pc.scrub_interval)
pc.scrub_interval = 0;
// Mark pool as VitastorFS pool (disable per-inode stats and block volume creation)
pc.used_for_fs = pool_item.second["used_for_fs"].as_string();
// Immediate Commit Mode
pc.immediate_commit = pool_item.second["immediate_commit"].is_string()
? (pool_item.second["immediate_commit"].string_value() == "all"
? IMMEDIATE_ALL : (pool_item.second["immediate_commit"].string_value() == "small"
? IMMEDIATE_SMALL : IMMEDIATE_NONE))
? parse_immediate_commit(pool_item.second["immediate_commit"].string_value())
: global_immediate_commit;
// PG Stripe Size
pc.pg_stripe_size = pool_item.second["pg_stripe_size"].uint64_value();
@@ -1167,6 +1161,23 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
}
}
uint32_t etcd_state_client_t::parse_immediate_commit(const std::string & immediate_commit_str)
{
return immediate_commit_str == "all" ? IMMEDIATE_ALL :
(immediate_commit_str == "small" ? IMMEDIATE_SMALL : IMMEDIATE_NONE);
}
uint32_t etcd_state_client_t::parse_scheme(const std::string & scheme)
{
if (scheme == "replicated")
return POOL_SCHEME_REPLICATED;
else if (scheme == "xor")
return POOL_SCHEME_XOR;
else if (scheme == "ec" || scheme == "jerasure")
return POOL_SCHEME_EC;
return 0;
}
void etcd_state_client_t::insert_inode_config(const inode_config_t & cfg)
{
this->inode_config[cfg.num] = cfg;

View File

@@ -60,6 +60,7 @@ struct pool_config_t
uint64_t pg_stripe_size;
std::map<pg_num_t, pg_config_t> pg_config;
uint64_t scrub_interval;
std::string used_for_fs;
};
struct inode_config_t
@@ -151,4 +152,7 @@ public:
void close_watch(inode_watch_t* watch);
int address_count();
~etcd_state_client_t();
static uint32_t parse_immediate_commit(const std::string & immediate_commit_str);
static uint32_t parse_scheme(const std::string & scheme_str);
};

673
src/kv_cli.cpp Normal file
View File

@@ -0,0 +1,673 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// Vitastor shared key/value database test CLI
#define _XOPEN_SOURCE
#include <limits.h>
#include <netinet/tcp.h>
#include <sys/epoll.h>
#include <unistd.h>
#include <fcntl.h>
//#include <signal.h>
#include "epoll_manager.h"
#include "str_util.h"
#include "kv_db.h"
const char *exe_name = NULL;
class kv_cli_t
{
public:
json11::Json::object cfg;
std::vector<std::string> cli_cmd;
kv_dbw_t *db = NULL;
ring_loop_t *ringloop = NULL;
epoll_manager_t *epmgr = NULL;
cluster_client_t *cli = NULL;
int load_parallelism = 16;
bool opened = false;
bool interactive = false, is_file = false;
int in_progress = 0;
char *cur_cmd = NULL;
int cur_cmd_size = 0, cur_cmd_alloc = 0;
bool finished = false, eof = false;
std::function<void(int)> load_cb;
bool loading_json = false, in_loadjson = false;
int load_state = 0;
std::string load_key;
~kv_cli_t();
void parse_args(int narg, const char *args[]);
void run();
void read_cmd();
void next_cmd();
std::vector<std::string> parse_cmd(const std::string & cmdstr);
void handle_cmd(const std::vector<std::string> & cmd, std::function<void(int)> cb);
void loadjson();
};
kv_cli_t::~kv_cli_t()
{
if (cur_cmd)
{
free(cur_cmd);
cur_cmd = NULL;
}
cur_cmd_alloc = 0;
if (db)
delete db;
if (cli)
{
cli->flush();
delete cli;
}
if (epmgr)
delete epmgr;
if (ringloop)
delete ringloop;
}
void kv_cli_t::parse_args(int narg, const char *args[])
{
bool db = false;
for (int i = 1; i < narg; i++)
{
if (!strcmp(args[i], "-h") || !strcmp(args[i], "--help"))
{
printf(
"Vitastor Key/Value CLI\n"
"(c) Vitaliy Filippov, 2023+ (VNPL-1.1)\n"
"\n"
"USAGE: %s [OPTIONS] [<IMAGE> [<COMMAND>]]\n"
"\n"
"COMMANDS:\n"
" get <key>\n"
" set <key> <value>\n"
" del <key>\n"
" list [<start> [end]]\n"
" dump [<start> [end]]\n"
" dumpjson [<start> [end]]\n"
" loadjson\n"
"\n"
"<IMAGE> should be the name of Vitastor image with the DB.\n"
"Without <COMMAND>, you get an interactive DB shell.\n"
"\n"
"OPTIONS:\n"
" --kv_block_size 4k\n"
" Key-value B-Tree block size\n"
" --kv_memory_limit 128M\n"
" Maximum memory to use for vitastor-kv index cache\n"
" --kv_allocate_blocks 4\n"
" Number of PG blocks used for new tree block allocation in parallel\n"
" --kv_evict_max_misses 10\n"
" Eviction algorithm parameter: retry eviction from another random spot\n"
" if this number of keys is used currently or was used recently\n"
" --kv_evict_attempts_per_level 3\n"
" Retry eviction at most this number of times per tree level, starting\n"
" with bottom-most levels\n"
" --kv_evict_unused_age 1000\n"
" Evict only keys unused during this number of last operations\n"
" --kv_log_level 1\n"
" Log level. 0 = errors, 1 = warnings, 10 = trace operations\n"
,
exe_name
);
exit(0);
}
else if (args[i][0] == '-' && args[i][1] == '-')
{
const char *opt = args[i]+2;
cfg[opt] = !strcmp(opt, "json") || i == narg-1 ? "1" : args[++i];
}
else if (!db)
{
cfg["db"] = args[i];
db = true;
}
else
{
cli_cmd.push_back(args[i]);
}
}
}
void kv_cli_t::run()
{
// Create client
ringloop = new ring_loop_t(512);
epmgr = new epoll_manager_t(ringloop);
cli = new cluster_client_t(ringloop, epmgr->tfd, cfg);
db = new kv_dbw_t(cli);
// Load image metadata
while (!cli->is_ready())
{
ringloop->loop();
if (cli->is_ready())
break;
ringloop->wait();
}
// Open if DB is set in options
if (cfg.find("db") != cfg.end())
{
bool done = false;
handle_cmd({ "open", cfg.at("db").string_value() }, [&done](int res) { if (res != 0) exit(1); done = true; });
while (!done)
{
ringloop->loop();
if (done)
break;
ringloop->wait();
}
}
// Run single command from CLI
if (cli_cmd.size())
{
bool done = false;
handle_cmd(cli_cmd, [&done](int res) { if (res != 0) exit(1); done = true; });
while (!done)
{
ringloop->loop();
if (done)
break;
ringloop->wait();
}
}
else
{
// Run interactive shell
fcntl(0, F_SETFL, fcntl(0, F_GETFL, 0) | O_NONBLOCK);
try
{
epmgr->tfd->set_fd_handler(0, false, [this](int fd, int events)
{
if (events & EPOLLIN)
{
read_cmd();
}
if (events & EPOLLRDHUP)
{
epmgr->tfd->set_fd_handler(0, false, NULL);
finished = true;
}
});
interactive = isatty(0);
if (interactive)
printf("> ");
}
catch (std::exception & e)
{
// Can't add to epoll, STDIN is probably a file
is_file = true;
read_cmd();
}
while (!finished)
{
ringloop->loop();
if (!finished)
ringloop->wait();
}
}
// Destroy the client
delete db;
db = NULL;
cli->flush();
delete cli;
delete epmgr;
delete ringloop;
cli = NULL;
epmgr = NULL;
ringloop = NULL;
}
void kv_cli_t::read_cmd()
{
if (!cur_cmd_alloc)
{
cur_cmd_alloc = 65536;
cur_cmd = (char*)malloc_or_die(cur_cmd_alloc);
}
while (cur_cmd_size < cur_cmd_alloc)
{
int r = read(0, cur_cmd+cur_cmd_size, cur_cmd_alloc-cur_cmd_size);
if (r < 0 && errno != EAGAIN)
fprintf(stderr, "Error reading from stdin: %s\n", strerror(errno));
if (r > 0)
cur_cmd_size += r;
if (r == 0)
eof = true;
if (r <= 0)
break;
}
next_cmd();
}
void kv_cli_t::next_cmd()
{
if (loading_json)
{
loadjson();
return;
}
if (in_progress > 0)
{
return;
}
int pos = 0;
for (; pos < cur_cmd_size; pos++)
{
if (cur_cmd[pos] == '\n' || cur_cmd[pos] == '\r')
{
auto cmd = trim(std::string(cur_cmd, pos));
pos++;
memmove(cur_cmd, cur_cmd+pos, cur_cmd_size-pos);
cur_cmd_size -= pos;
in_progress++;
handle_cmd(parse_cmd(cmd), [this](int res)
{
in_progress--;
if (interactive)
printf("> ");
next_cmd();
if (!in_progress)
read_cmd();
});
break;
}
}
if (eof && !in_progress)
{
finished = true;
}
}
struct kv_cli_list_t
{
kv_dbw_t *db = NULL;
void *handle = NULL;
int format = 0;
int n = 0;
std::function<void(int)> cb;
};
std::vector<std::string> kv_cli_t::parse_cmd(const std::string & str)
{
std::vector<std::string> res;
size_t pos = 0;
auto cmd = scan_escaped(str, pos);
if (cmd.empty())
return res;
res.push_back(cmd);
int max_args = (cmd == "set" || cmd == "config" ||
cmd == "list" || cmd == "dump" || cmd == "dumpjson" ? 3 :
(cmd == "open" || cmd == "get" || cmd == "del" ? 2 : 1));
while (pos < str.size() && res.size() < max_args)
{
if (res.size() == max_args-1)
{
// Allow unquoted last argument
pos = str.find_first_not_of(" \t\r\n", pos);
if (pos == std::string::npos)
break;
if (str[pos] != '"' && str[pos] != '\'')
{
res.push_back(trim(str.substr(pos)));
break;
}
}
auto arg = scan_escaped(str, pos);
if (arg.size())
res.push_back(arg);
}
return res;
}
void kv_cli_t::loadjson()
{
// simple streaming json parser
if (in_progress >= load_parallelism || in_loadjson)
{
return;
}
in_loadjson = true;
if (load_state == 5)
{
st_5:
if (!in_progress)
{
loading_json = false;
auto cb = std::move(load_cb);
cb(0);
}
in_loadjson = false;
return;
}
do
{
read_cmd();
size_t pos = 0;
while (true)
{
while (pos < cur_cmd_size && is_white(cur_cmd[pos]))
{
pos++;
}
if (pos >= cur_cmd_size)
{
break;
}
if (load_state == 0 || load_state == 2)
{
char expected = "{ :"[load_state];
if (cur_cmd[pos] != expected)
{
fprintf(stderr, "Unexpected %c, expected %c\n", cur_cmd[pos], expected);
exit(1);
}
pos++;
load_state++;
}
else if (load_state == 1 || load_state == 3)
{
if (cur_cmd[pos] != '"')
{
fprintf(stderr, "Unexpected %c, expected \"\n", cur_cmd[pos]);
exit(1);
}
size_t prev = pos;
auto str = scan_escaped(cur_cmd, cur_cmd_size, pos, false);
if (pos == prev)
{
break;
}
load_state++;
if (load_state == 2)
{
load_key = str;
}
else
{
in_progress++;
handle_cmd({ "set", load_key, str }, [this](int res)
{
in_progress--;
next_cmd();
});
if (in_progress >= load_parallelism)
{
break;
}
}
}
else if (load_state == 4)
{
if (cur_cmd[pos] == ',')
{
pos++;
load_state = 1;
}
else if (cur_cmd[pos] == '}')
{
pos++;
load_state = 5;
goto st_5;
}
else
{
fprintf(stderr, "Unexpected %c, expected , or }\n", cur_cmd[pos]);
exit(1);
}
}
}
if (pos < cur_cmd_size)
{
memmove(cur_cmd, cur_cmd+pos, cur_cmd_size-pos);
}
cur_cmd_size -= pos;
} while (loading_json && is_file);
in_loadjson = false;
}
void kv_cli_t::handle_cmd(const std::vector<std::string> & cmd, std::function<void(int)> cb)
{
if (!cmd.size())
{
cb(-EINVAL);
return;
}
auto & opname = cmd[0];
if (!opened && opname != "open" && opname != "config" && opname != "quit" && opname != "q")
{
fprintf(stderr, "Error: database not opened\n");
cb(-EINVAL);
return;
}
if (opname == "open")
{
auto name = cmd.size() > 1 ? cmd[1] : "";
uint64_t pool_id = 0;
inode_t inode_id = 0;
int scanned = sscanf(name.c_str(), "%lu %lu", &pool_id, &inode_id);
if (scanned < 2 || !pool_id || !inode_id)
{
inode_id = 0;
name = trim(name);
for (auto & ic: cli->st_cli.inode_config)
{
if (ic.second.name == name)
{
inode_id = ic.first;
break;
}
}
if (!inode_id)
{
fprintf(stderr, "Usage: open <image> OR open <pool_id> <inode_id>\n");
cb(-EINVAL);
return;
}
}
else
inode_id = INODE_WITH_POOL(pool_id, inode_id);
db->open(inode_id, cfg, [=](int res)
{
if (res < 0)
{
fprintf(stderr, "Error opening index: %s (code %d)\n", strerror(-res), res);
}
else
{
opened = true;
fprintf(interactive ? stdout : stderr, "Index opened. Current size: %lu bytes\n", db->get_size());
}
cb(res);
});
}
else if (opname == "config")
{
if (cmd.size() < 3)
{
fprintf(stderr, "Usage: config <property> <value>\n");
cb(-EINVAL);
return;
}
auto & key = cmd[1];
auto & value = cmd[2];
if (key != "kv_memory_limit" &&
key != "kv_allocate_blocks" &&
key != "kv_evict_max_misses" &&
key != "kv_evict_attempts_per_level" &&
key != "kv_evict_unused_age" &&
key != "kv_log_level" &&
key != "kv_block_size")
{
fprintf(
stderr, "Allowed properties: kv_block_size, kv_memory_limit, kv_allocate_blocks,"
" kv_evict_max_misses, kv_evict_attempts_per_level, kv_evict_unused_age, kv_log_level\n"
);
cb(-EINVAL);
}
else if (key == "kv_block_size")
{
if (opened)
{
fprintf(stderr, "kv_block_size can't be set after opening DB\n");
cb(-EINVAL);
}
else
{
cfg[key] = value;
cb(0);
}
}
else
{
cfg[key] = value;
db->set_config(cfg);
cb(0);
}
}
else if (opname == "get" || opname == "set" || opname == "del")
{
if (opname == "get" || opname == "del")
{
if (cmd.size() < 2)
{
fprintf(stderr, "Usage: %s <key>\n", opname.c_str());
cb(-EINVAL);
return;
}
auto & key = cmd[1];
if (opname == "get")
{
db->get(key, [this, cb](int res, const std::string & value)
{
if (res < 0)
fprintf(stderr, "Error: %s (code %d)\n", strerror(-res), res);
else
{
if (write(1, value.c_str(), value.size()) < 0 || write(1, "\n", 1) < 0)
exit(1);
}
cb(res);
});
}
else
{
db->del(key, [this, cb](int res)
{
if (res < 0)
fprintf(stderr, "Error: %s (code %d)\n", strerror(-res), res);
else
fprintf(interactive ? stdout : stderr, "OK\n");
cb(res);
});
}
}
else
{
if (cmd.size() < 3)
{
fprintf(stderr, "Usage: set <key> <value>\n");
cb(-EINVAL);
return;
}
auto & key = cmd[1];
auto & value = cmd[2];
db->set(key, value, [this, cb, l = loading_json](int res)
{
if (res < 0)
fprintf(stderr, "Error: %s (code %d)\n", strerror(-res), res);
else if (!l)
fprintf(interactive ? stdout : stderr, "OK\n");
cb(res);
});
}
}
else if (opname == "list" || opname == "dump" || opname == "dumpjson")
{
kv_cli_list_t *lst = new kv_cli_list_t;
std::string start = cmd.size() >= 2 ? cmd[1] : "";
std::string end = cmd.size() >= 3 ? cmd[2] : "";
lst->handle = db->list_start(start);
lst->db = db;
lst->format = opname == "dump" ? 1 : (opname == "dumpjson" ? 2 : 0);
lst->cb = std::move(cb);
db->list_next(lst->handle, [lst](int res, const std::string & key, const std::string & value)
{
if (res < 0)
{
if (res != -ENOENT)
{
fprintf(stderr, "Error: %s (code %d)\n", strerror(-res), res);
}
if (lst->format == 2)
printf("\n}\n");
lst->db->list_close(lst->handle);
lst->cb(res == -ENOENT ? 0 : res);
delete lst;
}
else
{
if (lst->format == 2)
printf(lst->n ? ",\n %s: %s" : "{\n %s: %s", addslashes(key).c_str(), addslashes(value).c_str());
else if (lst->format == 1)
printf("set %s %s\n", auto_addslashes(key).c_str(), value.c_str());
else
printf("%s = %s\n", key.c_str(), value.c_str());
lst->n++;
lst->db->list_next(lst->handle, NULL);
}
});
}
else if (opname == "loadjson")
{
loading_json = true;
load_state = 0;
load_cb = cb;
loadjson();
}
else if (opname == "close")
{
db->close([=]()
{
fprintf(interactive ? stdout : stderr, "Index closed\n");
opened = false;
cb(0);
});
}
else if (opname == "quit" || opname == "q")
{
::close(0);
finished = true;
}
else
{
fprintf(
stderr, "Unknown operation: %s. Supported operations:\n"
"open <image>\nopen <pool_id> <inode_id>\n"
"config <property> <value>\n"
"get <key>\nset <key> <value>\ndel <key>\n"
"list [<start> [end]]\ndump [<start> [end]]\ndumpjson [<start> [end]]\nloadjson\n"
"close\nquit\n", opname.c_str()
);
cb(-EINVAL);
}
}
int main(int narg, const char *args[])
{
setvbuf(stdout, NULL, _IONBF, 0);
setvbuf(stderr, NULL, _IONBF, 0);
exe_name = args[0];
kv_cli_t *p = new kv_cli_t();
p->parse_args(narg, args);
p->run();
delete p;
return 0;
}

2075
src/kv_db.cpp Normal file

File diff suppressed because it is too large Load Diff

36
src/kv_db.h Normal file
View File

@@ -0,0 +1,36 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// Vitastor shared key/value database
// Parallel optimistic B-Tree O:-)
#pragma once
#include "cluster_client.h"
struct kv_db_t;
struct kv_dbw_t
{
kv_dbw_t(cluster_client_t *cli);
~kv_dbw_t();
void open(inode_t inode_id, json11::Json cfg, std::function<void(int)> cb);
void set_config(json11::Json cfg);
void close(std::function<void()> cb);
uint64_t get_size();
void get(const std::string & key, std::function<void(int res, const std::string & value)> cb,
bool allow_old_cached = false);
void set(const std::string & key, const std::string & value, std::function<void(int res)> cb,
std::function<bool(int res, const std::string & value)> cas_compare = NULL);
void del(const std::string & key, std::function<void(int res)> cb,
std::function<bool(int res, const std::string & value)> cas_compare = NULL);
void* list_start(const std::string & start);
void list_next(void *handle, std::function<void(int res, const std::string & key, const std::string & value)> cb);
void list_close(void *handle);
kv_db_t *db;
};

701
src/kv_stress.cpp Normal file
View File

@@ -0,0 +1,701 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// Vitastor shared key/value database stress tester / benchmark
#define _XOPEN_SOURCE
#include <limits.h>
#include <netinet/tcp.h>
#include <sys/epoll.h>
#include <unistd.h>
#include <fcntl.h>
//#include <signal.h>
#include "epoll_manager.h"
#include "str_util.h"
#include "kv_db.h"
const char *exe_name = NULL;
struct kv_test_listing_t
{
uint64_t count = 0, done = 0;
void *handle = NULL;
std::string next_after;
std::set<std::string> inflights;
timespec tv_begin;
bool error = false;
};
struct kv_test_lat_t
{
const char *name = NULL;
uint64_t usec = 0, count = 0;
};
struct kv_test_stat_t
{
kv_test_lat_t get, add, update, del, list;
uint64_t list_keys = 0;
};
class kv_test_t
{
public:
// Config
json11::Json::object kv_cfg;
std::string key_prefix, key_suffix;
uint64_t inode_id = 0;
uint64_t op_count = 1000000;
uint64_t runtime_sec = 0;
uint64_t parallelism = 4;
uint64_t reopen_prob = 1;
uint64_t get_prob = 30000;
uint64_t add_prob = 20000;
uint64_t update_prob = 20000;
uint64_t del_prob = 5000;
uint64_t list_prob = 300;
uint64_t min_key_len = 10;
uint64_t max_key_len = 70;
uint64_t min_value_len = 50;
uint64_t max_value_len = 300;
uint64_t min_list_count = 10;
uint64_t max_list_count = 1000;
uint64_t print_stats_interval = 1;
bool json_output = false;
uint64_t log_level = 1;
bool trace = false;
bool stop_on_error = false;
// FIXME: Multiple clients
kv_test_stat_t stat, prev_stat;
timespec prev_stat_time, start_stat_time;
// State
kv_dbw_t *db = NULL;
ring_loop_t *ringloop = NULL;
epoll_manager_t *epmgr = NULL;
cluster_client_t *cli = NULL;
ring_consumer_t consumer;
bool finished = false;
uint64_t total_prob = 0;
uint64_t ops_sent = 0, ops_done = 0;
int stat_timer_id = -1;
int in_progress = 0;
bool reopening = false;
std::set<kv_test_listing_t*> listings;
std::set<std::string> changing_keys;
std::map<std::string, std::string> values;
~kv_test_t();
static json11::Json::object parse_args(int narg, const char *args[]);
void parse_config(json11::Json cfg);
void run(json11::Json cfg);
void loop();
void print_stats(kv_test_stat_t & prev_stat, timespec & prev_stat_time);
void print_total_stats();
void start_change(const std::string & key);
void stop_change(const std::string & key);
void add_stat(kv_test_lat_t & stat, timespec tv_begin);
};
kv_test_t::~kv_test_t()
{
if (db)
delete db;
if (cli)
{
cli->flush();
delete cli;
}
if (epmgr)
delete epmgr;
if (ringloop)
delete ringloop;
}
json11::Json::object kv_test_t::parse_args(int narg, const char *args[])
{
json11::Json::object cfg;
for (int i = 1; i < narg; i++)
{
if (!strcmp(args[i], "-h") || !strcmp(args[i], "--help"))
{
printf(
"Vitastor Key/Value DB stress tester / benchmark\n"
"(c) Vitaliy Filippov, 2023+ (VNPL-1.1)\n"
"\n"
"USAGE: %s --pool_id POOL_ID --inode_id INODE_ID [OPTIONS]\n"
" --op_count 1000000\n"
" Total operations to run during test. 0 means unlimited\n"
" --key_prefix \"\"\n"
" Prefix for all keys read or written (to avoid collisions)\n"
" --key_suffix \"\"\n"
" Suffix for all keys read or written (to avoid collisions, but scan all DB)\n"
" --runtime 0\n"
" Run for this number of seconds. 0 means unlimited\n"
" --parallelism 4\n"
" Run this number of operations in parallel\n"
" --get_prob 30000\n"
" Fraction of key retrieve operations\n"
" --add_prob 20000\n"
" Fraction of key addition operations\n"
" --update_prob 20000\n"
" Fraction of key update operations\n"
" --del_prob 30000\n"
" Fraction of key delete operations\n"
" --list_prob 300\n"
" Fraction of listing operations\n"
" --reopen_prob 1\n"
" Fraction of database reopens\n"
" --min_key_len 10\n"
" Minimum key size in bytes\n"
" --max_key_len 70\n"
" Maximum key size in bytes\n"
" --min_value_len 50\n"
" Minimum value size in bytes\n"
" --max_value_len 300\n"
" Maximum value size in bytes\n"
" --min_list_count 10\n"
" Minimum number of keys read in listing (0 = all keys)\n"
" --max_list_count 1000\n"
" Maximum number of keys read in listing\n"
" --print_stats 1\n"
" Print operation statistics every this number of seconds\n"
" --json\n"
" JSON output\n"
" --stop_on_error 0\n"
" Stop on first execution error, mismatch, lost key or extra key during listing\n"
" --kv_block_size 4k\n"
" Key-value B-Tree block size\n"
" --kv_memory_limit 128M\n"
" Maximum memory to use for vitastor-kv index cache\n"
" --kv_allocate_blocks 4\n"
" Number of PG blocks used for new tree block allocation in parallel\n"
" --kv_evict_max_misses 10\n"
" Eviction algorithm parameter: retry eviction from another random spot\n"
" if this number of keys is used currently or was used recently\n"
" --kv_evict_attempts_per_level 3\n"
" Retry eviction at most this number of times per tree level, starting\n"
" with bottom-most levels\n"
" --kv_evict_unused_age 1000\n"
" Evict only keys unused during this number of last operations\n"
" --kv_log_level 1\n"
" Log level. 0 = errors, 1 = warnings, 10 = trace operations\n",
exe_name
);
exit(0);
}
else if (args[i][0] == '-' && args[i][1] == '-')
{
const char *opt = args[i]+2;
cfg[opt] = !strcmp(opt, "json") || i == narg-1 ? "1" : args[++i];
}
}
return cfg;
}
void kv_test_t::parse_config(json11::Json cfg)
{
inode_id = INODE_WITH_POOL(cfg["pool_id"].uint64_value(), cfg["inode_id"].uint64_value());
if (cfg["op_count"].uint64_value() > 0)
op_count = cfg["op_count"].uint64_value();
key_prefix = cfg["key_prefix"].string_value();
key_suffix = cfg["key_suffix"].string_value();
if (cfg["runtime"].uint64_value() > 0)
runtime_sec = cfg["runtime"].uint64_value();
if (cfg["parallelism"].uint64_value() > 0)
parallelism = cfg["parallelism"].uint64_value();
if (!cfg["reopen_prob"].is_null())
reopen_prob = cfg["reopen_prob"].uint64_value();
if (!cfg["get_prob"].is_null())
get_prob = cfg["get_prob"].uint64_value();
if (!cfg["add_prob"].is_null())
add_prob = cfg["add_prob"].uint64_value();
if (!cfg["update_prob"].is_null())
update_prob = cfg["update_prob"].uint64_value();
if (!cfg["del_prob"].is_null())
del_prob = cfg["del_prob"].uint64_value();
if (!cfg["list_prob"].is_null())
list_prob = cfg["list_prob"].uint64_value();
if (!cfg["min_key_len"].is_null())
min_key_len = cfg["min_key_len"].uint64_value();
if (cfg["max_key_len"].uint64_value() > 0)
max_key_len = cfg["max_key_len"].uint64_value();
if (!cfg["min_value_len"].is_null())
min_value_len = cfg["min_value_len"].uint64_value();
if (cfg["max_value_len"].uint64_value() > 0)
max_value_len = cfg["max_value_len"].uint64_value();
if (!cfg["min_list_count"].is_null())
min_list_count = cfg["min_list_count"].uint64_value();
if (!cfg["max_list_count"].is_null())
max_list_count = cfg["max_list_count"].uint64_value();
if (!cfg["print_stats"].is_null())
print_stats_interval = cfg["print_stats"].uint64_value();
if (!cfg["json"].is_null())
json_output = true;
if (!cfg["stop_on_error"].is_null())
stop_on_error = cfg["stop_on_error"].bool_value();
if (!cfg["kv_block_size"].is_null())
kv_cfg["kv_block_size"] = cfg["kv_block_size"];
if (!cfg["kv_memory_limit"].is_null())
kv_cfg["kv_memory_limit"] = cfg["kv_memory_limit"];
if (!cfg["kv_allocate_blocks"].is_null())
kv_cfg["kv_allocate_blocks"] = cfg["kv_allocate_blocks"];
if (!cfg["kv_evict_max_misses"].is_null())
kv_cfg["kv_evict_max_misses"] = cfg["kv_evict_max_misses"];
if (!cfg["kv_evict_attempts_per_level"].is_null())
kv_cfg["kv_evict_attempts_per_level"] = cfg["kv_evict_attempts_per_level"];
if (!cfg["kv_evict_unused_age"].is_null())
kv_cfg["kv_evict_unused_age"] = cfg["kv_evict_unused_age"];
if (!cfg["kv_log_level"].is_null())
{
log_level = cfg["kv_log_level"].uint64_value();
trace = log_level >= 10;
kv_cfg["kv_log_level"] = cfg["kv_log_level"];
}
total_prob = reopen_prob+get_prob+add_prob+update_prob+del_prob+list_prob;
stat.get.name = "get";
stat.add.name = "add";
stat.update.name = "update";
stat.del.name = "del";
stat.list.name = "list";
}
void kv_test_t::run(json11::Json cfg)
{
srand48(time(NULL));
parse_config(cfg);
// Create client
ringloop = new ring_loop_t(512);
epmgr = new epoll_manager_t(ringloop);
cli = new cluster_client_t(ringloop, epmgr->tfd, cfg);
db = new kv_dbw_t(cli);
// Load image metadata
while (!cli->is_ready())
{
ringloop->loop();
if (cli->is_ready())
break;
ringloop->wait();
}
// Run
reopening = true;
db->open(inode_id, kv_cfg, [this](int res)
{
reopening = false;
if (res < 0)
{
fprintf(stderr, "ERROR: Open index: %d (%s)\n", res, strerror(-res));
exit(1);
}
if (trace)
printf("Index opened\n");
ringloop->wakeup();
});
consumer.loop = [this]() { loop(); };
ringloop->register_consumer(&consumer);
if (print_stats_interval)
stat_timer_id = epmgr->tfd->set_timer(print_stats_interval*1000, true, [this](int) { print_stats(prev_stat, prev_stat_time); });
clock_gettime(CLOCK_REALTIME, &start_stat_time);
prev_stat_time = start_stat_time;
while (!finished)
{
ringloop->loop();
if (!finished)
ringloop->wait();
}
if (stat_timer_id >= 0)
epmgr->tfd->clear_timer(stat_timer_id);
ringloop->unregister_consumer(&consumer);
// Print total stats
print_total_stats();
// Destroy the client
delete db;
db = NULL;
cli->flush();
delete cli;
delete epmgr;
delete ringloop;
cli = NULL;
epmgr = NULL;
ringloop = NULL;
}
static const char *base64_chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789@+/";
std::string random_str(int len)
{
std::string str;
str.resize(len);
for (int i = 0; i < len; i++)
{
str[i] = base64_chars[lrand48() % 64];
}
return str;
}
void kv_test_t::loop()
{
if (reopening)
{
return;
}
if (ops_done >= op_count)
{
finished = true;
}
while (!finished && ops_sent < op_count && in_progress < parallelism)
{
uint64_t dice = (lrand48() % total_prob);
if (dice < reopen_prob)
{
reopening = true;
db->close([this]()
{
if (trace)
printf("Index closed\n");
db->open(inode_id, kv_cfg, [this](int res)
{
reopening = false;
if (res < 0)
{
fprintf(stderr, "ERROR: Reopen index: %d (%s)\n", res, strerror(-res));
finished = true;
return;
}
if (trace)
printf("Index reopened\n");
ringloop->wakeup();
});
});
return;
}
else if (dice < reopen_prob+get_prob)
{
// get existing
auto key = random_str(max_key_len);
auto k_it = values.lower_bound(key);
if (k_it == values.end())
continue;
key = k_it->first;
if (changing_keys.find(key) != changing_keys.end())
continue;
in_progress++;
ops_sent++;
if (trace)
printf("get %s\n", key.c_str());
timespec tv_begin;
clock_gettime(CLOCK_REALTIME, &tv_begin);
db->get(key, [this, key, tv_begin](int res, const std::string & value)
{
add_stat(stat.get, tv_begin);
ops_done++;
in_progress--;
auto it = values.find(key);
if (res != (it == values.end() ? -ENOENT : 0))
{
fprintf(stderr, "ERROR: get %s: %d (%s)\n", key.c_str(), res, strerror(-res));
if (stop_on_error)
exit(1);
}
else if (it != values.end() && value != it->second)
{
fprintf(stderr, "ERROR: get %s: mismatch: %s vs %s\n", key.c_str(), value.c_str(), it->second.c_str());
if (stop_on_error)
exit(1);
}
ringloop->wakeup();
});
}
else if (dice < reopen_prob+get_prob+add_prob+update_prob)
{
bool is_add = false;
std::string key;
if (dice < reopen_prob+get_prob+add_prob)
{
// add
is_add = true;
uint64_t key_len = min_key_len + (max_key_len > min_key_len ? lrand48() % (max_key_len-min_key_len) : 0);
key = key_prefix + random_str(key_len) + key_suffix;
}
else
{
// update
key = random_str(max_key_len);
auto k_it = values.lower_bound(key);
if (k_it == values.end())
continue;
key = k_it->first;
}
if (changing_keys.find(key) != changing_keys.end())
continue;
uint64_t value_len = min_value_len + (max_value_len > min_value_len ? lrand48() % (max_value_len-min_value_len) : 0);
auto value = random_str(value_len);
start_change(key);
ops_sent++;
in_progress++;
if (trace)
printf("set %s = %s\n", key.c_str(), value.c_str());
timespec tv_begin;
clock_gettime(CLOCK_REALTIME, &tv_begin);
db->set(key, value, [this, key, value, tv_begin, is_add](int res)
{
add_stat(is_add ? stat.add : stat.update, tv_begin);
stop_change(key);
ops_done++;
in_progress--;
if (res != 0)
{
fprintf(stderr, "ERROR: set %s = %s: %d (%s)\n", key.c_str(), value.c_str(), res, strerror(-res));
if (stop_on_error)
exit(1);
}
else
{
values[key] = value;
}
ringloop->wakeup();
}, NULL);
}
else if (dice < reopen_prob+get_prob+add_prob+update_prob+del_prob)
{
// delete
auto key = random_str(max_key_len);
auto k_it = values.lower_bound(key);
if (k_it == values.end())
continue;
key = k_it->first;
if (changing_keys.find(key) != changing_keys.end())
continue;
start_change(key);
ops_sent++;
in_progress++;
if (trace)
printf("del %s\n", key.c_str());
timespec tv_begin;
clock_gettime(CLOCK_REALTIME, &tv_begin);
db->del(key, [this, key, tv_begin](int res)
{
add_stat(stat.del, tv_begin);
stop_change(key);
ops_done++;
in_progress--;
if (res != 0)
{
fprintf(stderr, "ERROR: del %s: %d (%s)\n", key.c_str(), res, strerror(-res));
if (stop_on_error)
exit(1);
}
else
{
values.erase(key);
}
ringloop->wakeup();
}, NULL);
}
else if (dice < reopen_prob+get_prob+add_prob+update_prob+del_prob+list_prob)
{
// list
ops_sent++;
in_progress++;
auto key = random_str(max_key_len);
auto lst = new kv_test_listing_t;
auto k_it = values.lower_bound(key);
lst->count = min_list_count + (max_list_count > min_list_count ? lrand48() % (max_list_count-min_list_count) : 0);
lst->handle = db->list_start(k_it == values.begin() ? key_prefix : key);
lst->next_after = k_it == values.begin() ? key_prefix : key;
lst->inflights = changing_keys;
listings.insert(lst);
if (trace)
printf("list from %s\n", key.c_str());
clock_gettime(CLOCK_REALTIME, &lst->tv_begin);
db->list_next(lst->handle, [this, lst](int res, const std::string & key, const std::string & value)
{
if (log_level >= 11)
printf("list: %s = %s\n", key.c_str(), value.c_str());
if (res >= 0 && key_prefix.size() && (key.size() < key_prefix.size() ||
key.substr(0, key_prefix.size()) != key_prefix))
{
// stop at this key
res = -ENOENT;
}
if (res < 0 || (lst->count > 0 && lst->done >= lst->count))
{
add_stat(stat.list, lst->tv_begin);
if (res == 0)
{
// ok (done >= count)
}
else if (res != -ENOENT)
{
fprintf(stderr, "ERROR: list: %d (%s)\n", res, strerror(-res));
lst->error = true;
}
else
{
auto k_it = lst->next_after == "" ? values.begin() : values.upper_bound(lst->next_after);
while (k_it != values.end())
{
while (k_it != values.end() && lst->inflights.find(k_it->first) != lst->inflights.end())
k_it++;
if (k_it != values.end())
{
fprintf(stderr, "ERROR: list: missing key %s\n", (k_it++)->first.c_str());
lst->error = true;
}
}
}
if (lst->error && stop_on_error)
exit(1);
ops_done++;
in_progress--;
db->list_close(lst->handle);
delete lst;
listings.erase(lst);
ringloop->wakeup();
}
else
{
stat.list_keys++;
// Do not check modified keys in listing
// Listing may return their old or new state
if ((!key_suffix.size() || key.size() >= key_suffix.size() &&
key.substr(key.size()-key_suffix.size()) == key_suffix) &&
lst->inflights.find(key) == lst->inflights.end())
{
lst->done++;
auto k_it = lst->next_after == "" ? values.begin() : values.upper_bound(lst->next_after);
while (true)
{
while (k_it != values.end() && lst->inflights.find(k_it->first) != lst->inflights.end())
{
k_it++;
}
if (k_it == values.end() || k_it->first > key)
{
fprintf(stderr, "ERROR: list: extra key %s\n", key.c_str());
lst->error = true;
break;
}
else if (k_it->first < key)
{
fprintf(stderr, "ERROR: list: missing key %s\n", k_it->first.c_str());
lst->error = true;
lst->next_after = k_it->first;
k_it++;
}
else
{
if (k_it->second != value)
{
fprintf(stderr, "ERROR: list: mismatch: %s = %s but should be %s\n",
key.c_str(), value.c_str(), k_it->second.c_str());
lst->error = true;
}
lst->next_after = k_it->first;
break;
}
}
}
db->list_next(lst->handle, NULL);
}
});
}
}
}
void kv_test_t::add_stat(kv_test_lat_t & stat, timespec tv_begin)
{
timespec tv_end;
clock_gettime(CLOCK_REALTIME, &tv_end);
int64_t usec = (tv_end.tv_sec - tv_begin.tv_sec)*1000000 +
(tv_end.tv_nsec - tv_begin.tv_nsec)/1000;
if (usec > 0)
stat.usec += usec;
stat.count++;
}
void kv_test_t::print_stats(kv_test_stat_t & prev_stat, timespec & prev_stat_time)
{
timespec cur_stat_time;
clock_gettime(CLOCK_REALTIME, &cur_stat_time);
int64_t usec = (cur_stat_time.tv_sec - prev_stat_time.tv_sec)*1000000 +
(cur_stat_time.tv_nsec - prev_stat_time.tv_nsec)/1000;
if (usec > 0)
{
kv_test_lat_t *lats[] = { &stat.get, &stat.add, &stat.update, &stat.del, &stat.list };
kv_test_lat_t *prev[] = { &prev_stat.get, &prev_stat.add, &prev_stat.update, &prev_stat.del, &prev_stat.list };
if (!json_output)
{
char buf[128] = { 0 };
for (int i = 0; i < sizeof(lats)/sizeof(lats[0]); i++)
{
snprintf(buf, sizeof(buf)-1, "%.1f %s/s (%lu us)", (lats[i]->count-prev[i]->count)*1000000.0/usec,
lats[i]->name, (lats[i]->usec-prev[i]->usec)/(lats[i]->count-prev[i]->count > 0 ? lats[i]->count-prev[i]->count : 1));
int k;
for (k = strlen(buf); k < strlen(lats[i]->name)+21; k++)
buf[k] = ' ';
buf[k] = 0;
printf("%s", buf);
}
printf("\n");
}
else
{
int64_t runtime = (cur_stat_time.tv_sec - start_stat_time.tv_sec)*1000000 +
(cur_stat_time.tv_nsec - start_stat_time.tv_nsec)/1000;
printf("{\"runtime\":%.1f", (double)runtime/1000000.0);
for (int i = 0; i < sizeof(lats)/sizeof(lats[0]); i++)
{
if (lats[i]->count > prev[i]->count)
{
printf(
",\"%s\":{\"avg\":{\"iops\":%.1f,\"usec\":%lu},\"total\":{\"count\":%lu,\"usec\":%lu}}",
lats[i]->name, (lats[i]->count-prev[i]->count)*1000000.0/usec,
(lats[i]->usec-prev[i]->usec)/(lats[i]->count-prev[i]->count),
lats[i]->count, lats[i]->usec
);
}
}
printf("}\n");
}
}
prev_stat = stat;
prev_stat_time = cur_stat_time;
}
void kv_test_t::print_total_stats()
{
if (!json_output)
printf("Total:\n");
kv_test_stat_t start_stats;
timespec start_stat_time = this->start_stat_time;
print_stats(start_stats, start_stat_time);
}
void kv_test_t::start_change(const std::string & key)
{
changing_keys.insert(key);
for (auto lst: listings)
{
lst->inflights.insert(key);
}
}
void kv_test_t::stop_change(const std::string & key)
{
changing_keys.erase(key);
}
int main(int narg, const char *args[])
{
setvbuf(stdout, NULL, _IONBF, 0);
setvbuf(stderr, NULL, _IONBF, 0);
exe_name = args[0];
kv_test_t *p = new kv_test_t();
p->run(kv_test_t::parse_args(narg, args));
delete p;
return 0;
}

View File

@@ -146,7 +146,7 @@ public:
" Note that nbd_timeout, nbd_max_devices and nbd_max_part options may also be specified\n"
" in /etc/vitastor/vitastor.conf or in other configuration file specified with --config_file.\n"
" --logfile /path/to/log/file.txt\n"
" Wite log messages to the specified file instead of dropping them (in background mode)\n"
" Write log messages to the specified file instead of dropping them (in background mode)\n"
" or printing them to the standard output (in foreground mode).\n"
" --dev_num N\n"
" Use the specified device /dev/nbdN instead of automatic selection.\n"
@@ -298,7 +298,7 @@ public:
}
}
}
if (cfg["logfile"].is_string())
if (cfg["logfile"].string_value() != "")
{
logfile = cfg["logfile"].string_value();
}

View File

@@ -1,23 +1,18 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS connection handler for NFS proxy
// NFS proxy over Vitastor block images
#include <sys/time.h>
#include "str_util.h"
#include "nfs_proxy.h"
#include "nfs_common.h"
#include "nfs_block.h"
#include "nfs/nfs.h"
#include "cli.h"
#define TRUE 1
#define FALSE 0
#define MAX_REQUEST_SIZE 128*1024*1024
static unsigned len_pad4(unsigned len)
{
return len + (len&3 ? 4-(len&3) : 0);
@@ -28,10 +23,10 @@ static std::string get_inode_name(nfs_client_t *self, diropargs3 & what)
// Get name
std::string dirhash = what.dir;
std::string dir;
if (dirhash != "roothandle")
if (dirhash != NFS_ROOT_HANDLE)
{
auto dir_it = self->parent->dir_by_hash.find(dirhash);
if (dir_it != self->parent->dir_by_hash.end())
auto dir_it = self->parent->blockfs->dir_by_hash.find(dirhash);
if (dir_it != self->parent->blockfs->dir_by_hash.end())
dir = dir_it->second;
else
return "";
@@ -39,27 +34,12 @@ static std::string get_inode_name(nfs_client_t *self, diropargs3 & what)
std::string name = what.name;
return (dir.size()
? dir+"/"+name
: self->parent->name_prefix+name);
}
static nfsstat3 vitastor_nfs_map_err(int err)
{
return (err == EINVAL ? NFS3ERR_INVAL
: (err == ENOENT ? NFS3ERR_NOENT
: (err == ENOSPC ? NFS3ERR_NOSPC
: (err == EEXIST ? NFS3ERR_EXIST
: (err == EIO ? NFS3ERR_IO : (err ? NFS3ERR_IO : NFS3_OK))))));
}
static int nfs3_null_proc(void *opaque, rpc_op_t *rop)
{
rpc_queue_reply(rop);
return 0;
: self->parent->blockfs->name_prefix+name);
}
static fattr3 get_dir_attributes(nfs_client_t *self, std::string dir)
{
auto & dinf = self->parent->dir_info.at(dir);
auto & dinf = self->parent->blockfs->dir_info.at(dir);
return (fattr3){
.type = NF3DIR,
.mode = 0755,
@@ -108,7 +88,7 @@ static fattr3 get_file_attributes(nfs_client_t *self, inode_t inode_num)
};
}
static int nfs3_getattr_proc(void *opaque, rpc_op_t *rop)
static int block_nfs3_getattr_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
GETATTR3args *args = (GETATTR3args*)rop->request;
@@ -116,12 +96,12 @@ static int nfs3_getattr_proc(void *opaque, rpc_op_t *rop)
bool is_dir = false;
std::string dirhash = args->object;
std::string dir;
if (args->object == "roothandle")
if (args->object == NFS_ROOT_HANDLE)
is_dir = true;
else
{
auto dir_it = self->parent->dir_by_hash.find(dirhash);
if (dir_it != self->parent->dir_by_hash.end())
auto dir_it = self->parent->blockfs->dir_by_hash.find(dirhash);
if (dir_it != self->parent->blockfs->dir_by_hash.end())
{
is_dir = true;
dir = dir_it->second;
@@ -140,8 +120,8 @@ static int nfs3_getattr_proc(void *opaque, rpc_op_t *rop)
else
{
uint64_t inode_num = 0;
auto inode_num_it = self->parent->inode_by_hash.find(dirhash);
if (inode_num_it != self->parent->inode_by_hash.end())
auto inode_num_it = self->parent->blockfs->inode_by_hash.find(dirhash);
if (inode_num_it != self->parent->blockfs->inode_by_hash.end())
inode_num = inode_num_it->second;
auto inode_it = self->parent->cli->st_cli.inode_config.find(inode_num);
if (inode_num && inode_it != self->parent->cli->st_cli.inode_config.end())
@@ -179,16 +159,16 @@ static int nfs3_getattr_proc(void *opaque, rpc_op_t *rop)
return 0;
}
static int nfs3_setattr_proc(void *opaque, rpc_op_t *rop)
static int block_nfs3_setattr_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
SETATTR3args *args = (SETATTR3args*)rop->request;
SETATTR3res *reply = (SETATTR3res*)rop->reply;
std::string handle = args->object;
auto ino_it = self->parent->inode_by_hash.find(handle);
if (ino_it == self->parent->inode_by_hash.end())
auto ino_it = self->parent->blockfs->inode_by_hash.find(handle);
if (ino_it == self->parent->blockfs->inode_by_hash.end())
{
if (handle == "roothandle" || self->parent->dir_by_hash.find(handle) != self->parent->dir_by_hash.end())
if (handle == NFS_ROOT_HANDLE || self->parent->blockfs->dir_by_hash.find(handle) != self->parent->blockfs->dir_by_hash.end())
{
if (args->new_attributes.size.set_it)
{
@@ -228,7 +208,7 @@ static int nfs3_setattr_proc(void *opaque, rpc_op_t *rop)
return 0;
}
static int nfs3_lookup_proc(void *opaque, rpc_op_t *rop)
static int block_nfs3_lookup_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
LOOKUP3args *args = (LOOKUP3args*)rop->request;
@@ -255,8 +235,8 @@ static int nfs3_lookup_proc(void *opaque, rpc_op_t *rop)
return 0;
}
}
auto dir_it = self->parent->dir_info.find(full_name);
if (dir_it != self->parent->dir_info.end())
auto dir_it = self->parent->blockfs->dir_info.find(full_name);
if (dir_it != self->parent->blockfs->dir_info.end())
{
*reply = (LOOKUP3res){
.status = NFS3_OK,
@@ -277,7 +257,7 @@ static int nfs3_lookup_proc(void *opaque, rpc_op_t *rop)
return 0;
}
static int nfs3_access_proc(void *opaque, rpc_op_t *rop)
static int block_nfs3_access_proc(void *opaque, rpc_op_t *rop)
{
//nfs_client_t *self = (nfs_client_t*)opaque;
ACCESS3args *args = (ACCESS3args*)rop->request;
@@ -292,7 +272,7 @@ static int nfs3_access_proc(void *opaque, rpc_op_t *rop)
return 0;
}
static int nfs3_readlink_proc(void *opaque, rpc_op_t *rop)
static int block_nfs3_readlink_proc(void *opaque, rpc_op_t *rop)
{
//nfs_client_t *self = (nfs_client_t*)opaque;
//READLINK3args *args = (READLINK3args*)rop->request;
@@ -303,14 +283,14 @@ static int nfs3_readlink_proc(void *opaque, rpc_op_t *rop)
return 0;
}
static int nfs3_read_proc(void *opaque, rpc_op_t *rop)
static int block_nfs3_read_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
READ3args *args = (READ3args*)rop->request;
READ3res *reply = (READ3res*)rop->reply;
std::string handle = args->file;
auto ino_it = self->parent->inode_by_hash.find(handle);
if (ino_it == self->parent->inode_by_hash.end())
auto ino_it = self->parent->blockfs->inode_by_hash.find(handle);
if (ino_it == self->parent->blockfs->inode_by_hash.end())
{
*reply = (READ3res){ .status = NFS3ERR_NOENT };
rpc_queue_reply(rop);
@@ -367,14 +347,14 @@ static int nfs3_read_proc(void *opaque, rpc_op_t *rop)
static void nfs_resize_write(nfs_client_t *self, rpc_op_t *rop, uint64_t inode, uint64_t new_size, uint64_t offset, uint64_t count, void *buf);
static int nfs3_write_proc(void *opaque, rpc_op_t *rop)
static int block_nfs3_write_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
WRITE3args *args = (WRITE3args*)rop->request;
WRITE3res *reply = (WRITE3res*)rop->reply;
std::string handle = args->file;
auto ino_it = self->parent->inode_by_hash.find(handle);
if (ino_it == self->parent->inode_by_hash.end())
auto ino_it = self->parent->blockfs->inode_by_hash.find(handle);
if (ino_it == self->parent->blockfs->inode_by_hash.end())
{
*reply = (WRITE3res){ .status = NFS3ERR_NOENT };
rpc_queue_reply(rop);
@@ -480,8 +460,8 @@ static void complete_extend_write(nfs_client_t *self, rpc_op_t *rop, inode_t ino
static void complete_extend_inode(nfs_client_t *self, uint64_t inode, uint64_t new_size, int err)
{
auto ext_it = self->extend_writes.lower_bound((extend_size_t){ .inode = inode, .new_size = 0 });
while (ext_it != self->extend_writes.end() &&
auto ext_it = self->parent->blockfs->extend_writes.lower_bound((extend_size_t){ .inode = inode, .new_size = 0 });
while (ext_it != self->parent->blockfs->extend_writes.end() &&
ext_it->first.inode == inode &&
ext_it->first.new_size <= new_size)
{
@@ -490,7 +470,7 @@ static void complete_extend_inode(nfs_client_t *self, uint64_t inode, uint64_t n
{
complete_extend_write(self, ext_it->second.rop, inode, ext_it->second.write_res < 0
? ext_it->second.write_res : ext_it->second.resize_res);
self->extend_writes.erase(ext_it++);
self->parent->blockfs->extend_writes.erase(ext_it++);
}
else
ext_it++;
@@ -500,7 +480,7 @@ static void complete_extend_inode(nfs_client_t *self, uint64_t inode, uint64_t n
static void extend_inode(nfs_client_t *self, uint64_t inode, uint64_t new_size)
{
// Send an extend request
auto & ext = self->extends[inode];
auto & ext = self->parent->blockfs->extends[inode];
ext.cur_extend = new_size;
auto inode_it = self->parent->cli->st_cli.inode_config.find(inode);
if (inode_it != self->parent->cli->st_cli.inode_config.end() &&
@@ -514,10 +494,10 @@ static void extend_inode(nfs_client_t *self, uint64_t inode, uint64_t new_size)
{ "force_size", true },
}), [=](const cli_result_t & r)
{
auto & ext = self->extends[inode];
auto & ext = self->parent->blockfs->extends[inode];
if (r.err)
{
fprintf(stderr, "Error extending inode %ju to %ju bytes: %s\n", inode, new_size, r.text.c_str());
fprintf(stderr, "Error extending inode %lu to %lu bytes: %s\n", inode, new_size, r.text.c_str());
}
if (r.err == EAGAIN || ext.next_extend > ext.cur_extend)
{
@@ -548,7 +528,7 @@ static void nfs_do_write(nfs_client_t *self, std::multimap<extend_size_t, extend
{
auto inode = op->inode;
int write_res = op->retval < 0 ? op->retval : (op->retval != op->len ? -ERANGE : 0);
if (ewr_it == self->extend_writes.end())
if (ewr_it == self->parent->blockfs->extend_writes.end())
{
complete_extend_write(self, rop, inode, write_res);
}
@@ -558,7 +538,7 @@ static void nfs_do_write(nfs_client_t *self, std::multimap<extend_size_t, extend
if (ewr_it->second.resize_res <= 0)
{
complete_extend_write(self, rop, inode, write_res < 0 ? write_res : ewr_it->second.resize_res);
self->extend_writes.erase(ewr_it);
self->parent->blockfs->extend_writes.erase(ewr_it);
}
}
};
@@ -572,7 +552,7 @@ static void nfs_resize_write(nfs_client_t *self, rpc_op_t *rop, uint64_t inode,
if (inode_it != self->parent->cli->st_cli.inode_config.end() &&
inode_it->second.size < new_size)
{
auto ewr_it = self->extend_writes.emplace((extend_size_t){
auto ewr_it = self->parent->blockfs->extend_writes.emplace((extend_size_t){
.inode = inode,
.new_size = new_size,
}, (extend_write_t){
@@ -580,7 +560,7 @@ static void nfs_resize_write(nfs_client_t *self, rpc_op_t *rop, uint64_t inode,
.resize_res = 1,
.write_res = 1,
});
auto & ext = self->extends[inode];
auto & ext = self->parent->blockfs->extends[inode];
if (ext.cur_extend > 0)
{
// Already resizing, just wait
@@ -595,11 +575,11 @@ static void nfs_resize_write(nfs_client_t *self, rpc_op_t *rop, uint64_t inode,
}
else
{
nfs_do_write(self, self->extend_writes.end(), rop, inode, offset, count, buf);
nfs_do_write(self, self->parent->blockfs->extend_writes.end(), rop, inode, offset, count, buf);
}
}
static int nfs3_create_proc(void *opaque, rpc_op_t *rop)
static int block_nfs3_create_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
CREATE3args *args = (CREATE3args*)rop->request;
@@ -650,7 +630,7 @@ static int nfs3_create_proc(void *opaque, rpc_op_t *rop)
return 1;
}
static int nfs3_mkdir_proc(void *opaque, rpc_op_t *rop)
static int block_nfs3_mkdir_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
MKDIR3args *args = (MKDIR3args*)rop->request;
@@ -669,19 +649,19 @@ static int nfs3_mkdir_proc(void *opaque, rpc_op_t *rop)
rpc_queue_reply(rop);
return 0;
}
auto dir_id_it = self->parent->dir_info.find(full_name);
if (dir_id_it != self->parent->dir_info.end())
auto dir_id_it = self->parent->blockfs->dir_info.find(full_name);
if (dir_id_it != self->parent->blockfs->dir_info.end())
{
*reply = (MKDIR3res){ .status = NFS3ERR_EXIST };
rpc_queue_reply(rop);
return 0;
}
// FIXME: Persist empty directories in some etcd keys, like /vitastor/dir/...
self->parent->dir_info[full_name] = (nfs_dir_t){
.id = self->parent->next_dir_id++,
self->parent->blockfs->dir_info[full_name] = (nfs_dir_t){
.id = self->parent->blockfs->next_dir_id++,
.mod_rev = 0,
};
self->parent->dir_by_hash["S"+base64_encode(sha256(full_name))] = full_name;
self->parent->blockfs->dir_by_hash["S"+base64_encode(sha256(full_name))] = full_name;
*reply = (MKDIR3res){
.status = NFS3_OK,
.resok = (MKDIR3resok){
@@ -700,7 +680,7 @@ static int nfs3_mkdir_proc(void *opaque, rpc_op_t *rop)
return 0;
}
static int nfs3_symlink_proc(void *opaque, rpc_op_t *rop)
static int block_nfs3_symlink_proc(void *opaque, rpc_op_t *rop)
{
// nfs_client_t *self = (nfs_client_t*)opaque;
// SYMLINK3args *args = (SYMLINK3args*)rop->request;
@@ -711,7 +691,7 @@ static int nfs3_symlink_proc(void *opaque, rpc_op_t *rop)
return 0;
}
static int nfs3_mknod_proc(void *opaque, rpc_op_t *rop)
static int block_nfs3_mknod_proc(void *opaque, rpc_op_t *rop)
{
// nfs_client_t *self = (nfs_client_t*)opaque;
// MKNOD3args *args = (MKNOD3args*)rop->request;
@@ -722,7 +702,7 @@ static int nfs3_mknod_proc(void *opaque, rpc_op_t *rop)
return 0;
}
static int nfs3_remove_proc(void *opaque, rpc_op_t *rop)
static int block_nfs3_remove_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
REMOVE3res *reply = (REMOVE3res*)rop->reply;
@@ -752,7 +732,7 @@ static int nfs3_remove_proc(void *opaque, rpc_op_t *rop)
return 1;
}
static int nfs3_rmdir_proc(void *opaque, rpc_op_t *rop)
static int block_nfs3_rmdir_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
RMDIR3args *args = (RMDIR3args*)rop->request;
@@ -764,8 +744,8 @@ static int nfs3_rmdir_proc(void *opaque, rpc_op_t *rop)
rpc_queue_reply(rop);
return 0;
}
auto dir_it = self->parent->dir_info.find(full_name);
if (dir_it == self->parent->dir_info.end())
auto dir_it = self->parent->blockfs->dir_info.find(full_name);
if (dir_it == self->parent->blockfs->dir_info.end())
{
*reply = (RMDIR3res){ .status = NFS3ERR_NOENT };
rpc_queue_reply(rop);
@@ -781,8 +761,8 @@ static int nfs3_rmdir_proc(void *opaque, rpc_op_t *rop)
return 0;
}
}
self->parent->dir_by_hash.erase("S"+base64_encode(sha256(full_name)));
self->parent->dir_info.erase(dir_it);
self->parent->blockfs->dir_by_hash.erase("S"+base64_encode(sha256(full_name)));
self->parent->blockfs->dir_info.erase(dir_it);
*reply = (RMDIR3res){ .status = NFS3_OK };
rpc_queue_reply(rop);
return 0;
@@ -811,12 +791,12 @@ static int continue_dir_rename(nfs_dir_rename_state *rename_st)
if (!rename_st->items.size())
{
// old dir
auto old_info = self->parent->dir_info.at(rename_st->old_name);
self->parent->dir_info.erase(rename_st->old_name);
self->parent->dir_by_hash.erase("S"+base64_encode(sha256(rename_st->old_name)));
auto old_info = self->parent->blockfs->dir_info.at(rename_st->old_name);
self->parent->blockfs->dir_info.erase(rename_st->old_name);
self->parent->blockfs->dir_by_hash.erase("S"+base64_encode(sha256(rename_st->old_name)));
// new dir
self->parent->dir_info[rename_st->new_name] = old_info;
self->parent->dir_by_hash["S"+base64_encode(sha256(rename_st->new_name))] = rename_st->new_name;
self->parent->blockfs->dir_info[rename_st->new_name] = old_info;
self->parent->blockfs->dir_by_hash["S"+base64_encode(sha256(rename_st->new_name))] = rename_st->new_name;
RENAME3res *reply = (RENAME3res*)rename_st->rop->reply;
*reply = (RENAME3res){
.status = NFS3_OK,
@@ -853,7 +833,7 @@ static int continue_dir_rename(nfs_dir_rename_state *rename_st)
static void nfs_do_rename(nfs_client_t *self, rpc_op_t *rop, std::string old_name, std::string new_name);
static int nfs3_rename_proc(void *opaque, rpc_op_t *rop)
static int block_nfs3_rename_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
RENAME3args *args = (RENAME3args*)rop->request;
@@ -866,8 +846,8 @@ static int nfs3_rename_proc(void *opaque, rpc_op_t *rop)
rpc_queue_reply(rop);
return 0;
}
bool old_is_dir = self->parent->dir_info.find(old_name) != self->parent->dir_info.end();
bool new_is_dir = self->parent->dir_info.find(new_name) != self->parent->dir_info.end();
bool old_is_dir = self->parent->blockfs->dir_info.find(old_name) != self->parent->blockfs->dir_info.end();
bool new_is_dir = self->parent->blockfs->dir_info.find(new_name) != self->parent->blockfs->dir_info.end();
bool old_is_file = false, new_is_file = false;
for (auto & ic: self->parent->cli->st_cli.inode_config)
{
@@ -948,7 +928,7 @@ static void nfs_do_rename(nfs_client_t *self, rpc_op_t *rop, std::string old_nam
});
}
static int nfs3_link_proc(void *opaque, rpc_op_t *rop)
static int block_nfs3_link_proc(void *opaque, rpc_op_t *rop)
{
//nfs_client_t *self = (nfs_client_t*)opaque;
//LINK3args *args = (LINK3args*)rop->request;
@@ -962,7 +942,7 @@ static int nfs3_link_proc(void *opaque, rpc_op_t *rop)
static void fill_dir_entry(nfs_client_t *self, rpc_op_t *rop,
std::map<std::string, nfs_dir_t>::iterator dir_id_it, struct entryplus3 *entry, bool is_plus)
{
if (dir_id_it == self->parent->dir_info.end())
if (dir_id_it == self->parent->blockfs->dir_info.end())
{
return;
}
@@ -980,7 +960,7 @@ static void fill_dir_entry(nfs_client_t *self, rpc_op_t *rop,
}
}
static void nfs3_readdir_common(void *opaque, rpc_op_t *rop, bool is_plus)
static void block_nfs3_readdir_common(void *opaque, rpc_op_t *rop, bool is_plus)
{
nfs_client_t *self = (nfs_client_t*)opaque;
READDIRPLUS3args plus_args;
@@ -999,13 +979,13 @@ static void nfs3_readdir_common(void *opaque, rpc_op_t *rop, bool is_plus)
}
std::string dirhash = args->dir;
std::string dir;
if (dirhash != "roothandle")
if (dirhash != NFS_ROOT_HANDLE)
{
auto dir_it = self->parent->dir_by_hash.find(dirhash);
if (dir_it != self->parent->dir_by_hash.end())
auto dir_it = self->parent->blockfs->dir_by_hash.find(dirhash);
if (dir_it != self->parent->blockfs->dir_by_hash.end())
dir = dir_it->second;
}
std::string prefix = dir.size() ? dir+"/" : self->parent->name_prefix;
std::string prefix = dir.size() ? dir+"/" : self->parent->blockfs->name_prefix;
std::map<std::string, struct entryplus3> entries;
for (auto & ic: self->parent->cli->st_cli.inode_config)
{
@@ -1043,12 +1023,12 @@ static void nfs3_readdir_common(void *opaque, rpc_op_t *rop, bool is_plus)
}
else
{
// skip directories, they will be added from dir_info
// skip directories, they will be added from blockfs->dir_info
}
}
// Add directories from dir_info
for (auto dir_id_it = self->parent->dir_info.lower_bound(prefix);
dir_id_it != self->parent->dir_info.end(); dir_id_it++)
// Add directories from blockfs->dir_info
for (auto dir_id_it = self->parent->blockfs->dir_info.lower_bound(prefix);
dir_id_it != self->parent->blockfs->dir_info.end(); dir_id_it++)
{
if (prefix != "" && dir_id_it->first.substr(0, prefix.size()) != prefix)
break;
@@ -1061,12 +1041,12 @@ static void nfs3_readdir_common(void *opaque, rpc_op_t *rop, bool is_plus)
}
// Add . and ..
{
auto dir_id_it = self->parent->dir_info.find(dir);
auto dir_id_it = self->parent->blockfs->dir_info.find(dir);
fill_dir_entry(self, rop, dir_id_it, &entries["."], is_plus);
auto sl = dir.rfind("/");
if (sl != std::string::npos)
{
auto dir_id_it = self->parent->dir_info.find(dir.substr(0, sl));
auto dir_id_it = self->parent->blockfs->dir_info.find(dir.substr(0, sl));
fill_dir_entry(self, rop, dir_id_it, &entries[".."], is_plus);
}
}
@@ -1147,7 +1127,7 @@ static void nfs3_readdir_common(void *opaque, rpc_op_t *rop, bool is_plus)
{
READDIRPLUS3res *reply = (READDIRPLUS3res*)rop->reply;
*reply = { .status = NFS3_OK };
*(uint64_t*)(reply->resok.cookieverf) = self->parent->dir_info.at(dir).mod_rev;
*(uint64_t*)(reply->resok.cookieverf) = self->parent->blockfs->dir_info.at(dir).mod_rev;
reply->resok.reply.entries = entries.size() ? &entries.begin()->second : NULL;
reply->resok.reply.eof = eof;
}
@@ -1155,250 +1135,135 @@ static void nfs3_readdir_common(void *opaque, rpc_op_t *rop, bool is_plus)
{
READDIR3res *reply = (READDIR3res*)rop->reply;
*reply = { .status = NFS3_OK };
*(uint64_t*)(reply->resok.cookieverf) = self->parent->dir_info.at(dir).mod_rev;
*(uint64_t*)(reply->resok.cookieverf) = self->parent->blockfs->dir_info.at(dir).mod_rev;
reply->resok.reply.entries = entries.size() ? (entry3*)&entries.begin()->second : NULL;
reply->resok.reply.eof = eof;
}
rpc_queue_reply(rop);
}
static int nfs3_readdir_proc(void *opaque, rpc_op_t *rop)
static int block_nfs3_readdir_proc(void *opaque, rpc_op_t *rop)
{
nfs3_readdir_common(opaque, rop, false);
block_nfs3_readdir_common(opaque, rop, false);
return 0;
}
static int nfs3_readdirplus_proc(void *opaque, rpc_op_t *rop)
static int block_nfs3_readdirplus_proc(void *opaque, rpc_op_t *rop)
{
nfs3_readdir_common(opaque, rop, true);
block_nfs3_readdir_common(opaque, rop, true);
return 0;
}
// Get file system statistics
static int nfs3_fsstat_proc(void *opaque, rpc_op_t *rop)
void block_fs_state_t::init(nfs_proxy_t *proxy, json11::Json cfg)
{
nfs_client_t *self = (nfs_client_t*)opaque;
//FSSTAT3args *args = (FSSTAT3args*)rop->request;
FSSTAT3res *reply = (FSSTAT3res*)rop->reply;
uint64_t tbytes = 0, fbytes = 0;
auto pst_it = self->parent->pool_stats.find(self->parent->default_pool_id);
if (pst_it != self->parent->pool_stats.end())
name_prefix = cfg["subdir"].string_value();
{
auto ttb = pst_it->second["total_raw_tb"].number_value();
auto ftb = (pst_it->second["total_raw_tb"].number_value() - pst_it->second["used_raw_tb"].number_value());
tbytes = ttb / pst_it->second["raw_to_usable"].number_value() * ((uint64_t)2<<40);
fbytes = ftb / pst_it->second["raw_to_usable"].number_value() * ((uint64_t)2<<40);
int e = name_prefix.size();
while (e > 0 && name_prefix[e-1] == '/')
e--;
int s = 0;
while (s < e && name_prefix[s] == '/')
s++;
name_prefix = name_prefix.substr(s, e-s);
if (name_prefix.size())
name_prefix += "/";
}
*reply = (FSSTAT3res){
.status = NFS3_OK,
.resok = (FSSTAT3resok){
.obj_attributes = {
.attributes_follow = 1,
.attributes = get_dir_attributes(self, ""),
},
.tbytes = tbytes, // total bytes
.fbytes = fbytes, // free bytes
.abytes = fbytes, // available bytes
.tfiles = (size3)(1 << 31), // maximum total files
.ffiles = (size3)(1 << 31), // free files
.afiles = (size3)(1 << 31), // available files
.invarsec = 0,
},
// We need inode name hashes for NFS handles to remain stateless and <= 64 bytes long
dir_info[""] = (nfs_dir_t){
.id = 1,
.mod_rev = 0,
};
rpc_queue_reply(rop);
return 0;
}
static int nfs3_fsinfo_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
FSINFO3args *args = (FSINFO3args*)rop->request;
FSINFO3res *reply = (FSINFO3res*)rop->reply;
if (args->fsroot != "roothandle")
clock_gettime(CLOCK_REALTIME, &dir_info[""].mtime);
assert(proxy->cli->st_cli.on_inode_change_hook == NULL);
proxy->cli->st_cli.on_inode_change_hook = [this, proxy](inode_t changed_inode, bool removed)
{
// Example error
*reply = (FSINFO3res){ .status = NFS3ERR_INVAL };
}
else
{
// Fill info
*reply = (FSINFO3res){
.status = NFS3_OK,
.resok = (FSINFO3resok){
.obj_attributes = {
.attributes_follow = 1,
.attributes = get_dir_attributes(self, ""),
},
.rtmax = 128*1024*1024,
.rtpref = 128*1024*1024,
.rtmult = 4096,
.wtmax = 128*1024*1024,
.wtpref = 128*1024*1024,
.wtmult = 4096,
.dtpref = 128,
.maxfilesize = 0x7fffffffffffffff,
.time_delta = {
.seconds = 1,
.nseconds = 0,
},
.properties = FSF3_SYMLINK | FSF3_HOMOGENEOUS,
},
};
}
rpc_queue_reply(rop);
return 0;
}
static int nfs3_pathconf_proc(void *opaque, rpc_op_t *rop)
{
//nfs_client_t *self = (nfs_client_t*)opaque;
PATHCONF3args *args = (PATHCONF3args*)rop->request;
PATHCONF3res *reply = (PATHCONF3res*)rop->reply;
if (args->object != "roothandle")
{
// Example error
*reply = (PATHCONF3res){ .status = NFS3ERR_INVAL };
}
else
{
// Fill info
bool_t x = FALSE;
*reply = (PATHCONF3res){
.status = NFS3_OK,
.resok = (PATHCONF3resok){
.obj_attributes = {
// Without at least one reference to a non-constant value (local variable or something else),
// with gcc 8 we get "internal compiler error: side-effects element in no-side-effects CONSTRUCTOR" here
// FIXME: get rid of this after raising compiler requirement
.attributes_follow = x,
},
.linkmax = 0,
.name_max = 255,
.no_trunc = TRUE,
.chown_restricted = FALSE,
.case_insensitive = FALSE,
.case_preserving = TRUE,
},
};
}
rpc_queue_reply(rop);
return 0;
}
static int nfs3_commit_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
//COMMIT3args *args = (COMMIT3args*)rop->request;
cluster_op_t *op = new cluster_op_t;
// fsync. we don't know how to fsync a single inode, so just fsync everything
op->opcode = OSD_OP_SYNC;
op->callback = [self, rop](cluster_op_t *op)
{
COMMIT3res *reply = (COMMIT3res*)rop->reply;
*reply = (COMMIT3res){ .status = vitastor_nfs_map_err(op->retval) };
*(uint64_t*)reply->resok.verf = self->parent->server_id;
rpc_queue_reply(rop);
auto inode_cfg_it = proxy->cli->st_cli.inode_config.find(changed_inode);
if (inode_cfg_it == proxy->cli->st_cli.inode_config.end())
{
return;
}
auto & inode_cfg = inode_cfg_it->second;
std::string full_name = inode_cfg.name;
if (proxy->blockfs->name_prefix != "" && full_name.substr(0, proxy->blockfs->name_prefix.size()) != proxy->blockfs->name_prefix)
{
return;
}
// Calculate directory modification time and revision (used as "cookie verifier")
timespec now;
clock_gettime(CLOCK_REALTIME, &now);
dir_info[""].mod_rev = dir_info[""].mod_rev < inode_cfg.mod_revision ? inode_cfg.mod_revision : dir_info[""].mod_rev;
dir_info[""].mtime = now;
int pos = full_name.find('/', proxy->blockfs->name_prefix.size());
while (pos >= 0)
{
std::string dir = full_name.substr(0, pos);
auto & dinf = dir_info[dir];
if (!dinf.id)
dinf.id = next_dir_id++;
dinf.mod_rev = dinf.mod_rev < inode_cfg.mod_revision ? inode_cfg.mod_revision : dinf.mod_rev;
dinf.mtime = now;
dir_by_hash["S"+base64_encode(sha256(dir))] = dir;
pos = full_name.find('/', pos+1);
}
// Alter inode_by_hash
if (removed)
{
auto ino_it = hash_by_inode.find(changed_inode);
if (ino_it != hash_by_inode.end())
{
inode_by_hash.erase(ino_it->second);
hash_by_inode.erase(ino_it);
}
}
else
{
std::string hash = "S"+base64_encode(sha256(full_name));
auto hbi_it = hash_by_inode.find(changed_inode);
if (hbi_it != hash_by_inode.end() && hbi_it->second != hash)
{
// inode had a different name, remove old hash=>inode pointer
inode_by_hash.erase(hbi_it->second);
}
inode_by_hash[hash] = changed_inode;
hash_by_inode[changed_inode] = hash;
}
};
self->parent->cli->execute(op);
return 1;
}
static int mount3_mnt_proc(void *opaque, rpc_op_t *rop)
{
//nfs_client_t *self = (nfs_client_t*)opaque;
//nfs_dirpath *args = (nfs_dirpath*)rop->request;
nfs_mountres3 *reply = (nfs_mountres3*)rop->reply;
u_int flavor = RPC_AUTH_NONE;
reply->fhs_status = MNT3_OK;
reply->mountinfo.fhandle = xdr_copy_string(rop->xdrs, "roothandle");
reply->mountinfo.auth_flavors.auth_flavors_len = 1;
reply->mountinfo.auth_flavors.auth_flavors_val = (u_int*)xdr_copy_string(rop->xdrs, (char*)&flavor, sizeof(u_int)).data;
rpc_queue_reply(rop);
return 0;
}
static int mount3_dump_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
nfs_mountlist *reply = (nfs_mountlist*)rop->reply;
*reply = (struct nfs_mountbody*)malloc_or_die(sizeof(struct nfs_mountbody));
xdr_add_malloc(rop->xdrs, *reply);
(*reply)->ml_hostname = xdr_copy_string(rop->xdrs, "127.0.0.1");
(*reply)->ml_directory = xdr_copy_string(rop->xdrs, self->parent->export_root);
(*reply)->ml_next = NULL;
rpc_queue_reply(rop);
return 0;
}
static int mount3_umnt_proc(void *opaque, rpc_op_t *rop)
{
//nfs_client_t *self = (nfs_client_t*)opaque;
//nfs_dirpath *arg = (nfs_dirpath*)rop->request;
// do nothing
rpc_queue_reply(rop);
return 0;
}
static int mount3_umntall_proc(void *opaque, rpc_op_t *rop)
{
// do nothing
rpc_queue_reply(rop);
return 0;
}
static int mount3_export_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
nfs_exports *reply = (nfs_exports*)rop->reply;
*reply = (struct nfs_exportnode*)calloc_or_die(1, sizeof(struct nfs_exportnode) + sizeof(struct nfs_groupnode));
xdr_add_malloc(rop->xdrs, *reply);
(*reply)->ex_dir = xdr_copy_string(rop->xdrs, self->parent->export_root);
(*reply)->ex_groups = (struct nfs_groupnode*)(reply+1);
(*reply)->ex_groups->gr_name = xdr_copy_string(rop->xdrs, "127.0.0.1");
(*reply)->ex_groups->gr_next = NULL;
(*reply)->ex_next = NULL;
rpc_queue_reply(rop);
return 0;
}
nfs_client_t::nfs_client_t()
void nfs_block_procs(nfs_client_t *self)
{
struct rpc_service_proc_t pt[] = {
{NFS_PROGRAM, NFS_V3, NFS3_NULL, nfs3_null_proc, NULL, 0, NULL, 0, this},
{NFS_PROGRAM, NFS_V3, NFS3_GETATTR, nfs3_getattr_proc, (xdrproc_t)xdr_GETATTR3args, sizeof(GETATTR3args), (xdrproc_t)xdr_GETATTR3res, sizeof(GETATTR3res), this},
{NFS_PROGRAM, NFS_V3, NFS3_SETATTR, nfs3_setattr_proc, (xdrproc_t)xdr_SETATTR3args, sizeof(SETATTR3args), (xdrproc_t)xdr_SETATTR3res, sizeof(SETATTR3res), this},
{NFS_PROGRAM, NFS_V3, NFS3_LOOKUP, nfs3_lookup_proc, (xdrproc_t)xdr_LOOKUP3args, sizeof(LOOKUP3args), (xdrproc_t)xdr_LOOKUP3res, sizeof(LOOKUP3res), this},
{NFS_PROGRAM, NFS_V3, NFS3_ACCESS, nfs3_access_proc, (xdrproc_t)xdr_ACCESS3args, sizeof(ACCESS3args), (xdrproc_t)xdr_ACCESS3res, sizeof(ACCESS3res), this},
{NFS_PROGRAM, NFS_V3, NFS3_READLINK, nfs3_readlink_proc, (xdrproc_t)xdr_READLINK3args, sizeof(READLINK3args), (xdrproc_t)xdr_READLINK3res, sizeof(READLINK3res), this},
{NFS_PROGRAM, NFS_V3, NFS3_READ, nfs3_read_proc, (xdrproc_t)xdr_READ3args, sizeof(READ3args), (xdrproc_t)xdr_READ3res, sizeof(READ3res), this},
{NFS_PROGRAM, NFS_V3, NFS3_WRITE, nfs3_write_proc, (xdrproc_t)xdr_WRITE3args, sizeof(WRITE3args), (xdrproc_t)xdr_WRITE3res, sizeof(WRITE3res), this},
{NFS_PROGRAM, NFS_V3, NFS3_CREATE, nfs3_create_proc, (xdrproc_t)xdr_CREATE3args, sizeof(CREATE3args), (xdrproc_t)xdr_CREATE3res, sizeof(CREATE3res), this},
{NFS_PROGRAM, NFS_V3, NFS3_MKDIR, nfs3_mkdir_proc, (xdrproc_t)xdr_MKDIR3args, sizeof(MKDIR3args), (xdrproc_t)xdr_MKDIR3res, sizeof(MKDIR3res), this},
{NFS_PROGRAM, NFS_V3, NFS3_SYMLINK, nfs3_symlink_proc, (xdrproc_t)xdr_SYMLINK3args, sizeof(SYMLINK3args), (xdrproc_t)xdr_SYMLINK3res, sizeof(SYMLINK3res), this},
{NFS_PROGRAM, NFS_V3, NFS3_MKNOD, nfs3_mknod_proc, (xdrproc_t)xdr_MKNOD3args, sizeof(MKNOD3args), (xdrproc_t)xdr_MKNOD3res, sizeof(MKNOD3res), this},
{NFS_PROGRAM, NFS_V3, NFS3_REMOVE, nfs3_remove_proc, (xdrproc_t)xdr_REMOVE3args, sizeof(REMOVE3args), (xdrproc_t)xdr_REMOVE3res, sizeof(REMOVE3res), this},
{NFS_PROGRAM, NFS_V3, NFS3_RMDIR, nfs3_rmdir_proc, (xdrproc_t)xdr_RMDIR3args, sizeof(RMDIR3args), (xdrproc_t)xdr_RMDIR3res, sizeof(RMDIR3res), this},
{NFS_PROGRAM, NFS_V3, NFS3_RENAME, nfs3_rename_proc, (xdrproc_t)xdr_RENAME3args, sizeof(RENAME3args), (xdrproc_t)xdr_RENAME3res, sizeof(RENAME3res), this},
{NFS_PROGRAM, NFS_V3, NFS3_LINK, nfs3_link_proc, (xdrproc_t)xdr_LINK3args, sizeof(LINK3args), (xdrproc_t)xdr_LINK3res, sizeof(LINK3res), this},
{NFS_PROGRAM, NFS_V3, NFS3_READDIR, nfs3_readdir_proc, (xdrproc_t)xdr_READDIR3args, sizeof(READDIR3args), (xdrproc_t)xdr_READDIR3res, sizeof(READDIR3res), this},
{NFS_PROGRAM, NFS_V3, NFS3_READDIRPLUS, nfs3_readdirplus_proc, (xdrproc_t)xdr_READDIRPLUS3args, sizeof(READDIRPLUS3args), (xdrproc_t)xdr_READDIRPLUS3res, sizeof(READDIRPLUS3res), this},
{NFS_PROGRAM, NFS_V3, NFS3_FSSTAT, nfs3_fsstat_proc, (xdrproc_t)xdr_FSSTAT3args, sizeof(FSSTAT3args), (xdrproc_t)xdr_FSSTAT3res, sizeof(FSSTAT3res), this},
{NFS_PROGRAM, NFS_V3, NFS3_FSINFO, nfs3_fsinfo_proc, (xdrproc_t)xdr_FSINFO3args, sizeof(FSINFO3args), (xdrproc_t)xdr_FSINFO3res, sizeof(FSINFO3res), this},
{NFS_PROGRAM, NFS_V3, NFS3_PATHCONF, nfs3_pathconf_proc, (xdrproc_t)xdr_PATHCONF3args, sizeof(PATHCONF3args), (xdrproc_t)xdr_PATHCONF3res, sizeof(PATHCONF3res), this},
{NFS_PROGRAM, NFS_V3, NFS3_COMMIT, nfs3_commit_proc, (xdrproc_t)xdr_COMMIT3args, sizeof(COMMIT3args), (xdrproc_t)xdr_COMMIT3res, sizeof(COMMIT3res), this},
{MOUNT_PROGRAM, MOUNT_V3, MOUNT3_NULL, nfs3_null_proc, NULL, 0, NULL, 0, this},
{MOUNT_PROGRAM, MOUNT_V3, MOUNT3_MNT, mount3_mnt_proc, (xdrproc_t)xdr_nfs_dirpath, sizeof(nfs_dirpath), (xdrproc_t)xdr_nfs_mountres3, sizeof(nfs_mountres3), this},
{MOUNT_PROGRAM, MOUNT_V3, MOUNT3_DUMP, mount3_dump_proc, NULL, 0, (xdrproc_t)xdr_nfs_mountlist, sizeof(nfs_mountlist), this},
{MOUNT_PROGRAM, MOUNT_V3, MOUNT3_UMNT, mount3_umnt_proc, (xdrproc_t)xdr_nfs_dirpath, sizeof(nfs_dirpath), NULL, 0, this},
{MOUNT_PROGRAM, MOUNT_V3, MOUNT3_UMNTALL, mount3_umntall_proc, NULL, 0, NULL, 0, this},
{MOUNT_PROGRAM, MOUNT_V3, MOUNT3_EXPORT, mount3_export_proc, NULL, 0, (xdrproc_t)xdr_nfs_exports, sizeof(nfs_exports), this},
{NFS_PROGRAM, NFS_V3, NFS3_NULL, nfs3_null_proc, NULL, 0, NULL, 0, self},
{NFS_PROGRAM, NFS_V3, NFS3_GETATTR, block_nfs3_getattr_proc, (xdrproc_t)xdr_GETATTR3args, sizeof(GETATTR3args), (xdrproc_t)xdr_GETATTR3res, sizeof(GETATTR3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_SETATTR, block_nfs3_setattr_proc, (xdrproc_t)xdr_SETATTR3args, sizeof(SETATTR3args), (xdrproc_t)xdr_SETATTR3res, sizeof(SETATTR3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_LOOKUP, block_nfs3_lookup_proc, (xdrproc_t)xdr_LOOKUP3args, sizeof(LOOKUP3args), (xdrproc_t)xdr_LOOKUP3res, sizeof(LOOKUP3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_ACCESS, block_nfs3_access_proc, (xdrproc_t)xdr_ACCESS3args, sizeof(ACCESS3args), (xdrproc_t)xdr_ACCESS3res, sizeof(ACCESS3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_READLINK, block_nfs3_readlink_proc, (xdrproc_t)xdr_READLINK3args, sizeof(READLINK3args), (xdrproc_t)xdr_READLINK3res, sizeof(READLINK3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_READ, block_nfs3_read_proc, (xdrproc_t)xdr_READ3args, sizeof(READ3args), (xdrproc_t)xdr_READ3res, sizeof(READ3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_WRITE, block_nfs3_write_proc, (xdrproc_t)xdr_WRITE3args, sizeof(WRITE3args), (xdrproc_t)xdr_WRITE3res, sizeof(WRITE3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_CREATE, block_nfs3_create_proc, (xdrproc_t)xdr_CREATE3args, sizeof(CREATE3args), (xdrproc_t)xdr_CREATE3res, sizeof(CREATE3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_MKDIR, block_nfs3_mkdir_proc, (xdrproc_t)xdr_MKDIR3args, sizeof(MKDIR3args), (xdrproc_t)xdr_MKDIR3res, sizeof(MKDIR3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_SYMLINK, block_nfs3_symlink_proc, (xdrproc_t)xdr_SYMLINK3args, sizeof(SYMLINK3args), (xdrproc_t)xdr_SYMLINK3res, sizeof(SYMLINK3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_MKNOD, block_nfs3_mknod_proc, (xdrproc_t)xdr_MKNOD3args, sizeof(MKNOD3args), (xdrproc_t)xdr_MKNOD3res, sizeof(MKNOD3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_REMOVE, block_nfs3_remove_proc, (xdrproc_t)xdr_REMOVE3args, sizeof(REMOVE3args), (xdrproc_t)xdr_REMOVE3res, sizeof(REMOVE3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_RMDIR, block_nfs3_rmdir_proc, (xdrproc_t)xdr_RMDIR3args, sizeof(RMDIR3args), (xdrproc_t)xdr_RMDIR3res, sizeof(RMDIR3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_RENAME, block_nfs3_rename_proc, (xdrproc_t)xdr_RENAME3args, sizeof(RENAME3args), (xdrproc_t)xdr_RENAME3res, sizeof(RENAME3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_LINK, block_nfs3_link_proc, (xdrproc_t)xdr_LINK3args, sizeof(LINK3args), (xdrproc_t)xdr_LINK3res, sizeof(LINK3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_READDIR, block_nfs3_readdir_proc, (xdrproc_t)xdr_READDIR3args, sizeof(READDIR3args), (xdrproc_t)xdr_READDIR3res, sizeof(READDIR3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_READDIRPLUS, block_nfs3_readdirplus_proc, (xdrproc_t)xdr_READDIRPLUS3args, sizeof(READDIRPLUS3args), (xdrproc_t)xdr_READDIRPLUS3res, sizeof(READDIRPLUS3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_FSSTAT, nfs3_fsstat_proc, (xdrproc_t)xdr_FSSTAT3args, sizeof(FSSTAT3args), (xdrproc_t)xdr_FSSTAT3res, sizeof(FSSTAT3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_FSINFO, nfs3_fsinfo_proc, (xdrproc_t)xdr_FSINFO3args, sizeof(FSINFO3args), (xdrproc_t)xdr_FSINFO3res, sizeof(FSINFO3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_PATHCONF, nfs3_pathconf_proc, (xdrproc_t)xdr_PATHCONF3args, sizeof(PATHCONF3args), (xdrproc_t)xdr_PATHCONF3res, sizeof(PATHCONF3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_COMMIT, nfs3_commit_proc, (xdrproc_t)xdr_COMMIT3args, sizeof(COMMIT3args), (xdrproc_t)xdr_COMMIT3res, sizeof(COMMIT3res), self},
{MOUNT_PROGRAM, MOUNT_V3, MOUNT3_NULL, nfs3_null_proc, NULL, 0, NULL, 0, self},
{MOUNT_PROGRAM, MOUNT_V3, MOUNT3_MNT, mount3_mnt_proc, (xdrproc_t)xdr_nfs_dirpath, sizeof(nfs_dirpath), (xdrproc_t)xdr_nfs_mountres3, sizeof(nfs_mountres3), self},
{MOUNT_PROGRAM, MOUNT_V3, MOUNT3_DUMP, mount3_dump_proc, NULL, 0, (xdrproc_t)xdr_nfs_mountlist, sizeof(nfs_mountlist), self},
{MOUNT_PROGRAM, MOUNT_V3, MOUNT3_UMNT, mount3_umnt_proc, (xdrproc_t)xdr_nfs_dirpath, sizeof(nfs_dirpath), NULL, 0, self},
{MOUNT_PROGRAM, MOUNT_V3, MOUNT3_UMNTALL, mount3_umntall_proc, NULL, 0, NULL, 0, self},
{MOUNT_PROGRAM, MOUNT_V3, MOUNT3_EXPORT, mount3_export_proc, NULL, 0, (xdrproc_t)xdr_nfs_exports, sizeof(nfs_exports), self},
};
for (int i = 0; i < sizeof(pt)/sizeof(pt[0]); i++)
{
proc_table.insert(pt[i]);
self->proc_table.insert(pt[i]);
}
}
nfs_client_t::~nfs_client_t()
{
}

59
src/nfs_block.h Normal file
View File

@@ -0,0 +1,59 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy over Vitastor block images - header
#pragma once
struct nfs_dir_t
{
uint64_t id;
uint64_t mod_rev;
timespec mtime;
};
struct extend_size_t
{
inode_t inode;
uint64_t new_size;
};
inline bool operator < (const extend_size_t &a, const extend_size_t &b)
{
return a.inode < b.inode || a.inode == b.inode && a.new_size < b.new_size;
}
struct extend_write_t
{
rpc_op_t *rop;
int resize_res, write_res; // 1 = started, 0 = completed OK, -errno = completed with error
};
struct extend_inode_t
{
uint64_t cur_extend = 0, next_extend = 0;
};
struct block_fs_state_t
{
std::string name_prefix;
// filehandle = "S"+base64(sha256(full name with prefix)) or "roothandle" for mount root)
uint64_t next_dir_id = 2;
// filehandle => dir with name_prefix
std::map<std::string, std::string> dir_by_hash;
// dir with name_prefix => dir info
std::map<std::string, nfs_dir_t> dir_info;
// filehandle => inode ID
std::map<std::string, inode_t> inode_by_hash;
// inode ID => filehandle
std::map<inode_t, std::string> hash_by_inode;
// inode extend requests in progress
std::map<inode_t, extend_inode_t> extends;
std::multimap<extend_size_t, extend_write_t> extend_writes;
void init(nfs_proxy_t *proxy, json11::Json cfg);
};
nfsstat3 vitastor_nfs_map_err(int err);

22
src/nfs_common.h Normal file
View File

@@ -0,0 +1,22 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy - common functions
#pragma once
#include "nfs/nfs.h"
void nfs_block_procs(nfs_client_t *self);
void nfs_kv_procs(nfs_client_t *self);
int nfs3_fsstat_proc(void *opaque, rpc_op_t *rop);
int nfs3_fsinfo_proc(void *opaque, rpc_op_t *rop);
int nfs3_pathconf_proc(void *opaque, rpc_op_t *rop);
int nfs3_access_proc(void *opaque, rpc_op_t *rop);
int nfs3_null_proc(void *opaque, rpc_op_t *rop);
int nfs3_commit_proc(void *opaque, rpc_op_t *rop);
int mount3_mnt_proc(void *opaque, rpc_op_t *rop);
int mount3_dump_proc(void *opaque, rpc_op_t *rop);
int mount3_umnt_proc(void *opaque, rpc_op_t *rop);
int mount3_umntall_proc(void *opaque, rpc_op_t *rop);
int mount3_export_proc(void *opaque, rpc_op_t *rop);

129
src/nfs_fsstat.cpp Normal file
View File

@@ -0,0 +1,129 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy - common FSSTAT, FSINFO, PATHCONF
#include <sys/time.h>
#include "nfs_proxy.h"
#include "nfs_kv.h"
// Get file system statistics
int nfs3_fsstat_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
//FSSTAT3args *args = (FSSTAT3args*)rop->request;
if (self->parent->trace)
fprintf(stderr, "[%d] FSSTAT\n", self->nfs_fd);
FSSTAT3res *reply = (FSSTAT3res*)rop->reply;
uint64_t tbytes = 0, fbytes = 0;
auto pst_it = self->parent->pool_stats.find(self->parent->default_pool_id);
if (pst_it != self->parent->pool_stats.end())
{
auto ttb = pst_it->second["total_raw_tb"].number_value();
auto ftb = (pst_it->second["total_raw_tb"].number_value() - pst_it->second["used_raw_tb"].number_value());
tbytes = ttb / pst_it->second["raw_to_usable"].number_value() * ((uint64_t)2<<40);
fbytes = ftb / pst_it->second["raw_to_usable"].number_value() * ((uint64_t)2<<40);
}
*reply = (FSSTAT3res){
.status = NFS3_OK,
.resok = (FSSTAT3resok){
.obj_attributes = {
.attributes_follow = 0,
//.attributes = get_root_attributes(self),
},
.tbytes = tbytes, // total bytes
.fbytes = fbytes, // free bytes
.abytes = fbytes, // available bytes
.tfiles = (size3)1 << (63-POOL_ID_BITS), // maximum total files
.ffiles = (size3)1 << (63-POOL_ID_BITS), // free files
.afiles = (size3)1 << (63-POOL_ID_BITS), // available files
.invarsec = 0,
},
};
rpc_queue_reply(rop);
return 0;
}
int nfs3_fsinfo_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
FSINFO3args *args = (FSINFO3args*)rop->request;
FSINFO3res *reply = (FSINFO3res*)rop->reply;
if (self->parent->trace)
fprintf(stderr, "[%d] FSINFO %s\n", self->nfs_fd, std::string(args->fsroot).c_str());
if (args->fsroot != NFS_ROOT_HANDLE)
{
*reply = (FSINFO3res){ .status = NFS3ERR_INVAL };
}
else
{
// Fill info
bool_t x = FALSE;
*reply = (FSINFO3res){
.status = NFS3_OK,
.resok = (FSINFO3resok){
.obj_attributes = {
// Without at least one reference to a non-constant value (local variable or something else),
// with gcc 8 we get "internal compiler error: side-effects element in no-side-effects CONSTRUCTOR" here
// FIXME: get rid of this after raising compiler requirement
.attributes_follow = x,
//.attributes = get_root_attributes(self),
},
.rtmax = 128*1024*1024,
.rtpref = 128*1024*1024,
.rtmult = 4096,
.wtmax = 128*1024*1024,
.wtpref = 128*1024*1024,
.wtmult = 4096,
.dtpref = 128,
.maxfilesize = 0x7fffffffffffffff,
.time_delta = {
.seconds = 1,
.nseconds = 0,
},
.properties = FSF3_SYMLINK | FSF3_HOMOGENEOUS,
},
};
}
rpc_queue_reply(rop);
return 0;
}
int nfs3_pathconf_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
PATHCONF3args *args = (PATHCONF3args*)rop->request;
PATHCONF3res *reply = (PATHCONF3res*)rop->reply;
if (self->parent->trace)
fprintf(stderr, "[%d] PATHCONF %s\n", self->nfs_fd, std::string(args->object).c_str());
if (args->object != NFS_ROOT_HANDLE)
{
*reply = (PATHCONF3res){ .status = NFS3ERR_INVAL };
}
else
{
// Fill info
bool_t x = FALSE;
*reply = (PATHCONF3res){
.status = NFS3_OK,
.resok = (PATHCONF3resok){
.obj_attributes = {
// Without at least one reference to a non-constant value (local variable or something else),
// with gcc 8 we get "internal compiler error: side-effects element in no-side-effects CONSTRUCTOR" here
// FIXME: get rid of this after raising compiler requirement
.attributes_follow = x,
//.attributes = get_root_attributes(self),
},
.linkmax = 0,
.name_max = 255,
.no_trunc = TRUE,
.chown_restricted = FALSE,
.case_insensitive = FALSE,
.case_preserving = TRUE,
},
};
}
rpc_queue_reply(rop);
return 0;
}

332
src/nfs_kv.cpp Normal file
View File

@@ -0,0 +1,332 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy over VitastorKV database - common functions
#include <sys/time.h>
#include "str_util.h"
#include "nfs_proxy.h"
#include "nfs_common.h"
#include "nfs_kv.h"
nfstime3 nfstime_from_str(const std::string & s)
{
nfstime3 t;
auto p = s.find(".");
if (p != std::string::npos)
{
t.seconds = stoull_full(s.substr(0, p), 10);
t.nseconds = stoull_full(s.substr(p+1), 10);
p = s.size()-p-1;
for (; p < 9; p++)
t.nseconds *= 10;
for (; p > 9; p--)
t.nseconds /= 10;
}
else
t.seconds = stoull_full(s, 10);
return t;
}
static std::string timespec_to_str(timespec t)
{
char buf[64];
snprintf(buf, sizeof(buf), "%ju.%09ju", t.tv_sec, t.tv_nsec);
int l = strlen(buf);
while (l > 0 && buf[l-1] == '0')
l--;
if (l > 0 && buf[l-1] == '.')
l--;
buf[l] = 0;
return buf;
}
std::string nfstime_to_str(nfstime3 t)
{
return timespec_to_str((timespec){ .tv_sec = t.seconds, .tv_nsec = t.nseconds });
}
std::string nfstime_now_str()
{
timespec t;
clock_gettime(CLOCK_REALTIME, &t);
return timespec_to_str(t);
}
int kv_map_type(const std::string & type)
{
return (type == "" || type == "file" ? NF3REG :
(type == "dir" ? NF3DIR :
(type == "blk" ? NF3BLK :
(type == "chr" ? NF3CHR :
(type == "link" ? NF3LNK :
(type == "sock" ? NF3SOCK :
(type == "fifo" ? NF3FIFO : -1)))))));
}
fattr3 get_kv_attributes(nfs_client_t *self, uint64_t ino, json11::Json attrs)
{
auto type = kv_map_type(attrs["type"].string_value());
auto mode = attrs["mode"].uint64_value();
auto nlink = attrs["nlink"].uint64_value();
nfstime3 mtime = nfstime_from_str(attrs["mtime"].string_value());
nfstime3 atime = attrs["atime"].is_null() ? mtime : nfstime_from_str(attrs["atime"].string_value());
nfstime3 ctime = attrs["ctime"].is_null() ? mtime : nfstime_from_str(attrs["ctime"].string_value());
// In theory we could store the binary structure itself, but JSON is simpler :-)
return (fattr3){
.type = (type == 0 ? NF3REG : (ftype3)type),
.mode = (attrs["mode"].is_null() ? (type == NF3DIR ? 0755 : 0644) : (uint32_t)mode),
.nlink = (nlink == 0 ? 1 : (uint32_t)nlink),
.uid = (uint32_t)attrs["uid"].uint64_value(),
.gid = (uint32_t)attrs["gid"].uint64_value(),
.size = (type == NF3DIR ? 4096 : attrs["size"].uint64_value()),
// FIXME Counting actual used file size would require reworking statistics
.used = (type == NF3DIR ? 4096 : attrs["size"].uint64_value()),
.rdev = (type == NF3BLK || type == NF3CHR
? (specdata3){ (uint32_t)attrs["major"].uint64_value(), (uint32_t)attrs["minor"].uint64_value() }
: (specdata3){}),
.fsid = self->parent->fsid,
.fileid = ino,
.atime = atime,
.mtime = mtime,
.ctime = ctime,
};
}
std::string kv_direntry_key(uint64_t dir_ino, const std::string & filename)
{
// encode as: d <length> <hex dir_ino> / <filename>
char key[24] = { 0 };
snprintf(key, sizeof(key), "d-%jx/", dir_ino);
int n = strnlen(key, sizeof(key)-1) - 3;
if (n < 10)
key[1] = '0'+n;
else
key[1] = 'A'+(n-10);
return (char*)key + filename;
}
std::string kv_direntry_filename(const std::string & key)
{
// decode as: d <length> <hex dir_ino> / <filename>
auto pos = key.find("/");
if (pos != std::string::npos)
return key.substr(pos+1);
return key;
}
std::string kv_inode_key(uint64_t ino)
{
char key[32] = { 0 };
snprintf(key, sizeof(key), "i%x", INODE_POOL(ino));
int n = strnlen(key, sizeof(key)-1);
snprintf(key+n+1, sizeof(key)-n-1, "%jx", INODE_NO_POOL(ino));
int m = strnlen(key+n+1, sizeof(key)-n-2);
key[n] = 'G'+m;
return std::string(key);
}
std::string kv_fh(uint64_t ino)
{
char key[32] = { 0 };
snprintf(key, sizeof(key), "S%jx", ino);
return key;
}
uint64_t kv_fh_inode(const std::string & fh)
{
if (fh == NFS_ROOT_HANDLE)
{
return 1;
}
else if (fh[0] == 'S')
{
uint64_t ino = 0;
int r = sscanf(fh.c_str()+1, "%jx", &ino);
if (r == 1)
return ino;
}
return 0;
}
bool kv_fh_valid(const std::string & fh)
{
return fh == NFS_ROOT_HANDLE || fh[0] == 'S';
}
void nfs_kv_procs(nfs_client_t *self)
{
struct rpc_service_proc_t pt[] = {
{NFS_PROGRAM, NFS_V3, NFS3_NULL, nfs3_null_proc, NULL, 0, NULL, 0, self},
{NFS_PROGRAM, NFS_V3, NFS3_GETATTR, kv_nfs3_getattr_proc, (xdrproc_t)xdr_GETATTR3args, sizeof(GETATTR3args), (xdrproc_t)xdr_GETATTR3res, sizeof(GETATTR3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_SETATTR, kv_nfs3_setattr_proc, (xdrproc_t)xdr_SETATTR3args, sizeof(SETATTR3args), (xdrproc_t)xdr_SETATTR3res, sizeof(SETATTR3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_LOOKUP, kv_nfs3_lookup_proc, (xdrproc_t)xdr_LOOKUP3args, sizeof(LOOKUP3args), (xdrproc_t)xdr_LOOKUP3res, sizeof(LOOKUP3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_ACCESS, nfs3_access_proc, (xdrproc_t)xdr_ACCESS3args, sizeof(ACCESS3args), (xdrproc_t)xdr_ACCESS3res, sizeof(ACCESS3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_READLINK, kv_nfs3_readlink_proc, (xdrproc_t)xdr_READLINK3args, sizeof(READLINK3args), (xdrproc_t)xdr_READLINK3res, sizeof(READLINK3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_READ, kv_nfs3_read_proc, (xdrproc_t)xdr_READ3args, sizeof(READ3args), (xdrproc_t)xdr_READ3res, sizeof(READ3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_WRITE, kv_nfs3_write_proc, (xdrproc_t)xdr_WRITE3args, sizeof(WRITE3args), (xdrproc_t)xdr_WRITE3res, sizeof(WRITE3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_CREATE, kv_nfs3_create_proc, (xdrproc_t)xdr_CREATE3args, sizeof(CREATE3args), (xdrproc_t)xdr_CREATE3res, sizeof(CREATE3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_MKDIR, kv_nfs3_mkdir_proc, (xdrproc_t)xdr_MKDIR3args, sizeof(MKDIR3args), (xdrproc_t)xdr_MKDIR3res, sizeof(MKDIR3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_SYMLINK, kv_nfs3_symlink_proc, (xdrproc_t)xdr_SYMLINK3args, sizeof(SYMLINK3args), (xdrproc_t)xdr_SYMLINK3res, sizeof(SYMLINK3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_MKNOD, kv_nfs3_mknod_proc, (xdrproc_t)xdr_MKNOD3args, sizeof(MKNOD3args), (xdrproc_t)xdr_MKNOD3res, sizeof(MKNOD3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_REMOVE, kv_nfs3_remove_proc, (xdrproc_t)xdr_REMOVE3args, sizeof(REMOVE3args), (xdrproc_t)xdr_REMOVE3res, sizeof(REMOVE3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_RMDIR, kv_nfs3_rmdir_proc, (xdrproc_t)xdr_RMDIR3args, sizeof(RMDIR3args), (xdrproc_t)xdr_RMDIR3res, sizeof(RMDIR3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_RENAME, kv_nfs3_rename_proc, (xdrproc_t)xdr_RENAME3args, sizeof(RENAME3args), (xdrproc_t)xdr_RENAME3res, sizeof(RENAME3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_LINK, kv_nfs3_link_proc, (xdrproc_t)xdr_LINK3args, sizeof(LINK3args), (xdrproc_t)xdr_LINK3res, sizeof(LINK3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_READDIR, kv_nfs3_readdir_proc, (xdrproc_t)xdr_READDIR3args, sizeof(READDIR3args), (xdrproc_t)xdr_READDIR3res, sizeof(READDIR3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_READDIRPLUS, kv_nfs3_readdirplus_proc, (xdrproc_t)xdr_READDIRPLUS3args, sizeof(READDIRPLUS3args), (xdrproc_t)xdr_READDIRPLUS3res, sizeof(READDIRPLUS3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_FSSTAT, nfs3_fsstat_proc, (xdrproc_t)xdr_FSSTAT3args, sizeof(FSSTAT3args), (xdrproc_t)xdr_FSSTAT3res, sizeof(FSSTAT3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_FSINFO, nfs3_fsinfo_proc, (xdrproc_t)xdr_FSINFO3args, sizeof(FSINFO3args), (xdrproc_t)xdr_FSINFO3res, sizeof(FSINFO3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_PATHCONF, nfs3_pathconf_proc, (xdrproc_t)xdr_PATHCONF3args, sizeof(PATHCONF3args), (xdrproc_t)xdr_PATHCONF3res, sizeof(PATHCONF3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_COMMIT, nfs3_commit_proc, (xdrproc_t)xdr_COMMIT3args, sizeof(COMMIT3args), (xdrproc_t)xdr_COMMIT3res, sizeof(COMMIT3res), self},
{MOUNT_PROGRAM, MOUNT_V3, MOUNT3_NULL, nfs3_null_proc, NULL, 0, NULL, 0, self},
{MOUNT_PROGRAM, MOUNT_V3, MOUNT3_MNT, mount3_mnt_proc, (xdrproc_t)xdr_nfs_dirpath, sizeof(nfs_dirpath), (xdrproc_t)xdr_nfs_mountres3, sizeof(nfs_mountres3), self},
{MOUNT_PROGRAM, MOUNT_V3, MOUNT3_DUMP, mount3_dump_proc, NULL, 0, (xdrproc_t)xdr_nfs_mountlist, sizeof(nfs_mountlist), self},
{MOUNT_PROGRAM, MOUNT_V3, MOUNT3_UMNT, mount3_umnt_proc, (xdrproc_t)xdr_nfs_dirpath, sizeof(nfs_dirpath), NULL, 0, self},
{MOUNT_PROGRAM, MOUNT_V3, MOUNT3_UMNTALL, mount3_umntall_proc, NULL, 0, NULL, 0, self},
{MOUNT_PROGRAM, MOUNT_V3, MOUNT3_EXPORT, mount3_export_proc, NULL, 0, (xdrproc_t)xdr_nfs_exports, sizeof(nfs_exports), self},
};
for (int i = 0; i < sizeof(pt)/sizeof(pt[0]); i++)
{
self->proc_table.insert(pt[i]);
}
}
void kv_fs_state_t::init(nfs_proxy_t *proxy, json11::Json cfg)
{
this->proxy = proxy;
auto & pool_cfg = proxy->cli->st_cli.pool_config.at(proxy->default_pool_id);
fs_kv_inode = cfg["fs"].uint64_value();
if (fs_kv_inode)
{
if (!INODE_POOL(fs_kv_inode))
{
fprintf(stderr, "FS metadata inode number must include pool\n");
exit(1);
}
}
else
{
for (auto & ic: proxy->cli->st_cli.inode_config)
{
if (ic.second.name == cfg["fs"].string_value())
{
fs_kv_inode = ic.first;
break;
}
}
if (!fs_kv_inode)
{
fprintf(stderr, "FS metadata image \"%s\" does not exist\n", cfg["fs"].string_value().c_str());
exit(1);
}
}
if (proxy->cli->st_cli.inode_config.find(fs_kv_inode) != proxy->cli->st_cli.inode_config.end())
{
auto & name = proxy->cli->st_cli.inode_config.at(fs_kv_inode).name;
if (pool_cfg.used_for_fs != name)
{
fprintf(stderr, "Please mark pool as used for this file system with `vitastor-cli modify-pool --used-for-fs %s %s`\n",
name.c_str(), cfg["fs"].string_value().c_str());
exit(1);
}
}
auto img_it = proxy->cli->st_cli.inode_config.lower_bound(INODE_WITH_POOL(proxy->default_pool_id+1, 0));
if (img_it != proxy->cli->st_cli.inode_config.begin())
{
img_it--;
if (img_it != proxy->cli->st_cli.inode_config.begin() && INODE_POOL(img_it->first) == proxy->default_pool_id)
{
idgen[proxy->default_pool_id].min_id = INODE_NO_POOL(img_it->first) + 1;
}
}
readdir_getattr_parallel = cfg["readdir_getattr_parallel"].uint64_value();
if (!readdir_getattr_parallel)
readdir_getattr_parallel = 8;
id_alloc_batch_size = cfg["id_alloc_batch_size"].uint64_value();
if (!id_alloc_batch_size)
id_alloc_batch_size = 200;
touch_interval = cfg["touch_interval"].uint64_value();
if (touch_interval < 100) // ms
touch_interval = 100;
pool_block_size = pool_cfg.pg_stripe_size;
pool_alignment = pool_cfg.bitmap_granularity;
// Open DB and wait
int open_res = 0;
bool open_done = false;
proxy->db = new kv_dbw_t(proxy->cli);
proxy->db->open(fs_kv_inode, cfg, [&](int res)
{
open_done = true;
open_res = res;
});
while (!open_done)
{
proxy->ringloop->loop();
if (open_done)
break;
proxy->ringloop->wait();
}
if (open_res < 0)
{
fprintf(stderr, "Failed to open key/value filesystem metadata index: %s (code %d)\n",
strerror(-open_res), open_res);
exit(1);
}
fs_inode_count = ((uint64_t)1 << (64-POOL_ID_BITS)) - 1;
shared_inode_threshold = pool_block_size;
if (!cfg["shared_inode_threshold"].is_null())
{
shared_inode_threshold = cfg["shared_inode_threshold"].uint64_value();
}
zero_block.resize(pool_block_size < 1048576 ? 1048576 : pool_block_size);
scrap_block.resize(pool_block_size < 1048576 ? 1048576 : pool_block_size);
touch_timer_id = proxy->epmgr->tfd->set_timer(touch_interval, true, [this](int){ touch_inodes(); });
}
kv_fs_state_t::~kv_fs_state_t()
{
if (proxy && touch_timer_id >= 0)
{
proxy->epmgr->tfd->clear_timer(touch_timer_id);
touch_timer_id = -1;
}
}
static void touch_inode(nfs_proxy_t *proxy, inode_t ino, bool allow_cache)
{
kv_read_inode(proxy, ino, [proxy, ino](int res, const std::string & value, json11::Json attrs)
{
if (!res)
{
auto ientry = attrs.object_items();
ientry["mtime"] = ientry["ctime"] = nfstime_now_str();
ientry.erase("verf");
// FIXME: Use "update" query
bool *found = new bool;
*found = true;
proxy->db->set(kv_inode_key(ino), json11::Json(ientry).dump(), [proxy, ino, found](int res)
{
if (!*found)
res = -ENOENT;
delete found;
if (res == -EAGAIN)
touch_inode(proxy, ino, false);
}, [value, found](int res, const std::string & old_value)
{
*found = res == 0;
return res == 0 && old_value == value;
});
}
}, allow_cache);
}
void kv_fs_state_t::touch_inodes()
{
std::set<inode_t> q = std::move(touch_queue);
for (auto ino: q)
{
touch_inode(proxy, ino, true);
}
}

134
src/nfs_kv.h Normal file
View File

@@ -0,0 +1,134 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy over VitastorKV database - header
#pragma once
#include "nfs/nfs.h"
#define KV_ROOT_INODE 1
#define SHARED_FILE_MAGIC_V1 0x711A5158A6EDF17E
struct nfs_kv_write_state;
struct list_cookie_t
{
uint64_t dir_ino, cookieverf, cookie;
};
inline bool operator < (const list_cookie_t & a, const list_cookie_t & b)
{
return a.dir_ino < b.dir_ino || a.dir_ino == b.dir_ino &&
(a.cookieverf < b.cookieverf || a.cookieverf == b.cookieverf && a.cookie < b.cookie);
};
struct list_cookie_val_t
{
std::string key;
};
struct shared_alloc_queue_t
{
nfs_kv_write_state *st;
int state;
};
struct kv_inode_extend_t
{
int refcnt = 0;
uint64_t cur_extend = 0, next_extend = 0, done_extend = 0;
std::vector<std::function<void()>> waiters;
};
struct kv_idgen_t
{
uint64_t next_id = 1, allocated_id = 0;
uint64_t min_id = 1;
std::vector<uint64_t> unallocated_ids;
};
struct kv_fs_state_t
{
nfs_proxy_t *proxy = NULL;
int touch_timer_id = -1;
uint64_t fs_kv_inode = 0;
uint64_t fs_inode_count = 0;
int readdir_getattr_parallel = 8, id_alloc_batch_size = 200;
uint64_t pool_block_size = 0;
uint64_t pool_alignment = 0;
uint64_t shared_inode_threshold = 0;
uint64_t touch_interval = 1000;
std::map<list_cookie_t, list_cookie_val_t> list_cookies;
std::map<pool_id_t, kv_idgen_t> idgen;
std::vector<shared_alloc_queue_t> allocating_shared;
uint64_t cur_shared_inode = 0, cur_shared_offset = 0;
std::map<inode_t, kv_inode_extend_t> extends;
std::set<inode_t> touch_queue;
std::vector<uint8_t> zero_block;
std::vector<uint8_t> scrap_block;
void init(nfs_proxy_t *proxy, json11::Json cfg);
void touch_inodes();
~kv_fs_state_t();
};
struct shared_file_header_t
{
uint64_t magic = 0;
uint64_t inode = 0;
uint64_t alloc = 0;
};
struct nfs_rmw_t
{
nfs_proxy_t *parent = NULL;
uint64_t ino = 0;
uint64_t offset = 0;
uint8_t *buf = NULL;
uint64_t size = 0;
uint8_t *part_buf = NULL;
uint64_t version = 0;
nfs_rmw_t *other = NULL;
std::function<void(nfs_rmw_t *)> cb;
int res = 0;
};
nfsstat3 vitastor_nfs_map_err(int err);
nfstime3 nfstime_from_str(const std::string & s);
std::string nfstime_to_str(nfstime3 t);
std::string nfstime_now_str();
int kv_map_type(const std::string & type);
fattr3 get_kv_attributes(nfs_client_t *self, uint64_t ino, json11::Json attrs);
std::string kv_direntry_key(uint64_t dir_ino, const std::string & filename);
std::string kv_direntry_filename(const std::string & key);
std::string kv_inode_key(uint64_t ino);
std::string kv_fh(uint64_t ino);
uint64_t kv_fh_inode(const std::string & fh);
bool kv_fh_valid(const std::string & fh);
void allocate_new_id(nfs_client_t *self, pool_id_t pool_id, std::function<void(int res, uint64_t new_id)> cb);
void kv_read_inode(nfs_proxy_t *proxy, uint64_t ino,
std::function<void(int res, const std::string & value, json11::Json ientry)> cb,
bool allow_cache = false);
uint64_t align_shared_size(nfs_client_t *self, uint64_t size);
void nfs_do_rmw(nfs_rmw_t *rmw);
int kv_nfs3_getattr_proc(void *opaque, rpc_op_t *rop);
int kv_nfs3_setattr_proc(void *opaque, rpc_op_t *rop);
int kv_nfs3_lookup_proc(void *opaque, rpc_op_t *rop);
int kv_nfs3_readlink_proc(void *opaque, rpc_op_t *rop);
int kv_nfs3_read_proc(void *opaque, rpc_op_t *rop);
int kv_nfs3_write_proc(void *opaque, rpc_op_t *rop);
int kv_nfs3_create_proc(void *opaque, rpc_op_t *rop);
int kv_nfs3_mkdir_proc(void *opaque, rpc_op_t *rop);
int kv_nfs3_symlink_proc(void *opaque, rpc_op_t *rop);
int kv_nfs3_mknod_proc(void *opaque, rpc_op_t *rop);
int kv_nfs3_remove_proc(void *opaque, rpc_op_t *rop);
int kv_nfs3_rmdir_proc(void *opaque, rpc_op_t *rop);
int kv_nfs3_rename_proc(void *opaque, rpc_op_t *rop);
int kv_nfs3_link_proc(void *opaque, rpc_op_t *rop);
int kv_nfs3_readdir_proc(void *opaque, rpc_op_t *rop);
int kv_nfs3_readdirplus_proc(void *opaque, rpc_op_t *rop);

365
src/nfs_kv_create.cpp Normal file
View File

@@ -0,0 +1,365 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy over VitastorKV database - CREATE, MKDIR, SYMLINK, MKNOD
#include <sys/time.h>
#include "str_util.h"
#include "nfs_proxy.h"
#include "nfs_kv.h"
void allocate_new_id(nfs_client_t *self, pool_id_t pool_id, std::function<void(int res, uint64_t new_id)> cb)
{
auto & idgen = self->parent->kvfs->idgen[pool_id];
if (idgen.unallocated_ids.size())
{
auto new_id = idgen.unallocated_ids.back();
idgen.unallocated_ids.pop_back();
cb(0, INODE_WITH_POOL(pool_id, new_id));
return;
}
else if (idgen.next_id <= idgen.allocated_id)
{
idgen.next_id++;
cb(0, INODE_WITH_POOL(pool_id, idgen.next_id-1));
return;
}
// FIXME: Maybe allow FS and block volumes to cohabitate in the same pool, but with different ID ranges
else if (idgen.next_id >= ((uint64_t)1 << (64-POOL_ID_BITS)))
{
cb(-ENOSPC, 0);
return;
}
self->parent->db->get((pool_id ? "id"+std::to_string(pool_id) : "id"), [=](int res, const std::string & prev_str)
{
auto & idgen = self->parent->kvfs->idgen[pool_id];
if (res < 0 && res != -ENOENT)
{
cb(res, 0);
return;
}
uint64_t prev_val = stoull_full(prev_str);
if (prev_val >= ((uint64_t)1 << (64-POOL_ID_BITS)))
{
cb(-ENOSPC, 0);
return;
}
if (prev_val < idgen.min_id)
{
prev_val = idgen.min_id;
}
uint64_t new_val = prev_val + self->parent->kvfs->id_alloc_batch_size;
if (new_val >= self->parent->kvfs->fs_inode_count)
{
new_val = self->parent->kvfs->fs_inode_count;
}
self->parent->db->set((pool_id ? "id"+std::to_string(pool_id) : "id"), std::to_string(new_val), [=](int res)
{
if (res == -EAGAIN)
{
// CAS failure - retry
allocate_new_id(self, pool_id, cb);
}
else if (res < 0)
{
cb(res, 0);
}
else
{
auto & idgen = self->parent->kvfs->idgen[pool_id];
idgen.next_id = prev_val+2;
idgen.allocated_id = new_val;
cb(0, INODE_WITH_POOL(pool_id, prev_val+1));
}
}, [prev_val](int res, const std::string & value)
{
// FIXME: Allow to modify value from CAS callback? ("update" query)
return res < 0 || stoull_full(value) == prev_val;
});
});
}
struct kv_create_state
{
nfs_client_t *self = NULL;
rpc_op_t *rop = NULL;
bool exclusive = false;
uint64_t verf = 0;
uint64_t dir_ino = 0;
std::string filename;
// state
int res = 0;
pool_id_t pool_id = 0;
uint64_t new_id = 0;
json11::Json::object attrobj;
json11::Json attrs;
std::string direntry_text;
uint64_t dup_ino = 0;
std::function<void(int res)> cb;
};
static void kv_continue_create(kv_create_state *st, int state)
{
if (state == 0) {}
else if (state == 1) goto resume_1;
else if (state == 2) goto resume_2;
else if (state == 3) goto resume_3;
else if (state == 4) goto resume_4;
else if (state == 5) goto resume_5;
if (st->self->parent->trace)
fprintf(stderr, "[%d] CREATE %ju/%s ATTRS %s\n", st->self->nfs_fd, st->dir_ino, st->filename.c_str(), json11::Json(st->attrobj).dump().c_str());
if (st->filename == "" || st->filename.find("/") != std::string::npos)
{
auto cb = std::move(st->cb);
cb(-EINVAL);
return;
}
st->attrobj["ctime"] = nfstime_now_str();
if (st->attrobj.find("mtime") == st->attrobj.end())
st->attrobj["mtime"] = st->attrobj["ctime"];
st->attrs = std::move(st->attrobj);
resume_1:
// Generate inode ID
// Directories and special files don't need pool
st->pool_id = kv_map_type(st->attrs["type"].string_value()) == NF3REG
? st->self->parent->default_pool_id
: 0;
allocate_new_id(st->self, st->pool_id, [st](int res, uint64_t new_id)
{
st->res = res;
st->new_id = new_id;
kv_continue_create(st, 2);
});
return;
resume_2:
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
st->self->parent->db->set(kv_inode_key(st->new_id), st->attrs.dump().c_str(), [st](int res)
{
st->res = res;
kv_continue_create(st, 3);
}, [st](int res, const std::string & value)
{
return res == -ENOENT;
});
return;
resume_3:
if (st->res == -EAGAIN)
{
// Inode ID generator failure - retry
goto resume_1;
}
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
{
auto direntry = json11::Json::object{ { "ino", st->new_id } };
if (st->attrs["type"].string_value() == "dir")
{
direntry["type"] = "dir";
}
st->direntry_text = json11::Json(direntry).dump().c_str();
}
// Set direntry
st->dup_ino = 0;
st->self->parent->db->set(kv_direntry_key(st->dir_ino, st->filename), st->direntry_text, [st](int res)
{
st->res = res;
kv_continue_create(st, 4);
}, [st](int res, const std::string & value)
{
// CAS compare - check that the key doesn't exist
if (res == 0)
{
std::string err;
auto direntry = json11::Json::parse(value, err);
if (err != "")
{
fprintf(stderr, "Invalid JSON in direntry %s = %s: %s, overwriting\n",
kv_direntry_key(st->dir_ino, st->filename).c_str(), value.c_str(), err.c_str());
return true;
}
if (st->exclusive && direntry["verf"].uint64_value() == st->verf)
{
st->dup_ino = direntry["ino"].uint64_value();
return false;
}
return false;
}
return true;
});
return;
resume_4:
if (st->res == -EAGAIN)
{
// Direntry already exists
st->self->parent->db->del(kv_inode_key(st->new_id), [st](int res)
{
st->res = res;
kv_continue_create(st, 5);
});
resume_5:
if (st->res < 0)
{
fprintf(stderr, "failed to delete duplicate inode %ju left from create %s (code %d)\n", st->new_id, strerror(-st->res), st->res);
}
else
{
auto & idgen = st->self->parent->kvfs->idgen[INODE_POOL(st->new_id)];
idgen.unallocated_ids.push_back(INODE_NO_POOL(st->new_id));
}
if (st->dup_ino)
{
// Successfully created by the previous "exclusive" request
st->new_id = st->dup_ino;
}
st->res = st->dup_ino ? 0 : -EEXIST;
}
if (!st->res)
{
st->self->parent->kvfs->touch_queue.insert(st->dir_ino);
}
auto cb = std::move(st->cb);
cb(st->res);
}
static void kv_create_setattr(json11::Json::object & attrobj, sattr3 & sattr)
{
if (sattr.mode.set_it)
attrobj["mode"] = (uint64_t)sattr.mode.mode;
if (sattr.uid.set_it)
attrobj["uid"] = (uint64_t)sattr.uid.uid;
if (sattr.gid.set_it)
attrobj["gid"] = (uint64_t)sattr.gid.gid;
if (sattr.atime.set_it)
attrobj["atime"] = nfstime_to_str(sattr.atime.atime);
if (sattr.mtime.set_it)
attrobj["mtime"] = nfstime_to_str(sattr.mtime.mtime);
}
template<class T, class Tok> static void kv_create_reply(kv_create_state *st, int res)
{
T *reply = (T*)st->rop->reply;
if (res < 0)
{
*reply = (T){ .status = vitastor_nfs_map_err(-res) };
}
else
{
*reply = (T){
.status = NFS3_OK,
.resok = (Tok){
.obj = {
.handle_follows = 1,
.handle = xdr_copy_string(st->rop->xdrs, kv_fh(st->new_id)),
},
.obj_attributes = {
.attributes_follow = 1,
.attributes = get_kv_attributes(st->self, st->new_id, st->attrs),
},
},
};
}
rpc_queue_reply(st->rop);
delete st;
}
int kv_nfs3_create_proc(void *opaque, rpc_op_t *rop)
{
kv_create_state *st = new kv_create_state;
st->self = (nfs_client_t*)opaque;
st->rop = rop;
auto args = (CREATE3args*)rop->request;
st->exclusive = args->how.mode == NFS_EXCLUSIVE;
st->verf = st->exclusive ? *(uint64_t*)&args->how.verf : 0;
st->dir_ino = kv_fh_inode(args->where.dir);
st->filename = args->where.name;
if (args->how.mode == NFS_EXCLUSIVE)
{
st->attrobj["verf"] = *(uint64_t*)&args->how.verf;
}
else if (args->how.mode == NFS_UNCHECKED)
{
kv_create_setattr(st->attrobj, args->how.obj_attributes);
if (args->how.obj_attributes.size.set_it)
{
st->attrobj["size"] = (uint64_t)args->how.obj_attributes.size.size;
st->attrobj["empty"] = true;
}
}
st->cb = [st](int res) { kv_create_reply<CREATE3res, CREATE3resok>(st, res); };
kv_continue_create(st, 0);
return 1;
}
int kv_nfs3_mkdir_proc(void *opaque, rpc_op_t *rop)
{
kv_create_state *st = new kv_create_state;
st->self = (nfs_client_t*)opaque;
st->rop = rop;
auto args = (MKDIR3args*)rop->request;
st->dir_ino = kv_fh_inode(args->where.dir);
st->filename = args->where.name;
st->attrobj["type"] = "dir";
st->attrobj["parent_ino"] = st->dir_ino;
kv_create_setattr(st->attrobj, args->attributes);
st->cb = [st](int res) { kv_create_reply<MKDIR3res, MKDIR3resok>(st, res); };
kv_continue_create(st, 0);
return 1;
}
int kv_nfs3_symlink_proc(void *opaque, rpc_op_t *rop)
{
kv_create_state *st = new kv_create_state;
st->self = (nfs_client_t*)opaque;
st->rop = rop;
auto args = (SYMLINK3args*)rop->request;
st->dir_ino = kv_fh_inode(args->where.dir);
st->filename = args->where.name;
st->attrobj["type"] = "link";
st->attrobj["symlink"] = (std::string)args->symlink.symlink_data;
kv_create_setattr(st->attrobj, args->symlink.symlink_attributes);
st->cb = [st](int res) { kv_create_reply<SYMLINK3res, SYMLINK3resok>(st, res); };
kv_continue_create(st, 0);
return 1;
}
int kv_nfs3_mknod_proc(void *opaque, rpc_op_t *rop)
{
kv_create_state *st = new kv_create_state;
st->self = (nfs_client_t*)opaque;
st->rop = rop;
auto args = (MKNOD3args*)rop->request;
st->dir_ino = kv_fh_inode(args->where.dir);
st->filename = args->where.name;
if (args->what.type == NF3CHR || args->what.type == NF3BLK)
{
st->attrobj["type"] = (args->what.type == NF3CHR ? "chr" : "blk");
st->attrobj["major"] = (uint64_t)args->what.chr_device.spec.specdata1;
st->attrobj["minor"] = (uint64_t)args->what.chr_device.spec.specdata2;
kv_create_setattr(st->attrobj, args->what.chr_device.dev_attributes);
}
else if (args->what.type == NF3SOCK || args->what.type == NF3FIFO)
{
st->attrobj["type"] = (args->what.type == NF3SOCK ? "sock" : "fifo");
kv_create_setattr(st->attrobj, args->what.sock_attributes);
}
else
{
*(MKNOD3res*)rop->reply = (MKNOD3res){ .status = NFS3ERR_INVAL };
rpc_queue_reply(rop);
delete st;
return 0;
}
st->cb = [st](int res) { kv_create_reply<MKNOD3res, MKNOD3resok>(st, res); };
kv_continue_create(st, 0);
return 1;
}

78
src/nfs_kv_getattr.cpp Normal file
View File

@@ -0,0 +1,78 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy over VitastorKV database - GETATTR
#include <sys/time.h>
#include "nfs_proxy.h"
#include "nfs_kv.h"
// Attributes are always stored in the inode
void kv_read_inode(nfs_proxy_t *proxy, uint64_t ino,
std::function<void(int res, const std::string & value, json11::Json ientry)> cb,
bool allow_cache)
{
auto key = kv_inode_key(ino);
proxy->db->get(key, [=](int res, const std::string & value)
{
if (ino == KV_ROOT_INODE && res == -ENOENT)
{
// Allow root inode to not exist
cb(0, "", json11::Json(json11::Json::object{ { "type", "dir" } }));
return;
}
if (res < 0)
{
if (res != -ENOENT)
fprintf(stderr, "Error reading inode %s: %s (code %d)\n", kv_inode_key(ino).c_str(), strerror(-res), res);
cb(res, "", json11::Json());
return;
}
std::string err;
auto attrs = json11::Json::parse(value, err);
if (err != "")
{
fprintf(stderr, "Invalid JSON in inode %s = %s: %s\n", kv_inode_key(ino).c_str(), value.c_str(), err.c_str());
res = -EIO;
}
cb(res, value, attrs);
}, allow_cache);
}
int kv_nfs3_getattr_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
GETATTR3args *args = (GETATTR3args*)rop->request;
GETATTR3res *reply = (GETATTR3res*)rop->reply;
std::string fh = args->object;
auto ino = kv_fh_inode(fh);
if (self->parent->trace)
fprintf(stderr, "[%d] GETATTR %ju\n", self->nfs_fd, ino);
if (!kv_fh_valid(fh))
{
*reply = (GETATTR3res){ .status = NFS3ERR_INVAL };
rpc_queue_reply(rop);
return 0;
}
kv_read_inode(self->parent, ino, [=](int res, const std::string & value, json11::Json attrs)
{
if (self->parent->trace)
fprintf(stderr, "[%d] GETATTR %ju -> %s\n", self->nfs_fd, ino, value.c_str());
if (res < 0)
{
*reply = (GETATTR3res){ .status = vitastor_nfs_map_err(-res) };
}
else
{
*reply = (GETATTR3res){
.status = NFS3_OK,
.resok = (GETATTR3resok){
.obj_attributes = get_kv_attributes(self, ino, attrs),
},
};
}
rpc_queue_reply(rop);
});
return 1;
}

193
src/nfs_kv_link.cpp Normal file
View File

@@ -0,0 +1,193 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy over VitastorKV database - LINK
#include <sys/time.h>
#include "nfs_proxy.h"
#include "nfs_kv.h"
struct nfs_kv_link_state
{
nfs_client_t *self = NULL;
rpc_op_t *rop = NULL;
uint64_t ino = 0;
uint64_t dir_ino = 0;
std::string filename;
std::string ientry_text;
json11::Json ientry;
bool retrying = false;
int wait = 0;
int res = 0, res2 = 0;
std::function<void(int)> cb;
};
static void nfs_kv_continue_link(nfs_kv_link_state *st, int state)
{
// 1) Read the source inode
// 2) If it's a directory - fail with -EISDIR
// 3) Create the new direntry with the same inode reference
// 4) Update the inode entry with refcount++
// 5) Retry update if CAS failed but the inode exists
// 6) Otherwise fail and remove the new direntry
// Yeah we may leave a bad direntry if we crash
// But the other option is to possibly leave an inode with too big refcount
if (state == 0) {}
else if (state == 1) goto resume_1;
else if (state == 2) goto resume_2;
else if (state == 3) goto resume_3;
else if (state == 4) goto resume_4;
else
{
fprintf(stderr, "BUG: invalid state in nfs_kv_continue_link()");
abort();
}
resume_0:
// Check that the source inode exists and is not a directory
st->wait = st->retrying ? 1 : 2;
st->res2 = 0;
kv_read_inode(st->self->parent, st->ino, [st](int res, const std::string & value, json11::Json attrs)
{
st->res = res == 0 ? (attrs["type"].string_value() == "dir" ? -EISDIR : 0) : res;
st->ientry_text = value;
st->ientry = attrs;
if (!--st->wait)
nfs_kv_continue_link(st, 1);
});
if (!st->retrying)
{
// Check that the new directory exists
kv_read_inode(st->self->parent, st->dir_ino, [st](int res, const std::string & value, json11::Json attrs)
{
st->res2 = res == 0 ? (attrs["type"].string_value() == "dir" ? 0 : -ENOTDIR) : res;
if (!--st->wait)
nfs_kv_continue_link(st, 1);
});
}
return;
resume_1:
if (st->res < 0 || st->res2 < 0)
{
auto cb = std::move(st->cb);
cb(st->res < 0 ? st->res : st->res2);
return;
}
// Write the new direntry
if (!st->retrying)
{
st->self->parent->db->set(kv_direntry_key(st->dir_ino, st->filename),
json11::Json(json11::Json::object{ { "ino", st->ino } }).dump(), [st](int res)
{
st->res = res;
nfs_kv_continue_link(st, 2);
}, [st](int res, const std::string & old_value)
{
return res == -ENOENT;
});
return;
resume_2:
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
}
// Increase inode refcount
{
auto new_ientry = st->ientry.object_items();
auto nlink = new_ientry["nlink"].uint64_value();
new_ientry["nlink"] = nlink ? nlink+1 : 2;
new_ientry["ctime"] = nfstime_now_str();
st->ientry = new_ientry;
}
st->self->parent->db->set(kv_inode_key(st->ino), st->ientry.dump(), [st](int res)
{
st->res = res;
nfs_kv_continue_link(st, 3);
}, [st](int res, const std::string & old_value)
{
st->res2 = res;
return res == 0 && old_value == st->ientry_text;
});
return;
resume_3:
if (st->res2 == -ENOENT)
{
st->res = -ENOENT;
}
if (st->res == -EAGAIN)
{
// Re-read inode and retry
st->retrying = true;
goto resume_0;
}
if (st->res < 0)
{
// Maybe inode was deleted in the meantime, delete our direntry
st->self->parent->db->del(kv_direntry_key(st->dir_ino, st->filename), [st](int res)
{
st->res2 = res;
nfs_kv_continue_link(st, 4);
});
return;
resume_4:
if (st->res2 < 0)
{
fprintf(stderr, "Warning: failed to delete new linked direntry %ju/%s: %s (code %d)\n",
st->dir_ino, st->filename.c_str(), strerror(-st->res2), st->res2);
}
}
if (!st->res)
{
st->self->parent->kvfs->touch_queue.insert(st->dir_ino);
}
auto cb = std::move(st->cb);
cb(st->res);
}
int kv_nfs3_link_proc(void *opaque, rpc_op_t *rop)
{
auto st = new nfs_kv_link_state;
st->self = (nfs_client_t*)opaque;
st->rop = rop;
LINK3args *args = (LINK3args*)rop->request;
st->ino = kv_fh_inode(args->file);
st->dir_ino = kv_fh_inode(args->link.dir);
st->filename = args->link.name;
if (st->self->parent->trace)
fprintf(stderr, "[%d] LINK %ju -> %ju/%s\n", st->self->nfs_fd, st->ino, st->dir_ino, st->filename.c_str());
if (!st->ino || !st->dir_ino || st->filename == "")
{
LINK3res *reply = (LINK3res*)rop->reply;
*reply = (LINK3res){ .status = NFS3ERR_INVAL };
rpc_queue_reply(rop);
delete st;
return 0;
}
st->cb = [st](int res)
{
LINK3res *reply = (LINK3res*)st->rop->reply;
if (res < 0)
{
*reply = (LINK3res){ .status = vitastor_nfs_map_err(res) };
}
else
{
*reply = (LINK3res){
.status = NFS3_OK,
.resok = (LINK3resok){
.file_attributes = (post_op_attr){
.attributes_follow = 1,
.attributes = get_kv_attributes(st->self, st->ino, st->ientry),
},
},
};
}
rpc_queue_reply(st->rop);
delete st;
};
nfs_kv_continue_link(st, 0);
return 1;
}

104
src/nfs_kv_lookup.cpp Normal file
View File

@@ -0,0 +1,104 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy over VitastorKV database - LOOKUP, READLINK
#include <sys/time.h>
#include "nfs_proxy.h"
#include "nfs_kv.h"
int kv_nfs3_lookup_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
LOOKUP3args *args = (LOOKUP3args*)rop->request;
LOOKUP3res *reply = (LOOKUP3res*)rop->reply;
inode_t dir_ino = kv_fh_inode(args->what.dir);
std::string filename = args->what.name;
if (self->parent->trace)
fprintf(stderr, "[%d] LOOKUP %ju/%s\n", self->nfs_fd, dir_ino, filename.c_str());
if (!dir_ino || filename == "")
{
*reply = (LOOKUP3res){ .status = NFS3ERR_INVAL };
rpc_queue_reply(rop);
return 0;
}
self->parent->db->get(kv_direntry_key(dir_ino, filename), [=](int res, const std::string & value)
{
if (res < 0)
{
*reply = (LOOKUP3res){ .status = vitastor_nfs_map_err(-res) };
rpc_queue_reply(rop);
return;
}
std::string err;
auto direntry = json11::Json::parse(value, err);
if (err != "")
{
fprintf(stderr, "Invalid JSON in direntry %s = %s: %s\n", kv_direntry_key(dir_ino, filename).c_str(), value.c_str(), err.c_str());
*reply = (LOOKUP3res){ .status = NFS3ERR_IO };
rpc_queue_reply(rop);
return;
}
uint64_t ino = direntry["ino"].uint64_value();
kv_read_inode(self->parent, ino, [=](int res, const std::string & value, json11::Json ientry)
{
if (res < 0)
{
*reply = (LOOKUP3res){ .status = vitastor_nfs_map_err(res == -ENOENT ? -EIO : res) };
rpc_queue_reply(rop);
return;
}
*reply = (LOOKUP3res){
.status = NFS3_OK,
.resok = (LOOKUP3resok){
.object = xdr_copy_string(rop->xdrs, kv_fh(ino)),
.obj_attributes = {
.attributes_follow = 1,
.attributes = get_kv_attributes(self, ino, ientry),
},
},
};
rpc_queue_reply(rop);
});
});
return 1;
}
int kv_nfs3_readlink_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
READLINK3args *args = (READLINK3args*)rop->request;
if (self->parent->trace)
fprintf(stderr, "[%d] READLINK %ju\n", self->nfs_fd, kv_fh_inode(args->symlink));
READLINK3res *reply = (READLINK3res*)rop->reply;
if (!kv_fh_valid(args->symlink) || args->symlink == NFS_ROOT_HANDLE)
{
// Invalid filehandle or trying to read symlink from root entry
*reply = (READLINK3res){ .status = NFS3ERR_INVAL };
rpc_queue_reply(rop);
return 0;
}
kv_read_inode(self->parent, kv_fh_inode(args->symlink), [=](int res, const std::string & value, json11::Json attrs)
{
if (res < 0)
{
*reply = (READLINK3res){ .status = vitastor_nfs_map_err(-res) };
}
else if (attrs["type"] != "link")
{
*reply = (READLINK3res){ .status = NFS3ERR_INVAL };
}
else
{
*reply = (READLINK3res){
.status = NFS3_OK,
.resok = (READLINK3resok){
.data = xdr_copy_string(rop->xdrs, attrs["symlink"].string_value()),
},
};
}
rpc_queue_reply(rop);
});
return 1;
}

198
src/nfs_kv_read.cpp Normal file
View File

@@ -0,0 +1,198 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy over VitastorKV database - READ
#include <sys/time.h>
#include "nfs_proxy.h"
#include "nfs_kv.h"
struct nfs_kv_read_state
{
nfs_client_t *self = NULL;
rpc_op_t *rop = NULL;
bool allow_cache = true;
inode_t ino = 0;
uint64_t offset = 0, size = 0;
std::function<void(int)> cb;
// state
int res = 0;
int eof = 0;
json11::Json ientry;
uint64_t aligned_size = 0, aligned_offset = 0;
uint8_t *aligned_buf = NULL;
cluster_op_t *op = NULL;
uint8_t *buf = NULL;
};
#define align_down(size) ((size) & ~(st->self->parent->kvfs->pool_alignment-1))
#define align_up(size) (((size) + st->self->parent->kvfs->pool_alignment-1) & ~(st->self->parent->kvfs->pool_alignment-1))
static void nfs_kv_continue_read(nfs_kv_read_state *st, int state)
{
if (state == 0) {}
else if (state == 1) goto resume_1;
else if (state == 2) goto resume_2;
else if (state == 3) goto resume_3;
else
{
fprintf(stderr, "BUG: invalid state in nfs_kv_continue_read()");
abort();
}
resume_0:
if (st->offset + sizeof(shared_file_header_t) < st->self->parent->kvfs->shared_inode_threshold)
{
kv_read_inode(st->self->parent, st->ino, [st](int res, const std::string & value, json11::Json attrs)
{
st->res = res;
st->ientry = attrs;
nfs_kv_continue_read(st, 1);
}, st->allow_cache);
return;
resume_1:
if (st->res < 0 || kv_map_type(st->ientry["type"].string_value()) != NF3REG)
{
auto cb = std::move(st->cb);
cb(st->res < 0 ? st->res : -EINVAL);
return;
}
if (st->ientry["shared_ino"].uint64_value() != 0)
{
if (st->offset >= st->ientry["size"].uint64_value())
{
st->size = 0;
st->eof = 1;
auto cb = std::move(st->cb);
cb(0);
return;
}
st->op = new cluster_op_t;
{
st->op->opcode = OSD_OP_READ;
st->op->inode = st->ientry["shared_ino"].uint64_value();
// Always read including header to react if the file was possibly moved away
auto read_offset = st->ientry["shared_offset"].uint64_value();
st->op->offset = align_down(read_offset);
if (st->op->offset < read_offset)
{
st->op->iov.push_back(st->self->parent->kvfs->scrap_block.data(),
read_offset-st->op->offset);
}
auto read_size = st->offset+st->size;
if (read_size > st->ientry["size"].uint64_value())
{
st->eof = 1;
st->size = st->ientry["size"].uint64_value()-st->offset;
read_size = st->ientry["size"].uint64_value();
}
read_size += sizeof(shared_file_header_t);
assert(!st->aligned_buf);
st->aligned_buf = (uint8_t*)malloc_or_die(read_size);
st->buf = st->aligned_buf + sizeof(shared_file_header_t) + st->offset;
st->op->iov.push_back(st->aligned_buf, read_size);
st->op->len = align_up(read_offset+read_size) - st->op->offset;
if (read_offset+read_size < st->op->offset+st->op->len)
{
st->op->iov.push_back(st->self->parent->kvfs->scrap_block.data(),
st->op->offset+st->op->len - (read_offset+read_size));
}
}
st->op->callback = [st, state](cluster_op_t *op)
{
st->res = op->retval == op->len ? 0 : op->retval;
delete op;
nfs_kv_continue_read(st, 2);
};
st->self->parent->cli->execute(st->op);
return;
resume_2:
if (st->res < 0)
{
free(st->aligned_buf);
st->aligned_buf = NULL;
auto cb = std::move(st->cb);
cb(st->res);
return;
}
auto hdr = ((shared_file_header_t*)st->aligned_buf);
if (hdr->magic != SHARED_FILE_MAGIC_V1 || hdr->inode != st->ino)
{
// Got unrelated data - retry from the beginning
free(st->aligned_buf);
st->aligned_buf = NULL;
st->allow_cache = false;
goto resume_0;
}
auto cb = std::move(st->cb);
cb(0);
return;
}
}
st->aligned_offset = align_down(st->offset);
st->aligned_size = align_up(st->offset+st->size) - st->aligned_offset;
assert(!st->aligned_buf);
st->aligned_buf = (uint8_t*)malloc_or_die(st->aligned_size);
st->buf = st->aligned_buf + st->offset - st->aligned_offset;
st->op = new cluster_op_t;
st->op->opcode = OSD_OP_READ;
st->op->inode = st->ino;
st->op->offset = st->aligned_offset;
st->op->len = st->aligned_size;
st->op->iov.push_back(st->aligned_buf, st->aligned_size);
st->op->callback = [st](cluster_op_t *op)
{
st->res = op->retval;
delete op;
nfs_kv_continue_read(st, 3);
};
st->self->parent->cli->execute(st->op);
return;
resume_3:
if (st->res < 0)
{
free(st->aligned_buf);
st->aligned_buf = NULL;
}
auto cb = std::move(st->cb);
cb(st->res < 0 ? st->res : 0);
return;
}
int kv_nfs3_read_proc(void *opaque, rpc_op_t *rop)
{
READ3args *args = (READ3args*)rop->request;
READ3res *reply = (READ3res*)rop->reply;
auto ino = kv_fh_inode(args->file);
if (args->count > MAX_REQUEST_SIZE || !ino)
{
*reply = (READ3res){ .status = NFS3ERR_INVAL };
rpc_queue_reply(rop);
return 0;
}
auto st = new nfs_kv_read_state;
st->self = (nfs_client_t*)opaque;
st->rop = rop;
st->ino = ino;
st->offset = args->offset;
st->size = args->count;
st->cb = [st](int res)
{
READ3res *reply = (READ3res*)st->rop->reply;
*reply = (READ3res){ .status = vitastor_nfs_map_err(res) };
if (res == 0)
{
xdr_add_malloc(st->rop->xdrs, st->aligned_buf);
reply->resok.data.data = (char*)st->buf;
reply->resok.data.size = st->size;
reply->resok.count = st->size;
reply->resok.eof = st->eof;
}
rpc_queue_reply(st->rop);
delete st;
};
if (st->self->parent->trace)
fprintf(stderr, "[%d] READ %ju %ju+%ju\n", st->self->nfs_fd, st->ino, st->offset, st->size);
nfs_kv_continue_read(st, 0);
return 1;
}

375
src/nfs_kv_readdir.cpp Normal file
View File

@@ -0,0 +1,375 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy over VitastorKV database - READDIR, READDIRPLUS
#include <sys/time.h>
#include "nfs_proxy.h"
#include "nfs_kv.h"
static unsigned len_pad4(unsigned len)
{
return len + (len&3 ? 4-(len&3) : 0);
}
struct nfs_kv_readdir_state
{
nfs_client_t *self = NULL;
rpc_op_t *rop = NULL;
// Request:
bool is_plus = false;
uint64_t cookie = 0;
uint64_t cookieverf = 0;
uint64_t dir_ino = 0;
uint64_t maxcount = 0;
std::function<void(int)> cb;
// State:
int res = 0;
std::string prefix, start;
void *list_handle;
uint64_t parent_ino = 0;
std::string ientry_text, parent_ientry_text;
json11::Json ientry, parent_ientry;
std::string cur_key, cur_value;
int reply_size = 0;
int to_skip = 0;
uint64_t offset = 0;
int getattr_running = 0, getattr_cur = 0;
// Result:
bool eof = false;
//uint64_t cookieverf = 0; // same field
std::vector<entryplus3> entries;
};
static void nfs_kv_continue_readdir(nfs_kv_readdir_state *st, int state);
static void kv_getattr_next(nfs_kv_readdir_state *st)
{
while (st->is_plus && st->getattr_cur < st->entries.size() && st->getattr_running < st->self->parent->kvfs->readdir_getattr_parallel)
{
auto idx = st->getattr_cur++;
st->getattr_running++;
kv_read_inode(st->self->parent, st->entries[idx].fileid, [st, idx](int res, const std::string & value, json11::Json ientry)
{
if (res == 0)
{
st->entries[idx].name_attributes = (post_op_attr){
// FIXME: maybe do not read parent attributes and leave them to a GETATTR?
.attributes_follow = 1,
.attributes = get_kv_attributes(st->self, st->entries[idx].fileid, ientry),
};
}
st->getattr_running--;
kv_getattr_next(st);
if (st->getattr_running == 0 && !st->list_handle)
{
nfs_kv_continue_readdir(st, 4);
}
});
}
}
static void nfs_kv_continue_readdir(nfs_kv_readdir_state *st, int state)
{
if (state == 0) {}
else if (state == 1) goto resume_1;
else if (state == 2) goto resume_2;
else if (state == 3) goto resume_3;
else if (state == 4) goto resume_4;
else
{
fprintf(stderr, "BUG: invalid state in nfs_kv_continue_readdir()");
abort();
}
// Limit results based on maximum reply size
// Sadly we have to calculate reply size by hand
// reply without entries is 4+4+(dir_attributes ? sizeof(fattr3) : 0)+8+4 bytes
st->reply_size = 20;
if (st->reply_size > st->maxcount)
{
// Error, too small max reply size
auto cb = std::move(st->cb);
cb(-NFS3ERR_TOOSMALL);
return;
}
// Add . and ..
if (st->cookie <= 1)
{
kv_read_inode(st->self->parent, st->dir_ino, [st](int res, const std::string & value, json11::Json ientry)
{
st->res = res;
st->ientry_text = value;
st->ientry = ientry;
nfs_kv_continue_readdir(st, 1);
});
return;
resume_1:
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
if (st->cookie == 0)
{
auto fh = kv_fh(st->dir_ino);
auto entry_size = 20 + 4/*len_pad4(".")*/ + (st->is_plus ? 8 + 88 + len_pad4(fh.size()) : 0);
if (st->reply_size + entry_size > st->maxcount)
{
auto cb = std::move(st->cb);
cb(-NFS3ERR_TOOSMALL);
return;
}
entryplus3 dot = {};
dot.name = xdr_copy_string(st->rop->xdrs, ".");
dot.fileid = st->dir_ino;
dot.name_attributes = (post_op_attr){
.attributes_follow = 1,
.attributes = get_kv_attributes(st->self, st->dir_ino, st->ientry),
};
dot.name_handle = (post_op_fh3){
.handle_follows = 1,
.handle = xdr_copy_string(st->rop->xdrs, fh),
};
st->entries.push_back(dot);
st->reply_size += entry_size;
}
st->parent_ino = st->ientry["parent_ino"].uint64_value();
if (st->parent_ino)
{
kv_read_inode(st->self->parent, st->ientry["parent_ino"].uint64_value(), [st](int res, const std::string & value, json11::Json ientry)
{
st->res = res;
st->parent_ientry_text = value;
st->parent_ientry = ientry;
nfs_kv_continue_readdir(st, 2);
});
return;
resume_2:
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
}
auto fh = kv_fh(st->parent_ino);
auto entry_size = 20 + 4/*len_pad4("..")*/ + (st->is_plus ? 8 + 88 + len_pad4(fh.size()) : 0);
if (st->reply_size + entry_size > st->maxcount)
{
st->eof = false;
auto cb = std::move(st->cb);
cb(0);
return;
}
entryplus3 dotdot = {};
dotdot.name = xdr_copy_string(st->rop->xdrs, "..");
dotdot.fileid = st->dir_ino;
dotdot.name_attributes = (post_op_attr){
// FIXME: maybe do not read parent attributes and leave them to a GETATTR?
.attributes_follow = 1,
.attributes = get_kv_attributes(st->self,
st->parent_ino ? st->parent_ino : st->dir_ino,
st->parent_ino ? st->parent_ientry : st->ientry),
};
dotdot.name_handle = (post_op_fh3){
.handle_follows = 1,
.handle = xdr_copy_string(st->rop->xdrs, fh),
};
st->entries.push_back(dotdot);
st->reply_size += entry_size;
}
st->prefix = kv_direntry_key(st->dir_ino, "");
st->eof = true;
st->start = st->prefix;
if (st->cookie > 1)
{
auto lc_it = st->self->parent->kvfs->list_cookies.find((list_cookie_t){ st->dir_ino, st->cookieverf, st->cookie });
if (lc_it != st->self->parent->kvfs->list_cookies.end())
{
st->start = st->prefix+lc_it->second.key;
st->to_skip = 1;
st->offset = st->cookie;
}
else
{
st->to_skip = st->cookie-2;
st->offset = 2;
st->cookieverf = ((uint64_t)lrand48() | ((uint64_t)lrand48() << 31) | ((uint64_t)lrand48() << 62));
}
}
else
{
st->to_skip = 0;
st->offset = 2;
st->cookieverf = ((uint64_t)lrand48() | ((uint64_t)lrand48() << 31) | ((uint64_t)lrand48() << 62));
}
{
auto lc_it = st->self->parent->kvfs->list_cookies.lower_bound((list_cookie_t){ st->dir_ino, st->cookieverf, 0 });
if (lc_it != st->self->parent->kvfs->list_cookies.end() &&
lc_it->first.dir_ino == st->dir_ino &&
lc_it->first.cookieverf == st->cookieverf &&
lc_it->first.cookie < st->cookie)
{
auto lc_start = lc_it;
while (lc_it != st->self->parent->kvfs->list_cookies.end() && lc_it->first.cookieverf == st->cookieverf)
{
lc_it++;
}
st->self->parent->kvfs->list_cookies.erase(lc_start, lc_it);
}
}
st->getattr_cur = st->entries.size();
st->list_handle = st->self->parent->db->list_start(st->start);
st->self->parent->db->list_next(st->list_handle, [=](int res, const std::string & key, const std::string & value)
{
st->res = res;
st->cur_key = key;
st->cur_value = value;
nfs_kv_continue_readdir(st, 3);
});
return;
while (st->list_handle)
{
st->self->parent->db->list_next(st->list_handle, NULL);
return;
resume_3:
if (st->res == -ENOENT || st->cur_key.size() < st->prefix.size() || st->cur_key.substr(0, st->prefix.size()) != st->prefix)
{
st->self->parent->db->list_close(st->list_handle);
st->list_handle = NULL;
break;
}
if (st->to_skip > 0)
{
st->to_skip--;
continue;
}
std::string err;
auto direntry = json11::Json::parse(st->cur_value, err);
if (err != "")
{
fprintf(stderr, "readdir: direntry %s contains invalid JSON: %s, skipping\n",
st->cur_key.c_str(), st->cur_value.c_str());
continue;
}
auto ino = direntry["ino"].uint64_value();
auto name = kv_direntry_filename(st->cur_key);
if (st->self->parent->trace)
{
fprintf(stderr, "[%d] READDIR %ju %lu %s\n",
st->self->nfs_fd, st->dir_ino, st->offset, name.c_str());
}
auto fh = kv_fh(ino);
// 1 entry3 is (8+4+(filename_len+3)/4*4+8) bytes
// 1 entryplus3 is (8+4+(filename_len+3)/4*4+8
// + 4+(name_attributes ? (sizeof(fattr3) = 84) : 0)
// + 4+(name_handle ? 4+(handle_len+3)/4*4 : 0)) bytes
auto entry_size = 20 + len_pad4(name.size()) + (st->is_plus ? 8 + 88 + len_pad4(fh.size()) : 0);
if (st->reply_size + entry_size > st->maxcount)
{
st->eof = false;
st->self->parent->db->list_close(st->list_handle);
st->list_handle = NULL;
break;
}
st->reply_size += entry_size;
auto idx = st->entries.size();
st->entries.push_back((entryplus3){});
auto entry = &st->entries[idx];
entry->name = xdr_copy_string(st->rop->xdrs, name);
entry->fileid = ino;
entry->cookie = st->offset++;
st->self->parent->kvfs->list_cookies[(list_cookie_t){ st->dir_ino, st->cookieverf, entry->cookie }] = { .key = name };
if (st->is_plus)
{
entry->name_handle = (post_op_fh3){
.handle_follows = 1,
.handle = xdr_copy_string(st->rop->xdrs, fh),
};
kv_getattr_next(st);
}
}
resume_4:
while (st->getattr_running > 0)
{
return;
}
void *prev = NULL;
for (int i = 0; i < st->entries.size(); i++)
{
entryplus3 *entry = &st->entries[i];
if (prev)
{
if (st->is_plus)
((entryplus3*)prev)->nextentry = entry;
else
((entry3*)prev)->nextentry = (entry3*)entry;
}
prev = entry;
}
// Send reply
auto cb = std::move(st->cb);
cb(0);
}
static void nfs3_readdir_common(void *opaque, rpc_op_t *rop, bool is_plus)
{
auto st = new nfs_kv_readdir_state;
st->self = (nfs_client_t*)opaque;
st->rop = rop;
st->is_plus = is_plus;
if (st->is_plus)
{
READDIRPLUS3args *args = (READDIRPLUS3args*)rop->request;
st->dir_ino = kv_fh_inode(args->dir);
st->cookie = args->cookie;
st->cookieverf = *((uint64_t*)args->cookieverf);
st->maxcount = args->maxcount;
}
else
{
READDIR3args *args = ((READDIR3args*)rop->request);
st->dir_ino = kv_fh_inode(args->dir);
st->cookie = args->cookie;
st->cookieverf = *((uint64_t*)args->cookieverf);
st->maxcount = args->count;
}
if (st->self->parent->trace)
fprintf(stderr, "[%d] READDIR %ju VERF %jx OFFSET %ju LIMIT %ju\n", st->self->nfs_fd, st->dir_ino, st->cookieverf, st->cookie, st->maxcount);
st->cb = [st](int res)
{
if (st->is_plus)
{
READDIRPLUS3res *reply = (READDIRPLUS3res*)st->rop->reply;
*reply = (READDIRPLUS3res){ .status = vitastor_nfs_map_err(res) };
*(uint64_t*)(reply->resok.cookieverf) = st->cookieverf;
reply->resok.reply.entries = st->entries.size() ? &st->entries[0] : NULL;
reply->resok.reply.eof = st->eof;
}
else
{
READDIR3res *reply = (READDIR3res*)st->rop->reply;
*reply = (READDIR3res){ .status = vitastor_nfs_map_err(res) };
*(uint64_t*)(reply->resok.cookieverf) = st->cookieverf;
reply->resok.reply.entries = st->entries.size() ? (entry3*)&st->entries[0] : NULL;
reply->resok.reply.eof = st->eof;
}
rpc_queue_reply(st->rop);
delete st;
};
nfs_kv_continue_readdir(st, 0);
}
int kv_nfs3_readdir_proc(void *opaque, rpc_op_t *rop)
{
nfs3_readdir_common(opaque, rop, false);
return 0;
}
int kv_nfs3_readdirplus_proc(void *opaque, rpc_op_t *rop)
{
nfs3_readdir_common(opaque, rop, true);
return 0;
}

321
src/nfs_kv_remove.cpp Normal file
View File

@@ -0,0 +1,321 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy over VitastorKV database - REMOVE, RMDIR
#include <sys/time.h>
#include "nfs_proxy.h"
#include "nfs_kv.h"
#include "cli.h"
struct kv_del_state
{
nfs_client_t *self = NULL;
rpc_op_t *rop = NULL;
uint64_t dir_ino = 0;
std::string filename;
uint64_t ino = 0;
void *list_handle = NULL;
std::string prefix, list_key, direntry_text, ientry_text;
json11::Json direntry, ientry;
int type = 0;
bool is_rmdir = false;
bool rm_data = false;
bool allow_cache = true;
int res = 0, res2 = 0;
std::function<void(int)> cb;
};
static void nfs_kv_continue_delete(kv_del_state *st, int state)
{
// Overall algorithm:
// 1) Get inode attributes and check that it's not a directory (REMOVE)
// 2) Get inode attributes and check that it is a directory (RMDIR)
// 3) Delete direntry with CAS
// 4) Check that the directory didn't contain files (RMDIR) and restore it if it did
// 5) Reduce inode refcount by 1 or delete inode
// 6) If regular file and inode is deleted: delete data
if (state == 0) {}
else if (state == 1) goto resume_1;
else if (state == 2) goto resume_2;
else if (state == 3) goto resume_3;
else if (state == 4) goto resume_4;
else if (state == 5) goto resume_5;
else if (state == 6) goto resume_6;
else if (state == 7) goto resume_7;
else
{
fprintf(stderr, "BUG: invalid state in nfs_kv_continue_delete()");
abort();
}
resume_0:
st->self->parent->db->get(kv_direntry_key(st->dir_ino, st->filename), [st](int res, const std::string & value)
{
st->res = res;
st->direntry_text = value;
nfs_kv_continue_delete(st, 1);
}, st->allow_cache);
return;
resume_1:
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
{
std::string err;
st->direntry = json11::Json::parse(st->direntry_text, err);
if (err != "")
{
fprintf(stderr, "Invalid JSON in direntry %s = %s: %s, deleting\n",
kv_direntry_key(st->dir_ino, st->filename).c_str(), st->direntry_text.c_str(), err.c_str());
// Just delete direntry and skip inode
}
else
{
st->ino = st->direntry["ino"].uint64_value();
}
}
// Get inode
st->self->parent->db->get(kv_inode_key(st->ino), [st](int res, const std::string & value)
{
st->res = res;
st->ientry_text = value;
nfs_kv_continue_delete(st, 2);
}, st->allow_cache);
return;
resume_2:
if (st->res < 0)
{
fprintf(stderr, "error reading inode %s: %s (code %d)\n",
kv_inode_key(st->ino).c_str(), strerror(-st->res), st->res);
auto cb = std::move(st->cb);
cb(st->res);
return;
}
{
std::string err;
st->ientry = json11::Json::parse(st->ientry_text, err);
if (err != "")
{
fprintf(stderr, "Invalid JSON in inode %s = %s: %s, treating as a regular file\n",
kv_inode_key(st->ino).c_str(), st->ientry_text.c_str(), err.c_str());
}
}
// (1-2) Check type
st->type = kv_map_type(st->ientry["type"].string_value());
if (st->type == -1 || st->is_rmdir != (st->type == NF3DIR))
{
auto cb = std::move(st->cb);
cb(st->is_rmdir ? -ENOTDIR : -EISDIR);
return;
}
// (3) Delete direntry with CAS
st->self->parent->db->del(kv_direntry_key(st->dir_ino, st->filename), [st](int res)
{
st->res = res;
nfs_kv_continue_delete(st, 3);
}, [st](int res, const std::string & value)
{
return value == st->direntry_text;
});
return;
resume_3:
if (st->res == -EAGAIN)
{
// CAS failure, restart from the beginning
st->allow_cache = false;
goto resume_0;
}
else if (st->res < 0 && st->res != -ENOENT)
{
fprintf(stderr, "failed to remove direntry %s: %s (code %d)\n",
kv_direntry_key(st->dir_ino, st->filename).c_str(), strerror(-st->res), st->res);
auto cb = std::move(st->cb);
cb(st->res);
return;
}
if (!st->ino)
{
// direntry contained invalid JSON and was deleted, finish
auto cb = std::move(st->cb);
cb(0);
return;
}
if (st->is_rmdir)
{
// (4) Check if directory actually is not empty
st->list_handle = st->self->parent->db->list_start(kv_direntry_key(st->ino, ""));
st->self->parent->db->list_next(st->list_handle, [st](int res, const std::string & key, const std::string & value)
{
st->res = res;
st->list_key = key;
st->self->parent->db->list_close(st->list_handle);
nfs_kv_continue_delete(st, 4);
});
return;
resume_4:
st->prefix = kv_direntry_key(st->ino, "");
if (st->res == -ENOENT || st->list_key.size() < st->prefix.size() || st->list_key.substr(0, st->prefix.size()) != st->prefix)
{
// OK, directory is empty
}
else
{
// Not OK, restore direntry
st->self->parent->db->del(kv_direntry_key(st->dir_ino, st->filename), [st](int res)
{
st->res2 = res;
nfs_kv_continue_delete(st, 5);
}, [st](int res, const std::string & value)
{
return res == -ENOENT;
});
return;
resume_5:
if (st->res2 < 0)
{
fprintf(stderr, "failed to restore direntry %s (%s): %s (code %d)",
kv_direntry_key(st->dir_ino, st->filename).c_str(), st->direntry_text.c_str(), strerror(-st->res2), st->res2);
fprintf(stderr, " - inode %ju may be left as garbage\n", st->ino);
}
if (st->res < 0)
{
fprintf(stderr, "failed to list entries from %s: %s (code %d)\n",
kv_direntry_key(st->ino, "").c_str(), strerror(-st->res), st->res);
}
auto cb = std::move(st->cb);
cb(st->res < 0 ? st->res : -ENOTEMPTY);
return;
}
}
// (5) Reduce inode refcount by 1 or delete inode
if (st->ientry["nlink"].uint64_value() > 1)
{
auto copy = st->ientry.object_items();
copy["nlink"] = st->ientry["nlink"].uint64_value()-1;
copy["ctime"] = nfstime_now_str();
st->self->parent->db->set(kv_inode_key(st->ino), json11::Json(copy).dump(), [st](int res)
{
st->res = res;
nfs_kv_continue_delete(st, 6);
}, [st](int res, const std::string & old_value)
{
return old_value == st->ientry_text;
});
}
else
{
st->self->parent->kvfs->touch_queue.erase(st->ino);
st->self->parent->db->del(kv_inode_key(st->ino), [st](int res)
{
st->res = res;
nfs_kv_continue_delete(st, 6);
}, [st](int res, const std::string & old_value)
{
return old_value == st->ientry_text;
});
}
return;
resume_6:
if (st->res < 0)
{
// Assume EAGAIN is OK, maybe someone created a hard link in the meantime
auto cb = std::move(st->cb);
cb(st->res == -EAGAIN ? 0 : st->res);
return;
}
// (6) If regular file and inode is deleted: delete data
if ((!st->type || st->type == NF3REG) && st->ientry["nlink"].uint64_value() <= 1 &&
!st->ientry["shared_ino"].uint64_value())
{
// Remove data
st->self->parent->cmd->loop_and_wait(st->self->parent->cmd->start_rm_data(json11::Json::object {
{ "inode", INODE_NO_POOL(st->ino) },
{ "pool", (uint64_t)INODE_POOL(st->ino) },
}), [st](const cli_result_t & r)
{
if (r.err)
{
fprintf(stderr, "Failed to remove inode %jx data: %s (code %d)\n",
st->ino, r.text.c_str(), r.err);
}
st->res = r.err;
nfs_kv_continue_delete(st, 7);
});
return;
resume_7:
auto cb = std::move(st->cb);
cb(st->res);
return;
}
if (!st->res)
{
st->self->parent->kvfs->touch_queue.insert(st->dir_ino);
}
auto cb = std::move(st->cb);
cb(0);
}
int kv_nfs3_remove_proc(void *opaque, rpc_op_t *rop)
{
kv_del_state *st = new kv_del_state;
st->self = (nfs_client_t*)opaque;
st->rop = rop;
REMOVE3res *reply = (REMOVE3res*)rop->reply;
REMOVE3args *args = (REMOVE3args*)rop->request;
st->dir_ino = kv_fh_inode(args->object.dir);
st->filename = args->object.name;
if (st->self->parent->trace)
fprintf(stderr, "[%d] REMOVE %ju/%s\n", st->self->nfs_fd, st->dir_ino, st->filename.c_str());
if (!st->dir_ino)
{
*reply = (REMOVE3res){ .status = NFS3ERR_INVAL };
rpc_queue_reply(rop);
delete st;
return 0;
}
st->cb = [st](int res)
{
*((REMOVE3res*)st->rop->reply) = (REMOVE3res){
.status = vitastor_nfs_map_err(res),
};
rpc_queue_reply(st->rop);
delete st;
};
nfs_kv_continue_delete(st, 0);
return 1;
}
int kv_nfs3_rmdir_proc(void *opaque, rpc_op_t *rop)
{
kv_del_state *st = new kv_del_state;
st->self = (nfs_client_t*)opaque;
st->rop = rop;
RMDIR3args *args = (RMDIR3args*)rop->request;
RMDIR3res *reply = (RMDIR3res*)rop->reply;
st->dir_ino = kv_fh_inode(args->object.dir);
st->filename = args->object.name;
st->is_rmdir = true;
if (st->self->parent->trace)
fprintf(stderr, "[%d] RMDIR %ju/%s\n", st->self->nfs_fd, st->dir_ino, st->filename.c_str());
if (!st->dir_ino)
{
*reply = (RMDIR3res){ .status = NFS3ERR_INVAL };
rpc_queue_reply(rop);
delete st;
return 0;
}
st->cb = [st](int res)
{
*((RMDIR3res*)st->rop->reply) = (RMDIR3res){
.status = vitastor_nfs_map_err(res),
};
rpc_queue_reply(st->rop);
delete st;
};
nfs_kv_continue_delete(st, 0);
return 1;
}

401
src/nfs_kv_rename.cpp Normal file
View File

@@ -0,0 +1,401 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy over VitastorKV database - RENAME
#include <sys/time.h>
#include "nfs_proxy.h"
#include "nfs_kv.h"
#include "cli.h"
struct nfs_kv_rename_state
{
nfs_client_t *self = NULL;
rpc_op_t *rop = NULL;
// params:
uint64_t old_dir_ino = 0, new_dir_ino = 0;
std::string old_name, new_name;
// state:
bool allow_cache = true;
std::string old_direntry_text, old_ientry_text, new_direntry_text, new_ientry_text;
json11::Json old_direntry, old_ientry, new_direntry, new_ientry;
std::string new_dir_prefix;
void *list_handle = NULL;
bool new_exists = false;
bool rm_dest_data = false;
int res = 0, res2 = 0;
std::function<void(int)> cb;
};
static void nfs_kv_continue_rename(nfs_kv_rename_state *st, int state)
{
// Algorithm (non-atomic of course):
// 1) Read source direntry
// 2) Read destination direntry
// 3) If destination exists:
// 3.1) Check file/folder compatibility (EISDIR/ENOTDIR)
// 3.2) Check if destination is empty if it's a folder
// 4) If not:
// 4.1) Check that the destination directory is actually a directory
// 5) Overwrite destination direntry, restart from beginning if CAS failure
// 6) Delete source direntry, restart from beginning if CAS failure
// 7) If the moved direntry was a regular file:
// 7.1) Read inode
// 7.2) Delete inode if its link count <= 1
// 7.3) Delete inode data if its link count <= 1 and it's a regular non-shared file
// 7.4) Reduce link count by 1 if it's > 1
// 8) If the moved direntry is a directory:
// 8.1) Change parent_ino reference in its inode
if (state == 0) {}
else if (state == 1) goto resume_1;
else if (state == 2) goto resume_2;
else if (state == 3) goto resume_3;
else if (state == 4) goto resume_4;
else if (state == 5) goto resume_5;
else if (state == 6) goto resume_6;
else if (state == 7) goto resume_7;
else if (state == 8) goto resume_8;
else if (state == 9) goto resume_9;
else if (state == 10) goto resume_10;
else if (state == 11) goto resume_11;
else if (state == 12) goto resume_12;
else
{
fprintf(stderr, "BUG: invalid state in nfs_kv_continue_rename()");
abort();
}
resume_0:
// Read the old direntry
st->self->parent->db->get(kv_direntry_key(st->old_dir_ino, st->old_name), [=](int res, const std::string & value)
{
st->res = res;
st->old_direntry_text = value;
nfs_kv_continue_rename(st, 1);
}, st->allow_cache);
return;
resume_1:
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
{
std::string err;
st->old_direntry = json11::Json::parse(st->old_direntry_text, err);
if (err != "")
{
fprintf(stderr, "Invalid JSON in direntry %s = %s: %s\n",
kv_direntry_key(st->old_dir_ino, st->old_name).c_str(),
st->old_direntry_text.c_str(), err.c_str());
auto cb = std::move(st->cb);
cb(-EIO);
return;
}
}
// Read the new direntry
st->self->parent->db->get(kv_direntry_key(st->new_dir_ino, st->new_name), [=](int res, const std::string & value)
{
st->res = res;
st->new_direntry_text = value;
nfs_kv_continue_rename(st, 2);
}, st->allow_cache);
return;
resume_2:
if (st->res < 0 && st->res != -ENOENT)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
if (st->res == 0)
{
std::string err;
st->new_direntry = json11::Json::parse(st->new_direntry_text, err);
if (err != "")
{
fprintf(stderr, "Invalid JSON in direntry %s = %s: %s\n",
kv_direntry_key(st->new_dir_ino, st->new_name).c_str(),
st->new_direntry_text.c_str(), err.c_str());
auto cb = std::move(st->cb);
cb(-EIO);
return;
}
}
st->new_exists = st->res == 0;
if (st->new_exists)
{
// Check file/folder compatibility (EISDIR/ENOTDIR)
if ((st->old_direntry["type"] == "dir") != (st->new_direntry["type"] == "dir"))
{
auto cb = std::move(st->cb);
cb((st->new_direntry["type"] == "dir") ? -ENOTDIR : -EISDIR);
return;
}
if (st->new_direntry["type"] == "dir")
{
// Check that the destination directory is empty
st->new_dir_prefix = kv_direntry_key(st->new_direntry["ino"].uint64_value(), "");
st->list_handle = st->self->parent->db->list_start(st->new_dir_prefix);
st->self->parent->db->list_next(st->list_handle, [st](int res, const std::string & key, const std::string & value)
{
st->res = res;
nfs_kv_continue_rename(st, 3);
});
return;
resume_3:
st->self->parent->db->list_close(st->list_handle);
if (st->res != -ENOENT)
{
auto cb = std::move(st->cb);
cb(-ENOTEMPTY);
return;
}
}
}
else
{
// Check that the new directory is actually a directory
kv_read_inode(st->self->parent, st->new_dir_ino, [st](int res, const std::string & value, json11::Json attrs)
{
st->res = res == 0 ? (attrs["type"].string_value() == "dir" ? 0 : -ENOTDIR) : res;
nfs_kv_continue_rename(st, 4);
});
return;
resume_4:
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
}
// Write the new direntry
st->self->parent->db->set(kv_direntry_key(st->new_dir_ino, st->new_name), st->old_direntry_text, [st](int res)
{
st->res = res;
nfs_kv_continue_rename(st, 5);
}, [st](int res, const std::string & old_value)
{
return st->new_exists ? (old_value == st->new_direntry_text) : (res == -ENOENT);
});
return;
resume_5:
if (st->res == -EAGAIN)
{
// CAS failure
st->allow_cache = false;
goto resume_0;
}
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
// Delete the old direntry
st->self->parent->db->del(kv_direntry_key(st->old_dir_ino, st->old_name), [st](int res)
{
st->res = res;
nfs_kv_continue_rename(st, 6);
}, [=](int res, const std::string & old_value)
{
return res == 0 && old_value == st->old_direntry_text;
});
return;
resume_6:
if (st->res == -EAGAIN)
{
// CAS failure
st->allow_cache = false;
goto resume_0;
}
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
st->allow_cache = true;
resume_7again:
if (st->new_exists && st->new_direntry["type"].string_value() != "dir")
{
// (Maybe) delete old destination file data
kv_read_inode(st->self->parent, st->new_direntry["ino"].uint64_value(), [st](int res, const std::string & value, json11::Json attrs)
{
st->res = res;
st->new_ientry_text = value;
st->new_ientry = attrs;
nfs_kv_continue_rename(st, 7);
}, st->allow_cache);
return;
resume_7:
if (st->res == 0)
{
// (5) Reduce inode refcount by 1 or delete inode
if (st->new_ientry["nlink"].uint64_value() > 1)
{
auto copy = st->new_ientry.object_items();
copy["nlink"] = st->new_ientry["nlink"].uint64_value()-1;
copy["ctime"] = nfstime_now_str();
copy.erase("verf");
st->self->parent->db->set(kv_inode_key(st->new_direntry["ino"].uint64_value()), json11::Json(copy).dump(), [st](int res)
{
st->res = res;
nfs_kv_continue_rename(st, 8);
}, [st](int res, const std::string & old_value)
{
return old_value == st->new_ientry_text;
});
}
else
{
st->rm_dest_data = kv_map_type(st->new_ientry["type"].string_value()) == NF3REG
&& !st->new_ientry["shared_ino"].uint64_value();
st->self->parent->db->del(kv_inode_key(st->new_direntry["ino"].uint64_value()), [st](int res)
{
st->res = res;
nfs_kv_continue_rename(st, 8);
}, [st](int res, const std::string & old_value)
{
return old_value == st->new_ientry_text;
});
}
return;
resume_8:
if (st->res == -EAGAIN)
{
// CAS failure - re-read inode
st->allow_cache = false;
goto resume_7again;
}
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
// Delete inode data if required
if (st->rm_dest_data)
{
st->self->parent->cmd->loop_and_wait(st->self->parent->cmd->start_rm_data(json11::Json::object {
{ "inode", INODE_NO_POOL(st->new_direntry["ino"].uint64_value()) },
{ "pool", (uint64_t)INODE_POOL(st->new_direntry["ino"].uint64_value()) },
}), [st](const cli_result_t & r)
{
if (r.err)
{
fprintf(stderr, "Failed to remove inode %jx data: %s (code %d)\n",
st->new_direntry["ino"].uint64_value(), r.text.c_str(), r.err);
}
st->res = r.err;
nfs_kv_continue_rename(st, 9);
});
return;
resume_9:
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
}
}
}
if (st->old_direntry["type"].string_value() == "dir" && st->new_dir_ino != st->old_dir_ino)
{
// Change parent_ino in old ientry
st->allow_cache = true;
resume_10:
kv_read_inode(st->self->parent, st->old_direntry["ino"].uint64_value(), [st](int res, const std::string & value, json11::Json ientry)
{
st->res = res;
st->old_ientry_text = value;
st->old_ientry = ientry;
nfs_kv_continue_rename(st, 11);
}, st->allow_cache);
return;
resume_11:
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
{
auto ientry_new = st->old_ientry.object_items();
ientry_new["parent_ino"] = st->new_dir_ino;
ientry_new["ctime"] = nfstime_now_str();
ientry_new.erase("verf");
st->self->parent->db->set(kv_inode_key(st->old_direntry["ino"].uint64_value()), json11::Json(ientry_new).dump(), [st](int res)
{
st->res = res;
nfs_kv_continue_rename(st, 12);
}, [st](int res, const std::string & old_value)
{
return old_value == st->old_ientry_text;
});
}
return;
resume_12:
if (st->res == -EAGAIN)
{
// CAS failure - try again
st->allow_cache = false;
goto resume_10;
}
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
}
if (!st->res)
{
st->self->parent->kvfs->touch_queue.insert(st->old_dir_ino);
st->self->parent->kvfs->touch_queue.insert(st->new_dir_ino);
}
auto cb = std::move(st->cb);
cb(st->res);
}
int kv_nfs3_rename_proc(void *opaque, rpc_op_t *rop)
{
auto st = new nfs_kv_rename_state;
st->self = (nfs_client_t*)opaque;
st->rop = rop;
RENAME3args *args = (RENAME3args*)rop->request;
st->old_dir_ino = kv_fh_inode(args->from.dir);
st->new_dir_ino = kv_fh_inode(args->to.dir);
st->old_name = args->from.name;
st->new_name = args->to.name;
if (st->self->parent->trace)
fprintf(stderr, "[%d] RENAME %ju/%s -> %ju/%s\n", st->self->nfs_fd, st->old_dir_ino, st->old_name.c_str(), st->new_dir_ino, st->new_name.c_str());
if (!st->old_dir_ino || !st->new_dir_ino || st->old_name == "" || st->new_name == "")
{
RENAME3res *reply = (RENAME3res*)rop->reply;
*reply = (RENAME3res){ .status = NFS3ERR_INVAL };
rpc_queue_reply(rop);
delete st;
return 0;
}
if (st->old_dir_ino == st->new_dir_ino && st->old_name == st->new_name)
{
RENAME3res *reply = (RENAME3res*)rop->reply;
*reply = (RENAME3res){ .status = NFS3_OK };
rpc_queue_reply(st->rop);
delete st;
return 0;
}
st->cb = [st](int res)
{
RENAME3res *reply = (RENAME3res*)st->rop->reply;
*reply = (RENAME3res){ .status = vitastor_nfs_map_err(res) };
rpc_queue_reply(st->rop);
delete st;
};
nfs_kv_continue_rename(st, 0);
return 1;
}

204
src/nfs_kv_setattr.cpp Normal file
View File

@@ -0,0 +1,204 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy over VitastorKV database - SETATTR
#include <sys/time.h>
#include "nfs_proxy.h"
#include "nfs_kv.h"
#include "cli.h"
struct nfs_kv_setattr_state
{
nfs_client_t *self = NULL;
rpc_op_t *rop = NULL;
uint64_t ino = 0;
uint64_t old_size = 0, new_size = 0;
std::string expected_ctime;
json11::Json::object set_attrs;
int res = 0, cas_res = 0;
std::string ientry_text;
json11::Json ientry;
json11::Json::object new_attrs;
std::function<void(int)> cb;
};
static void nfs_kv_continue_setattr(nfs_kv_setattr_state *st, int state)
{
// FIXME: NFS client does a lot of setattr calls, so maybe process them asynchronously
if (state == 0) {}
else if (state == 1) goto resume_1;
else if (state == 2) goto resume_2;
else if (state == 3) goto resume_3;
else
{
fprintf(stderr, "BUG: invalid state in nfs_kv_continue_setattr()");
abort();
}
st->self->parent->kvfs->touch_queue.erase(st->ino);
resume_0:
kv_read_inode(st->self->parent, st->ino, [st](int res, const std::string & value, json11::Json attrs)
{
st->res = res;
st->ientry_text = value;
st->ientry = attrs;
nfs_kv_continue_setattr(st, 1);
});
return;
resume_1:
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
if (st->ientry["type"].string_value() != "file" &&
st->ientry["type"].string_value() != "" &&
!st->set_attrs["size"].is_null())
{
auto cb = std::move(st->cb);
cb(-EINVAL);
return;
}
if (st->expected_ctime != "")
{
auto actual_ctime = (st->ientry["ctime"].is_null() ? st->ientry["mtime"] : st->ientry["ctime"]);
if (actual_ctime != st->expected_ctime)
{
auto cb = std::move(st->cb);
cb(NFS3ERR_NOT_SYNC);
return;
}
}
// Now we can update it
st->new_attrs = st->ientry.object_items();
st->old_size = st->ientry["size"].uint64_value();
for (auto & kv: st->set_attrs)
{
if (kv.first == "size")
{
st->new_size = kv.second.uint64_value();
}
st->new_attrs[kv.first] = kv.second;
}
st->new_attrs.erase("verf");
st->new_attrs["ctime"] = nfstime_now_str();
st->self->parent->db->set(kv_inode_key(st->ino), json11::Json(st->new_attrs).dump(), [st](int res)
{
st->res = res;
nfs_kv_continue_setattr(st, 2);
}, [st](int res, const std::string & cas_value)
{
st->cas_res = res;
return (res == 0 || res == -ENOENT && st->ino == KV_ROOT_INODE) && cas_value == st->ientry_text;
});
return;
resume_2:
if (st->cas_res == -ENOENT)
{
st->res = -ENOENT;
}
if (st->res == -EAGAIN)
{
// Retry
goto resume_0;
}
if (st->res < 0)
{
fprintf(stderr, "Failed to update inode %ju: %s (code %d)\n", st->ino, strerror(-st->res), st->res);
auto cb = std::move(st->cb);
cb(st->res);
return;
}
if (!st->set_attrs["size"].is_null() &&
st->ientry["size"].uint64_value() > st->set_attrs["size"].uint64_value() &&
!st->ientry["shared_ino"].uint64_value())
{
// Delete extra data when downsizing
st->self->parent->cmd->loop_and_wait(st->self->parent->cmd->start_rm_data(json11::Json::object {
{ "inode", INODE_NO_POOL(st->ino) },
{ "pool", (uint64_t)INODE_POOL(st->ino) },
{ "min_offset", st->set_attrs["size"].uint64_value() },
}), [st](const cli_result_t & r)
{
if (r.err)
{
fprintf(stderr, "Failed to truncate inode %ju: %s (code %d)\n",
st->ino, r.text.c_str(), r.err);
}
st->res = r.err;
nfs_kv_continue_setattr(st, 3);
});
return;
}
resume_3:
auto cb = std::move(st->cb);
cb(st->res);
}
int kv_nfs3_setattr_proc(void *opaque, rpc_op_t *rop)
{
nfs_kv_setattr_state *st = new nfs_kv_setattr_state;
st->self = (nfs_client_t*)opaque;
st->rop = rop;
auto args = (SETATTR3args*)rop->request;
auto reply = (SETATTR3res*)rop->reply;
std::string fh = args->object;
if (!kv_fh_valid(fh))
{
*reply = (SETATTR3res){ .status = NFS3ERR_INVAL };
rpc_queue_reply(rop);
delete st;
return 0;
}
st->ino = kv_fh_inode(fh);
if (args->guard.check)
st->expected_ctime = nfstime_to_str(args->guard.obj_ctime);
if (args->new_attributes.size.set_it)
st->set_attrs["size"] = args->new_attributes.size.size;
if (args->new_attributes.mode.set_it)
st->set_attrs["mode"] = (uint64_t)args->new_attributes.mode.mode;
if (args->new_attributes.uid.set_it)
st->set_attrs["uid"] = (uint64_t)args->new_attributes.uid.uid;
if (args->new_attributes.gid.set_it)
st->set_attrs["gid"] = (uint64_t)args->new_attributes.gid.gid;
if (args->new_attributes.atime.set_it == SET_TO_SERVER_TIME)
st->set_attrs["atime"] = nfstime_now_str();
else if (args->new_attributes.atime.set_it == SET_TO_CLIENT_TIME)
st->set_attrs["atime"] = nfstime_to_str(args->new_attributes.atime.atime);
if (args->new_attributes.mtime.set_it == SET_TO_SERVER_TIME)
st->set_attrs["mtime"] = nfstime_now_str();
else if (args->new_attributes.mtime.set_it == SET_TO_CLIENT_TIME)
st->set_attrs["mtime"] = nfstime_to_str(args->new_attributes.mtime.mtime);
if (st->self->parent->trace)
fprintf(stderr, "[%d] SETATTR %ju ATTRS %s\n", st->self->nfs_fd, st->ino, json11::Json(st->set_attrs).dump().c_str());
st->cb = [st](int res)
{
auto reply = (SETATTR3res*)st->rop->reply;
if (res < 0)
{
*reply = (SETATTR3res){
.status = vitastor_nfs_map_err(res),
};
}
else
{
*reply = (SETATTR3res){
.status = NFS3_OK,
.resok = (SETATTR3resok){
.obj_wcc = (wcc_data){
.after = (post_op_attr){
.attributes_follow = 1,
.attributes = get_kv_attributes(st->self, st->ino, st->new_attrs),
},
},
},
};
}
rpc_queue_reply(st->rop);
delete st;
};
nfs_kv_continue_setattr(st, 0);
return 1;
}

1012
src/nfs_kv_write.cpp Normal file

File diff suppressed because it is too large Load Diff

126
src/nfs_mount.cpp Normal file
View File

@@ -0,0 +1,126 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy - common NULL, ACCESS, COMMIT, DUMP, EXPORT, MNT, UMNT, UMNTALL
#include <sys/time.h>
#include "nfs_proxy.h"
#include "nfs/nfs.h"
nfsstat3 vitastor_nfs_map_err(int err)
{
if (err < 0)
{
err = -err;
}
return (err == EINVAL ? NFS3ERR_INVAL
: (err == ENOENT ? NFS3ERR_NOENT
: (err == ENOSPC ? NFS3ERR_NOSPC
: (err == EEXIST ? NFS3ERR_EXIST
: (err == EISDIR ? NFS3ERR_ISDIR
: (err == ENOTDIR ? NFS3ERR_NOTDIR
: (err == ENOTEMPTY ? NFS3ERR_NOTEMPTY
: (err == EIO ? NFS3ERR_IO : (err ? NFS3ERR_IO : NFS3_OK)))))))));
}
int nfs3_null_proc(void *opaque, rpc_op_t *rop)
{
rpc_queue_reply(rop);
return 0;
}
int nfs3_access_proc(void *opaque, rpc_op_t *rop)
{
//nfs_client_t *self = (nfs_client_t*)opaque;
ACCESS3args *args = (ACCESS3args*)rop->request;
ACCESS3res *reply = (ACCESS3res*)rop->reply;
*reply = (ACCESS3res){
.status = NFS3_OK,
.resok = (ACCESS3resok){
.access = args->access,
},
};
rpc_queue_reply(rop);
return 0;
}
int nfs3_commit_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
//COMMIT3args *args = (COMMIT3args*)rop->request;
cluster_op_t *op = new cluster_op_t;
// fsync. we don't know how to fsync a single inode, so just fsync everything
op->opcode = OSD_OP_SYNC;
op->callback = [self, rop](cluster_op_t *op)
{
COMMIT3res *reply = (COMMIT3res*)rop->reply;
*reply = (COMMIT3res){ .status = vitastor_nfs_map_err(op->retval) };
*(uint64_t*)reply->resok.verf = self->parent->server_id;
rpc_queue_reply(rop);
};
self->parent->cli->execute(op);
return 1;
}
int mount3_mnt_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
//nfs_dirpath *args = (nfs_dirpath*)rop->request;
if (self->parent->trace)
fprintf(stderr, "[%d] MNT\n", self->nfs_fd);
nfs_mountres3 *reply = (nfs_mountres3*)rop->reply;
u_int flavor = RPC_AUTH_NONE;
reply->fhs_status = MNT3_OK;
reply->mountinfo.fhandle = xdr_copy_string(rop->xdrs, NFS_ROOT_HANDLE);
reply->mountinfo.auth_flavors.auth_flavors_len = 1;
reply->mountinfo.auth_flavors.auth_flavors_val = (u_int*)xdr_copy_string(rop->xdrs, (char*)&flavor, sizeof(u_int)).data;
rpc_queue_reply(rop);
return 0;
}
int mount3_dump_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
if (self->parent->trace)
fprintf(stderr, "[%d] DUMP\n", self->nfs_fd);
nfs_mountlist *reply = (nfs_mountlist*)rop->reply;
*reply = (struct nfs_mountbody*)malloc_or_die(sizeof(struct nfs_mountbody));
xdr_add_malloc(rop->xdrs, *reply);
(*reply)->ml_hostname = xdr_copy_string(rop->xdrs, "127.0.0.1");
(*reply)->ml_directory = xdr_copy_string(rop->xdrs, self->parent->export_root);
(*reply)->ml_next = NULL;
rpc_queue_reply(rop);
return 0;
}
int mount3_umnt_proc(void *opaque, rpc_op_t *rop)
{
//nfs_client_t *self = (nfs_client_t*)opaque;
//nfs_dirpath *arg = (nfs_dirpath*)rop->request;
// do nothing
rpc_queue_reply(rop);
return 0;
}
int mount3_umntall_proc(void *opaque, rpc_op_t *rop)
{
// do nothing
rpc_queue_reply(rop);
return 0;
}
int mount3_export_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
nfs_exports *reply = (nfs_exports*)rop->reply;
*reply = (struct nfs_exportnode*)calloc_or_die(1, sizeof(struct nfs_exportnode) + sizeof(struct nfs_groupnode));
xdr_add_malloc(rop->xdrs, *reply);
(*reply)->ex_dir = xdr_copy_string(rop->xdrs, self->parent->export_root);
(*reply)->ex_groups = (struct nfs_groupnode*)(reply+1);
(*reply)->ex_groups->gr_name = xdr_copy_string(rop->xdrs, "127.0.0.1");
(*reply)->ex_groups->gr_next = NULL;
(*reply)->ex_next = NULL;
rpc_queue_reply(rop);
return 0;
}

View File

@@ -10,9 +10,10 @@
#include <netinet/tcp.h>
#include <sys/epoll.h>
#include <sys/wait.h>
#include <unistd.h>
#include <fcntl.h>
//#include <signal.h>
#include <signal.h>
#include "nfs/nfs.h"
#include "nfs/rpc.h"
@@ -21,6 +22,9 @@
#include "addr_util.h"
#include "str_util.h"
#include "nfs_proxy.h"
#include "nfs_kv.h"
#include "nfs_block.h"
#include "nfs_common.h"
#include "http_client.h"
#include "cli.h"
@@ -31,6 +35,12 @@ const char *exe_name = NULL;
nfs_proxy_t::~nfs_proxy_t()
{
if (kvfs)
delete kvfs;
if (blockfs)
delete blockfs;
if (db)
delete db;
if (cmd)
delete cmd;
if (cli)
@@ -44,43 +54,90 @@ nfs_proxy_t::~nfs_proxy_t()
delete ringloop;
}
static const char* help_text =
"Vitastor NFS 3.0 proxy " VERSION "\n"
"(c) Vitaliy Filippov, 2021+ (VNPL-1.1)\n"
"\n"
"vitastor-nfs (--fs <NAME> | --block) [-o <OPT>] mount <MOUNTPOINT>\n"
" Start local filesystem server and mount file system to <MOUNTPOINT>.\n"
" Use regular `umount <MOUNTPOINT>` to unmount the FS.\n"
" The server will be automatically stopped when the FS is unmounted.\n"
" -o|--options <OPT> Pass additional NFS mount options (ex.: -o async).\n"
"\n"
"vitastor-nfs (--fs <NAME> | --block) start\n"
" Start network NFS server. Options:\n"
" --bind <IP> bind service to <IP> address (default 0.0.0.0)\n"
" --port <PORT> use port <PORT> for NFS services (default is 2049)\n"
" --portmap 0 do not listen on port 111 (portmap/rpcbind, requires root)\n"
"\n"
"OPTIONS:\n"
" --fs <NAME> use VitastorFS with metadata in image <NAME>\n"
" --block use pseudo-FS presenting images as files\n"
" --pool <POOL> use <POOL> as default pool for new files\n"
" --subdir <DIR> export <DIR> instead of root directory (pseudo-FS only)\n"
" --nfspath <PATH> set NFS export path to <PATH> (default is /)\n"
" --pidfile <FILE> write process ID to the specified file\n"
" --logfile <FILE> log to the specified file\n"
" --foreground 1 stay in foreground, do not daemonize\n"
"\n"
"NFS proxy is stateless if you use immediate_commit=all in your cluster and if\n"
"you do not use client_enable_writeback=true, so you can freely use multiple\n"
"NFS proxies with L3 load balancing in this case.\n"
"\n"
"Example start and mount commands for a custom NFS port:\n"
" vitastor-nfs start --block --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool\n"
" mount localhost:/ /mnt/ -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp\n"
"Or just:\n"
" vitastor-nfs mount --block --pool testpool /mnt/\n"
;
json11::Json::object nfs_proxy_t::parse_args(int narg, const char *args[])
{
json11::Json::object cfg;
std::vector<std::string> cmd;
for (int i = 1; i < narg; i++)
{
if (!strcmp(args[i], "-h") || !strcmp(args[i], "--help"))
{
printf(
"Vitastor NFS 3.0 proxy\n"
"(c) Vitaliy Filippov, 2021-2022 (VNPL-1.1)\n"
"\n"
"USAGE:\n"
" %s [STANDARD OPTIONS] [OTHER OPTIONS]\n"
" --subdir <DIR> export images prefixed <DIR>/ (default empty - export all images)\n"
" --portmap 0 do not listen on port 111 (portmap/rpcbind, requires root)\n"
" --bind <IP> bind service to <IP> address (default 0.0.0.0)\n"
" --nfspath <PATH> set NFS export path to <PATH> (default is /)\n"
" --port <PORT> use port <PORT> for NFS services (default is 2049)\n"
" --pool <POOL> use <POOL> as default pool for new files (images)\n"
" --foreground 1 stay in foreground, do not daemonize\n"
"\n"
"NFS proxy is stateless if you use immediate_commit=all in your cluster and if\n"
"you do not use client_enable_writeback=true, so you can freely use multiple\n"
"NFS proxies with L3 load balancing in this case.\n"
"\n"
"Example start and mount commands for a custom NFS port:\n"
" %s --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool\n"
" mount localhost:/ /mnt/ -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp\n",
exe_name, exe_name
);
printf("%s", help_text);
exit(0);
}
else if (!strcmp(args[i], "-o") || !strcmp(args[i], "--options"))
{
if (i >= narg-1)
{
printf("%s", help_text);
exit(0);
}
const std::string & old = cfg["options"].string_value();
cfg["options"] = old != "" ? old+","+args[i+1] : args[i+1];
}
else if (args[i][0] == '-' && args[i][1] == '-')
{
const char *opt = args[i]+2;
cfg[opt] = !strcmp(opt, "json") || i == narg-1 ? "1" : args[++i];
cfg[opt] = !strcmp(opt, "json") || !strcmp(opt, "block") || i == narg-1 ? "1" : args[++i];
}
else
{
cmd.push_back(args[i]);
}
}
if (cfg.find("block") == cfg.end() && cfg.find("fs") == cfg.end())
{
fprintf(stderr, "Specify one of --block or --fs NAME. Use vitastor-nfs --help for details\n");
exit(1);
}
if (cmd.size() >= 2 && cmd[0] == "mount")
{
cfg["mount"] = cmd[1];
}
else if (cmd.size() >= 1 && cmd[0] == "start")
{
}
else
{
printf("%s", help_text);
exit(1);
}
return cfg;
}
@@ -92,6 +149,10 @@ void nfs_proxy_t::run(json11::Json cfg)
srand48(tv.tv_sec*1000000000 + tv.tv_nsec);
server_id = (uint64_t)lrand48() | ((uint64_t)lrand48() << 31) | ((uint64_t)lrand48() << 62);
// Parse options
if (cfg["logfile"].string_value() != "")
logfile = cfg["logfile"].string_value();
pidfile = cfg["pidfile"].string_value();
trace = cfg["log_level"].uint64_value() > 5 || cfg["trace"].uint64_value() > 0;
bind_address = cfg["bind"].string_value();
if (bind_address == "")
bind_address = "0.0.0.0";
@@ -103,18 +164,6 @@ void nfs_proxy_t::run(json11::Json cfg)
export_root = cfg["nfspath"].string_value();
if (!export_root.size())
export_root = "/";
name_prefix = cfg["subdir"].string_value();
{
int e = name_prefix.size();
while (e > 0 && name_prefix[e-1] == '/')
e--;
int s = 0;
while (s < e && name_prefix[s] == '/')
s++;
name_prefix = name_prefix.substr(s, e-s);
if (name_prefix.size())
name_prefix += "/";
}
if (cfg["client_writeback_allowed"].is_null())
{
// NFS is always aware of fsync, so we allow write-back cache
@@ -123,6 +172,16 @@ void nfs_proxy_t::run(json11::Json cfg)
obj["client_writeback_allowed"] = true;
cfg = obj;
}
mountpoint = cfg["mount"].string_value();
if (mountpoint != "")
{
bind_address = "127.0.0.1";
nfs_port = 0;
portmap_enabled = false;
exit_on_umount = true;
}
mountopts = cfg["options"].string_value();
fsname = cfg["fs"].string_value();
// Create client
ringloop = new ring_loop_t(RINGLOOP_DEFAULT_SIZE);
epmgr = new epoll_manager_t(ringloop);
@@ -131,67 +190,7 @@ void nfs_proxy_t::run(json11::Json cfg)
cmd->ringloop = ringloop;
cmd->epmgr = epmgr;
cmd->cli = cli;
// We need inode name hashes for NFS handles to remain stateless and <= 64 bytes long
dir_info[""] = (nfs_dir_t){
.id = 1,
.mod_rev = 0,
};
clock_gettime(CLOCK_REALTIME, &dir_info[""].mtime);
watch_stats();
assert(cli->st_cli.on_inode_change_hook == NULL);
cli->st_cli.on_inode_change_hook = [this](inode_t changed_inode, bool removed)
{
auto inode_cfg_it = cli->st_cli.inode_config.find(changed_inode);
if (inode_cfg_it == cli->st_cli.inode_config.end())
{
return;
}
auto & inode_cfg = inode_cfg_it->second;
std::string full_name = inode_cfg.name;
if (name_prefix != "" && full_name.substr(0, name_prefix.size()) != name_prefix)
{
return;
}
// Calculate directory modification time and revision (used as "cookie verifier")
timespec now;
clock_gettime(CLOCK_REALTIME, &now);
dir_info[""].mod_rev = dir_info[""].mod_rev < inode_cfg.mod_revision ? inode_cfg.mod_revision : dir_info[""].mod_rev;
dir_info[""].mtime = now;
int pos = full_name.find('/', name_prefix.size());
while (pos >= 0)
{
std::string dir = full_name.substr(0, pos);
auto & dinf = dir_info[dir];
if (!dinf.id)
dinf.id = next_dir_id++;
dinf.mod_rev = dinf.mod_rev < inode_cfg.mod_revision ? inode_cfg.mod_revision : dinf.mod_rev;
dinf.mtime = now;
dir_by_hash["S"+base64_encode(sha256(dir))] = dir;
pos = full_name.find('/', pos+1);
}
// Alter inode_by_hash
if (removed)
{
auto ino_it = hash_by_inode.find(changed_inode);
if (ino_it != hash_by_inode.end())
{
inode_by_hash.erase(ino_it->second);
hash_by_inode.erase(ino_it);
}
}
else
{
std::string hash = "S"+base64_encode(sha256(full_name));
auto hbi_it = hash_by_inode.find(changed_inode);
if (hbi_it != hash_by_inode.end() && hbi_it->second != hash)
{
// inode had a different name, remove old hash=>inode pointer
inode_by_hash.erase(hbi_it->second);
}
inode_by_hash[hash] = changed_inode;
hash_by_inode[changed_inode] = hash;
}
};
// Load image metadata
while (!cli->is_ready())
{
@@ -202,6 +201,17 @@ void nfs_proxy_t::run(json11::Json cfg)
}
// Check default pool
check_default_pool();
// Check if we're using VitastorFS
if (fsname == "")
{
blockfs = new block_fs_state_t();
blockfs->init(this, cfg);
}
else
{
kvfs = new kv_fs_state_t();
kvfs->init(this, cfg);
}
// Self-register portmap and NFS
pmap.reg_ports.insert((portmap_id_t){
.prog = PMAP_PROGRAM,
@@ -232,7 +242,7 @@ void nfs_proxy_t::run(json11::Json cfg)
.addr = "0.0.0.0.0."+std::to_string(nfs_port),
});
// Create NFS socket and add it to epoll
int nfs_socket = create_and_bind_socket(bind_address, nfs_port, 128, NULL);
int nfs_socket = create_and_bind_socket(bind_address, nfs_port, 128, &listening_port);
fcntl(nfs_socket, F_SETFL, fcntl(nfs_socket, F_GETFL, 0) | O_NONBLOCK);
epmgr->tfd->set_fd_handler(nfs_socket, false, [this](int nfs_socket, int epoll_events)
{
@@ -264,17 +274,40 @@ void nfs_proxy_t::run(json11::Json cfg)
}
});
}
if (mountpoint != "")
{
mount_fs();
}
if (cfg["foreground"].is_null())
{
daemonize();
}
while (true)
if (pidfile != "")
{
write_pid();
}
while (!finished)
{
ringloop->loop();
ringloop->wait();
}
// Destroy the client
cli->flush();
if (kvfs)
{
delete kvfs;
kvfs = NULL;
}
if (blockfs)
{
delete blockfs;
blockfs = NULL;
}
if (db)
{
delete db;
db = NULL;
}
delete cli;
delete epmgr;
delete ringloop;
@@ -351,7 +384,7 @@ void nfs_proxy_t::parse_stats(etcd_kv_t & kv)
inode_t inode_num = 0;
char null_byte = 0;
int scanned = sscanf(key.c_str() + cli->st_cli.etcd_prefix.length()+13, "%u/%ju%c", &pool_id, &inode_num, &null_byte);
if (scanned != 2 || !pool_id || pool_id >= POOL_ID_MAX || !inode_num)
if (scanned != 2 || !pool_id || pool_id >= POOL_ID_MAX)
{
fprintf(stderr, "Bad etcd key %s, ignoring\n", key.c_str());
}
@@ -382,8 +415,9 @@ void nfs_proxy_t::check_default_pool()
{
if (cli->st_cli.pool_config.size() == 1)
{
default_pool = cli->st_cli.pool_config.begin()->second.name;
default_pool_id = cli->st_cli.pool_config.begin()->first;
auto pool_it = cli->st_cli.pool_config.begin();
default_pool_id = pool_it->first;
default_pool = pool_it->second.name;
}
else
{
@@ -416,11 +450,17 @@ void nfs_proxy_t::do_accept(int listen_fd)
int nfs_fd = 0;
while ((nfs_fd = accept(listen_fd, (struct sockaddr *)&addr, &addr_size)) >= 0)
{
fprintf(stderr, "New client %d: connection from %s\n", nfs_fd, addr_to_string(addr).c_str());
if (trace)
fprintf(stderr, "New client %d: connection from %s\n", nfs_fd, addr_to_string(addr).c_str());
active_connections++;
fcntl(nfs_fd, F_SETFL, fcntl(nfs_fd, F_GETFL, 0) | O_NONBLOCK);
int one = 1;
setsockopt(nfs_fd, SOL_TCP, TCP_NODELAY, &one, sizeof(one));
auto cli = new nfs_client_t();
if (kvfs)
nfs_kv_procs(cli);
else
nfs_block_procs(cli);
cli->parent = this;
cli->nfs_fd = nfs_fd;
for (auto & fn: pmap.proc_table)
@@ -432,8 +472,12 @@ void nfs_proxy_t::do_accept(int listen_fd)
// Handle incoming event
if (epoll_events & EPOLLRDHUP)
{
fprintf(stderr, "Client %d disconnected\n", nfs_fd);
auto parent = cli->parent;
if (parent->trace)
fprintf(stderr, "Client %d disconnected\n", nfs_fd);
cli->stop();
parent->active_connections--;
parent->check_exit();
return;
}
cli->epoll_events |= epoll_events;
@@ -544,7 +588,7 @@ void nfs_client_t::handle_read(int result)
read_msg.msg_iovlen = 0;
if (deref())
return;
if (result <= 0 && result != -EAGAIN && result != -EINTR)
if (result <= 0 && result != -EAGAIN && result != -EINTR && result != -ECANCELED)
{
printf("Failed read from client %d: %d (%s)\n", nfs_fd, result, strerror(-result));
stop();
@@ -639,8 +683,8 @@ void nfs_client_t::handle_read(int result)
return;
}
}
submit_read(0);
}
submit_read(0);
}
void nfs_client_t::submit_send()
@@ -968,8 +1012,164 @@ void nfs_proxy_t::daemonize()
close(1);
close(2);
open("/dev/null", O_RDONLY);
open("/dev/null", O_WRONLY);
open("/dev/null", O_WRONLY);
open(logfile.c_str(), O_WRONLY|O_APPEND|O_CREAT, 0666);
open(logfile.c_str(), O_WRONLY|O_APPEND|O_CREAT, 0666);
}
void nfs_proxy_t::write_pid()
{
int fd = open(pidfile.c_str(), O_WRONLY|O_CREAT|O_TRUNC, 0666);
if (fd < 0)
{
fprintf(stderr, "Failed to create pid file %s: %s (code %d)\n", pidfile.c_str(), strerror(errno), errno);
return;
}
auto pid = std::to_string(getpid());
if (write(fd, pid.c_str(), pid.size()) < 0)
{
fprintf(stderr, "Failed to write pid to %s: %s (code %d)\n", pidfile.c_str(), strerror(errno), errno);
}
close(fd);
}
static pid_t wanted_pid = 0;
static bool child_finished = false;
static int child_status = -1;
void single_child_handler(int signal)
{
child_finished = true;
waitpid(wanted_pid, &child_status, WNOHANG);
}
void nfs_proxy_t::mount_fs()
{
check_already_mounted();
signal(SIGCHLD, single_child_handler);
auto pid = fork();
if (pid < 0)
{
fprintf(stderr, "Failed to fork: %s (code %d)\n", strerror(errno), errno);
exit(1);
}
if (pid > 0)
{
// Parent - loop and wait until child finishes
wanted_pid = pid;
exit_on_umount = false;
while (!child_finished)
{
ringloop->loop();
ringloop->wait();
}
if (!WIFEXITED(child_status) || WEXITSTATUS(child_status) != 0)
{
// Mounting failed
exit(1);
}
if (fsname != "")
fprintf(stderr, "Successfully mounted VitastorFS %s at %s\n", fsname.c_str(), mountpoint.c_str());
else
fprintf(stderr, "Successfully mounted Vitastor pseudo-FS at %s\n", mountpoint.c_str());
finished = false;
exit_on_umount = true;
}
else
{
// Child
std::string src = ("localhost:"+export_root);
std::string opts = ("port="+std::to_string(listening_port)+",mountport="+std::to_string(listening_port)+",nfsvers=3,nolock,tcp");
bool hard = false, async = false;
for (auto & opt: explode(",", mountopts, true))
{
if (opt == "hard")
hard = true;
else if (opt == "async")
async = true;
else if (opt.substr(0, 4) != "port" && opt.substr(0, 9) != "mountport" &&
opt.substr(0, 7) != "nfsvers" && opt.substr(0, 5) != "proto" &&
opt != "udp" && opt != "tcp" && opt != "rdma")
{
opts += ","+opt;
}
}
if (!hard)
opts += ",soft";
if (!async)
opts += ",sync";
const char *args[] = { "mount", src.c_str(), mountpoint.c_str(), "-o", opts.c_str(), NULL };
execvp("mount", (char* const*)args);
fprintf(stderr, "Failed to run mount %s %s -o %s: %s (code %d)\n",
src.c_str(), mountpoint.c_str(), opts.c_str(), strerror(errno), errno);
exit(1);
}
}
void nfs_proxy_t::check_already_mounted()
{
std::string realpoint = realpath_str(mountpoint, false);
if (realpoint == "")
{
return;
}
std::string mountstr = read_file("/proc/mounts");
if (mountstr == "")
{
return;
}
auto mounts = explode("\n", mountstr, true);
for (auto & str: mounts)
{
auto mnt = explode(" ", str, true);
if (mnt.size() >= 2 && mnt[1] == realpoint)
{
fprintf(stderr, "%s is already mounted\n", mountpoint.c_str());
exit(1);
}
}
}
void nfs_proxy_t::check_exit()
{
if (active_connections || !exit_on_umount)
{
return;
}
fprintf(stderr, "All active NFS connections are closed, checking /proc/mounts\n");
std::string mountstr = read_file("/proc/mounts");
if (mountstr == "")
{
return;
}
auto port_opt = "port="+std::to_string(listening_port);
auto mountport_opt = "mountport="+std::to_string(listening_port);
auto mounts = explode("\n", mountstr, true);
for (auto & str: mounts)
{
auto opts = explode(" ", str, true);
if (opts[2].size() >= 3 && opts[2].substr(0, 3) == "nfs" && opts.size() >= 4)
{
opts = explode(",", opts[3], true);
bool port_found = false;
bool addr_found = false;
for (auto & opt: opts)
{
if (opt == port_opt || opt == mountport_opt)
port_found = true;
if (opt == "addr=127.0.0.1" || opt == "mountaddr=127.0.0.1")
addr_found = true;
}
if (port_found && addr_found)
{
// OK, do not unmount
fprintf(stderr, "NFS mount to 127.0.0.1:%d still active, leaving server active\n", listening_port);
return;
}
}
}
fprintf(stderr, "NFS mount to 127.0.0.1:%d not found, exiting\n", listening_port);
// Not found, unmount
finished = true;
}
int main(int narg, const char *args[])

View File

@@ -4,51 +4,54 @@
#include "epoll_manager.h"
#include "nfs_portmap.h"
#include "nfs/xdr_impl.h"
#include "kv_db.h"
#define NFS_ROOT_HANDLE "R"
#define RPC_INIT_BUF_SIZE 32768
#define MAX_REQUEST_SIZE 128*1024*1024
#define TRUE 1
#define FALSE 0
class cli_tool_t;
struct nfs_dir_t
{
uint64_t id;
uint64_t mod_rev;
timespec mtime;
};
struct kv_fs_state_t;
struct block_fs_state_t;
class nfs_proxy_t
{
public:
std::string bind_address;
std::string name_prefix;
uint64_t fsid = 1;
uint64_t server_id = 0;
// FIXME: Maybe allow to create files in different pools?
std::string default_pool;
std::string export_root;
bool portmap_enabled;
unsigned nfs_port;
int trace = 0;
std::string logfile = "/dev/null";
std::string pidfile;
bool exit_on_umount = false;
std::string mountpoint;
std::string mountopts;
std::string fsname;
pool_id_t default_pool_id;
int active_connections = 0;
bool finished = false;
int listening_port = 0;
pool_id_t default_pool_id = 0;
portmap_service_t pmap;
ring_loop_t *ringloop = NULL;
epoll_manager_t *epmgr = NULL;
cluster_client_t *cli = NULL;
cli_tool_t *cmd = NULL;
kv_dbw_t *db = NULL;
kv_fs_state_t *kvfs = NULL;
block_fs_state_t *blockfs = NULL;
std::vector<XDR*> xdr_pool;
// filehandle = "S"+base64(sha256(full name with prefix)) or "roothandle" for mount root)
uint64_t next_dir_id = 2;
// filehandle => dir with name_prefix
std::map<std::string, std::string> dir_by_hash;
// dir with name_prefix => dir info
std::map<std::string, nfs_dir_t> dir_info;
// filehandle => inode ID
std::map<std::string, inode_t> inode_by_hash;
// inode ID => filehandle
std::map<inode_t, std::string> hash_by_inode;
// inode ID => statistics
std::map<inode_t, json11::Json> inode_stats;
// pool ID => statistics
@@ -63,6 +66,10 @@ public:
void check_default_pool();
void do_accept(int listen_fd);
void daemonize();
void write_pid();
void mount_fs();
void check_already_mounted();
void check_exit();
};
struct rpc_cur_buffer_t
@@ -86,28 +93,6 @@ struct rpc_free_buffer_t
unsigned size;
};
struct extend_size_t
{
inode_t inode;
uint64_t new_size;
};
inline bool operator < (const extend_size_t &a, const extend_size_t &b)
{
return a.inode < b.inode || a.inode == b.inode && a.new_size < b.new_size;
}
struct extend_write_t
{
rpc_op_t *rop;
int resize_res, write_res; // 1 = started, 0 = completed OK, -errno = completed with error
};
struct extend_inode_t
{
uint64_t cur_extend = 0, next_extend = 0;
};
class nfs_client_t
{
public:
@@ -122,8 +107,6 @@ public:
rpc_cur_buffer_t cur_buffer = { 0 };
std::map<uint8_t*, rpc_used_buffer_t> used_buffers;
std::vector<rpc_free_buffer_t> free_buffers;
std::map<inode_t, extend_inode_t> extends;
std::multimap<extend_size_t, extend_write_t> extend_writes;
iovec read_iov;
msghdr read_msg = { 0 };
@@ -133,9 +116,6 @@ public:
std::vector<iovec> send_list, next_send_list;
std::vector<rpc_op_t*> outbox, next_outbox;
nfs_client_t();
~nfs_client_t();
void select_read_buffer(unsigned wanted_size);
void submit_read(unsigned wanted_size);
void handle_read(int result);

View File

@@ -239,6 +239,7 @@ class osd_t
void report_statistics();
void report_pg_state(pg_t & pg);
void report_pg_states();
void apply_no_inode_stats();
void apply_pg_count();
void apply_pg_config();

View File

@@ -388,9 +388,18 @@ void osd_t::on_change_etcd_state_hook(std::map<std::string, etcd_kv_t> & changes
etcd_global_config = changes[st_cli.etcd_prefix+"/config/global"].value.object_items();
parse_config(false);
}
bool pools = changes.find(st_cli.etcd_prefix+"/config/pools") != changes.end();
if (pools)
{
apply_no_inode_stats();
}
if (run_primary)
{
apply_pg_count();
bool pgs = changes.find(st_cli.etcd_prefix+"/config/pgs") != changes.end();
if (pools || pgs)
{
apply_pg_count();
}
apply_pg_config();
}
}
@@ -414,6 +423,8 @@ void osd_t::on_reload_config_hook(json11::Json::object & global_config)
// Acquire lease
void osd_t::acquire_lease()
{
// Apply no_inode_stats before the first statistics report
apply_no_inode_stats();
// Maximum lease TTL is (report interval) + retries * (timeout + repeat interval)
st_cli.etcd_call("/lease/grant", json11::Json::object {
{ "TTL", etcd_report_interval+(st_cli.max_etcd_attempts*(2*st_cli.etcd_quick_timeout)+999)/1000 }
@@ -602,11 +613,32 @@ void osd_t::on_load_pgs_hook(bool success)
else
{
peering_state &= ~OSD_LOADING_PGS;
apply_pg_count();
apply_pg_config();
apply_no_inode_stats();
if (run_primary)
{
apply_pg_count();
apply_pg_config();
}
}
}
void osd_t::apply_no_inode_stats()
{
if (!bs)
{
return;
}
std::vector<uint64_t> no_inode_stats;
for (auto & pool_item: st_cli.pool_config)
{
if (!pool_item.second.used_for_fs.empty())
{
no_inode_stats.push_back(pool_item.first);
}
}
bs->set_no_inode_stats(no_inode_stats);
}
void osd_t::apply_pg_count()
{
for (auto & pool_item: st_cli.pool_config)

View File

@@ -3,13 +3,15 @@
#pragma once
#include "object_id.h"
#define POOL_SCHEME_REPLICATED 1
#define POOL_SCHEME_XOR 2
#define POOL_SCHEME_EC 3
#define POOL_ID_MAX 0x10000
#define POOL_ID_BITS 16
#define INODE_POOL(inode) (pool_id_t)((inode) >> (64 - POOL_ID_BITS))
#define INODE_NO_POOL(inode) (inode_t)(inode & (((uint64_t)1 << (64-POOL_ID_BITS)) - 1))
#define INODE_NO_POOL(inode) (inode_t)((inode) & (((uint64_t)1 << (64-POOL_ID_BITS)) - 1))
#define INODE_WITH_POOL(pool_id, inode) (((inode_t)(pool_id) << (64-POOL_ID_BITS)) | INODE_NO_POOL(inode))
// Pool ID is 16 bits long

View File

@@ -1,9 +1,10 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#include <assert.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include "str_util.h"
std::string base64_encode(const std::string &in)
@@ -214,7 +215,10 @@ void print_help(const char *help_text, std::string exe_name, std::string cmd, bo
else if (*next_line && isspace(*next_line))
started = true;
else if (cmd_start && matched)
{
filtered_text += std::string(cmd_start, next_line-cmd_start);
matched = started = false;
}
}
while (filtered_text.size() > 1 &&
filtered_text[filtered_text.size()-1] == '\n' &&
@@ -301,6 +305,23 @@ std::string read_all_fd(int fd)
return res;
}
std::string read_file(std::string file, bool allow_enoent)
{
std::string res;
int fd = open(file.c_str(), O_RDONLY);
if (fd < 0 || (res = read_all_fd(fd)) == "")
{
int err = errno;
if (fd >= 0)
close(fd);
if (!allow_enoent || err != ENOENT)
fprintf(stderr, "Failed to read %s: %s (code %d)\n", file.c_str(), strerror(err), err);
return "";
}
close(fd);
return res;
}
std::string str_repeat(const std::string & str, int times)
{
std::string r;
@@ -324,3 +345,121 @@ size_t utf8_length(const char *s)
len += (*s & 0xC0) != 0x80;
return len;
}
std::vector<std::string> explode(const std::string & sep, const std::string & value, bool trim)
{
std::vector<std::string> res;
size_t prev = 0;
while (prev < value.size())
{
while (trim && prev < value.size() && isspace(value[prev]))
prev++;
size_t pos = value.find(sep, prev);
if (pos == std::string::npos)
pos = value.size();
size_t next = pos+sep.size();
while (trim && pos > prev && isspace(value[pos-1]))
pos--;
if (!trim || pos > prev)
res.push_back(value.substr(prev, pos-prev));
prev = next;
}
return res;
}
std::string scan_escaped(const std::string & cmd, size_t & pos, bool allow_unquoted)
{
return scan_escaped(cmd.data(), cmd.size(), pos, allow_unquoted);
}
// extract possibly single- or double-quoted part of string with escape characters
std::string scan_escaped(const char *cmd, size_t size, size_t & pos, bool allow_unquoted)
{
auto orig = pos;
while (pos < size && is_white(cmd[pos]))
pos++;
if (pos >= size)
{
pos = orig;
return "";
}
if (cmd[pos] != '"' && cmd[pos] != '\'')
{
if (!allow_unquoted)
{
pos = orig;
return "";
}
auto pos2 = pos;
while (pos2 < size && !is_white(cmd[pos2]))
pos2++;
auto key = std::string(cmd+pos, pos2-pos);
pos = pos2;
return key;
}
char quot = cmd[pos];
pos++;
std::string key;
while (true)
{
auto pos2 = pos;
while (pos2 < size && cmd[pos2] != '\\' && cmd[pos2] != quot)
pos2++;
if (pos2 >= size || pos2 == size-1 && cmd[pos2] == '\\')
{
// Unfinished string literal
pos = orig;
return "";
}
if (pos2 > pos)
key += std::string(cmd+pos, pos2-pos);
pos = pos2;
if (cmd[pos] == quot)
{
pos++;
break;
}
else /* if (cmd[pos] == '\\') */
{
key += cmd[++pos];
pos++;
}
}
return key;
}
std::string auto_addslashes(const std::string & str, const char *toescape)
{
auto pos = str.find_first_of(toescape);
if (pos == std::string::npos)
return str;
return addslashes(str, toescape);
}
std::string addslashes(const std::string & str, const char *toescape)
{
std::string res = "\"";
auto pos = 0;
while (pos < str.size())
{
auto pos2 = str.find_first_of(toescape, pos);
if (pos2 == std::string::npos)
return res + str.substr(pos) + "\"";
res += str.substr(pos, pos2-pos)+"\\"+str[pos2];
pos = pos2+1;
}
return res+"\"";
}
std::string realpath_str(std::string path, bool nofail)
{
char *p = realpath((char*)path.c_str(), NULL);
if (!p)
{
fprintf(stderr, "Failed to resolve %s: %s\n", path.c_str(), strerror(errno));
return nofail ? path : "";
}
std::string rp(p);
free(p);
return rp;
}

View File

@@ -1,9 +1,12 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once
#include <stdint.h>
#include <string>
#include <vector>
#define is_white(a) ((a) == ' ' || (a) == '\t' || (a) == '\r' || (a) == '\n')
std::string base64_encode(const std::string &in);
std::string base64_decode(const std::string &in);
@@ -17,6 +20,13 @@ std::string format_size(uint64_t size, bool nobytes = false);
void print_help(const char *help_text, std::string exe_name, std::string cmd, bool all);
uint64_t parse_time(std::string time_str, bool *ok = NULL);
std::string read_all_fd(int fd);
std::string read_file(std::string file, bool allow_enoent = false);
std::string str_repeat(const std::string & str, int times);
size_t utf8_length(const std::string & s);
size_t utf8_length(const char *s);
std::vector<std::string> explode(const std::string & sep, const std::string & value, bool trim);
std::string scan_escaped(const char *cmd, size_t size, size_t & pos, bool allow_unquoted = true);
std::string scan_escaped(const std::string & cmd, size_t & pos, bool allow_unquoted = true);
std::string auto_addslashes(const std::string & str, const char *toescape = "\\\"");
std::string addslashes(const std::string & str, const char *toescape = "\\\"");
std::string realpath_str(std::string path, bool nofail = true);

View File

@@ -6,7 +6,7 @@ includedir=${prefix}/@CMAKE_INSTALL_INCLUDEDIR@
Name: Vitastor
Description: Vitastor client library
Version: 1.4.8
Version: 1.5.0
Libs: -L${libdir} -lvitastor_client
Cflags: -I${includedir}

View File

@@ -68,3 +68,5 @@ SCHEME=xor ./test_scrub.sh
PG_SIZE=3 ./test_scrub.sh
PG_SIZE=6 PG_MINSIZE=4 OSD_COUNT=6 SCHEME=ec ./test_scrub.sh
SCHEME=ec ./test_scrub.sh
./test_nfs.sh

View File

@@ -13,7 +13,14 @@ $ETCDCTL put /vitastor/osd/stats/5 '{"host":"host3","size":1073741824,"time":"'$
$ETCDCTL put /vitastor/osd/stats/6 '{"host":"host3","size":1073741824,"time":"'$TIME'"}'
$ETCDCTL put /vitastor/osd/stats/7 '{"host":"host4","size":1073741824,"time":"'$TIME'"}'
$ETCDCTL put /vitastor/osd/stats/8 '{"host":"host4","size":1073741824,"time":"'$TIME'"}'
$ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":4,"failure_domain":"rack"}}'
build/src/vitastor-cli --etcd_address $ETCD_URL create-pool testpool --ec 3+2 -n 32 --failure_domain rack --force
$ETCDCTL get --print-value-only /vitastor/config/pools | jq -s -e '. == [{"1": {"failure_domain": "rack", "name": "testpool", "parity_chunks": 2, "pg_count": 32, "pg_minsize": 4, "pg_size": 5, "scheme": "ec"}}]'
build/src/vitastor-cli --etcd_address $ETCD_URL modify-pool testpool --ec 3+3 --failure_domain host
$ETCDCTL get --print-value-only /vitastor/config/pools | jq -s -e '. == [{"1": {"failure_domain": "host", "name": "testpool", "parity_chunks": 3, "pg_count": 32, "pg_minsize": 4, "pg_size": 6, "scheme": "ec"}}]'
build/src/vitastor-cli --etcd_address $ETCD_URL rm-pool testpool
$ETCDCTL get --print-value-only /vitastor/config/pools | jq -s -e '. == [{}]'
build/src/vitastor-cli --etcd_address $ETCD_URL create-pool testpool -s 2 -n 4 --failure_domain rack --force
$ETCDCTL get --print-value-only /vitastor/config/pools | jq -s -e '. == [{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":4,"failure_domain":"rack"}}]'
node mon/mon-main.js --etcd_address $ETCD_URL --etcd_prefix "/vitastor" >>./testdata/mon.log 2>&1 &
MON_PID=$!

168
tests/test_nfs.sh Executable file
View File

@@ -0,0 +1,168 @@
#!/bin/bash -ex
PG_COUNT=16
. `dirname $0`/run_3osds.sh
build/src/vitastor-cli --etcd_address $ETCD_URL create -s 10G fsmeta
build/src/vitastor-cli --etcd_address $ETCD_URL modify-pool --used-for-fs fsmeta testpool
build/src/vitastor-nfs start --fs fsmeta --etcd_address $ETCD_URL --portmap 0 --port 2050 --foreground 1 --trace 1 >>./testdata/nfs.log 2>&1 &
NFS_PID=$!
mkdir -p testdata/nfs
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
MNT=$(pwd)/testdata/nfs
trap "sudo umount -f $MNT"' || true; kill -9 $(jobs -p)' EXIT
# write small file
ls -l ./testdata/nfs
dd if=/dev/urandom of=./testdata/f1 bs=100k count=1
cp testdata/f1 ./testdata/nfs/
sudo umount ./testdata/nfs/
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
ls -l ./testdata/nfs | grep f1
diff ./testdata/f1 ./testdata/nfs/f1
format_green "100K file ok"
# overwrite it inplace
dd if=/dev/urandom of=./testdata/f1_90k bs=90k count=1
cp testdata/f1_90k ./testdata/nfs/f1
sudo umount ./testdata/nfs/
format_green "inplace overwrite 90K ok"
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
ls -l ./testdata/nfs | grep f1
# create another copy
dd if=./testdata/f1_90k of=./testdata/nfs/f1_nfs bs=1M
diff ./testdata/f1_90k ./testdata/nfs/f1_nfs
sudo umount ./testdata/nfs/
format_green "another copy 90K ok"
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
ls -l ./testdata/nfs | grep f1
cp ./testdata/nfs/f1 ./testdata/f1_nfs
diff ./testdata/f1_90k ./testdata/nfs/f1
format_green "90K data ok"
# test partial shared overwrite
dd if=/dev/urandom of=./testdata/f1_90k bs=9317 count=1 seek=5 conv=notrunc
dd if=./testdata/f1_90k of=./testdata/nfs/f1 bs=9317 count=1 skip=5 seek=5 conv=notrunc
sudo umount ./testdata/nfs/
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
diff ./testdata/f1_90k ./testdata/nfs/f1
format_green "partial inplace shared overwrite ok"
# move it to a larger shared space
dd if=/dev/urandom of=./testdata/f1_110k bs=110k count=1
cp testdata/f1_110k ./testdata/nfs/f1
sudo umount ./testdata/nfs/
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
ls -l ./testdata/nfs | grep f1
diff ./testdata/f1_110k ./testdata/nfs/f1
format_green "move shared 90K -> 110K ok"
# extend it to large file + rm
dd if=/dev/urandom of=./testdata/f1_2M bs=2M count=1
cp ./testdata/f1_2M ./testdata/nfs/f1
sudo umount ./testdata/nfs/
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
ls -l ./testdata/nfs | grep f1
cp ./testdata/nfs/f1 ./testdata/f1_nfs
diff ./testdata/f1_2M ./testdata/nfs/f1
rm ./testdata/nfs/f1
format_green "extend to 2M + rm ok"
# mkdir
mkdir -p ./testdata/nfs/dir1/dir2
echo abcdef > ./testdata/nfs/dir1/dir2/hnpfls
# rename dir
mv ./testdata/nfs/dir1 ./testdata/nfs/dir3
sudo umount ./testdata/nfs/
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
ls -l ./testdata/nfs | grep dir3
ls -l ./testdata/nfs/dir3 | grep dir2
ls -l ./testdata/nfs/dir3/dir2 | grep hnpfls
echo abcdef > ./testdata/hnpfls
diff ./testdata/hnpfls ./testdata/nfs/dir3/dir2/hnpfls
format_green "rename dir with file ok"
# touch
touch -t 202401011404 ./testdata/nfs/dir3/dir2/hnpfls
sudo chown 65534:65534 ./testdata/nfs/dir3/dir2/hnpfls
sudo chmod 755 ./testdata/nfs/dir3/dir2/hnpfls
sudo umount ./testdata/nfs/
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
T=`stat -c '%a %u %g %y' ./testdata/nfs/dir3/dir2/hnpfls | perl -pe 's/(:\d+)(.*)/$1/'`
[[ "$T" = "755 65534 65534 2024-01-01 14:04" ]]
format_green "set attrs ok"
# move dir
mv ./testdata/nfs/dir3/dir2 ./testdata/nfs/
sudo umount ./testdata/nfs/
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
ls -l ./testdata/nfs | grep dir3
ls -l ./testdata/nfs | grep dir2
format_green "move dir ok"
# symlink, readlink
ln -s dir2 ./testdata/nfs/sym2
[[ "`stat -c '%A' ./testdata/nfs/sym2`" = "lrwxrwxrwx" ]]
sudo umount ./testdata/nfs/
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
[[ "`stat -c '%A' ./testdata/nfs/sym2`" = "lrwxrwxrwx" ]]
[[ "`readlink ./testdata/nfs/sym2`" = "dir2" ]]
format_green "symlink, readlink ok"
# mknod: chr, blk, sock, fifo + remove
sudo mknod ./testdata/nfs/nod_chr c 1 5
sudo mknod ./testdata/nfs/nod_blk b 2 6
mkfifo ./testdata/nfs/nod_fifo
perl -e 'use Socket; socket($sock, PF_UNIX, SOCK_STREAM, undef) || die $!; bind($sock, sockaddr_un("./testdata/nfs/nod_sock")) || die $!;'
chmod 777 ./testdata/nfs/nod_*
sudo umount ./testdata/nfs/
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
[[ "`ls testdata|wc -l`" -ge 4 ]]
[[ "`stat -c '%A' ./testdata/nfs/nod_blk`" = "brwxrwxrwx" ]]
[[ "`stat -c '%A' ./testdata/nfs/nod_chr`" = "crwxrwxrwx" ]]
[[ "`stat -c '%A' ./testdata/nfs/nod_fifo`" = "prwxrwxrwx" ]]
[[ "`stat -c '%A' ./testdata/nfs/nod_sock`" = "srwxrwxrwx" ]]
sudo rm ./testdata/nfs/nod_*
format_green "mknod + rm ok"
# hardlink
echo ABCDEF > ./testdata/nfs/linked1
i=`stat -c '%i' ./testdata/nfs/linked1`
ln ./testdata/nfs/linked1 ./testdata/nfs/linked2
[[ "`stat -c '%i' ./testdata/nfs/linked2`" -eq $i ]]
echo BABABA > ./testdata/nfs/linked2
diff ./testdata/nfs/linked2 ./testdata/nfs/linked1
sudo umount ./testdata/nfs/
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
diff ./testdata/nfs/linked2 ./testdata/nfs/linked1
[[ "`cat ./testdata/nfs/linked2`" = "BABABA" ]]
rm ./testdata/nfs/linked2
sudo umount ./testdata/nfs/
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
[[ "`cat ./testdata/nfs/linked1`" = "BABABA" ]]
format_green "hardlink ok"
# rm small
ls -l ./testdata/nfs
dd if=/dev/urandom of=./testdata/nfs/smallfile bs=100k count=1
sudo umount ./testdata/nfs/
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
rm ./testdata/nfs/smallfile
if ls ./testdata/nfs | grep smallfile; then false; fi
sudo umount ./testdata/nfs/
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
if ls ./testdata/nfs | grep smallfile; then false; fi
format_green "rm small ok"
# rename over existing
echo ZXCVBN > ./testdata/nfs/over1
mv ./testdata/nfs/over1 ./testdata/nfs/linked2
sudo umount ./testdata/nfs/
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
if ls ./testdata/nfs | grep over1; then false; fi
[[ "`cat ./testdata/nfs/linked2`" = "ZXCVBN" ]]
[[ "`cat ./testdata/nfs/linked1`" = "BABABA" ]]
format_green "rename over existing file ok"
format_green OK