Compare commits

..

81 Commits

Author SHA1 Message Date
fc83e3821c Fix write-over-delete failing for the very first entry in dirty_db 2023-10-21 17:08:32 +03:00
0d707fc83b Fix possible segfault in vitastor-cli ls -l 2023-10-21 17:08:32 +03:00
4f99f78430 Fix possible OSD crash during sync due to missing min_flushed_journal_sector reset 2023-09-17 00:36:46 +03:00
f926f8c2e0 Remove unused bs_sync fields 2023-09-17 00:36:46 +03:00
d73ad12c56 Fix fio_sec_osd attr_len 2023-09-17 00:36:46 +03:00
d68cec10e2 Remove erroneous block_size mismatch warnings on pools without matching PGs 2023-09-17 00:36:46 +03:00
9afa200a33 Flush STDOUT and STDERR before exiting from cli to fix Proxmox "Unexpected result" 2023-09-17 00:36:46 +03:00
aff6f3e970 Fix sscanf validation usage (field count instead of null_byte == 0) 2023-09-07 11:34:42 +03:00
49fca80f1c Add supported_truncate_flags 2023-09-07 11:34:42 +03:00
a028b4fa4c Make QEMU driver compatible with QEMU 8.1 2023-08-24 19:09:00 +03:00
da2bfd0b1e Fix co_truncate size division by BDRV_SECTOR_SIZE 2023-08-24 19:02:24 +03:00
cec5ceab77 Fix buffer insert in cluster_client 2023-08-24 19:02:24 +03:00
dc90faec7e Fix incorrect marking op parts as done with snapshots (could probably lead to client hangs) 2023-08-24 19:02:24 +03:00
20c62a4244 Fix monitor retrying failed etcd connection in an infinite loop without pauses 2023-08-24 19:02:24 +03:00
6acf562e01 Release 1.0.0
New features:

- Data and metadata checksums!
  - Metadata checksums are always used with new disk format
  - Data checksums can be turned on with --data_csum_type crc32c for new OSDs
  - Checksum block size can be configured
  - inmemory_metadata now also affects keeping checksums in memory
- Linux page cache I/O caching support which can be enabled separately for
  data, metadata (including checksums) and journal (O_SYNC instead of O_DIRECT)
- Details [here](https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/docs/config/layout-osd.en.md#data_csum_type)
- Backwards compatibility is preserved, you can use new OSDs with old disks

Release also includes bug fixes from [0.9.6](https://git.yourcmc.ru/vitalif/vitastor/releases/tag/v0.9.6).

0.9.6 is moved to "-oldstable" repositories and will be available for some additional time.
2023-07-29 18:57:19 +03:00
6f797f429e Add a note about -oldstable 2023-07-29 18:11:13 +03:00
b8a1734465 Reword checksum docs one more time 2023-07-29 14:42:56 +03:00
c752b68167 Remove "without checksums" from docs :) 2023-07-29 12:19:03 +03:00
564df2eb5d Support using buffered I/O with O_SYNC instead of direct I/O 2023-07-29 12:17:18 +03:00
9a427dd70a Allow to override OSD devices in tests 2023-07-29 12:17:18 +03:00
1a4ceb420d Track used blocks, not object versions 2023-07-29 12:17:18 +03:00
21b5124a4b Document data_csum_type and csum_block_size parameters 2023-07-29 12:17:18 +03:00
4181add1f4 Remove creepy "metadata copying" during overwrite
Instead of it, just do not verify checksums of currently mutated objects.
When clean data modification during flush runs in parallel to a read request,
that request may read a mix of old and new data. It may even read a mix of
multiple flushed versions if it lasts too long... And attempts to verify it
using temporary copies of metadata make the algorithm too complex and creepy.
2023-07-29 12:17:18 +03:00
a8464c19af Support keeping checksums on disk (not in memory)
Definitely beneficial for SSD+HDD setups
2023-07-29 12:17:18 +03:00
819cb70cdd Check for "Checksum mismatch" and "BUG" messages during test_heal 2023-07-29 12:17:18 +03:00
3c8e4c6b72 Use clean_dyn_size for space check 2023-07-29 12:17:18 +03:00
8ef4cf89dc Log more details about checksum mismatch in big_writes 2023-07-29 12:17:18 +03:00
7bfb1639ea Use find_holes() in flusher for unification 2023-07-29 12:17:18 +03:00
628e481c32 Fill journal header to know checksum type & size when dumping journal with --all 2023-07-29 12:17:18 +03:00
af6f2046fc Fix journal read checksum verification with inmemory_journal=false 2023-07-29 12:17:18 +03:00
9357e5293e Call fill_partial_checksum_blocks() correctly in regard to COPY_BUF_CSUM_FILL 2023-07-29 12:17:18 +03:00
12851dc07d Wait for journal reads before checking them in clear_incomplete_csum_block_bits 2023-07-29 12:17:18 +03:00
a5753e35a3 Check for checksum mismatch absence in test_heal 2023-07-29 12:17:18 +03:00
d6ee1ca17c Use zero checksum size for zero-length writes 2023-07-29 12:17:18 +03:00
71674d00cf Fix journal data checksum mangling on corrupted block overwrite 2023-07-29 12:17:18 +03:00
ddb078d5a7 Check journal entry size when checking block checksums 2023-07-29 12:17:18 +03:00
d22d56f90a Fix journal data checksum verification on start 2023-07-29 12:17:18 +03:00
eb1331a079 Add more details to "journal entry data is corrupt" messages 2023-07-29 12:17:18 +03:00
c5274f655b ...and partially remove the perversion with bitmap inlining 2023-07-29 12:17:18 +03:00
45e07d6294 Sadly we have to refcount dyn_data... 2023-07-29 12:17:18 +03:00
a8ee391e05 Fix clean block checksum read 2023-07-29 12:17:18 +03:00
de48fa3fd2 Allow to forcibly set meta_format 2023-07-29 12:17:18 +03:00
874a766b62 Rename meta_version to meta_format 2023-07-29 12:17:18 +03:00
384bd8e28f Support old metadata format in vitastor-disk dump-meta 2023-07-29 12:17:18 +03:00
430994f48a Fix journal big_write simple reads after checksum changes 2023-07-29 12:17:18 +03:00
3d7f838c59 Verify checksums in test_heal in different combinations 2023-07-29 12:17:18 +03:00
b909d81f41 Fix bitmap-granular checksums 2023-07-29 12:17:18 +03:00
e42975ffd1 Fix wait_journal_count not being zeroed 2023-07-29 12:17:18 +03:00
93778324e5 Rewrite and fix find_holes into a more obvious version 2023-07-29 12:17:18 +03:00
eeb6727170 Fix missing checksum read offset 2023-07-29 12:17:18 +03:00
7fe82c692e Add a test for checksums 2023-07-29 12:17:18 +03:00
92c6e16eba Fix checksum verification in big_write journal reads 2023-07-29 12:17:18 +03:00
213a9ccb4d Verify checksums during journal reads 2023-07-29 12:17:18 +03:00
a166147110 Add backwards compatibility with non-checksum metadata and journal formats 2023-07-29 12:17:18 +03:00
7d532880c3 Implement large csum_block_size support (more than 4k) + refactor blockstore_flush 2023-07-29 12:17:18 +03:00
0b0405d115 Implement bitmap-granular (4k) metadata & data checksums 2023-07-29 12:17:18 +03:00
e651c93a90 Release 0.9.6
- Fix vitastor-disk partition zeroing (sometimes it was writing garbage instead of zeroes)
- Fix incorrect EC space statistics in `vitastor-cli status`
- Several bug fixes for NFS:
  - Add . and .. in NFS directory listings
  - Return FILE_SYNC from NFS writes if immediate_commit is enabled
  - Return the same "verifier" in NFS COMMIT as in NFS WRITE
  - Make parallel NFS extending writes work correctly, without conflicts
  - Handle parallel NFS extending writes without imposing extra load on etcd
- Support UTF-8 in vitastor-cli table output
- Also allow "0" and "no" as false for inmemory_metadata and inmemory_journal
- Use HDD defaults for HDD-only in automatic `vitastor-disk prepare` mode
2023-07-29 10:54:00 +03:00
988e90be69 Fix vitastor-disk partition zeroing (it was writing random garbage instead of zeroes :D) 2023-07-28 12:29:07 +03:00
272a45ad63 Fix modprobe command in docs 2023-07-27 23:57:02 +03:00
25a15d24cf Fix incorrect EC space statistics in vitastor-cli status 2023-07-27 02:26:17 +00:00
700e0e9bff Handle parallel NFS extending writes without imposing extra load on etcd 2023-07-27 02:26:17 +00:00
ab0ca7c00f Return FILE_SYNC from NFS writes if immediate_commit is enabled 2023-07-26 02:09:47 +03:00
f153bc950b Return the same "verifier" in NFS COMMIT as in NFS WRITE
This fixes buffered (not O_DIRECT) NFS writes in Linux - previously they were
hanging in an infinite loop because COMMIT didn't return the same verifier as
previous WRITEs, and NFS kernel client was infinitely retrying the same writes.

Also this probably allows for correct NFS failover, at least for the same
buffered writes, because NFS clients repeat all write requests until a COMMIT
confirms them.
2023-07-26 02:09:47 +03:00
425ff8818d Add . and .. in NFS directory listings
MC, for example, hangs with infinite listing retries without them
2023-07-26 02:09:47 +03:00
9e287a7778 Handle extending writes correctly in NFS proxy
Previously, multiple parallel writes extending file size through NFS were
racing with each other and triggering deletions of part of the written data

I.e. if you mounted vitastor-nfs and just copied a file into it in MC then
you could end up with only a part of the file actually written
2023-07-26 02:09:43 +03:00
f52f58b9e9 Support UTF-8 in vitastor-cli table output 2023-07-25 01:48:57 +00:00
1fe6b0c0e2 Also allow "0" and "no" as false for inmemory_metadata and inmemory_journal 2023-07-25 01:48:57 +00:00
e4237e9ed8 Enable HDD defaults for HDD-only in automatic vitastor-disk prepare mode 2023-07-23 02:33:22 +03:00
10a5fd6abb Release 0.9.5
A hotfix to 0.9.4 containing only one bugfix: 100% CPU usage in the new QEMU
driver caused by the lack of eventfd reset on io_uring event handling :)
2023-07-21 00:04:41 +03:00
1c316ef350 Reset eventfd on every ringloop::loop() 2023-07-21 00:04:41 +03:00
0b2d12eef1 Remove has_work, it was unnecessary 2023-07-21 00:04:37 +03:00
1c10430ae1 Release 0.9.4
- Improve QEMU driver performance by integrating io_uring in it (up to 1.5x total iops improvement)
- Fix QEMU driver deadlocks which started to reproduce in qemu-img after iothread fixes
- Fix `vitastor-cli status` reporting more etcds than actually exists (fix etcd address duplication in config on reload)
- Fix `vitastor-cli ls` crashing on inodes in non-existing pools
- Delete old garbage /pool/stats/ keys for non-existing (deleted) pools
- Reduce memory usage of etcds initialized by make-etcd script
- Fix OSDs almost always crashing on etcd restart due to "revisions were compacted" (support reloading state from etcd)
- Fix a crash and a stall possible mostly in HDD setups with small journal and big (512k, 900k) random writes
- Add notes about HDDs to documentation. You are officially allowed to use HDD-only Vitastor with HGST/Toshiba/EXOS :)
2023-07-19 02:50:30 +03:00
dfce91d168 Change git url in docs, correct block/vitastor.c path 2023-07-19 01:02:12 +03:00
332a13ba30 Build patched QEMU against local packages 2023-07-19 00:05:02 +03:00
d0e257ee81 Fix non-existing pool handling in vitastor-cli ls 2023-07-18 23:52:02 +03:00
004912aac0 Add RPM spec patches for 6.2-el8 and 7.2-el9 2023-07-18 23:38:14 +03:00
c18e92273e Copy qemu 5.1 -> 5.2 patch for convenience 2023-07-18 23:37:53 +03:00
9815d70ffc It is impossible to use io_uring with older vitastor-client because it does not have vitastor_c_uring_has_work() 2023-07-18 23:37:53 +03:00
4a4627dcab Do not use bool in C library 2023-07-18 23:37:53 +03:00
b963f2fd93 Add QEMU 2.12 patch (basically the same as 3.1) 2023-07-18 23:37:06 +03:00
ba7427020e Fix deadlocks possible in qemu-img after fixing iothread
Deadlock was caused by switching QEMU coroutines directly inside
vitastor_co_read_bitmap_cb() callback. The correct way is to schedule a BH
/BH is a QEMU term for setImmediate() :)/, same as in read and write callbacks.
2023-07-18 23:32:16 +03:00
80 changed files with 1144 additions and 446 deletions

View File

@@ -2,6 +2,6 @@ cmake_minimum_required(VERSION 2.8.12)
project(vitastor)
set(VERSION "0.9.3")
set(VERSION "1.0.0")
add_subdirectory(src)

View File

@@ -1,4 +1,4 @@
VERSION ?= v0.9.3
VERSION ?= v1.0.0
all: build push

View File

@@ -49,7 +49,7 @@ spec:
capabilities:
add: ["SYS_ADMIN"]
allowPrivilegeEscalation: true
image: vitalif/vitastor-csi:v0.9.3
image: vitalif/vitastor-csi:v1.0.0
args:
- "--node=$(NODE_ID)"
- "--endpoint=$(CSI_ENDPOINT)"

View File

@@ -116,7 +116,7 @@ spec:
privileged: true
capabilities:
add: ["SYS_ADMIN"]
image: vitalif/vitastor-csi:v0.9.3
image: vitalif/vitastor-csi:v1.0.0
args:
- "--node=$(NODE_ID)"
- "--endpoint=$(CSI_ENDPOINT)"

View File

@@ -5,7 +5,7 @@ package vitastor
const (
vitastorCSIDriverName = "csi.vitastor.io"
vitastorCSIDriverVersion = "0.9.3"
vitastorCSIDriverVersion = "1.0.0"
)
// Config struct fills the parameters of request or user input

4
debian/changelog vendored
View File

@@ -1,10 +1,10 @@
vitastor (0.9.3-1) unstable; urgency=medium
vitastor (1.0.0-1) unstable; urgency=medium
* Bugfixes
-- Vitaliy Filippov <vitalif@yourcmc.ru> Fri, 03 Jun 2022 02:09:44 +0300
vitastor (0.9.3-1) unstable; urgency=medium
vitastor (1.0.0-1) unstable; urgency=medium
* Implement NFS proxy
* Add documentation

View File

@@ -28,13 +28,19 @@ RUN apt-get --download-only source qemu
ADD patches /root/vitastor/patches
ADD src/qemu_driver.c /root/vitastor/src/qemu_driver.c
#RUN set -e; \
# apt-get install -y wget; \
# wget -q -O /etc/apt/trusted.gpg.d/vitastor.gpg https://vitastor.io/debian/pubkey.gpg; \
# (echo deb http://vitastor.io/debian $REL main > /etc/apt/sources.list.d/vitastor.list); \
# (echo "APT::Install-Recommends false;" > /etc/apt/apt.conf) && \
# apt-get update; \
# apt-get install -y vitastor-client vitastor-client-dev quilt
RUN set -e; \
apt-get install -y wget; \
wget -q -O /etc/apt/trusted.gpg.d/vitastor.gpg https://vitastor.io/debian/pubkey.gpg; \
(echo deb http://vitastor.io/debian $REL main > /etc/apt/sources.list.d/vitastor.list); \
(echo "APT::Install-Recommends false;" > /etc/apt/apt.conf) && \
dpkg -i /root/packages/vitastor-$REL/vitastor-client_*.deb /root/packages/vitastor-$REL/vitastor-client-dev_*.deb; \
apt-get update; \
apt-get install -y vitastor-client vitastor-client-dev quilt; \
apt-get install -y quilt; \
mkdir -p /root/packages/qemu-$REL; \
rm -rf /root/packages/qemu-$REL/*; \
cd /root/packages/qemu-$REL; \

View File

@@ -35,8 +35,8 @@ RUN set -e -x; \
mkdir -p /root/packages/vitastor-$REL; \
rm -rf /root/packages/vitastor-$REL/*; \
cd /root/packages/vitastor-$REL; \
cp -r /root/vitastor vitastor-0.9.3; \
cd vitastor-0.9.3; \
cp -r /root/vitastor vitastor-1.0.0; \
cd vitastor-1.0.0; \
ln -s /root/fio-build/fio-*/ ./fio; \
FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
ls /usr/include/linux/raw.h || cp ./debian/raw.h /usr/include/linux/raw.h; \
@@ -49,8 +49,8 @@ RUN set -e -x; \
rm -rf a b; \
echo "dep:fio=$FIO" > debian/fio_version; \
cd /root/packages/vitastor-$REL; \
tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.9.3.orig.tar.xz vitastor-0.9.3; \
cd vitastor-0.9.3; \
tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_1.0.0.orig.tar.xz vitastor-1.0.0; \
cd vitastor-1.0.0; \
V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \
DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \

View File

@@ -197,21 +197,22 @@ Must be equal or a multiple of [bitmap_granularity](layout-cluster.en.md#bitmap_
Checksums increase metadata size by 4 bytes per each csum_block_size of data.
Checksums are always a compromise:
Checksums are always a tradeoff:
1. You either sacrifice +1 GB RAM per 1 TB of data
2. Or you raise csum_block_size, for example, to 32k and sacrifice
50% random write iops due to checksum read-modify-write
3. Or you turn off [inmemory_metadata](osd.en.md#inmemory_metadata) and
sacrifice 50% random read iops due to checksum reads
Option 1 (default) is recommended for all-flash setups because these usually
have enough RAM.
All-flash clusters usually have enough RAM to use default csum_block_size,
which uses 1 GB RAM per 1 TB of data. HDD clusters usually don't.
Option 2 is recommended for HDD-only setups. HDD-only setups usually do NOT
have enough RAM for the default 4 KB csum_block_size.
Thus, recommended setups are:
1. All-flash, 1 GB RAM per 1 TB data: default (csum_block_size=4k)
2. All-flash, less RAM: csum_block_size=4k + inmemory_metadata=false
3. Hybrid HDD+SSD: csum_block_size=4k + inmemory_metadata=false
4. HDD-only, faster random read: csum_block_size=32k
5. HDD-only, faster random write: csum_block_size=4k +
inmemory_metadata=false + cached_io_meta=true
Option 3 is recommended for SSD+HDD setups (because metadata SSDs will handle
extra reads without any performance drop) and also *maybe* for NVMe all-flash
setups when you don't have enough RAM (because NVMe drives have plenty
of read iops to spare). You may also consider enabling
[cached_read_meta](osd.en.md#cached_read_meta) in this case.
See also [cached_io_meta](osd.en.md#cached_io_meta).

View File

@@ -220,17 +220,12 @@ csum_block_size данных.
жертвуете 50% скорости случайного чтения из-за чтения контрольных сумм
с диска
Вариант 1 (при настройках по умолчанию) рекомендуется для SSD (All-Flash)
кластеров, потому что памяти в них обычно хватает.
Таким образом, рекомендуются следующие варианты настроек:
1. All-flash, 1 ГБ памяти на 1 ТБ данных: по умолчанию (csum_block_size=4k)
2. All-flash, меньше памяти: csum_block_size=4k + inmemory_metadata=false
3. Гибридные HDD+SSD: csum_block_size=4k + inmemory_metadata=false
4. Только HDD, быстрее случайное чтение: csum_block_size=32k
5. Только HDD, быстрее случайная запись: csum_block_size=4k +
inmemory_metadata=false + cached_io_meta=true
Вариант 2 рекомендуется для кластеров на одних жёстких дисках (без SSD
под метаданные). На 4 кб блок контрольной суммы памяти в таких кластерах
обычно НЕ хватает.
Вариант 3 рекомендуется для гибридных кластеров (SSD+HDD), потому что
скорости SSD под метаданными хватит, чтобы обработать дополнительные чтения
без снижения производительности. Также вариант 3 *может* рекомендоваться
для All-Flash кластеров на основе NVMe-дисков, когда памяти НЕ достаточно,
потому что NVMe-диски имеют огромный запас производительности по чтению.
В таких случаях, возможно, также имеет смысл включать параметр
[cached_read_meta](osd.ru.md#cached_read_meta).
Смотрите также [cached_io_meta](osd.ru.md#cached_io_meta).

View File

@@ -31,9 +31,9 @@ them, even without restarting by updating configuration in etcd.
- [max_flusher_count](#max_flusher_count)
- [inmemory_metadata](#inmemory_metadata)
- [inmemory_journal](#inmemory_journal)
- [cached_read_data](#cached_read_data)
- [cached_read_meta](#cached_read_meta)
- [cached_read_journal](#cached_read_journal)
- [cached_io_data](#cached_io_data)
- [cached_io_meta](#cached_io_meta)
- [cached_io_journal](#cached_io_journal)
- [journal_sector_buffer_count](#journal_sector_buffer_count)
- [journal_no_same_sector_overwrites](#journal_no_same_sector_overwrites)
- [throttle_small_writes](#throttle_small_writes)
@@ -258,44 +258,46 @@ is typically very small because it's sufficient to have 16-32 MB journal
for SSD OSDs. However, in theory it's possible that you'll want to turn it
off for hybrid (HDD+SSD) OSDs with large journals on quick devices.
## cached_read_data
## cached_io_data
- Type: boolean
- Default: false
Read data through Linux page cache, i.e. use a file descriptor opened without
O_DIRECT for data reads. May improve read performance for frequently accessed
data if it fits in RAM. Memory in page cache is shared by all processes and
not accounted in OSD memory consumption.
Read and write *data* through Linux page cache, i.e. use a file descriptor
opened with O_SYNC, but without O_DIRECT for I/O. May improve read
performance for hot data and slower disks - HDDs and maybe SATA SSDs.
Not recommended for desktop SSDs without capacitors because O_SYNC flushes
disk cache on every write.
## cached_read_meta
## cached_io_meta
- Type: boolean
- Default: false
Read metadata through Linux page cache. May be beneficial when checksums
are enabled and [inmemory_metadata](#inmemory_metadata) is disabled, because
in this case metadata blocks are read from disk to verify checksums on every
read request and caching them may reduce this extra read load.
Read and write *metadata* through Linux page cache. May improve read
performance only if your drives are relatively slow (HDD, SATA SSD), and
only if checksums are enabled and [inmemory_metadata](#inmemory_metadata)
is disabled, because in this case metadata blocks are read from disk
on every read request to verify checksums and caching them may reduce this
extra read load.
Absolutely pointless to enable with enabled inmemory_metadata because all
metadata is kept in memory anyway, and likely pointless without checksums,
because in that case, metadata blocks are read from disk only during journal
flushing.
If the same device is used for data and metadata, enabling [cached_read_data](#cached_read_data)
If the same device is used for data and metadata, enabling [cached_io_data](#cached_io_data)
also enables this parameter, given that it isn't turned off explicitly.
## cached_read_journal
## cached_io_journal
- Type: boolean
- Default: false
Read buffered data from journal through Linux page cache. Does not have sense
without disabling [inmemory_journal](#inmemory_journal), which, again, is
enabled by default.
Read and write *journal* through Linux page cache. May improve read
performance if [inmemory_journal](#inmemory_journal) is turned off.
If the same device is used for metadata and journal, enabling [cached_read_meta](#cached_read_meta)
If the same device is used for metadata and journal, enabling [cached_io_meta](#cached_io_meta)
also enables this parameter, given that it isn't turned off explicitly.
## journal_sector_buffer_count

View File

@@ -32,9 +32,9 @@
- [max_flusher_count](#max_flusher_count)
- [inmemory_metadata](#inmemory_metadata)
- [inmemory_journal](#inmemory_journal)
- [cached_read_data](#cached_read_data)
- [cached_read_meta](#cached_read_meta)
- [cached_read_journal](#cached_read_journal)
- [cached_io_data](#cached_io_data)
- [cached_io_meta](#cached_io_meta)
- [cached_io_journal](#cached_io_journal)
- [journal_sector_buffer_count](#journal_sector_buffer_count)
- [journal_no_same_sector_overwrites](#journal_no_same_sector_overwrites)
- [throttle_small_writes](#throttle_small_writes)
@@ -266,27 +266,28 @@ Flusher - это микро-поток (корутина), которая коп
параметра может оказаться полезным для гибридных OSD (HDD+SSD) с большими
журналами, расположенными на быстром по сравнению с HDD устройстве.
## cached_read_data
## cached_io_data
- Тип: булево (да/нет)
- Значение по умолчанию: false
Читать данные через системный кэш Linux (page cache), то есть, использовать
для чтения данных файловый дескриптор, открытый без флага O_DIRECT. Может
улучшить производительность чтения для часто используемых данных, если они
помещаются в память. Память кэша разделяется между всеми процессами в
системе и не учитывается в потреблении памяти процессом OSD.
Читать и записывать *данные* через системный кэш Linux (page cache), то есть,
использовать для данных файловый дескриптор, открытый без флага O_DIRECT, но
с флагом O_SYNC. Может улучшить скорость чтения для относительно медленных
дисков - HDD и, возможно, SATA SSD. Не рекомендуется для потребительских
SSD без конденсаторов, так как O_SYNC сбрасывает кэш диска при каждой записи.
## cached_read_meta
## cached_io_meta
- Тип: булево (да/нет)
- Значение по умолчанию: false
Читать метаданные через системный кэш Linux. Может быть полезно, когда
включены контрольные суммы, а параметр [inmemory_metadata](#inmemory_metadata)
отключён, так как в этом случае блоки метаданных читаются с диска при каждом
запросе чтения для проверки контрольных сумм и их кэширование может снизить
дополнительную нагрузку на диск.
Читать и записывать *метаданные* через системный кэш Linux. Может улучшить
скорость чтения, если у вас медленные диски, и только если контрольные суммы
включены, а параметр [inmemory_metadata](#inmemory_metadata) отключён, так
как в этом случае блоки метаданных читаются с диска при каждом запросе чтения
для проверки контрольных сумм и их кэширование может снизить дополнительную
нагрузку на диск.
Абсолютно бессмысленно включать данный параметр, если параметр
inmemory_metadata включён (по умолчанию это так), и также вероятно
@@ -295,20 +296,20 @@ inmemory_metadata включён (по умолчанию это так), и т
журнала.
Если одно и то же устройство используется для данных и метаданных, включение
[cached_read_data](#cached_read_data) также включает данный параметр, при
[cached_io_data](#cached_io_data) также включает данный параметр, при
условии, что он не отключён явным образом.
## cached_read_journal
## cached_io_journal
- Тип: булево (да/нет)
- Значение по умолчанию: false
Читать буферизованные в журнале данные через системный кэш Linux. Не имеет
смысла без отключения параметра [inmemory_journal](#inmemory_journal),
который, опять же, по умолчанию включён.
Читать и записывать *журнал* через системный кэш Linux. Может улучшить
скорость чтения, если параметр [inmemory_journal](#inmemory_journal)
отключён.
Если одно и то же устройство используется для метаданных и журнала,
включение [cached_read_meta](#cached_read_meta) также включает данный
включение [cached_io_meta](#cached_io_meta) также включает данный
параметр, при условии, что он не отключён явным образом.
## journal_sector_buffer_count

View File

@@ -228,24 +228,25 @@
Checksums increase metadata size by 4 bytes per each csum_block_size of data.
Checksums are always a compromise:
Checksums are always a tradeoff:
1. You either sacrifice +1 GB RAM per 1 TB of data
2. Or you raise csum_block_size, for example, to 32k and sacrifice
50% random write iops due to checksum read-modify-write
3. Or you turn off [inmemory_metadata](osd.en.md#inmemory_metadata) and
sacrifice 50% random read iops due to checksum reads
Option 1 (default) is recommended for all-flash setups because these usually
have enough RAM.
All-flash clusters usually have enough RAM to use default csum_block_size,
which uses 1 GB RAM per 1 TB of data. HDD clusters usually don't.
Option 2 is recommended for HDD-only setups. HDD-only setups usually do NOT
have enough RAM for the default 4 KB csum_block_size.
Thus, recommended setups are:
1. All-flash, 1 GB RAM per 1 TB data: default (csum_block_size=4k)
2. All-flash, less RAM: csum_block_size=4k + inmemory_metadata=false
3. Hybrid HDD+SSD: csum_block_size=4k + inmemory_metadata=false
4. HDD-only, faster random read: csum_block_size=32k
5. HDD-only, faster random write: csum_block_size=4k +
inmemory_metadata=false + cached_io_meta=true
Option 3 is recommended for SSD+HDD setups (because metadata SSDs will handle
extra reads without any performance drop) and also *maybe* for NVMe all-flash
setups when you don't have enough RAM (because NVMe drives have plenty
of read iops to spare). You may also consider enabling
[cached_read_meta](osd.en.md#cached_read_meta) in this case.
See also [cached_io_meta](osd.en.md#cached_io_meta).
info_ru: |
Размер блока расчёта контрольных сумм.
@@ -264,17 +265,12 @@
жертвуете 50% скорости случайного чтения из-за чтения контрольных сумм
с диска
Вариант 1 (при настройках по умолчанию) рекомендуется для SSD (All-Flash)
кластеров, потому что памяти в них обычно хватает.
Таким образом, рекомендуются следующие варианты настроек:
1. All-flash, 1 ГБ памяти на 1 ТБ данных: по умолчанию (csum_block_size=4k)
2. All-flash, меньше памяти: csum_block_size=4k + inmemory_metadata=false
3. Гибридные HDD+SSD: csum_block_size=4k + inmemory_metadata=false
4. Только HDD, быстрее случайное чтение: csum_block_size=32k
5. Только HDD, быстрее случайная запись: csum_block_size=4k +
inmemory_metadata=false + cached_io_meta=true
Вариант 2 рекомендуется для кластеров на одних жёстких дисках (без SSD
под метаданные). На 4 кб блок контрольной суммы памяти в таких кластерах
обычно НЕ хватает.
Вариант 3 рекомендуется для гибридных кластеров (SSD+HDD), потому что
скорости SSD под метаданными хватит, чтобы обработать дополнительные чтения
без снижения производительности. Также вариант 3 *может* рекомендоваться
для All-Flash кластеров на основе NVMe-дисков, когда памяти НЕ достаточно,
потому что NVMe-диски имеют огромный запас производительности по чтению.
В таких случаях, возможно, также имеет смысл включать параметр
[cached_read_meta](osd.ru.md#cached_read_meta).
Смотрите также [cached_io_meta](osd.ru.md#cached_io_meta).

View File

@@ -260,42 +260,46 @@
достаточно 16- или 32-мегабайтного журнала. Однако в теории отключение
параметра может оказаться полезным для гибридных OSD (HDD+SSD) с большими
журналами, расположенными на быстром по сравнению с HDD устройстве.
- name: cached_read_data
- name: cached_io_data
type: bool
default: false
info: |
Read data through Linux page cache, i.e. use a file descriptor opened without
O_DIRECT for data reads. May improve read performance for frequently accessed
data if it fits in RAM. Memory in page cache is shared by all processes and
not accounted in OSD memory consumption.
Read and write *data* through Linux page cache, i.e. use a file descriptor
opened with O_SYNC, but without O_DIRECT for I/O. May improve read
performance for hot data and slower disks - HDDs and maybe SATA SSDs.
Not recommended for desktop SSDs without capacitors because O_SYNC flushes
disk cache on every write.
info_ru: |
Читать данные через системный кэш Linux (page cache), то есть, использовать
для чтения данных файловый дескриптор, открытый без флага O_DIRECT. Может
улучшить производительность чтения для часто используемых данных, если они
помещаются в память. Память кэша разделяется между всеми процессами в
системе и не учитывается в потреблении памяти процессом OSD.
- name: cached_read_meta
Читать и записывать *данные* через системный кэш Linux (page cache), то есть,
использовать для данных файловый дескриптор, открытый без флага O_DIRECT, но
с флагом O_SYNC. Может улучшить скорость чтения для относительно медленных
дисков - HDD и, возможно, SATA SSD. Не рекомендуется для потребительских
SSD без конденсаторов, так как O_SYNC сбрасывает кэш диска при каждой записи.
- name: cached_io_meta
type: bool
default: false
info: |
Read metadata through Linux page cache. May be beneficial when checksums
are enabled and [inmemory_metadata](#inmemory_metadata) is disabled, because
in this case metadata blocks are read from disk to verify checksums on every
read request and caching them may reduce this extra read load.
Read and write *metadata* through Linux page cache. May improve read
performance only if your drives are relatively slow (HDD, SATA SSD), and
only if checksums are enabled and [inmemory_metadata](#inmemory_metadata)
is disabled, because in this case metadata blocks are read from disk
on every read request to verify checksums and caching them may reduce this
extra read load.
Absolutely pointless to enable with enabled inmemory_metadata because all
metadata is kept in memory anyway, and likely pointless without checksums,
because in that case, metadata blocks are read from disk only during journal
flushing.
If the same device is used for data and metadata, enabling [cached_read_data](#cached_read_data)
If the same device is used for data and metadata, enabling [cached_io_data](#cached_io_data)
also enables this parameter, given that it isn't turned off explicitly.
info_ru: |
Читать метаданные через системный кэш Linux. Может быть полезно, когда
включены контрольные суммы, а параметр [inmemory_metadata](#inmemory_metadata)
отключён, так как в этом случае блоки метаданных читаются с диска при каждом
запросе чтения для проверки контрольных сумм и их кэширование может снизить
дополнительную нагрузку на диск.
Читать и записывать *метаданные* через системный кэш Linux. Может улучшить
скорость чтения, если у вас медленные диски, и только если контрольные суммы
включены, а параметр [inmemory_metadata](#inmemory_metadata) отключён, так
как в этом случае блоки метаданных читаются с диска при каждом запросе чтения
для проверки контрольных сумм и их кэширование может снизить дополнительную
нагрузку на диск.
Абсолютно бессмысленно включать данный параметр, если параметр
inmemory_metadata включён (по умолчанию это так), и также вероятно
@@ -304,25 +308,24 @@
журнала.
Если одно и то же устройство используется для данных и метаданных, включение
[cached_read_data](#cached_read_data) также включает данный параметр, при
[cached_io_data](#cached_io_data) также включает данный параметр, при
условии, что он не отключён явным образом.
- name: cached_read_journal
- name: cached_io_journal
type: bool
default: false
info: |
Read buffered data from journal through Linux page cache. Does not have sense
without disabling [inmemory_journal](#inmemory_journal), which, again, is
enabled by default.
Read and write *journal* through Linux page cache. May improve read
performance if [inmemory_journal](#inmemory_journal) is turned off.
If the same device is used for metadata and journal, enabling [cached_read_meta](#cached_read_meta)
If the same device is used for metadata and journal, enabling [cached_io_meta](#cached_io_meta)
also enables this parameter, given that it isn't turned off explicitly.
info_ru: |
Читать буферизованные в журнале данные через системный кэш Linux. Не имеет
смысла без отключения параметра [inmemory_journal](#inmemory_journal),
который, опять же, по умолчанию включён.
Читать и записывать *журнал* через системный кэш Linux. Может улучшить
скорость чтения, если параметр [inmemory_journal](#inmemory_journal)
отключён.
Если одно и то же устройство используется для метаданных и журнала,
включение [cached_read_meta](#cached_read_meta) также включает данный
включение [cached_io_meta](#cached_io_meta) также включает данный
параметр, при условии, что он не отключён явным образом.
- name: journal_sector_buffer_count
type: int

View File

@@ -14,6 +14,8 @@
- Debian 12 (Bookworm/Sid): `deb https://vitastor.io/debian bookworm main`
- Debian 11 (Bullseye): `deb https://vitastor.io/debian bullseye main`
- Debian 10 (Buster): `deb https://vitastor.io/debian buster main`
- Add `-oldstable` to bookworm/bullseye/buster in this line to install the last
stable version from 0.9.x branch instead of 1.x
- For Debian 10 (Buster) also enable backports repository:
`deb http://deb.debian.org/debian buster-backports main`
- Install packages: `apt update; apt install vitastor lp-solve etcd linux-image-amd64 qemu`

View File

@@ -14,6 +14,8 @@
- Debian 12 (Bookworm/Sid): `deb https://vitastor.io/debian bookworm main`
- Debian 11 (Bullseye): `deb https://vitastor.io/debian bullseye main`
- Debian 10 (Buster): `deb https://vitastor.io/debian buster main`
- Добавьте `-oldstable` к слову bookworm/bullseye/buster в этой строке, чтобы
установить последнюю стабильную версию из ветки 0.9.x вместо 1.x
- Для Debian 10 (Buster) также включите репозиторий backports:
`deb http://deb.debian.org/debian buster-backports main`
- Установите пакеты: `apt update; apt install vitastor lp-solve etcd linux-image-amd64 qemu`

View File

@@ -21,7 +21,7 @@
## Basic instructions
Download source, for example using git: `git clone --recurse-submodules https://yourcmc.ru/git/vitalif/vitastor/`
Download source, for example using git: `git clone --recurse-submodules https://git.yourcmc.ru/vitalif/vitastor/`
Get `fio` source and symlink it into `<vitastor>/fio`. If you don't want to build fio engine,
you can disable it by passing `-DWITH_FIO=no` to cmake.
@@ -41,7 +41,7 @@ It's recommended to build the QEMU driver (qemu_driver.c) in-tree, as a part of
QEMU build process. To do that:
- Install vitastor client library headers (from source or from vitastor-client-dev package)
- Take a corresponding patch from `patches/qemu-*-vitastor.patch` and apply it to QEMU source
- Copy `src/qemu_driver.c` to QEMU source directory as `block/block-vitastor.c`
- Copy `src/qemu_driver.c` to QEMU source directory as `block/vitastor.c`
- Build QEMU as usual
But it is also possible to build it out-of-tree. To do that:

View File

@@ -21,7 +21,7 @@
## Базовая инструкция
Скачайте исходные коды, например, из git: `git clone --recurse-submodules https://yourcmc.ru/git/vitalif/vitastor/`
Скачайте исходные коды, например, из git: `git clone --recurse-submodules https://git.yourcmc.ru/vitalif/vitastor/`
Скачайте исходные коды пакета `fio`, распакуйте их и создайте символическую ссылку на них
в директории исходников Vitastor: `<vitastor>/fio`. Либо, если вы не хотите собирать плагин fio,
@@ -41,7 +41,7 @@ cmake .. && make -j8 install
Драйвер QEMU (qemu_driver.c) рекомендуется собирать вместе с самим QEMU. Для этого:
- Установите заголовки клиентской библиотеки Vitastor (из исходников или из пакета vitastor-client-dev)
- Возьмите соответствующий патч из `patches/qemu-*-vitastor.patch` и примените его к исходникам QEMU
- Скопируйте [src/qemu_driver.c](../../src/qemu_driver.c) в директорию исходников QEMU как `block/block-vitastor.c`
- Скопируйте [src/qemu_driver.c](../../src/qemu_driver.c) в директорию исходников QEMU как `block/vitastor.c`
- Соберите QEMU как обычно
Однако в целях отладки драйвер также можно собирать отдельно от QEMU. Для этого:
@@ -60,7 +60,7 @@ cmake .. && make -j8 install
* Для QEMU 2.0+: `<qemu>/qapi-types.h` &rarr; `<vitastor>/qemu/b/qemu/qapi-types.h`
- `config-host.h` и `qapi` нужны, т.к. в них содержатся автогенерируемые заголовки
- Сконфигурируйте cmake Vitastor с `WITH_QEMU=yes` (`cmake .. -DWITH_QEMU=yes`) и, если вы
используете RHEL-подобый дистрибутив, также с `QEMU_PLUGINDIR=qemu-kvm`.
используете RHEL-подобный дистрибутив, также с `QEMU_PLUGINDIR=qemu-kvm`.
- После этого в процессе сборки Vitastor также будет собираться подходящий для вашей
версии QEMU `block-vitastor.so`.
- Таким образом можно использовать драйвер даже с немодифицированным QEMU, но в этом случае

View File

@@ -29,7 +29,7 @@
- Snapshots and copy-on-write image clones
- [Write throttling to smooth random write workloads in SSD+HDD configurations](../config/osd.en.md#throttle_small_writes)
- [RDMA/RoCEv2 support via libibverbs](../config/network.en.md#rdma_device)
- [Scrubbing without checksums](../config/osd.en.md#auto_scrub) (verification of copies)
- [Scrubbing](../config/osd.en.md#auto_scrub) (verification of copies)
- [Checksums](../config/layout-osd.en.md#data_csum_type)
## Plugins and tools

View File

@@ -31,7 +31,7 @@
- Снапшоты и copy-on-write клоны
- [Сглаживание производительности случайной записи в SSD+HDD конфигурациях](../config/osd.ru.md#throttle_small_writes)
- [Поддержка RDMA/RoCEv2 через libibverbs](../config/network.ru.md#rdma_device)
- [Фоновая проверка целостности без контрольных сумм](../config/osd.ru.md#auto_scrub) (сверка копий)
- [Фоновая проверка целостности](../config/osd.ru.md#auto_scrub) (сверка копий)
- [Контрольные суммы](../config/layout-osd.ru.md#data_csum_type)
## Драйверы и инструменты

View File

@@ -102,7 +102,7 @@ checks the device cache status on start and tries to disable cache for SATA/SAS
If it doesn't succeed it issues a warning in the system log.
You can also pass other OSD options here as arguments and they'll be persisted
in the superblock: cached_read_data, cached_read_meta, cached_read_journal,
in the superblock: cached_io_data, cached_io_meta, cached_io_journal,
inmemory_metadata, inmemory_journal, max_write_iodepth,
min_flusher_count, max_flusher_count, journal_sector_buffer_count,
journal_no_same_sector_overwrites, throttle_small_writes, throttle_target_iops,

View File

@@ -103,8 +103,8 @@ vitastor-disk - инструмент командной строки для уп
это не удаётся, в системный журнал выводится предупреждение.
Вы можете передать данной команде и некоторые другие опции OSD в качестве аргументов
и они тоже будут сохранены в суперблок: cached_read_data, cached_read_meta,
cached_read_journal, inmemory_metadata, inmemory_journal, max_write_iodepth,
и они тоже будут сохранены в суперблок: cached_io_data, cached_io_meta,
cached_io_journal, inmemory_metadata, inmemory_journal, max_write_iodepth,
min_flusher_count, max_flusher_count, journal_sector_buffer_count,
journal_no_same_sector_overwrites, throttle_small_writes, throttle_target_iops,
throttle_target_mbs, throttle_target_parallelism, throttle_threshold_us.

View File

@@ -107,7 +107,8 @@ disabled by now, so if you want to try it on Debian, use a kernel from Ubuntu
Commands to attach Vitastor image as a VDUSE device:
```
modprobe vduse virtio-vdpa
modprobe vduse
modprobe virtio-vdpa
qemu-storage-daemon --daemonize --blockdev '{"node-name":"test1","driver":"vitastor",\
"etcd-host":"192.168.7.2:2379/v3","image":"testosd1","cache":{"direct":true,"no-flush":false},"discard":"unmap"}' \
--export vduse-blk,id=test1,node-name=test1,name=test1,num-queues=16,queue-size=128,writable=true

View File

@@ -111,7 +111,8 @@ VDUSE (CONFIG_VIRTIO_VDPA=m и CONFIG_VDPA_USER=m). В ядрах в Debian Linu
Команды для подключения виртуального диска через VDUSE:
```
modprobe vduse virtio-vdpa
modprobe vduse
modprobe virtio-vdpa
qemu-storage-daemon --daemonize --blockdev '{"node-name":"test1","driver":"vitastor",\
"etcd-host":"192.168.7.2:2379/v3","image":"testosd1","cache":{"direct":true,"no-flush":false},"discard":"unmap"}' \
--export vduse-blk,id=test1,node-name=test1,name=test1,num-queues=16,queue-size=128,writable=true

View File

@@ -539,10 +539,18 @@ class Mon
{
retries = 1;
}
const tried = {};
while (retries < 0 || retry < retries)
{
const cur_addr = this.pick_next_etcd();
const base = 'ws'+cur_addr.substr(4);
let now = Date.now();
if (tried[base] && now-tried[base] < timeout)
{
await new Promise(ok => setTimeout(ok, timeout-(now-tried[base])));
now = Date.now();
}
tried[base] = now;
const ok = await new Promise((ok, no) =>
{
const timer_id = setTimeout(() =>
@@ -1497,10 +1505,14 @@ class Mon
break;
}
}
const pool_cfg = (this.state.config.pools[pool_id]||{});
if (!object_size)
{
object_size = (this.state.config.pools[pool_id]||{}).block_size ||
this.config.block_size || 131072;
object_size = pool_cfg.block_size || this.config.block_size || 131072;
}
if (pool_cfg.scheme !== 'replicated')
{
object_size *= ((pool_cfg.pg_size||0) - (pool_cfg.parity_chunks||0));
}
object_size = BigInt(object_size);
for (const pg_num in this.state.pg.stats[pool_id])
@@ -1784,10 +1796,18 @@ class Mon
{
retries = 1;
}
const tried = {};
while (retries < 0 || retry < retries)
{
retry++;
const base = this.pick_next_etcd();
let now = Date.now();
if (tried[base] && now-tried[base] < timeout)
{
await new Promise(ok => setTimeout(ok, timeout-(now-tried[base])));
now = Date.now();
}
tried[base] = now;
const res = await POST(base+path, body, timeout);
if (res.error)
{

View File

@@ -50,7 +50,7 @@ from cinder.volume import configuration
from cinder.volume import driver
from cinder.volume import volume_utils
VERSION = '0.9.3'
VERSION = '1.0.0'
LOG = logging.getLogger(__name__)

View File

@@ -0,0 +1,176 @@
diff --git a/block/Makefile.objs b/block/Makefile.objs
index d644bac60a..e404236291 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -19,6 +19,7 @@ block-obj-$(if $(CONFIG_LIBISCSI),y,n) += iscsi-opts.o
block-obj-$(CONFIG_LIBNFS) += nfs.o
block-obj-$(CONFIG_CURL) += curl.o
block-obj-$(CONFIG_RBD) += rbd.o
+block-obj-$(CONFIG_VITASTOR) += vitastor.o
block-obj-$(CONFIG_GLUSTERFS) += gluster.o
block-obj-$(CONFIG_VXHS) += vxhs.o
block-obj-$(CONFIG_LIBSSH2) += ssh.o
@@ -39,6 +40,8 @@ curl.o-cflags := $(CURL_CFLAGS)
curl.o-libs := $(CURL_LIBS)
rbd.o-cflags := $(RBD_CFLAGS)
rbd.o-libs := $(RBD_LIBS)
+vitastor.o-cflags := $(VITASTOR_CFLAGS)
+vitastor.o-libs := $(VITASTOR_LIBS)
gluster.o-cflags := $(GLUSTERFS_CFLAGS)
gluster.o-libs := $(GLUSTERFS_LIBS)
vxhs.o-libs := $(VXHS_LIBS)
diff --git a/configure b/configure
index 0a19b033bc..58b7fbf24c 100755
--- a/configure
+++ b/configure
@@ -398,6 +398,7 @@ trace_backends="log"
trace_file="trace"
spice=""
rbd=""
+vitastor=""
smartcard=""
libusb=""
usb_redir=""
@@ -1213,6 +1214,10 @@ for opt do
;;
--enable-rbd) rbd="yes"
;;
+ --disable-vitastor) vitastor="no"
+ ;;
+ --enable-vitastor) vitastor="yes"
+ ;;
--disable-xfsctl) xfs="no"
;;
--enable-xfsctl) xfs="yes"
@@ -1601,6 +1606,7 @@ disabled with --disable-FEATURE, default is enabled if available:
vhost-crypto vhost-crypto acceleration support
spice spice
rbd rados block device (rbd)
+ vitastor vitastor block device
libiscsi iscsi support
libnfs nfs support
smartcard smartcard support (libcacard)
@@ -3594,6 +3600,27 @@ EOF
fi
fi
+##########################################
+# vitastor probe
+if test "$vitastor" != "no" ; then
+ cat > $TMPC <<EOF
+#include <vitastor_c.h>
+int main(void) {
+ vitastor_c_create_qemu(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
+ return 0;
+}
+EOF
+ vitastor_libs="-lvitastor_client"
+ if compile_prog "" "$vitastor_libs" ; then
+ vitastor=yes
+ else
+ if test "$vitastor" = "yes" ; then
+ feature_not_found "vitastor block device" "Install vitastor-client-dev"
+ fi
+ vitastor=no
+ fi
+fi
+
##########################################
# libssh2 probe
min_libssh2_version=1.2.8
@@ -5837,6 +5864,7 @@ echo "Trace output file $trace_file-<pid>"
fi
echo "spice support $spice $(echo_version $spice $spice_protocol_version/$spice_server_version)"
echo "rbd support $rbd"
+echo "vitastor support $vitastor"
echo "xfsctl support $xfs"
echo "smartcard support $smartcard"
echo "libusb $libusb"
@@ -6416,6 +6444,11 @@ if test "$rbd" = "yes" ; then
echo "RBD_CFLAGS=$rbd_cflags" >> $config_host_mak
echo "RBD_LIBS=$rbd_libs" >> $config_host_mak
fi
+if test "$vitastor" = "yes" ; then
+ echo "CONFIG_VITASTOR=m" >> $config_host_mak
+ echo "VITASTOR_CFLAGS=$vitastor_cflags" >> $config_host_mak
+ echo "VITASTOR_LIBS=$vitastor_libs" >> $config_host_mak
+fi
echo "CONFIG_COROUTINE_BACKEND=$coroutine" >> $config_host_mak
if test "$coroutine_pool" = "yes" ; then
diff --git a/qapi/block-core.json b/qapi/block-core.json
index c50517bff3..c780bb2c1c 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2514,7 +2514,7 @@
'dmg', 'file', 'ftp', 'ftps', 'gluster', 'host_cdrom',
'host_device', 'http', 'https', 'iscsi', 'luks', 'nbd', 'nfs',
'null-aio', 'null-co', 'nvme', 'parallels', 'qcow', 'qcow2', 'qed',
- 'quorum', 'raw', 'rbd', 'replication', 'sheepdog', 'ssh',
+ 'quorum', 'raw', 'rbd', 'vitastor', 'replication', 'sheepdog', 'ssh',
'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', 'vxhs' ] }
##
@@ -3217,6 +3217,28 @@
'*snap-id': 'uint32',
'*tag': 'str' } }
+##
+# @BlockdevOptionsVitastor:
+#
+# Driver specific block device options for vitastor
+#
+# @image: Image name
+# @inode: Inode number
+# @pool: Pool ID
+# @size: Desired image size in bytes
+# @config-path: Path to Vitastor configuration
+# @etcd-host: etcd connection address(es)
+# @etcd-prefix: etcd key/value prefix
+##
+{ 'struct': 'BlockdevOptionsVitastor',
+ 'data': { '*inode': 'uint64',
+ '*pool': 'uint64',
+ '*size': 'uint64',
+ '*image': 'str',
+ '*config-path': 'str',
+ '*etcd-host': 'str',
+ '*etcd-prefix': 'str' } }
+
##
# @ReplicationMode:
#
@@ -3547,6 +3569,7 @@
'rbd': 'BlockdevOptionsRbd',
'replication':'BlockdevOptionsReplication',
'sheepdog': 'BlockdevOptionsSheepdog',
+ 'vitastor': 'BlockdevOptionsVitastor',
'ssh': 'BlockdevOptionsSsh',
'throttle': 'BlockdevOptionsThrottle',
'vdi': 'BlockdevOptionsGenericFormat',
@@ -3991,6 +4014,17 @@
'*subformat': 'BlockdevVhdxSubformat',
'*block-state-zero': 'bool' } }
+##
+# @BlockdevCreateOptionsVitastor:
+#
+# Driver specific image creation options for Vitastor.
+#
+# @size: Size of the virtual disk in bytes
+##
+{ 'struct': 'BlockdevCreateOptionsVitastor',
+ 'data': { 'location': 'BlockdevOptionsVitastor',
+ 'size': 'size' } }
+
##
# @BlockdevVpcSubformat:
#
@@ -4074,6 +4108,7 @@
'rbd': 'BlockdevCreateOptionsRbd',
'replication': 'BlockdevCreateNotSupported',
'sheepdog': 'BlockdevCreateOptionsSheepdog',
+ 'vitastor': 'BlockdevCreateOptionsVitastor',
'ssh': 'BlockdevCreateOptionsSsh',
'throttle': 'BlockdevCreateNotSupported',
'vdi': 'BlockdevCreateOptionsVdi',

View File

@@ -0,0 +1,181 @@
Index: qemu-5.2+dfsg/qapi/block-core.json
===================================================================
--- qemu-5.2+dfsg.orig/qapi/block-core.json
+++ qemu-5.2+dfsg/qapi/block-core.json
@@ -2831,7 +2831,7 @@
'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
{ 'name': 'replication', 'if': 'defined(CONFIG_REPLICATION)' },
- 'sheepdog',
+ 'sheepdog', 'vitastor',
'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
##
@@ -3668,6 +3668,28 @@
'*tag': 'str' } }
##
+# @BlockdevOptionsVitastor:
+#
+# Driver specific block device options for vitastor
+#
+# @image: Image name
+# @inode: Inode number
+# @pool: Pool ID
+# @size: Desired image size in bytes
+# @config-path: Path to Vitastor configuration
+# @etcd-host: etcd connection address(es)
+# @etcd-prefix: etcd key/value prefix
+##
+{ 'struct': 'BlockdevOptionsVitastor',
+ 'data': { '*inode': 'uint64',
+ '*pool': 'uint64',
+ '*size': 'uint64',
+ '*image': 'str',
+ '*config-path': 'str',
+ '*etcd-host': 'str',
+ '*etcd-prefix': 'str' } }
+
+##
# @ReplicationMode:
#
# An enumeration of replication modes.
@@ -4015,6 +4037,7 @@
'replication': { 'type': 'BlockdevOptionsReplication',
'if': 'defined(CONFIG_REPLICATION)' },
'sheepdog': 'BlockdevOptionsSheepdog',
+ 'vitastor': 'BlockdevOptionsVitastor',
'ssh': 'BlockdevOptionsSsh',
'throttle': 'BlockdevOptionsThrottle',
'vdi': 'BlockdevOptionsGenericFormat',
@@ -4404,6 +4427,17 @@
'*cluster-size' : 'size' } }
##
+# @BlockdevCreateOptionsVitastor:
+#
+# Driver specific image creation options for Vitastor.
+#
+# @size: Size of the virtual disk in bytes
+##
+{ 'struct': 'BlockdevCreateOptionsVitastor',
+ 'data': { 'location': 'BlockdevOptionsVitastor',
+ 'size': 'size' } }
+
+##
# @BlockdevVmdkSubformat:
#
# Subformat options for VMDK images
@@ -4665,6 +4699,7 @@
'qed': 'BlockdevCreateOptionsQed',
'rbd': 'BlockdevCreateOptionsRbd',
'sheepdog': 'BlockdevCreateOptionsSheepdog',
+ 'vitastor': 'BlockdevCreateOptionsVitastor',
'ssh': 'BlockdevCreateOptionsSsh',
'vdi': 'BlockdevCreateOptionsVdi',
'vhdx': 'BlockdevCreateOptionsVhdx',
Index: qemu-5.2+dfsg/block/meson.build
===================================================================
--- qemu-5.2+dfsg.orig/block/meson.build
+++ qemu-5.2+dfsg/block/meson.build
@@ -76,6 +76,7 @@ foreach m : [
['CONFIG_LIBNFS', 'nfs', libnfs, 'nfs.c'],
['CONFIG_LIBSSH', 'ssh', libssh, 'ssh.c'],
['CONFIG_RBD', 'rbd', rbd, 'rbd.c'],
+ ['CONFIG_VITASTOR', 'vitastor', vitastor, 'vitastor.c'],
]
if config_host.has_key(m[0])
if enable_modules
Index: qemu-5.2+dfsg/configure
===================================================================
--- qemu-5.2+dfsg.orig/configure
+++ qemu-5.2+dfsg/configure
@@ -372,6 +372,7 @@ trace_backends="log"
trace_file="trace"
spice=""
rbd=""
+vitastor=""
smartcard=""
u2f="auto"
libusb=""
@@ -1263,6 +1264,10 @@ for opt do
;;
--enable-rbd) rbd="yes"
;;
+ --disable-vitastor) vitastor="no"
+ ;;
+ --enable-vitastor) vitastor="yes"
+ ;;
--disable-xfsctl) xfs="no"
;;
--enable-xfsctl) xfs="yes"
@@ -1827,6 +1832,7 @@ disabled with --disable-FEATURE, default
vhost-vdpa vhost-vdpa kernel backend support
spice spice
rbd rados block device (rbd)
+ vitastor vitastor block device
libiscsi iscsi support
libnfs nfs support
smartcard smartcard support (libcacard)
@@ -3719,6 +3725,27 @@ EOF
fi
##########################################
+# vitastor probe
+if test "$vitastor" != "no" ; then
+ cat > $TMPC <<EOF
+#include <vitastor_c.h>
+int main(void) {
+ vitastor_c_create_qemu(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
+ return 0;
+}
+EOF
+ vitastor_libs="-lvitastor_client"
+ if compile_prog "" "$vitastor_libs" ; then
+ vitastor=yes
+ else
+ if test "$vitastor" = "yes" ; then
+ feature_not_found "vitastor block device" "Install vitastor-client-dev"
+ fi
+ vitastor=no
+ fi
+fi
+
+##########################################
# libssh probe
if test "$libssh" != "no" ; then
if $pkg_config --exists libssh; then
@@ -6456,6 +6483,10 @@ if test "$rbd" = "yes" ; then
echo "CONFIG_RBD=y" >> $config_host_mak
echo "RBD_LIBS=$rbd_libs" >> $config_host_mak
fi
+if test "$vitastor" = "yes" ; then
+ echo "CONFIG_VITASTOR=y" >> $config_host_mak
+ echo "VITASTOR_LIBS=$vitastor_libs" >> $config_host_mak
+fi
echo "CONFIG_COROUTINE_BACKEND=$coroutine" >> $config_host_mak
if test "$coroutine_pool" = "yes" ; then
Index: qemu-5.2+dfsg/meson.build
===================================================================
--- qemu-5.2+dfsg.orig/meson.build
+++ qemu-5.2+dfsg/meson.build
@@ -596,6 +596,10 @@ rbd = not_found
if 'CONFIG_RBD' in config_host
rbd = declare_dependency(link_args: config_host['RBD_LIBS'].split())
endif
+vitastor = not_found
+if 'CONFIG_VITASTOR' in config_host
+ vitastor = declare_dependency(link_args: config_host['VITASTOR_LIBS'].split())
+endif
glusterfs = not_found
if 'CONFIG_GLUSTERFS' in config_host
glusterfs = declare_dependency(compile_args: config_host['GLUSTERFS_CFLAGS'].split(),
@@ -2145,6 +2149,7 @@ endif
# TODO: add back protocol and server version
summary_info += {'spice support': config_host.has_key('CONFIG_SPICE')}
summary_info += {'rbd support': config_host.has_key('CONFIG_RBD')}
+summary_info += {'vitastor support': config_host.has_key('CONFIG_VITASTOR')}
summary_info += {'xfsctl support': config_host.has_key('CONFIG_XFS')}
summary_info += {'smartcard support': config_host.has_key('CONFIG_SMARTCARD')}
summary_info += {'U2F support': u2f.found()}

View File

@@ -24,4 +24,4 @@ rm fio
mv fio-copy fio
FIO=`rpm -qi fio | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'`
perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec
tar --transform 's#^#vitastor-0.9.3/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.9.3$(rpm --eval '%dist').tar.gz *
tar --transform 's#^#vitastor-1.0.0/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-1.0.0$(rpm --eval '%dist').tar.gz *

View File

@@ -22,7 +22,7 @@
Name: qemu-kvm
Version: 4.2.0
-Release: 29.vitastor%{?dist}.6
+Release: 32.vitastor%{?dist}.6
+Release: 34.vitastor%{?dist}.6
# Epoch because we pushed a qemu-1.0 package. AIUI this can't ever be dropped
Epoch: 15
License: GPLv2 and GPLv2+ and CC-BY

View File

@@ -13,7 +13,7 @@
Name: qemu-kvm
Version: 4.2.0
-Release: 29%{?dist}.6
+Release: 32.vitastor%{?dist}.6
+Release: 33.vitastor%{?dist}.6
# Epoch because we pushed a qemu-1.0 package. AIUI this can't ever be dropped
Epoch: 15
License: GPLv2 and GPLv2+ and CC-BY

View File

@@ -0,0 +1,103 @@
--- qemu-kvm-6.2.spec.orig 2023-07-18 13:52:57.636625440 +0000
+++ qemu-kvm-6.2.spec 2023-07-18 13:52:19.011683886 +0000
@@ -73,6 +73,7 @@ Requires: %{name}-hw-usbredir = %{epoch}
%endif \
Requires: %{name}-block-iscsi = %{epoch}:%{version}-%{release} \
Requires: %{name}-block-rbd = %{epoch}:%{version}-%{release} \
+Requires: %{name}-block-vitastor = %{epoch}:%{version}-%{release}\
Requires: %{name}-block-ssh = %{epoch}:%{version}-%{release}
# Macro to properly setup RHEL/RHEV conflict handling
@@ -83,7 +84,7 @@ Obsoletes: %1-rhev <= %{epoch}:%{version
Summary: QEMU is a machine emulator and virtualizer
Name: qemu-kvm
Version: 6.2.0
-Release: 32%{?rcrel}%{?dist}
+Release: 32.vitastor%{?rcrel}%{?dist}
# Epoch because we pushed a qemu-1.0 package. AIUI this can't ever be dropped
Epoch: 15
License: GPLv2 and GPLv2+ and CC-BY
@@ -122,6 +123,7 @@ Source37: tests_data_acpi_pc_SSDT.dimmpx
Source38: tests_data_acpi_q35_FACP.slic
Source39: tests_data_acpi_q35_SSDT.dimmpxm
Source40: tests_data_acpi_virt_SSDT.memhp
+Source41: qemu-vitastor.c
Patch0001: 0001-redhat-Adding-slirp-to-the-exploded-tree.patch
Patch0005: 0005-Initial-redhat-build.patch
@@ -652,6 +654,7 @@ Patch255: kvm-scsi-protect-req-aiocb-wit
Patch256: kvm-dma-helpers-prevent-dma_blk_cb-vs-dma_aio_cancel-rac.patch
# For bz#2090990 - qemu crash with error scsi_req_unref(SCSIRequest *): Assertion `req->refcount > 0' failed or scsi_dma_complete(void *, int): Assertion `r->req.aiocb != NULL' failed [8.7.0]
Patch257: kvm-virtio-scsi-reset-SCSI-devices-from-main-loop-thread.patch
+Patch258: qemu-6.2-vitastor.patch
BuildRequires: wget
BuildRequires: rpm-build
@@ -689,6 +692,7 @@ BuildRequires: libcurl-devel
BuildRequires: libssh-devel
BuildRequires: librados-devel
BuildRequires: librbd-devel
+BuildRequires: vitastor-client-devel
%if %{have_gluster}
# For gluster block driver
BuildRequires: glusterfs-api-devel
@@ -926,6 +930,14 @@ Install this package if you want to acce
using the rbd protocol.
+%package block-vitastor
+Summary: QEMU Vitastor block driver
+Requires: %{name}-common%{?_isa} = %{epoch}:%{version}-%{release}
+
+%description block-vitastor
+This package provides the additional Vitastor block driver for QEMU.
+
+
%package block-ssh
Summary: QEMU SSH block driver
Requires: %{name}-common%{?_isa} = %{epoch}:%{version}-%{release}
@@ -979,6 +991,7 @@ This package provides usbredir support.
rm -fr slirp
mkdir slirp
%autopatch -p1
+cp %{SOURCE41} ./block/vitastor.c
%global qemu_kvm_build qemu_kvm_build
mkdir -p %{qemu_kvm_build}
@@ -994,7 +1007,7 @@ cp -f %{SOURCE40} tests/data/acpi/virt/S
# --build-id option is used for giving info to the debug packages.
buildldflags="VL_LDFLAGS=-Wl,--build-id"
-%global block_drivers_list qcow2,raw,file,host_device,nbd,iscsi,rbd,blkdebug,luks,null-co,nvme,copy-on-read,throttle
+%global block_drivers_list qcow2,raw,file,host_device,nbd,iscsi,rbd,vitastor,blkdebug,luks,null-co,nvme,copy-on-read,throttle
%if 0%{have_gluster}
%global block_drivers_list %{block_drivers_list},gluster
@@ -1149,9 +1162,7 @@ pushd %{qemu_kvm_build}
--firmwarepath=%{_prefix}/share/qemu-firmware \
--meson="git" \
--target-list="%{buildarch}" \
- --block-drv-rw-whitelist=%{block_drivers_list} \
--audio-drv-list= \
- --block-drv-ro-whitelist=vmdk,vhdx,vpc,https,ssh \
--with-coroutine=ucontext \
--with-git=git \
--tls-priority=@QEMU,SYSTEM \
@@ -1197,6 +1208,7 @@ pushd %{qemu_kvm_build}
%endif
--enable-pie \
--enable-rbd \
+ --enable-vitastor \
%if 0%{have_librdma}
--enable-rdma \
%endif
@@ -1794,6 +1806,9 @@ sh %{_sysconfdir}/sysconfig/modules/kvm.
%files block-rbd
%{_libdir}/qemu-kvm/block-rbd.so
+%files block-vitastor
+%{_libdir}/qemu-kvm/block-vitastor.so
+
%files block-ssh
%{_libdir}/qemu-kvm/block-ssh.so

View File

@@ -0,0 +1,93 @@
--- qemu-kvm-7.2.spec.orig 2023-06-22 13:56:19.000000000 +0000
+++ qemu-kvm-7.2.spec 2023-07-18 07:55:22.347090196 +0000
@@ -100,8 +100,6 @@
%endif
%global target_list %{kvm_target}-softmmu
-%global block_drivers_rw_list qcow2,raw,file,host_device,nbd,iscsi,rbd,blkdebug,luks,null-co,nvme,copy-on-read,throttle,compress
-%global block_drivers_ro_list vdi,vmdk,vhdx,vpc,https
%define qemudocdir %{_docdir}/%{name}
%global firmwaredirs "%{_datadir}/qemu-firmware:%{_datadir}/ipxe/qemu:%{_datadir}/seavgabios:%{_datadir}/seabios"
@@ -126,6 +124,7 @@ Requires: %{name}-device-usb-host = %{ep
Requires: %{name}-device-usb-redirect = %{epoch}:%{version}-%{release} \
%endif \
Requires: %{name}-block-rbd = %{epoch}:%{version}-%{release} \
+Requires: %{name}-block-vitastor = %{epoch}:%{version}-%{release}\
Requires: %{name}-audio-pa = %{epoch}:%{version}-%{release}
# Since SPICE is removed from RHEL-9, the following Obsoletes:
@@ -148,7 +147,7 @@ Obsoletes: %{name}-block-ssh <= %{epoch}
Summary: QEMU is a machine emulator and virtualizer
Name: qemu-kvm
Version: 7.2.0
-Release: 14%{?rcrel}%{?dist}%{?cc_suffix}.1
+Release: 14.vitastor%{?rcrel}%{?dist}%{?cc_suffix}.1
# Epoch because we pushed a qemu-1.0 package. AIUI this can't ever be dropped
# Epoch 15 used for RHEL 8
# Epoch 17 used for RHEL 9 (due to release versioning offset in RHEL 8.5)
@@ -171,6 +170,7 @@ Source28: 95-kvm-memlock.conf
Source30: kvm-s390x.conf
Source31: kvm-x86.conf
Source36: README.tests
+Source37: qemu-vitastor.c
Patch0004: 0004-Initial-redhat-build.patch
@@ -418,6 +418,7 @@ Patch134: kvm-target-i386-Fix-BZHI-instr
Patch135: kvm-intel-iommu-fail-DEVIOTLB_UNMAP-without-dt-mode.patch
# For bz#2203745 - Disk detach is unsuccessful while the guest is still booting [rhel-9.2.0.z]
Patch136: kvm-acpi-pcihp-allow-repeating-hot-unplug-requests.patch
+Patch137: qemu-7.2-vitastor.patch
%if %{have_clang}
BuildRequires: clang
@@ -449,6 +450,7 @@ BuildRequires: libcurl-devel
%if %{have_block_rbd}
BuildRequires: librbd-devel
%endif
+BuildRequires: vitastor-client-devel
# We need both because the 'stap' binary is probed for by configure
BuildRequires: systemtap
BuildRequires: systemtap-sdt-devel
@@ -642,6 +644,14 @@ using the rbd protocol.
%endif
+%package block-vitastor
+Summary: QEMU Vitastor block driver
+Requires: %{name}-common%{?_isa} = %{epoch}:%{version}-%{release}
+
+%description block-vitastor
+This package provides the additional Vitastor block driver for QEMU.
+
+
%package audio-pa
Summary: QEMU PulseAudio audio driver
Requires: %{name}-common%{?_isa} = %{epoch}:%{version}-%{release}
@@ -719,6 +729,7 @@ This package provides usbredir support.
%prep
%setup -q -n qemu-%{version}%{?rcstr}
%autopatch -p1
+cp %{SOURCE37} ./block/vitastor.c
%global qemu_kvm_build qemu_kvm_build
mkdir -p %{qemu_kvm_build}
@@ -946,6 +957,7 @@ run_configure \
%if %{have_block_rbd}
--enable-rbd \
%endif
+ --enable-vitastor \
%if %{have_librdma}
--enable-rdma \
%endif
@@ -1426,6 +1438,9 @@ useradd -r -u 107 -g qemu -G kvm -d / -s
%files block-rbd
%{_libdir}/%{name}/block-rbd.so
%endif
+%files block-vitastor
+%{_libdir}/%{name}/block-vitastor.so
+
%files audio-pa
%{_libdir}/%{name}/audio-pa.so

View File

@@ -35,7 +35,7 @@ ADD . /root/vitastor
RUN set -e; \
cd /root/vitastor/rpm; \
sh build-tarball.sh; \
cp /root/vitastor-0.9.3.el7.tar.gz ~/rpmbuild/SOURCES; \
cp /root/vitastor-1.0.0.el7.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \

View File

@@ -1,11 +1,11 @@
Name: vitastor
Version: 0.9.3
Version: 1.0.0
Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1
URL: https://vitastor.io/
Source0: vitastor-0.9.3.el7.tar.gz
Source0: vitastor-1.0.0.el7.tar.gz
BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel

View File

@@ -35,7 +35,7 @@ ADD . /root/vitastor
RUN set -e; \
cd /root/vitastor/rpm; \
sh build-tarball.sh; \
cp /root/vitastor-0.9.3.el8.tar.gz ~/rpmbuild/SOURCES; \
cp /root/vitastor-1.0.0.el8.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \

View File

@@ -1,11 +1,11 @@
Name: vitastor
Version: 0.9.3
Version: 1.0.0
Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1
URL: https://vitastor.io/
Source0: vitastor-0.9.3.el8.tar.gz
Source0: vitastor-1.0.0.el8.tar.gz
BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel

View File

@@ -18,7 +18,7 @@ ADD . /root/vitastor
RUN set -e; \
cd /root/vitastor/rpm; \
sh build-tarball.sh; \
cp /root/vitastor-0.9.3.el9.tar.gz ~/rpmbuild/SOURCES; \
cp /root/vitastor-1.0.0.el9.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el9.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \

View File

@@ -1,11 +1,11 @@
Name: vitastor
Version: 0.9.3
Version: 1.0.0
Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1
URL: https://vitastor.io/
Source0: vitastor-0.9.3.el9.tar.gz
Source0: vitastor-1.0.0.el9.tar.gz
BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel

View File

@@ -16,7 +16,7 @@ if("${CMAKE_INSTALL_PREFIX}" MATCHES "^/usr/local/?$")
set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}")
endif()
add_definitions(-DVERSION="0.9.3")
add_definitions(-DVERSION="1.0.0")
add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -fdiagnostics-color=always -I ${CMAKE_SOURCE_DIR}/src)
if (${WITH_ASAN})
add_definitions(-fsanitize=address -fno-omit-frame-pointer)

View File

@@ -19,8 +19,8 @@ bool string_to_addr(std::string str, bool parse_port, int default_port, struct s
if (p != std::string::npos && !(str.length() > 0 && str[p-1] == ']')) // "[ipv6]" which contains ':'
{
char null_byte = 0;
int n = sscanf(str.c_str()+p+1, "%d%c", &default_port, &null_byte);
if (n != 1 || default_port >= 0x10000)
int scanned = sscanf(str.c_str()+p+1, "%d%c", &default_port, &null_byte);
if (scanned != 1 || default_port >= 0x10000)
return false;
str = str.substr(0, p);
}

View File

@@ -45,13 +45,13 @@ void blockstore_disk_t::parse_config(std::map<std::string, std::string> & config
meta_block_size = parse_size(config["meta_block_size"]);
bitmap_granularity = parse_size(config["bitmap_granularity"]);
meta_format = stoull_full(config["meta_format"]);
cached_read_data = config["cached_read_data"] == "true" || config["cached_read_data"] == "yes" || config["cached_read_data"] == "1";
cached_read_meta = cached_read_data && (meta_device == data_device || meta_device == "") &&
config.find("cached_read_meta") == config.end() ||
config["cached_read_meta"] == "true" || config["cached_read_meta"] == "yes" || config["cached_read_meta"] == "1";
cached_read_journal = cached_read_meta && (journal_device == meta_device || journal_device == "") &&
config.find("cached_read_journal") == config.end() ||
config["cached_read_journal"] == "true" || config["cached_read_journal"] == "yes" || config["cached_read_journal"] == "1";
cached_io_data = config["cached_io_data"] == "true" || config["cached_io_data"] == "yes" || config["cached_io_data"] == "1";
cached_io_meta = cached_io_data && (meta_device == data_device || meta_device == "") &&
config.find("cached_io_meta") == config.end() ||
config["cached_io_meta"] == "true" || config["cached_io_meta"] == "yes" || config["cached_io_meta"] == "1";
cached_io_journal = cached_io_meta && (journal_device == meta_device || journal_device == "") &&
config.find("cached_io_journal") == config.end() ||
config["cached_io_journal"] == "true" || config["cached_io_journal"] == "yes" || config["cached_io_journal"] == "1";
if (config["data_csum_type"] == "crc32c")
{
data_csum_type = BLOCKSTORE_CSUM_CRC32C;
@@ -274,7 +274,7 @@ static void check_size(int fd, uint64_t *size, uint64_t *sectsize, std::string n
void blockstore_disk_t::open_data()
{
data_fd = open(data_device.c_str(), O_DIRECT|O_RDWR);
data_fd = open(data_device.c_str(), (cached_io_data ? O_SYNC : O_DIRECT) | O_RDWR);
if (data_fd == -1)
{
throw std::runtime_error("Failed to open data device "+data_device+": "+std::string(strerror(errno)));
@@ -295,25 +295,13 @@ void blockstore_disk_t::open_data()
{
throw std::runtime_error(std::string("Failed to lock data device: ") + strerror(errno));
}
if (cached_read_data)
{
read_data_fd = open(data_device.c_str(), O_RDWR);
if (read_data_fd == -1)
{
throw std::runtime_error("Failed to open data device "+data_device+": "+std::string(strerror(errno)));
}
}
else
{
read_data_fd = data_fd;
}
}
void blockstore_disk_t::open_meta()
{
if (meta_device != data_device)
if (meta_device != data_device || cached_io_meta != cached_io_data)
{
meta_fd = open(meta_device.c_str(), O_DIRECT|O_RDWR);
meta_fd = open(meta_device.c_str(), (cached_io_meta ? O_SYNC : O_DIRECT) | O_RDWR);
if (meta_fd == -1)
{
throw std::runtime_error("Failed to open metadata device "+meta_device+": "+std::string(strerror(errno)));
@@ -323,22 +311,10 @@ void blockstore_disk_t::open_meta()
{
throw std::runtime_error("meta_offset exceeds device size = "+std::to_string(meta_device_size));
}
if (!disable_flock && flock(meta_fd, LOCK_EX|LOCK_NB) != 0)
if (!disable_flock && meta_device != data_device && flock(meta_fd, LOCK_EX|LOCK_NB) != 0)
{
throw std::runtime_error(std::string("Failed to lock metadata device: ") + strerror(errno));
}
if (cached_read_meta)
{
read_meta_fd = open(meta_device.c_str(), O_RDWR);
if (read_meta_fd == -1)
{
throw std::runtime_error("Failed to open metadata device "+meta_device+": "+std::string(strerror(errno)));
}
}
else
{
read_meta_fd = meta_fd;
}
}
else
{
@@ -357,35 +333,19 @@ void blockstore_disk_t::open_meta()
") is not a multiple of data device sector size ("+std::to_string(meta_device_sect)+")"
);
}
if (!cached_read_meta)
{
read_meta_fd = meta_fd;
}
else if (meta_device == data_device && cached_read_data)
{
read_meta_fd = read_data_fd;
}
else
{
read_meta_fd = open(meta_device.c_str(), O_RDWR);
if (read_meta_fd == -1)
{
throw std::runtime_error("Failed to open metadata device "+meta_device+": "+std::string(strerror(errno)));
}
}
}
void blockstore_disk_t::open_journal()
{
if (journal_device != meta_device)
if (journal_device != meta_device || cached_io_journal != cached_io_meta)
{
journal_fd = open(journal_device.c_str(), O_DIRECT|O_RDWR);
journal_fd = open(journal_device.c_str(), (cached_io_journal ? O_SYNC : O_DIRECT) | O_RDWR);
if (journal_fd == -1)
{
throw std::runtime_error("Failed to open journal device "+journal_device+": "+std::string(strerror(errno)));
}
check_size(journal_fd, &journal_device_size, &journal_device_sect, "journal device");
if (!disable_flock && flock(journal_fd, LOCK_EX|LOCK_NB) != 0)
if (!disable_flock && journal_device != meta_device && flock(journal_fd, LOCK_EX|LOCK_NB) != 0)
{
throw std::runtime_error(std::string("Failed to lock journal device: ") + strerror(errno));
}
@@ -407,26 +367,6 @@ void blockstore_disk_t::open_journal()
") is not a multiple of journal device sector size ("+std::to_string(journal_device_sect)+")"
);
}
if (!cached_read_journal)
{
read_journal_fd = journal_fd;
}
else if (journal_device == meta_device && cached_read_meta)
{
read_journal_fd = read_meta_fd;
}
else if (journal_device == data_device && cached_read_data)
{
read_journal_fd = read_data_fd;
}
else
{
read_journal_fd = open(journal_device.c_str(), O_RDWR);
if (read_journal_fd == -1)
{
throw std::runtime_error("Failed to open journal device "+journal_device+": "+std::string(strerror(errno)));
}
}
}
void blockstore_disk_t::close_all()
@@ -437,12 +377,5 @@ void blockstore_disk_t::close_all()
close(meta_fd);
if (journal_fd >= 0 && journal_fd != meta_fd)
close(journal_fd);
if (read_data_fd >= 0 && read_data_fd != data_fd)
close(read_data_fd);
if (read_meta_fd >= 0 && read_meta_fd != meta_fd)
close(read_meta_fd);
if (read_journal_fd >= 0 && read_journal_fd != journal_fd)
close(read_journal_fd);
data_fd = meta_fd = journal_fd = -1;
read_data_fd = read_meta_fd = read_journal_fd = -1;
}

View File

@@ -31,11 +31,10 @@ struct blockstore_disk_t
uint32_t csum_block_size = 4096;
// By default, Blockstore locks all opened devices exclusively. This option can be used to disable locking
bool disable_flock = false;
// Use linux page cache for reads. If enabled, separate buffered FDs will be opened for reading
bool cached_read_data = false, cached_read_meta = false, cached_read_journal = false;
// Use Linux page cache for reads and writes, i.e. open FDs with O_SYNC instead of O_DIRECT
bool cached_io_data = false, cached_io_meta = false, cached_io_journal = false;
int meta_fd = -1, data_fd = -1, journal_fd = -1;
int read_meta_fd = -1, read_data_fd = -1, read_journal_fd = -1;
uint64_t meta_offset, meta_device_sect, meta_device_size, meta_len, meta_format = 0;
uint64_t data_offset, data_device_sect, data_device_size, data_len;
uint64_t journal_offset, journal_device_sect, journal_device_size, journal_len;

View File

@@ -1087,7 +1087,7 @@ bool journal_flusher_co::read_dirty(int wait_base)
data->iov = (struct iovec){ vi.buf, vi.len };
data->callback = simple_callback_r;
my_uring_prep_readv(
sqe, bs->dsk.read_data_fd, &data->iov, 1, bs->dsk.data_offset + old_clean_loc + vi.offset
sqe, bs->dsk.data_fd, &data->iov, 1, bs->dsk.data_offset + old_clean_loc + vi.offset
);
wait_count++;
bs->find_holes(v, vi.offset, vi.offset+vi.len, [this, buf = (uint8_t*)vi.buf-vi.offset](int pos, bool alloc, uint32_t cur_start, uint32_t cur_end)
@@ -1119,7 +1119,7 @@ bool journal_flusher_co::read_dirty(int wait_base)
data->iov = (struct iovec){ v[i].buf, (size_t)v[i].len };
data->callback = simple_callback_rj;
my_uring_prep_readv(
sqe, bs->dsk.read_journal_fd, &data->iov, 1, bs->journal.offset + v[i].disk_offset
sqe, bs->dsk.journal_fd, &data->iov, 1, bs->journal.offset + v[i].disk_offset
);
wait_journal_count++;
}
@@ -1212,7 +1212,7 @@ bool journal_flusher_co::modify_meta_read(uint64_t meta_loc, flusher_meta_write_
data->callback = simple_callback_r;
wr.submitted = true;
my_uring_prep_readv(
sqe, bs->dsk.read_meta_fd, &data->iov, 1, bs->dsk.meta_offset + bs->dsk.meta_block_size + wr.sector
sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset + bs->dsk.meta_block_size + wr.sector
);
wait_count++;
}

View File

@@ -393,6 +393,7 @@ void blockstore_impl_t::init_op(blockstore_op_t *op)
{
// Call constructor without allocating memory. We'll call destructor before returning op back
new ((void*)op->private_data) blockstore_op_private_t;
PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
PRIV(op)->wait_for = 0;
PRIV(op)->op_state = 0;
PRIV(op)->pending_ops = 0;

View File

@@ -210,7 +210,7 @@ struct blockstore_op_private_t
std::vector<copy_buffer_t> read_vec;
// Sync, write
int min_flushed_journal_sector, max_flushed_journal_sector;
uint64_t min_flushed_journal_sector, max_flushed_journal_sector;
// Write
struct iovec iov_zerofill[3];
@@ -220,7 +220,6 @@ struct blockstore_op_private_t
// Sync
std::vector<obj_ver_id> sync_big_writes, sync_small_writes;
int sync_small_checked, sync_big_checked;
};
typedef uint32_t pool_id_t;

View File

@@ -65,7 +65,7 @@ int blockstore_init_meta::loop()
GET_SQE();
data->iov = { metadata_buffer, bs->dsk.meta_block_size };
data->callback = [this](ring_data_t *data) { handle_event(data, -1); };
my_uring_prep_readv(sqe, bs->dsk.read_meta_fd, &data->iov, 1, bs->dsk.meta_offset);
my_uring_prep_readv(sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset);
bs->ringloop->submit();
submitted++;
resume_1:
@@ -202,7 +202,7 @@ resume_2:
data->iov = { bufs[i].buf, bufs[i].size };
data->callback = [this, i](ring_data_t *data) { handle_event(data, i); };
if (!zero_on_init)
my_uring_prep_readv(sqe, bs->dsk.read_meta_fd, &data->iov, 1, bs->dsk.meta_offset + bufs[i].offset);
my_uring_prep_readv(sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset + bufs[i].offset);
else
{
// Fill metadata with zeroes
@@ -259,7 +259,7 @@ resume_2:
GET_SQE();
data->iov = { metadata_buffer, bs->dsk.meta_block_size };
data->callback = [this](ring_data_t *data) { handle_event(data, -1); };
my_uring_prep_readv(sqe, bs->dsk.read_meta_fd, &data->iov, 1, bs->dsk.meta_offset + (1+next_offset)*bs->dsk.meta_block_size);
my_uring_prep_readv(sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset + (1+next_offset)*bs->dsk.meta_block_size);
submitted++;
resume_5:
if (submitted > 0)
@@ -467,7 +467,7 @@ int blockstore_init_journal::loop()
data = ((ring_data_t*)sqe->user_data);
data->iov = { submitted_buf, bs->journal.block_size };
data->callback = simple_callback;
my_uring_prep_readv(sqe, bs->dsk.read_journal_fd, &data->iov, 1, bs->journal.offset);
my_uring_prep_readv(sqe, bs->dsk.journal_fd, &data->iov, 1, bs->journal.offset);
bs->ringloop->submit();
wait_count = 1;
resume_1:
@@ -607,7 +607,7 @@ resume_1:
end - journal_pos < JOURNAL_BUFFER_SIZE ? end - journal_pos : JOURNAL_BUFFER_SIZE,
};
data->callback = [this](ring_data_t *data1) { handle_event(data1); };
my_uring_prep_readv(sqe, bs->dsk.read_journal_fd, &data->iov, 1, bs->journal.offset + journal_pos);
my_uring_prep_readv(sqe, bs->dsk.journal_fd, &data->iov, 1, bs->journal.offset + journal_pos);
bs->ringloop->submit();
}
while (done.size() > 0)

View File

@@ -198,6 +198,7 @@ void blockstore_impl_t::prepare_journal_sector_write(int cur_sector, blockstore_
priv->pending_ops++;
if (!priv->min_flushed_journal_sector)
priv->min_flushed_journal_sector = 1+cur_sector;
assert(priv->min_flushed_journal_sector <= journal.sector_count);
priv->max_flushed_journal_sector = 1+cur_sector;
}

View File

@@ -85,11 +85,13 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config, bool init)
immediate_commit = IMMEDIATE_SMALL;
}
metadata_buf_size = strtoull(config["meta_buf_size"].c_str(), NULL, 10);
inmemory_meta = config["inmemory_metadata"] != "false";
inmemory_meta = config["inmemory_metadata"] != "false" && config["inmemory_metadata"] != "0" &&
config["inmemory_metadata"] != "no";
journal.sector_count = strtoull(config["journal_sector_buffer_count"].c_str(), NULL, 10);
journal.no_same_sector_overwrites = config["journal_no_same_sector_overwrites"] == "true" ||
config["journal_no_same_sector_overwrites"] == "1" || config["journal_no_same_sector_overwrites"] == "yes";
journal.inmemory = config["inmemory_journal"] != "false";
journal.inmemory = config["inmemory_journal"] != "false" && config["inmemory_journal"] != "0" &&
config["inmemory_journal"] != "no";
// Validate
if (journal.sector_count < 2)
{

View File

@@ -29,7 +29,7 @@ int blockstore_impl_t::fulfill_read_push(blockstore_op_t *op, void *buf, uint64_
PRIV(op)->pending_ops++;
my_uring_prep_readv(
sqe,
IS_JOURNAL(item_state) ? dsk.read_journal_fd : dsk.read_data_fd,
IS_JOURNAL(item_state) ? dsk.journal_fd : dsk.data_fd,
&data->iov, 1,
(IS_JOURNAL(item_state) ? dsk.journal_offset : dsk.data_offset) + offset
);
@@ -348,7 +348,7 @@ bool blockstore_impl_t::read_checksum_block(blockstore_op_t *op, int rv_pos, uin
.csum_buf = vi->csum_buf,
.dyn_data = vi->dyn_data,
};
int submit_fd = (vi->copy_flags & COPY_BUF_JOURNAL ? dsk.read_journal_fd : dsk.read_data_fd);
int submit_fd = (vi->copy_flags & COPY_BUF_JOURNAL ? dsk.journal_fd : dsk.data_fd);
uint64_t submit_offset = (vi->copy_flags & COPY_BUF_JOURNAL ? journal.offset : dsk.data_offset);
uint32_t d_pos = 0;
for (int n_pos = 0; n_pos < n_iov; n_pos += IOV_MAX)
@@ -702,7 +702,7 @@ uint8_t* blockstore_impl_t::read_clean_meta_block(blockstore_op_t *op, uint64_t
BS_SUBMIT_GET_SQE(sqe, data);
data->iov = (struct iovec){ buf, dsk.meta_block_size };
PRIV(op)->pending_ops++;
my_uring_prep_readv(sqe, dsk.read_meta_fd, &data->iov, 1, dsk.meta_offset + dsk.meta_block_size + sector);
my_uring_prep_readv(sqe, dsk.meta_fd, &data->iov, 1, dsk.meta_offset + dsk.meta_block_size + sector);
data->callback = [this, op](ring_data_t *data) { handle_read_event(data, op); };
// return pointer to checksums + bitmap
return buf + pos + sizeof(clean_disk_entry);

View File

@@ -27,8 +27,6 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
unsynced_big_write_count -= unsynced_big_writes.size();
PRIV(op)->sync_big_writes.swap(unsynced_big_writes);
PRIV(op)->sync_small_writes.swap(unsynced_small_writes);
PRIV(op)->sync_small_checked = 0;
PRIV(op)->sync_big_checked = 0;
unsynced_big_writes.clear();
unsynced_small_writes.clear();
if (PRIV(op)->sync_big_writes.size() > 0)

View File

@@ -286,13 +286,18 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
printf("Restoring %lx:%lx version: v%lu -> v%lu\n", op->oid.inode, op->oid.stripe, op->version, PRIV(op)->real_version);
#endif
auto prev_it = dirty_it;
prev_it--;
if (prev_it->first.oid == op->oid && prev_it->first.version >= PRIV(op)->real_version)
if (prev_it != dirty_db.begin())
{
// Original version is still invalid
// All subsequent writes to the same object must be canceled too
cancel_all_writes(op, dirty_it, -EEXIST);
return 2;
prev_it--;
if (prev_it->first.oid == op->oid && prev_it->first.version >= PRIV(op)->real_version)
{
// Original version is still invalid
// All subsequent writes to the same object must be canceled too
printf("Tried to write %lx:%lx v%lu after delete (old version v%lu), but already have v%lu\n",
op->oid.inode, op->oid.stripe, PRIV(op)->real_version, op->version, prev_it->first.version);
cancel_all_writes(op, dirty_it, -EEXIST);
return 2;
}
}
op->version = PRIV(op)->real_version;
PRIV(op)->real_version = 0;
@@ -378,7 +383,6 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
sqe, dsk.data_fd, PRIV(op)->iov_zerofill, vcnt, dsk.data_offset + (loc << dsk.block_order) + op->offset - stripe_offset
);
PRIV(op)->pending_ops = 1;
PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
if (immediate_commit != IMMEDIATE_ALL)
{
// Increase the counter, but don't save into unsynced_writes yet (can't sync until the write is finished)
@@ -415,16 +419,10 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
write_iodepth++;
// Got SQEs. Prepare previous journal sector write if required
auto cb = [this, op](ring_data_t *data) { handle_write_event(data, op); };
if (immediate_commit == IMMEDIATE_NONE)
if (immediate_commit == IMMEDIATE_NONE &&
!journal.entry_fits(sizeof(journal_entry_small_write) + dyn_size))
{
if (!journal.entry_fits(sizeof(journal_entry_small_write) + dyn_size))
{
prepare_journal_sector_write(journal.cur_sector, op);
}
else
{
PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
}
prepare_journal_sector_write(journal.cur_sector, op);
}
// Then pre-fill journal entry
journal_entry_small_write *je = (journal_entry_small_write*)prefill_single_journal_entry(
@@ -750,17 +748,11 @@ int blockstore_impl_t::dequeue_del(blockstore_op_t *op)
}
write_iodepth++;
// Prepare journal sector write
if (immediate_commit == IMMEDIATE_NONE)
if (immediate_commit == IMMEDIATE_NONE &&
(dsk.journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_del) &&
journal.sector_info[journal.cur_sector].dirty)
{
if ((dsk.journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_del) &&
journal.sector_info[journal.cur_sector].dirty)
{
prepare_journal_sector_write(journal.cur_sector, op);
}
else
{
PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
}
prepare_journal_sector_write(journal.cur_sector, op);
}
// Pre-fill journal entry
journal_entry_del *je = (journal_entry_del*)prefill_single_journal_entry(

View File

@@ -357,6 +357,8 @@ static int run(cli_tool_t *p, json11::Json::object cfg)
p->ringloop = NULL;
}
// Print result
fflush(stderr);
fflush(stdout);
if (p->json_output && !result.data.is_null())
{
printf("%s\n", result.data.dump().c_str());

View File

@@ -77,8 +77,8 @@ struct alloc_osd_t
std::string key = base64_decode(kv["key"].string_value());
osd_num_t cur_osd;
char null_byte = 0;
sscanf(key.c_str() + parent->cli->st_cli.etcd_prefix.length(), "/osd/stats/%lu%c", &cur_osd, &null_byte);
if (!cur_osd || null_byte != 0)
int scanned = sscanf(key.c_str() + parent->cli->st_cli.etcd_prefix.length(), "/osd/stats/%lu%c", &cur_osd, &null_byte);
if (scanned != 1 || !cur_osd)
{
fprintf(stderr, "Invalid key in etcd: %s\n", key.c_str());
continue;

View File

@@ -67,8 +67,8 @@ resume_1:
// pool ID
pool_id_t pool_id;
char null_byte = 0;
sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(), "/pool/stats/%u%c", &pool_id, &null_byte);
if (!pool_id || pool_id >= POOL_ID_MAX || null_byte != 0)
int scanned = sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(), "/pool/stats/%u%c", &pool_id, &null_byte);
if (scanned != 1 || !pool_id || pool_id >= POOL_ID_MAX)
{
fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
continue;
@@ -82,8 +82,8 @@ resume_1:
// osd ID
osd_num_t osd_num;
char null_byte = 0;
sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(), "/osd/stats/%lu%c", &osd_num, &null_byte);
if (!osd_num || osd_num >= POOL_ID_MAX || null_byte != 0)
int scanned = sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(), "/osd/stats/%lu%c", &osd_num, &null_byte);
if (scanned != 1 || !osd_num || osd_num >= POOL_ID_MAX)
{
fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
continue;

View File

@@ -56,14 +56,15 @@ struct image_lister_t
{
continue;
}
auto & pool_cfg = parent->cli->st_cli.pool_config.at(INODE_POOL(ic.second.num));
auto pool_it = parent->cli->st_cli.pool_config.find(INODE_POOL(ic.second.num));
bool good_pool = pool_it != parent->cli->st_cli.pool_config.end();
auto item = json11::Json::object {
{ "name", ic.second.name },
{ "size", ic.second.size },
{ "used_size", 0 },
{ "readonly", ic.second.readonly },
{ "pool_id", (uint64_t)INODE_POOL(ic.second.num) },
{ "pool_name", pool_cfg.name },
{ "pool_name", good_pool ? pool_it->second.name : "? (ID:"+std::to_string(INODE_POOL(ic.second.num))+")" },
{ "inode_num", INODE_NO_POOL(ic.second.num) },
{ "inode_id", ic.second.num },
};
@@ -132,8 +133,8 @@ resume_1:
// pool ID
pool_id_t pool_id;
char null_byte = 0;
sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(), "/pool/stats/%u%c", &pool_id, &null_byte);
if (!pool_id || pool_id >= POOL_ID_MAX || null_byte != 0)
int scanned = sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(), "/pool/stats/%u%c", &pool_id, &null_byte);
if (scanned != 1 || !pool_id || pool_id >= POOL_ID_MAX)
{
fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
continue;
@@ -148,9 +149,9 @@ resume_1:
pool_id_t pool_id;
inode_t only_inode_num;
char null_byte = 0;
sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(),
int scanned = sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(),
"/inode/stats/%u/%lu%c", &pool_id, &only_inode_num, &null_byte);
if (!pool_id || pool_id >= POOL_ID_MAX || INODE_POOL(only_inode_num) != 0 || null_byte != 0)
if (scanned != 2 || !pool_id || pool_id >= POOL_ID_MAX || INODE_POOL(only_inode_num) != 0)
{
fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
continue;
@@ -173,7 +174,7 @@ resume_1:
{ "size", 0 },
{ "readonly", false },
{ "pool_id", (uint64_t)INODE_POOL(inode_num) },
{ "pool_name", pool_it == parent->cli->st_cli.pool_config.end()
{ "pool_name", pool_it != parent->cli->st_cli.pool_config.end()
? (pool_it->second.name == "" ? "<Unnamed>" : pool_it->second.name) : "?" },
{ "inode_num", INODE_NO_POOL(inode_num) },
{ "inode_id", inode_num },
@@ -247,6 +248,8 @@ resume_1:
if (state == 1)
goto resume_1;
get_list();
if (state == 100)
return;
if (show_stats)
{
resume_1:
@@ -269,7 +272,7 @@ resume_1:
{ "key", "name" },
{ "title", "NAME" },
});
if (!list_pool_id)
if (list_pool_name == "")
{
cols.push_back(json11::Json::object{
{ "key", "pool_name" },
@@ -376,16 +379,18 @@ resume_1:
std::string print_table(json11::Json items, json11::Json header, bool use_esc)
{
int header_sizes[header.array_items().size()];
std::vector<int> sizes;
for (int i = 0; i < header.array_items().size(); i++)
{
sizes.push_back(header[i]["title"].string_value().length());
header_sizes[i] = utf8_length(header[i]["title"].string_value());
sizes.push_back(header_sizes[i]);
}
for (auto & item: items.array_items())
{
for (int i = 0; i < header.array_items().size(); i++)
{
int l = item[header[i]["key"].string_value()].as_string().length();
int l = utf8_length(item[header[i]["key"].string_value()].as_string());
sizes[i] = sizes[i] < l ? l : sizes[i];
}
}
@@ -397,7 +402,7 @@ std::string print_table(json11::Json items, json11::Json header, bool use_esc)
// Separator
str += " ";
}
int pad = sizes[i]-header[i]["title"].string_value().length();
int pad = sizes[i]-header_sizes[i];
if (header[i]["right"].bool_value())
{
// Align right
@@ -425,7 +430,7 @@ std::string print_table(json11::Json items, json11::Json header, bool use_esc)
// Separator
str += " ";
}
int pad = sizes[i] - item[header[i]["key"].string_value()].as_string().length();
int pad = sizes[i] - utf8_length(item[header[i]["key"].string_value()].as_string());
if (header[i]["right"].bool_value())
{
// Align right

View File

@@ -13,7 +13,7 @@ struct image_changer_t
std::string image_name;
std::string new_name;
uint64_t new_size = 0;
bool force_size = false;
bool force_size = false, inc_size = false;
bool set_readonly = false, set_readwrite = false, force = false;
// interval between fsyncs
int fsync_interval = 128;
@@ -81,14 +81,14 @@ struct image_changer_t
}
if ((!set_readwrite || !cfg.readonly) &&
(!set_readonly || cfg.readonly) &&
(!new_size && !force_size || cfg.size == new_size) &&
(!new_size && !force_size || cfg.size == new_size || cfg.size >= new_size && inc_size) &&
(new_name == "" || new_name == image_name))
{
result = (cli_result_t){ .text = "No change" };
state = 100;
return;
}
if (new_size != 0 || force_size)
if ((new_size != 0 || force_size) && (cfg.size < new_size || !inc_size))
{
if (cfg.size >= new_size)
{
@@ -233,6 +233,7 @@ std::function<bool(cli_result_t &)> cli_tool_t::start_modify(json11::Json cfg)
changer->new_name = cfg["rename"].string_value();
changer->new_size = parse_size(cfg["resize"].as_string());
changer->force_size = cfg["force_size"].bool_value();
changer->inc_size = cfg["inc_size"].bool_value();
changer->force = cfg["force"].bool_value();
changer->set_readonly = cfg["readonly"].bool_value();
changer->set_readwrite = cfg["readwrite"].bool_value();

View File

@@ -384,8 +384,8 @@ resume_100:
pool_id_t pool_id = 0;
inode_t inode = 0;
char null_byte = 0;
sscanf(kv.key.c_str() + parent->cli->st_cli.etcd_prefix.length()+13, "%u/%lu%c", &pool_id, &inode, &null_byte);
if (!inode || null_byte != 0)
int scanned = sscanf(kv.key.c_str() + parent->cli->st_cli.etcd_prefix.length()+13, "%u/%lu%c", &pool_id, &inode, &null_byte);
if (scanned != 2 || !inode)
{
result = (cli_result_t){ .err = EIO, .text = "Bad key returned from etcd: "+kv.key };
state = 100;

View File

@@ -132,8 +132,8 @@ resume_2:
auto kv = parent->cli->st_cli.parse_etcd_kv(osd_stats[i]);
osd_num_t stat_osd_num = 0;
char null_byte = 0;
sscanf(kv.key.c_str() + parent->cli->st_cli.etcd_prefix.size(), "/osd/stats/%lu%c", &stat_osd_num, &null_byte);
if (!stat_osd_num || null_byte != 0)
int scanned = sscanf(kv.key.c_str() + parent->cli->st_cli.etcd_prefix.size(), "/osd/stats/%lu%c", &stat_osd_num, &null_byte);
if (scanned != 1 || !stat_osd_num)
{
fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
continue;

View File

@@ -565,11 +565,11 @@ void cluster_client_t::copy_write(cluster_op_t *op, std::map<object_id, cluster_
while (len > 0)
{
uint64_t new_len = 0;
if (dirty_it == dirty_buffers.end())
if (dirty_it == dirty_buffers.end() || dirty_it->first.inode != op->inode)
{
new_len = len;
}
else if (dirty_it->first.inode != op->inode || dirty_it->first.stripe > pos)
else if (dirty_it->first.stripe > pos)
{
new_len = dirty_it->first.stripe - pos;
if (new_len > len)
@@ -881,6 +881,7 @@ void cluster_client_t::slice_rw(cluster_op_t *op)
uint64_t end = (op->offset + op->len) > (stripe + pg_block_size)
? (stripe + pg_block_size) : (op->offset + op->len);
op->parts[i].iov.reset();
op->parts[i].flags = 0;
if (op->cur_inode != op->inode)
{
// Read remaining parts from upper layers
@@ -918,7 +919,10 @@ void cluster_client_t::slice_rw(cluster_op_t *op)
else
add_iov(cur-prev, skip_prev, op, iov_idx, iov_pos, op->parts[i].iov, scrap_buffer, scrap_buffer_size);
if (end == begin)
{
op->done_count++;
op->parts[i].flags = PART_DONE;
}
}
else if (op->opcode != OSD_OP_READ_BITMAP && op->opcode != OSD_OP_READ_CHAIN_BITMAP && op->opcode != OSD_OP_DELETE)
{
@@ -930,7 +934,6 @@ void cluster_client_t::slice_rw(cluster_op_t *op)
op->opcode == OSD_OP_DELETE ? 0 : (uint32_t)(end - begin);
op->parts[i].pg_num = pg_num;
op->parts[i].osd_num = 0;
op->parts[i].flags = 0;
i++;
}
}

View File

@@ -74,7 +74,7 @@ static const char *help_text =
" If it doesn't succeed it issues a warning in the system log.\n"
" \n"
" You can also pass other OSD options here as arguments and they'll be persisted\n"
" in the superblock: cached_read_data, cached_read_meta, cached_read_journal,\n"
" in the superblock: cached_io_data, cached_io_meta, cached_io_journal,\n"
" inmemory_metadata, inmemory_journal, max_write_iodepth,\n"
" min_flusher_count, max_flusher_count, journal_sector_buffer_count,\n"
" journal_no_same_sector_overwrites, throttle_small_writes, throttle_target_iops,\n"

View File

@@ -8,9 +8,9 @@
int disk_tool_t::prepare_one(std::map<std::string, std::string> options, int is_hdd)
{
static const char *allow_additional_params[] = {
"cached_read_data",
"cached_read_meta",
"cached_read_journal",
"cached_io_data",
"cached_io_meta",
"cached_io_journal",
"max_write_iodepth",
"max_write_iodepth",
"min_flusher_count",
@@ -119,7 +119,7 @@ int disk_tool_t::prepare_one(std::map<std::string, std::string> options, int is_
try
{
dsk.parse_config(options);
dsk.cached_read_data = dsk.cached_read_meta = dsk.cached_read_journal = false;
dsk.cached_io_data = dsk.cached_io_meta = dsk.cached_io_journal = false;
dsk.open_data();
dsk.open_meta();
dsk.open_journal();
@@ -151,7 +151,7 @@ int disk_tool_t::prepare_one(std::map<std::string, std::string> options, int is_
for (int i = 0; i < sizeof(allow_additional_params)/sizeof(allow_additional_params[0]); i++)
{
auto it = options.find(allow_additional_params[i]);
if (it != options.end())
if (it != options.end() && it->second != "")
{
sb[it->first] = it->second;
}
@@ -483,7 +483,7 @@ int disk_tool_t::get_meta_partition(std::vector<vitastor_dev_info_t> & ssds, std
{
blockstore_disk_t dsk;
dsk.parse_config(options);
dsk.cached_read_data = dsk.cached_read_meta = dsk.cached_read_journal = false;
dsk.cached_io_data = dsk.cached_io_meta = dsk.cached_io_journal = false;
dsk.open_data();
dsk.open_meta();
dsk.open_journal();
@@ -626,7 +626,7 @@ int disk_tool_t::prepare(std::vector<std::string> devices)
}
}
// Treat all disks as SSDs if not in the hybrid mode
prepare_one(options, hybrid && dev.is_hdd ? 1 : 0);
prepare_one(options, dev.is_hdd ? 1 : 0);
if (hybrid)
{
options.erase("journal_device");

View File

@@ -91,7 +91,7 @@ int disk_tool_t::resize_parse_params()
try
{
dsk.parse_config(options);
dsk.cached_read_data = dsk.cached_read_meta = dsk.cached_read_journal = false;
dsk.cached_io_data = dsk.cached_io_meta = dsk.cached_io_journal = false;
dsk.open_data();
dsk.open_meta();
dsk.open_journal();

View File

@@ -264,6 +264,7 @@ int write_zero(int fd, uint64_t offset, uint64_t size)
{
uint64_t buf_len = 1024*1024;
void *zero_buf = memalign_or_die(MEM_ALIGNMENT, buf_len);
memset(zero_buf, 0, buf_len);
ssize_t r;
while (size > 0)
{

View File

@@ -684,8 +684,8 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
// ID
pool_id_t pool_id;
char null_byte = 0;
sscanf(pool_item.first.c_str(), "%u%c", &pool_id, &null_byte);
if (!pool_id || pool_id >= POOL_ID_MAX || null_byte != 0)
int scanned = sscanf(pool_item.first.c_str(), "%u%c", &pool_id, &null_byte);
if (scanned != 1 || !pool_id || pool_id >= POOL_ID_MAX)
{
fprintf(stderr, "Pool ID %s is invalid (must be a number less than 0x%x), skipping pool\n", pool_item.first.c_str(), POOL_ID_MAX);
continue;
@@ -829,8 +829,8 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
{
pool_id_t pool_id;
char null_byte = 0;
sscanf(pool_item.first.c_str(), "%u%c", &pool_id, &null_byte);
if (!pool_id || pool_id >= POOL_ID_MAX || null_byte != 0)
int scanned = sscanf(pool_item.first.c_str(), "%u%c", &pool_id, &null_byte);
if (scanned != 1 || !pool_id || pool_id >= POOL_ID_MAX)
{
fprintf(stderr, "Pool ID %s is invalid in PG configuration (must be a number less than 0x%x), skipping pool\n", pool_item.first.c_str(), POOL_ID_MAX);
continue;
@@ -838,8 +838,8 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
for (auto & pg_item: pool_item.second.object_items())
{
pg_num_t pg_num = 0;
sscanf(pg_item.first.c_str(), "%u%c", &pg_num, &null_byte);
if (!pg_num || null_byte != 0)
int scanned = sscanf(pg_item.first.c_str(), "%u%c", &pg_num, &null_byte);
if (scanned != 1 || !pg_num)
{
fprintf(stderr, "Bad key in pool %u PG configuration: %s (must be a number), skipped\n", pool_id, pg_item.first.c_str());
continue;
@@ -889,8 +889,8 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
pool_id_t pool_id = 0;
pg_num_t pg_num = 0;
char null_byte = 0;
sscanf(key.c_str() + etcd_prefix.length()+12, "%u/%u%c", &pool_id, &pg_num, &null_byte);
if (!pool_id || pool_id >= POOL_ID_MAX || !pg_num || null_byte != 0)
int scanned = sscanf(key.c_str() + etcd_prefix.length()+12, "%u/%u%c", &pool_id, &pg_num, &null_byte);
if (scanned != 2 || !pool_id || pool_id >= POOL_ID_MAX || !pg_num)
{
fprintf(stderr, "Bad etcd key %s, ignoring\n", key.c_str());
}
@@ -944,8 +944,8 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
pool_id_t pool_id = 0;
pg_num_t pg_num = 0;
char null_byte = 0;
sscanf(key.c_str() + etcd_prefix.length()+10, "%u/%u%c", &pool_id, &pg_num, &null_byte);
if (!pool_id || pool_id >= POOL_ID_MAX || !pg_num || null_byte != 0)
int scanned = sscanf(key.c_str() + etcd_prefix.length()+10, "%u/%u%c", &pool_id, &pg_num, &null_byte);
if (scanned != 2 || !pool_id || pool_id >= POOL_ID_MAX || !pg_num)
{
fprintf(stderr, "Bad etcd key %s, ignoring\n", key.c_str());
}
@@ -1015,8 +1015,8 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
uint64_t pool_id = 0;
uint64_t inode_num = 0;
char null_byte = 0;
sscanf(key.c_str() + etcd_prefix.length()+14, "%lu/%lu%c", &pool_id, &inode_num, &null_byte);
if (!pool_id || pool_id >= POOL_ID_MAX || !inode_num || (inode_num >> (64-POOL_ID_BITS)) || null_byte != 0)
int scanned = sscanf(key.c_str() + etcd_prefix.length()+14, "%lu/%lu%c", &pool_id, &inode_num, &null_byte);
if (scanned != 2 || !pool_id || pool_id >= POOL_ID_MAX || !inode_num || (inode_num >> (64-POOL_ID_BITS)))
{
fprintf(stderr, "Bad etcd key %s, ignoring\n", key.c_str());
}

View File

@@ -242,6 +242,7 @@ static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)
op.sec_rw.version = UINT64_MAX; // last unstable
op.sec_rw.offset = io->offset % bsd->block_size;
op.sec_rw.len = io->xfer_buflen;
op.sec_rw.attr_len = 0;
}
else
{
@@ -263,6 +264,7 @@ static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)
op.sec_rw.version = 0; // assign automatically
op.sec_rw.offset = io->offset % bsd->block_size;
op.sec_rw.len = io->xfer_buflen;
op.sec_rw.attr_len = 0;
}
else
{

View File

@@ -19,12 +19,12 @@ std::string msgr_rdma_address_t::to_string()
bool msgr_rdma_address_t::from_string(const char *str, msgr_rdma_address_t *dest)
{
uint64_t* gid = (uint64_t*)&dest->gid;
int n = sscanf(
int scanned = sscanf(
str, "%hx:%x:%x:%16lx%16lx", &dest->lid, &dest->qpn, &dest->psn, gid, gid+1
);
gid[0] = be64toh(gid[0]);
gid[1] = be64toh(gid[1]);
return n == 5;
return scanned == 5;
}
msgr_rdma_context_t::~msgr_rdma_context_t()

View File

@@ -190,7 +190,15 @@ static int nfs3_setattr_proc(void *opaque, rpc_op_t *rop)
{
if (handle == "roothandle" || self->parent->dir_by_hash.find(handle) != self->parent->dir_by_hash.end())
{
*reply = (SETATTR3res){ .status = NFS3ERR_ISDIR };
if (args->new_attributes.size.set_it)
{
*reply = (SETATTR3res){ .status = NFS3ERR_ISDIR };
}
else
{
// Silently ignore mode, uid, gid, atime, mtime changes
*reply = (SETATTR3res){ .status = NFS3_OK };
}
}
else
{
@@ -358,7 +366,6 @@ static int nfs3_read_proc(void *opaque, rpc_op_t *rop)
}
static void nfs_resize_write(nfs_client_t *self, rpc_op_t *rop, uint64_t inode, uint64_t new_size, uint64_t offset, uint64_t count, void *buf);
static void nfs_do_write(nfs_client_t *self, rpc_op_t *rop, uint64_t inode, uint64_t offset, uint64_t count, void *buf);
static int nfs3_write_proc(void *opaque, rpc_op_t *rop)
{
@@ -392,7 +399,6 @@ static int nfs3_write_proc(void *opaque, rpc_op_t *rop)
.resok = (WRITE3resok){
//.file_wcc = ...,
.count = (unsigned)count,
.committed = args->stable,
},
};
if ((args->offset % alignment) != 0 || (count % alignment) != 0)
@@ -436,42 +442,101 @@ static int nfs3_write_proc(void *opaque, rpc_op_t *rop)
return 1;
}
static void nfs_resize_write(nfs_client_t *self, rpc_op_t *rop, uint64_t inode, uint64_t new_size, uint64_t offset, uint64_t count, void *buf)
static void complete_extend_write(nfs_client_t *self, rpc_op_t *rop, inode_t inode, int res)
{
// Check if we have to resize the inode before writing
WRITE3args *args = (WRITE3args*)rop->request;
WRITE3res *reply = (WRITE3res*)rop->reply;
if (res < 0)
{
*reply = (WRITE3res){ .status = vitastor_nfs_map_err(res) };
rpc_queue_reply(rop);
return;
}
bool imm = self->parent->cli->get_immediate_commit(inode);
reply->resok.committed = args->stable != UNSTABLE || imm ? FILE_SYNC : UNSTABLE;
*(uint64_t*)reply->resok.verf = self->parent->server_id;
if (args->stable != UNSTABLE && !imm)
{
// Client requested a stable write. Add an fsync
auto op = new cluster_op_t;
op->opcode = OSD_OP_SYNC;
op->callback = [rop](cluster_op_t *op)
{
if (op->retval != 0)
{
WRITE3res *reply = (WRITE3res*)rop->reply;
*reply = (WRITE3res){ .status = vitastor_nfs_map_err(-op->retval) };
}
delete op;
rpc_queue_reply(rop);
};
self->parent->cli->execute(op);
}
else
{
rpc_queue_reply(rop);
}
}
static void complete_extend_inode(nfs_client_t *self, uint64_t inode, uint64_t new_size, int err)
{
auto ext_it = self->extend_writes.lower_bound((extend_size_t){ .inode = inode, .new_size = 0 });
while (ext_it != self->extend_writes.end() &&
ext_it->first.inode == inode &&
ext_it->first.new_size <= new_size)
{
ext_it->second.resize_res = err;
if (ext_it->second.write_res <= 0)
{
complete_extend_write(self, ext_it->second.rop, inode, ext_it->second.write_res < 0
? ext_it->second.write_res : ext_it->second.resize_res);
self->extend_writes.erase(ext_it++);
}
else
ext_it++;
}
}
static void extend_inode(nfs_client_t *self, uint64_t inode, uint64_t new_size)
{
// Send an extend request
auto & ext = self->extends[inode];
ext.cur_extend = new_size;
auto inode_it = self->parent->cli->st_cli.inode_config.find(inode);
if (inode_it != self->parent->cli->st_cli.inode_config.end() &&
inode_it->second.size < new_size)
{
self->parent->cmd->loop_and_wait(self->parent->cmd->start_modify(json11::Json::object {
// FIXME: Resizing by ID is probably more correct
{ "image", inode_it->second.name },
{ "resize", new_size },
{ "inc_size", true },
{ "force_size", true },
}), [=](const cli_result_t & r)
{
auto & ext = self->extends[inode];
if (r.err)
{
if (r.err == EAGAIN)
{
// Multiple concurrent resize requests received, try to repeat
nfs_resize_write(self, rop, inode, new_size, offset, count, buf);
return;
}
WRITE3res *reply = (WRITE3res*)rop->reply;
*reply = (WRITE3res){ .status = vitastor_nfs_map_err(r.err) };
rpc_queue_reply(rop);
fprintf(stderr, "Error extending inode %lu to %lu bytes: %s\n", inode, new_size, r.text.c_str());
}
if (r.err == EAGAIN || ext.next_extend > ext.cur_extend)
{
// Multiple concurrent resize requests received, try to repeat
extend_inode(self, inode, ext.next_extend > ext.cur_extend ? ext.next_extend : ext.cur_extend);
return;
}
nfs_do_write(self, rop, inode, offset, count, buf);
ext.cur_extend = ext.next_extend = 0;
complete_extend_inode(self, inode, new_size, r.err);
});
}
else
{
nfs_do_write(self, rop, inode, offset, count, buf);
complete_extend_inode(self, inode, new_size, 0);
}
}
static void nfs_do_write(nfs_client_t *self, rpc_op_t *rop, uint64_t inode, uint64_t offset, uint64_t count, void *buf)
static void nfs_do_write(nfs_client_t *self, std::multimap<extend_size_t, extend_write_t>::iterator ewr_it,
rpc_op_t *rop, uint64_t inode, uint64_t offset, uint64_t count, void *buf)
{
cluster_op_t *op = new cluster_op_t;
op->opcode = OSD_OP_WRITE;
@@ -479,48 +544,61 @@ static void nfs_do_write(nfs_client_t *self, rpc_op_t *rop, uint64_t inode, uint
op->offset = offset;
op->len = count;
op->iov.push_back(buf, count);
op->callback = [self, rop](cluster_op_t *op)
op->callback = [self, ewr_it, rop](cluster_op_t *op)
{
uint64_t inode = op->inode;
WRITE3args *args = (WRITE3args*)rop->request;
WRITE3res *reply = (WRITE3res*)rop->reply;
if (op->retval != op->len)
auto inode = op->inode;
int write_res = op->retval < 0 ? op->retval : (op->retval != op->len ? -ERANGE : 0);
if (ewr_it == self->extend_writes.end())
{
*reply = (WRITE3res){ .status = vitastor_nfs_map_err(-op->retval) };
delete op;
rpc_queue_reply(rop);
complete_extend_write(self, rop, inode, write_res);
}
else
{
*(uint64_t*)reply->resok.verf = self->parent->server_id;
delete op;
if (args->stable != UNSTABLE &&
!self->parent->cli->get_immediate_commit(inode))
ewr_it->second.write_res = write_res;
if (ewr_it->second.resize_res <= 0)
{
// Client requested a stable write. Add an fsync
op = new cluster_op_t;
op->opcode = OSD_OP_SYNC;
op->callback = [rop](cluster_op_t *op)
{
if (op->retval != 0)
{
WRITE3res *reply = (WRITE3res*)rop->reply;
*reply = (WRITE3res){ .status = vitastor_nfs_map_err(-op->retval) };
}
delete op;
rpc_queue_reply(rop);
};
self->parent->cli->execute(op);
}
else
{
rpc_queue_reply(rop);
complete_extend_write(self, rop, inode, write_res < 0 ? write_res : ewr_it->second.resize_res);
self->extend_writes.erase(ewr_it);
}
}
};
self->parent->cli->execute(op);
}
static void nfs_resize_write(nfs_client_t *self, rpc_op_t *rop, uint64_t inode, uint64_t new_size, uint64_t offset, uint64_t count, void *buf)
{
// Check if we have to resize the inode during write
auto inode_it = self->parent->cli->st_cli.inode_config.find(inode);
if (inode_it != self->parent->cli->st_cli.inode_config.end() &&
inode_it->second.size < new_size)
{
auto ewr_it = self->extend_writes.emplace((extend_size_t){
.inode = inode,
.new_size = new_size,
}, (extend_write_t){
.rop = rop,
.resize_res = 1,
.write_res = 1,
});
auto & ext = self->extends[inode];
if (ext.cur_extend > 0)
{
// Already resizing, just wait
if (ext.next_extend < new_size)
ext.next_extend = new_size;
}
else
{
extend_inode(self, inode, new_size);
}
nfs_do_write(self, ewr_it, rop, inode, offset, count, buf);
}
else
{
nfs_do_write(self, self->extend_writes.end(), rop, inode, offset, count, buf);
}
}
static int nfs3_create_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
@@ -881,6 +959,27 @@ static int nfs3_link_proc(void *opaque, rpc_op_t *rop)
return 0;
}
static void fill_dir_entry(nfs_client_t *self, rpc_op_t *rop,
std::map<std::string, nfs_dir_t>::iterator dir_id_it, struct entryplus3 *entry, bool is_plus)
{
if (dir_id_it == self->parent->dir_info.end())
{
return;
}
entry->fileid = dir_id_it->second.id;
if (is_plus)
{
entry->name_attributes = (post_op_attr){
.attributes_follow = 1,
.attributes = get_dir_attributes(self, dir_id_it->first),
};
entry->name_handle = (post_op_fh3){
.handle_follows = 1,
.handle = xdr_copy_string(rop->xdrs, "S"+base64_encode(sha256(dir_id_it->first))),
};
}
}
static void nfs3_readdir_common(void *opaque, rpc_op_t *rop, bool is_plus)
{
nfs_client_t *self = (nfs_client_t*)opaque;
@@ -958,17 +1057,17 @@ static void nfs3_readdir_common(void *opaque, rpc_op_t *rop, bool is_plus)
continue;
std::string subname = dir_id_it->first.substr(prefix.size());
// for directories, fileid changes when the user restarts proxy
entries[subname].fileid = dir_id_it->second.id;
if (is_plus)
fill_dir_entry(self, rop, dir_id_it, &entries[subname], is_plus);
}
// Add . and ..
{
auto dir_id_it = self->parent->dir_info.find(dir);
fill_dir_entry(self, rop, dir_id_it, &entries["."], is_plus);
auto sl = dir.rfind("/");
if (sl != std::string::npos)
{
entries[subname].name_attributes = (post_op_attr){
.attributes_follow = 1,
.attributes = get_dir_attributes(self, dir_id_it->first),
};
entries[subname].name_handle = (post_op_fh3){
.handle_follows = 1,
.handle = xdr_copy_string(rop->xdrs, "S"+base64_encode(sha256(dir_id_it->first))),
};
auto dir_id_it = self->parent->dir_info.find(dir.substr(0, sl));
fill_dir_entry(self, rop, dir_id_it, &entries[".."], is_plus);
}
}
// Offset results by the continuation cookie (equal to index in the listing)
@@ -1193,10 +1292,11 @@ static int nfs3_commit_proc(void *opaque, rpc_op_t *rop)
cluster_op_t *op = new cluster_op_t;
// fsync. we don't know how to fsync a single inode, so just fsync everything
op->opcode = OSD_OP_SYNC;
op->callback = [rop](cluster_op_t *op)
op->callback = [self, rop](cluster_op_t *op)
{
COMMIT3res *reply = (COMMIT3res*)rop->reply;
*reply = (COMMIT3res){ .status = vitastor_nfs_map_err(op->retval) };
*(uint64_t*)reply->resok.verf = self->parent->server_id;
rpc_queue_reply(rop);
};
self->parent->cli->execute(op);

View File

@@ -346,8 +346,8 @@ void nfs_proxy_t::parse_stats(etcd_kv_t & kv)
pool_id_t pool_id = 0;
inode_t inode_num = 0;
char null_byte = 0;
sscanf(key.c_str() + cli->st_cli.etcd_prefix.length()+13, "%u/%lu%c", &pool_id, &inode_num, &null_byte);
if (!pool_id || pool_id >= POOL_ID_MAX || !inode_num || null_byte != 0)
int scanned = sscanf(key.c_str() + cli->st_cli.etcd_prefix.length()+13, "%u/%lu%c", &pool_id, &inode_num, &null_byte);
if (scanned != 2 || !pool_id || pool_id >= POOL_ID_MAX || !inode_num)
{
fprintf(stderr, "Bad etcd key %s, ignoring\n", key.c_str());
}
@@ -360,8 +360,8 @@ void nfs_proxy_t::parse_stats(etcd_kv_t & kv)
{
pool_id_t pool_id = 0;
char null_byte = 0;
sscanf(key.c_str() + cli->st_cli.etcd_prefix.length()+12, "%u%c", &pool_id, &null_byte);
if (!pool_id || pool_id >= POOL_ID_MAX)
int scanned = sscanf(key.c_str() + cli->st_cli.etcd_prefix.length()+12, "%u%c", &pool_id, &null_byte);
if (scanned != 1 || !pool_id || pool_id >= POOL_ID_MAX)
{
fprintf(stderr, "Bad etcd key %s, ignoring\n", key.c_str());
}

View File

@@ -86,6 +86,28 @@ struct rpc_free_buffer_t
unsigned size;
};
struct extend_size_t
{
inode_t inode;
uint64_t new_size;
};
inline bool operator < (const extend_size_t &a, const extend_size_t &b)
{
return a.inode < b.inode || a.inode == b.inode && a.new_size < b.new_size;
}
struct extend_write_t
{
rpc_op_t *rop;
int resize_res, write_res; // 1 = started, 0 = completed OK, -errno = completed with error
};
struct extend_inode_t
{
uint64_t cur_extend = 0, next_extend = 0;
};
class nfs_client_t
{
public:
@@ -100,6 +122,8 @@ public:
rpc_cur_buffer_t cur_buffer = { 0 };
std::map<uint8_t*, rpc_used_buffer_t> used_buffers;
std::vector<rpc_free_buffer_t> free_buffers;
std::map<inode_t, extend_inode_t> extends;
std::multimap<extend_size_t, extend_write_t> extend_writes;
iovec read_iov;
msghdr read_msg = { 0 };

View File

@@ -649,7 +649,7 @@ void osd_t::apply_pg_config()
auto pg_it = this->pgs.find({ .pool_id = pool_id, .pg_num = pg_num });
bool currently_taken = pg_it != this->pgs.end() && pg_it->second.state != PG_OFFLINE;
// Check pool block size and bitmap granularity
if (this->bs_block_size != pool_item.second.data_block_size ||
if (take && this->bs_block_size != pool_item.second.data_block_size ||
this->bs_bitmap_granularity != pool_item.second.bitmap_granularity)
{
if (!warned_block_size)
@@ -967,8 +967,8 @@ void osd_t::report_pg_states()
pool_id_t pool_id = 0;
pg_num_t pg_num = 0;
char null_byte = 0;
sscanf(kv.key.c_str() + st_cli.etcd_prefix.length()+10, "%u/%u%c", &pool_id, &pg_num, &null_byte);
if (null_byte == 0)
int scanned = sscanf(kv.key.c_str() + st_cli.etcd_prefix.length()+10, "%u/%u%c", &pool_id, &pg_num, &null_byte);
if (scanned == 2)
{
auto pg_it = pgs.find({ .pool_id = pool_id, .pg_num = pg_num });
if (pg_it != pgs.end() && pg_it->second.state != PG_OFFLINE && pg_it->second.state != PG_STARTING &&

View File

@@ -197,7 +197,11 @@ static void vitastor_parse_filename(const char *filename, QDict *options, Error
!strcmp(name, "rdma-mtu"))
{
unsigned long long num_val;
#if QEMU_VERSION_MAJOR < 8 || QEMU_VERSION_MAJOR == 8 && QEMU_VERSION_MINOR < 1
if (parse_uint_full(value, &num_val, 0))
#else
if (parse_uint_full(value, 0, &num_val))
#endif
{
error_setg(errp, "Illegal %s: %s", name, value);
goto out;
@@ -234,15 +238,13 @@ out:
return;
}
#if defined VITASTOR_C_API_VERSION && VITASTOR_C_API_VERSION >= 2
static void vitastor_uring_handler(void *opaque)
{
VitastorClient *client = (VitastorClient*)opaque;
qemu_mutex_lock(&client->mutex);
client->bh_uring_scheduled = 0;
do
{
vitastor_c_uring_handle_events(client->proxy);
} while (vitastor_c_uring_has_work(client->proxy));
vitastor_c_uring_handle_events(client->proxy);
qemu_mutex_unlock(&client->mutex);
}
@@ -266,20 +268,23 @@ static void vitastor_schedule_uring_handler(VitastorClient *client)
replay_bh_schedule_oneshot_event(client->ctx, vitastor_uring_handler, opaque);
#elif QEMU_VERSION_MAJOR >= 3 || QEMU_VERSION_MAJOR == 2 && QEMU_VERSION_MINOR >= 8
aio_bh_schedule_oneshot(client->ctx, vitastor_uring_handler, opaque);
#elif QEMU_VERSION_MAJOR >= 2
#else
VitastorBH *vbh = (VitastorBH*)malloc(sizeof(VitastorBH));
vbh->cli = client;
#if QEMU_VERSION_MAJOR >= 2
vbh->bh = aio_bh_new(bdrv_get_aio_context(task->bs), vitastor_bh_uring_handler, vbh);
qemu_bh_schedule(vbh->bh);
#else
client->bh_uring_scheduled = 0;
do
{
vitastor_c_uring_handle_events(client->proxy);
} while (vitastor_c_uring_has_work(client->proxy));
vbh->bh = qemu_bh_new(vitastor_bh_uring_handler, vbh);
#endif
qemu_bh_schedule(vbh->bh);
#endif
}
}
#else
static void vitastor_schedule_uring_handler(VitastorClient *client)
{
}
#endif
static void coroutine_fn vitastor_co_get_metadata(VitastorRPC *task)
{
@@ -319,7 +324,7 @@ static void vitastor_aio_fd_write(void *fddv)
static void universal_aio_set_fd_handler(AioContext *ctx, int fd, IOHandler *fd_read, IOHandler *fd_write, void *opaque)
{
aio_set_fd_handler(ctx, fd,
#if QEMU_VERSION_MAJOR == 2 && QEMU_VERSION_MINOR >= 5 || QEMU_VERSION_MAJOR >= 3
#if QEMU_VERSION_MAJOR == 2 && QEMU_VERSION_MINOR >= 5 || QEMU_VERSION_MAJOR >= 3 && (QEMU_VERSION_MAJOR < 8 || QEMU_VERSION_MAJOR == 8 && QEMU_VERSION_MINOR < 1)
0 /*is_external*/,
#endif
fd_read,
@@ -406,20 +411,16 @@ static int vitastor_file_open(BlockDriverState *bs, QDict *options, int flags, E
vitastor_aio_set_fd_handler, client, client->config_path, client->etcd_host, client->etcd_prefix,
client->use_rdma, client->rdma_device, client->rdma_port_num, client->rdma_gid_index, client->rdma_mtu, 0
);
#else
client->proxy = vitastor_c_create_uring(
client->config_path, client->etcd_host, client->etcd_prefix,
client->use_rdma, client->rdma_device, client->rdma_port_num, client->rdma_gid_index, client->rdma_mtu, 0
);
#endif
if (!client->proxy)
{
fprintf(stderr, "vitastor: failed to create io_uring: %s - I/O will be slower\n", strerror(errno));
client->uring_eventfd = -1;
#endif
client->proxy = vitastor_c_create_qemu(
vitastor_aio_set_fd_handler, client, client->config_path, client->etcd_host, client->etcd_prefix,
client->use_rdma, client->rdma_device, client->rdma_port_num, client->rdma_gid_index, client->rdma_mtu, 0
);
#if defined VITASTOR_C_API_VERSION && VITASTOR_C_API_VERSION >= 2
}
else
{
@@ -433,6 +434,7 @@ static int vitastor_file_open(BlockDriverState *bs, QDict *options, int flags, E
}
universal_aio_set_fd_handler(client->ctx, client->uring_eventfd, vitastor_uring_handler, NULL, client);
}
#endif
image = client->image = g_strdup(qdict_get_try_str(options, "image"));
client->readonly = (flags & BDRV_O_RDWR) ? 1 : 0;
// Get image metadata (size and readonly flag) or just wait until the client is ready
@@ -491,6 +493,10 @@ static int vitastor_file_open(BlockDriverState *bs, QDict *options, int flags, E
return -1;
}
bs->total_sectors = client->size / BDRV_SECTOR_SIZE;
#if QEMU_VERSION_MAJOR > 5 || QEMU_VERSION_MAJOR == 5 && QEMU_VERSION_MINOR >= 1
/* When extending regular files, we get zeros from the OS */
bs->supported_truncate_flags = BDRV_REQ_ZERO_WRITE;
#endif
//client->aio_context = bdrv_get_aio_context(bs);
qdict_del(options, "use-rdma");
qdict_del(options, "rdma-mtu");
@@ -587,7 +593,11 @@ static int coroutine_fn vitastor_co_truncate(BlockDriverState *bs, int64_t offse
}
// TODO: Resize inode to <offset> bytes
client->size = offset / BDRV_SECTOR_SIZE;
#if QEMU_VERSION_MAJOR >= 4
client->size = exact || client->size < offset ? offset : client->size;
#else
client->size = offset;
#endif
return 0;
}
@@ -664,7 +674,8 @@ static void vitastor_co_generic_cb(void *opaque, long retval)
task->bh = aio_bh_new(bdrv_get_aio_context(task->bs), vitastor_co_generic_bh_cb, opaque);
qemu_bh_schedule(task->bh);
#else
vitastor_co_generic_bh_cb(opaque);
task->bh = qemu_bh_new(vitastor_co_generic_bh_cb, opaque);
qemu_bh_schedule(task->bh);
#endif
}
@@ -741,7 +752,6 @@ static void vitastor_co_read_bitmap_cb(void *opaque, long retval, uint8_t *bitma
VitastorRPC *task = opaque;
VitastorClient *client = task->bs->opaque;
task->ret = retval;
task->complete = 1;
if (retval >= 0)
{
task->bitmap = bitmap;
@@ -753,15 +763,17 @@ static void vitastor_co_read_bitmap_cb(void *opaque, long retval, uint8_t *bitma
client->last_bitmap = bitmap;
}
}
if (qemu_coroutine_self() != task->co)
{
#if QEMU_VERSION_MAJOR >= 3 || QEMU_VERSION_MAJOR == 2 && QEMU_VERSION_MINOR > 8
aio_co_wake(task->co);
#if QEMU_VERSION_MAJOR > 4 || QEMU_VERSION_MAJOR == 4 && QEMU_VERSION_MINOR >= 2
replay_bh_schedule_oneshot_event(bdrv_get_aio_context(task->bs), vitastor_co_generic_bh_cb, opaque);
#elif QEMU_VERSION_MAJOR >= 3 || QEMU_VERSION_MAJOR == 2 && QEMU_VERSION_MINOR >= 8
aio_bh_schedule_oneshot(bdrv_get_aio_context(task->bs), vitastor_co_generic_bh_cb, opaque);
#elif QEMU_VERSION_MAJOR >= 2
task->bh = aio_bh_new(bdrv_get_aio_context(task->bs), vitastor_co_generic_bh_cb, opaque);
qemu_bh_schedule(task->bh);
#else
qemu_coroutine_enter(task->co, NULL);
qemu_aio_release(task);
task->bh = qemu_bh_new(vitastor_co_generic_bh_cb, opaque);
qemu_bh_schedule(task->bh);
#endif
}
}
static int coroutine_fn vitastor_co_block_status(

View File

@@ -66,6 +66,16 @@ void ring_loop_t::unregister_consumer(ring_consumer_t *consumer)
void ring_loop_t::loop()
{
if (ring_eventfd >= 0)
{
// Reset eventfd counter
uint64_t ctr = 0;
int r = read(ring_eventfd, &ctr, 8);
if (r < 0 && errno != EAGAIN && errno != EINTR)
{
fprintf(stderr, "Error resetting eventfd: %s\n", strerror(errno));
}
}
struct io_uring_cqe *cqe;
while (!io_uring_peek_cqe(&ring, &cqe))
{
@@ -84,7 +94,7 @@ void ring_loop_t::loop()
}
else
{
printf("Warning: empty callback in SQE\n");
fprintf(stderr, "Warning: empty callback in SQE\n");
free_ring_data[free_ring_data_ptr++] = d - ring_datas;
}
io_uring_cqe_seen(&ring, cqe);

View File

@@ -308,3 +308,19 @@ std::string str_repeat(const std::string & str, int times)
r += str;
return r;
}
size_t utf8_length(const std::string & s)
{
size_t len = 0;
for (size_t i = 0; i < s.size(); i++)
len += (s[i] & 0xC0) != 0x80;
return len;
}
size_t utf8_length(const char *s)
{
size_t len = 0;
for (; *s; s++)
len += (*s & 0xC0) != 0x80;
return len;
}

View File

@@ -18,3 +18,5 @@ void print_help(const char *help_text, std::string exe_name, std::string cmd, bo
uint64_t parse_time(std::string time_str, bool *ok = NULL);
std::string read_all_fd(int fd);
std::string str_repeat(const std::string & str, int times);
size_t utf8_length(const std::string & s);
size_t utf8_length(const char *s);

View File

@@ -6,7 +6,7 @@ includedir=${prefix}/@CMAKE_INSTALL_INCLUDEDIR@
Name: Vitastor
Description: Vitastor client library
Version: 0.9.3
Version: 1.0.0
Libs: -L${libdir} -lvitastor_client
Cflags: -I${includedir}

View File

@@ -215,7 +215,7 @@ void vitastor_c_uring_wait_events(vitastor_c *client)
client->ringloop->wait();
}
bool vitastor_c_uring_has_work(vitastor_c *client)
int vitastor_c_uring_has_work(vitastor_c *client)
{
return client->ringloop->has_work();
}

View File

@@ -46,7 +46,7 @@ int vitastor_c_uring_register_eventfd(vitastor_c *client);
void vitastor_c_uring_wait_ready(vitastor_c *client);
void vitastor_c_uring_handle_events(vitastor_c *client);
void vitastor_c_uring_wait_events(vitastor_c *client);
bool vitastor_c_uring_has_work(vitastor_c *client);
int vitastor_c_uring_has_work(vitastor_c *client);
void vitastor_c_read(vitastor_c *client, uint64_t inode, uint64_t offset, uint64_t len,
struct iovec *iov, int iovcnt, VitastorReadHandler cb, void *opaque);
void vitastor_c_write(vitastor_c *client, uint64_t inode, uint64_t offset, uint64_t len, uint64_t check_version,

View File

@@ -7,6 +7,7 @@ PG_COUNT=${PG_COUNT:-1}
# OSD_COUNT
SCHEME=${SCHEME:-replicated}
# OSD_ARGS
# OFFSET_ARGS
# PG_SIZE
# PG_MINSIZE
@@ -24,17 +25,31 @@ else
$ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"osd_out_time":1}'
fi
start_osd()
start_osd_on()
{
local i=$1
local dev=$2
build/src/vitastor-osd --osd_num $i --bind_address 127.0.0.1 $NO_SAME $OSD_ARGS --etcd_address $ETCD_URL \
$(build/src/vitastor-disk simple-offsets --format options ./testdata/test_osd$i.bin $OFFSET_ARGS 2>/dev/null) \
$(build/src/vitastor-disk simple-offsets --format options $OFFSET_ARGS $dev $OFFSET_ARGS 2>/dev/null) \
>>./testdata/osd$i.log 2>&1 &
eval OSD${i}_PID=$!
}
if ! type -t osd_dev; then
osd_dev()
{
local i=$1
[[ -f ./testdata/test_osd$i.bin ]] || dd if=/dev/zero of=./testdata/test_osd$i.bin bs=1024 count=1 seek=$((OSD_SIZE*1024-1))
echo ./testdata/test_osd$i.bin
}
fi
start_osd()
{
start_osd_on $1 $(osd_dev $1)
}
for i in $(seq 1 $OSD_COUNT); do
dd if=/dev/zero of=./testdata/test_osd$i.bin bs=1024 count=1 seek=$((OSD_SIZE*1024-1))
start_osd $i
done
@@ -85,7 +100,7 @@ wait_up()
}
if [[ $OSD_COUNT -gt 0 ]]; then
wait_up 60
wait_up 120
fi
try_reweight()

View File

@@ -8,10 +8,7 @@ LD_PRELOAD="build/src/libfio_vitastor.so" \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4M -direct=1 -iodepth=1 -end_fsync=1 \
-rw=write -etcd=$ETCD_URL -pool=1 -inode=1 -size=128M -cluster_log_level=10
for i in 4; do
dd if=/dev/zero of=./testdata/test_osd$i.bin bs=1024 count=1 seek=$((OSD_SIZE*1024-1))
start_osd $i
done
start_osd 4
sleep 2