Compare commits

..

10 Commits

Author SHA1 Message Date
1c10430ae1 Release 0.9.4
All checks were successful
Test / test_interrupted_rebalance (push) Successful in 1m54s
Test / test_interrupted_rebalance_imm (push) Successful in 2m4s
Test / test_interrupted_rebalance_ec (push) Successful in 1m40s
Test / test_interrupted_rebalance_ec_imm (push) Successful in 1m25s
Test / test_failure_domain (push) Successful in 15s
Test / test_snapshot (push) Successful in 25s
Test / test_snapshot_ec (push) Successful in 20s
Test / test_minsize_1 (push) Successful in 13s
Test / test_move_reappear (push) Successful in 16s
Test / test_rm (push) Successful in 13s
Test / test_snapshot_chain (push) Successful in 1m56s
Test / test_snapshot_chain_ec (push) Successful in 2m33s
Test / test_snapshot_down (push) Successful in 23s
Test / test_snapshot_down_ec (push) Successful in 22s
Test / test_splitbrain (push) Successful in 16s
Test / test_rebalance_verify (push) Successful in 3m3s
Test / test_rebalance_verify_imm (push) Successful in 3m2s
Test / test_rebalance_verify_ec (push) Successful in 3m13s
Test / test_rebalance_verify_ec_imm (push) Successful in 8m35s
Test / test_write (push) Successful in 33s
Test / test_write_xor (push) Successful in 40s
Test / test_write_no_same (push) Successful in 15s
Test / test_heal_pg_size_2 (push) Successful in 4m25s
Test / test_heal_ec (push) Successful in 3m9s
Test / test_scrub (push) Successful in 1m0s
Test / test_scrub_zero_osd_2 (push) Successful in 46s
Test / test_scrub_xor (push) Successful in 1m1s
Test / test_scrub_pg_size_3 (push) Successful in 1m55s
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 1m25s
Test / test_scrub_ec (push) Successful in 52s
- Improve QEMU driver performance by integrating io_uring in it (up to 1.5x total iops improvement)
- Fix QEMU driver deadlocks which started to reproduce in qemu-img after iothread fixes
- Fix `vitastor-cli status` reporting more etcds than actually exists (fix etcd address duplication in config on reload)
- Fix `vitastor-cli ls` crashing on inodes in non-existing pools
- Delete old garbage /pool/stats/ keys for non-existing (deleted) pools
- Reduce memory usage of etcds initialized by make-etcd script
- Fix OSDs almost always crashing on etcd restart due to "revisions were compacted" (support reloading state from etcd)
- Fix a crash and a stall possible mostly in HDD setups with small journal and big (512k, 900k) random writes
- Add notes about HDDs to documentation. You are officially allowed to use HDD-only Vitastor with HGST/Toshiba/EXOS :)
2023-07-19 02:50:30 +03:00
dfce91d168 Change git url in docs, correct block/vitastor.c path 2023-07-19 01:02:12 +03:00
332a13ba30 Build patched QEMU against local packages 2023-07-19 00:05:02 +03:00
d0e257ee81 Fix non-existing pool handling in vitastor-cli ls
All checks were successful
Test / test_interrupted_rebalance (push) Successful in 2m6s
Test / test_interrupted_rebalance_imm (push) Successful in 3m11s
Test / test_interrupted_rebalance_ec (push) Successful in 2m6s
Test / test_interrupted_rebalance_ec_imm (push) Successful in 2m20s
Test / test_failure_domain (push) Successful in 22s
Test / test_snapshot (push) Successful in 50s
Test / test_snapshot_ec (push) Successful in 33s
Test / test_minsize_1 (push) Successful in 13s
Test / test_move_reappear (push) Successful in 1m23s
Test / test_rm (push) Successful in 13s
Test / test_snapshot_chain (push) Successful in 2m22s
Test / test_snapshot_chain_ec (push) Successful in 3m6s
Test / test_snapshot_down (push) Successful in 24s
Test / test_snapshot_down_ec (push) Successful in 22s
Test / test_splitbrain (push) Successful in 19s
Test / test_rebalance_verify (push) Successful in 3m28s
Test / test_rebalance_verify_imm (push) Successful in 3m27s
Test / test_rebalance_verify_ec (push) Successful in 9m10s
Test / test_rebalance_verify_ec_imm (push) Successful in 9m29s
Test / test_write (push) Successful in 1m36s
Test / test_write_xor (push) Successful in 2m17s
Test / test_write_no_same (push) Successful in 36s
Test / test_heal_pg_size_2 (push) Successful in 6m27s
Test / test_heal_ec (push) Successful in 5m53s
Test / test_scrub (push) Successful in 44s
Test / test_scrub_zero_osd_2 (push) Successful in 35s
Test / test_scrub_xor (push) Successful in 36s
Test / test_scrub_pg_size_3 (push) Successful in 1m1s
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 46s
Test / test_scrub_ec (push) Successful in 36s
2023-07-18 23:52:02 +03:00
004912aac0 Add RPM spec patches for 6.2-el8 and 7.2-el9
Some checks failed
Test / test_interrupted_rebalance (push) Successful in 1m57s
Test / test_interrupted_rebalance_imm (push) Successful in 3m16s
Test / test_interrupted_rebalance_ec (push) Successful in 1m52s
Test / test_interrupted_rebalance_ec_imm (push) Successful in 1m25s
Test / test_failure_domain (push) Failing after 47s
Test / test_snapshot (push) Successful in 40s
Test / test_snapshot_ec (push) Successful in 24s
Test / test_minsize_1 (push) Successful in 16s
Test / test_move_reappear (push) Failing after 52s
Test / test_rm (push) Successful in 19s
Test / test_snapshot_chain (push) Successful in 2m27s
Test / test_snapshot_chain_ec (push) Failing after 3m9s
Test / test_snapshot_down (push) Successful in 22s
Test / test_snapshot_down_ec (push) Successful in 21s
Test / test_splitbrain (push) Successful in 21s
Test / test_rebalance_verify (push) Successful in 3m34s
Test / test_rebalance_verify_imm (push) Successful in 3m32s
Test / test_rebalance_verify_ec (push) Successful in 5m14s
Test / test_rebalance_verify_ec_imm (push) Successful in 5m18s
Test / test_write (push) Successful in 49s
Test / test_write_xor (push) Successful in 58s
Test / test_write_no_same (push) Successful in 13s
Test / test_heal_pg_size_2 (push) Successful in 3m55s
Test / test_heal_ec (push) Failing after 10m39s
Test / test_scrub (push) Successful in 33s
Test / test_scrub_zero_osd_2 (push) Successful in 29s
Test / test_scrub_xor (push) Successful in 30s
Test / test_scrub_pg_size_3 (push) Successful in 1m1s
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 44s
Test / test_scrub_ec (push) Successful in 25s
2023-07-18 23:38:14 +03:00
c18e92273e Copy qemu 5.1 -> 5.2 patch for convenience 2023-07-18 23:37:53 +03:00
9815d70ffc It is impossible to use io_uring with older vitastor-client because it does not have vitastor_c_uring_has_work() 2023-07-18 23:37:53 +03:00
4a4627dcab Do not use bool in C library 2023-07-18 23:37:53 +03:00
b963f2fd93 Add QEMU 2.12 patch (basically the same as 3.1) 2023-07-18 23:37:06 +03:00
ba7427020e Fix deadlocks possible in qemu-img after fixing iothread
Deadlock was caused by switching QEMU coroutines directly inside
vitastor_co_read_bitmap_cb() callback. The correct way is to schedule a BH
/BH is a QEMU term for setImmediate() :)/, same as in read and write callbacks.
2023-07-18 23:32:16 +03:00
76 changed files with 1397 additions and 3319 deletions

View File

@@ -622,114 +622,6 @@ jobs:
echo ""
done
test_heal_csum_32k_dmj:
runs-on: ubuntu-latest
needs: build
container: ${{env.TEST_IMAGE}}:${{github.sha}}
steps:
- name: Run test
id: test
timeout-minutes: 10
run: TEST_NAME=csum_32k_dmj OSD_ARGS="--data_csum_type crc32c --csum_block_size 32k --inmemory_metadata false --inmemory_journal false" OFFSET_ARGS=$OSD_ARGS /root/vitastor/tests/test_heal.sh
- name: Print logs
if: always() && steps.test.outcome == 'failure'
run: |
for i in /root/vitastor/testdata/*.log /root/vitastor/testdata/*.txt; do
echo "-------- $i --------"
cat $i
echo ""
done
test_heal_csum_32k_dj:
runs-on: ubuntu-latest
needs: build
container: ${{env.TEST_IMAGE}}:${{github.sha}}
steps:
- name: Run test
id: test
timeout-minutes: 10
run: TEST_NAME=csum_32k_dj OSD_ARGS="--data_csum_type crc32c --csum_block_size 32k --inmemory_journal false" OFFSET_ARGS=$OSD_ARGS /root/vitastor/tests/test_heal.sh
- name: Print logs
if: always() && steps.test.outcome == 'failure'
run: |
for i in /root/vitastor/testdata/*.log /root/vitastor/testdata/*.txt; do
echo "-------- $i --------"
cat $i
echo ""
done
test_heal_csum_32k:
runs-on: ubuntu-latest
needs: build
container: ${{env.TEST_IMAGE}}:${{github.sha}}
steps:
- name: Run test
id: test
timeout-minutes: 10
run: TEST_NAME=csum_32k OSD_ARGS="--data_csum_type crc32c --csum_block_size 32k" OFFSET_ARGS=$OSD_ARGS /root/vitastor/tests/test_heal.sh
- name: Print logs
if: always() && steps.test.outcome == 'failure'
run: |
for i in /root/vitastor/testdata/*.log /root/vitastor/testdata/*.txt; do
echo "-------- $i --------"
cat $i
echo ""
done
test_heal_csum_4k_dmj:
runs-on: ubuntu-latest
needs: build
container: ${{env.TEST_IMAGE}}:${{github.sha}}
steps:
- name: Run test
id: test
timeout-minutes: 10
run: TEST_NAME=csum_4k_dmj OSD_ARGS="--data_csum_type crc32c --inmemory_metadata false --inmemory_journal false" OFFSET_ARGS=$OSD_ARGS /root/vitastor/tests/test_heal.sh
- name: Print logs
if: always() && steps.test.outcome == 'failure'
run: |
for i in /root/vitastor/testdata/*.log /root/vitastor/testdata/*.txt; do
echo "-------- $i --------"
cat $i
echo ""
done
test_heal_csum_4k_dj:
runs-on: ubuntu-latest
needs: build
container: ${{env.TEST_IMAGE}}:${{github.sha}}
steps:
- name: Run test
id: test
timeout-minutes: 10
run: TEST_NAME=csum_4k_dj OSD_ARGS="--data_csum_type crc32c --inmemory_journal false" OFFSET_ARGS=$OSD_ARGS /root/vitastor/tests/test_heal.sh
- name: Print logs
if: always() && steps.test.outcome == 'failure'
run: |
for i in /root/vitastor/testdata/*.log /root/vitastor/testdata/*.txt; do
echo "-------- $i --------"
cat $i
echo ""
done
test_heal_csum_4k:
runs-on: ubuntu-latest
needs: build
container: ${{env.TEST_IMAGE}}:${{github.sha}}
steps:
- name: Run test
id: test
timeout-minutes: 10
run: TEST_NAME=csum_4k OSD_ARGS="--data_csum_type crc32c" OFFSET_ARGS=$OSD_ARGS /root/vitastor/tests/test_heal.sh
- name: Print logs
if: always() && steps.test.outcome == 'failure'
run: |
for i in /root/vitastor/testdata/*.log /root/vitastor/testdata/*.txt; do
echo "-------- $i --------"
cat $i
echo ""
done
test_scrub:
runs-on: ubuntu-latest
needs: build

View File

@@ -7,8 +7,7 @@ for my $line (<>)
if ($line =~ /\.\/(test_[^\.]+)/s)
{
chomp $line;
my $base_name = $1;
my $test_name = $base_name;
my $test_name = $1;
my $timeout = 3;
if ($test_name eq 'test_etcd_fail' || $test_name eq 'test_heal' || $test_name eq 'test_add_osd' ||
$test_name eq 'test_interrupted_rebalance' || $test_name eq 'test_rebalance_verify')
@@ -17,12 +16,7 @@ for my $line (<>)
}
while ($line =~ /([^\s=]+)=(\S+)/gs)
{
if ($1 eq 'TEST_NAME')
{
$test_name = $base_name.'_'.$2;
last;
}
elsif ($1 eq 'SCHEME' && $2 eq 'ec')
if ($1 eq 'SCHEME' && $2 eq 'ec')
{
$test_name .= '_ec';
}

View File

@@ -2,6 +2,6 @@ cmake_minimum_required(VERSION 2.8.12)
project(vitastor)
set(VERSION "0.9.3")
set(VERSION "0.9.4")
add_subdirectory(src)

View File

@@ -1,4 +1,4 @@
VERSION ?= v0.9.3
VERSION ?= v0.9.4
all: build push

View File

@@ -49,7 +49,7 @@ spec:
capabilities:
add: ["SYS_ADMIN"]
allowPrivilegeEscalation: true
image: vitalif/vitastor-csi:v0.9.3
image: vitalif/vitastor-csi:v0.9.4
args:
- "--node=$(NODE_ID)"
- "--endpoint=$(CSI_ENDPOINT)"

View File

@@ -116,7 +116,7 @@ spec:
privileged: true
capabilities:
add: ["SYS_ADMIN"]
image: vitalif/vitastor-csi:v0.9.3
image: vitalif/vitastor-csi:v0.9.4
args:
- "--node=$(NODE_ID)"
- "--endpoint=$(CSI_ENDPOINT)"

View File

@@ -5,7 +5,7 @@ package vitastor
const (
vitastorCSIDriverName = "csi.vitastor.io"
vitastorCSIDriverVersion = "0.9.3"
vitastorCSIDriverVersion = "0.9.4"
)
// Config struct fills the parameters of request or user input

4
debian/changelog vendored
View File

@@ -1,10 +1,10 @@
vitastor (0.9.3-1) unstable; urgency=medium
vitastor (0.9.4-1) unstable; urgency=medium
* Bugfixes
-- Vitaliy Filippov <vitalif@yourcmc.ru> Fri, 03 Jun 2022 02:09:44 +0300
vitastor (0.9.3-1) unstable; urgency=medium
vitastor (0.9.4-1) unstable; urgency=medium
* Implement NFS proxy
* Add documentation

View File

@@ -28,13 +28,19 @@ RUN apt-get --download-only source qemu
ADD patches /root/vitastor/patches
ADD src/qemu_driver.c /root/vitastor/src/qemu_driver.c
#RUN set -e; \
# apt-get install -y wget; \
# wget -q -O /etc/apt/trusted.gpg.d/vitastor.gpg https://vitastor.io/debian/pubkey.gpg; \
# (echo deb http://vitastor.io/debian $REL main > /etc/apt/sources.list.d/vitastor.list); \
# (echo "APT::Install-Recommends false;" > /etc/apt/apt.conf) && \
# apt-get update; \
# apt-get install -y vitastor-client vitastor-client-dev quilt
RUN set -e; \
apt-get install -y wget; \
wget -q -O /etc/apt/trusted.gpg.d/vitastor.gpg https://vitastor.io/debian/pubkey.gpg; \
(echo deb http://vitastor.io/debian $REL main > /etc/apt/sources.list.d/vitastor.list); \
(echo "APT::Install-Recommends false;" > /etc/apt/apt.conf) && \
dpkg -i /root/packages/vitastor-$REL/vitastor-client_*.deb /root/packages/vitastor-$REL/vitastor-client-dev_*.deb; \
apt-get update; \
apt-get install -y vitastor-client vitastor-client-dev quilt; \
apt-get install -y quilt; \
mkdir -p /root/packages/qemu-$REL; \
rm -rf /root/packages/qemu-$REL/*; \
cd /root/packages/qemu-$REL; \

View File

@@ -35,8 +35,8 @@ RUN set -e -x; \
mkdir -p /root/packages/vitastor-$REL; \
rm -rf /root/packages/vitastor-$REL/*; \
cd /root/packages/vitastor-$REL; \
cp -r /root/vitastor vitastor-0.9.3; \
cd vitastor-0.9.3; \
cp -r /root/vitastor vitastor-0.9.4; \
cd vitastor-0.9.4; \
ln -s /root/fio-build/fio-*/ ./fio; \
FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
ls /usr/include/linux/raw.h || cp ./debian/raw.h /usr/include/linux/raw.h; \
@@ -49,8 +49,8 @@ RUN set -e -x; \
rm -rf a b; \
echo "dep:fio=$FIO" > debian/fio_version; \
cd /root/packages/vitastor-$REL; \
tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.9.3.orig.tar.xz vitastor-0.9.3; \
cd vitastor-0.9.3; \
tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.9.4.orig.tar.xz vitastor-0.9.4; \
cd vitastor-0.9.4; \
V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \
DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \

View File

@@ -24,8 +24,6 @@ initialization and can't be changed after it without losing data.
- [disable_journal_fsync](#disable_journal_fsync)
- [disable_device_lock](#disable_device_lock)
- [disk_alignment](#disk_alignment)
- [data_csum_type](#data_csum_type)
- [csum_block_size](#csum_block_size)
## data_device
@@ -176,42 +174,3 @@ Intel Optane (probably, not tested yet).
Clients don't need to be aware of disk_alignment, so it's not required to
put a modified value into etcd key /vitastor/config/global.
## data_csum_type
- Type: string
- Default: none
Data checksum type to use. May be "crc32c" or "none". Set to "crc32c" to
enable data checksums.
## csum_block_size
- Type: integer
- Default: 4096
Checksum calculation block size.
Must be equal or a multiple of [bitmap_granularity](layout-cluster.en.md#bitmap_granularity)
(which is usually 4 KB).
Checksums increase metadata size by 4 bytes per each csum_block_size of data.
Checksums are always a compromise:
1. You either sacrifice +1 GB RAM per 1 TB of data
2. Or you raise csum_block_size, for example, to 32k and sacrifice
50% random write iops due to checksum read-modify-write
3. Or you turn off [inmemory_metadata](osd.en.md#inmemory_metadata) and
sacrifice 50% random read iops due to checksum reads
Option 1 (default) is recommended for all-flash setups because these usually
have enough RAM.
Option 2 is recommended for HDD-only setups. HDD-only setups usually do NOT
have enough RAM for the default 4 KB csum_block_size.
Option 3 is recommended for SSD+HDD setups (because metadata SSDs will handle
extra reads without any performance drop) and also *maybe* for NVMe all-flash
setups when you don't have enough RAM (because NVMe drives have plenty
of read iops to spare). You may also consider enabling
[cached_read_meta](osd.en.md#cached_read_meta) in this case.

View File

@@ -25,8 +25,6 @@
- [disable_journal_fsync](#disable_journal_fsync)
- [disable_device_lock](#disable_device_lock)
- [disk_alignment](#disk_alignment)
- [data_csum_type](#data_csum_type)
- [csum_block_size](#csum_block_size)
## data_device
@@ -185,52 +183,3 @@ journal_block_size и meta_block_size. Однако единственные SSD
Клиентам не обязательно знать про disk_alignment, так что помещать значение
этого параметра в etcd в /vitastor/config/global не нужно.
## data_csum_type
- Тип: строка
- Значение по умолчанию: none
Тип используемых OSD контрольных сумм данных. Может быть "crc32c" или "none".
Установите в "crc32c", чтобы включить расчёт и проверку контрольных сумм данных.
Следует понимать, что контрольные суммы в зависимости от размера блока их
расчёта либо увеличивают потребление памяти, либо снижают производительность.
Подробнее смотрите в описании параметра [csum_block_size](#csum_block_size).
## csum_block_size
- Тип: целое число
- Значение по умолчанию: 4096
Размер блока расчёта контрольных сумм.
Должен быть равен или кратен [bitmap_granularity](layout-cluster.ru.md#bitmap_granularity)
(который обычно равен 4 КБ).
Контрольные суммы увеличивают размер метаданных на 4 байта на каждые
csum_block_size данных.
Контрольные суммы - это всегда компромисс:
1. Вы либо жертвуете потреблением +1 ГБ памяти на 1 ТБ дискового пространства
2. Либо вы повышаете csum_block_size до, скажем, 32k и жертвуете 50%
скорости случайной записи из-за цикла чтения-изменения-записи для расчёта
новых контрольных сумм
3. Либо вы отключаете [inmemory_metadata](osd.ru.md#inmemory_metadata) и
жертвуете 50% скорости случайного чтения из-за чтения контрольных сумм
с диска
Вариант 1 (при настройках по умолчанию) рекомендуется для SSD (All-Flash)
кластеров, потому что памяти в них обычно хватает.
Вариант 2 рекомендуется для кластеров на одних жёстких дисках (без SSD
под метаданные). На 4 кб блок контрольной суммы памяти в таких кластерах
обычно НЕ хватает.
Вариант 3 рекомендуется для гибридных кластеров (SSD+HDD), потому что
скорости SSD под метаданными хватит, чтобы обработать дополнительные чтения
без снижения производительности. Также вариант 3 *может* рекомендоваться
для All-Flash кластеров на основе NVMe-дисков, когда памяти НЕ достаточно,
потому что NVMe-диски имеют огромный запас производительности по чтению.
В таких случаях, возможно, также имеет смысл включать параметр
[cached_read_meta](osd.ru.md#cached_read_meta).

View File

@@ -31,9 +31,6 @@ them, even without restarting by updating configuration in etcd.
- [max_flusher_count](#max_flusher_count)
- [inmemory_metadata](#inmemory_metadata)
- [inmemory_journal](#inmemory_journal)
- [cached_read_data](#cached_read_data)
- [cached_read_meta](#cached_read_meta)
- [cached_read_journal](#cached_read_journal)
- [journal_sector_buffer_count](#journal_sector_buffer_count)
- [journal_no_same_sector_overwrites](#journal_no_same_sector_overwrites)
- [throttle_small_writes](#throttle_small_writes)
@@ -258,46 +255,6 @@ is typically very small because it's sufficient to have 16-32 MB journal
for SSD OSDs. However, in theory it's possible that you'll want to turn it
off for hybrid (HDD+SSD) OSDs with large journals on quick devices.
## cached_read_data
- Type: boolean
- Default: false
Read data through Linux page cache, i.e. use a file descriptor opened without
O_DIRECT for data reads. May improve read performance for frequently accessed
data if it fits in RAM. Memory in page cache is shared by all processes and
not accounted in OSD memory consumption.
## cached_read_meta
- Type: boolean
- Default: false
Read metadata through Linux page cache. May be beneficial when checksums
are enabled and [inmemory_metadata](#inmemory_metadata) is disabled, because
in this case metadata blocks are read from disk to verify checksums on every
read request and caching them may reduce this extra read load.
Absolutely pointless to enable with enabled inmemory_metadata because all
metadata is kept in memory anyway, and likely pointless without checksums,
because in that case, metadata blocks are read from disk only during journal
flushing.
If the same device is used for data and metadata, enabling [cached_read_data](#cached_read_data)
also enables this parameter, given that it isn't turned off explicitly.
## cached_read_journal
- Type: boolean
- Default: false
Read buffered data from journal through Linux page cache. Does not have sense
without disabling [inmemory_journal](#inmemory_journal), which, again, is
enabled by default.
If the same device is used for metadata and journal, enabling [cached_read_meta](#cached_read_meta)
also enables this parameter, given that it isn't turned off explicitly.
## journal_sector_buffer_count
- Type: integer

View File

@@ -32,9 +32,6 @@
- [max_flusher_count](#max_flusher_count)
- [inmemory_metadata](#inmemory_metadata)
- [inmemory_journal](#inmemory_journal)
- [cached_read_data](#cached_read_data)
- [cached_read_meta](#cached_read_meta)
- [cached_read_journal](#cached_read_journal)
- [journal_sector_buffer_count](#journal_sector_buffer_count)
- [journal_no_same_sector_overwrites](#journal_no_same_sector_overwrites)
- [throttle_small_writes](#throttle_small_writes)
@@ -266,51 +263,6 @@ Flusher - это микро-поток (корутина), которая коп
параметра может оказаться полезным для гибридных OSD (HDD+SSD) с большими
журналами, расположенными на быстром по сравнению с HDD устройстве.
## cached_read_data
- Тип: булево (да/нет)
- Значение по умолчанию: false
Читать данные через системный кэш Linux (page cache), то есть, использовать
для чтения данных файловый дескриптор, открытый без флага O_DIRECT. Может
улучшить производительность чтения для часто используемых данных, если они
помещаются в память. Память кэша разделяется между всеми процессами в
системе и не учитывается в потреблении памяти процессом OSD.
## cached_read_meta
- Тип: булево (да/нет)
- Значение по умолчанию: false
Читать метаданные через системный кэш Linux. Может быть полезно, когда
включены контрольные суммы, а параметр [inmemory_metadata](#inmemory_metadata)
отключён, так как в этом случае блоки метаданных читаются с диска при каждом
запросе чтения для проверки контрольных сумм и их кэширование может снизить
дополнительную нагрузку на диск.
Абсолютно бессмысленно включать данный параметр, если параметр
inmemory_metadata включён (по умолчанию это так), и также вероятно
бессмысленно включать его, если не включены контрольные суммы, так как в
этом случае блоки метаданных читаются с диска только во время сброса
журнала.
Если одно и то же устройство используется для данных и метаданных, включение
[cached_read_data](#cached_read_data) также включает данный параметр, при
условии, что он не отключён явным образом.
## cached_read_journal
- Тип: булево (да/нет)
- Значение по умолчанию: false
Читать буферизованные в журнале данные через системный кэш Linux. Не имеет
смысла без отключения параметра [inmemory_journal](#inmemory_journal),
который, опять же, по умолчанию включён.
Если одно и то же устройство используется для метаданных и журнала,
включение [cached_read_meta](#cached_read_meta) также включает данный
параметр, при условии, что он не отключён явным образом.
## journal_sector_buffer_count
- Тип: целое число

View File

@@ -204,77 +204,3 @@
Клиентам не обязательно знать про disk_alignment, так что помещать значение
этого параметра в etcd в /vitastor/config/global не нужно.
- name: data_csum_type
type: string
default: none
info: |
Data checksum type to use. May be "crc32c" or "none". Set to "crc32c" to
enable data checksums.
info_ru: |
Тип используемых OSD контрольных сумм данных. Может быть "crc32c" или "none".
Установите в "crc32c", чтобы включить расчёт и проверку контрольных сумм данных.
Следует понимать, что контрольные суммы в зависимости от размера блока их
расчёта либо увеличивают потребление памяти, либо снижают производительность.
Подробнее смотрите в описании параметра [csum_block_size](#csum_block_size).
- name: csum_block_size
type: int
default: 4096
info: |
Checksum calculation block size.
Must be equal or a multiple of [bitmap_granularity](layout-cluster.en.md#bitmap_granularity)
(which is usually 4 KB).
Checksums increase metadata size by 4 bytes per each csum_block_size of data.
Checksums are always a compromise:
1. You either sacrifice +1 GB RAM per 1 TB of data
2. Or you raise csum_block_size, for example, to 32k and sacrifice
50% random write iops due to checksum read-modify-write
3. Or you turn off [inmemory_metadata](osd.en.md#inmemory_metadata) and
sacrifice 50% random read iops due to checksum reads
Option 1 (default) is recommended for all-flash setups because these usually
have enough RAM.
Option 2 is recommended for HDD-only setups. HDD-only setups usually do NOT
have enough RAM for the default 4 KB csum_block_size.
Option 3 is recommended for SSD+HDD setups (because metadata SSDs will handle
extra reads without any performance drop) and also *maybe* for NVMe all-flash
setups when you don't have enough RAM (because NVMe drives have plenty
of read iops to spare). You may also consider enabling
[cached_read_meta](osd.en.md#cached_read_meta) in this case.
info_ru: |
Размер блока расчёта контрольных сумм.
Должен быть равен или кратен [bitmap_granularity](layout-cluster.ru.md#bitmap_granularity)
(который обычно равен 4 КБ).
Контрольные суммы увеличивают размер метаданных на 4 байта на каждые
csum_block_size данных.
Контрольные суммы - это всегда компромисс:
1. Вы либо жертвуете потреблением +1 ГБ памяти на 1 ТБ дискового пространства
2. Либо вы повышаете csum_block_size до, скажем, 32k и жертвуете 50%
скорости случайной записи из-за цикла чтения-изменения-записи для расчёта
новых контрольных сумм
3. Либо вы отключаете [inmemory_metadata](osd.ru.md#inmemory_metadata) и
жертвуете 50% скорости случайного чтения из-за чтения контрольных сумм
с диска
Вариант 1 (при настройках по умолчанию) рекомендуется для SSD (All-Flash)
кластеров, потому что памяти в них обычно хватает.
Вариант 2 рекомендуется для кластеров на одних жёстких дисках (без SSD
под метаданные). На 4 кб блок контрольной суммы памяти в таких кластерах
обычно НЕ хватает.
Вариант 3 рекомендуется для гибридных кластеров (SSD+HDD), потому что
скорости SSD под метаданными хватит, чтобы обработать дополнительные чтения
без снижения производительности. Также вариант 3 *может* рекомендоваться
для All-Flash кластеров на основе NVMe-дисков, когда памяти НЕ достаточно,
потому что NVMe-диски имеют огромный запас производительности по чтению.
В таких случаях, возможно, также имеет смысл включать параметр
[cached_read_meta](osd.ru.md#cached_read_meta).

View File

@@ -260,70 +260,6 @@
достаточно 16- или 32-мегабайтного журнала. Однако в теории отключение
параметра может оказаться полезным для гибридных OSD (HDD+SSD) с большими
журналами, расположенными на быстром по сравнению с HDD устройстве.
- name: cached_read_data
type: bool
default: false
info: |
Read data through Linux page cache, i.e. use a file descriptor opened without
O_DIRECT for data reads. May improve read performance for frequently accessed
data if it fits in RAM. Memory in page cache is shared by all processes and
not accounted in OSD memory consumption.
info_ru: |
Читать данные через системный кэш Linux (page cache), то есть, использовать
для чтения данных файловый дескриптор, открытый без флага O_DIRECT. Может
улучшить производительность чтения для часто используемых данных, если они
помещаются в память. Память кэша разделяется между всеми процессами в
системе и не учитывается в потреблении памяти процессом OSD.
- name: cached_read_meta
type: bool
default: false
info: |
Read metadata through Linux page cache. May be beneficial when checksums
are enabled and [inmemory_metadata](#inmemory_metadata) is disabled, because
in this case metadata blocks are read from disk to verify checksums on every
read request and caching them may reduce this extra read load.
Absolutely pointless to enable with enabled inmemory_metadata because all
metadata is kept in memory anyway, and likely pointless without checksums,
because in that case, metadata blocks are read from disk only during journal
flushing.
If the same device is used for data and metadata, enabling [cached_read_data](#cached_read_data)
also enables this parameter, given that it isn't turned off explicitly.
info_ru: |
Читать метаданные через системный кэш Linux. Может быть полезно, когда
включены контрольные суммы, а параметр [inmemory_metadata](#inmemory_metadata)
отключён, так как в этом случае блоки метаданных читаются с диска при каждом
запросе чтения для проверки контрольных сумм и их кэширование может снизить
дополнительную нагрузку на диск.
Абсолютно бессмысленно включать данный параметр, если параметр
inmemory_metadata включён (по умолчанию это так), и также вероятно
бессмысленно включать его, если не включены контрольные суммы, так как в
этом случае блоки метаданных читаются с диска только во время сброса
журнала.
Если одно и то же устройство используется для данных и метаданных, включение
[cached_read_data](#cached_read_data) также включает данный параметр, при
условии, что он не отключён явным образом.
- name: cached_read_journal
type: bool
default: false
info: |
Read buffered data from journal through Linux page cache. Does not have sense
without disabling [inmemory_journal](#inmemory_journal), which, again, is
enabled by default.
If the same device is used for metadata and journal, enabling [cached_read_meta](#cached_read_meta)
also enables this parameter, given that it isn't turned off explicitly.
info_ru: |
Читать буферизованные в журнале данные через системный кэш Linux. Не имеет
смысла без отключения параметра [inmemory_journal](#inmemory_journal),
который, опять же, по умолчанию включён.
Если одно и то же устройство используется для метаданных и журнала,
включение [cached_read_meta](#cached_read_meta) также включает данный
параметр, при условии, что он не отключён явным образом.
- name: journal_sector_buffer_count
type: int
default: 32

View File

@@ -21,7 +21,7 @@
## Basic instructions
Download source, for example using git: `git clone --recurse-submodules https://yourcmc.ru/git/vitalif/vitastor/`
Download source, for example using git: `git clone --recurse-submodules https://git.yourcmc.ru/vitalif/vitastor/`
Get `fio` source and symlink it into `<vitastor>/fio`. If you don't want to build fio engine,
you can disable it by passing `-DWITH_FIO=no` to cmake.
@@ -41,7 +41,7 @@ It's recommended to build the QEMU driver (qemu_driver.c) in-tree, as a part of
QEMU build process. To do that:
- Install vitastor client library headers (from source or from vitastor-client-dev package)
- Take a corresponding patch from `patches/qemu-*-vitastor.patch` and apply it to QEMU source
- Copy `src/qemu_driver.c` to QEMU source directory as `block/block-vitastor.c`
- Copy `src/qemu_driver.c` to QEMU source directory as `block/vitastor.c`
- Build QEMU as usual
But it is also possible to build it out-of-tree. To do that:

View File

@@ -21,7 +21,7 @@
## Базовая инструкция
Скачайте исходные коды, например, из git: `git clone --recurse-submodules https://yourcmc.ru/git/vitalif/vitastor/`
Скачайте исходные коды, например, из git: `git clone --recurse-submodules https://git.yourcmc.ru/vitalif/vitastor/`
Скачайте исходные коды пакета `fio`, распакуйте их и создайте символическую ссылку на них
в директории исходников Vitastor: `<vitastor>/fio`. Либо, если вы не хотите собирать плагин fio,
@@ -41,7 +41,7 @@ cmake .. && make -j8 install
Драйвер QEMU (qemu_driver.c) рекомендуется собирать вместе с самим QEMU. Для этого:
- Установите заголовки клиентской библиотеки Vitastor (из исходников или из пакета vitastor-client-dev)
- Возьмите соответствующий патч из `patches/qemu-*-vitastor.patch` и примените его к исходникам QEMU
- Скопируйте [src/qemu_driver.c](../../src/qemu_driver.c) в директорию исходников QEMU как `block/block-vitastor.c`
- Скопируйте [src/qemu_driver.c](../../src/qemu_driver.c) в директорию исходников QEMU как `block/vitastor.c`
- Соберите QEMU как обычно
Однако в целях отладки драйвер также можно собирать отдельно от QEMU. Для этого:
@@ -60,7 +60,7 @@ cmake .. && make -j8 install
* Для QEMU 2.0+: `<qemu>/qapi-types.h` &rarr; `<vitastor>/qemu/b/qemu/qapi-types.h`
- `config-host.h` и `qapi` нужны, т.к. в них содержатся автогенерируемые заголовки
- Сконфигурируйте cmake Vitastor с `WITH_QEMU=yes` (`cmake .. -DWITH_QEMU=yes`) и, если вы
используете RHEL-подобый дистрибутив, также с `QEMU_PLUGINDIR=qemu-kvm`.
используете RHEL-подобный дистрибутив, также с `QEMU_PLUGINDIR=qemu-kvm`.
- После этого в процессе сборки Vitastor также будет собираться подходящий для вашей
версии QEMU `block-vitastor.so`.
- Таким образом можно использовать драйвер даже с немодифицированным QEMU, но в этом случае

View File

@@ -30,7 +30,6 @@
- [Write throttling to smooth random write workloads in SSD+HDD configurations](../config/osd.en.md#throttle_small_writes)
- [RDMA/RoCEv2 support via libibverbs](../config/network.en.md#rdma_device)
- [Scrubbing without checksums](../config/osd.en.md#auto_scrub) (verification of copies)
- [Checksums](../config/layout-osd.en.md#data_csum_type)
## Plugins and tools
@@ -56,6 +55,7 @@ The following features are planned for the future:
- iSCSI proxy
- Multi-threaded client
- Faster failover
- Checksums
- Tiered storage (SSD caching)
- NVDIMM support
- Compression (possibly)

View File

@@ -32,7 +32,6 @@
- [Сглаживание производительности случайной записи в SSD+HDD конфигурациях](../config/osd.ru.md#throttle_small_writes)
- [Поддержка RDMA/RoCEv2 через libibverbs](../config/network.ru.md#rdma_device)
- [Фоновая проверка целостности без контрольных сумм](../config/osd.ru.md#auto_scrub) (сверка копий)
- [Контрольные суммы](../config/layout-osd.ru.md#data_csum_type)
## Драйверы и инструменты
@@ -56,6 +55,7 @@
- iSCSI-прокси
- Многопоточный клиент
- Более быстрое переключение при отказах
- Контрольные суммы
- Поддержка SSD-кэширования (tiered storage)
- Поддержка NVDIMM
- Возможно, сжатие

View File

@@ -86,8 +86,6 @@ Options (both modes):
--journal_size 1G/32M Set journal size (area or partition size)
--block_size 1M/128k Set blockstore object size
--bitmap_granularity 4k Set bitmap granularity
--data_csum_type none Set data checksum type (crc32c or none)
--csum_block_size 4k Set data checksum block size
--data_device_block 4k Override data device block size
--meta_device_block 4k Override metadata device block size
--journal_device_block 4k Override journal device block size
@@ -102,9 +100,8 @@ checks the device cache status on start and tries to disable cache for SATA/SAS
If it doesn't succeed it issues a warning in the system log.
You can also pass other OSD options here as arguments and they'll be persisted
in the superblock: cached_read_data, cached_read_meta, cached_read_journal,
inmemory_metadata, inmemory_journal, max_write_iodepth,
min_flusher_count, max_flusher_count, journal_sector_buffer_count,
to the superblock: max_write_iodepth, max_write_iodepth, min_flusher_count,
max_flusher_count, inmemory_metadata, inmemory_journal, journal_sector_buffer_count,
journal_no_same_sector_overwrites, throttle_small_writes, throttle_target_iops,
throttle_target_mbs, throttle_target_parallelism, throttle_threshold_us.
See [Runtime OSD Parameters](../config/osd.en.md) for details.
@@ -252,9 +249,7 @@ Options (see also [Cluster-Wide Disk Layout Parameters](../config/layout-cluster
```
--object_size 128k Set blockstore block size
--bitmap_granularity 4k Set bitmap granularity
--journal_size 16M Set journal size
--data_csum_type none Set data checksum type (crc32c or none)
--csum_block_size 4k Set data checksum block size
--journal_size 32M Set journal size
--device_block_size 4k Set device block size
--journal_offset 0 Set journal offset
--device_size 0 Set device size

View File

@@ -87,8 +87,6 @@ vitastor-disk - инструмент командной строки для уп
--journal_size 1G/32M Задать размер журнала (области или раздела журнала)
--block_size 1M/128k Задать размер объекта хранилища
--bitmap_granularity 4k Задать гранулярность битовых карт
--data_csum_type none Задать тип контрольных сумм (crc32c или none)
--csum_block_size 4k Задать размер блока расчёта контрольных сумм
--data_device_block 4k Задать размер блока устройства данных
--meta_device_block 4k Задать размер блока метаданных
--journal_device_block 4k Задать размер блока журнала
@@ -103,9 +101,8 @@ vitastor-disk - инструмент командной строки для уп
это не удаётся, в системный журнал выводится предупреждение.
Вы можете передать данной команде и некоторые другие опции OSD в качестве аргументов
и они тоже будут сохранены в суперблок: cached_read_data, cached_read_meta,
cached_read_journal, inmemory_metadata, inmemory_journal, max_write_iodepth,
min_flusher_count, max_flusher_count, journal_sector_buffer_count,
и они тоже будут сохранены в суперблок: max_write_iodepth, max_write_iodepth, min_flusher_count,
max_flusher_count, inmemory_metadata, inmemory_journal, journal_sector_buffer_count,
journal_no_same_sector_overwrites, throttle_small_writes, throttle_target_iops,
throttle_target_mbs, throttle_target_parallelism, throttle_threshold_us.
Читайте об этих параметрах подробнее в разделе [Изменяемые параметры OSD](../config/osd.ru.md).
@@ -257,9 +254,7 @@ OSD отключены fsync-и.
```
--object_size 128k Размер блока хранилища
--bitmap_granularity 4k Гранулярность битовых карт
--journal_size 16M Размер журнала
--data_csum_type none Задать тип контрольных сумм (crc32c или none)
--csum_block_size 4k Задать размер блока расчёта контрольных сумм
--journal_size 32M Размер журнала
--device_block_size 4k Размер блока устройства
--journal_offset 0 Смещение журнала
--device_size 0 Размер устройства

View File

@@ -50,7 +50,7 @@ from cinder.volume import configuration
from cinder.volume import driver
from cinder.volume import volume_utils
VERSION = '0.9.3'
VERSION = '0.9.4'
LOG = logging.getLogger(__name__)

View File

@@ -0,0 +1,176 @@
diff --git a/block/Makefile.objs b/block/Makefile.objs
index d644bac60a..e404236291 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -19,6 +19,7 @@ block-obj-$(if $(CONFIG_LIBISCSI),y,n) += iscsi-opts.o
block-obj-$(CONFIG_LIBNFS) += nfs.o
block-obj-$(CONFIG_CURL) += curl.o
block-obj-$(CONFIG_RBD) += rbd.o
+block-obj-$(CONFIG_VITASTOR) += vitastor.o
block-obj-$(CONFIG_GLUSTERFS) += gluster.o
block-obj-$(CONFIG_VXHS) += vxhs.o
block-obj-$(CONFIG_LIBSSH2) += ssh.o
@@ -39,6 +40,8 @@ curl.o-cflags := $(CURL_CFLAGS)
curl.o-libs := $(CURL_LIBS)
rbd.o-cflags := $(RBD_CFLAGS)
rbd.o-libs := $(RBD_LIBS)
+vitastor.o-cflags := $(VITASTOR_CFLAGS)
+vitastor.o-libs := $(VITASTOR_LIBS)
gluster.o-cflags := $(GLUSTERFS_CFLAGS)
gluster.o-libs := $(GLUSTERFS_LIBS)
vxhs.o-libs := $(VXHS_LIBS)
diff --git a/configure b/configure
index 0a19b033bc..58b7fbf24c 100755
--- a/configure
+++ b/configure
@@ -398,6 +398,7 @@ trace_backends="log"
trace_file="trace"
spice=""
rbd=""
+vitastor=""
smartcard=""
libusb=""
usb_redir=""
@@ -1213,6 +1214,10 @@ for opt do
;;
--enable-rbd) rbd="yes"
;;
+ --disable-vitastor) vitastor="no"
+ ;;
+ --enable-vitastor) vitastor="yes"
+ ;;
--disable-xfsctl) xfs="no"
;;
--enable-xfsctl) xfs="yes"
@@ -1601,6 +1606,7 @@ disabled with --disable-FEATURE, default is enabled if available:
vhost-crypto vhost-crypto acceleration support
spice spice
rbd rados block device (rbd)
+ vitastor vitastor block device
libiscsi iscsi support
libnfs nfs support
smartcard smartcard support (libcacard)
@@ -3594,6 +3600,27 @@ EOF
fi
fi
+##########################################
+# vitastor probe
+if test "$vitastor" != "no" ; then
+ cat > $TMPC <<EOF
+#include <vitastor_c.h>
+int main(void) {
+ vitastor_c_create_qemu(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
+ return 0;
+}
+EOF
+ vitastor_libs="-lvitastor_client"
+ if compile_prog "" "$vitastor_libs" ; then
+ vitastor=yes
+ else
+ if test "$vitastor" = "yes" ; then
+ feature_not_found "vitastor block device" "Install vitastor-client-dev"
+ fi
+ vitastor=no
+ fi
+fi
+
##########################################
# libssh2 probe
min_libssh2_version=1.2.8
@@ -5837,6 +5864,7 @@ echo "Trace output file $trace_file-<pid>"
fi
echo "spice support $spice $(echo_version $spice $spice_protocol_version/$spice_server_version)"
echo "rbd support $rbd"
+echo "vitastor support $vitastor"
echo "xfsctl support $xfs"
echo "smartcard support $smartcard"
echo "libusb $libusb"
@@ -6416,6 +6444,11 @@ if test "$rbd" = "yes" ; then
echo "RBD_CFLAGS=$rbd_cflags" >> $config_host_mak
echo "RBD_LIBS=$rbd_libs" >> $config_host_mak
fi
+if test "$vitastor" = "yes" ; then
+ echo "CONFIG_VITASTOR=m" >> $config_host_mak
+ echo "VITASTOR_CFLAGS=$vitastor_cflags" >> $config_host_mak
+ echo "VITASTOR_LIBS=$vitastor_libs" >> $config_host_mak
+fi
echo "CONFIG_COROUTINE_BACKEND=$coroutine" >> $config_host_mak
if test "$coroutine_pool" = "yes" ; then
diff --git a/qapi/block-core.json b/qapi/block-core.json
index c50517bff3..c780bb2c1c 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2514,7 +2514,7 @@
'dmg', 'file', 'ftp', 'ftps', 'gluster', 'host_cdrom',
'host_device', 'http', 'https', 'iscsi', 'luks', 'nbd', 'nfs',
'null-aio', 'null-co', 'nvme', 'parallels', 'qcow', 'qcow2', 'qed',
- 'quorum', 'raw', 'rbd', 'replication', 'sheepdog', 'ssh',
+ 'quorum', 'raw', 'rbd', 'vitastor', 'replication', 'sheepdog', 'ssh',
'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', 'vxhs' ] }
##
@@ -3217,6 +3217,28 @@
'*snap-id': 'uint32',
'*tag': 'str' } }
+##
+# @BlockdevOptionsVitastor:
+#
+# Driver specific block device options for vitastor
+#
+# @image: Image name
+# @inode: Inode number
+# @pool: Pool ID
+# @size: Desired image size in bytes
+# @config-path: Path to Vitastor configuration
+# @etcd-host: etcd connection address(es)
+# @etcd-prefix: etcd key/value prefix
+##
+{ 'struct': 'BlockdevOptionsVitastor',
+ 'data': { '*inode': 'uint64',
+ '*pool': 'uint64',
+ '*size': 'uint64',
+ '*image': 'str',
+ '*config-path': 'str',
+ '*etcd-host': 'str',
+ '*etcd-prefix': 'str' } }
+
##
# @ReplicationMode:
#
@@ -3547,6 +3569,7 @@
'rbd': 'BlockdevOptionsRbd',
'replication':'BlockdevOptionsReplication',
'sheepdog': 'BlockdevOptionsSheepdog',
+ 'vitastor': 'BlockdevOptionsVitastor',
'ssh': 'BlockdevOptionsSsh',
'throttle': 'BlockdevOptionsThrottle',
'vdi': 'BlockdevOptionsGenericFormat',
@@ -3991,6 +4014,17 @@
'*subformat': 'BlockdevVhdxSubformat',
'*block-state-zero': 'bool' } }
+##
+# @BlockdevCreateOptionsVitastor:
+#
+# Driver specific image creation options for Vitastor.
+#
+# @size: Size of the virtual disk in bytes
+##
+{ 'struct': 'BlockdevCreateOptionsVitastor',
+ 'data': { 'location': 'BlockdevOptionsVitastor',
+ 'size': 'size' } }
+
##
# @BlockdevVpcSubformat:
#
@@ -4074,6 +4108,7 @@
'rbd': 'BlockdevCreateOptionsRbd',
'replication': 'BlockdevCreateNotSupported',
'sheepdog': 'BlockdevCreateOptionsSheepdog',
+ 'vitastor': 'BlockdevCreateOptionsVitastor',
'ssh': 'BlockdevCreateOptionsSsh',
'throttle': 'BlockdevCreateNotSupported',
'vdi': 'BlockdevCreateOptionsVdi',

View File

@@ -0,0 +1,181 @@
Index: qemu-5.2+dfsg/qapi/block-core.json
===================================================================
--- qemu-5.2+dfsg.orig/qapi/block-core.json
+++ qemu-5.2+dfsg/qapi/block-core.json
@@ -2831,7 +2831,7 @@
'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
{ 'name': 'replication', 'if': 'defined(CONFIG_REPLICATION)' },
- 'sheepdog',
+ 'sheepdog', 'vitastor',
'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
##
@@ -3668,6 +3668,28 @@
'*tag': 'str' } }
##
+# @BlockdevOptionsVitastor:
+#
+# Driver specific block device options for vitastor
+#
+# @image: Image name
+# @inode: Inode number
+# @pool: Pool ID
+# @size: Desired image size in bytes
+# @config-path: Path to Vitastor configuration
+# @etcd-host: etcd connection address(es)
+# @etcd-prefix: etcd key/value prefix
+##
+{ 'struct': 'BlockdevOptionsVitastor',
+ 'data': { '*inode': 'uint64',
+ '*pool': 'uint64',
+ '*size': 'uint64',
+ '*image': 'str',
+ '*config-path': 'str',
+ '*etcd-host': 'str',
+ '*etcd-prefix': 'str' } }
+
+##
# @ReplicationMode:
#
# An enumeration of replication modes.
@@ -4015,6 +4037,7 @@
'replication': { 'type': 'BlockdevOptionsReplication',
'if': 'defined(CONFIG_REPLICATION)' },
'sheepdog': 'BlockdevOptionsSheepdog',
+ 'vitastor': 'BlockdevOptionsVitastor',
'ssh': 'BlockdevOptionsSsh',
'throttle': 'BlockdevOptionsThrottle',
'vdi': 'BlockdevOptionsGenericFormat',
@@ -4404,6 +4427,17 @@
'*cluster-size' : 'size' } }
##
+# @BlockdevCreateOptionsVitastor:
+#
+# Driver specific image creation options for Vitastor.
+#
+# @size: Size of the virtual disk in bytes
+##
+{ 'struct': 'BlockdevCreateOptionsVitastor',
+ 'data': { 'location': 'BlockdevOptionsVitastor',
+ 'size': 'size' } }
+
+##
# @BlockdevVmdkSubformat:
#
# Subformat options for VMDK images
@@ -4665,6 +4699,7 @@
'qed': 'BlockdevCreateOptionsQed',
'rbd': 'BlockdevCreateOptionsRbd',
'sheepdog': 'BlockdevCreateOptionsSheepdog',
+ 'vitastor': 'BlockdevCreateOptionsVitastor',
'ssh': 'BlockdevCreateOptionsSsh',
'vdi': 'BlockdevCreateOptionsVdi',
'vhdx': 'BlockdevCreateOptionsVhdx',
Index: qemu-5.2+dfsg/block/meson.build
===================================================================
--- qemu-5.2+dfsg.orig/block/meson.build
+++ qemu-5.2+dfsg/block/meson.build
@@ -76,6 +76,7 @@ foreach m : [
['CONFIG_LIBNFS', 'nfs', libnfs, 'nfs.c'],
['CONFIG_LIBSSH', 'ssh', libssh, 'ssh.c'],
['CONFIG_RBD', 'rbd', rbd, 'rbd.c'],
+ ['CONFIG_VITASTOR', 'vitastor', vitastor, 'vitastor.c'],
]
if config_host.has_key(m[0])
if enable_modules
Index: qemu-5.2+dfsg/configure
===================================================================
--- qemu-5.2+dfsg.orig/configure
+++ qemu-5.2+dfsg/configure
@@ -372,6 +372,7 @@ trace_backends="log"
trace_file="trace"
spice=""
rbd=""
+vitastor=""
smartcard=""
u2f="auto"
libusb=""
@@ -1263,6 +1264,10 @@ for opt do
;;
--enable-rbd) rbd="yes"
;;
+ --disable-vitastor) vitastor="no"
+ ;;
+ --enable-vitastor) vitastor="yes"
+ ;;
--disable-xfsctl) xfs="no"
;;
--enable-xfsctl) xfs="yes"
@@ -1827,6 +1832,7 @@ disabled with --disable-FEATURE, default
vhost-vdpa vhost-vdpa kernel backend support
spice spice
rbd rados block device (rbd)
+ vitastor vitastor block device
libiscsi iscsi support
libnfs nfs support
smartcard smartcard support (libcacard)
@@ -3719,6 +3725,27 @@ EOF
fi
##########################################
+# vitastor probe
+if test "$vitastor" != "no" ; then
+ cat > $TMPC <<EOF
+#include <vitastor_c.h>
+int main(void) {
+ vitastor_c_create_qemu(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
+ return 0;
+}
+EOF
+ vitastor_libs="-lvitastor_client"
+ if compile_prog "" "$vitastor_libs" ; then
+ vitastor=yes
+ else
+ if test "$vitastor" = "yes" ; then
+ feature_not_found "vitastor block device" "Install vitastor-client-dev"
+ fi
+ vitastor=no
+ fi
+fi
+
+##########################################
# libssh probe
if test "$libssh" != "no" ; then
if $pkg_config --exists libssh; then
@@ -6456,6 +6483,10 @@ if test "$rbd" = "yes" ; then
echo "CONFIG_RBD=y" >> $config_host_mak
echo "RBD_LIBS=$rbd_libs" >> $config_host_mak
fi
+if test "$vitastor" = "yes" ; then
+ echo "CONFIG_VITASTOR=y" >> $config_host_mak
+ echo "VITASTOR_LIBS=$vitastor_libs" >> $config_host_mak
+fi
echo "CONFIG_COROUTINE_BACKEND=$coroutine" >> $config_host_mak
if test "$coroutine_pool" = "yes" ; then
Index: qemu-5.2+dfsg/meson.build
===================================================================
--- qemu-5.2+dfsg.orig/meson.build
+++ qemu-5.2+dfsg/meson.build
@@ -596,6 +596,10 @@ rbd = not_found
if 'CONFIG_RBD' in config_host
rbd = declare_dependency(link_args: config_host['RBD_LIBS'].split())
endif
+vitastor = not_found
+if 'CONFIG_VITASTOR' in config_host
+ vitastor = declare_dependency(link_args: config_host['VITASTOR_LIBS'].split())
+endif
glusterfs = not_found
if 'CONFIG_GLUSTERFS' in config_host
glusterfs = declare_dependency(compile_args: config_host['GLUSTERFS_CFLAGS'].split(),
@@ -2145,6 +2149,7 @@ endif
# TODO: add back protocol and server version
summary_info += {'spice support': config_host.has_key('CONFIG_SPICE')}
summary_info += {'rbd support': config_host.has_key('CONFIG_RBD')}
+summary_info += {'vitastor support': config_host.has_key('CONFIG_VITASTOR')}
summary_info += {'xfsctl support': config_host.has_key('CONFIG_XFS')}
summary_info += {'smartcard support': config_host.has_key('CONFIG_SMARTCARD')}
summary_info += {'U2F support': u2f.found()}

View File

@@ -24,4 +24,4 @@ rm fio
mv fio-copy fio
FIO=`rpm -qi fio | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'`
perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec
tar --transform 's#^#vitastor-0.9.3/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.9.3$(rpm --eval '%dist').tar.gz *
tar --transform 's#^#vitastor-0.9.4/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.9.4$(rpm --eval '%dist').tar.gz *

View File

@@ -22,7 +22,7 @@
Name: qemu-kvm
Version: 4.2.0
-Release: 29.vitastor%{?dist}.6
+Release: 32.vitastor%{?dist}.6
+Release: 34.vitastor%{?dist}.6
# Epoch because we pushed a qemu-1.0 package. AIUI this can't ever be dropped
Epoch: 15
License: GPLv2 and GPLv2+ and CC-BY

View File

@@ -13,7 +13,7 @@
Name: qemu-kvm
Version: 4.2.0
-Release: 29%{?dist}.6
+Release: 32.vitastor%{?dist}.6
+Release: 33.vitastor%{?dist}.6
# Epoch because we pushed a qemu-1.0 package. AIUI this can't ever be dropped
Epoch: 15
License: GPLv2 and GPLv2+ and CC-BY

View File

@@ -0,0 +1,103 @@
--- qemu-kvm-6.2.spec.orig 2023-07-18 13:52:57.636625440 +0000
+++ qemu-kvm-6.2.spec 2023-07-18 13:52:19.011683886 +0000
@@ -73,6 +73,7 @@ Requires: %{name}-hw-usbredir = %{epoch}
%endif \
Requires: %{name}-block-iscsi = %{epoch}:%{version}-%{release} \
Requires: %{name}-block-rbd = %{epoch}:%{version}-%{release} \
+Requires: %{name}-block-vitastor = %{epoch}:%{version}-%{release}\
Requires: %{name}-block-ssh = %{epoch}:%{version}-%{release}
# Macro to properly setup RHEL/RHEV conflict handling
@@ -83,7 +84,7 @@ Obsoletes: %1-rhev <= %{epoch}:%{version
Summary: QEMU is a machine emulator and virtualizer
Name: qemu-kvm
Version: 6.2.0
-Release: 32%{?rcrel}%{?dist}
+Release: 32.vitastor%{?rcrel}%{?dist}
# Epoch because we pushed a qemu-1.0 package. AIUI this can't ever be dropped
Epoch: 15
License: GPLv2 and GPLv2+ and CC-BY
@@ -122,6 +123,7 @@ Source37: tests_data_acpi_pc_SSDT.dimmpx
Source38: tests_data_acpi_q35_FACP.slic
Source39: tests_data_acpi_q35_SSDT.dimmpxm
Source40: tests_data_acpi_virt_SSDT.memhp
+Source41: qemu-vitastor.c
Patch0001: 0001-redhat-Adding-slirp-to-the-exploded-tree.patch
Patch0005: 0005-Initial-redhat-build.patch
@@ -652,6 +654,7 @@ Patch255: kvm-scsi-protect-req-aiocb-wit
Patch256: kvm-dma-helpers-prevent-dma_blk_cb-vs-dma_aio_cancel-rac.patch
# For bz#2090990 - qemu crash with error scsi_req_unref(SCSIRequest *): Assertion `req->refcount > 0' failed or scsi_dma_complete(void *, int): Assertion `r->req.aiocb != NULL' failed [8.7.0]
Patch257: kvm-virtio-scsi-reset-SCSI-devices-from-main-loop-thread.patch
+Patch258: qemu-6.2-vitastor.patch
BuildRequires: wget
BuildRequires: rpm-build
@@ -689,6 +692,7 @@ BuildRequires: libcurl-devel
BuildRequires: libssh-devel
BuildRequires: librados-devel
BuildRequires: librbd-devel
+BuildRequires: vitastor-client-devel
%if %{have_gluster}
# For gluster block driver
BuildRequires: glusterfs-api-devel
@@ -926,6 +930,14 @@ Install this package if you want to acce
using the rbd protocol.
+%package block-vitastor
+Summary: QEMU Vitastor block driver
+Requires: %{name}-common%{?_isa} = %{epoch}:%{version}-%{release}
+
+%description block-vitastor
+This package provides the additional Vitastor block driver for QEMU.
+
+
%package block-ssh
Summary: QEMU SSH block driver
Requires: %{name}-common%{?_isa} = %{epoch}:%{version}-%{release}
@@ -979,6 +991,7 @@ This package provides usbredir support.
rm -fr slirp
mkdir slirp
%autopatch -p1
+cp %{SOURCE41} ./block/vitastor.c
%global qemu_kvm_build qemu_kvm_build
mkdir -p %{qemu_kvm_build}
@@ -994,7 +1007,7 @@ cp -f %{SOURCE40} tests/data/acpi/virt/S
# --build-id option is used for giving info to the debug packages.
buildldflags="VL_LDFLAGS=-Wl,--build-id"
-%global block_drivers_list qcow2,raw,file,host_device,nbd,iscsi,rbd,blkdebug,luks,null-co,nvme,copy-on-read,throttle
+%global block_drivers_list qcow2,raw,file,host_device,nbd,iscsi,rbd,vitastor,blkdebug,luks,null-co,nvme,copy-on-read,throttle
%if 0%{have_gluster}
%global block_drivers_list %{block_drivers_list},gluster
@@ -1149,9 +1162,7 @@ pushd %{qemu_kvm_build}
--firmwarepath=%{_prefix}/share/qemu-firmware \
--meson="git" \
--target-list="%{buildarch}" \
- --block-drv-rw-whitelist=%{block_drivers_list} \
--audio-drv-list= \
- --block-drv-ro-whitelist=vmdk,vhdx,vpc,https,ssh \
--with-coroutine=ucontext \
--with-git=git \
--tls-priority=@QEMU,SYSTEM \
@@ -1197,6 +1208,7 @@ pushd %{qemu_kvm_build}
%endif
--enable-pie \
--enable-rbd \
+ --enable-vitastor \
%if 0%{have_librdma}
--enable-rdma \
%endif
@@ -1794,6 +1806,9 @@ sh %{_sysconfdir}/sysconfig/modules/kvm.
%files block-rbd
%{_libdir}/qemu-kvm/block-rbd.so
+%files block-vitastor
+%{_libdir}/qemu-kvm/block-vitastor.so
+
%files block-ssh
%{_libdir}/qemu-kvm/block-ssh.so

View File

@@ -0,0 +1,93 @@
--- qemu-kvm-7.2.spec.orig 2023-06-22 13:56:19.000000000 +0000
+++ qemu-kvm-7.2.spec 2023-07-18 07:55:22.347090196 +0000
@@ -100,8 +100,6 @@
%endif
%global target_list %{kvm_target}-softmmu
-%global block_drivers_rw_list qcow2,raw,file,host_device,nbd,iscsi,rbd,blkdebug,luks,null-co,nvme,copy-on-read,throttle,compress
-%global block_drivers_ro_list vdi,vmdk,vhdx,vpc,https
%define qemudocdir %{_docdir}/%{name}
%global firmwaredirs "%{_datadir}/qemu-firmware:%{_datadir}/ipxe/qemu:%{_datadir}/seavgabios:%{_datadir}/seabios"
@@ -126,6 +124,7 @@ Requires: %{name}-device-usb-host = %{ep
Requires: %{name}-device-usb-redirect = %{epoch}:%{version}-%{release} \
%endif \
Requires: %{name}-block-rbd = %{epoch}:%{version}-%{release} \
+Requires: %{name}-block-vitastor = %{epoch}:%{version}-%{release}\
Requires: %{name}-audio-pa = %{epoch}:%{version}-%{release}
# Since SPICE is removed from RHEL-9, the following Obsoletes:
@@ -148,7 +147,7 @@ Obsoletes: %{name}-block-ssh <= %{epoch}
Summary: QEMU is a machine emulator and virtualizer
Name: qemu-kvm
Version: 7.2.0
-Release: 14%{?rcrel}%{?dist}%{?cc_suffix}.1
+Release: 14.vitastor%{?rcrel}%{?dist}%{?cc_suffix}.1
# Epoch because we pushed a qemu-1.0 package. AIUI this can't ever be dropped
# Epoch 15 used for RHEL 8
# Epoch 17 used for RHEL 9 (due to release versioning offset in RHEL 8.5)
@@ -171,6 +170,7 @@ Source28: 95-kvm-memlock.conf
Source30: kvm-s390x.conf
Source31: kvm-x86.conf
Source36: README.tests
+Source37: qemu-vitastor.c
Patch0004: 0004-Initial-redhat-build.patch
@@ -418,6 +418,7 @@ Patch134: kvm-target-i386-Fix-BZHI-instr
Patch135: kvm-intel-iommu-fail-DEVIOTLB_UNMAP-without-dt-mode.patch
# For bz#2203745 - Disk detach is unsuccessful while the guest is still booting [rhel-9.2.0.z]
Patch136: kvm-acpi-pcihp-allow-repeating-hot-unplug-requests.patch
+Patch137: qemu-7.2-vitastor.patch
%if %{have_clang}
BuildRequires: clang
@@ -449,6 +450,7 @@ BuildRequires: libcurl-devel
%if %{have_block_rbd}
BuildRequires: librbd-devel
%endif
+BuildRequires: vitastor-client-devel
# We need both because the 'stap' binary is probed for by configure
BuildRequires: systemtap
BuildRequires: systemtap-sdt-devel
@@ -642,6 +644,14 @@ using the rbd protocol.
%endif
+%package block-vitastor
+Summary: QEMU Vitastor block driver
+Requires: %{name}-common%{?_isa} = %{epoch}:%{version}-%{release}
+
+%description block-vitastor
+This package provides the additional Vitastor block driver for QEMU.
+
+
%package audio-pa
Summary: QEMU PulseAudio audio driver
Requires: %{name}-common%{?_isa} = %{epoch}:%{version}-%{release}
@@ -719,6 +729,7 @@ This package provides usbredir support.
%prep
%setup -q -n qemu-%{version}%{?rcstr}
%autopatch -p1
+cp %{SOURCE37} ./block/vitastor.c
%global qemu_kvm_build qemu_kvm_build
mkdir -p %{qemu_kvm_build}
@@ -946,6 +957,7 @@ run_configure \
%if %{have_block_rbd}
--enable-rbd \
%endif
+ --enable-vitastor \
%if %{have_librdma}
--enable-rdma \
%endif
@@ -1426,6 +1438,9 @@ useradd -r -u 107 -g qemu -G kvm -d / -s
%files block-rbd
%{_libdir}/%{name}/block-rbd.so
%endif
+%files block-vitastor
+%{_libdir}/%{name}/block-vitastor.so
+
%files audio-pa
%{_libdir}/%{name}/audio-pa.so

View File

@@ -35,7 +35,7 @@ ADD . /root/vitastor
RUN set -e; \
cd /root/vitastor/rpm; \
sh build-tarball.sh; \
cp /root/vitastor-0.9.3.el7.tar.gz ~/rpmbuild/SOURCES; \
cp /root/vitastor-0.9.4.el7.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \

View File

@@ -1,11 +1,11 @@
Name: vitastor
Version: 0.9.3
Version: 0.9.4
Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1
URL: https://vitastor.io/
Source0: vitastor-0.9.3.el7.tar.gz
Source0: vitastor-0.9.4.el7.tar.gz
BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel

View File

@@ -35,7 +35,7 @@ ADD . /root/vitastor
RUN set -e; \
cd /root/vitastor/rpm; \
sh build-tarball.sh; \
cp /root/vitastor-0.9.3.el8.tar.gz ~/rpmbuild/SOURCES; \
cp /root/vitastor-0.9.4.el8.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \

View File

@@ -1,11 +1,11 @@
Name: vitastor
Version: 0.9.3
Version: 0.9.4
Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1
URL: https://vitastor.io/
Source0: vitastor-0.9.3.el8.tar.gz
Source0: vitastor-0.9.4.el8.tar.gz
BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel

View File

@@ -18,7 +18,7 @@ ADD . /root/vitastor
RUN set -e; \
cd /root/vitastor/rpm; \
sh build-tarball.sh; \
cp /root/vitastor-0.9.3.el9.tar.gz ~/rpmbuild/SOURCES; \
cp /root/vitastor-0.9.4.el9.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el9.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \

View File

@@ -1,11 +1,11 @@
Name: vitastor
Version: 0.9.3
Version: 0.9.4
Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1
URL: https://vitastor.io/
Source0: vitastor-0.9.3.el9.tar.gz
Source0: vitastor-0.9.4.el9.tar.gz
BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel

View File

@@ -16,7 +16,7 @@ if("${CMAKE_INSTALL_PREFIX}" MATCHES "^/usr/local/?$")
set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}")
endif()
add_definitions(-DVERSION="0.9.3")
add_definitions(-DVERSION="0.9.4")
add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -fdiagnostics-color=always -I ${CMAKE_SOURCE_DIR}/src)
if (${WITH_ASAN})
add_definitions(-fsanitize=address -fno-omit-frame-pointer)

View File

@@ -143,83 +143,34 @@ uint64_t allocator::get_free_count()
return free;
}
// FIXME: Move to utils?
void bitmap_set(void *bitmap, uint64_t start, uint64_t len, uint64_t bitmap_granularity)
{
if (start == 0 && len == 32*bitmap_granularity)
*((uint32_t*)bitmap) = UINT32_MAX;
else if (start == 0 && len == 64*bitmap_granularity)
*((uint64_t*)bitmap) = UINT64_MAX;
else
if (start == 0)
{
unsigned bit_start = start / bitmap_granularity;
unsigned bit_end = ((start + len) + bitmap_granularity - 1) / bitmap_granularity;
while (bit_start < bit_end)
if (len == 32*bitmap_granularity)
{
if (!(bit_start & 7) && bit_end >= bit_start+8)
{
((uint8_t*)bitmap)[bit_start / 8] = UINT8_MAX;
bit_start += 8;
}
else
{
((uint8_t*)bitmap)[bit_start / 8] |= 1 << (bit_start % 8);
bit_start++;
}
*((uint32_t*)bitmap) = UINT32_MAX;
return;
}
else if (len == 64*bitmap_granularity)
{
*((uint64_t*)bitmap) = UINT64_MAX;
return;
}
}
unsigned bit_start = start / bitmap_granularity;
unsigned bit_end = ((start + len) + bitmap_granularity - 1) / bitmap_granularity;
while (bit_start < bit_end)
{
if (!(bit_start & 7) && bit_end >= bit_start+8)
{
((uint8_t*)bitmap)[bit_start / 8] = UINT8_MAX;
bit_start += 8;
}
else
{
((uint8_t*)bitmap)[bit_start / 8] |= 1 << (bit_start % 8);
bit_start++;
}
}
}
void bitmap_clear(void *bitmap, uint64_t start, uint64_t len, uint64_t bitmap_granularity)
{
if (start == 0 && len == 32*bitmap_granularity)
*((uint32_t*)bitmap) = 0;
else if (start == 0 && len == 64*bitmap_granularity)
*((uint64_t*)bitmap) = 0;
else
{
unsigned bit_start = start / bitmap_granularity;
unsigned bit_end = ((start + len) + bitmap_granularity - 1) / bitmap_granularity;
while (bit_start < bit_end)
{
if (!(bit_start & 7) && bit_end >= bit_start+8)
{
((uint8_t*)bitmap)[bit_start / 8] = 0;
bit_start += 8;
}
else
{
((uint8_t*)bitmap)[bit_start / 8] &= (0xFF ^ (1 << (bit_start % 8)));
bit_start++;
}
}
}
}
bool bitmap_check(void *bitmap, uint64_t start, uint64_t len, uint64_t bitmap_granularity)
{
bool r = false;
if (start == 0 && len == 32*bitmap_granularity)
r = !!*((uint32_t*)bitmap);
else if (start == 0 && len == 64*bitmap_granularity)
r = !!*((uint64_t*)bitmap);
else
{
unsigned bit_start = start / bitmap_granularity;
unsigned bit_end = ((start + len) + bitmap_granularity - 1) / bitmap_granularity;
while (bit_start < bit_end)
{
if (!(bit_start & 7) && bit_end >= bit_start+8)
{
r = r || !!((uint8_t*)bitmap)[bit_start / 8];
bit_start += 8;
}
else
{
r = r || (((uint8_t*)bitmap)[bit_start / 8] & (1 << (bit_start % 8)));
bit_start++;
}
}
}
return r;
}

View File

@@ -23,5 +23,3 @@ public:
};
void bitmap_set(void *bitmap, uint64_t start, uint64_t len, uint64_t bitmap_granularity);
void bitmap_clear(void *bitmap, uint64_t start, uint64_t len, uint64_t bitmap_granularity);
bool bitmap_check(void *bitmap, uint64_t start, uint64_t len, uint64_t bitmap_granularity);

View File

@@ -77,7 +77,6 @@ Output:
-EINVAL = invalid input parameters
-ENOENT = requested object/version does not exist for reads
-ENOSPC = no space left in the store for writes
-EDOM = checksum error.
- version = the version actually read or written
## BS_OP_DELETE

View File

@@ -40,31 +40,10 @@ void blockstore_disk_t::parse_config(std::map<std::string, std::string> & config
data_block_size = parse_size(config["block_size"]);
journal_device = config["journal_device"];
journal_offset = parse_size(config["journal_offset"]);
disk_alignment = parse_size(config["disk_alignment"]);
journal_block_size = parse_size(config["journal_block_size"]);
meta_block_size = parse_size(config["meta_block_size"]);
bitmap_granularity = parse_size(config["bitmap_granularity"]);
meta_format = stoull_full(config["meta_format"]);
cached_read_data = config["cached_read_data"] == "true" || config["cached_read_data"] == "yes" || config["cached_read_data"] == "1";
cached_read_meta = cached_read_data && (meta_device == data_device || meta_device == "") &&
config.find("cached_read_meta") == config.end() ||
config["cached_read_meta"] == "true" || config["cached_read_meta"] == "yes" || config["cached_read_meta"] == "1";
cached_read_journal = cached_read_meta && (journal_device == meta_device || journal_device == "") &&
config.find("cached_read_journal") == config.end() ||
config["cached_read_journal"] == "true" || config["cached_read_journal"] == "yes" || config["cached_read_journal"] == "1";
if (config["data_csum_type"] == "crc32c")
{
data_csum_type = BLOCKSTORE_CSUM_CRC32C;
}
else if (config["data_csum_type"] == "" || config["data_csum_type"] == "none")
{
data_csum_type = BLOCKSTORE_CSUM_NONE;
}
else
{
throw std::runtime_error("data_csum_type="+config["data_csum_type"]+" is unsupported, only \"crc32c\" and \"none\" are supported");
}
csum_block_size = parse_size(config["csum_block_size"]);
disk_alignment = strtoull(config["disk_alignment"].c_str(), NULL, 10);
journal_block_size = strtoull(config["journal_block_size"].c_str(), NULL, 10);
meta_block_size = strtoull(config["meta_block_size"].c_str(), NULL, 10);
bitmap_granularity = strtoull(config["bitmap_granularity"].c_str(), NULL, 10);
// Validate
if (!data_block_size)
{
@@ -112,23 +91,7 @@ void blockstore_disk_t::parse_config(std::map<std::string, std::string> & config
}
if (data_block_size % bitmap_granularity)
{
throw std::runtime_error("Data block size must be a multiple of sparse write tracking granularity");
}
if (!data_csum_type)
{
csum_block_size = 0;
}
else if (!csum_block_size)
{
csum_block_size = bitmap_granularity;
}
if (csum_block_size && (csum_block_size % bitmap_granularity))
{
throw std::runtime_error("Checksum block size must be a multiple of sparse write tracking granularity");
}
if (csum_block_size && (data_block_size % csum_block_size))
{
throw std::runtime_error("Checksum block size must be a divisor of data block size");
throw std::runtime_error("Block size must be a multiple of sparse write tracking granularity");
}
if (meta_device == "")
{
@@ -147,9 +110,7 @@ void blockstore_disk_t::parse_config(std::map<std::string, std::string> & config
throw std::runtime_error("journal_offset must be a multiple of journal_block_size = "+std::to_string(journal_block_size));
}
clean_entry_bitmap_size = data_block_size / bitmap_granularity / 8;
clean_dyn_size = clean_entry_bitmap_size*2 + (csum_block_size
? data_block_size/csum_block_size*(data_csum_type & 0xFF) : 0);
clean_entry_size = sizeof(clean_disk_entry) + clean_dyn_size + 4 /*entry_csum*/;
clean_entry_size = sizeof(clean_disk_entry) + 2*clean_entry_bitmap_size;
}
void blockstore_disk_t::calc_lengths(bool skip_meta_check)
@@ -199,25 +160,6 @@ void blockstore_disk_t::calc_lengths(bool skip_meta_check)
// required metadata size
block_count = data_len / data_block_size;
meta_len = (1 + (block_count - 1 + meta_block_size / clean_entry_size) / (meta_block_size / clean_entry_size)) * meta_block_size;
if (meta_format == BLOCKSTORE_META_FORMAT_V1 ||
!meta_format && !skip_meta_check && meta_area_size < meta_len && !data_csum_type)
{
uint64_t clean_entry_v0_size = sizeof(clean_disk_entry) + 2*clean_entry_bitmap_size;
uint64_t meta_v0_len = (1 + (block_count - 1 + meta_block_size / clean_entry_v0_size)
/ (meta_block_size / clean_entry_v0_size)) * meta_block_size;
if (meta_format == BLOCKSTORE_META_FORMAT_V1 || meta_area_size >= meta_v0_len)
{
// Old metadata fits.
printf("Warning: Using old metadata format without checksums because the new format doesn't fit into provided area\n");
clean_entry_size = clean_entry_v0_size;
meta_len = meta_v0_len;
meta_format = BLOCKSTORE_META_FORMAT_V1;
}
else
meta_format = BLOCKSTORE_META_FORMAT_V2;
}
else
meta_format = BLOCKSTORE_META_FORMAT_V2;
if (!skip_meta_check && meta_area_size < meta_len)
{
throw std::runtime_error("Metadata area is too small, need at least "+std::to_string(meta_len)+" bytes");
@@ -295,18 +237,6 @@ void blockstore_disk_t::open_data()
{
throw std::runtime_error(std::string("Failed to lock data device: ") + strerror(errno));
}
if (cached_read_data)
{
read_data_fd = open(data_device.c_str(), O_RDWR);
if (read_data_fd == -1)
{
throw std::runtime_error("Failed to open data device "+data_device+": "+std::string(strerror(errno)));
}
}
else
{
read_data_fd = data_fd;
}
}
void blockstore_disk_t::open_meta()
@@ -327,18 +257,6 @@ void blockstore_disk_t::open_meta()
{
throw std::runtime_error(std::string("Failed to lock metadata device: ") + strerror(errno));
}
if (cached_read_meta)
{
read_meta_fd = open(meta_device.c_str(), O_RDWR);
if (read_meta_fd == -1)
{
throw std::runtime_error("Failed to open metadata device "+meta_device+": "+std::string(strerror(errno)));
}
}
else
{
read_meta_fd = meta_fd;
}
}
else
{
@@ -357,22 +275,6 @@ void blockstore_disk_t::open_meta()
") is not a multiple of data device sector size ("+std::to_string(meta_device_sect)+")"
);
}
if (!cached_read_meta)
{
read_meta_fd = meta_fd;
}
else if (meta_device == data_device && cached_read_data)
{
read_meta_fd = read_data_fd;
}
else
{
read_meta_fd = open(meta_device.c_str(), O_RDWR);
if (read_meta_fd == -1)
{
throw std::runtime_error("Failed to open metadata device "+meta_device+": "+std::string(strerror(errno)));
}
}
}
void blockstore_disk_t::open_journal()
@@ -407,26 +309,6 @@ void blockstore_disk_t::open_journal()
") is not a multiple of journal device sector size ("+std::to_string(journal_device_sect)+")"
);
}
if (!cached_read_journal)
{
read_journal_fd = journal_fd;
}
else if (journal_device == meta_device && cached_read_meta)
{
read_journal_fd = read_meta_fd;
}
else if (journal_device == data_device && cached_read_data)
{
read_journal_fd = read_data_fd;
}
else
{
read_journal_fd = open(journal_device.c_str(), O_RDWR);
if (read_journal_fd == -1)
{
throw std::runtime_error("Failed to open journal device "+journal_device+": "+std::string(strerror(errno)));
}
}
}
void blockstore_disk_t::close_all()
@@ -437,12 +319,5 @@ void blockstore_disk_t::close_all()
close(meta_fd);
if (journal_fd >= 0 && journal_fd != meta_fd)
close(journal_fd);
if (read_data_fd >= 0 && read_data_fd != data_fd)
close(read_data_fd);
if (read_meta_fd >= 0 && read_meta_fd != meta_fd)
close(read_meta_fd);
if (read_journal_fd >= 0 && read_journal_fd != journal_fd)
close(read_journal_fd);
data_fd = meta_fd = journal_fd = -1;
read_data_fd = read_meta_fd = read_journal_fd = -1;
}

View File

@@ -8,10 +8,6 @@
#include <string>
#include <map>
#define BLOCKSTORE_CSUM_NONE 0
// Lower byte of checksum type is its length
#define BLOCKSTORE_CSUM_CRC32C 0x104
struct blockstore_disk_t
{
std::string data_device, meta_device, journal_device;
@@ -25,24 +21,17 @@ struct blockstore_disk_t
uint64_t meta_block_size = 4096;
// Sparse write tracking granularity. 4 KB is a good choice. Must be a multiple of disk_alignment
uint64_t bitmap_granularity = 4096;
// Data checksum type, BLOCKSTORE_CSUM_NONE or BLOCKSTORE_CSUM_CRC32C
uint32_t data_csum_type = BLOCKSTORE_CSUM_NONE;
// Checksum block size, must be a multiple of bitmap_granularity
uint32_t csum_block_size = 4096;
// By default, Blockstore locks all opened devices exclusively. This option can be used to disable locking
bool disable_flock = false;
// Use linux page cache for reads. If enabled, separate buffered FDs will be opened for reading
bool cached_read_data = false, cached_read_meta = false, cached_read_journal = false;
int meta_fd = -1, data_fd = -1, journal_fd = -1;
int read_meta_fd = -1, read_data_fd = -1, read_journal_fd = -1;
uint64_t meta_offset, meta_device_sect, meta_device_size, meta_len, meta_format = 0;
uint64_t meta_offset, meta_device_sect, meta_device_size, meta_len;
uint64_t data_offset, data_device_sect, data_device_size, data_len;
uint64_t journal_offset, journal_device_sect, journal_device_size, journal_len;
uint32_t block_order;
uint64_t block_count;
uint32_t clean_entry_bitmap_size = 0, clean_entry_size = 0, clean_dyn_size = 0;
uint32_t clean_entry_bitmap_size = 0, clean_entry_size = 0;
void parse_config(std::map<std::string, std::string> & config);
void open_data();
@@ -50,13 +39,4 @@ struct blockstore_disk_t
void open_journal();
void calc_lengths(bool skip_meta_check = false);
void close_all();
inline uint64_t dirty_dyn_size(uint64_t offset, uint64_t len)
{
// Checksums may be partial if write is not aligned with csum_block_size
return clean_entry_bitmap_size + (csum_block_size && len > 0
? ((offset+len+csum_block_size-1)/csum_block_size - offset/csum_block_size)
* (data_csum_type & 0xFF)
: 0);
}
};

File diff suppressed because it is too large Load Diff

View File

@@ -1,22 +1,10 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
#define COPY_BUF_JOURNAL 1
#define COPY_BUF_DATA 2
#define COPY_BUF_ZERO 4
#define COPY_BUF_CSUM_FILL 8
#define COPY_BUF_COALESCED 16
#define COPY_BUF_META_BLOCK 32
#define COPY_BUF_JOURNALED_BIG 64
struct copy_buffer_t
{
int copy_flags;
uint64_t offset, len, disk_offset;
uint64_t journal_sector; // only for reads: sector+1 if used and !journal.inmemory, otherwise 0
uint64_t offset, len;
void *buf;
uint8_t *csum_buf;
int *dyn_data;
};
struct meta_sector_t
@@ -49,7 +37,7 @@ class journal_flusher_co
{
blockstore_impl_t *bs;
journal_flusher_t *flusher;
int wait_state, wait_count, wait_journal_count;
int wait_state, wait_count;
struct io_uring_sqe *sqe;
struct ring_data_t *data;
@@ -58,39 +46,28 @@ class journal_flusher_co
obj_ver_id cur;
std::map<obj_ver_id, dirty_entry>::iterator dirty_it, dirty_start, dirty_end;
std::map<object_id, uint64_t>::iterator repeat_it;
std::function<void(ring_data_t*)> simple_callback_r, simple_callback_rj, simple_callback_w;
std::function<void(ring_data_t*)> simple_callback_r, simple_callback_w;
bool skip_copy, has_delete, has_writes;
std::vector<copy_buffer_t> v;
std::vector<copy_buffer_t>::iterator it;
int i;
bool fill_incomplete, cleared_incomplete;
int read_to_fill_incomplete;
int copy_count;
uint64_t clean_loc, clean_ver, old_clean_loc, old_clean_ver;
uint64_t clean_loc, old_clean_loc;
flusher_meta_write_t meta_old, meta_new;
bool clean_init_bitmap;
uint64_t clean_bitmap_offset, clean_bitmap_len;
uint8_t *clean_init_dyn_ptr;
uint8_t *new_clean_bitmap;
void *new_clean_bitmap;
uint64_t new_trim_pos;
// local: scan_dirty()
uint64_t offset, end_offset, submit_offset, submit_len;
friend class journal_flusher_t;
void scan_dirty();
bool read_dirty(int wait_base);
bool modify_meta_do_reads(int wait_base);
bool wait_meta_reads(int wait_base);
bool scan_dirty(int wait_base);
bool modify_meta_read(uint64_t meta_loc, flusher_meta_write_t &wr, int wait_base);
bool clear_incomplete_csum_block_bits(int wait_base);
void calc_block_checksums(uint32_t *new_data_csums, bool skip_overwrites);
void update_metadata_entry();
bool write_meta_block(flusher_meta_write_t & meta_block, int wait_base);
void update_clean_db();
void free_data_blocks();
bool fsync_batch(bool fsync_meta, int wait_base);
bool trim_journal(int wait_base);
void free_buffers();
public:
journal_flusher_co();
bool loop();
@@ -118,10 +95,9 @@ class journal_flusher_t
std::map<uint64_t, meta_sector_t> meta_sectors;
std::deque<object_id> flush_queue;
std::map<object_id, uint64_t> flush_versions; // FIXME: consider unordered_map?
std::map<object_id, uint64_t> flush_versions;
bool try_find_older(std::map<obj_ver_id, dirty_entry>::iterator & dirty_end, obj_ver_id & cur);
bool try_find_other(std::map<obj_ver_id, dirty_entry>::iterator & dirty_end, obj_ver_id & cur);
public:
journal_flusher_t(blockstore_impl_t *bs);
@@ -136,5 +112,4 @@ public:
void unshift_flush(obj_ver_id oid, bool force);
void remove_flush(object_id oid);
void dump_diagnostics();
bool is_mutated(uint64_t clean_loc);
};

View File

@@ -13,7 +13,6 @@ blockstore_impl_t::blockstore_impl_t(blockstore_config_t & config, ring_loop_t *
initialized = 0;
parse_config(config, true);
zero_object = (uint8_t*)memalign_or_die(MEM_ALIGNMENT, dsk.data_block_size);
alloc_dyn_data = dsk.clean_dyn_size > sizeof(void*) || dsk.csum_block_size > 0;
try
{
dsk.open_data();
@@ -39,8 +38,8 @@ blockstore_impl_t::~blockstore_impl_t()
dsk.close_all();
if (metadata_buffer)
free(metadata_buffer);
if (clean_bitmaps)
free(clean_bitmaps);
if (clean_bitmap)
free(clean_bitmap);
}
bool blockstore_impl_t::is_started()

View File

@@ -93,10 +93,11 @@
// "VITAstor"
#define BLOCKSTORE_META_MAGIC_V1 0x726F747341544956l
#define BLOCKSTORE_META_FORMAT_V1 1
#define BLOCKSTORE_META_FORMAT_V2 2
#define BLOCKSTORE_META_VERSION_V1 1
// metadata header (superblock)
// FIXME: After adding the OSD superblock, add a key to metadata
// and journal headers to check if they belong to the same OSD
struct __attribute__((__packed__)) blockstore_meta_header_v1_t
{
uint64_t zero;
@@ -107,29 +108,14 @@ struct __attribute__((__packed__)) blockstore_meta_header_v1_t
uint32_t bitmap_granularity;
};
struct __attribute__((__packed__)) blockstore_meta_header_v2_t
{
uint64_t zero;
uint64_t magic;
uint64_t version;
uint32_t meta_block_size;
uint32_t data_block_size;
uint32_t bitmap_granularity;
uint32_t data_csum_type;
uint32_t csum_block_size;
uint32_t header_csum;
};
// 32 bytes = 24 bytes + block bitmap (4 bytes by default) + external attributes (also bitmap, 4 bytes by default)
// per "clean" entry on disk with fixed metadata tables
// FIXME: maybe add crc32's to metadata
struct __attribute__((__packed__)) clean_disk_entry
{
object_id oid;
uint64_t version;
uint8_t bitmap[];
// Two more fields come after bitmap in metadata version 2:
// uint32_t data_csum[];
// uint32_t entry_csum;
};
// 32 = 16 + 16 bytes per "clean" entry in memory (object_id => clean_entry)
@@ -139,7 +125,7 @@ struct __attribute__((__packed__)) clean_entry
uint64_t location;
};
// 64 = 24 + 40 bytes per dirty entry in memory (obj_ver_id => dirty_entry). Plus checksums
// 64 = 24 + 40 bytes per dirty entry in memory (obj_ver_id => dirty_entry)
struct __attribute__((__packed__)) dirty_entry
{
uint32_t state;
@@ -148,7 +134,7 @@ struct __attribute__((__packed__)) dirty_entry
uint32_t offset; // data offset within object (stripe)
uint32_t len; // data length
uint64_t journal_sector; // journal sector used for this entry
void* dyn_data; // dynamic data: external bitmap and data block checksums. may be a pointer to the in-memory journal
void* bitmap; // either external bitmap itself when it fits, or a pointer to it when it doesn't
};
// - Sync must be submitted after previous writes/deletes (not before!)
@@ -177,23 +163,12 @@ struct __attribute__((__packed__)) dirty_entry
// Suspend operation until there is some free space on the data device
#define WAIT_FREE 5
struct used_clean_obj_t
struct fulfill_read_t
{
int refs;
bool was_freed; // was freed by a parallel flush?
bool was_changed; // was changed by a parallel flush?
uint64_t offset, len;
uint64_t journal_sector; // sector+1 if used and !journal.inmemory, otherwise 0
};
// https://github.com/algorithm-ninja/cpp-btree
// https://github.com/greg7mdp/sparsepp/ was used previously, but it was TERRIBLY slow after resizing
// with sparsepp, random reads dropped to ~700 iops very fast with just as much as ~32k objects in the DB
typedef btree::btree_map<object_id, clean_entry> blockstore_clean_db_t;
typedef std::map<obj_ver_id, dirty_entry> blockstore_dirty_db_t;
#include "blockstore_init.h"
#include "blockstore_flush.h"
#define PRIV(op) ((blockstore_op_private_t*)(op)->private_data)
#define FINISH_OP(op) PRIV(op)->~blockstore_op_private_t(); std::function<void (blockstore_op_t*)>(op->callback)(op)
@@ -206,8 +181,7 @@ struct blockstore_op_private_t
int op_state;
// Read
uint64_t clean_block_used;
std::vector<copy_buffer_t> read_vec;
std::vector<fulfill_read_t> read_vec;
// Sync, write
int min_flushed_journal_sector, max_flushed_journal_sector;
@@ -223,6 +197,16 @@ struct blockstore_op_private_t
int sync_small_checked, sync_big_checked;
};
// https://github.com/algorithm-ninja/cpp-btree
// https://github.com/greg7mdp/sparsepp/ was used previously, but it was TERRIBLY slow after resizing
// with sparsepp, random reads dropped to ~700 iops very fast with just as much as ~32k objects in the DB
typedef btree::btree_map<object_id, clean_entry> blockstore_clean_db_t;
typedef std::map<obj_ver_id, dirty_entry> blockstore_dirty_db_t;
#include "blockstore_init.h"
#include "blockstore_flush.h"
typedef uint32_t pool_id_t;
typedef uint64_t pool_pg_id_t;
@@ -269,7 +253,7 @@ class blockstore_impl_t
std::map<pool_id_t, pool_shard_settings_t> clean_db_settings;
std::map<pool_pg_id_t, blockstore_clean_db_t> clean_db_shards;
uint8_t *clean_bitmaps = NULL;
uint8_t *clean_bitmap = NULL;
blockstore_dirty_db_t dirty_db;
std::vector<blockstore_op_t*> submit_queue;
std::vector<obj_ver_id> unsynced_big_writes, unsynced_small_writes;
@@ -283,10 +267,6 @@ class blockstore_impl_t
journal_flusher_t *flusher;
int big_to_flush = 0;
int write_iodepth = 0;
bool alloc_dyn_data = false;
// clean data blocks referenced by read operations
std::map<uint64_t, used_clean_obj_t> used_clean_objects;
bool live = false, queue_stall = false;
ring_loop_t *ringloop;
@@ -330,30 +310,8 @@ class blockstore_impl_t
// Read
int dequeue_read(blockstore_op_t *read_op);
void find_holes(std::vector<copy_buffer_t> & read_vec, uint32_t item_start, uint32_t item_end,
std::function<int(int, bool, uint32_t, uint32_t)> callback);
int fulfill_read(blockstore_op_t *read_op,
uint64_t &fulfilled, uint32_t item_start, uint32_t item_end,
uint32_t item_state, uint64_t item_version, uint64_t item_location,
uint64_t journal_sector, uint8_t *csum, int *dyn_data);
bool fulfill_clean_read(blockstore_op_t *read_op, uint64_t & fulfilled,
uint8_t *clean_entry_bitmap, int *dyn_data,
uint32_t item_start, uint32_t item_end, uint64_t clean_loc, uint64_t clean_ver);
int fill_partial_checksum_blocks(std::vector<copy_buffer_t> & rv, uint64_t & fulfilled,
uint8_t *clean_entry_bitmap, int *dyn_data, bool from_journal, uint8_t *read_buf, uint64_t read_offset, uint64_t read_end);
int pad_journal_read(std::vector<copy_buffer_t> & rv, copy_buffer_t & cp,
uint64_t dirty_offset, uint64_t dirty_end, uint64_t dirty_loc, uint8_t *csum_ptr, int *dyn_data,
uint64_t offset, uint64_t submit_len, uint64_t & blk_begin, uint64_t & blk_end, uint8_t* & blk_buf);
bool read_range_fulfilled(std::vector<copy_buffer_t> & rv, uint64_t & fulfilled, uint8_t *read_buf,
uint8_t *clean_entry_bitmap, uint32_t item_start, uint32_t item_end);
bool read_checksum_block(blockstore_op_t *op, int rv_pos, uint64_t &fulfilled, uint64_t clean_loc);
uint8_t* read_clean_meta_block(blockstore_op_t *read_op, uint64_t clean_loc, int rv_pos);
bool verify_padded_checksums(uint8_t *clean_entry_bitmap, uint8_t *csum_buf, uint32_t offset,
iovec *iov, int n_iov, std::function<void(uint32_t, uint32_t, uint32_t)> bad_block_cb);
bool verify_journal_checksums(uint8_t *csums, uint32_t offset,
iovec *iov, int n_iov, std::function<void(uint32_t, uint32_t, uint32_t)> bad_block_cb);
bool verify_clean_padded_checksums(blockstore_op_t *op, uint64_t clean_loc, uint8_t *dyn_data, bool from_journal,
iovec *iov, int n_iov, std::function<void(uint32_t, uint32_t, uint32_t)> bad_block_cb);
int fulfill_read(blockstore_op_t *read_op, uint64_t &fulfilled, uint32_t item_start, uint32_t item_end,
uint32_t item_state, uint64_t item_version, uint64_t item_location, uint64_t journal_sector);
int fulfill_read_push(blockstore_op_t *op, void *buf, uint64_t offset, uint64_t len,
uint32_t item_state, uint64_t item_version);
void handle_read_event(ring_data_t *data, blockstore_op_t *op);
@@ -384,7 +342,6 @@ class blockstore_impl_t
int continue_rollback(blockstore_op_t *op);
void mark_rolled_back(const obj_ver_id & ov);
void erase_dirty(blockstore_dirty_db_t::iterator dirty_start, blockstore_dirty_db_t::iterator dirty_end, uint64_t clean_loc);
void free_dirty_dyn_data(dirty_entry & e);
// List
void process_list(blockstore_op_t *op);

View File

@@ -65,7 +65,7 @@ int blockstore_init_meta::loop()
GET_SQE();
data->iov = { metadata_buffer, bs->dsk.meta_block_size };
data->callback = [this](ring_data_t *data) { handle_event(data, -1); };
my_uring_prep_readv(sqe, bs->dsk.read_meta_fd, &data->iov, 1, bs->dsk.meta_offset);
my_uring_prep_readv(sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset);
bs->ringloop->submit();
submitted++;
resume_1:
@@ -77,20 +77,13 @@ resume_1:
if (iszero((uint64_t*)metadata_buffer, bs->dsk.meta_block_size / sizeof(uint64_t)))
{
{
blockstore_meta_header_v2_t *hdr = (blockstore_meta_header_v2_t *)metadata_buffer;
blockstore_meta_header_v1_t *hdr = (blockstore_meta_header_v1_t *)metadata_buffer;
hdr->zero = 0;
hdr->magic = BLOCKSTORE_META_MAGIC_V1;
hdr->version = bs->dsk.meta_format;
hdr->version = BLOCKSTORE_META_VERSION_V1;
hdr->meta_block_size = bs->dsk.meta_block_size;
hdr->data_block_size = bs->dsk.data_block_size;
hdr->bitmap_granularity = bs->dsk.bitmap_granularity;
if (bs->dsk.meta_format >= BLOCKSTORE_META_FORMAT_V2)
{
hdr->data_csum_type = bs->dsk.data_csum_type;
hdr->csum_block_size = bs->dsk.csum_block_size;
hdr->header_csum = 0;
hdr->header_csum = crc32c(0, hdr, sizeof(*hdr));
}
}
if (bs->readonly)
{
@@ -116,62 +109,28 @@ resume_1:
}
else
{
blockstore_meta_header_v2_t *hdr = (blockstore_meta_header_v2_t *)metadata_buffer;
if (hdr->zero != 0 || hdr->magic != BLOCKSTORE_META_MAGIC_V1 || hdr->version < BLOCKSTORE_META_FORMAT_V1)
blockstore_meta_header_v1_t *hdr = (blockstore_meta_header_v1_t *)metadata_buffer;
if (hdr->zero != 0 ||
hdr->magic != BLOCKSTORE_META_MAGIC_V1 ||
hdr->version != BLOCKSTORE_META_VERSION_V1)
{
printf(
"Metadata is corrupt or too old (pre-0.6.x).\n"
" If this is a new OSD, please zero out the metadata area before starting it.\n"
" If you need to upgrade from 0.5.x, convert metadata with vitastor-disk.\n"
);
exit(1);
}
if (hdr->version == BLOCKSTORE_META_FORMAT_V2)
{
uint32_t csum = hdr->header_csum;
hdr->header_csum = 0;
if (crc32c(0, hdr, sizeof(*hdr)) != csum)
{
printf("Metadata header is corrupt (checksum mismatch).\n");
exit(1);
}
hdr->header_csum = csum;
bs->dsk.meta_format = BLOCKSTORE_META_FORMAT_V2;
}
else if (hdr->version == BLOCKSTORE_META_FORMAT_V1)
{
hdr->data_csum_type = 0;
hdr->csum_block_size = 0;
hdr->header_csum = 0;
// Enable compatibility mode - entries without checksums
bs->dsk.clean_entry_size = sizeof(clean_disk_entry) + bs->dsk.clean_entry_bitmap_size*2;
bs->dsk.meta_len = (1 + (bs->dsk.block_count - 1 + bs->dsk.meta_block_size / bs->dsk.clean_entry_size)
/ (bs->dsk.meta_block_size / bs->dsk.clean_entry_size)) * bs->dsk.meta_block_size;
bs->dsk.meta_format = BLOCKSTORE_META_FORMAT_V1;
printf("Warning: Starting with metadata in the old format without checksums, as stored on disk\n");
}
else if (hdr->version > BLOCKSTORE_META_FORMAT_V2)
{
printf(
"Metadata format is too new for me (stored version is %lu, max supported %u).\n",
hdr->version, BLOCKSTORE_META_FORMAT_V2
"Metadata is corrupt or old version.\n"
" If this is a new OSD please zero out the metadata area before starting it.\n"
" If you need to upgrade from 0.5.x please request it via the issue tracker.\n"
);
exit(1);
}
if (hdr->meta_block_size != bs->dsk.meta_block_size ||
hdr->data_block_size != bs->dsk.data_block_size ||
hdr->bitmap_granularity != bs->dsk.bitmap_granularity ||
hdr->data_csum_type != bs->dsk.data_csum_type ||
hdr->csum_block_size != bs->dsk.csum_block_size)
hdr->bitmap_granularity != bs->dsk.bitmap_granularity)
{
printf(
"Configuration stored in metadata superblock"
" (meta_block_size=%u, data_block_size=%u, bitmap_granularity=%u, data_csum_type=%u, csum_block_size=%u)"
" differs from OSD configuration (%lu/%u/%lu, %u/%u).\n",
" (meta_block_size=%u, data_block_size=%u, bitmap_granularity=%u)"
" differs from OSD configuration (%lu/%u/%lu).\n",
hdr->meta_block_size, hdr->data_block_size, hdr->bitmap_granularity,
hdr->data_csum_type, hdr->csum_block_size,
bs->dsk.meta_block_size, bs->dsk.data_block_size, bs->dsk.bitmap_granularity,
bs->dsk.data_csum_type, bs->dsk.csum_block_size
bs->dsk.meta_block_size, bs->dsk.data_block_size, bs->dsk.bitmap_granularity
);
exit(1);
}
@@ -202,7 +161,7 @@ resume_2:
data->iov = { bufs[i].buf, bufs[i].size };
data->callback = [this, i](ring_data_t *data) { handle_event(data, i); };
if (!zero_on_init)
my_uring_prep_readv(sqe, bs->dsk.read_meta_fd, &data->iov, 1, bs->dsk.meta_offset + bufs[i].offset);
my_uring_prep_readv(sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset + bufs[i].offset);
else
{
// Fill metadata with zeroes
@@ -259,7 +218,7 @@ resume_2:
GET_SQE();
data->iov = { metadata_buffer, bs->dsk.meta_block_size };
data->callback = [this](ring_data_t *data) { handle_event(data, -1); };
my_uring_prep_readv(sqe, bs->dsk.read_meta_fd, &data->iov, 1, bs->dsk.meta_offset + (1+next_offset)*bs->dsk.meta_block_size);
my_uring_prep_readv(sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset + (1+next_offset)*bs->dsk.meta_block_size);
submitted++;
resume_5:
if (submitted > 0)
@@ -320,22 +279,12 @@ bool blockstore_init_meta::handle_meta_block(uint8_t *buf, uint64_t entries_per_
for (uint64_t i = 0; i < max_i; i++)
{
clean_disk_entry *entry = (clean_disk_entry*)(buf + i*bs->dsk.clean_entry_size);
if (!bs->inmemory_meta && bs->dsk.clean_entry_bitmap_size)
{
memcpy(bs->clean_bitmap + (done_cnt+i)*2*bs->dsk.clean_entry_bitmap_size, &entry->bitmap, 2*bs->dsk.clean_entry_bitmap_size);
}
if (entry->oid.inode > 0)
{
if (bs->dsk.meta_format >= BLOCKSTORE_META_FORMAT_V2)
{
// Check entry crc32
uint32_t *entry_csum = (uint32_t*)((uint8_t*)entry + bs->dsk.clean_entry_size - 4);
if (*entry_csum != crc32c(0, entry, bs->dsk.clean_entry_size - 4))
{
printf("Metadata entry %lu is corrupt (checksum mismatch), skipping\n", done_cnt+i);
continue;
}
}
if (!bs->inmemory_meta && bs->dsk.clean_entry_bitmap_size)
{
memcpy(bs->clean_bitmaps + (done_cnt+i) * 2 * bs->dsk.clean_entry_bitmap_size, &entry->bitmap, 2 * bs->dsk.clean_entry_bitmap_size);
}
auto & clean_db = bs->clean_db_shard(entry->oid);
auto clean_it = clean_db.find(entry->oid);
if (clean_it == clean_db.end() || clean_it->second.version < entry->version)
@@ -467,7 +416,7 @@ int blockstore_init_journal::loop()
data = ((ring_data_t*)sqe->user_data);
data->iov = { submitted_buf, bs->journal.block_size };
data->callback = simple_callback;
my_uring_prep_readv(sqe, bs->dsk.read_journal_fd, &data->iov, 1, bs->journal.offset);
my_uring_prep_readv(sqe, bs->dsk.journal_fd, &data->iov, 1, bs->journal.offset);
bs->ringloop->submit();
wait_count = 1;
resume_1:
@@ -491,9 +440,7 @@ resume_1:
.size = sizeof(journal_entry_start),
.reserved = 0,
.journal_start = bs->journal.block_size,
.version = JOURNAL_VERSION_V2,
.data_csum_type = bs->dsk.data_csum_type,
.csum_block_size = bs->dsk.csum_block_size,
.version = JOURNAL_VERSION,
};
((journal_entry_start*)submitted_buf)->crc32 = je_crc32((journal_entry*)submitted_buf);
if (bs->readonly)
@@ -545,36 +492,18 @@ resume_1:
if (je_start->magic != JOURNAL_MAGIC ||
je_start->type != JE_START ||
je_crc32((journal_entry*)je_start) != je_start->crc32 ||
je_start->size != JE_START_V0_SIZE && je_start->size != JE_START_V1_SIZE && je_start->size != JE_START_V2_SIZE)
je_start->size != sizeof(journal_entry_start) && je_start->size != JE_START_LEGACY_SIZE)
{
// Entry is corrupt
fprintf(stderr, "First entry of the journal is corrupt or unsupported\n");
fprintf(stderr, "First entry of the journal is corrupt\n");
exit(1);
}
if (je_start->size == JE_START_V0_SIZE ||
(je_start->version != JOURNAL_VERSION_V1 || je_start->size != JE_START_V1_SIZE) &&
(je_start->version != JOURNAL_VERSION_V2 || je_start->size != JE_START_V2_SIZE))
if (je_start->size == JE_START_LEGACY_SIZE || je_start->version != JOURNAL_VERSION)
{
fprintf(
stderr, "The code only supports journal versions 2 and 1, but it is %lu on disk."
" Please use vitastor-disk to rewrite the journal\n",
je_start->size == JE_START_V0_SIZE ? 0 : je_start->version
);
exit(1);
}
if (je_start->version == JOURNAL_VERSION_V1)
{
je_start->data_csum_type = 0;
je_start->csum_block_size = 0;
}
if (je_start->data_csum_type != bs->dsk.data_csum_type ||
je_start->csum_block_size != bs->dsk.csum_block_size)
{
printf(
"Configuration stored in journal superblock (data_csum_type=%u, csum_block_size=%u)"
" differs from OSD configuration (%u/%u).\n",
je_start->data_csum_type, je_start->csum_block_size,
bs->dsk.data_csum_type, bs->dsk.csum_block_size
stderr, "The code only supports journal version %d, but it is %lu on disk."
" Please use the previous version to flush the journal before upgrading OSD\n",
JOURNAL_VERSION, je_start->size == JE_START_LEGACY_SIZE ? 0 : je_start->version
);
exit(1);
}
@@ -607,7 +536,7 @@ resume_1:
end - journal_pos < JOURNAL_BUFFER_SIZE ? end - journal_pos : JOURNAL_BUFFER_SIZE,
};
data->callback = [this](ring_data_t *data1) { handle_event(data1); };
my_uring_prep_readv(sqe, bs->dsk.read_journal_fd, &data->iov, 1, bs->journal.offset + journal_pos);
my_uring_prep_readv(sqe, bs->dsk.journal_fd, &data->iov, 1, bs->journal.offset + journal_pos);
bs->ringloop->submit();
}
while (done.size() > 0)
@@ -776,14 +705,11 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
snprintf(err, 1024, "BUG: calculated journal data offset (%08lx) != stored journal data offset (%08lx)", location, je->small_write.data_offset);
throw std::runtime_error(err);
}
small_write_data.clear();
uint32_t data_crc32 = 0;
if (location >= done_pos && location+je->small_write.len <= done_pos+len)
{
// data is within this buffer
small_write_data.push_back((iovec){
.iov_base = (uint8_t*)buf + location - done_pos,
.iov_len = je->small_write.len,
});
data_crc32 = crc32c(0, (uint8_t*)buf + location - done_pos, je->small_write.len);
}
else
{
@@ -798,10 +724,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
? location+je->small_write.len : done[i].pos+done[i].len);
uint64_t part_begin = (location < done[i].pos ? done[i].pos : location);
covered += part_end - part_begin;
small_write_data.push_back((iovec){
.iov_base = (uint8_t*)done[i].buf + part_begin - done[i].pos,
.iov_len = part_end - part_begin,
});
data_crc32 = crc32c(data_crc32, (uint8_t*)done[i].buf + part_begin - done[i].pos, part_end - part_begin);
}
}
if (covered < je->small_write.len)
@@ -811,102 +734,12 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
return 2;
}
}
bool data_csum_valid = true;
if (!bs->dsk.csum_block_size)
{
uint32_t data_crc32 = 0;
for (auto & sd: small_write_data)
{
data_crc32 = crc32c(data_crc32, sd.iov_base, sd.iov_len);
}
data_csum_valid = data_crc32 == je->small_write.crc32_data;
if (!data_csum_valid)
{
printf(
"Journal entry data is corrupt for small_write%s oid=%lx:%lx ver=%lu offset=%u len=%u - data crc32 %x != %x\n",
je->type == JE_SMALL_WRITE_INSTANT ? "_instant" : "",
je->small_write.oid.inode, je->small_write.oid.stripe, je->small_write.version,
je->small_write.offset, je->small_write.len,
data_crc32, je->small_write.crc32_data
);
}
}
else if (je->small_write.len > 0)
{
// FIXME: deduplicate with disk_tool_journal.cpp
// like in enqueue_write()
uint32_t start = je->small_write.offset / bs->dsk.csum_block_size;
uint32_t end = (je->small_write.offset+je->small_write.len-1) / bs->dsk.csum_block_size;
uint32_t data_csum_size = (end-start+1) * (bs->dsk.data_csum_type & 0xFF);
uint32_t required_size = sizeof(journal_entry_small_write) + bs->dsk.clean_entry_bitmap_size + data_csum_size;
if (je->size != required_size)
{
printf(
"Journal entry data has invalid size for small_write%s oid=%lx:%lx ver=%lu offset=%u len=%u - should be %u bytes but is %u bytes\n",
je->type == JE_SMALL_WRITE_INSTANT ? "_instant" : "",
je->small_write.oid.inode, je->small_write.oid.stripe, je->small_write.version,
je->small_write.offset, je->small_write.len,
required_size, je->size
);
data_csum_valid = false;
}
else
{
int sd_num = 0;
size_t sd_pos = 0;
uint32_t *block_csums = (uint32_t*)((uint8_t*)je + sizeof(journal_entry_small_write) + bs->dsk.clean_entry_bitmap_size);
for (uint32_t pos = start; pos <= end; pos++, block_csums++)
{
size_t block_left = (pos == start
? (start == end
? je->small_write.len
: bs->dsk.csum_block_size - je->small_write.offset%bs->dsk.csum_block_size)
: (pos < end
? bs->dsk.csum_block_size
: (je->small_write.offset + je->small_write.len)%bs->dsk.csum_block_size));
if (pos > start && pos == end && block_left == 0)
{
// full last block
block_left = bs->dsk.csum_block_size;
}
uint32_t block_crc32 = 0;
while (block_left > 0)
{
assert(sd_num < small_write_data.size());
if (small_write_data[sd_num].iov_len >= sd_pos+block_left)
{
block_crc32 = crc32c(block_crc32, (uint8_t*)small_write_data[sd_num].iov_base+sd_pos, block_left);
sd_pos += block_left;
break;
}
else
{
block_crc32 = crc32c(block_crc32, (uint8_t*)small_write_data[sd_num].iov_base+sd_pos, small_write_data[sd_num].iov_len-sd_pos);
block_left -= (small_write_data[sd_num].iov_len-sd_pos);
sd_pos = 0;
sd_num++;
}
}
if (block_crc32 != *block_csums)
{
printf(
"Journal entry data is corrupt for small_write%s oid=%lx:%lx ver=%lu offset=%u len=%u - block %u crc32 %x != %x\n",
je->type == JE_SMALL_WRITE_INSTANT ? "_instant" : "",
je->small_write.oid.inode, je->small_write.oid.stripe, je->small_write.version,
je->small_write.offset, je->small_write.len,
pos, block_crc32, *block_csums
);
data_csum_valid = false;
break;
}
}
}
}
if (!data_csum_valid)
if (data_crc32 != je->small_write.crc32_data)
{
// journal entry is corrupt, stop here
// interesting thing is that we must clear the corrupt entry if we're not readonly,
// because we don't write next entries in the same journal block
printf("Journal entry data is corrupt (data crc32 %x != %x)\n", data_crc32, je->small_write.crc32_data);
memset((uint8_t*)buf + proc_pos - done_pos + pos, 0, bs->journal.block_size - pos);
bs->journal.next_free = prev_free;
init_write_buf = (uint8_t*)buf + proc_pos - done_pos;
@@ -922,14 +755,11 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
.oid = je->small_write.oid,
.version = je->small_write.version,
};
uint64_t dyn_size = bs->dsk.dirty_dyn_size(je->small_write.offset, je->small_write.len);
void *dyn = NULL;
void *dyn_from = (uint8_t*)je + sizeof(journal_entry_small_write);
if (!bs->alloc_dyn_data)
void *bmp = NULL;
void *bmp_from = (uint8_t*)je + sizeof(journal_entry_small_write);
if (bs->dsk.clean_entry_bitmap_size <= sizeof(void*))
{
// Bitmap without checksum is only 4 bytes for 128k objects, save it inline
// It can even contain 4 byte bitmap + 4 byte CRC32 for 4 kb writes :)
memcpy(&dyn, dyn_from, dyn_size);
memcpy(&bmp, bmp_from, bs->dsk.clean_entry_bitmap_size);
}
else
{
@@ -937,9 +767,8 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
// allocations for entry bitmaps. This can only be fixed by using
// a patched map with dynamic entry size, but not the btree_map,
// because it doesn't keep iterators valid all the time.
dyn = malloc_or_die(dyn_size+sizeof(int));
*((int*)dyn) = 1;
memcpy((uint8_t*)dyn+sizeof(int), dyn_from, dyn_size);
bmp = malloc_or_die(bs->dsk.clean_entry_bitmap_size);
memcpy(bmp, bmp_from, bs->dsk.clean_entry_bitmap_size);
}
bs->dirty_db.emplace(ov, (dirty_entry){
.state = (BS_ST_SMALL_WRITE | BS_ST_SYNCED),
@@ -948,7 +777,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
.offset = je->small_write.offset,
.len = je->small_write.len,
.journal_sector = proc_pos,
.dyn_data = dyn,
.bitmap = bmp,
});
bs->journal.used_sectors[proc_pos]++;
#ifdef BLOCKSTORE_DEBUG
@@ -1007,13 +836,11 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
.oid = je->big_write.oid,
.version = je->big_write.version,
};
uint64_t dyn_size = bs->dsk.dirty_dyn_size(je->big_write.offset, je->big_write.len);
void *dyn = NULL;
void *dyn_from = (uint8_t*)je + sizeof(journal_entry_big_write);
if (!bs->alloc_dyn_data)
void *bmp = NULL;
void *bmp_from = (uint8_t*)je + sizeof(journal_entry_big_write);
if (bs->dsk.clean_entry_bitmap_size <= sizeof(void*))
{
// Bitmap without checksum is only 4 bytes for 128k objects, save it inline
memcpy(&dyn, dyn_from, dyn_size);
memcpy(&bmp, bmp_from, bs->dsk.clean_entry_bitmap_size);
}
else
{
@@ -1021,9 +848,8 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
// allocations for entry bitmaps. This can only be fixed by using
// a patched map with dynamic entry size, but not the btree_map,
// because it doesn't keep iterators valid all the time.
dyn = malloc_or_die(dyn_size+sizeof(int));
*((int*)dyn) = 1;
memcpy((uint8_t*)dyn+sizeof(int), dyn_from, dyn_size);
bmp = malloc_or_die(bs->dsk.clean_entry_bitmap_size);
memcpy(bmp, bmp_from, bs->dsk.clean_entry_bitmap_size);
}
auto dirty_it = bs->dirty_db.emplace(ov, (dirty_entry){
.state = (BS_ST_BIG_WRITE | BS_ST_SYNCED),
@@ -1032,7 +858,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
.offset = je->big_write.offset,
.len = je->big_write.len,
.journal_sector = proc_pos,
.dyn_data = dyn,
.bitmap = bmp,
}).first;
if (bs->data_alloc->get(je->big_write.location >> bs->dsk.block_order))
{

View File

@@ -50,7 +50,6 @@ class blockstore_init_journal
uint64_t next_free;
std::vector<bs_init_journal_done> done;
std::vector<obj_ver_id> double_allocs;
std::vector<iovec> small_write_data;
uint64_t journal_pos = 0;
uint64_t continue_pos = 0;
void *init_write_buf = NULL;

View File

@@ -17,7 +17,6 @@ blockstore_journal_check_t::blockstore_journal_check_t(blockstore_impl_t *bs)
// Check if we can write <required> entries of <size> bytes and <data_after> data bytes after them to the journal
int blockstore_journal_check_t::check_available(blockstore_op_t *op, int entries_required, int size, int data_after)
{
uint64_t prev_next = next_sector;
int required = entries_required;
while (1)
{
@@ -36,19 +35,11 @@ int blockstore_journal_check_t::check_available(blockstore_op_t *op, int entries
}
required -= fits;
next_in_pos += fits * size;
if (next_sector != prev_next || !sectors_to_write)
{
// Except the previous call to this function
sectors_to_write++;
}
sectors_to_write++;
}
else if (bs->journal.sector_info[next_sector].dirty)
{
if (next_sector != prev_next || !sectors_to_write)
{
// Except the previous call to this function
sectors_to_write++;
}
sectors_to_write++;
}
if (required <= 0)
{
@@ -298,31 +289,3 @@ void journal_t::dump_diagnostics()
journal_used_it == used_sectors.end() ? 0 : journal_used_it->second
);
}
static uint64_t zero_page[4096];
uint32_t crc32c_pad(uint32_t prev_crc, const void *buf, size_t len, size_t left_pad, size_t right_pad)
{
uint32_t r = prev_crc;
while (left_pad >= 4096)
{
r = crc32c(r, zero_page, 4096);
left_pad -= 4096;
}
if (left_pad > 0)
r = crc32c(r, zero_page, left_pad);
r = crc32c(r, buf, len);
while (right_pad >= 4096)
{
r = crc32c(r, zero_page, 4096);
right_pad -= 4096;
}
if (left_pad > 0)
r = crc32c(r, zero_page, right_pad);
return r;
}
uint32_t crc32c_nopad(uint32_t prev_crc, const void *buf, size_t len, size_t left_pad, size_t right_pad)
{
return crc32c(0, buf, len);
}

View File

@@ -8,8 +8,7 @@
#define MIN_JOURNAL_SIZE 4*1024*1024
#define JOURNAL_MAGIC 0x4A33
#define JOURNAL_VERSION_V1 1
#define JOURNAL_VERSION_V2 2
#define JOURNAL_VERSION 1
#define JOURNAL_BUFFER_SIZE 4*1024*1024
#define JOURNAL_ENTRY_HEADER_SIZE 16
@@ -33,7 +32,7 @@
#define JE_BIG_WRITE_INSTANT 0x08
#define JE_MAX 0x08
// crc32c comes first to ease calculation
// crc32c comes first to ease calculation and is equal to crc32()
struct __attribute__((__packed__)) journal_entry_start
{
uint32_t crc32;
@@ -43,12 +42,8 @@ struct __attribute__((__packed__)) journal_entry_start
uint32_t reserved;
uint64_t journal_start;
uint64_t version;
uint32_t data_csum_type;
uint32_t csum_block_size;
};
#define JE_START_V0_SIZE 24
#define JE_START_V1_SIZE 32
#define JE_START_V2_SIZE 40
#define JE_START_LEGACY_SIZE 24
struct __attribute__((__packed__)) journal_entry_small_write
{
@@ -64,12 +59,10 @@ struct __attribute__((__packed__)) journal_entry_small_write
// small_write entries contain <len> bytes of data which is stored in next sectors
// data_offset is its offset within journal
uint64_t data_offset;
uint32_t crc32_data; // zero when data_csum_type != 0
uint32_t crc32_data;
// small_write and big_write entries are followed by the "external" bitmap
// its size is dynamic and included in journal entry's <size> field
uint8_t bitmap[];
// and then data checksums if data_csum_type != 0
// uint32_t data_crc32c[];
};
struct __attribute__((__packed__)) journal_entry_big_write
@@ -87,8 +80,6 @@ struct __attribute__((__packed__)) journal_entry_big_write
// small_write and big_write entries are followed by the "external" bitmap
// its size is dynamic and included in journal entry's <size> field
uint8_t bitmap[];
// and then data checksums if data_csum_type != 0
// uint32_t data_crc32c[];
};
struct __attribute__((__packed__)) journal_entry_stable
@@ -227,6 +218,3 @@ struct blockstore_journal_check_t
};
journal_entry* prefill_single_journal_entry(journal_t & journal, uint16_t type, uint32_t size);
uint32_t crc32c_pad(uint32_t prev_crc, const void *buf, size_t len, size_t left_pad, size_t right_pad);
uint32_t crc32c_nopad(uint32_t prev_crc, const void *buf, size_t len, size_t left_pad, size_t right_pad);

View File

@@ -133,24 +133,19 @@ void blockstore_impl_t::calc_lengths()
{
metadata_buffer = memalign(MEM_ALIGNMENT, dsk.meta_len);
if (!metadata_buffer)
throw std::runtime_error("Failed to allocate memory for the metadata ("+std::to_string(dsk.meta_len/1024/1024)+" MB)");
throw std::runtime_error("Failed to allocate memory for the metadata");
}
else if (dsk.clean_entry_bitmap_size || dsk.data_csum_type)
else if (dsk.clean_entry_bitmap_size)
{
clean_bitmaps = (uint8_t*)malloc(dsk.block_count * 2 * dsk.clean_entry_bitmap_size);
if (!clean_bitmaps)
{
throw std::runtime_error(
"Failed to allocate memory for the metadata sparse write bitmap ("+
std::to_string(dsk.block_count * 2 * dsk.clean_entry_bitmap_size / 1024 / 1024)+" MB)"
);
}
clean_bitmap = (uint8_t*)malloc(dsk.block_count * 2*dsk.clean_entry_bitmap_size);
if (!clean_bitmap)
throw std::runtime_error("Failed to allocate memory for the metadata sparse write bitmap");
}
if (journal.inmemory)
{
journal.buffer = memalign(MEM_ALIGNMENT, journal.len);
if (!journal.buffer)
throw std::runtime_error("Failed to allocate memory for journal ("+std::to_string(journal.len/1024/1024)+" MB)");
throw std::runtime_error("Failed to allocate memory for journal");
}
else
{

View File

@@ -1,7 +1,6 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
#include <limits.h>
#include "blockstore_impl.h"
int blockstore_impl_t::fulfill_read_push(blockstore_op_t *op, void *buf, uint64_t offset, uint64_t len,
@@ -9,7 +8,12 @@ int blockstore_impl_t::fulfill_read_push(blockstore_op_t *op, void *buf, uint64_
{
if (!len)
{
// Zero-length read
// Zero-length version - skip
return 1;
}
else if (IS_IN_FLIGHT(item_state))
{
// Write not finished yet - skip
return 1;
}
else if (IS_DELETE(item_state))
@@ -18,7 +22,6 @@ int blockstore_impl_t::fulfill_read_push(blockstore_op_t *op, void *buf, uint64_
memset(buf, 0, len);
return 1;
}
assert(!IS_IN_FLIGHT(item_state));
if (journal.inmemory && IS_JOURNAL(item_state))
{
memcpy(buf, (uint8_t*)journal.buffer + offset, len);
@@ -29,7 +32,7 @@ int blockstore_impl_t::fulfill_read_push(blockstore_op_t *op, void *buf, uint64_
PRIV(op)->pending_ops++;
my_uring_prep_readv(
sqe,
IS_JOURNAL(item_state) ? dsk.read_journal_fd : dsk.read_data_fd,
IS_JOURNAL(item_state) ? dsk.journal_fd : dsk.data_fd,
&data->iov, 1,
(IS_JOURNAL(item_state) ? dsk.journal_offset : dsk.data_offset) + offset
);
@@ -37,115 +40,59 @@ int blockstore_impl_t::fulfill_read_push(blockstore_op_t *op, void *buf, uint64_
return 1;
}
void blockstore_impl_t::find_holes(std::vector<copy_buffer_t> & read_vec,
uint32_t item_start, uint32_t item_end,
std::function<int(int, bool, uint32_t, uint32_t)> callback)
// FIXME I've seen a bug here so I want some tests
int blockstore_impl_t::fulfill_read(blockstore_op_t *read_op, uint64_t &fulfilled, uint32_t item_start, uint32_t item_end,
uint32_t item_state, uint64_t item_version, uint64_t item_location, uint64_t journal_sector)
{
auto cur_start = item_start;
int i = 0;
while (cur_start < item_end)
uint32_t cur_start = item_start;
if (cur_start < read_op->offset + read_op->len && item_end > read_op->offset)
{
// COPY_BUF_CSUM_FILL items are fake items inserted in the end, their offsets aren't in order
if (i >= read_vec.size() || read_vec[i].copy_flags & COPY_BUF_CSUM_FILL || read_vec[i].offset >= item_end)
cur_start = cur_start < read_op->offset ? read_op->offset : cur_start;
item_end = item_end > read_op->offset + read_op->len ? read_op->offset + read_op->len : item_end;
auto it = PRIV(read_op)->read_vec.begin();
while (1)
{
// Hole (at end): cur_start .. item_end
i += callback(i, false, cur_start, item_end);
break;
}
else if (read_vec[i].offset > cur_start)
{
// Hole: cur_start .. min(read_vec[i].offset, item_end)
auto cur_end = read_vec[i].offset > item_end ? item_end : read_vec[i].offset;
i += callback(i, false, cur_start, cur_end);
cur_start = cur_end;
}
else if (read_vec[i].offset + read_vec[i].len > cur_start)
{
// Allocated: cur_start .. min(read_vec[i].offset + read_vec[i].len, item_end)
auto cur_end = read_vec[i].offset + read_vec[i].len;
cur_end = cur_end > item_end ? item_end : cur_end;
i += callback(i, true, cur_start, cur_end);
cur_start = cur_end;
i++;
}
else
i++;
}
}
int blockstore_impl_t::fulfill_read(blockstore_op_t *read_op,
uint64_t &fulfilled, uint32_t item_start, uint32_t item_end, // FIXME: Rename item_* to dirty_*
uint32_t item_state, uint64_t item_version, uint64_t item_location,
uint64_t journal_sector, uint8_t *csum, int *dyn_data)
{
int r = 1;
if (item_start < read_op->offset + read_op->len && item_end > read_op->offset)
{
auto & rv = PRIV(read_op)->read_vec;
auto rd_start = item_start < read_op->offset ? read_op->offset : item_start;
auto rd_end = item_end > read_op->offset + read_op->len ? read_op->offset + read_op->len : item_end;
find_holes(rv, rd_start, rd_end, [&](int pos, bool alloc, uint32_t start, uint32_t end)
{
if (!r || alloc)
return 0;
if (!journal.inmemory && dsk.csum_block_size > dsk.bitmap_granularity && IS_JOURNAL(item_state) && !IS_DELETE(item_state))
for (; it != PRIV(read_op)->read_vec.end(); it++)
{
uint32_t blk_begin = (start/dsk.csum_block_size) * dsk.csum_block_size;
blk_begin = blk_begin < item_start ? item_start : blk_begin;
uint32_t blk_end = ((end-1) / dsk.csum_block_size + 1) * dsk.csum_block_size;
blk_end = blk_end > item_end ? item_end : blk_end;
rv.push_back((copy_buffer_t){
.copy_flags = COPY_BUF_JOURNAL|COPY_BUF_CSUM_FILL,
.offset = blk_begin,
.len = blk_end-blk_begin,
.csum_buf = (csum + (blk_begin/dsk.csum_block_size -
item_start/dsk.csum_block_size) * (dsk.data_csum_type & 0xFF)),
.dyn_data = dyn_data,
});
if (dyn_data)
if (it->offset >= cur_start)
{
(*dyn_data)++;
break;
}
// Submit the journal checksum block read
if (!read_checksum_block(read_op, 1, fulfilled, item_location - item_start))
else if (it->offset + it->len > cur_start)
{
r = 0;
cur_start = it->offset + it->len;
if (cur_start >= item_end)
{
goto endwhile;
}
}
return 0;
}
copy_buffer_t el = {
.copy_flags = (IS_JOURNAL(item_state) ? COPY_BUF_JOURNAL : COPY_BUF_DATA),
.offset = start,
.len = end-start,
.disk_offset = item_location + start - item_start,
.journal_sector = (IS_JOURNAL(item_state) ? journal_sector : 0),
.csum_buf = !csum ? NULL : (csum + (start - item_start) / dsk.csum_block_size * (dsk.data_csum_type & 0xFF)),
.dyn_data = dyn_data,
};
if (dyn_data)
if (it == PRIV(read_op)->read_vec.end() || it->offset > cur_start)
{
(*dyn_data)++;
fulfill_read_t el = {
.offset = cur_start,
.len = it == PRIV(read_op)->read_vec.end() || it->offset >= item_end ? item_end-cur_start : it->offset-cur_start,
.journal_sector = journal_sector,
};
it = PRIV(read_op)->read_vec.insert(it, el);
if (!fulfill_read_push(read_op,
(uint8_t*)read_op->buf + el.offset - read_op->offset,
item_location + el.offset - item_start,
el.len, item_state, item_version))
{
return 0;
}
fulfilled += el.len;
}
if (IS_BIG_WRITE(item_state))
cur_start = it->offset + it->len;
if (it == PRIV(read_op)->read_vec.end() || cur_start >= item_end)
{
// If we don't track it then we may IN THEORY read another object's data:
// submit read -> remove the object -> flush remove -> overwrite with another object -> finish read
// Very improbable, but possible
PRIV(read_op)->clean_block_used = 1;
break;
}
rv.insert(rv.begin() + pos, el);
fulfilled += el.len;
if (!fulfill_read_push(read_op,
(uint8_t*)read_op->buf + el.offset - read_op->offset,
item_location + el.offset - item_start,
el.len, item_state, item_version))
{
r = 0;
}
return 1;
});
}
}
return r;
endwhile:
return 1;
}
uint8_t* blockstore_impl_t::get_clean_entry_bitmap(uint64_t block_loc, int offset)
@@ -159,225 +106,10 @@ uint8_t* blockstore_impl_t::get_clean_entry_bitmap(uint64_t block_loc, int offse
clean_entry_bitmap = ((uint8_t*)metadata_buffer + sector + pos*dsk.clean_entry_size + sizeof(clean_disk_entry) + offset);
}
else
clean_entry_bitmap = (uint8_t*)(clean_bitmaps + meta_loc*2*dsk.clean_entry_bitmap_size + offset);
clean_entry_bitmap = (uint8_t*)(clean_bitmap + meta_loc*2*dsk.clean_entry_bitmap_size + offset);
return clean_entry_bitmap;
}
int blockstore_impl_t::fill_partial_checksum_blocks(std::vector<copy_buffer_t> & rv, uint64_t & fulfilled,
uint8_t *clean_entry_bitmap, int *dyn_data, bool from_journal, uint8_t *read_buf, uint64_t read_offset, uint64_t read_end)
{
if (read_end == read_offset)
return 0;
int required = 0;
read_buf -= read_offset;
uint32_t last_block = (read_end-1)/dsk.csum_block_size;
uint32_t start_block = read_offset/dsk.csum_block_size;
uint32_t end_block = 0;
while (start_block <= last_block)
{
if (read_range_fulfilled(rv, fulfilled, read_buf, clean_entry_bitmap,
start_block*dsk.csum_block_size < read_offset ? read_offset : start_block*dsk.csum_block_size,
(start_block+1)*dsk.csum_block_size > read_end ? read_end : (start_block+1)*dsk.csum_block_size))
{
// read_range_fulfilled() also adds zero-filled areas
start_block++;
}
else
{
// Find a sequence of checksum blocks required to be read
end_block = start_block;
while ((end_block+1)*dsk.csum_block_size < read_end &&
!read_range_fulfilled(rv, fulfilled, read_buf, clean_entry_bitmap,
(end_block+1)*dsk.csum_block_size < read_offset ? read_offset : (end_block+1)*dsk.csum_block_size,
(end_block+2)*dsk.csum_block_size > read_end ? read_end : (end_block+2)*dsk.csum_block_size))
{
end_block++;
}
end_block++;
// OK, mark this range as required
rv.push_back((copy_buffer_t){
.copy_flags = COPY_BUF_CSUM_FILL | (from_journal ? COPY_BUF_JOURNALED_BIG : 0),
.offset = start_block*dsk.csum_block_size,
.len = (end_block-start_block)*dsk.csum_block_size,
// save clean_entry_bitmap if we're reading clean data from the journal
.csum_buf = from_journal ? clean_entry_bitmap : NULL,
.dyn_data = dyn_data,
});
if (dyn_data)
{
(*dyn_data)++;
}
start_block = end_block;
required++;
}
}
return required;
}
// read_buf should be == op->buf - op->offset
bool blockstore_impl_t::read_range_fulfilled(std::vector<copy_buffer_t> & rv, uint64_t & fulfilled, uint8_t *read_buf,
uint8_t *clean_entry_bitmap, uint32_t item_start, uint32_t item_end)
{
bool all_done = true;
find_holes(rv, item_start, item_end, [&](int pos, bool alloc, uint32_t cur_start, uint32_t cur_end)
{
if (alloc)
return 0;
int diff = 0;
uint32_t bmp_start = cur_start/dsk.bitmap_granularity;
uint32_t bmp_end = cur_end/dsk.bitmap_granularity;
uint32_t bmp_pos = bmp_start;
while (bmp_pos < bmp_end)
{
while (bmp_pos < bmp_end && !(clean_entry_bitmap[bmp_pos >> 3] & (1 << (bmp_pos & 0x7))))
bmp_pos++;
if (bmp_pos > bmp_start)
{
// zero fill
copy_buffer_t el = {
.copy_flags = COPY_BUF_ZERO,
.offset = bmp_start*dsk.bitmap_granularity,
.len = (bmp_pos-bmp_start)*dsk.bitmap_granularity,
};
rv.insert(rv.begin() + pos, el);
if (read_buf)
memset(read_buf + el.offset, 0, el.len);
fulfilled += el.len;
diff++;
}
bmp_start = bmp_pos;
while (bmp_pos < bmp_end && (clean_entry_bitmap[bmp_pos >> 3] & (1 << (bmp_pos & 0x7))))
bmp_pos++;
if (bmp_pos > bmp_start)
{
// something is to be read
all_done = false;
}
bmp_start = bmp_pos;
}
return diff;
});
return all_done;
}
bool blockstore_impl_t::read_checksum_block(blockstore_op_t *op, int rv_pos, uint64_t &fulfilled, uint64_t clean_loc)
{
auto & rv = PRIV(op)->read_vec;
auto *vi = &rv[rv.size()-rv_pos];
uint32_t item_start = vi->offset, item_end = vi->offset+vi->len;
uint32_t fill_size = 0;
int n_iov = 0;
find_holes(rv, item_start, item_end, [&](int pos, bool alloc, uint32_t cur_start, uint32_t cur_end)
{
if (alloc)
{
fill_size += cur_end-cur_start;
n_iov++;
}
else
{
if (cur_start < op->offset)
{
fill_size += op->offset-cur_start;
n_iov++;
cur_start = op->offset;
}
if (cur_end > op->offset+op->len)
{
fill_size += cur_end-(op->offset+op->len);
n_iov++;
cur_end = op->offset+op->len;
}
if (cur_end > cur_start)
{
n_iov++;
}
}
return 0;
});
void *buf = memalign_or_die(MEM_ALIGNMENT, fill_size + n_iov*sizeof(struct iovec));
iovec *iov = (struct iovec*)((uint8_t*)buf+fill_size);
n_iov = 0;
fill_size = 0;
find_holes(rv, item_start, item_end, [&](int pos, bool alloc, uint32_t cur_start, uint32_t cur_end)
{
int res = 0;
if (alloc)
{
iov[n_iov++] = (struct iovec){ (uint8_t*)buf+fill_size, cur_end-cur_start };
fill_size += cur_end-cur_start;
}
else
{
if (cur_start < op->offset)
{
iov[n_iov++] = (struct iovec){ (uint8_t*)buf+fill_size, op->offset-cur_start };
fill_size += op->offset-cur_start;
cur_start = op->offset;
}
auto lim_end = cur_end > op->offset+op->len ? op->offset+op->len : cur_end;
if (lim_end > cur_start)
{
iov[n_iov++] = (struct iovec){ (uint8_t*)op->buf+cur_start-op->offset, lim_end-cur_start };
rv.insert(rv.begin() + pos, (copy_buffer_t){
.copy_flags = COPY_BUF_DATA,
.offset = cur_start,
.len = lim_end-cur_start,
});
fulfilled += lim_end-cur_start;
res++;
}
if (cur_end > op->offset+op->len)
{
iov[n_iov++] = (struct iovec){ (uint8_t*)buf+fill_size, cur_end - (op->offset+op->len) };
fill_size += cur_end - (op->offset+op->len);
cur_end = op->offset+op->len;
}
}
return res;
});
vi = &rv[rv.size()-rv_pos];
// Save buf into read_vec too but in a creepy way
// FIXME: Shit, something else should be invented %)
*vi = (copy_buffer_t){
.copy_flags = vi->copy_flags,
.offset = vi->offset,
.len = ((uint64_t)n_iov << 32) | fill_size,
.disk_offset = clean_loc + item_start,
.buf = (uint8_t*)buf,
.csum_buf = vi->csum_buf,
.dyn_data = vi->dyn_data,
};
int submit_fd = (vi->copy_flags & COPY_BUF_JOURNAL ? dsk.read_journal_fd : dsk.read_data_fd);
uint64_t submit_offset = (vi->copy_flags & COPY_BUF_JOURNAL ? journal.offset : dsk.data_offset);
uint32_t d_pos = 0;
for (int n_pos = 0; n_pos < n_iov; n_pos += IOV_MAX)
{
int n_cur = n_iov-n_pos < IOV_MAX ? n_iov-n_pos : IOV_MAX;
BS_SUBMIT_GET_SQE(sqe, data);
PRIV(op)->pending_ops++;
my_uring_prep_readv(sqe, submit_fd, iov + n_pos, n_cur, submit_offset + clean_loc + item_start + d_pos);
data->callback = [this, op](ring_data_t *data) { handle_read_event(data, op); };
if (n_pos > 0 || n_pos + IOV_MAX < n_iov)
{
uint32_t d_len = 0;
for (int i = 0; i < IOV_MAX; i++)
d_len += iov[n_pos+i].iov_len;
data->iov.iov_len = d_len;
d_pos += d_len;
}
else
data->iov.iov_len = item_end-item_start;
}
if (!(vi->copy_flags & COPY_BUF_JOURNAL))
{
// Reads running parallel to flushes of the same clean block may read
// a mixture of old and new data. So we don't verify checksums for such blocks.
PRIV(op)->clean_block_used = 1;
}
return true;
}
int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
{
auto & clean_db = clean_db_shard(read_op->oid);
@@ -399,8 +131,6 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
}
uint64_t fulfilled = 0;
PRIV(read_op)->pending_ops = 0;
PRIV(read_op)->clean_block_used = 0;
auto & rv = PRIV(read_op)->read_vec;
uint64_t result_version = 0;
if (dirty_found)
{
@@ -418,36 +148,23 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
FINISH_OP(read_op);
return 2;
}
int *dyn_data = (int*)(dsk.csum_block_size > 0 && alloc_dyn_data ? dirty.dyn_data : NULL);
uint8_t *bmp_ptr = (alloc_dyn_data
? (uint8_t*)dirty.dyn_data + sizeof(int) : (uint8_t*)&dirty.dyn_data);
if (!result_version)
{
result_version = dirty_it->first.version;
if (read_op->bitmap)
{
void *bmp_ptr = (dsk.clean_entry_bitmap_size > sizeof(void*) ? dirty_it->second.bitmap : &dirty_it->second.bitmap);
memcpy(read_op->bitmap, bmp_ptr, dsk.clean_entry_bitmap_size);
}
}
// If inmemory_journal is false, journal trim will have to wait until the read is completed
if (!IS_JOURNAL(dirty.state))
if (!fulfill_read(read_op, fulfilled, dirty.offset, dirty.offset + dirty.len,
dirty.state, dirty_it->first.version, dirty.location + (IS_JOURNAL(dirty.state) ? 0 : dirty.offset),
(IS_JOURNAL(dirty.state) ? dirty.journal_sector+1 : 0)))
{
// Read from data disk, possibly checking checksums
if (!fulfill_clean_read(read_op, fulfilled, bmp_ptr, dyn_data,
dirty.offset, dirty.offset+dirty.len, dirty.location, dirty_it->first.version))
{
goto undo_read;
}
}
else
{
// Copy from memory or read from journal, possibly checking checksums
if (!fulfill_read(read_op, fulfilled, dirty.offset, dirty.offset + dirty.len,
dirty.state, dirty_it->first.version, dirty.location, dirty.journal_sector+1,
journal.inmemory ? NULL : bmp_ptr+dsk.clean_entry_bitmap_size, dyn_data))
{
goto undo_read;
}
// need to wait. undo added requests, don't dequeue op
PRIV(read_op)->read_vec.clear();
return 0;
}
}
if (fulfilled == read_op->len || dirty_it == dirty_db.begin())
@@ -470,10 +187,50 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
}
if (fulfilled < read_op->len)
{
if (!fulfill_clean_read(read_op, fulfilled, NULL, NULL, 0, dsk.data_block_size,
clean_it->second.location, clean_it->second.version))
if (!dsk.clean_entry_bitmap_size)
{
goto undo_read;
if (!fulfill_read(read_op, fulfilled, 0, dsk.data_block_size,
(BS_ST_BIG_WRITE | BS_ST_STABLE), 0, clean_it->second.location, 0))
{
// need to wait. undo added requests, don't dequeue op
PRIV(read_op)->read_vec.clear();
return 0;
}
}
else
{
uint8_t *clean_entry_bitmap = get_clean_entry_bitmap(clean_it->second.location, 0);
uint64_t bmp_start = 0, bmp_end = 0, bmp_size = dsk.data_block_size/dsk.bitmap_granularity;
while (bmp_start < bmp_size)
{
while (!(clean_entry_bitmap[bmp_end >> 3] & (1 << (bmp_end & 0x7))) && bmp_end < bmp_size)
{
bmp_end++;
}
if (bmp_end > bmp_start)
{
// fill with zeroes
assert(fulfill_read(read_op, fulfilled, bmp_start * dsk.bitmap_granularity,
bmp_end * dsk.bitmap_granularity, (BS_ST_DELETE | BS_ST_STABLE), 0, 0, 0));
}
bmp_start = bmp_end;
while (clean_entry_bitmap[bmp_end >> 3] & (1 << (bmp_end & 0x7)) && bmp_end < bmp_size)
{
bmp_end++;
}
if (bmp_end > bmp_start)
{
if (!fulfill_read(read_op, fulfilled, bmp_start * dsk.bitmap_granularity,
bmp_end * dsk.bitmap_granularity, (BS_ST_BIG_WRITE | BS_ST_STABLE), 0,
clean_it->second.location + bmp_start * dsk.bitmap_granularity, 0))
{
// need to wait. undo added requests, don't dequeue op
PRIV(read_op)->read_vec.clear();
return 0;
}
bmp_start = bmp_end;
}
}
}
}
}
@@ -485,7 +242,11 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
FINISH_OP(read_op);
return 2;
}
assert(fulfilled == read_op->len);
if (fulfilled < read_op->len)
{
assert(fulfill_read(read_op, fulfilled, 0, dsk.data_block_size, (BS_ST_DELETE | BS_ST_STABLE), 0, 0, 0));
assert(fulfilled == read_op->len);
}
read_op->version = result_version;
if (!PRIV(read_op)->pending_ops)
{
@@ -510,309 +271,6 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
}
read_op->retval = 0;
return 2;
undo_read:
// need to wait. undo added requests, don't dequeue op
if (dsk.csum_block_size > dsk.bitmap_granularity)
{
for (auto & vec: rv)
{
if ((vec.copy_flags & COPY_BUF_CSUM_FILL) && vec.buf)
{
free(vec.buf);
vec.buf = NULL;
}
if (vec.dyn_data && --(*vec.dyn_data) == 0) // refcount
{
free(vec.dyn_data);
vec.dyn_data = NULL;
}
}
}
rv.clear();
return 0;
}
int blockstore_impl_t::pad_journal_read(std::vector<copy_buffer_t> & rv, copy_buffer_t & cp,
// FIXME Passing dirty_entry& would be nicer
uint64_t dirty_offset, uint64_t dirty_end, uint64_t dirty_loc, uint8_t *csum_ptr, int *dyn_data,
uint64_t offset, uint64_t submit_len, uint64_t & blk_begin, uint64_t & blk_end, uint8_t* & blk_buf)
{
if (offset % dsk.csum_block_size || submit_len % dsk.csum_block_size)
{
if (offset < blk_end)
{
// Already being read as a part of the previous checksum block series
cp.buf = blk_buf + offset - blk_begin;
cp.copy_flags |= COPY_BUF_COALESCED;
if (offset+submit_len > blk_end)
cp.len = blk_end-offset;
return 2;
}
else
{
// We don't use fill_partial_checksum_blocks for journal because journal writes never have holes (internal bitmap)
blk_begin = (offset/dsk.csum_block_size) * dsk.csum_block_size;
blk_begin = blk_begin < dirty_offset ? dirty_offset : blk_begin;
blk_end = ((offset+submit_len-1)/dsk.csum_block_size + 1) * dsk.csum_block_size;
blk_end = blk_end > dirty_end ? dirty_end : blk_end;
if (blk_begin < offset || blk_end > offset+submit_len)
{
blk_buf = (uint8_t*)memalign_or_die(MEM_ALIGNMENT, blk_end-blk_begin);
cp.buf = blk_buf + offset - blk_begin;
cp.copy_flags |= COPY_BUF_COALESCED;
rv.push_back((copy_buffer_t){
.copy_flags = COPY_BUF_JOURNAL|COPY_BUF_CSUM_FILL,
.offset = blk_begin,
.len = blk_end-blk_begin,
.disk_offset = dirty_loc + blk_begin - dirty_offset,
.buf = blk_buf,
.csum_buf = (csum_ptr + (blk_begin/dsk.csum_block_size -
dirty_offset/dsk.csum_block_size) * (dsk.data_csum_type & 0xFF)),
.dyn_data = dyn_data,
});
if (dyn_data)
{
(*dyn_data)++;
}
return 1;
}
}
}
return 0;
}
bool blockstore_impl_t::fulfill_clean_read(blockstore_op_t *read_op, uint64_t & fulfilled,
uint8_t *clean_entry_bitmap, int *dyn_data, uint32_t item_start, uint32_t item_end, uint64_t clean_loc, uint64_t clean_ver)
{
bool from_journal = clean_entry_bitmap != NULL;
if (!clean_entry_bitmap)
{
// NULL clean_entry_bitmap means we're reading from data, not from the journal,
// and the bitmap location is obvious
clean_entry_bitmap = get_clean_entry_bitmap(clean_loc, 0);
}
if (dsk.csum_block_size > dsk.bitmap_granularity)
{
auto & rv = PRIV(read_op)->read_vec;
int req = fill_partial_checksum_blocks(rv, fulfilled, clean_entry_bitmap, dyn_data, from_journal,
(uint8_t*)read_op->buf, read_op->offset, read_op->offset+read_op->len);
if (!inmemory_meta && !from_journal && req > 0)
{
// Read checksums from disk
uint8_t *csum_buf = read_clean_meta_block(read_op, clean_loc, rv.size()-req);
for (int i = req; i > 0; i--)
{
rv[rv.size()-i].csum_buf = csum_buf;
}
}
for (int i = req; i > 0; i--)
{
if (!read_checksum_block(read_op, i, fulfilled, clean_loc))
{
return false;
}
}
PRIV(read_op)->clean_block_used = req > 0;
}
else if (from_journal)
{
// Don't scan bitmap - journal writes don't have holes (internal bitmap)!
uint8_t *csum = !dsk.csum_block_size ? 0 : (clean_entry_bitmap + dsk.clean_entry_bitmap_size +
item_start/dsk.csum_block_size*(dsk.data_csum_type & 0xFF));
if (!fulfill_read(read_op, fulfilled, item_start, item_end,
(BS_ST_BIG_WRITE | BS_ST_STABLE), 0, clean_loc + item_start, 0, csum, dyn_data))
{
return false;
}
if (item_start > 0 && fulfilled < read_op->len)
{
// fill with zeroes
assert(fulfill_read(read_op, fulfilled, 0, item_start, (BS_ST_DELETE | BS_ST_STABLE), 0, 0, 0, NULL, NULL));
}
if (item_end < dsk.data_block_size && fulfilled < read_op->len)
{
// fill with zeroes
assert(fulfill_read(read_op, fulfilled, item_end, dsk.data_block_size, (BS_ST_DELETE | BS_ST_STABLE), 0, 0, 0, NULL, NULL));
}
}
else
{
bool csum_done = !dsk.csum_block_size || inmemory_meta;
uint8_t *csum_buf = clean_entry_bitmap;
uint64_t bmp_start = 0, bmp_end = 0, bmp_size = dsk.data_block_size/dsk.bitmap_granularity;
while (bmp_start < bmp_size)
{
while (!(clean_entry_bitmap[bmp_end >> 3] & (1 << (bmp_end & 0x7))) && bmp_end < bmp_size)
{
bmp_end++;
}
if (bmp_end > bmp_start)
{
// fill with zeroes
assert(fulfill_read(read_op, fulfilled, bmp_start * dsk.bitmap_granularity,
bmp_end * dsk.bitmap_granularity, (BS_ST_DELETE | BS_ST_STABLE), 0, 0, 0, NULL, NULL));
}
bmp_start = bmp_end;
while (clean_entry_bitmap[bmp_end >> 3] & (1 << (bmp_end & 0x7)) && bmp_end < bmp_size)
{
bmp_end++;
}
if (bmp_end > bmp_start)
{
if (!csum_done)
{
// Read checksums from disk
csum_buf = read_clean_meta_block(read_op, clean_loc, PRIV(read_op)->read_vec.size());
csum_done = true;
}
uint8_t *csum = !dsk.csum_block_size ? 0 : (csum_buf + 2*dsk.clean_entry_bitmap_size + bmp_start*(dsk.data_csum_type & 0xFF));
if (!fulfill_read(read_op, fulfilled, bmp_start * dsk.bitmap_granularity,
bmp_end * dsk.bitmap_granularity, (BS_ST_BIG_WRITE | BS_ST_STABLE), 0,
clean_loc + bmp_start * dsk.bitmap_granularity, 0, csum, dyn_data))
{
return false;
}
bmp_start = bmp_end;
}
}
}
// Increment reference counter if clean data is being read from the disk
if (PRIV(read_op)->clean_block_used)
{
auto & uo = used_clean_objects[clean_loc];
uo.refs++;
if (dsk.csum_block_size && flusher->is_mutated(clean_loc))
uo.was_changed = true;
PRIV(read_op)->clean_block_used = clean_loc;
}
return true;
}
uint8_t* blockstore_impl_t::read_clean_meta_block(blockstore_op_t *op, uint64_t clean_loc, int rv_pos)
{
auto & rv = PRIV(op)->read_vec;
auto sector = ((clean_loc >> dsk.block_order) / (dsk.meta_block_size / dsk.clean_entry_size)) * dsk.meta_block_size;
auto pos = ((clean_loc >> dsk.block_order) % (dsk.meta_block_size / dsk.clean_entry_size)) * dsk.clean_entry_size;
uint8_t *buf = (uint8_t*)memalign_or_die(MEM_ALIGNMENT, dsk.meta_block_size);
rv.insert(rv.begin()+rv_pos, (copy_buffer_t){
.copy_flags = COPY_BUF_META_BLOCK|COPY_BUF_CSUM_FILL,
.offset = pos,
.buf = buf,
});
BS_SUBMIT_GET_SQE(sqe, data);
data->iov = (struct iovec){ buf, dsk.meta_block_size };
PRIV(op)->pending_ops++;
my_uring_prep_readv(sqe, dsk.read_meta_fd, &data->iov, 1, dsk.meta_offset + dsk.meta_block_size + sector);
data->callback = [this, op](ring_data_t *data) { handle_read_event(data, op); };
// return pointer to checksums + bitmap
return buf + pos + sizeof(clean_disk_entry);
}
bool blockstore_impl_t::verify_padded_checksums(uint8_t *clean_entry_bitmap, uint8_t *csum_buf, uint32_t offset,
iovec *iov, int n_iov, std::function<void(uint32_t, uint32_t, uint32_t)> bad_block_cb)
{
assert(!(offset % dsk.csum_block_size));
uint32_t *csums = (uint32_t*)csum_buf;
uint32_t block_csum = 0;
uint32_t block_done = 0;
uint32_t block_num = clean_entry_bitmap ? offset/dsk.csum_block_size : 0;
uint32_t bmp_pos = offset/dsk.bitmap_granularity;
for (int i = 0; i < n_iov; i++)
{
uint32_t pos = 0;
while (pos < iov[i].iov_len)
{
uint32_t start = pos;
uint8_t bit = (clean_entry_bitmap[bmp_pos >> 3] >> (bmp_pos & 0x7)) & 1;
while (pos < iov[i].iov_len && ((clean_entry_bitmap[bmp_pos >> 3] >> (bmp_pos & 0x7)) & 1) == bit)
{
pos += dsk.bitmap_granularity;
bmp_pos++;
}
uint32_t len = pos-start;
auto buf = (uint8_t*)iov[i].iov_base+start;
while (block_done+len >= dsk.csum_block_size)
{
auto cur_len = dsk.csum_block_size-block_done;
block_csum = crc32c_pad(block_csum, buf, bit ? cur_len : 0, bit ? 0 : cur_len, 0);
if (block_csum != csums[block_num])
{
if (bad_block_cb)
bad_block_cb(block_num*dsk.csum_block_size, block_csum, csums[block_num]);
else
return false;
}
block_num++;
buf += cur_len;
len -= cur_len;
block_done = block_csum = 0;
}
if (len > 0)
{
block_csum = crc32c_pad(block_csum, buf, bit ? len : 0, bit ? 0 : len, 0);
block_done += len;
}
}
}
assert(!block_done);
return true;
}
bool blockstore_impl_t::verify_journal_checksums(uint8_t *csums, uint32_t offset,
iovec *iov, int n_iov, std::function<void(uint32_t, uint32_t, uint32_t)> bad_block_cb)
{
uint32_t block_csum = 0;
uint32_t block_num = 0;
uint32_t block_done = offset%dsk.csum_block_size;
for (int i = 0; i < n_iov; i++)
{
uint32_t len = iov[i].iov_len;
auto buf = (uint8_t*)iov[i].iov_base;
while (block_done+len >= dsk.csum_block_size)
{
auto cur_len = dsk.csum_block_size-block_done;
block_csum = crc32c(block_csum, buf, cur_len);
if (block_csum != ((uint32_t*)csums)[block_num])
{
if (bad_block_cb)
bad_block_cb(block_num*dsk.csum_block_size, block_csum, ((uint32_t*)csums)[block_num]);
else
return false;
}
block_num++;
buf += cur_len;
len -= cur_len;
block_done = block_csum = 0;
}
if (len > 0)
{
block_csum = crc32c(block_csum, buf, len);
block_done += len;
}
}
if (block_done > 0 && block_csum != ((uint32_t*)csums)[block_num])
{
if (bad_block_cb)
bad_block_cb(block_num*dsk.csum_block_size, block_csum, ((uint32_t*)csums)[block_num]);
else
return false;
}
return true;
}
bool blockstore_impl_t::verify_clean_padded_checksums(blockstore_op_t *op, uint64_t clean_loc, uint8_t *dyn_data, bool from_journal,
iovec *iov, int n_iov, std::function<void(uint32_t, uint32_t, uint32_t)> bad_block_cb)
{
uint32_t offset = clean_loc % dsk.data_block_size;
if (from_journal)
return verify_padded_checksums(dyn_data, dyn_data + dsk.clean_entry_bitmap_size, offset, iov, n_iov, bad_block_cb);
clean_loc = (clean_loc >> dsk.block_order) << dsk.block_order;
if (!dyn_data)
{
assert(inmemory_meta);
dyn_data = get_clean_entry_bitmap(clean_loc, 0);
}
return verify_padded_checksums(dyn_data, dyn_data + 2*dsk.clean_entry_bitmap_size, offset, iov, n_iov, bad_block_cb);
}
void blockstore_impl_t::handle_read_event(ring_data_t *data, blockstore_op_t *op)
@@ -826,139 +284,6 @@ void blockstore_impl_t::handle_read_event(ring_data_t *data, blockstore_op_t *op
}
if (PRIV(op)->pending_ops == 0)
{
if (dsk.csum_block_size)
{
// verify checksums if required
auto & rv = PRIV(op)->read_vec;
void *meta_block = NULL;
if (dsk.csum_block_size > dsk.bitmap_granularity)
{
for (int i = rv.size()-1; i >= 0 && (rv[i].copy_flags & COPY_BUF_CSUM_FILL); i--)
{
if (rv[i].copy_flags & COPY_BUF_META_BLOCK)
{
// Metadata read. Skip
assert(!meta_block);
meta_block = rv[i].buf;
rv[i].buf = NULL;
continue;
}
struct iovec *iov = (struct iovec*)((uint8_t*)rv[i].buf + (rv[i].len & 0xFFFFFFFF));
int n_iov = rv[i].len >> 32;
bool ok = true;
if (rv[i].copy_flags & COPY_BUF_JOURNAL)
{
// SMALL_WRITE from journal
verify_journal_checksums(
rv[i].csum_buf, rv[i].offset, iov, n_iov,
[&](uint32_t bad_block, uint32_t calc_csum, uint32_t stored_csum)
{
ok = false;
printf(
"Checksum mismatch in object %lx:%lx v%lu in journal at 0x%lx, checksum block #%u: got %08x, expected %08x\n",
op->oid.inode, op->oid.stripe, op->version,
rv[i].disk_offset, bad_block / dsk.csum_block_size, calc_csum, stored_csum
);
}
);
}
else
{
// BIG_WRITE from journal or clean data
// Do not verify checksums if the data location is/was mutated by flushers
auto & uo = used_clean_objects.at((rv[i].disk_offset >> dsk.block_order) << dsk.block_order);
if (!uo.was_changed)
{
verify_clean_padded_checksums(
op, rv[i].disk_offset, rv[i].csum_buf, (rv[i].copy_flags & COPY_BUF_JOURNALED_BIG), iov, n_iov,
[&](uint32_t bad_block, uint32_t calc_csum, uint32_t stored_csum)
{
ok = false;
printf(
"Checksum mismatch in object %lx:%lx v%lu in %s data at 0x%lx, checksum block #%u: got %08x, expected %08x\n",
op->oid.inode, op->oid.stripe, op->version,
(rv[i].copy_flags & COPY_BUF_JOURNALED_BIG ? "redirect-write" : "clean"),
rv[i].disk_offset, bad_block / dsk.csum_block_size, calc_csum, stored_csum
);
}
);
}
}
if (!ok)
{
op->retval = -EDOM;
}
free(rv[i].buf);
rv[i].buf = NULL;
if (rv[i].dyn_data && --(*rv[i].dyn_data) == 0) // refcount
{
free(rv[i].dyn_data);
rv[i].dyn_data = NULL;
}
}
}
else
{
for (auto & vec: rv)
{
if (vec.copy_flags & COPY_BUF_META_BLOCK)
{
// Metadata read. Skip
assert(!meta_block);
meta_block = vec.buf;
vec.buf = NULL;
continue;
}
if (vec.csum_buf)
{
uint32_t *csum = (uint32_t*)vec.csum_buf;
for (size_t p = 0; p < vec.len; p += dsk.csum_block_size, csum++)
{
if (crc32c(0, (uint8_t*)op->buf + vec.offset - op->offset + p, dsk.csum_block_size) != *csum)
{
// checksum error
printf(
"Checksum mismatch in object %lx:%lx v%lu in %s area at offset 0x%lx+0x%lx: %08x vs %08x\n",
op->oid.inode, op->oid.stripe, op->version,
(vec.copy_flags & COPY_BUF_JOURNAL) ? "journal" : "data", vec.disk_offset, p,
crc32c(0, (uint8_t*)op->buf + vec.offset - op->offset + p, dsk.csum_block_size), *csum
);
op->retval = -EDOM;
break;
}
}
}
if (vec.dyn_data && --(*vec.dyn_data) == 0) // refcount
{
free(vec.dyn_data);
vec.dyn_data = NULL;
}
}
}
if (meta_block)
{
// Free after checking
free(meta_block);
meta_block = NULL;
}
}
if (PRIV(op)->clean_block_used)
{
// Release clean data block
auto uo_it = used_clean_objects.find(PRIV(op)->clean_block_used);
if (uo_it != used_clean_objects.end())
{
uo_it->second.refs--;
if (uo_it->second.refs <= 0)
{
if (uo_it->second.was_freed)
{
data_alloc->set(PRIV(op)->clean_block_used, false);
}
used_clean_objects.erase(uo_it);
}
}
}
if (!journal.inmemory)
{
// Release journal sector usage
@@ -999,9 +324,8 @@ int blockstore_impl_t::read_bitmap(object_id oid, uint64_t target_version, void
*result_version = dirty_it->first.version;
if (bitmap)
{
void *dyn_ptr = (alloc_dyn_data
? (uint8_t*)dirty_it->second.dyn_data + sizeof(int) : (uint8_t*)&dirty_it->second.dyn_data);
memcpy(bitmap, dyn_ptr, dsk.clean_entry_bitmap_size);
void *bmp_ptr = (dsk.clean_entry_bitmap_size > sizeof(void*) ? dirty_it->second.bitmap : &dirty_it->second.bitmap);
memcpy(bitmap, bmp_ptr, dsk.clean_entry_bitmap_size);
}
return 0;
}

View File

@@ -227,7 +227,11 @@ void blockstore_impl_t::erase_dirty(blockstore_dirty_db_t::iterator dirty_start,
journal.used_sectors.erase(dirty_it->second.journal_sector);
flusher->mark_trim_possible();
}
free_dirty_dyn_data(dirty_it->second);
if (dsk.clean_entry_bitmap_size > sizeof(void*))
{
free(dirty_it->second.bitmap);
dirty_it->second.bitmap = NULL;
}
if (dirty_it == dirty_start)
{
break;
@@ -236,18 +240,3 @@ void blockstore_impl_t::erase_dirty(blockstore_dirty_db_t::iterator dirty_start,
}
dirty_db.erase(dirty_start, dirty_end);
}
void blockstore_impl_t::free_dirty_dyn_data(dirty_entry & e)
{
if (e.dyn_data)
{
if (alloc_dyn_data &&
--*((int*)e.dyn_data) == 0) // refcount
{
// dyn_data contains the bitmap and checksums
// free it if it doesn't refer to the in-memory journal
free(e.dyn_data);
}
e.dyn_data = NULL;
}
}

View File

@@ -78,23 +78,7 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
// 2nd step: Data device is synced, prepare & write journal entries
// Check space in the journal and journal memory buffers
blockstore_journal_check_t space_check(this);
if (dsk.csum_block_size)
{
// More complex check because all journal entries have different lengths
int left = PRIV(op)->sync_big_writes.size();
for (auto & sbw: PRIV(op)->sync_big_writes)
{
left--;
auto & dirty_entry = dirty_db.at(sbw);
uint64_t dyn_size = dsk.dirty_dyn_size(dirty_entry.offset, dirty_entry.len);
if (!space_check.check_available(op, 1, sizeof(journal_entry_big_write) + dyn_size,
left == 0 ? JOURNAL_STABILIZE_RESERVATION : 0))
{
return 0;
}
}
}
else if (!space_check.check_available(op, PRIV(op)->sync_big_writes.size(),
if (!space_check.check_available(op, PRIV(op)->sync_big_writes.size(),
sizeof(journal_entry_big_write) + dsk.clean_entry_bitmap_size, JOURNAL_STABILIZE_RESERVATION))
{
return 0;
@@ -106,17 +90,16 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
int s = 0;
while (it != PRIV(op)->sync_big_writes.end())
{
auto & dirty_entry = dirty_db.at(*it);
uint64_t dyn_size = dsk.dirty_dyn_size(dirty_entry.offset, dirty_entry.len);
if (!journal.entry_fits(sizeof(journal_entry_big_write) + dyn_size) &&
if (!journal.entry_fits(sizeof(journal_entry_big_write) + dsk.clean_entry_bitmap_size) &&
journal.sector_info[journal.cur_sector].dirty)
{
prepare_journal_sector_write(journal.cur_sector, op);
s++;
}
auto & dirty_entry = dirty_db.at(*it);
journal_entry_big_write *je = (journal_entry_big_write*)prefill_single_journal_entry(
journal, (dirty_entry.state & BS_ST_INSTANT) ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
sizeof(journal_entry_big_write) + dyn_size
sizeof(journal_entry_big_write) + dsk.clean_entry_bitmap_size
);
dirty_entry.journal_sector = journal.sector_info[journal.cur_sector].offset;
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
@@ -132,8 +115,8 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
je->offset = dirty_entry.offset;
je->len = dirty_entry.len;
je->location = dirty_entry.location;
memcpy((void*)(je+1), (alloc_dyn_data
? (uint8_t*)dirty_entry.dyn_data+sizeof(int) : (uint8_t*)&dirty_entry.dyn_data), dyn_size);
memcpy((void*)(je+1), (dsk.clean_entry_bitmap_size > sizeof(void*)
? dirty_entry.bitmap : &dirty_entry.bitmap), dsk.clean_entry_bitmap_size);
je->crc32 = je_crc32((journal_entry*)je);
journal.crc32_last = je->crc32;
it++;

View File

@@ -8,21 +8,12 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
// Check or assign version number
bool found = false, deleted = false, unsynced = false, is_del = (op->opcode == BS_OP_DELETE);
bool wait_big = false, wait_del = false;
void *dyn = NULL;
if (is_del)
{
op->len = 0;
}
size_t dyn_size = dsk.dirty_dyn_size(op->offset, op->len);
if (!is_del && alloc_dyn_data)
{
// FIXME: Working with `dyn_data` has to be refactored somehow but I first have to decide how :)
// +sizeof(int) = refcount
dyn = calloc_or_die(1, dyn_size+sizeof(int));
*((int*)dyn) = 1;
}
uint8_t *dyn_ptr = (uint8_t*)(alloc_dyn_data ? dyn+sizeof(int) : &dyn);
void *bmp = NULL;
uint64_t version = 1;
if (!is_del && dsk.clean_entry_bitmap_size > sizeof(void*))
{
bmp = calloc_or_die(1, dsk.clean_entry_bitmap_size);
}
if (dirty_db.size() > 0)
{
auto dirty_it = dirty_db.upper_bound((obj_ver_id){
@@ -42,9 +33,10 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
: ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_WAIT_BIG);
if (!is_del && !deleted)
{
void *dyn_from = alloc_dyn_data
? (uint8_t*)dirty_it->second.dyn_data + sizeof(int) : (uint8_t*)&dirty_it->second.dyn_data;
memcpy(dyn_ptr, dyn_from, dsk.clean_entry_bitmap_size);
if (dsk.clean_entry_bitmap_size > sizeof(void*))
memcpy(bmp, dirty_it->second.bitmap, dsk.clean_entry_bitmap_size);
else
bmp = dirty_it->second.bitmap;
}
}
}
@@ -58,7 +50,7 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
if (!is_del)
{
void *bmp_ptr = get_clean_entry_bitmap(clean_it->second.location, dsk.clean_entry_bitmap_size);
memcpy(dyn_ptr, bmp_ptr, dsk.clean_entry_bitmap_size);
memcpy((dsk.clean_entry_bitmap_size > sizeof(void*) ? bmp : &bmp), bmp_ptr, dsk.clean_entry_bitmap_size);
}
}
else
@@ -120,9 +112,9 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
printf("Write %lx:%lx v%lu requested, but we already have v%lu\n", op->oid.inode, op->oid.stripe, op->version, version);
#endif
op->retval = -EEXIST;
if (!is_del && alloc_dyn_data)
if (!is_del && dsk.clean_entry_bitmap_size > sizeof(void*))
{
free(dyn);
free(bmp);
}
return false;
}
@@ -166,50 +158,26 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
if (op->bitmap)
{
// Only allow to overwrite part of the object bitmap respective to the write's offset/len
uint8_t *bmp_ptr = (uint8_t*)(dsk.clean_entry_bitmap_size > sizeof(void*) ? bmp : &bmp);
uint32_t bit = op->offset/dsk.bitmap_granularity;
uint32_t bits_left = op->len/dsk.bitmap_granularity;
while (!(bit % 8) && bits_left >= 8)
{
// Copy bytes
dyn_ptr[bit/8] = ((uint8_t*)op->bitmap)[bit/8];
bmp_ptr[bit/8] = ((uint8_t*)op->bitmap)[bit/8];
bit += 8;
bits_left -= 8;
}
while (bits_left > 0)
{
// Copy bits
dyn_ptr[bit/8] = (dyn_ptr[bit/8] & ~(1 << (bit%8)))
bmp_ptr[bit/8] = (bmp_ptr[bit/8] & ~(1 << (bit%8)))
| (((uint8_t*)op->bitmap)[bit/8] & (1 << bit%8));
bit++;
bits_left--;
}
}
}
// Calculate checksums
// FIXME: Allow to receive checksums from outside?
if (!is_del && dsk.data_csum_type && op->len > 0)
{
uint32_t *data_csums = (uint32_t*)(dyn_ptr + dsk.clean_entry_bitmap_size);
uint32_t start = op->offset / dsk.csum_block_size;
uint32_t end = (op->offset+op->len-1) / dsk.csum_block_size;
auto fn = state & BS_ST_BIG_WRITE ? crc32c_pad : crc32c_nopad;
if (start == end)
data_csums[0] = fn(0, op->buf, op->len, op->offset - start*dsk.csum_block_size, end*dsk.csum_block_size - (op->offset+op->len));
else
{
// First block
data_csums[0] = fn(0, op->buf, dsk.csum_block_size*(start+1)-op->offset, op->offset - start*dsk.csum_block_size, 0);
// Intermediate blocks
for (uint32_t i = start+1; i < end; i++)
data_csums[i-start] = crc32c(0, (uint8_t*)op->buf + dsk.csum_block_size*i-op->offset, dsk.csum_block_size);
// Last block
data_csums[end-start] = fn(
0, (uint8_t*)op->buf + end*dsk.csum_block_size - op->offset,
op->offset+op->len - end*dsk.csum_block_size,
0, (end+1)*dsk.csum_block_size - (op->offset+op->len)
);
}
}
dirty_db.emplace((obj_ver_id){
.oid = op->oid,
.version = op->version,
@@ -220,7 +188,7 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
.offset = is_del ? 0 : op->offset,
.len = is_del ? 0 : op->len,
.journal_sector = 0,
.dyn_data = dyn,
.bitmap = bmp,
});
return true;
}
@@ -229,7 +197,8 @@ void blockstore_impl_t::cancel_all_writes(blockstore_op_t *op, blockstore_dirty_
{
while (dirty_it != dirty_db.end() && dirty_it->first.oid == op->oid)
{
free_dirty_dyn_data(dirty_it->second);
if (dsk.clean_entry_bitmap_size > sizeof(void*))
free(dirty_it->second.bitmap);
dirty_db.erase(dirty_it++);
}
bool found = false;
@@ -311,7 +280,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
{
blockstore_journal_check_t space_check(this);
if (!space_check.check_available(op, unsynced_big_write_count + 1,
sizeof(journal_entry_big_write) + dsk.clean_dyn_size,
sizeof(journal_entry_big_write) + dsk.clean_entry_bitmap_size,
(dirty_it->second.state & BS_ST_INSTANT) ? JOURNAL_INSTANT_RESERVATION : JOURNAL_STABILIZE_RESERVATION))
{
return 0;
@@ -394,13 +363,12 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
{
// Small (journaled) write
// First check if the journal has sufficient space
uint64_t dyn_size = dsk.dirty_dyn_size(op->offset, op->len);
blockstore_journal_check_t space_check(this);
if (unsynced_big_write_count &&
!space_check.check_available(op, unsynced_big_write_count,
sizeof(journal_entry_big_write) + dsk.clean_dyn_size, 0)
sizeof(journal_entry_big_write) + dsk.clean_entry_bitmap_size, 0)
|| !space_check.check_available(op, 1,
sizeof(journal_entry_small_write) + dyn_size,
sizeof(journal_entry_small_write) + dsk.clean_entry_bitmap_size,
op->len + ((dirty_it->second.state & BS_ST_INSTANT) ? JOURNAL_INSTANT_RESERVATION : JOURNAL_STABILIZE_RESERVATION)))
{
return 0;
@@ -409,7 +377,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
BS_SUBMIT_CHECK_SQES(
// Write current journal sector only if it's dirty and full, or in the immediate_commit mode
(immediate_commit != IMMEDIATE_NONE ||
!journal.entry_fits(sizeof(journal_entry_small_write) + dyn_size) ? 1 : 0) +
!journal.entry_fits(sizeof(journal_entry_small_write) + dsk.clean_entry_bitmap_size) ? 1 : 0) +
(op->len > 0 ? 1 : 0)
);
write_iodepth++;
@@ -417,7 +385,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
auto cb = [this, op](ring_data_t *data) { handle_write_event(data, op); };
if (immediate_commit == IMMEDIATE_NONE)
{
if (!journal.entry_fits(sizeof(journal_entry_small_write) + dyn_size))
if (!journal.entry_fits(sizeof(journal_entry_small_write) + dsk.clean_entry_bitmap_size))
{
prepare_journal_sector_write(journal.cur_sector, op);
}
@@ -429,7 +397,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
// Then pre-fill journal entry
journal_entry_small_write *je = (journal_entry_small_write*)prefill_single_journal_entry(
journal, op->opcode == BS_OP_WRITE_STABLE ? JE_SMALL_WRITE_INSTANT : JE_SMALL_WRITE,
sizeof(journal_entry_small_write) + dyn_size
sizeof(journal_entry_small_write) + dsk.clean_entry_bitmap_size
);
dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
@@ -463,9 +431,8 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
je->offset = op->offset;
je->len = op->len;
je->data_offset = journal.next_free;
je->crc32_data = dsk.csum_block_size ? 0 : crc32c(0, op->buf, op->len);
memcpy((void*)(je+1), (alloc_dyn_data
? (uint8_t*)dirty_it->second.dyn_data+sizeof(int) : (uint8_t*)&dirty_it->second.dyn_data), dyn_size);
je->crc32_data = crc32c(0, op->buf, op->len);
memcpy((void*)(je+1), (dsk.clean_entry_bitmap_size > sizeof(void*) ? dirty_it->second.bitmap : &dirty_it->second.bitmap), dsk.clean_entry_bitmap_size);
je->crc32 = je_crc32((journal_entry*)je);
journal.crc32_last = je->crc32;
if (immediate_commit != IMMEDIATE_NONE)
@@ -534,9 +501,9 @@ resume_2:
.version = op->version,
});
assert(dirty_it != dirty_db.end());
uint64_t dyn_size = dsk.dirty_dyn_size(op->offset, op->len);
blockstore_journal_check_t space_check(this);
if (!space_check.check_available(op, 1, sizeof(journal_entry_big_write) + dyn_size,
if (!space_check.check_available(op, 1,
sizeof(journal_entry_big_write) + dsk.clean_entry_bitmap_size,
((dirty_it->second.state & BS_ST_INSTANT) ? JOURNAL_INSTANT_RESERVATION : JOURNAL_STABILIZE_RESERVATION)))
{
return 0;
@@ -544,7 +511,7 @@ resume_2:
BS_SUBMIT_CHECK_SQES(1);
journal_entry_big_write *je = (journal_entry_big_write*)prefill_single_journal_entry(
journal, op->opcode == BS_OP_WRITE_STABLE ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
sizeof(journal_entry_big_write) + dyn_size
sizeof(journal_entry_big_write) + dsk.clean_entry_bitmap_size
);
dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
@@ -560,8 +527,7 @@ resume_2:
je->offset = op->offset;
je->len = op->len;
je->location = dirty_it->second.location;
memcpy((void*)(je+1), (alloc_dyn_data
? (uint8_t*)dirty_it->second.dyn_data+sizeof(int) : (uint8_t*)&dirty_it->second.dyn_data), dyn_size);
memcpy((void*)(je+1), (dsk.clean_entry_bitmap_size > sizeof(void*) ? dirty_it->second.bitmap : &dirty_it->second.bitmap), dsk.clean_entry_bitmap_size);
je->crc32 = je_crc32((journal_entry*)je);
journal.crc32_last = je->crc32;
prepare_journal_sector_write(journal.cur_sector, op);

View File

@@ -56,14 +56,15 @@ struct image_lister_t
{
continue;
}
auto & pool_cfg = parent->cli->st_cli.pool_config.at(INODE_POOL(ic.second.num));
auto pool_it = parent->cli->st_cli.pool_config.find(INODE_POOL(ic.second.num));
bool good_pool = pool_it != parent->cli->st_cli.pool_config.end();
auto item = json11::Json::object {
{ "name", ic.second.name },
{ "size", ic.second.size },
{ "used_size", 0 },
{ "readonly", ic.second.readonly },
{ "pool_id", (uint64_t)INODE_POOL(ic.second.num) },
{ "pool_name", pool_cfg.name },
{ "pool_name", good_pool ? pool_it->second.name : "? (ID:"+std::to_string(INODE_POOL(ic.second.num))+")" },
{ "inode_num", INODE_NO_POOL(ic.second.num) },
{ "inode_id", ic.second.num },
};
@@ -247,6 +248,8 @@ resume_1:
if (state == 1)
goto resume_1;
get_list();
if (state == 100)
return;
if (show_stats)
{
resume_1:
@@ -269,7 +272,7 @@ resume_1:
{ "key", "name" },
{ "title", "NAME" },
});
if (!list_pool_id)
if (list_pool_name == "")
{
cols.push_back(json11::Json::object{
{ "key", "pool_name" },

View File

@@ -10,7 +10,6 @@
#include "json11/json11.hpp"
#include "str_util.h"
#include "blockstore.h"
#include "blockstore_disk.h"
// Calculate offsets for a block device and print OSD command line parameters
void disk_tool_simple_offsets(json11::Json cfg, bool json_output)
@@ -21,39 +20,23 @@ void disk_tool_simple_offsets(json11::Json cfg, bool json_output)
fprintf(stderr, "Device path is missing\n");
exit(1);
}
uint64_t data_block_size = parse_size(cfg["object_size"].string_value());
uint64_t object_size = parse_size(cfg["object_size"].string_value());
uint64_t bitmap_granularity = parse_size(cfg["bitmap_granularity"].string_value());
uint64_t journal_size = parse_size(cfg["journal_size"].string_value());
uint64_t device_block_size = parse_size(cfg["device_block_size"].string_value());
uint64_t journal_offset = parse_size(cfg["journal_offset"].string_value());
uint64_t device_size = parse_size(cfg["device_size"].string_value());
uint32_t csum_block_size = parse_size(cfg["csum_block_size"].string_value());
uint32_t data_csum_type = BLOCKSTORE_CSUM_NONE;
if (cfg["data_csum_type"] == "crc32c")
data_csum_type = BLOCKSTORE_CSUM_CRC32C;
else if (cfg["data_csum_type"].string_value() != "" && cfg["data_csum_type"].string_value() != "none")
{
fprintf(
stderr, "data_csum_type=%s is unsupported, only \"crc32c\" and \"none\" are supported",
cfg["data_csum_type"].string_value().c_str()
);
exit(1);
}
std::string format = cfg["format"].string_value();
if (json_output)
format = "json";
if (!data_block_size)
data_block_size = 1 << DEFAULT_DATA_BLOCK_ORDER;
if (!object_size)
object_size = 1 << DEFAULT_DATA_BLOCK_ORDER;
if (!bitmap_granularity)
bitmap_granularity = DEFAULT_BITMAP_GRANULARITY;
if (!journal_size)
journal_size = 16*1024*1024;
if (!device_block_size)
device_block_size = 4096;
if (!data_csum_type)
csum_block_size = 0;
else if (!csum_block_size)
csum_block_size = bitmap_granularity;
uint64_t orig_device_size = device_size;
if (!device_size)
{
@@ -102,30 +85,22 @@ void disk_tool_simple_offsets(json11::Json cfg, bool json_output)
fprintf(stderr, "Invalid device block size specified: %lu\n", device_block_size);
exit(1);
}
if (data_block_size < device_block_size || data_block_size > MAX_DATA_BLOCK_SIZE ||
data_block_size & (data_block_size-1) != 0)
if (object_size < device_block_size || object_size > MAX_DATA_BLOCK_SIZE ||
object_size & (object_size-1) != 0)
{
fprintf(stderr, "Invalid object size specified: %lu\n", data_block_size);
fprintf(stderr, "Invalid object size specified: %lu\n", object_size);
exit(1);
}
if (bitmap_granularity < device_block_size || bitmap_granularity > data_block_size ||
if (bitmap_granularity < device_block_size || bitmap_granularity > object_size ||
bitmap_granularity & (bitmap_granularity-1) != 0)
{
fprintf(stderr, "Invalid bitmap granularity specified: %lu\n", bitmap_granularity);
exit(1);
}
if (csum_block_size && (data_block_size % csum_block_size))
{
fprintf(stderr, "csum_block_size must be a divisor of data_block_size\n");
exit(1);
}
journal_offset = ((journal_offset+device_block_size-1)/device_block_size)*device_block_size;
uint64_t meta_offset = journal_offset + ((journal_size+device_block_size-1)/device_block_size)*device_block_size;
uint64_t data_csum_size = (data_csum_type ? data_block_size/csum_block_size*(data_csum_type & 0xFF) : 0);
uint64_t clean_entry_bitmap_size = data_block_size/bitmap_granularity/8;
uint64_t clean_entry_size = 24 /*sizeof(clean_disk_entry)*/ + 2*clean_entry_bitmap_size + data_csum_size + 4 /*entry_csum*/;
uint64_t entries_per_block = device_block_size / clean_entry_size;
uint64_t object_count = ((device_size-meta_offset)/data_block_size);
uint64_t entries_per_block = (device_block_size / (24 + 2*object_size/bitmap_granularity/8));
uint64_t object_count = ((device_size-meta_offset)/object_size);
uint64_t meta_size = (1 + (object_count+entries_per_block-1)/entries_per_block) * device_block_size;
uint64_t data_offset = meta_offset + meta_size;
if (format == "json")

View File

@@ -59,8 +59,6 @@ static const char *help_text =
" --journal_size 32M/1G Set journal size (area or partition size)\n"
" --block_size 128k/1M Set blockstore object size\n"
" --bitmap_granularity 4k Set bitmap granularity\n"
" --data_csum_type none Set data checksum type (crc32c or none)\n"
" --csum_block_size 4k Set data checksum block size\n"
" --data_device_block 4k Override data device block size\n"
" --meta_device_block 4k Override metadata device block size\n"
" --journal_device_block 4k Override journal device block size\n"
@@ -74,9 +72,8 @@ static const char *help_text =
" If it doesn't succeed it issues a warning in the system log.\n"
" \n"
" You can also pass other OSD options here as arguments and they'll be persisted\n"
" in the superblock: cached_read_data, cached_read_meta, cached_read_journal,\n"
" inmemory_metadata, inmemory_journal, max_write_iodepth,\n"
" min_flusher_count, max_flusher_count, journal_sector_buffer_count,\n"
" to the superblock: max_write_iodepth, max_write_iodepth, min_flusher_count,\n"
" max_flusher_count, inmemory_metadata, inmemory_journal, journal_sector_buffer_count,\n"
" journal_no_same_sector_overwrites, throttle_small_writes, throttle_target_iops,\n"
" throttle_target_mbs, throttle_target_parallelism, throttle_threshold_us.\n"
"\n"
@@ -164,8 +161,6 @@ static const char *help_text =
" --object_size 128k Set blockstore block size\n"
" --bitmap_granularity 4k Set bitmap granularity\n"
" --journal_size 16M Set journal size\n"
" --data_csum_type none Set data checksum type (crc32c or none)\n"
" --csum_block_size 4k Set data checksum block size\n"
" --device_block_size 4k Set device block size\n"
" --journal_offset 0 Set journal offset\n"
" --device_size 0 Set device size\n"
@@ -275,19 +270,6 @@ int main(int argc, char *argv[])
fprintf(stderr, "Invalid JSON: %s\n", json_err.c_str());
return 1;
}
if (entries[0]["type"] == "start")
{
self.dsk.data_csum_type = csum_type_from_str(entries[0]["data_csum_type"].string_value());
self.dsk.csum_block_size = entries[0]["csum_block_size"].uint64_value();
}
if (self.options["data_csum_type"] != "")
{
self.dsk.data_csum_type = csum_type_from_str(self.options["data_csum_type"]);
}
if (self.options["csum_block_size"] != "")
{
self.dsk.csum_block_size = stoull_full(self.options["csum_block_size"], 0);
}
return self.write_json_journal(entries);
}
else if (!strcmp(cmd[0], "dump-meta"))

View File

@@ -64,19 +64,17 @@ struct disk_tool_t
ring_loop_t *ringloop;
ring_consumer_t ring_consumer;
int remap_active;
journal_entry_start je_start;
uint8_t *new_journal_buf, *new_meta_buf, *new_journal_ptr, *new_journal_data;
uint64_t new_journal_in_pos;
int64_t data_idx_diff;
uint64_t total_blocks, free_first, free_last;
uint64_t new_clean_entry_bitmap_size, new_data_csum_size, new_clean_entry_size, new_entries_per_block;
uint64_t new_clean_entry_bitmap_size, new_clean_entry_size, new_entries_per_block;
int new_journal_fd, new_meta_fd;
resizer_data_moving_t *moving_blocks;
bool started;
void *small_write_data;
uint32_t data_crc32;
bool data_csum_valid;
uint32_t crc32_last;
uint32_t new_crc32_prev;
@@ -86,11 +84,11 @@ struct disk_tool_t
void dump_journal_entry(int num, journal_entry *je, bool json);
int process_journal(std::function<int(void*)> block_fn);
int process_journal_block(void *buf, std::function<void(int, journal_entry*)> iter_fn);
int process_meta(std::function<void(blockstore_meta_header_v2_t *)> hdr_fn,
int process_meta(std::function<void(blockstore_meta_header_v1_t *)> hdr_fn,
std::function<void(uint64_t, clean_disk_entry*, uint8_t*)> record_fn);
int dump_meta();
void dump_meta_header(blockstore_meta_header_v2_t *hdr);
void dump_meta_header(blockstore_meta_header_v1_t *hdr);
void dump_meta_entry(uint64_t block_num, clean_disk_entry *entry, uint8_t *bitmap);
int write_json_journal(json11::Json entries);
@@ -98,7 +96,7 @@ struct disk_tool_t
int resize_data();
int resize_parse_params();
void resize_init(blockstore_meta_header_v2_t *hdr);
void resize_init(blockstore_meta_header_v1_t *hdr);
int resize_remap_blocks();
int resize_copy_data();
int resize_rewrite_journal();
@@ -143,5 +141,3 @@ json11::Json read_parttable(std::string dev);
uint64_t dev_size_from_parttable(json11::Json pt);
uint64_t free_from_parttable(json11::Json pt);
int fix_partition_type(std::string dev_by_uuid);
std::string csum_type_str(uint32_t data_csum_type);
uint32_t csum_type_from_str(std::string data_csum_type);

View File

@@ -55,23 +55,6 @@ int disk_tool_t::dump_journal()
printf("offset %08lx:\n", journal_pos);
else
printf(",\"entries\":[\n");
if (journal_pos == 0)
{
// Fill journal header to know checksum type & size
journal_entry *je = (journal_entry*)journal_buf;
if (je->magic == JOURNAL_MAGIC && je->type == JE_START &&
(je->start.version == JOURNAL_VERSION_V1 || je->start.version == JOURNAL_VERSION_V2))
{
memcpy(&je_start, je, sizeof(je_start));
if (je_start.size == JE_START_V0_SIZE)
je_start.version = 0;
if (je_start.version < JOURNAL_VERSION_V2)
{
je_start.data_csum_type = 0;
je_start.csum_block_size = 0;
}
}
}
first_entry = true;
process_journal_block(journal_buf, [this](int num, journal_entry *je) { dump_journal_entry(num, je, json); });
if (json)
@@ -137,22 +120,8 @@ int disk_tool_t::process_journal(std::function<int(void*)> block_fn)
fprintf(stderr, "offset %08lx: journal superblock is invalid\n", journal_pos);
r = 1;
}
else if (je->start.size != JE_START_V0_SIZE && je->start.version != JOURNAL_VERSION_V1 && je->start.version != JOURNAL_VERSION_V2)
{
fprintf(stderr, "offset %08lx: journal superblock contains version %lu, but I only understand 0, 1 and 2\n",
journal_pos, je->start.size == JE_START_V0_SIZE ? 0 : je->start.version);
r = 1;
}
else
{
memcpy(&je_start, je, sizeof(je_start));
if (je_start.size == JE_START_V0_SIZE)
je_start.version = 0;
if (je_start.version < JOURNAL_VERSION_V2)
{
je_start.data_csum_type = 0;
je_start.csum_block_size = 0;
}
started = false;
crc32_last = 0;
block_fn(data);
@@ -214,49 +183,7 @@ int disk_tool_t::process_journal_block(void *buf, std::function<void(int, journa
}
small_write_data = memalign_or_die(MEM_ALIGNMENT, je->small_write.len);
assert(pread(dsk.journal_fd, small_write_data, je->small_write.len, dsk.journal_offset+je->small_write.data_offset) == je->small_write.len);
data_crc32 = je_start.csum_block_size ? 0 : crc32c(0, small_write_data, je->small_write.len);
data_csum_valid = (data_crc32 == je->small_write.crc32_data);
if (je_start.csum_block_size && je->small_write.len > 0)
{
// like in enqueue_write()
uint32_t start = je->small_write.offset / je_start.csum_block_size;
uint32_t end = (je->small_write.offset+je->small_write.len-1) / je_start.csum_block_size;
uint32_t data_csum_size = (end-start+1) * (je_start.data_csum_type & 0xFF);
if (je->size < sizeof(journal_entry_small_write) + data_csum_size)
{
data_csum_valid = false;
}
else
{
uint32_t calc_csum = 0;
uint32_t *block_csums = (uint32_t*)((uint8_t*)je + je->size - data_csum_size);
if (start == end)
{
calc_csum = crc32c(0, (uint8_t*)small_write_data, je->small_write.len);
data_csum_valid = data_csum_valid && (calc_csum == *block_csums++);
}
else
{
// First block
calc_csum = crc32c(0, (uint8_t*)small_write_data,
je_start.csum_block_size*(start+1)-je->small_write.offset);
data_csum_valid = data_csum_valid && (calc_csum == *block_csums++);
// Intermediate blocks
for (uint32_t i = start+1; i < end; i++)
{
calc_csum = crc32c(0, (uint8_t*)small_write_data +
je_start.csum_block_size*i-je->small_write.offset, je_start.csum_block_size);
data_csum_valid = data_csum_valid && (calc_csum == *block_csums++);
}
// Last block
calc_csum = crc32c(
0, (uint8_t*)small_write_data + end*je_start.csum_block_size - je->small_write.offset,
je->small_write.offset+je->small_write.len - end*je_start.csum_block_size
);
data_csum_valid = data_csum_valid && (calc_csum == *block_csums++);
}
}
}
data_crc32 = crc32c(0, small_write_data, je->small_write.len);
}
iter_fn(entry, je);
if (je->type == JE_SMALL_WRITE || je->type == JE_SMALL_WRITE_INSTANT)
@@ -296,40 +223,29 @@ void disk_tool_t::dump_journal_entry(int num, journal_entry *je, bool json)
if (je->type == JE_START)
{
printf(
json ? ",\"type\":\"start\",\"start\":\"0x%lx\"" : "je_start start=%08lx",
json ? ",\"type\":\"start\",\"start\":\"0x%lx\"}" : "je_start start=%08lx\n",
je->start.journal_start
);
if (je->start.data_csum_type)
{
printf(
json ? ",\"data_csum_type\":\"%s\",\"csum_block_size\":%u" : " data_csum_type=%s csum_block_size=%u",
csum_type_str(je->start.data_csum_type).c_str(), je->start.csum_block_size
);
}
printf(json ? "}" : "\n");
}
else if (je->type == JE_SMALL_WRITE || je->type == JE_SMALL_WRITE_INSTANT)
{
auto & sw = je->small_write;
printf(
json ? ",\"type\":\"small_write%s\",\"inode\":\"0x%lx\",\"stripe\":\"0x%lx\",\"ver\":\"%lu\",\"offset\":%u,\"len\":%u,\"loc\":\"0x%lx\""
: "je_small_write%s oid=%lx:%lx ver=%lu offset=%u len=%u loc=%08lx",
je->type == JE_SMALL_WRITE_INSTANT ? "_instant" : "",
sw.oid.inode, sw.oid.stripe, sw.version, sw.offset, sw.len, sw.data_offset
je->small_write.oid.inode, je->small_write.oid.stripe,
je->small_write.version, je->small_write.offset, je->small_write.len,
je->small_write.data_offset
);
if (journal_calc_data_pos != sw.data_offset)
if (journal_calc_data_pos != je->small_write.data_offset)
{
printf(json ? ",\"bad_loc\":true,\"calc_loc\":\"0x%lx\""
: " (mismatched, calculated = %lu)", journal_pos);
}
uint32_t data_csum_size = (!je_start.csum_block_size
? 0
: ((sw.offset + sw.len - 1)/je_start.csum_block_size - sw.offset/je_start.csum_block_size + 1)
*(je_start.data_csum_type & 0xFF));
if (je->size > sizeof(journal_entry_small_write) + data_csum_size)
if (je->small_write.size > sizeof(journal_entry_small_write))
{
printf(json ? ",\"bitmap\":\"" : " (bitmap: ");
for (int i = sizeof(journal_entry_small_write); i < je->size - data_csum_size; i++)
for (int i = sizeof(journal_entry_small_write); i < je->small_write.size; i++)
{
printf("%02x", ((uint8_t*)je)[i]);
}
@@ -338,56 +254,34 @@ void disk_tool_t::dump_journal_entry(int num, journal_entry *je, bool json)
if (dump_with_data)
{
printf(json ? ",\"data\":\"" : " (data: ");
for (int i = 0; i < sw.len; i++)
for (int i = 0; i < je->small_write.len; i++)
{
printf("%02x", ((uint8_t*)small_write_data)[i]);
}
printf(json ? "\"" : ")");
}
if (data_csum_size > 0 && je->size >= sizeof(journal_entry_small_write) + data_csum_size)
{
printf(json ? ",\"block_csums\":\"" : " block_csums=");
uint8_t *block_csums = (uint8_t*)je + je->size - data_csum_size;
for (int i = 0; i < data_csum_size; i++)
printf("%02x", block_csums[i]);
printf(json ? "\"" : "");
}
else
{
printf(json ? ",\"data_crc32\":\"%08x\"" : " data_crc32=%08x", sw.crc32_data);
}
printf(
json ? ",\"data_valid\":%s}" : "%s\n",
(data_csum_valid
? (json ? "true" : " (valid)")
: (json ? "false" : " (invalid)"))
json ? ",\"data_crc32\":\"%08x\",\"data_valid\":%s}" : " data_crc32=%08x%s\n",
je->small_write.crc32_data,
(data_crc32 != je->small_write.crc32_data
? (json ? "false" : " (invalid)")
: (json ? "true" : " (valid)"))
);
}
else if (je->type == JE_BIG_WRITE || je->type == JE_BIG_WRITE_INSTANT)
{
auto & bw = je->big_write;
printf(
json ? ",\"type\":\"big_write%s\",\"inode\":\"0x%lx\",\"stripe\":\"0x%lx\",\"ver\":\"%lu\",\"offset\":%u,\"len\":%u,\"loc\":\"0x%lx\""
: "je_big_write%s oid=%lx:%lx ver=%lu offset=%u len=%u loc=%08lx",
je->type == JE_BIG_WRITE_INSTANT ? "_instant" : "",
bw.oid.inode, bw.oid.stripe, bw.version, bw.offset, bw.len, bw.location
je->big_write.oid.inode, je->big_write.oid.stripe,
je->big_write.version, je->big_write.offset, je->big_write.len,
je->big_write.location
);
uint32_t data_csum_size = (!je_start.csum_block_size
? 0
: ((bw.offset + bw.len - 1)/je_start.csum_block_size - bw.offset/je_start.csum_block_size + 1)
*(je_start.data_csum_type & 0xFF));
if (data_csum_size > 0 && je->size >= sizeof(journal_entry_big_write) + data_csum_size)
{
printf(json ? ",\"block_csums\":\"" : " block_csums=");
uint8_t *block_csums = (uint8_t*)je + je->size - data_csum_size;
for (int i = 0; i < data_csum_size; i++)
printf("%02x", block_csums[i]);
printf(json ? "\"" : "");
}
if (bw.size > sizeof(journal_entry_big_write) + data_csum_size)
if (je->big_write.size > sizeof(journal_entry_big_write))
{
printf(json ? ",\"bitmap\":\"" : " (bitmap: ");
for (int i = sizeof(journal_entry_big_write); i < bw.size - data_csum_size; i++)
for (int i = sizeof(journal_entry_big_write); i < je->big_write.size; i++)
{
printf("%02x", ((uint8_t*)je)[i]);
}
@@ -444,9 +338,7 @@ int disk_tool_t::write_json_journal(json11::Json entries)
.type = JE_START,
.size = sizeof(journal_entry_start),
.journal_start = dsk.journal_block_size,
.version = JOURNAL_VERSION_V2,
.data_csum_type = dsk.data_csum_type,
.csum_block_size = dsk.csum_block_size,
.version = JOURNAL_VERSION,
};
((journal_entry*)new_journal_buf)->crc32 = je_crc32((journal_entry*)new_journal_buf);
new_journal_ptr += dsk.journal_block_size;
@@ -466,11 +358,9 @@ int disk_tool_t::write_json_journal(json11::Json entries)
uint32_t entry_size = (type == JE_START
? sizeof(journal_entry_start)
: (type == JE_SMALL_WRITE || type == JE_SMALL_WRITE_INSTANT
? sizeof(journal_entry_small_write) + dsk.clean_entry_bitmap_size +
(dsk.data_csum_type ? rec["len"].uint64_value()/dsk.csum_block_size*(dsk.data_csum_type & 0xFF) : 0)
? sizeof(journal_entry_small_write) + dsk.clean_entry_bitmap_size
: (type == JE_BIG_WRITE || type == JE_BIG_WRITE_INSTANT
? sizeof(journal_entry_big_write) + dsk.clean_entry_bitmap_size +
(dsk.data_csum_type ? rec["len"].uint64_value()/dsk.csum_block_size*(dsk.data_csum_type & 0xFF) : 0)
? sizeof(journal_entry_big_write) + dsk.clean_entry_bitmap_size
: sizeof(journal_entry_del))));
if (dsk.journal_block_size < new_journal_in_pos + entry_size)
{
@@ -512,24 +402,12 @@ int disk_tool_t::write_json_journal(json11::Json entries)
.offset = (uint32_t)rec["offset"].uint64_value(),
.len = (uint32_t)rec["len"].uint64_value(),
.data_offset = (uint64_t)(new_journal_data-new_journal_buf),
.crc32_data = !dsk.data_csum_type ? 0 : (uint32_t)sscanf_json("%x", rec["data_crc32"]),
.crc32_data = (uint32_t)sscanf_json("%x", rec["data_crc32"]),
};
uint32_t data_csum_size = !dsk.data_csum_type ? 0 : ne->small_write.len/dsk.csum_block_size*(dsk.data_csum_type & 0xFF);
fromhexstr(rec["bitmap"].string_value(), dsk.clean_entry_bitmap_size, ((uint8_t*)ne) + sizeof(journal_entry_small_write) + data_csum_size);
fromhexstr(rec["bitmap"].string_value(), dsk.clean_entry_bitmap_size, ((uint8_t*)ne) + sizeof(journal_entry_small_write));
fromhexstr(rec["data"].string_value(), ne->small_write.len, new_journal_data);
if (dsk.data_csum_type)
fromhexstr(rec["block_csums"].string_value(), data_csum_size, ((uint8_t*)ne) + sizeof(journal_entry_small_write));
if (rec["data"].is_string())
{
if (!dsk.data_csum_type)
ne->small_write.crc32_data = crc32c(0, new_journal_data, ne->small_write.len);
else if (dsk.data_csum_type == BLOCKSTORE_CSUM_CRC32C)
{
uint32_t *block_csums = (uint32_t*)(((uint8_t*)ne) + sizeof(journal_entry_small_write));
for (uint32_t i = 0; i < ne->small_write.len; i += dsk.csum_block_size, block_csums++)
*block_csums = crc32c(0, new_journal_data+i, dsk.csum_block_size);
}
}
ne->small_write.crc32_data = crc32c(0, new_journal_data, ne->small_write.len);
new_journal_data += ne->small_write.len;
}
else if (type == JE_BIG_WRITE || type == JE_BIG_WRITE_INSTANT)
@@ -548,10 +426,7 @@ int disk_tool_t::write_json_journal(json11::Json entries)
.len = (uint32_t)rec["len"].uint64_value(),
.location = sscanf_json(NULL, rec["loc"]),
};
uint32_t data_csum_size = !dsk.data_csum_type ? 0 : ne->big_write.len/dsk.csum_block_size*(dsk.data_csum_type & 0xFF);
fromhexstr(rec["bitmap"].string_value(), dsk.clean_entry_bitmap_size, ((uint8_t*)ne) + sizeof(journal_entry_big_write) + data_csum_size);
if (dsk.data_csum_type)
fromhexstr(rec["block_csums"].string_value(), data_csum_size, ((uint8_t*)ne) + sizeof(journal_entry_big_write));
fromhexstr(rec["bitmap"].string_value(), dsk.clean_entry_bitmap_size, ((uint8_t*)ne) + sizeof(journal_entry_big_write));
}
else if (type == JE_STABLE || type == JE_ROLLBACK || type == JE_DELETE)
{

View File

@@ -5,7 +5,7 @@
#include "rw_blocking.h"
#include "osd_id.h"
int disk_tool_t::process_meta(std::function<void(blockstore_meta_header_v2_t *)> hdr_fn,
int disk_tool_t::process_meta(std::function<void(blockstore_meta_header_v1_t *)> hdr_fn,
std::function<void(uint64_t, clean_disk_entry*, uint8_t*)> record_fn)
{
if (dsk.meta_block_size % DIRECT_IO_ALIGNMENT)
@@ -28,38 +28,12 @@ int disk_tool_t::process_meta(std::function<void(blockstore_meta_header_v2_t *)>
lseek64(dsk.meta_fd, dsk.meta_offset, 0);
read_blocking(dsk.meta_fd, data, dsk.meta_block_size);
// Check superblock
blockstore_meta_header_v2_t *hdr = (blockstore_meta_header_v2_t *)data;
if (hdr->zero == 0 && hdr->magic == BLOCKSTORE_META_MAGIC_V1)
blockstore_meta_header_v1_t *hdr = (blockstore_meta_header_v1_t *)data;
if (hdr->zero == 0 &&
hdr->magic == BLOCKSTORE_META_MAGIC_V1 &&
hdr->version == BLOCKSTORE_META_VERSION_V1)
{
if (hdr->version == BLOCKSTORE_META_FORMAT_V1)
{
// Vitastor 0.6-0.8 - static array of clean_disk_entry with bitmaps
hdr->data_csum_type = 0;
hdr->csum_block_size = 0;
hdr->header_csum = 0;
}
else if (hdr->version == BLOCKSTORE_META_FORMAT_V2)
{
// Vitastor 0.9 - static array of clean_disk_entry with bitmaps and checksums
if (hdr->data_csum_type != 0 &&
hdr->data_csum_type != BLOCKSTORE_CSUM_CRC32C)
{
fprintf(stderr, "I don't know checksum format %u, the only supported format is crc32c = %u.\n", hdr->data_csum_type, BLOCKSTORE_CSUM_CRC32C);
free(data);
close(dsk.meta_fd);
dsk.meta_fd = -1;
return 1;
}
}
else
{
// Unsupported version
fprintf(stderr, "Metadata format is too new for me (stored version is %lu, max supported %u).\n", hdr->version, BLOCKSTORE_META_FORMAT_V2);
free(data);
close(dsk.meta_fd);
dsk.meta_fd = -1;
return 1;
}
// Vitastor 0.6-0.7 - static array of clean_disk_entry with bitmaps
if (hdr->meta_block_size != dsk.meta_block_size)
{
fprintf(stderr, "Using block size of %u bytes based on information from the superblock\n", hdr->meta_block_size);
@@ -71,24 +45,14 @@ int disk_tool_t::process_meta(std::function<void(blockstore_meta_header_v2_t *)>
memcpy(new_data, data, dsk.meta_block_size);
free(data);
data = new_data;
hdr = (blockstore_meta_header_v2_t *)data;
hdr = (blockstore_meta_header_v1_t *)data;
}
}
dsk.meta_format = hdr->version;
dsk.data_block_size = hdr->data_block_size;
dsk.csum_block_size = hdr->csum_block_size;
dsk.data_csum_type = hdr->data_csum_type;
dsk.bitmap_granularity = hdr->bitmap_granularity;
dsk.clean_entry_bitmap_size = (hdr->data_block_size / hdr->bitmap_granularity + 7) / 8;
dsk.clean_entry_size = sizeof(clean_disk_entry) + 2*dsk.clean_entry_bitmap_size
+ (hdr->data_csum_type
? ((hdr->data_block_size+hdr->csum_block_size-1)/hdr->csum_block_size
*(hdr->data_csum_type & 0xff))
: 0)
+ (dsk.meta_format == BLOCKSTORE_META_FORMAT_V2 ? 4 /*entry_csum*/ : 0);
dsk.clean_entry_bitmap_size = hdr->data_block_size / hdr->bitmap_granularity / 8;
dsk.clean_entry_size = sizeof(clean_disk_entry) + 2*dsk.clean_entry_bitmap_size;
uint64_t block_num = 0;
hdr_fn(hdr);
hdr = NULL;
meta_pos = dsk.meta_block_size;
lseek64(dsk.meta_fd, dsk.meta_offset+meta_pos, 0);
while (meta_pos < dsk.meta_len)
@@ -103,15 +67,6 @@ int disk_tool_t::process_meta(std::function<void(blockstore_meta_header_v2_t *)>
clean_disk_entry *entry = (clean_disk_entry*)((uint8_t*)data + blk + ioff);
if (entry->oid.inode)
{
if (dsk.data_csum_type)
{
uint32_t *entry_csum = (uint32_t*)((uint8_t*)entry + dsk.clean_entry_size - 4);
if (*entry_csum != crc32c(0, entry, dsk.clean_entry_size - 4))
{
fprintf(stderr, "Metadata entry %lu is corrupt (checksum mismatch), skipping\n", block_num);
continue;
}
}
record_fn(block_num, entry, entry->bitmap);
}
}
@@ -152,35 +107,21 @@ int disk_tool_t::process_meta(std::function<void(blockstore_meta_header_v2_t *)>
int disk_tool_t::dump_meta()
{
int r = process_meta(
[this](blockstore_meta_header_v2_t *hdr) { dump_meta_header(hdr); },
[this](blockstore_meta_header_v1_t *hdr) { dump_meta_header(hdr); },
[this](uint64_t block_num, clean_disk_entry *entry, uint8_t *bitmap) { dump_meta_entry(block_num, entry, bitmap); }
);
if (r == 0)
printf("\n]}\n");
printf("\n]}\n");
return r;
}
void disk_tool_t::dump_meta_header(blockstore_meta_header_v2_t *hdr)
void disk_tool_t::dump_meta_header(blockstore_meta_header_v1_t *hdr)
{
if (hdr)
{
if (hdr->version == BLOCKSTORE_META_FORMAT_V1)
{
printf(
"{\"version\":\"0.6\",\"meta_block_size\":%u,\"data_block_size\":%u,\"bitmap_granularity\":%u,"
"\"entries\":[\n",
hdr->meta_block_size, hdr->data_block_size, hdr->bitmap_granularity
);
}
else if (hdr->version == BLOCKSTORE_META_FORMAT_V2)
{
printf(
"{\"version\":\"0.9\",\"meta_block_size\":%u,\"data_block_size\":%u,\"bitmap_granularity\":%u,"
"\"data_csum_type\":%s,\"csum_block_size\":%u,\"entries\":[\n",
hdr->meta_block_size, hdr->data_block_size, hdr->bitmap_granularity,
csum_type_str(hdr->data_csum_type).c_str(), hdr->csum_block_size
);
}
printf(
"{\"version\":\"0.6\",\"meta_block_size\":%u,\"data_block_size\":%u,\"bitmap_granularity\":%u,\"entries\":[\n",
hdr->meta_block_size, hdr->data_block_size, hdr->bitmap_granularity
);
}
else
{
@@ -210,15 +151,6 @@ void disk_tool_t::dump_meta_entry(uint64_t block_num, clean_disk_entry *entry, u
{
printf("%02x", bitmap[dsk.clean_entry_bitmap_size + i]);
}
if (dsk.csum_block_size && dsk.data_csum_type)
{
uint8_t *csums = bitmap + dsk.clean_entry_bitmap_size*2;
printf("\",\"block_csums\":\"");
for (uint64_t i = 0; i < (dsk.data_block_size+dsk.csum_block_size-1)/dsk.csum_block_size*(dsk.data_csum_type & 0xFF); i++)
{
printf("%02x", csums[i]);
}
}
printf("\"}");
}
else
@@ -232,30 +164,18 @@ int disk_tool_t::write_json_meta(json11::Json meta)
{
new_meta_buf = (uint8_t*)memalign_or_die(MEM_ALIGNMENT, new_meta_len);
memset(new_meta_buf, 0, new_meta_len);
blockstore_meta_header_v2_t *new_hdr = (blockstore_meta_header_v2_t *)new_meta_buf;
blockstore_meta_header_v1_t *new_hdr = (blockstore_meta_header_v1_t *)new_meta_buf;
new_hdr->zero = 0;
new_hdr->magic = BLOCKSTORE_META_MAGIC_V1;
new_hdr->version = meta["version"].uint64_value() == BLOCKSTORE_META_FORMAT_V1
? BLOCKSTORE_META_FORMAT_V1 : BLOCKSTORE_META_FORMAT_V2;
new_hdr->version = BLOCKSTORE_META_VERSION_V1;
new_hdr->meta_block_size = meta["meta_block_size"].uint64_value()
? meta["meta_block_size"].uint64_value() : 4096;
new_hdr->data_block_size = meta["data_block_size"].uint64_value()
? meta["data_block_size"].uint64_value() : 131072;
new_hdr->bitmap_granularity = meta["bitmap_granularity"].uint64_value()
? meta["bitmap_granularity"].uint64_value() : 4096;
new_hdr->data_csum_type = meta["data_csum_type"].is_number()
? meta["data_csum_type"].uint64_value()
: (meta["data_csum_type"].string_value() == "crc32c"
? BLOCKSTORE_CSUM_CRC32C
: BLOCKSTORE_CSUM_NONE);
new_hdr->csum_block_size = meta["csum_block_size"].uint64_value();
uint32_t new_clean_entry_header_size = (new_hdr->version == BLOCKSTORE_META_FORMAT_V1
? sizeof(clean_disk_entry) : sizeof(clean_disk_entry) + 4 /*entry_csum*/);
new_clean_entry_bitmap_size = (new_hdr->data_block_size / new_hdr->bitmap_granularity + 7) / 8;
new_data_csum_size = (new_hdr->data_csum_type
? ((new_hdr->data_block_size+new_hdr->csum_block_size-1)/new_hdr->csum_block_size*(new_hdr->data_csum_type & 0xFF))
: 0);
new_clean_entry_size = new_clean_entry_header_size + 2*new_clean_entry_bitmap_size + new_data_csum_size;
new_clean_entry_bitmap_size = new_hdr->data_block_size / new_hdr->bitmap_granularity / 8;
new_clean_entry_size = sizeof(clean_disk_entry) + 2*new_clean_entry_bitmap_size;
new_entries_per_block = new_hdr->meta_block_size / new_clean_entry_size;
for (const auto & e: meta["entries"].array_items())
{
@@ -274,21 +194,8 @@ int disk_tool_t::write_json_meta(json11::Json meta)
new_entry->oid.inode = (sscanf_json(NULL, e["pool"]) << (64-POOL_ID_BITS)) | sscanf_json(NULL, e["inode"]);
new_entry->oid.stripe = sscanf_json(NULL, e["stripe"]);
new_entry->version = sscanf_json(NULL, e["version"]);
fromhexstr(e["bitmap"].string_value(), new_clean_entry_bitmap_size,
((uint8_t*)new_entry) + sizeof(clean_disk_entry));
fromhexstr(e["ext_bitmap"].string_value(), new_clean_entry_bitmap_size,
((uint8_t*)new_entry) + sizeof(clean_disk_entry) + new_clean_entry_bitmap_size);
if (new_hdr->version == BLOCKSTORE_META_FORMAT_V2)
{
if (new_hdr->data_csum_type != 0)
{
fromhexstr(e["data_csum"].string_value(), new_data_csum_size,
((uint8_t*)new_entry) + sizeof(clean_disk_entry) + 2*new_clean_entry_bitmap_size);
}
uint32_t *new_entry_csum = (uint32_t*)(((uint8_t*)new_entry) + sizeof(clean_disk_entry) +
2*new_clean_entry_bitmap_size + new_data_csum_size);
*new_entry_csum = crc32c(0, new_entry, new_clean_entry_size - 4);
}
fromhexstr(e["bitmap"].string_value(), new_clean_entry_bitmap_size, ((uint8_t*)new_entry) + sizeof(clean_disk_entry));
fromhexstr(e["ext_bitmap"].string_value(), new_clean_entry_bitmap_size, ((uint8_t*)new_entry) + sizeof(clean_disk_entry) + new_clean_entry_bitmap_size);
}
int r = resize_write_new_meta();
free(new_meta_buf);

View File

@@ -8,9 +8,6 @@
int disk_tool_t::prepare_one(std::map<std::string, std::string> options, int is_hdd)
{
static const char *allow_additional_params[] = {
"cached_read_data",
"cached_read_meta",
"cached_read_journal",
"max_write_iodepth",
"max_write_iodepth",
"min_flusher_count",
@@ -119,7 +116,6 @@ int disk_tool_t::prepare_one(std::map<std::string, std::string> options, int is_
try
{
dsk.parse_config(options);
dsk.cached_read_data = dsk.cached_read_meta = dsk.cached_read_journal = false;
dsk.open_data();
dsk.open_meta();
dsk.open_journal();
@@ -483,7 +479,6 @@ int disk_tool_t::get_meta_partition(std::vector<vitastor_dev_info_t> & ssds, std
{
blockstore_disk_t dsk;
dsk.parse_config(options);
dsk.cached_read_data = dsk.cached_read_meta = dsk.cached_read_journal = false;
dsk.open_data();
dsk.open_meta();
dsk.open_journal();

View File

@@ -29,7 +29,7 @@ int disk_tool_t::resize_data()
fprintf(stderr, "Reading metadata\n");
data_alloc = new allocator((new_data_len < dsk.data_len ? dsk.data_len : new_data_len) / dsk.data_block_size);
r = process_meta(
[this](blockstore_meta_header_v2_t *hdr)
[this](blockstore_meta_header_v1_t *hdr)
{
resize_init(hdr);
},
@@ -91,7 +91,6 @@ int disk_tool_t::resize_parse_params()
try
{
dsk.parse_config(options);
dsk.cached_read_data = dsk.cached_read_meta = dsk.cached_read_journal = false;
dsk.open_data();
dsk.open_meta();
dsk.open_journal();
@@ -140,7 +139,7 @@ int disk_tool_t::resize_parse_params()
return 0;
}
void disk_tool_t::resize_init(blockstore_meta_header_v2_t *hdr)
void disk_tool_t::resize_init(blockstore_meta_header_v1_t *hdr)
{
if (hdr && dsk.data_block_size != hdr->data_block_size)
{
@@ -150,15 +149,6 @@ void disk_tool_t::resize_init(blockstore_meta_header_v2_t *hdr)
}
dsk.data_block_size = hdr->data_block_size;
}
if (hdr && (dsk.data_csum_type != hdr->data_csum_type || dsk.csum_block_size != hdr->csum_block_size))
{
if (dsk.data_csum_type)
{
fprintf(stderr, "Using data checksum type %s from metadata superblock\n", csum_type_str(hdr->data_csum_type).c_str());
}
dsk.data_csum_type = hdr->data_csum_type;
dsk.csum_block_size = hdr->csum_block_size;
}
if (((new_data_len-dsk.data_len) % dsk.data_block_size) ||
((new_data_offset-dsk.data_offset) % dsk.data_block_size))
{
@@ -170,12 +160,8 @@ void disk_tool_t::resize_init(blockstore_meta_header_v2_t *hdr)
free_last = (new_data_offset+new_data_len < dsk.data_offset+dsk.data_len)
? (dsk.data_offset+dsk.data_len-new_data_offset-new_data_len) / dsk.data_block_size
: 0;
uint32_t new_clean_entry_header_size = sizeof(clean_disk_entry) + 4 /*entry_csum*/;
new_clean_entry_bitmap_size = dsk.data_block_size / (hdr ? hdr->bitmap_granularity : 4096) / 8;
new_data_csum_size = (dsk.data_csum_type
? ((dsk.data_block_size+dsk.csum_block_size-1)/dsk.csum_block_size*(dsk.data_csum_type & 0xFF))
: 0);
new_clean_entry_size = new_clean_entry_header_size + 2*new_clean_entry_bitmap_size + new_data_csum_size;
new_clean_entry_size = sizeof(clean_disk_entry) + 2 * new_clean_entry_bitmap_size;
new_entries_per_block = dsk.meta_block_size/new_clean_entry_size;
uint64_t new_meta_blocks = 1 + (new_data_len/dsk.data_block_size + new_entries_per_block-1) / new_entries_per_block;
if (!new_meta_len)
@@ -363,25 +349,13 @@ int disk_tool_t::resize_rewrite_journal()
{
if (je->type == JE_START)
{
if (je_start.data_csum_type != dsk.data_csum_type ||
je_start.csum_block_size != dsk.csum_block_size)
{
fprintf(
stderr, "Error: journal header has different checksum parameters: %s/%u vs %s/%u\n",
csum_type_str(je_start.data_csum_type).c_str(), je_start.csum_block_size,
csum_type_str(dsk.data_csum_type).c_str(), dsk.csum_block_size
);
exit(1);
}
journal_entry *ne = (journal_entry*)(new_journal_ptr + new_journal_in_pos);
*((journal_entry_start*)ne) = (journal_entry_start){
.magic = JOURNAL_MAGIC,
.type = JE_START,
.size = sizeof(journal_entry_start),
.journal_start = dsk.journal_block_size,
.version = JOURNAL_VERSION_V2,
.data_csum_type = dsk.data_csum_type,
.csum_block_size = dsk.csum_block_size,
.version = JOURNAL_VERSION,
};
ne->crc32 = je_crc32(ne);
new_journal_ptr += dsk.journal_block_size;
@@ -462,17 +436,15 @@ int disk_tool_t::resize_rewrite_meta()
new_meta_buf = (uint8_t*)memalign_or_die(MEM_ALIGNMENT, new_meta_len);
memset(new_meta_buf, 0, new_meta_len);
int r = process_meta(
[this](blockstore_meta_header_v2_t *hdr)
[this](blockstore_meta_header_v1_t *hdr)
{
blockstore_meta_header_v2_t *new_hdr = (blockstore_meta_header_v2_t *)new_meta_buf;
blockstore_meta_header_v1_t *new_hdr = (blockstore_meta_header_v1_t *)new_meta_buf;
new_hdr->zero = 0;
new_hdr->magic = BLOCKSTORE_META_MAGIC_V1;
new_hdr->version = BLOCKSTORE_META_FORMAT_V1;
new_hdr->version = BLOCKSTORE_META_VERSION_V1;
new_hdr->meta_block_size = dsk.meta_block_size;
new_hdr->data_block_size = dsk.data_block_size;
new_hdr->bitmap_granularity = dsk.bitmap_granularity ? dsk.bitmap_granularity : 4096;
new_hdr->data_csum_type = dsk.data_csum_type;
new_hdr->csum_block_size = dsk.csum_block_size;
},
[this](uint64_t block_num, clean_disk_entry *entry, uint8_t *bitmap)
{
@@ -491,7 +463,7 @@ int disk_tool_t::resize_rewrite_meta()
new_entry->oid = entry->oid;
new_entry->version = entry->version;
if (bitmap)
memcpy(new_entry->bitmap, bitmap, 2*new_clean_entry_bitmap_size + new_data_csum_size);
memcpy(new_entry->bitmap, bitmap, 2*new_clean_entry_bitmap_size);
else
memset(new_entry->bitmap, 0xff, 2*new_clean_entry_bitmap_size);
}

View File

@@ -373,22 +373,3 @@ int fix_partition_type(std::string dev_by_uuid)
std::string out;
return shell_exec({ "sfdisk", "--no-reread", "--force", "/dev/"+parent_dev }, script, &out, NULL);
}
std::string csum_type_str(uint32_t data_csum_type)
{
std::string csum_type;
if (data_csum_type == BLOCKSTORE_CSUM_NONE)
csum_type = "none";
else if (data_csum_type == BLOCKSTORE_CSUM_CRC32C)
csum_type = "crc32c";
else
csum_type = std::to_string(data_csum_type);
return csum_type;
}
uint32_t csum_type_from_str(std::string data_csum_type)
{
if (data_csum_type == "crc32c")
return BLOCKSTORE_CSUM_CRC32C;
return stoull_full(data_csum_type, 0);
}

View File

@@ -145,11 +145,6 @@ resume_3:
// Mark object corrupted and retry
op_data->object_state = mark_object_corrupted(pg, op_data->oid, op_data->object_state, op_data->stripes, true, false);
op_data->prev_set = op_data->object_state ? op_data->object_state->read_target.data() : pg.cur_set.data();
if (cur_op->rmw_buf)
{
free(cur_op->rmw_buf);
cur_op->rmw_buf = NULL;
}
goto retry_1;
}
deref_object_state(pg, &op_data->object_state, true);

View File

@@ -234,6 +234,7 @@ out:
return;
}
#if defined VITASTOR_C_API_VERSION && VITASTOR_C_API_VERSION >= 2
static void vitastor_uring_handler(void *opaque)
{
VitastorClient *client = (VitastorClient*)opaque;
@@ -266,20 +267,23 @@ static void vitastor_schedule_uring_handler(VitastorClient *client)
replay_bh_schedule_oneshot_event(client->ctx, vitastor_uring_handler, opaque);
#elif QEMU_VERSION_MAJOR >= 3 || QEMU_VERSION_MAJOR == 2 && QEMU_VERSION_MINOR >= 8
aio_bh_schedule_oneshot(client->ctx, vitastor_uring_handler, opaque);
#elif QEMU_VERSION_MAJOR >= 2
#else
VitastorBH *vbh = (VitastorBH*)malloc(sizeof(VitastorBH));
vbh->cli = client;
#if QEMU_VERSION_MAJOR >= 2
vbh->bh = aio_bh_new(bdrv_get_aio_context(task->bs), vitastor_bh_uring_handler, vbh);
qemu_bh_schedule(vbh->bh);
#else
client->bh_uring_scheduled = 0;
do
{
vitastor_c_uring_handle_events(client->proxy);
} while (vitastor_c_uring_has_work(client->proxy));
vbh->bh = qemu_bh_new(vitastor_bh_uring_handler, vbh);
#endif
qemu_bh_schedule(vbh->bh);
#endif
}
}
#else
static void vitastor_schedule_uring_handler(VitastorClient *client)
{
}
#endif
static void coroutine_fn vitastor_co_get_metadata(VitastorRPC *task)
{
@@ -406,20 +410,16 @@ static int vitastor_file_open(BlockDriverState *bs, QDict *options, int flags, E
vitastor_aio_set_fd_handler, client, client->config_path, client->etcd_host, client->etcd_prefix,
client->use_rdma, client->rdma_device, client->rdma_port_num, client->rdma_gid_index, client->rdma_mtu, 0
);
#else
client->proxy = vitastor_c_create_uring(
client->config_path, client->etcd_host, client->etcd_prefix,
client->use_rdma, client->rdma_device, client->rdma_port_num, client->rdma_gid_index, client->rdma_mtu, 0
);
#endif
if (!client->proxy)
{
fprintf(stderr, "vitastor: failed to create io_uring: %s - I/O will be slower\n", strerror(errno));
client->uring_eventfd = -1;
#endif
client->proxy = vitastor_c_create_qemu(
vitastor_aio_set_fd_handler, client, client->config_path, client->etcd_host, client->etcd_prefix,
client->use_rdma, client->rdma_device, client->rdma_port_num, client->rdma_gid_index, client->rdma_mtu, 0
);
#if defined VITASTOR_C_API_VERSION && VITASTOR_C_API_VERSION >= 2
}
else
{
@@ -433,6 +433,7 @@ static int vitastor_file_open(BlockDriverState *bs, QDict *options, int flags, E
}
universal_aio_set_fd_handler(client->ctx, client->uring_eventfd, vitastor_uring_handler, NULL, client);
}
#endif
image = client->image = g_strdup(qdict_get_try_str(options, "image"));
client->readonly = (flags & BDRV_O_RDWR) ? 1 : 0;
// Get image metadata (size and readonly flag) or just wait until the client is ready
@@ -664,7 +665,8 @@ static void vitastor_co_generic_cb(void *opaque, long retval)
task->bh = aio_bh_new(bdrv_get_aio_context(task->bs), vitastor_co_generic_bh_cb, opaque);
qemu_bh_schedule(task->bh);
#else
vitastor_co_generic_bh_cb(opaque);
task->bh = qemu_bh_new(vitastor_co_generic_bh_cb, opaque);
qemu_bh_schedule(task->bh);
#endif
}
@@ -741,7 +743,6 @@ static void vitastor_co_read_bitmap_cb(void *opaque, long retval, uint8_t *bitma
VitastorRPC *task = opaque;
VitastorClient *client = task->bs->opaque;
task->ret = retval;
task->complete = 1;
if (retval >= 0)
{
task->bitmap = bitmap;
@@ -753,15 +754,17 @@ static void vitastor_co_read_bitmap_cb(void *opaque, long retval, uint8_t *bitma
client->last_bitmap = bitmap;
}
}
if (qemu_coroutine_self() != task->co)
{
#if QEMU_VERSION_MAJOR >= 3 || QEMU_VERSION_MAJOR == 2 && QEMU_VERSION_MINOR > 8
aio_co_wake(task->co);
#if QEMU_VERSION_MAJOR > 4 || QEMU_VERSION_MAJOR == 4 && QEMU_VERSION_MINOR >= 2
replay_bh_schedule_oneshot_event(bdrv_get_aio_context(task->bs), vitastor_co_generic_bh_cb, opaque);
#elif QEMU_VERSION_MAJOR >= 3 || QEMU_VERSION_MAJOR == 2 && QEMU_VERSION_MINOR >= 8
aio_bh_schedule_oneshot(bdrv_get_aio_context(task->bs), vitastor_co_generic_bh_cb, opaque);
#elif QEMU_VERSION_MAJOR >= 2
task->bh = aio_bh_new(bdrv_get_aio_context(task->bs), vitastor_co_generic_bh_cb, opaque);
qemu_bh_schedule(task->bh);
#else
qemu_coroutine_enter(task->co, NULL);
qemu_aio_release(task);
task->bh = qemu_bh_new(vitastor_co_generic_bh_cb, opaque);
qemu_bh_schedule(task->bh);
#endif
}
}
static int coroutine_fn vitastor_co_block_status(

View File

@@ -6,7 +6,7 @@ includedir=${prefix}/@CMAKE_INSTALL_INCLUDEDIR@
Name: Vitastor
Description: Vitastor client library
Version: 0.9.3
Version: 0.9.4
Libs: -L${libdir} -lvitastor_client
Cflags: -I${includedir}

View File

@@ -215,7 +215,7 @@ void vitastor_c_uring_wait_events(vitastor_c *client)
client->ringloop->wait();
}
bool vitastor_c_uring_has_work(vitastor_c *client)
int vitastor_c_uring_has_work(vitastor_c *client)
{
return client->ringloop->has_work();
}

View File

@@ -46,7 +46,7 @@ int vitastor_c_uring_register_eventfd(vitastor_c *client);
void vitastor_c_uring_wait_ready(vitastor_c *client);
void vitastor_c_uring_handle_events(vitastor_c *client);
void vitastor_c_uring_wait_events(vitastor_c *client);
bool vitastor_c_uring_has_work(vitastor_c *client);
int vitastor_c_uring_has_work(vitastor_c *client);
void vitastor_c_read(vitastor_c *client, uint64_t inode, uint64_t offset, uint64_t len,
struct iovec *iov, int iovcnt, VitastorReadHandler cb, void *opaque);
void vitastor_c_write(vitastor_c *client, uint64_t inode, uint64_t offset, uint64_t len, uint64_t check_version,

View File

@@ -27,9 +27,7 @@ fi
start_osd()
{
local i=$1
build/src/vitastor-osd --osd_num $i --bind_address 127.0.0.1 $NO_SAME $OSD_ARGS --etcd_address $ETCD_URL \
$(build/src/vitastor-disk simple-offsets --format options ./testdata/test_osd$i.bin $OFFSET_ARGS 2>/dev/null) \
>>./testdata/osd$i.log 2>&1 &
build/src/vitastor-osd --osd_num $i --bind_address 127.0.0.1 $NO_SAME $OSD_ARGS --etcd_address $ETCD_URL $(build/src/vitastor-disk simple-offsets --format options ./testdata/test_osd$i.bin 2>/dev/null) >>./testdata/osd$i.log 2>&1 &
eval OSD${i}_PID=$!
}

View File

@@ -53,13 +53,6 @@ SCHEME=xor ./test_write.sh
PG_SIZE=2 ./test_heal.sh
SCHEME=ec ./test_heal.sh
TEST_NAME=csum_32k_dmj OSD_ARGS="--data_csum_type crc32c --csum_block_size 32k --inmemory_metadata false --inmemory_journal false" OFFSET_ARGS=$OSD_ARGS ./test_heal.sh
TEST_NAME=csum_32k_dj OSD_ARGS="--data_csum_type crc32c --csum_block_size 32k --inmemory_journal false" OFFSET_ARGS=$OSD_ARGS ./test_heal.sh
TEST_NAME=csum_32k OSD_ARGS="--data_csum_type crc32c --csum_block_size 32k" OFFSET_ARGS=$OSD_ARGS ./test_heal.sh
TEST_NAME=csum_4k_dmj OSD_ARGS="--data_csum_type crc32c --inmemory_metadata false --inmemory_journal false" OFFSET_ARGS=$OSD_ARGS ./test_heal.sh
TEST_NAME=csum_4k_dj OSD_ARGS="--data_csum_type crc32c --inmemory_journal false" OFFSET_ARGS=$OSD_ARGS ./test_heal.sh
TEST_NAME=csum_4k OSD_ARGS="--data_csum_type crc32c" OFFSET_ARGS=$OSD_ARGS ./test_heal.sh
./test_scrub.sh
ZERO_OSD=2 ./test_scrub.sh
SCHEME=xor ./test_scrub.sh

View File

@@ -1,37 +0,0 @@
#!/bin/bash -ex
OSD_ARGS="--data_csum_type crc32c --csum_block_size 32k --inmemory_journal false $OSD_ARGS"
OFFSET_ARGS="--data_csum_type crc32c --csum_block_size 32k --inmemory_journal false $OFFSET_ARGS"
PG_COUNT=${PG_COUNT:-64}
. `dirname $0`/run_3osds.sh
check_qemu
IMG_SIZE=128
$ETCDCTL put /vitastor/config/inode/1/1 '{"name":"testimg","size":'$((IMG_SIZE*1024*1024))'}'
# Write
LD_PRELOAD="build/src/libfio_vitastor.so" \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=1M -direct=1 -iodepth=4 \
-mirror_file=./testdata/mirror.bin -end_fsync=1 -rw=write -etcd=$ETCD_URL -image=testimg -runtime=10
# Intentionally corrupt OSD data and restart it
kill $OSD1_PID
data_offset=$(build/src/vitastor-disk simple-offsets ./testdata/test_osd1.bin $OFFSET_ARGS | grep data_offset | awk '{print $2}')
truncate -s $data_offset ./testdata/test_osd1.bin
dd if=/dev/zero of=./testdata/test_osd1.bin bs=1024 count=1 seek=$((OSD_SIZE*1024-1))
start_osd 1
# FIXME: corrupt the journal WHEN OSD IS RUNNING and check reads too
# Wait until start
wait_up 10
# Read everything back
qemu-img convert -S 4096 -p \
-f raw "vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:image=testimg" \
-O raw ./testdata/read.bin
diff ./testdata/read.bin ./testdata/mirror.bin
format_green OK

View File

@@ -12,7 +12,6 @@ PG_COUNT=32
. `dirname $0`/run_3osds.sh
check_qemu
# FIXME: Fix space rebalance priorities :)
IMG_SIZE=960
$ETCDCTL put /vitastor/config/inode/1/1 '{"name":"testimg","size":'$((IMG_SIZE*1024*1024))'}'
@@ -25,7 +24,6 @@ kill_osds()
{
sleep 5
echo Killing OSD 1
kill -9 $OSD1_PID
$ETCDCTL del /vitastor/osd/state/1
@@ -40,7 +38,6 @@ kill_osds()
done
sleep 5
echo Starting OSD 7
start_osd 7
sleep 5
@@ -49,19 +46,13 @@ kill_osds()
kill_osds &
LD_PRELOAD="build/src/libfio_vitastor.so" \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bsrange=4k-128k -blockalign=4k -direct=1 -iodepth=32 -fsync=256 -rw=randrw \
fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bsrange=4k-128k -direct=1 -iodepth=32 -fsync=256 -rw=randrw \
-randrepeat=0 -refill_buffers=1 -mirror_file=./testdata/mirror.bin -etcd=$ETCD_URL -image=testimg -loops=10 -runtime=120
qemu-img convert -S 4096 -p \
-f raw "vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:image=testimg" \
-O raw ./testdata/read.bin
if ! diff -q ./testdata/read.bin ./testdata/mirror.bin; then
format_error Data lost during self-heal
fi
if grep -qP 'Checksum mismatch|BUG' ./testdata/osd*.log; then
format_error Checksum mismatches or BUGs detected during test
fi
diff ./testdata/read.bin ./testdata/mirror.bin
format_green OK

View File

@@ -4,7 +4,7 @@
OSD_SIZE=1024
OSD_COUNT=5
OSD_ARGS="$OSD_ARGS"
OSD_ARGS=
for i in $(seq 1 $OSD_COUNT); do
dd if=/dev/zero of=./testdata/test_osd$i.bin bs=1024 count=1 seek=$((OSD_SIZE*1024-1))
build/src/vitastor-osd --osd_num $i --bind_address 127.0.0.1 $OSD_ARGS --etcd_address $ETCD_URL $(build/src/vitastor-disk simple-offsets --format options ./testdata/test_osd$i.bin 2>/dev/null) >>./testdata/osd$i.log 2>&1 &

View File

@@ -8,8 +8,7 @@ etcdctl --endpoints=http://127.0.0.1:12379/v3 del --prefix /vitastor/pg/state
etcdctl --endpoints=http://127.0.0.1:12379/v3 del --prefix /vitastor/osd/state
OSD_COUNT=3
OSD_ARGS="$OSD_ARGS"
OFFSET_ARGS="$OFFSET_ARGS"
OSD_ARGS=
for i in $(seq 1 $OSD_COUNT); do
build/src/vitastor-osd --osd_num $i --bind_address 127.0.0.1 $OSD_ARGS --etcd_address $ETCD_URL $(build/src/vitastor-disk simple-offsets --format options ./testdata/test_osd$i.bin 2>/dev/null) >>./testdata/osd$i.log 2>&1 &
eval OSD${i}_PID=$!

View File

@@ -1,7 +1,7 @@
#!/bin/bash -ex
# Test the `no_same_sector_overwrites` mode
OSD_ARGS="--journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024 --disable_data_fsync 1 --immediate_commit all $OSD_ARGS"
OSD_ARGS="--journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024 --disable_data_fsync 1 --immediate_commit all"
GLOBAL_CONF='{"immediate_commit":"all"}'
. `dirname $0`/run_3osds.sh