Non-capacitors SSD flush journal error（ set "immediate_commit" to none） #3

New Issue

DongliSi · 2021-03-16T11:56:13+03:00

DongliSi commented

2021-03-16 11:56:13 +03:00

Hi, I think the bug is in journal_flusher_co，using one node, one osd can reproduce this problem, vitastor version is 0.5.9 :-).

run osd:
osd --etcd_address 127.0.0.1:2379/v3 --bind_address 127.0.0.1 --osd_num 1
--immediate_commit none
--journal_offset 0
--meta_offset 16777216
--data_offset 260964352
--data_size 858993459200
--flusher_count 256
--data_device /dev/sdb

run qemu-img:
qemu-img convert -p a.raw -O raw 'vitastor:etcd_host=127.0.0.1:2379/v3:pool=1:inode=1:size=85899345920'
Stop here：(4.03/100%)

osd log:
Still waiting to flush journal offset 00001000
[OSD 1] Slow op from client 9: primary_write id=6558 inode=1000000000001 offset=36463000 len=1d000

The larger the flusher_count setting, the more likely this problem will occur, Setting flusher_count to 1 works fine.

Hi, I think the bug is in journal_flusher_co，using one node, one osd can reproduce this problem, vitastor version is 0.5.9 :-). run osd: osd --etcd_address 127.0.0.1:2379/v3 --bind_address 127.0.0.1 --osd_num 1 \ --immediate_commit none \ --journal_offset 0 \ --meta_offset 16777216 \ --data_offset 260964352 \ --data_size 858993459200 \ --flusher_count 256 \ --data_device /dev/sdb run qemu-img: qemu-img convert -p a.raw -O raw 'vitastor:etcd_host=127.0.0.1\:2379/v3:pool=1:inode=1:size=85899345920' Stop here：(4.03/100%) osd log: Still waiting to flush journal offset 00001000 [OSD 1] Slow op from client 9: primary_write id=6558 inode=1000000000001 offset=36463000 len=1d000 The larger the flusher_count setting, the more likely this problem will occur, Setting flusher_count to 1 works fine.

vitalif commented

2021-03-16 13:23:18 +03:00

Try v0.5.10 - I fixed exactly this problem during the last few days :)

vitalif closed this issue

2021-03-16 13:23:27 +03:00

DongliSi commented

2021-03-23 12:12:59 +03:00

Hi, I reappear this problem using v0.5.10.

This problem is easy to appear under very high io load.

Eg: run "fio --name benchmark --filename=/root/fio-tempfile.dat --rw=write --size=1G -bs=4M --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=16 --runtime=60 --group_reporting" in a Linux virtual machine

Hi, I reappear this problem using v0.5.10. This problem is easy to appear under very high io load. Eg: run "fio --name benchmark --filename=/root/fio-tempfile.dat --rw=write --size=1G -bs=4M --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=16 --runtime=60 --group_reporting" in a Linux virtual machine

vitalif commented

2021-04-09 15:25:51 +03:00

I.e. does "Still waiting to flush journal offset 00001000" still happen?

DongliSi commented

2021-04-12 05:45:37 +03:00

Hi, This problem still exists in v0.5.13, but now it is "Still waiting to flush journal offset 00418000".

Interestingly, this problem always does not appear when I run the following command:
fio --name benchmark --filename=/root/fio-tempfile.dat --rw=write --size=1G -bs=4M --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=16 --runtime=60 --group_reporting

But this problem always appears when I run the following command:
fio --name benchmark --filename=/root/fio-tempfile.dat --rw=write --size=10G -bs=4M --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=16 --runtime=600 --group_reporting

Hi, This problem still exists in v0.5.13, but now it is "Still waiting to flush journal offset 00418000". Interestingly, this problem always does not appear when I run the following command: fio --name benchmark --filename=/root/fio-tempfile.dat --rw=write --size=1G -bs=4M --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=16 --runtime=60 --group_reporting But this problem always appears when I run the following command: fio --name benchmark --filename=/root/fio-tempfile.dat --rw=write --size=10G -bs=4M --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=16 --runtime=600 --group_reporting

vitalif commented

2021-04-16 13:22:37 +03:00

Hm... it's strange, I thought I caught most "flush stall" bugs. I'll try to reproduce it, OK.

vitalif reopened this issue

2021-04-16 13:22:38 +03:00

vitalif commented

2023-01-13 20:08:22 +03:00

I'll close it as it's really outdated now. Several other flush stalls were fixed in the meantime, the last fix is even coming into 0.8.4.

vitalif closed this issue

2023-01-13 20:08:22 +03:00

Sign in to join this conversation.

Branches Tags

master

lsmeta

heap-meta

veeam-trick

test-etcd-schizo

test

test1

test2

cow-meta

kv-readahead

qemu-fix-crash-idea

eventloop-auto-init-trigger

hugo-docs

node-mutex-and-postpone

check-writeback

msgr-iothreads-v2

antietcd

rel-1.4

test-fix-ec-unknown-state-51

kv-update

kv-debug

hotfix-1.2.0

rdma-simple-nodp

zerocopy-tcp-send

csi-staging

hotfix-1.1.0

hotfix-1.0.0

fsync-feedback

cached-reads

qemu-send-loop

openssl

test-double-alloc

hier-failure-domains

mon-self-restart

epoch-deletions

rdma-flow-control

csi-use-vitastor-cli

rdma-v2

etcd-hide-base64

lrc-matrix

rm-left-on-dead

nfs-proxy-old

non-odp-rdma

test-zctr

separate-data-connections

sec_osd_msgr

test-assert

nbd-vmsplice

allow-etcd-address-option

rdma-zerocopy

rel-0.5

openonload

sync-io-test

trace-sync

trace-sqes

test-submit-and-wait

test-sq-poll

blocking-uring-test

2 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: vitalif/vitastor#3