Non-capacitors SSD flush journal error( set "immediate_commit" to none) #3
Loading…
Reference in New Issue
There is no content yet.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may exist for a short time before cleaning up, in most cases it CANNOT be undone. Continue?
Hi, I think the bug is in journal_flusher_co,using one node, one osd can reproduce this problem, vitastor version is 0.5.9 :-).
run osd:
osd --etcd_address 127.0.0.1:2379/v3 --bind_address 127.0.0.1 --osd_num 1
--immediate_commit none
--journal_offset 0
--meta_offset 16777216
--data_offset 260964352
--data_size 858993459200
--flusher_count 256
--data_device /dev/sdb
run qemu-img:
qemu-img convert -p a.raw -O raw 'vitastor:etcd_host=127.0.0.1:2379/v3:pool=1:inode=1:size=85899345920'
Stop here:(4.03/100%)
osd log:
Still waiting to flush journal offset 00001000
[OSD 1] Slow op from client 9: primary_write id=6558 inode=1000000000001 offset=36463000 len=1d000
The larger the flusher_count setting, the more likely this problem will occur, Setting flusher_count to 1 works fine.
Try v0.5.10 - I fixed exactly this problem during the last few days :)
Hi, I reappear this problem using v0.5.10.
This problem is easy to appear under very high io load.
Eg: run "fio --name benchmark --filename=/root/fio-tempfile.dat --rw=write --size=1G -bs=4M --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=16 --runtime=60 --group_reporting" in a Linux virtual machine
I.e. does "Still waiting to flush journal offset 00001000" still happen?
Hi, This problem still exists in v0.5.13, but now it is "Still waiting to flush journal offset 00418000".
Interestingly, this problem always does not appear when I run the following command:
fio --name benchmark --filename=/root/fio-tempfile.dat --rw=write --size=1G -bs=4M --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=16 --runtime=60 --group_reporting
But this problem always appears when I run the following command:
fio --name benchmark --filename=/root/fio-tempfile.dat --rw=write --size=10G -bs=4M --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=16 --runtime=600 --group_reporting
Hm... it's strange, I thought I caught most "flush stall" bugs. I'll try to reproduce it, OK.
I'll close it as it's really outdated now. Several other flush stalls were fixed in the meantime, the last fix is even coming into 0.8.4.