Non-capacitors SSD flush journal error( set "immediate_commit" to none) #3

Closed
opened 2021-03-16 11:56:13 +03:00 by DongliSi · 6 comments

Hi, I think the bug is in journal_flusher_co,using one node, one osd can reproduce this problem, vitastor version is 0.5.9 :-).

run osd:
osd --etcd_address 127.0.0.1:2379/v3 --bind_address 127.0.0.1 --osd_num 1
--immediate_commit none
--journal_offset 0
--meta_offset 16777216
--data_offset 260964352
--data_size 858993459200
--flusher_count 256
--data_device /dev/sdb

run qemu-img:
qemu-img convert -p a.raw -O raw 'vitastor:etcd_host=127.0.0.1:2379/v3:pool=1:inode=1:size=85899345920'
Stop here:(4.03/100%)

osd log:
Still waiting to flush journal offset 00001000
[OSD 1] Slow op from client 9: primary_write id=6558 inode=1000000000001 offset=36463000 len=1d000

The larger the flusher_count setting, the more likely this problem will occur, Setting flusher_count to 1 works fine.

Hi, I think the bug is in journal_flusher_co,using one node, one osd can reproduce this problem, vitastor version is 0.5.9 :-). run osd: osd --etcd_address 127.0.0.1:2379/v3 --bind_address 127.0.0.1 --osd_num 1 \ --immediate_commit none \ --journal_offset 0 \ --meta_offset 16777216 \ --data_offset 260964352 \ --data_size 858993459200 \ --flusher_count 256 \ --data_device /dev/sdb run qemu-img: qemu-img convert -p a.raw -O raw 'vitastor:etcd_host=127.0.0.1\:2379/v3:pool=1:inode=1:size=85899345920' Stop here:(4.03/100%) osd log: Still waiting to flush journal offset 00001000 [OSD 1] Slow op from client 9: primary_write id=6558 inode=1000000000001 offset=36463000 len=1d000 The larger the flusher_count setting, the more likely this problem will occur, Setting flusher_count to 1 works fine.

Try v0.5.10 - I fixed exactly this problem during the last few days :)

Try v0.5.10 - I fixed exactly this problem during the last few days :)

Hi, I reappear this problem using v0.5.10.

This problem is easy to appear under very high io load.

Eg: run "fio --name benchmark --filename=/root/fio-tempfile.dat --rw=write --size=1G -bs=4M --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=16 --runtime=60 --group_reporting" in a Linux virtual machine

Hi, I reappear this problem using v0.5.10. This problem is easy to appear under very high io load. Eg: run "fio --name benchmark --filename=/root/fio-tempfile.dat --rw=write --size=1G -bs=4M --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=16 --runtime=60 --group_reporting" in a Linux virtual machine

I.e. does "Still waiting to flush journal offset 00001000" still happen?

I.e. does "Still waiting to flush journal offset 00001000" still happen?

Hi, This problem still exists in v0.5.13, but now it is "Still waiting to flush journal offset 00418000".

Interestingly, this problem always does not appear when I run the following command:
fio --name benchmark --filename=/root/fio-tempfile.dat --rw=write --size=1G -bs=4M --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=16 --runtime=60 --group_reporting

But this problem always appears when I run the following command:
fio --name benchmark --filename=/root/fio-tempfile.dat --rw=write --size=10G -bs=4M --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=16 --runtime=600 --group_reporting

Hi, This problem still exists in v0.5.13, but now it is "Still waiting to flush journal offset 00418000". Interestingly, this problem always does not appear when I run the following command: fio --name benchmark --filename=/root/fio-tempfile.dat --rw=write --size=1G -bs=4M --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=16 --runtime=60 --group_reporting But this problem always appears when I run the following command: fio --name benchmark --filename=/root/fio-tempfile.dat --rw=write --size=10G -bs=4M --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=16 --runtime=600 --group_reporting

Hm... it's strange, I thought I caught most "flush stall" bugs. I'll try to reproduce it, OK.

Hm... it's strange, I thought I caught most "flush stall" bugs. I'll try to reproduce it, OK.
vitalif reopened this issue 2021-04-16 13:22:38 +03:00

I'll close it as it's really outdated now. Several other flush stalls were fixed in the meantime, the last fix is even coming into 0.8.4.

I'll close it as it's really outdated now. Several other flush stalls were fixed in the meantime, the last fix is even coming into 0.8.4.
Sign in to join this conversation.
No Label
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: vitalif/vitastor#3
There is no content yet.