Vitastor cannot tolerate network error #2
Loading…
Reference in New Issue
There is no content yet.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may exist for a short time before cleaning up, in most cases it CANNOT be undone. Continue?
Hi, My vitastor cluster is configured with 3 nodes, when the network of one of the nodes is shut down, the entire cluster cannot work。
For example, the network shutdown of node3 will cause node1 and node2 to fail to respond to client requests.
This problem is 100% reproducible.
Have you tested this situation?
The following is my environment and configuration information:
vitastor version: 0.5.5
etcd Version: 3.4.14
etcdctl --endpoints http://172.16.7.3:2379 put /vitastor/config/global '{"immediate_commit":"all"}'
etcdctl --endpoints http://172.16.7.3:2379 put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":48,"failure_domain":"host"}}'
Command parameters of one of the nodes:
etcd --name node2-ssd --initial-advertise-peer-urls http://172.16.7.4:2380 --listen-peer-urls http://172.16.7.4:2380 --listen-client-urls http://172.16.7.4:2379,http://127.0.0.1:2379 --advertise-client-urls http://172.16.7.4:2379 --initial-cluster-token vitastor --initial-cluster node1-ssd=http://172.16.7.3:2380,node2-ssd=http://172.16.7.4:2380,node3-ssd=http://172.16.7.5:2380 --initial-cluster-state new --max-txn-ops=100000 --auto-compaction-retention=10 --auto-compaction-mode=revision
vitastor-osd --etcd_address 172.16.7.4:2379/v3 --bind_address 172.16.7.4 --osd_num 3 --disable_data_fsync 1 --immediate_commit all --disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 --journal_sector_buffer_count 1024 --journal_offset 0 --meta_offset 16777216 --data_offset 138870784 --data_size 429496729600 --data_device /dev/disk/by-id/ata-Samsung_SSD_860_EVO_500GB_S3Z3NB1KB15171L
vitastor-osd --etcd_address 172.16.7.4:2379/v3 --bind_address 172.16.7.4 --osd_num 4 --disable_data_fsync 1 --immediate_commit all --disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 --journal_sector_buffer_count 1024 --journal_offset 0 --meta_offset 16777216 --data_offset 138870784 --data_size 429496729600 --data_device /dev/disk/by-id/ata-Samsung_SSD_860_EVO_500GB_S3Z3NB1KB15085V
node /ovpdatastore/pkg/vitastor/mon/mon-main.js --etcd_url "http://172.16.7.4:2379" --etcd_prefix "/vitastor" --etcd_start_timeout 5
Vitastor cannot tolerate network abnormalitiesto Vitastor cannot tolerate network errorHi. First I wanted to tell you a lot of things including that I just released 0.5.7 and so on, but then I realized you're talking about the lack of TCP timeouts.
So yes, current versions of Vitastor don't use timeouts and don't detect dead connections... The Linux defaults for
net.ipv4.tcp_keepalive_{time,probes,intvl}
are 7200, 9, 75, so connections only die after 2 hours of inactivity which is of course unacceptable :))).I thought about it, but I saved it for the future for some reason. :-)). I'll implement timeouts in the next few days, ok.
OK, try v0.5.8, it has heartbeats. Packages are updated :-)
A small correction: I found another bug which may result in lost objects (not physically lost, but unable to be found because of the incorrect PG configuration) in some cases so I'll fix it and release v0.5.9))
OK, now you can test 0.5.9 :)