vitalif/etcd - etcd

Commit Graph

Author	SHA1	Message	Date
Gyu-Ho Lee	45fd8279f0	etcdserver: add leaseExpired debugging metrics Fix https://github.com/coreos/etcd/issues/8050. Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>	2017-06-08 10:36:25 -07:00
Hitoshi Mitake	0c655902f2	auth, etcdserver: protect revoking lease with auth Currently clients can revoke any lease without permission. This commit lets etcdserver protect revoking with write permission. This commit adds a mechanism for generating internal token. It is used for indicating that LeaseRevoke was issued internally so it should be able to delete any attached keys.	2017-06-07 17:46:14 -07:00
Anthony Romano	887db5a3db	*: fix go tool vet -all -shadow errors	2017-06-03 21:32:36 -07:00
Anthony Romano	a20e667c5b	Merge pull request #7967 from heyitsanthony/purge-snapdb etcdserver: purge old snap.db files	2017-05-30 16:15:11 -07:00
fanmin shi	9e7740011b	etcdserver: add --max-request-bytes flag	2017-05-25 11:01:38 -07:00
Anthony Romano	c1c9a2c96c	etcdserver: close mvcc.KV on init error path Scheduled compaction will panic if KV is not stopped before closing the backend.	2017-05-23 10:41:37 -07:00
Anthony Romano	ab16fa1f07	etcdserver: purge old snap.db files Lots of garbage db files in #7957. Should purge.	2017-05-22 15:44:21 -07:00
Anthony Romano	f6cd4d4f5b	snap, etcdserver: tighten up snapshot path handling Computing the snapshot file path is error prone; snapshot recovery was constructing file paths missing a path separator so the snapshot would never be loaded. Instead, refactor the backend path handling to use helper functions where possible.	2017-05-11 13:46:59 -07:00
fanmin shi	8b7b7222dd	etcdserver: renaming db happens after snapshot persists to wal and snap files In the case that follower recieves a snapshot from leader and crashes before renaming xxx.snap.db to db but after snapshot has persisted to .wal and .snap, restarting follower results loading old db, new .wal, and new .snap. This will causes a index mismatch between snap metadata index and consistent index from db. This pr forces an ordering where saving/renaming db must happen after snapshot is persisted to wal and snap file. this guarantees wal and snap files are newer than db. on server restart, etcd server checks if snap index > db consistent index. if yes, etcd server attempts to load xxx.snap.db where xxx=snap index if there is any and panic other wise. FIXES #7628	2017-05-09 14:00:12 -07:00
fanmin shi	5533c3058a	etcdserver: apply() sets consistIndex for any entry type previously, apply() doesn't set consistIndex for EntryConfChange type. this causes a misalignment between consistIndex and applied index where EntryConfChange entry results setting applied index but not consistIndex. suppose that addMember() is called and leader reflects that change. 1. applied index and consistIndex is now misaligned. 2. a new follower node joined. 3. leader sends the snapshot to follower where the applied index is the snapshot metadata index. 4. follower node saves the snapshot and database(includes consistIndex) from leader. 5. restarting follower loads snapshot and database. 6. follower checks snapshot metadata index(same as applied index) and database consistIndex, finds them don't match, and then panic. FIXES #7834	2017-05-02 14:57:36 -07:00
Gyu-Ho Lee	91f6aee4f2	etcdserver: ensure waitForApply sync with applyAll Problem is: `Step1`: `etcdserver/raft.go`'s `Ready` process routine sends config-change entries via `r.applyc <- ap` (https://github.com/coreos/etcd/blob/master/etcdserver/raft.go#L193-L203) `Step2`: `etcdserver/server.go`'s `*EtcdServer.run` routine receives this via `ap := <-s.r.apply()` (https://github.com/coreos/etcd/blob/master/etcdserver/server.go#L735-L738) `StepA`: `Step1` proceeds without sync, right after sending `r.applyc <- ap`. `StepB`: `Step2` proceeds without sync, right after `sched.Schedule(s.applyAll(&ep,&ap))`. `StepC`: `etcdserver` tries to sync with `s.applyAll(&ep,&ap)` by calling `rh.waitForApply()`. `rh.waitForApply()` waits for all pending jobs to finish in `pkg/schedule` side. However, the order of `StepA`,`StepB`,`StepC` is not guaranteed. It is possible that `StepC` happens first, and proceeds without waiting on apply. And the restarting member comes back as a leader in single-node cluster, when there is no synchronization between apply-layer and config-change Raft entry apply. Confirmed with more debugging lines below, only reproducible with slow CPU VM (~2 vCPU). ``` ~:24.005397 I \| etcdserver: starting server... [version: 3.2.0+git, cluster version: to_be_decided] ~:24.011136 I \| etcdserver: [DEBUG] 29b2d24047a277df waitForApply before ~:24.011194 I \| etcdserver: [DEBUG] 29b2d24047a277df starts wait for 0 pending jobs ~:24.011234 I \| etcdserver: [DEBUG] 29b2d24047a277df finished wait for 0 pending jobs (current pending 0) ~:24.011268 I \| etcdserver: [DEBUG] 29b2d24047a277df waitForApply after ~:24.011348 I \| etcdserver: [DEBUG] [0] 29b2d24047a277df is scheduling conf change on 29b2d24047a277df ~:24.011396 I \| etcdserver: [DEBUG] [1] 29b2d24047a277df is scheduling conf change on 5edf80e32a334cf0 ~:24.011437 I \| etcdserver: [DEBUG] [2] 29b2d24047a277df is scheduling conf change on e32e31e76c8d2678 ~:24.011477 I \| etcdserver: [DEBUG] 29b2d24047a277df scheduled conf change on 29b2d24047a277df ~:24.011509 I \| etcdserver: [DEBUG] 29b2d24047a277df scheduled conf change on 5edf80e32a334cf0 ~:24.011545 I \| etcdserver: [DEBUG] 29b2d24047a277df scheduled conf change on e32e31e76c8d2678 ~:24.012500 I \| etcdserver: [DEBUG] 29b2d24047a277df applyConfChange on 29b2d24047a277df before ~:24.013014 I \| etcdserver/membership: added member 29b2d24047a277df [unix://127.0.0.1:2100515039] to cluster 9250d4ae34216949 ~:24.013066 I \| etcdserver: [DEBUG] 29b2d24047a277df applyConfChange on 29b2d24047a277df after ~:24.013113 I \| etcdserver: [DEBUG] 29b2d24047a277df applyConfChange on 29b2d24047a277df after trigger ~:24.013158 I \| etcdserver: [DEBUG] 29b2d24047a277df applyConfChange on 5edf80e32a334cf0 before ~:24.013666 W \| etcdserver: failed to send out heartbeat on time (exceeded the 10ms timeout for 11.964739ms) ~:24.013709 W \| etcdserver: server is likely overloaded ~:24.013750 W \| etcdserver: failed to send out heartbeat on time (exceeded the 10ms timeout for 12.057265ms) ~:24.013775 W \| etcdserver: server is likely overloaded ~:24.013950 I \| raft: 29b2d24047a277df is starting a new election at term 4 ~:24.014012 I \| raft: 29b2d24047a277df became candidate at term 5 ~:24.014051 I \| raft: 29b2d24047a277df received MsgVoteResp from 29b2d24047a277df at term 5 ~:24.014107 I \| raft: 29b2d24047a277df became leader at term 5 ~:24.014146 I \| raft: raft.node: 29b2d24047a277df elected leader 29b2d24047a277df at term 5 ``` I am printing out the number of pending jobs before we call `sched.WaitFinish(0)`, and there was no pending jobs, so it returned immediately (before we schedule `applyAll`). This is the root cause to: - https://github.com/coreos/etcd/issues/7595 - https://github.com/coreos/etcd/issues/7739 - https://github.com/coreos/etcd/issues/7802 `sched.WaitFinish(0)` doesn't work when `len(f.pendings)==0` and `f.finished==0`. Config-change is the first job to apply, so `f.finished` is 0 in this case. `f.finished` monotonically increases, so we need `WaitFinish(finished+1)`. And `finished` must be the one before calling `Schedule`. This is safe because `Schedule(applyAll)` is the only place adding jobs to `sched`. Then scheduler waits on the single job of `applyAll`, by getting the current number of finished jobs before sending `Schedule`. Or just make it be blocked until `applyAll` routine triggers on the config-change job. This patch just removes `waitForApply`, and signal `raftDone` to wait until `applyAll` finishes applying entries. Confirmed that it fixes the issue, as below: ``` ~:43.198354 I \| rafthttp: started streaming with peer 36cda5222aba364b (stream MsgApp v2 reader) ~:43.198740 I \| etcdserver: [DEBUG] 3988bc20c2b2e40c waitForApply before ~:43.198836 I \| etcdserver: [DEBUG] 3988bc20c2b2e40c starts wait for 0 pending jobs, 1 finished jobs ~:43.200696 I \| integration: launched 3169361310155633349 () ~:43.201784 I \| etcdserver: [DEBUG] [0] 3988bc20c2b2e40c is scheduling conf change on 36cda5222aba364b ~:43.201884 I \| etcdserver: [DEBUG] [1] 3988bc20c2b2e40c is scheduling conf change on 3988bc20c2b2e40c ~:43.201965 I \| etcdserver: [DEBUG] [2] 3988bc20c2b2e40c is scheduling conf change on cf5d6cbc2a121727 ~:43.202070 I \| etcdserver: [DEBUG] 3988bc20c2b2e40c scheduled conf change on 36cda5222aba364b ~:43.202139 I \| etcdserver: [DEBUG] 3988bc20c2b2e40c scheduled conf change on 3988bc20c2b2e40c ~:43.202204 I \| etcdserver: [DEBUG] 3988bc20c2b2e40c scheduled conf change on cf5d6cbc2a121727 ~:43.202444 I \| etcdserver: [DEBUG] 3988bc20c2b2e40c applyConfChange on 36cda5222aba364b (request ID: 0) before ~:43.204486 I \| etcdserver/membership: added member 36cda5222aba364b [unix://127.0.0.1:2100913646] to cluster 425d73f1b7b01674 ~:43.204588 I \| etcdserver: [DEBUG] 3988bc20c2b2e40c applyConfChange on 36cda5222aba364b (request ID: 0) after ~:43.204703 I \| etcdserver: [DEBUG] 3988bc20c2b2e40c applyConfChange on 36cda5222aba364b (request ID: 0) after trigger ~:43.204791 I \| etcdserver: [DEBUG] 3988bc20c2b2e40c applyConfChange on 3988bc20c2b2e40c (request ID: 0) before ~:43.205689 I \| etcdserver/membership: added member 3988bc20c2b2e40c [unix://127.0.0.1:2101113646] to cluster 425d73f1b7b01674 ~:43.205783 I \| etcdserver: [DEBUG] 3988bc20c2b2e40c applyConfChange on 3988bc20c2b2e40c (request ID: 0) after ~:43.205929 I \| etcdserver: [DEBUG] 3988bc20c2b2e40c applyConfChange on 3988bc20c2b2e40c (request ID: 0) after trigger ~:43.206056 I \| etcdserver: [DEBUG] 3988bc20c2b2e40c applyConfChange on cf5d6cbc2a121727 (request ID: 0) before ~:43.207353 I \| etcdserver/membership: added member cf5d6cbc2a121727 [unix://127.0.0.1:2100713646] to cluster 425d73f1b7b01674 ~:43.207516 I \| etcdserver: [DEBUG] 3988bc20c2b2e40c applyConfChange on cf5d6cbc2a121727 (request ID: 0) after ~:43.207619 I \| etcdserver: [DEBUG] 3988bc20c2b2e40c applyConfChange on cf5d6cbc2a121727 (request ID: 0) after trigger ~:43.207710 I \| etcdserver: [DEBUG] 3988bc20c2b2e40c finished scheduled conf change on 36cda5222aba364b ~:43.207781 I \| etcdserver: [DEBUG] 3988bc20c2b2e40c finished scheduled conf change on 3988bc20c2b2e40c ~:43.207843 I \| etcdserver: [DEBUG] 3988bc20c2b2e40c finished scheduled conf change on cf5d6cbc2a121727 ~:43.207951 I \| etcdserver: [DEBUG] 3988bc20c2b2e40c finished wait for 0 pending jobs (current pending 0, finished 1) ~:43.208029 I \| rafthttp: started HTTP pipelining with peer cf5d6cbc2a121727 ~:43.210339 I \| rafthttp: peer 3988bc20c2b2e40c became active ~:43.210435 I \| rafthttp: established a TCP streaming connection with peer 3988bc20c2b2e40c (stream MsgApp v2 reader) ~:43.210861 I \| rafthttp: started streaming with peer 3988bc20c2b2e40c (writer) ~:43.211732 I \| etcdserver: [DEBUG] 3988bc20c2b2e40c waitForApply after ``` Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>	2017-04-25 10:22:27 -07:00
Anthony Romano	714b48a4b4	etcdserver: initialize raftNode with constructor raftNode was being initialized in start(), which was causing hangs when trying to stop the etcd server since the stop channel would not be initialized in time for the stop call. Instead, setup non-configurable bits in a constructor. Fixes #7668	2017-04-18 09:33:59 -07:00
Anthony Romano	d9ec6b4d22	*: return updated member list in v3 rpcs Now it's possible to atomically know the new member configuration from issuing a membership change RPC.	2017-04-12 16:24:51 -07:00
Anthony Romano	8ad935ef2c	etcdserver: use cancelable context for server initiated requests	2017-03-31 19:19:33 -07:00
Anthony Romano	7ef75e373a	Merge pull request #7525 from heyitsanthony/big-backend etcdserver, backend: configure mmap size based on quota	2017-03-23 10:06:00 -07:00
Hitoshi Mitake	5594f695bc	e2e, etcdserver: fix wrong usages of ordinal They must be "ordinary".	2017-03-21 23:50:16 +09:00
Anthony Romano	5e4b008106	*: base initial mmap size on quota size	2017-03-17 15:38:49 -07:00
Anthony Romano	2f1542c06d	*: use filepath.Join for files	2017-03-16 07:46:06 -07:00
Gyu-Ho Lee	80c10e150f	etcdserver: remove possibly compacted entry look-up Fix https://github.com/coreos/etcd/issues/7470. This patch removes unnecessary term look-up in 'createMergedSnapshotMessage', which can trigger panic if raft entry at etcdProgress.appliedi got compacted by subsequent 'MsgSnap' messages--if a follower is being (in this case, network latency spikes), it could receive subsequent 'MsgSnap' requests from leader. etcd server-side 'applyAll' routine and raft's Ready processing routine becomes asynchronous after raft entries are persisted. And given that raft Ready routine takes less time to finish, it is possible that second 'MsgSnap' is being handled, while the slow 'applyAll' is still processing the first(old) 'MsgSnap'. Then raft Ready routine can compact the log entries at future index to 'applyAll'. That is how 'createMergedSnapshotMessage' tried to look up raft term with outdated etcdProgress.appliedi. Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>	2017-03-15 12:46:56 -07:00
Xiang	7f0733cf46	etcdserver: candidate should wait for applying all configuration changes	2017-03-14 17:20:20 -07:00
Anthony Romano	58da8b17ee	etcdserver: support mvcc txn	2017-03-08 20:54:15 -08:00
Hitoshi Mitake	f8a290e7ca	*: support jwt token in v3 auth API This commit adds jwt token support in v3 auth API. Remaining major ToDos: - Currently token type isn't hidden from etcdserver. In the near future the information should be completely invisible from etcdserver package. - Configurable expiration of token. Currently tokens can be valid until keys are changed. How to use: 1. generate keys for signing and verfying jwt tokens: $ openssl genrsa -out app.rsa 1024 $ openssl rsa -in app.rsa -pubout > app.rsa.pub 2. add command line options to etcd like below: --auth-token-type jwt \ --auth-jwt-pub-key app.rsa.pub --auth-jwt-priv-key app.rsa \ --auth-jwt-sign-method RS512 3. launch etcd cluster Below is a performance comparison of serializable read w/ and w/o jwt token. Every (3) etcd node is executed on a single machine. Signing method is RS512 and key length is 1024 bit. As the results show, jwt based token introduces a performance overhead but it would be acceptable for a case that requires authentication. w/o jwt token auth (no auth): Summary: Total: 1.6172 secs. Slowest: 0.0125 secs. Fastest: 0.0001 secs. Average: 0.0002 secs. Stddev: 0.0004 secs. Requests/sec: 6183.5877 Response time histogram: 0.000 [1] \| 0.001 [9982] \|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.003 [1] \| 0.004 [1] \| 0.005 [0] \| 0.006 [0] \| 0.008 [6] \| 0.009 [0] \| 0.010 [1] \| 0.011 [5] \| 0.013 [3] \| Latency distribution: 10% in 0.0001 secs. 25% in 0.0001 secs. 50% in 0.0001 secs. 75% in 0.0001 secs. 90% in 0.0002 secs. 95% in 0.0002 secs. 99% in 0.0003 secs. w/ jwt token auth: Summary: Total: 2.5364 secs. Slowest: 0.0182 secs. Fastest: 0.0002 secs. Average: 0.0003 secs. Stddev: 0.0005 secs. Requests/sec: 3942.5185 Response time histogram: 0.000 [1] \| 0.002 [9975] \|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.004 [0] \| 0.006 [1] \| 0.007 [11] \| 0.009 [2] \| 0.011 [4] \| 0.013 [5] \| 0.015 [0] \| 0.016 [0] \| 0.018 [1] \| Latency distribution: 10% in 0.0002 secs. 25% in 0.0002 secs. 50% in 0.0002 secs. 75% in 0.0002 secs. 90% in 0.0003 secs. 95% in 0.0003 secs. 99% in 0.0004 secs.	2017-03-06 19:46:03 -08:00
Gyu-Ho Lee	3d75395875	*: remove never-unused vars, minor lint fix Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>	2017-03-06 14:59:12 -08:00
Hitoshi Mitake	0191509637	auth, etcdserver: authenticate clients based on certificate CommonName This commit lets v3 auth mechanism authenticate clients based on CommonName of certificate like v2 auth.	2017-01-31 17:22:12 +09:00
Xiang Li	699b1e5b3a	Merge pull request #7160 from xiang90/snapshotcount etcdserver: increase snapshot to 100,000	2017-01-14 16:53:44 -08:00
Hitoshi Mitake	9886e9448e	auth, etcdserver: let maintenance services require root role This commit lets maintenance services require root privilege. It also moves AuthInfoFromCtx() from etcdserver to auth pkg for cleaning purpose.	2017-01-14 19:36:24 +09:00
Xiang Li	c5a9d54835	etcdserver: increase snapshot to 100,000 Keep more wal entries in memory for fast follower recovery. 10,000 was a too small number that triggers quite a few snapshots. ZK proves that 100,000 is a reasonable number for even old less prowerful machines. Eventually we should provide both count and max memory (for large entries).	2017-01-13 18:05:25 -08:00
vimalk78	5fac6b8d15	etcdserver: resume compactor only if leader	2017-01-04 05:01:14 +05:30
fanmin shi	2a1bae0c2a	etcdserver: consistent naming in raftReadyHandler	2016-12-29 11:27:16 -08:00
fanmin shi	2faf72f47c	etcdserver: rework update committed index logic	2016-12-27 10:11:40 -08:00
fanmin shi	fef4a79528	lease: force leader to apply its pending committed index for lease operations suppose a lease granting request from a follower goes through and followed by a lease look up or renewal, the leader might not apply the lease grant request locally. So the leader might not find the lease from the lease look up or renewal request which will result lease not found error. To fix this issue, we force the leader to apply its pending commited index before looking up lease. FIX #6978	2016-12-22 14:24:38 -08:00
Xiang Li	35fd5dc9fc	Merge pull request #6903 from mitake/auth-member protect membership change RPCs with auth	2016-12-15 08:04:31 -08:00
Hitoshi Mitake	86d7390804	auth, etcdserver: protect membership change operations with auth This commit protects membership change operations with auth. Only users that have root role can issue the operations. Implements https://github.com/coreos/etcd/issues/6899	2016-12-15 22:54:20 +09:00
Anthony Romano	2c06def8ca	etcdserver, embed, v2http: move pprof setup to embed Seems like a better place for prof setup since it's not specific to v2.	2016-12-09 12:37:35 -08:00
Xiang Li	2f96a68a20	etcdserver: do not send v2 sync if ttl keys do not exist	2016-12-07 14:48:15 -08:00
Vimal Kumar	dfe853ebff	auth: add a timeout mechanism to simple token	2016-11-28 17:21:13 +05:30
Xiang Li	c33d04fb54	etcdserver: print out warning when waiting for file lock	2016-11-01 17:55:16 -07:00
Gyu-Ho Lee	6ec03d3f7c	etcdserver: move 'EtcdServer.send' to raft.go Clear 'TODO'	2016-10-26 16:26:00 -07:00
Gyu-Ho Lee	0c61d8804a	etcdserver: make WaitGroup.Add sync with Wait	2016-10-12 13:11:35 -07:00
Gyu-Ho Lee	e011ea25ca	etcdserver: separate EtcdServer from raftNode	2016-10-07 13:18:39 -07:00
Xiang Li	0f0c048e29	etcdserver: fix early lessor promotion issue If we promote the lessor before finish applying all entries from the last term, we might incorrectly renew the already revoked leases. Here is an example: - Term 1: revoke lease A accepted by raft - Old leader failed, new election happened - Term 2: promote - Term 2: keep alive A succeed. A now has 10 seconds TTL - Term 2: revoke lease A from Term 1 got committed and applied - Term 2: the lease A with 10 seconds TTL is revoked To solve this, the new leader MUST apply all entries from old term before promote its lessor to start accept renew requests.	2016-10-05 14:41:47 -07:00
Xiang Li	e3e3993022	etcdserver: support read index Use read index to achieve l-read.	2016-09-27 13:41:40 +08:00
fanmin shi	690a0b6f00	etcdserver: parallelize expired leases process When 1000 leases expired at the same time, etcd takes more than 5 seconds to clean them. This means that even after the leases have expired, keys associated with leases are still accessible. I increase the deletion throughput by parallelizing leases deletion process.	2016-09-19 16:17:49 -07:00
Anthony Romano	3866e78c26	etcdserver: tighten up goroutine management All outstanding goroutines now go into the etcdserver waitgroup. goroutines are shutdown with a "stopping" channel which is closed when the run() goroutine shutsdown. The done channel will only close once the waitgroup is totally cleared.	2016-09-19 12:10:41 -07:00
Xiang Li	771ee43169	etcdserver: allow zero kv index for cluster upgrade If a user upgrades etcd from 2.3.x to 3.0 and shutdown the cluster immediately without triggering any new backend writes, then the consistent index in backend would be zero. The user cannot restart etcdserver due to today's strick index match checking. We now have to lose this a bit for this case.	2016-08-30 11:28:18 -07:00
Xiang Li	7f3d4bfae5	etcdserver: kv.commit needs to be serialized with apply kv.commit updates the consistent index in backend. When executing in parallel with apply, it might grab tx lock after apply update the consistent index and before apply starts to execute the opeartion. If the server dies right after kv.commit, the consistent is updated but the opeartion is not executed. If we restart etcd server, etcd will skip the operation. :( There are a few other places that we need to take care of, but let us fix this first.	2016-08-23 09:16:09 -07:00
Xiang Li	83de13e4a8	etcdserver: support apply wait	2016-08-19 16:18:35 -07:00
Xiang Li	d0fa390048	etcdserver: improve logging for leadership transfer	2016-08-17 11:40:46 -07:00
Gyu-Ho Lee	64a0e34602	etcdserver: transfer leadership when stopping	2016-08-13 14:31:58 -07:00
Gyu-Ho Lee	c6c6cfb502	etcdserver: implement 'CutPeer', 'MendPeer'	2016-08-12 07:38:52 -07:00

1 2 3 4 5 ...

531 Commits (1f206c027a37d2ed80b59465d37e8c22477209c1)