vitastor/docs/usage/admin.en.md

216 lines
11 KiB
Markdown

[Documentation](../../README.md#documentation) → Usage → Administration
-----
[Читать на русском](admin.ru.md)
# Administration
- [Pool states](#pool-states)
- [PG states](#pg-states)
- [Base PG states](#base-pg-states)
- [Additional PG states](#additional-pg-states)
- [Removing a healthy disk](#removing-a-healthy-disk)
- [Removing a failed disk](#removing-a-failed-disk)
- [Adding a disk](#adding-a-disk)
- [Restoring from lost pool configuration](#restoring-from-lost-pool-configuration)
- [Upgrading Vitastor](#upgrading-vitastor)
- [OSD memory usage](#osd-memory-usage)
## Pool states
Pool is active — that is, fully available for client input/output — when all its PGs are
'active' (maybe with some additional state flags).
If at least 1 PG is inactive, pool is also inactive and all clients suspend their I/O and
wait until you fix the cluster. :-)
## PG states
PG states may be seen in [vitastor-cli status](cli.en.md#status) output.
PG state consists of exactly 1 base state and an arbitrary number of additional states.
### Base PG states
PG state always includes exactly 1 of the following base states:
- **active** — PG is active and handles user I/O.
- **incomplete** — Not enough OSDs are available to activate this PG. That is, more disks
are lost than it's allowed by the pool's redundancy scheme. For example, if the pool has
pg_size=3 and pg_minsize=1, part of the data may be written only to 1 OSD. If that exact
OSD is lost, PG will become **incomplete**.
- **offline** — PG isn't activated by any OSD at all. Either primary OSD isn't set for
this PG at all (if the pool is just created), or an unavailable OSD is set as primary,
or the primary OSD refuses to start this PG (for example, because of wrong block_size),
or the PG is stopped by the monitor using `pause: true` flag in `/vitastor/config/pgs` in etcd.
- **starting** — primary OSD has acquired PG lock in etcd, PG is starting.
- **peering** — primary OSD requests PG object listings from secondary OSDs and calculates
the PG state.
- **repeering** — PG is waiting for current I/O operations to complete and will
then transition to **peering**.
- **stopping** — PG is waiting for current I/O operations to complete and will
then transition to **offline** or be activated by another OSD.
All states except **active** mean that PG is inactive and client I/O is suspended.
**peering** state is normally visible only for a short period of time during OSD restarts
and during switching primary OSD of PGs.
**starting**, **repeering**, **stopping** states normally almost aren't visible at all.
If you notice them for any noticeable time — chances are some operations on some OSDs hung.
Search for "slow op" in OSD logs to find them — operations hung for more than
[slow_log_interval](../config/osd.en.md#slow_log_interval) are logged as "slow ops".
State transition diagram:
![PG state transitions](pg_states.svg "PG state transitions")
### Additional PG states
If a PG is active it can also have any number of the following additional states:
- **degraded** — PG is running on reduced number of drives (OSDs), redundancy of all
objects in this PG is reduced.
- **has_incomplete** — some objects in this PG are incomplete (unrecoverable), that is,
they have too many lost EC parts (more than pool's [parity_chunks](../config/pool.en.md#parity_chunks)).
- **has_degraded** — some objects in this PG have reduced redundancy
compared to the rest of the PG (so PG can be degraded+has_degraded at the same time).
These objects should be healed automatically by recovery process, unless
it's disabled by [no_recovery](../config/osd.en.md#no_recovery).
- **has_misplaced** — some objects in this PG are stored on an OSD set different from
the target set of the PG. These objects should be moved automatically, unless
rebalance is disabled by [no_rebalance](../config/osd.en.md#no_rebalance). Objects
that are degraded and misplaced at the same time are treated as just degraded.
- **has_unclean** — one more state normally noticeable only for very short time during
PG activation. It's used only with EC pools and means that some objects of this PG
have started but not finished modifications. All such objects are either quickly
committed or rolled back by the primary OSD when starting the PG, that is why the
state shouldn't be noticeable. If you notice it, it probably means that commit or
rollback operations are hung.
- **has_invalid** — PG contains objects with incorrect part ID. Never occurs normally.
It can only occur if you delete a non-empty EC pool and then recreate it as a replica
pool or with smaller data part count.
- **has_corrupted** — PG has corrupted objects, discovered by checking checksums during
read or during scrub. When possible, such objects should be recovered automatically.
If objects remain corrupted, use [vitastor-cli describe](cli.en.md#describe) to find
out details and/or look into the log of the primary OSD of the PG.
- **has_inconsistent** — PG has objects with non-matching parts or copies on different OSDs,
and it's impossible to determine which copy is correct automatically. It may happen
if you use a pool with 2 replica and you don't enable checksums, and if data on one
of replicas becomes corrupted. You should also use vitastor-cli [describe](cli.en.md#describe)
and [fix](cli.en.md#fix) commands to remove the incorrect version in this case.
- **left_on_dead** — part of the data of this PG is left on unavailable OSD that isn't
fully removed from the cluster. You should either start the corresponding OSD back and
let it remove the unneeded data or remove it from cluster using vitastor-cli
[rm-osd](cli.en.md#rm-osd) if you know that it's gone forever (for example, if the disk died).
- **scrubbing** — data [scrub](../config/osd.en.md#auto_scrub) is running for this PG.
## Removing a healthy disk
Befor removing a healthy disk from the cluster set its OSD weight(s) to 0 to
move data away. To do that, add `"reweight":0` to etcd key `/vitastor/config/osd/<OSD_NUMBER>`.
For example:
```
etcdctl --endpoints=http://1.1.1.1:2379/v3 put /vitastor/config/osd/1 '{"reweight":0}'
```
Then wait until rebalance finishes and remove OSD by running `vitastor-disk purge /dev/vitastor/osdN-data`.
## Removing a failed disk
If a disk is already dead, its OSD(s) are likely already stopped.
In this case just remove OSD(s) from the cluster by running `vitastor-cli rm-osd OSD_NUMBER`.
## Adding a disk
If you're adding a server, first install Vitastor packages and copy the
`/etc/vitastor/vitastor.conf` configuration file to it.
After that you can just run `vitastor-disk prepare /dev/nvmeXXX`, of course with
the same parameters which you used for other OSDs in your cluster before.
## Restoring from lost pool configuration
If you remove or corrupt `/vitastor/config/pools` key in etcd all pools will
be deleted. Don't worry, the data won't be lost, but you'll need to perform
a specific recovery procedure.
First you need to restore previous configuration of the pool with the same ID
and EC/replica parameters and wait until pool PGs appear in `vitastor-cli status`.
Then add all OSDs into the history records of all PGs. You can do it by running
the following script (just don't forget to use your own PG_COUNT and POOL_ID):
```
PG_COUNT=32
POOL_ID=1
ALL_OSDS=$(etcdctl --endpoints=your_etcd_address:2379 get --keys-only --prefix /vitastor/osd/stats/ | \
perl -e '$/ = undef; $a = <>; $a =~ s/\s*$//; $a =~ s!/vitastor/osd/stats/!!g; $a =~ s/\s+/,/g; print $a')
for i in $(seq 1 $PG_COUNT); do
etcdctl --endpoints=your_etcd_address:2379 put /vitastor/pg/history/$POOL_ID/$i '{"all_peers":['$ALL_OSDS']}'; done
done
```
After that all PGs should peer and find all previous data.
## Upgrading Vitastor
Every upcoming Vitastor version is usually compatible with previous both forward
and backward regarding the network protocol and etcd data structures.
So, by default, if this page doesn't contain explicit different instructions, you
can upgrade your Vitastor cluster by simply upgrading packages and restarting all
OSDs and monitors in any order.
Upgrading is performed without stopping clients (VMs/containers), you just need to
upgrade and restart servers one by one. However, ideally you should restart VMs too
to make them use the new version of the client library.
Exceptions (specific upgrade instructions):
- Upgrading <= 1.1.x to 1.2.0 or later, if you use EC n+k with k>=2, is recommended
to be performed with full downtime: first you should stop all clients, then all OSDs,
then upgrade and start everything back — because versions before 1.2.0 have several
bugs leading to invalid data being read in EC n+k, k>=2 configurations in degraded pools.
- Versions <= 0.8.7 are incompatible with versions >= 0.9.0, so you should first
upgrade from <= 0.8.7 to 0.8.8 or 0.8.9, and only then to >= 0.9.x. If you upgrade
without this intermediate step, client I/O will hang until the end of upgrade process.
- Upgrading from <= 0.5.x to >= 0.6.x is not supported.
Rollback:
- Version 1.0.0 has a new disk format, so OSDs initiaziled on 1.0.0 can't be rolled
back to 0.9.x or previous versions.
- Versions before 0.8.0 don't have vitastor-disk, so OSDs, initialized by it, won't
start with 0.7.x or 0.6.x. :-)
## OSD memory usage
OSD uses RAM mainly for:
- Metadata index: `data_size`/[`block_size`](../config/layout-cluster.en.md#block_size) * `approximately 1.1` * `32` bytes.
Consumed always.
- Copy of the on-disk metadata area: `data_size`/[`block_size`](../config/layout-cluster.en.md#block_size) * `28` bytes.
Consumed if [inmemory_metadata](../config/osd.en.md#inmemory_metadata) isn't disabled.
- Bitmaps: `data_size`/[`bitmap_granularity`](../config/layout-cluster.en.md#bitmap_granularity)/`8` * `2` bytes.
Consumed always.
- Journal index: between 0 and, approximately, journal size. Consumed always.
- Copy of the on-disk journal area: exactly journal size. Consumed if
[inmemory_journal](../config/osd.en.md#inmemory_journal) isn't disabled.
- Checksums: `data_size`/[`csum_block_size`](../config/osd.en.md#csum_block_size) * 4 bytes.
Consumed if checksums are enabled and [inmemory_metadata](../config/osd.en.md#inmemory_metadata) isn't disabled.
bitmap_granularity is almost always 4 KB.
So with default SSD settings (block_size=128k, journal_size=32M, csum_block_size=4k) memory usage is:
- Metadata and bitmaps: ~600 MB per 1 TB of data.
- Journal: up to 64 MB per 1 OSD.
- Checksums: 1 GB per 1 TB of data.
With default HDD settings (block_size=1M, journal_size=128M, csum_block_size=32k):
- Metadata and bitmaps: ~128 MB per 1 TB of data.
- Journal: up to 256 MB per 1 OSD.
- Checksums: 128 MB per 1 TB of data.