[Documentation](../../README.md#documentation) → Usage → Administration ----- [Читать на русском](admin.ru.md) # Administration - [Pool states](#pool-states) - [PG states](#pg-states) - [Base PG states](#base-pg-states) - [Additional PG states](#additional-pg-states) - [Removing a healthy disk](#removing-a-healthy-disk) - [Removing a failed disk](#removing-a-failed-disk) - [Adding a disk](#adding-a-disk) - [Restoring from lost pool configuration](#restoring-from-lost-pool-configuration) - [Upgrading Vitastor](#upgrading-vitastor) - [OSD memory usage](#osd-memory-usage) ## Pool states Pool is active — that is, fully available for client input/output — when all its PGs are 'active' (maybe with some additional state flags). If at least 1 PG is inactive, pool is also inactive and all clients suspend their I/O and wait until you fix the cluster. :-) ## PG states PG states may be seen in [vitastor-cli status](cli.en.md#status) output. PG state consists of exactly 1 base state and an arbitrary number of additional states. ### Base PG states PG state always includes exactly 1 of the following base states: - **active** — PG is active and handles user I/O. - **incomplete** — Not enough OSDs are available to activate this PG. That is, more disks are lost than it's allowed by the pool's redundancy scheme. For example, if the pool has pg_size=3 and pg_minsize=1, part of the data may be written only to 1 OSD. If that exact OSD is lost, PG will become **incomplete**. - **offline** — PG isn't activated by any OSD at all. Either primary OSD isn't set for this PG at all (if the pool is just created), or an unavailable OSD is set as primary, or the primary OSD refuses to start this PG (for example, because of wrong block_size), or the PG is stopped by the monitor using `pause: true` flag in `/vitastor/config/pgs` in etcd. - **starting** — primary OSD has acquired PG lock in etcd, PG is starting. - **peering** — primary OSD requests PG object listings from secondary OSDs and calculates the PG state. - **repeering** — PG is waiting for current I/O operations to complete and will then transition to **peering**. - **stopping** — PG is waiting for current I/O operations to complete and will then transition to **offline** or be activated by another OSD. All states except **active** mean that PG is inactive and client I/O is suspended. **peering** state is normally visible only for a short period of time during OSD restarts and during switching primary OSD of PGs. **starting**, **repeering**, **stopping** states normally almost aren't visible at all. If you notice them for any noticeable time — chances are some operations on some OSDs hung. Search for "slow op" in OSD logs to find them — operations hung for more than [slow_log_interval](../config/osd.en.md#slow_log_interval) are logged as "slow ops". State transition diagram: ![PG state transitions](pg_states.svg "PG state transitions") ### Additional PG states If a PG is active it can also have any number of the following additional states: - **degraded** — PG is running on reduced number of drives (OSDs), redundancy of all objects in this PG is reduced. - **has_incomplete** — some objects in this PG are incomplete (unrecoverable), that is, they have too many lost EC parts (more than pool's [parity_chunks](../config/pool.en.md#parity_chunks)). - **has_degraded** — some objects in this PG have reduced redundancy compared to the rest of the PG (so PG can be degraded+has_degraded at the same time). These objects should be healed automatically by recovery process, unless it's disabled by [no_recovery](../config/osd.en.md#no_recovery). - **has_misplaced** — some objects in this PG are stored on an OSD set different from the target set of the PG. These objects should be moved automatically, unless rebalance is disabled by [no_rebalance](../config/osd.en.md#no_rebalance). Objects that are degraded and misplaced at the same time are treated as just degraded. - **has_unclean** — one more state normally noticeable only for very short time during PG activation. It's used only with EC pools and means that some objects of this PG have started but not finished modifications. All such objects are either quickly committed or rolled back by the primary OSD when starting the PG, that is why the state shouldn't be noticeable. If you notice it, it probably means that commit or rollback operations are hung. - **has_invalid** — PG contains objects with incorrect part ID. Never occurs normally. It can only occur if you delete a non-empty EC pool and then recreate it as a replica pool or with smaller data part count. - **has_corrupted** — PG has corrupted objects, discovered by checking checksums during read or during scrub. When possible, such objects should be recovered automatically. If objects remain corrupted, use [vitastor-cli describe](cli.en.md#describe) to find out details and/or look into the log of the primary OSD of the PG. - **has_inconsistent** — PG has objects with non-matching parts or copies on different OSDs, and it's impossible to determine which copy is correct automatically. It may happen if you use a pool with 2 replica and you don't enable checksums, and if data on one of replicas becomes corrupted. You should also use vitastor-cli [describe](cli.en.md#describe) and [fix](cli.en.md#fix) commands to remove the incorrect version in this case. - **left_on_dead** — part of the data of this PG is left on unavailable OSD that isn't fully removed from the cluster. You should either start the corresponding OSD back and let it remove the unneeded data or remove it from cluster using vitastor-cli [rm-osd](cli.en.md#rm-osd) if you know that it's gone forever (for example, if the disk died). - **scrubbing** — data [scrub](../config/osd.en.md#auto_scrub) is running for this PG. ## Removing a healthy disk Befor removing a healthy disk from the cluster set its OSD weight(s) to 0 to move data away. To do that, add `"reweight":0` to etcd key `/vitastor/config/osd/`. For example: ``` etcdctl --endpoints=http://1.1.1.1:2379/v3 put /vitastor/config/osd/1 '{"reweight":0}' ``` Then wait until rebalance finishes and remove OSD by running `vitastor-disk purge /dev/vitastor/osdN-data`. ## Removing a failed disk If a disk is already dead, its OSD(s) are likely already stopped. In this case just remove OSD(s) from the cluster by running `vitastor-cli rm-osd OSD_NUMBER`. ## Adding a disk If you're adding a server, first install Vitastor packages and copy the `/etc/vitastor/vitastor.conf` configuration file to it. After that you can just run `vitastor-disk prepare /dev/nvmeXXX`, of course with the same parameters which you used for other OSDs in your cluster before. ## Restoring from lost pool configuration If you remove or corrupt `/vitastor/config/pools` key in etcd all pools will be deleted. Don't worry, the data won't be lost, but you'll need to perform a specific recovery procedure. First you need to restore previous configuration of the pool with the same ID and EC/replica parameters and wait until pool PGs appear in `vitastor-cli status`. Then add all OSDs into the history records of all PGs. You can do it by running the following script (just don't forget to use your own PG_COUNT and POOL_ID): ``` PG_COUNT=32 POOL_ID=1 ALL_OSDS=$(etcdctl --endpoints=your_etcd_address:2379 get --keys-only --prefix /vitastor/osd/stats/ | \ perl -e '$/ = undef; $a = <>; $a =~ s/\s*$//; $a =~ s!/vitastor/osd/stats/!!g; $a =~ s/\s+/,/g; print $a') for i in $(seq 1 $PG_COUNT); do etcdctl --endpoints=your_etcd_address:2379 put /vitastor/pg/history/$POOL_ID/$i '{"all_peers":['$ALL_OSDS']}'; done done ``` After that all PGs should peer and find all previous data. ## Upgrading Vitastor Every upcoming Vitastor version is usually compatible with previous both forward and backward regarding the network protocol and etcd data structures. So, by default, if this page doesn't contain explicit different instructions, you can upgrade your Vitastor cluster by simply upgrading packages and restarting all OSDs and monitors in any order. Upgrading is performed without stopping clients (VMs/containers), you just need to upgrade and restart servers one by one. However, ideally you should restart VMs too to make them use the new version of the client library. Exceptions (specific upgrade instructions): - Upgrading <= 1.1.x to 1.2.0 or later, if you use EC n+k with k>=2, is recommended to be performed with full downtime: first you should stop all clients, then all OSDs, then upgrade and start everything back — because versions before 1.2.0 have several bugs leading to invalid data being read in EC n+k, k>=2 configurations in degraded pools. - Versions <= 0.8.7 are incompatible with versions >= 0.9.0, so you should first upgrade from <= 0.8.7 to 0.8.8 or 0.8.9, and only then to >= 0.9.x. If you upgrade without this intermediate step, client I/O will hang until the end of upgrade process. - Upgrading from <= 0.5.x to >= 0.6.x is not supported. Rollback: - Version 1.0.0 has a new disk format, so OSDs initiaziled on 1.0.0 can't be rolled back to 0.9.x or previous versions. - Versions before 0.8.0 don't have vitastor-disk, so OSDs, initialized by it, won't start with 0.7.x or 0.6.x. :-) ## OSD memory usage OSD uses RAM mainly for: - Metadata index: `data_size`/[`block_size`](../config/layout-cluster.en.md#block_size) * `approximately 1.1` * `32` bytes. Consumed always. - Copy of the on-disk metadata area: `data_size`/[`block_size`](../config/layout-cluster.en.md#block_size) * `28` bytes. Consumed if [inmemory_metadata](../config/osd.en.md#inmemory_metadata) isn't disabled. - Bitmaps: `data_size`/[`bitmap_granularity`](../config/layout-cluster.en.md#bitmap_granularity)/`8` * `2` bytes. Consumed always. - Journal index: between 0 and, approximately, journal size. Consumed always. - Copy of the on-disk journal area: exactly journal size. Consumed if [inmemory_journal](../config/osd.en.md#inmemory_journal) isn't disabled. - Checksums: `data_size`/[`csum_block_size`](../config/osd.en.md#csum_block_size) * 4 bytes. Consumed if checksums are enabled and [inmemory_metadata](../config/osd.en.md#inmemory_metadata) isn't disabled. bitmap_granularity is almost always 4 KB. So with default SSD settings (block_size=128k, journal_size=32M, csum_block_size=4k) memory usage is: - Metadata and bitmaps: ~600 MB per 1 TB of data. - Journal: up to 64 MB per 1 OSD. - Checksums: 1 GB per 1 TB of data. With default HDD settings (block_size=1M, journal_size=128M, csum_block_size=32k): - Metadata and bitmaps: ~128 MB per 1 TB of data. - Journal: up to 256 MB per 1 OSD. - Checksums: 128 MB per 1 TB of data.