2022-01-29 23:43:22 +03:00
|
|
|
[Documentation](../../README.md#documentation) → [Configuration](../config.en.md) → Pool configuration
|
|
|
|
|
|
|
|
-----
|
|
|
|
|
|
|
|
[Читать на русском](pool.ru.md)
|
|
|
|
|
|
|
|
# Pool configuration
|
|
|
|
|
|
|
|
Pool configuration is set in etcd key `/vitastor/config/pools` in the following
|
|
|
|
JSON format:
|
|
|
|
|
|
|
|
```
|
|
|
|
{
|
|
|
|
"<Numeric ID>": {
|
|
|
|
"name": "<name>",
|
|
|
|
...other parameters...
|
|
|
|
}
|
|
|
|
}
|
|
|
|
```
|
|
|
|
|
2022-05-17 01:50:24 +03:00
|
|
|
Pool configuration is also affected by:
|
|
|
|
|
|
|
|
- [OSD Placement Tree](#placement-tree)
|
|
|
|
- [Separate OSD settings](#osd-settings)
|
|
|
|
|
|
|
|
Parameters:
|
|
|
|
|
2022-01-29 23:43:22 +03:00
|
|
|
- [name](#name)
|
|
|
|
- [scheme](#scheme)
|
|
|
|
- [pg_size](#pg_size)
|
|
|
|
- [parity_chunks](#parity_chunks)
|
|
|
|
- [pg_minsize](#pg_minsize)
|
|
|
|
- [pg_count](#pg_count)
|
|
|
|
- [failure_domain](#failure_domain)
|
2024-04-07 00:38:22 +03:00
|
|
|
- [level_placement](#level_placement)
|
|
|
|
- [raw_placement](#raw_placement)
|
2022-01-29 23:43:22 +03:00
|
|
|
- [max_osd_combinations](#max_osd_combinations)
|
2022-08-09 02:27:02 +03:00
|
|
|
- [block_size](#block_size)
|
|
|
|
- [bitmap_granularity](#bitmap_granularity)
|
|
|
|
- [immediate_commit](#immediate_commit)
|
2022-01-29 23:43:22 +03:00
|
|
|
- [pg_stripe_size](#pg_stripe_size)
|
|
|
|
- [root_node](#root_node)
|
|
|
|
- [osd_tags](#osd_tags)
|
|
|
|
- [primary_affinity_tags](#primary_affinity_tags)
|
2023-04-22 02:44:10 +03:00
|
|
|
- [scrub_interval](#scrub_interval)
|
2024-03-10 18:08:57 +03:00
|
|
|
- [used_for_fs](#used_for_fs)
|
2022-01-29 23:43:22 +03:00
|
|
|
|
|
|
|
Examples:
|
|
|
|
|
|
|
|
- [Replicated Pool](#replicated-pool)
|
|
|
|
- [Erasure-coded Pool](#erasure-coded-pool)
|
|
|
|
|
2022-05-17 01:50:24 +03:00
|
|
|
# Placement Tree
|
|
|
|
|
|
|
|
OSD placement tree is set in a separate etcd key `/vitastor/config/node_placement`
|
|
|
|
in the following JSON format:
|
|
|
|
|
|
|
|
`
|
|
|
|
{
|
|
|
|
"<node name or OSD number>": {
|
|
|
|
"level": "<level>",
|
|
|
|
"parent": "<parent node name, if any>"
|
|
|
|
},
|
|
|
|
...
|
|
|
|
}
|
|
|
|
`
|
|
|
|
|
|
|
|
Here, if a node name is a number then it is assumed to refer to an OSD.
|
|
|
|
Level of the OSD is always "osd" and cannot be overriden. You may only
|
|
|
|
override parent node of the OSD which is its host by default.
|
|
|
|
|
|
|
|
Non-numeric node names refer to other placement tree nodes like hosts, racks,
|
|
|
|
datacenters and so on.
|
|
|
|
|
|
|
|
Hosts of all OSDs are auto-created in the tree with level "host" and name
|
|
|
|
equal to the host name reported by a corresponding OSD. You can refer to them
|
|
|
|
without adding them to this JSON tree manually.
|
|
|
|
|
|
|
|
Level may be "host", "osd" or refer to some other placement tree level
|
|
|
|
from [placement_levels](monitor.en.md#placement_levels).
|
|
|
|
|
|
|
|
Parent node reference is required for intermediate tree nodes.
|
|
|
|
|
|
|
|
# OSD settings
|
|
|
|
|
|
|
|
Separate OSD settings are set in etc keys `/vitastor/config/osd/<number>`
|
|
|
|
in JSON format `{"<key>":<value>}`.
|
|
|
|
|
2024-04-16 02:19:55 +03:00
|
|
|
As of now, the following settings are supported:
|
|
|
|
|
|
|
|
- [reweight](#reweight)
|
|
|
|
- [tags](#tags)
|
|
|
|
- [noout](#noout)
|
2022-05-17 01:50:24 +03:00
|
|
|
|
|
|
|
## reweight
|
|
|
|
|
|
|
|
- Type: number, between 0 and 1
|
|
|
|
- Default: 1
|
|
|
|
|
|
|
|
Every OSD receives PGs proportional to its size. Reweight is a multiplier for
|
|
|
|
OSD size used during PG distribution.
|
|
|
|
|
|
|
|
This means an OSD configured with reweight lower than 1 receives less PGs than
|
|
|
|
it normally would. An OSD with reweight = 0 won't store any data. You can set
|
|
|
|
reweight to 0 to trigger rebalance and remove all data from an OSD.
|
|
|
|
|
2022-11-16 19:22:21 +03:00
|
|
|
## tags
|
|
|
|
|
|
|
|
- Type: string or array of strings
|
|
|
|
|
|
|
|
Sets tag or multiple tags for this OSD. Tags can be used to group OSDs into
|
|
|
|
subsets and then use a specific subset for pool instead of all OSDs.
|
|
|
|
For example you can mark SSD OSDs with tag "ssd" and HDD OSDs with "hdd" and
|
|
|
|
such tags will work as device classes.
|
|
|
|
|
2024-04-16 02:19:55 +03:00
|
|
|
## noout
|
|
|
|
|
|
|
|
- Type: boolean
|
|
|
|
- Default: false
|
|
|
|
|
|
|
|
If set to true, [osd_out_time](monitor.en.md#osd_out_time) is ignored for this
|
|
|
|
OSD and it's never removed from data distribution by the monitor.
|
|
|
|
|
2022-05-17 01:50:24 +03:00
|
|
|
# Pool parameters
|
2022-01-29 23:43:22 +03:00
|
|
|
|
|
|
|
## name
|
|
|
|
|
|
|
|
- Type: string
|
|
|
|
- Required
|
|
|
|
|
|
|
|
Pool name.
|
|
|
|
|
|
|
|
## scheme
|
|
|
|
|
|
|
|
- Type: string
|
|
|
|
- Required
|
2022-06-03 15:36:58 +03:00
|
|
|
- One of: "replicated", "xor", "ec" or "jerasure"
|
2022-01-29 23:43:22 +03:00
|
|
|
|
2022-06-03 15:36:58 +03:00
|
|
|
Redundancy scheme used for data in this pool. "jerasure" is an alias for "ec",
|
2022-06-04 22:58:02 +03:00
|
|
|
both use Reed-Solomon-Vandermonde codes based on ISA-L or jerasure libraries.
|
|
|
|
Fast ISA-L based implementation is used automatically when it's available,
|
|
|
|
slower jerasure version is used otherwise.
|
2022-01-29 23:43:22 +03:00
|
|
|
|
|
|
|
## pg_size
|
|
|
|
|
|
|
|
- Type: integer
|
|
|
|
- Required
|
|
|
|
|
|
|
|
Total number of disks for PGs of this pool - i.e., number of replicas for
|
|
|
|
replicated pools and number of data plus parity disks for EC/XOR pools.
|
|
|
|
|
|
|
|
## parity_chunks
|
|
|
|
|
|
|
|
- Type: integer
|
|
|
|
|
|
|
|
Number of parity chunks for EC/XOR pools. For such pools, data will be lost
|
|
|
|
if you lose more than parity_chunks disks at once, so this parameter can be
|
|
|
|
equally described as FTT (number of failures to tolerate).
|
|
|
|
|
|
|
|
Required for EC/XOR pools, ignored for replicated pools.
|
|
|
|
|
|
|
|
## pg_minsize
|
|
|
|
|
|
|
|
- Type: integer
|
|
|
|
- Required
|
|
|
|
|
|
|
|
Number of available live disks for PGs of this pool to remain active.
|
|
|
|
That is, if it becomes impossible to place PG data on at least (pg_minsize)
|
|
|
|
OSDs, PG is deactivated for both read and write. So you know that a fresh
|
|
|
|
write always goes to at least (pg_minsize) OSDs (disks).
|
|
|
|
|
2024-02-26 23:47:12 +03:00
|
|
|
For example, the difference between pg_minsize 2 and 1 in a 3-way replicated
|
|
|
|
pool (pg_size=3) is:
|
|
|
|
- If 2 hosts go down with pg_minsize=2, the pool becomes inactive and remains
|
|
|
|
inactive for [osd_out_time](monitor.en.md#osd_out_time) (10 minutes). After
|
|
|
|
this timeout, the monitor selects replacement hosts/OSDs and the pool comes
|
|
|
|
up and starts to heal. Therefore, if you don't have replacement OSDs, i.e.
|
|
|
|
if you only have 3 hosts with OSDs and 2 of them are down, the pool remains
|
|
|
|
inactive until you add or return at least 1 host (or change failure_domain
|
|
|
|
to "osd").
|
|
|
|
- If 2 hosts go down with pg_minsize=1, the pool only experiences a short
|
|
|
|
I/O pause until the monitor notices that OSDs are down (5-10 seconds with
|
|
|
|
the default [etcd_report_interval](osd.en.md#etcd_report_interval)). After
|
|
|
|
this pause, I/O resumes, but new data is temporarily written in only 1 copy.
|
|
|
|
Then, after osd_out_time, the monitor also selects replacement OSDs and the
|
|
|
|
pool starts to heal.
|
|
|
|
|
|
|
|
So, pg_minsize regulates the number of failures that a pool can tolerate
|
|
|
|
without temporary downtime for [osd_out_time](monitor.en.md#osd_out_time),
|
|
|
|
but at a cost of slightly reduced storage reliability.
|
2024-02-15 23:29:23 +03:00
|
|
|
|
2022-01-29 23:43:22 +03:00
|
|
|
FIXME: pg_minsize behaviour may be changed in the future to only make PGs
|
|
|
|
read-only instead of deactivating them.
|
|
|
|
|
|
|
|
## pg_count
|
|
|
|
|
|
|
|
- Type: integer
|
|
|
|
- Required
|
|
|
|
|
|
|
|
Number of PGs for this pool. The value should be big enough for the monitor /
|
|
|
|
LP solver to be able to optimize data placement.
|
|
|
|
|
2024-02-26 23:47:12 +03:00
|
|
|
"Enough" is usually around 10-100 PGs per OSD, i.e. you set pg_count for pool
|
|
|
|
to (total OSD count * 10 / pg_size). You can round it to the closest power of 2,
|
2022-01-29 23:43:22 +03:00
|
|
|
because it makes it easier to reduce or increase PG count later by dividing or
|
|
|
|
multiplying it by 2.
|
|
|
|
|
|
|
|
In Vitastor, PGs are ephemeral, so you can change pool PG count anytime just
|
|
|
|
by overwriting pool configuration in etcd. Amount of the data affected by
|
|
|
|
rebalance will be smaller if the new PG count is a multiple of the old PG count
|
|
|
|
or vice versa.
|
|
|
|
|
|
|
|
## failure_domain
|
|
|
|
|
|
|
|
- Type: string
|
|
|
|
- Default: host
|
|
|
|
|
|
|
|
Failure domain specification. Must be "host" or "osd" or refer to one of the
|
|
|
|
placement tree levels, defined in [placement_levels](monitor.en.md#placement_levels).
|
|
|
|
|
|
|
|
Two replicas, or two parts in case of EC/XOR, of the same block of data are
|
|
|
|
never put on OSDs in the same failure domain (for example, on the same host).
|
|
|
|
So failure domain specifies the unit which failure you are protecting yourself
|
|
|
|
from.
|
|
|
|
|
2024-04-07 00:38:22 +03:00
|
|
|
## level_placement
|
|
|
|
|
|
|
|
- Type: string
|
|
|
|
|
|
|
|
Additional failure domain rules, applied in conjuction with failure_domain.
|
|
|
|
Must be specified in the following form:
|
|
|
|
|
|
|
|
`<placement level>=<sequence of characters>, <level2>=<sequence2>, ...`
|
|
|
|
|
|
|
|
Sequence should be exactly [pg_size](#pg_size) character long. Each character
|
|
|
|
corresponds to an OSD in the PG of this pool. Equal characters mean that
|
|
|
|
corresponding items of the PG should be placed into the same placement tree
|
|
|
|
item at this level. Different characters mean that items should be placed into
|
|
|
|
different items.
|
|
|
|
|
|
|
|
For example, if you want a EC 4+2 pool and you want every 2 chunks to be stored
|
|
|
|
in its own datacenter and you also want each chunk to be stored on a different
|
|
|
|
host, you should set `level_placement` to `dc=112233 host=123456`.
|
|
|
|
|
|
|
|
Or you can set `level_placement` to `dc=112233` and leave `failure_domain` empty,
|
|
|
|
because `host` is the default `failure_domain` and it will be applied anyway.
|
|
|
|
|
|
|
|
Without this rule, it may happen that 3 chunks will be stored on OSDs in the
|
|
|
|
same datacenter, and the data will become inaccessibly if that datacenter goes
|
|
|
|
down in this case.
|
|
|
|
|
|
|
|
Of course, you should group your hosts into datacenters before applying the rule
|
|
|
|
by setting [placement_levels](monitor.en.md#placement_levels) to something like
|
|
|
|
`{"dc":90,"host":100,"osd":110}` and add DCs to [node_placement](#placement-tree),
|
|
|
|
like `{"dc1":{"level":"dc"},"host1":{"parent":"dc1"},...}`.
|
|
|
|
|
|
|
|
## raw_placement
|
|
|
|
|
|
|
|
- Type: string
|
|
|
|
|
|
|
|
Raw PG placement rules, specified in the form of a DSL (domain-specific language).
|
|
|
|
Use only if you really know what you're doing :)
|
|
|
|
|
|
|
|
DSL specification:
|
|
|
|
|
|
|
|
```
|
|
|
|
dsl := item | item ("\n" | ",") items
|
|
|
|
item := "any" | rules
|
|
|
|
rules := rule | rule rules
|
|
|
|
rule := level operator arg
|
|
|
|
level := /\w+/
|
|
|
|
operator := "!=" | "=" | ">" | "?="
|
|
|
|
arg := value | "(" values ")"
|
|
|
|
values := value | value "," values
|
|
|
|
value := item_ref | constant_id
|
|
|
|
item_ref := /\d+/
|
|
|
|
constant_id := /"([^"]+)"/
|
|
|
|
```
|
|
|
|
|
|
|
|
"?=" operator means "preferred". I.e. `dc ?= "meow"` means "prefer datacenter meow
|
|
|
|
for this chunk, but put into another dc if it's unavailable".
|
|
|
|
|
|
|
|
Examples:
|
|
|
|
|
|
|
|
- Simple 3 replicas with failure_domain=host: `any, host!=1, host!=(1,2)`
|
|
|
|
- EC 4+2 in 3 DC: `any, dc=1 host!=1, dc!=1, dc=3 host!=3, dc!=(1,3), dc=5 host!=5`
|
|
|
|
- 1 replica in fixed DC + 2 in random DCs: `dc?=meow, dc!=1, dc!=(1,2)`
|
|
|
|
|
2022-01-29 23:43:22 +03:00
|
|
|
## max_osd_combinations
|
|
|
|
|
|
|
|
- Type: integer
|
|
|
|
- Default: 10000
|
|
|
|
|
|
|
|
Vitastor data placement algorithm is based on the LP solver and OSD combinations
|
|
|
|
which are fed to it are generated ramdonly. This parameter specifies the maximum
|
|
|
|
number of combinations to generate when optimising PG placement.
|
|
|
|
|
|
|
|
This parameter usually doesn't require to be changed.
|
|
|
|
|
2022-08-09 02:27:02 +03:00
|
|
|
## block_size
|
|
|
|
|
|
|
|
- Type: integer
|
|
|
|
- Default: 131072
|
|
|
|
|
|
|
|
Block size for this pool. The value from /vitastor/config/global is used when
|
2023-09-02 17:55:53 +03:00
|
|
|
unspecified. Only OSDs with matching block_size are used for each pool. If you
|
|
|
|
want to further restrict OSDs for the pool, use [osd_tags](#osd_tags).
|
2022-08-09 02:27:02 +03:00
|
|
|
|
|
|
|
Read more about this parameter in [Cluster-Wide Disk Layout Parameters](layout-cluster.en.md#block_size).
|
|
|
|
|
|
|
|
## bitmap_granularity
|
|
|
|
|
|
|
|
- Type: integer
|
|
|
|
- Default: 4096
|
|
|
|
|
2023-09-02 17:55:53 +03:00
|
|
|
"Sector" size of virtual disks in this pool. The value from /vitastor/config/global
|
|
|
|
is used when unspecified. Similarly to block_size, only OSDs with matching
|
|
|
|
bitmap_granularity are used for each pool.
|
2022-08-09 02:27:02 +03:00
|
|
|
|
|
|
|
Read more about this parameter in [Cluster-Wide Disk Layout Parameters](layout-cluster.en.md#bitmap_granularity).
|
|
|
|
|
|
|
|
## immediate_commit
|
|
|
|
|
|
|
|
- Type: string, one of "all", "small" and "none"
|
|
|
|
- Default: none
|
|
|
|
|
|
|
|
Immediate commit setting for this pool. The value from /vitastor/config/global
|
2023-09-02 17:55:53 +03:00
|
|
|
is used when unspecified. Similarly to block_size, only OSDs with compatible
|
|
|
|
bitmap_granularity are used for each pool. "Compatible" means that a pool with
|
|
|
|
non-immediate commit will use OSDs with immediate commit enabled, but not vice
|
|
|
|
versa. I.e., pools with "none" use all OSDs, pools with "small" only use OSDs
|
|
|
|
with "all" or "small", and pools with "all" only use OSDs with "all".
|
2022-08-09 02:27:02 +03:00
|
|
|
|
|
|
|
Read more about this parameter in [Cluster-Wide Disk Layout Parameters](layout-cluster.en.md#immediate_commit).
|
|
|
|
|
2022-01-29 23:43:22 +03:00
|
|
|
## pg_stripe_size
|
|
|
|
|
|
|
|
- Type: integer
|
|
|
|
- Default: 0
|
|
|
|
|
|
|
|
Specifies the stripe size for this pool according to which images are split into
|
|
|
|
different PGs. Stripe size can't be smaller than [block_size](layout-cluster.en.md#block_size)
|
|
|
|
multiplied by (pg_size - parity_chunks) for EC/XOR pools, or 1 for replicated pools,
|
|
|
|
and the same value is used by default.
|
|
|
|
|
|
|
|
This means first `pg_stripe_size = (block_size * (pg_size-parity_chunks))` bytes
|
|
|
|
of an image go to one PG, next `pg_stripe_size` bytes go to another PG and so on.
|
|
|
|
|
|
|
|
Usually doesn't require to be changed separately from the block size.
|
|
|
|
|
|
|
|
## root_node
|
|
|
|
|
|
|
|
- Type: string
|
|
|
|
|
|
|
|
Specifies the root node of the OSD tree to restrict this pool OSDs to.
|
|
|
|
Referenced root node must exist in /vitastor/config/node_placement.
|
|
|
|
|
|
|
|
## osd_tags
|
|
|
|
|
|
|
|
- Type: string or array of strings
|
|
|
|
|
|
|
|
Specifies OSD tags to restrict this pool to. If multiple tags are specified,
|
|
|
|
only OSDs having all of these tags will be used for this pool.
|
|
|
|
|
|
|
|
## primary_affinity_tags
|
|
|
|
|
|
|
|
- Type: string or array of strings
|
|
|
|
|
|
|
|
Specifies OSD tags to prefer putting primary OSDs in this pool to.
|
|
|
|
Note that for EC/XOR pools Vitastor always prefers to put primary OSD on one
|
|
|
|
of the OSDs containing a data chunk for a PG.
|
|
|
|
|
2023-04-22 02:44:10 +03:00
|
|
|
## scrub_interval
|
|
|
|
|
|
|
|
- Type: time interval (number + unit s/m/h/d/M/y)
|
|
|
|
|
|
|
|
Automatic scrubbing interval for this pool. Overrides
|
|
|
|
[global scrub_interval setting](osd.en.md#scrub_interval).
|
|
|
|
|
2024-03-10 18:08:57 +03:00
|
|
|
## used_for_fs
|
|
|
|
|
|
|
|
- Type: string
|
|
|
|
|
|
|
|
If non-empty, the pool is marked as used for VitastorFS with metadata stored
|
|
|
|
in block image (regular Vitastor volume) named as the value of this pool parameter.
|
|
|
|
|
|
|
|
When a pool is marked as used for VitastorFS, regular block volume creation in it
|
|
|
|
is disabled (vitastor-cli refuses to create images without --force) to protect
|
|
|
|
the user from block volume and FS file ID collisions and data loss.
|
|
|
|
|
|
|
|
[vitastor-nfs](../usage/nfs.ru.md), in its turn, refuses to use pools not marked
|
|
|
|
for the corresponding FS when starting. This also implies that you can use one
|
|
|
|
pool only for one VitastorFS.
|
|
|
|
|
|
|
|
The second thing that is disabled for VitastorFS pools is reporting per-inode space
|
|
|
|
usage statistics in etcd because a FS pool may store a very large number of files
|
|
|
|
and statistics for them all would take a lot of space in etcd.
|
|
|
|
|
2022-01-29 23:43:22 +03:00
|
|
|
# Examples
|
|
|
|
|
|
|
|
## Replicated pool
|
|
|
|
|
|
|
|
```
|
|
|
|
{
|
|
|
|
"1": {
|
|
|
|
"name":"testpool",
|
|
|
|
"scheme":"replicated",
|
|
|
|
"pg_size":2,
|
|
|
|
"pg_minsize":1,
|
|
|
|
"pg_count":256,
|
|
|
|
"failure_domain":"host"
|
|
|
|
}
|
|
|
|
}
|
|
|
|
```
|
|
|
|
|
|
|
|
## Erasure-coded pool
|
|
|
|
|
|
|
|
```
|
|
|
|
{
|
|
|
|
"2": {
|
|
|
|
"name":"ecpool",
|
2022-06-03 15:36:58 +03:00
|
|
|
"scheme":"ec",
|
2022-01-29 23:43:22 +03:00
|
|
|
"pg_size":3,
|
|
|
|
"parity_chunks":1,
|
|
|
|
"pg_minsize":2,
|
|
|
|
"pg_count":256,
|
|
|
|
"failure_domain":"host"
|
|
|
|
}
|
|
|
|
}
|
|
|
|
```
|