74 lines
5.0 KiB
Markdown
74 lines
5.0 KiB
Markdown
|
---
|
||
|
title: Architecture
|
||
|
weight: 3
|
||
|
---
|
||
|
|
||
|
For people familiar with Ceph, Vitastor is quite similar:
|
||
|
|
||
|
- Vitastor also has Pools, PGs, OSDs, Monitors, Failure Domains, Placement Tree:
|
||
|
- OSD (Object Storage Daemon) is a process that stores data and serves read/write requests.
|
||
|
- PG (Placement Group) is a container for data that (normally) shares the same replicas.
|
||
|
- Pool is a container for data that has the same redundancy scheme and placement rules.
|
||
|
- Monitor is a separate daemon that watches cluster state and controls data distribution.
|
||
|
- Failure Domain is a group of OSDs that you allow to fail. It's "host" by default.
|
||
|
- Placement Tree groups OSDs in a hierarchy to later split them into Failure Domains.
|
||
|
- Vitastor also distributes every image data across the whole cluster.
|
||
|
- Vitastor is also transactional (every write to the cluster is atomic).
|
||
|
- OSDs also have journal and metadata and they can also be put on separate drives.
|
||
|
- Just like in Ceph, client library attempts to recover from any cluster failure so
|
||
|
you can basically reboot the whole cluster and only pause, but not crash, your clients
|
||
|
(please report a bug if the client crashes in that case).
|
||
|
|
||
|
However, there are also differences:
|
||
|
|
||
|
- Vitastor's main focus is on SSDs. Hybrid SSD+HDD setups are also possible.
|
||
|
- Vitastor OSD is (and will always be) single-threaded. If you want to dedicate more than 1 core
|
||
|
per drive you should run multiple OSDs each on a different partition of the drive.
|
||
|
Vitastor isn't CPU-hungry though (as opposed to Ceph), so 1 core is sufficient in a lot of cases.
|
||
|
- Metadata and journal are always kept in memory. Metadata size depends linearly on drive capacity
|
||
|
and data store block size which is 128 KB by default. With 128 KB blocks metadata should occupy
|
||
|
around 512 MB per 1 TB (which is still less than Ceph wants). Journal doesn't have to be big,
|
||
|
the example test below was conducted with only 16 MB journal. A big journal is probably even
|
||
|
harmful as dirty write metadata also take some memory.
|
||
|
- Vitastor storage layer doesn't have internal copy-on-write or redirect-write. I know that maybe
|
||
|
it's possible to create a good copy-on-write storage, but it's much harder and makes performance
|
||
|
less deterministic, so CoW isn't used in Vitastor.
|
||
|
- The basic layer of Vitastor is block storage with fixed-size blocks, not object storage with
|
||
|
rich semantics like in Ceph (RADOS).
|
||
|
- There's a "lazy fsync" mode which allows to batch writes before flushing them to the disk.
|
||
|
This allows to use Vitastor with desktop SSDs, but still lowers performance due to additional
|
||
|
network roundtrips, so use server SSDs with capacitor-based power loss protection
|
||
|
("Advanced Power Loss Protection") for best performance.
|
||
|
- PGs are ephemeral. This means that they aren't stored on data disks and only exist in memory
|
||
|
while OSDs are running.
|
||
|
- Recovery process is per-object (per-block), not per-PG. Also there are no PGLOGs.
|
||
|
- Monitors don't store data. Cluster configuration and state is stored in etcd in simple human-readable
|
||
|
JSON structures. Monitors only watch cluster state and handle data movement.
|
||
|
Thus Vitastor's Monitor isn't a critical component of the system and is more similar to Ceph's Manager.
|
||
|
Vitastor's Monitor is implemented in node.js.
|
||
|
- PG distribution isn't based on consistent hashes. All PG mappings are stored in etcd.
|
||
|
Rebalancing PGs between OSDs is done by mathematical optimization - data distribution problem
|
||
|
is reduced to a linear programming problem and solved by lp_solve. This allows for almost
|
||
|
perfect (96-99% uniformity compared to Ceph's 80-90%) data distribution in most cases, ability
|
||
|
to map PGs by hand without breaking rebalancing logic, reduced OSD peer-to-peer communication
|
||
|
(on average, OSDs have fewer peers) and less data movement. It also probably has a drawback -
|
||
|
this method may fail in very large clusters, but up to several hundreds of OSDs it's perfectly fine.
|
||
|
It's also easy to add consistent hashes in the future if something proves their necessity.
|
||
|
- There's no separate CRUSH layer. You select pool redundancy scheme, placement root, failure domain
|
||
|
and so on directly in pool configuration.
|
||
|
- Images are global i.e. you can't create multiple images with the same name in different pools.
|
||
|
|
||
|
## Implementation Principles
|
||
|
|
||
|
- I like architecturally simple solutions. Vitastor is and will always be designed
|
||
|
exactly like that.
|
||
|
- I also like reinventing the wheel to some extent, like writing my own HTTP client
|
||
|
for etcd interaction instead of using prebuilt libraries, because in this case
|
||
|
I'm confident about what my code does and what it doesn't do.
|
||
|
- I don't care about C++ "best practices" like RAII or proper inheritance or usage of
|
||
|
smart pointers or whatever and I don't intend to change my mind, so if you're here
|
||
|
looking for ideal reference C++ code, this probably isn't the right place.
|
||
|
- I like node.js better than any other dynamically-typed language interpreter
|
||
|
because it's faster than any other interpreter in the world, has neutral C-like
|
||
|
syntax and built-in event loop. That's why Monitor is implemented in node.js.
|