pve-qemu/debian/patches/pve/0048-qapi-blockdev-backup-a...

278 lines
12 KiB
Diff
Raw Normal View History

implement support for backup fleecing Excerpt from Fiona's v3 cover-letter [0]: When a backup for a VM is started, QEMU will install a "copy-before-write" filter in its block layer. This filter ensures that upon new guest writes, old data still needed for the backup is sent to the backup target first. The guest write blocks until this operation is finished so guest IO to not-yet-backed-up sectors will be limited by the speed of the backup target. With backup fleecing, such old data is cached in a fleecing image rather than sent directly to the backup target. This can help guest IO performance and even prevent hangs in certain scenarios, at the cost of requiring more storage space. With this series it will be possible to enable backup-fleecing via e.g. `vzdump 123 --fleecing enabled=1,storage=local-lvm` with fleecing images created on the storage `local-lvm`. The fleecing storage should be a fast local storage which supports thin-provisioning and discard. If the storage supports qcow2, that is used as the fleecing image format. If the underlying file system does not support discard, with qcow2 and preallocation=off, at least already allocated parts of the image can be re-used later. Fleecing images are created by qemu-server via pve-storage and attached to QEMU before the backup starts, and cleaned up after the backup finished or failed. The naming schema for fleecing images is 'vm-ID-fleece-N(.FORMAT)'. The allocated images are recorded in the guest configuration, so that even after a hard failure, clean-up can be re-attempted. While not too bad, it's a non-trivial amount of code and I'm not 100% sure about the cost-benefit, so sending those as RFC. The fleecing image needs to be the exact same size as the source, but luckily, an explicit size can be specified when attaching a raw image to QEMU so there are no size issues when using storages that have coarser allocation/round up. For qcow2, it seems that virtual size can be nearly arbitrary (i.e. modulo 512 byte granularity) during allocation. [0]: https://lists.proxmox.com/pipermail/pve-devel/2024-April/062815.html Originally-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-04-11 18:38:26 +03:00
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Date: Thu, 11 Apr 2024 11:29:25 +0200
Subject: [PATCH] qapi: blockdev-backup: add discard-source parameter
Add a parameter that enables discard-after-copy. That is mostly useful
in "push backup with fleecing" scheme, when source is snapshot-access
format driver node, based on copy-before-write filter snapshot-access
API:
[guest] [snapshot-access] ~~ blockdev-backup ~~> [backup target]
| |
| root | file
v v
[copy-before-write]
| |
| file | target
v v
[active disk] [temp.img]
In this case discard-after-copy does two things:
- discard data in temp.img to save disk space
- avoid further copy-before-write operation in discarded area
Note that we have to declare WRITE permission on source in
copy-before-write filter, for discard to work. Still we can't take it
unconditionally, as it will break normal backup from RO source. So, we
have to add a parameter and pass it thorough bdrv_open flags.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
---
block/backup.c | 5 +++--
block/block-copy.c | 9 +++++++++
block/copy-before-write.c | 15 +++++++++++++--
block/copy-before-write.h | 1 +
block/replication.c | 4 ++--
blockdev.c | 2 +-
include/block/block-common.h | 2 ++
include/block/block-copy.h | 1 +
include/block/block_int-global-state.h | 2 +-
qapi/block-core.json | 4 ++++
10 files changed, 37 insertions(+), 8 deletions(-)
diff --git a/block/backup.c b/block/backup.c
update submodule and patches to QEMU 8.2.2 This version includes both the AioContext lock and the block graph lock, so there might be some deadlocks lurking. It's not possible to disable the block graph lock like was done in QEMU 8.1, because there are no changes like the function bdrv_schedule_unref() that require it. QEMU 9.0 will finally get rid of the AioContext locking. During live-restore with a VirtIO SCSI drive with iothread there is a known racy deadlock related to the AioContext lock. Not new [1], but not sure if more likely now. Should be fixed in QEMU 9.0. The block graph lock comes with annotations that can be checked by clang's TSA. This required changes to the block drivers, i.e. alloc-track, pbs, zeroinit as well as taking the appropriate locks in pve-backup, savevm-async, vma-reader. Local variable shadowing is prohibited via a compiler flag now, required slight adaptation in vma.c. Major changes only affect alloc-track: * It is not possible to call a generated co-wrapper like bdrv_get_info() while holding the block graph lock exclusively [0], which does happen during initialization of alloc-track when the backing hd is set and the refresh_limits driver callback is invoked. The bdrv_get_info() call to get the cluster size is moved to directly after opening the file child in track_open(). The important thing is that at least the request alignment for the write target is used, because then the RMW cycle in bdrv_pwritev will gather enough data from the backing file. Partial cluster allocations in the target are not a fundamental issue, because the driver returns its allocation status based on the bitmap, so any other data that maps to the same cluster will still be copied later by a stream job (or during writes to that cluster). * Replacing the node cannot be done in the track_co_change_backing_file() callback, because it is a coroutine and cannot hold the block graph lock exclusively. So it is moved to the stream job itself with the auto-remove option not having an effect anymore (qemu-server would always set it anyways). In the future, there could either be a special option for the stream job, or maybe the upcoming blockdev-replace QMP command can be used. Replacing the backing child is actually already done in the stream job, so no need to do it in the track_co_change_backing_file() callback. It also cannot be called from a coroutine. Looking at the implementation in the qcow2 driver, it doesn't seem to be intended to change the backing child itself, just update driver-internal state. Other changes: * alloc-track: Error out early when used without auto-remove. Since replacing the node now happens in the stream job, where the option cannot be read from (it's internal to the driver), it will always be treated as 'on'. Makes sure to have users beside qemu-server notice the change (should they even exist). The option can be fully dropped in the future while adding a version guard in qemu-server. * alloc-track: Avoid seemingly superfluous child permission update. Doesn't seem necessary nowadays (maybe after commit "alloc-track: fix deadlock during drop" where the dropping is not rescheduled and delayed anymore or some upstream change). Replacing the block node will already update the permissions of the new node (which was the file child before). Should there really be some issue, instead of having a drop state, this could also be just based off the fact whether there is still a backing child. Dumping the cumulative (shared) permissions for the BDS with a debug print yields the same values after this patch and with QEMU 8.1, namely 3 and 5. * PBS block driver: compile unconditionally. Proxmox VE always needs it and something in the build process changed to make it not enabled by default. Probably would need to move the build option to meson otherwise. * backup: job unreferencing during cleanup needs to happen outside of coroutine, so it was moved to before invoking the clean * mirror: Cherry-pick stable fix to avoid potential deadlock. * savevm-async: migrate_init now can fail, so propagate potential error. * savevm-async: compression counters are not accessible outside migration/ram-compress now, so drop code that prophylactically set it to zero. [0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/ [1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/ Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
index aec140e0c8..f19b751fe6 100644
implement support for backup fleecing Excerpt from Fiona's v3 cover-letter [0]: When a backup for a VM is started, QEMU will install a "copy-before-write" filter in its block layer. This filter ensures that upon new guest writes, old data still needed for the backup is sent to the backup target first. The guest write blocks until this operation is finished so guest IO to not-yet-backed-up sectors will be limited by the speed of the backup target. With backup fleecing, such old data is cached in a fleecing image rather than sent directly to the backup target. This can help guest IO performance and even prevent hangs in certain scenarios, at the cost of requiring more storage space. With this series it will be possible to enable backup-fleecing via e.g. `vzdump 123 --fleecing enabled=1,storage=local-lvm` with fleecing images created on the storage `local-lvm`. The fleecing storage should be a fast local storage which supports thin-provisioning and discard. If the storage supports qcow2, that is used as the fleecing image format. If the underlying file system does not support discard, with qcow2 and preallocation=off, at least already allocated parts of the image can be re-used later. Fleecing images are created by qemu-server via pve-storage and attached to QEMU before the backup starts, and cleaned up after the backup finished or failed. The naming schema for fleecing images is 'vm-ID-fleece-N(.FORMAT)'. The allocated images are recorded in the guest configuration, so that even after a hard failure, clean-up can be re-attempted. While not too bad, it's a non-trivial amount of code and I'm not 100% sure about the cost-benefit, so sending those as RFC. The fleecing image needs to be the exact same size as the source, but luckily, an explicit size can be specified when attaching a raw image to QEMU so there are no size issues when using storages that have coarser allocation/round up. For qcow2, it seems that virtual size can be nearly arbitrary (i.e. modulo 512 byte granularity) during allocation. [0]: https://lists.proxmox.com/pipermail/pve-devel/2024-April/062815.html Originally-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-04-11 18:38:26 +03:00
--- a/block/backup.c
+++ b/block/backup.c
@@ -332,7 +332,7 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
BlockDriverState *target, int64_t speed,
MirrorSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap,
BitmapSyncMode bitmap_mode,
- bool compress,
+ bool compress, bool discard_source,
const char *filter_node_name,
BackupPerf *perf,
BlockdevOnError on_source_error,
update submodule and patches to QEMU 8.2.2 This version includes both the AioContext lock and the block graph lock, so there might be some deadlocks lurking. It's not possible to disable the block graph lock like was done in QEMU 8.1, because there are no changes like the function bdrv_schedule_unref() that require it. QEMU 9.0 will finally get rid of the AioContext locking. During live-restore with a VirtIO SCSI drive with iothread there is a known racy deadlock related to the AioContext lock. Not new [1], but not sure if more likely now. Should be fixed in QEMU 9.0. The block graph lock comes with annotations that can be checked by clang's TSA. This required changes to the block drivers, i.e. alloc-track, pbs, zeroinit as well as taking the appropriate locks in pve-backup, savevm-async, vma-reader. Local variable shadowing is prohibited via a compiler flag now, required slight adaptation in vma.c. Major changes only affect alloc-track: * It is not possible to call a generated co-wrapper like bdrv_get_info() while holding the block graph lock exclusively [0], which does happen during initialization of alloc-track when the backing hd is set and the refresh_limits driver callback is invoked. The bdrv_get_info() call to get the cluster size is moved to directly after opening the file child in track_open(). The important thing is that at least the request alignment for the write target is used, because then the RMW cycle in bdrv_pwritev will gather enough data from the backing file. Partial cluster allocations in the target are not a fundamental issue, because the driver returns its allocation status based on the bitmap, so any other data that maps to the same cluster will still be copied later by a stream job (or during writes to that cluster). * Replacing the node cannot be done in the track_co_change_backing_file() callback, because it is a coroutine and cannot hold the block graph lock exclusively. So it is moved to the stream job itself with the auto-remove option not having an effect anymore (qemu-server would always set it anyways). In the future, there could either be a special option for the stream job, or maybe the upcoming blockdev-replace QMP command can be used. Replacing the backing child is actually already done in the stream job, so no need to do it in the track_co_change_backing_file() callback. It also cannot be called from a coroutine. Looking at the implementation in the qcow2 driver, it doesn't seem to be intended to change the backing child itself, just update driver-internal state. Other changes: * alloc-track: Error out early when used without auto-remove. Since replacing the node now happens in the stream job, where the option cannot be read from (it's internal to the driver), it will always be treated as 'on'. Makes sure to have users beside qemu-server notice the change (should they even exist). The option can be fully dropped in the future while adding a version guard in qemu-server. * alloc-track: Avoid seemingly superfluous child permission update. Doesn't seem necessary nowadays (maybe after commit "alloc-track: fix deadlock during drop" where the dropping is not rescheduled and delayed anymore or some upstream change). Replacing the block node will already update the permissions of the new node (which was the file child before). Should there really be some issue, instead of having a drop state, this could also be just based off the fact whether there is still a backing child. Dumping the cumulative (shared) permissions for the BDS with a debug print yields the same values after this patch and with QEMU 8.1, namely 3 and 5. * PBS block driver: compile unconditionally. Proxmox VE always needs it and something in the build process changed to make it not enabled by default. Probably would need to move the build option to meson otherwise. * backup: job unreferencing during cleanup needs to happen outside of coroutine, so it was moved to before invoking the clean * mirror: Cherry-pick stable fix to avoid potential deadlock. * savevm-async: migrate_init now can fail, so propagate potential error. * savevm-async: compression counters are not accessible outside migration/ram-compress now, so drop code that prophylactically set it to zero. [0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/ [1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/ Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
@@ -433,7 +433,8 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
implement support for backup fleecing Excerpt from Fiona's v3 cover-letter [0]: When a backup for a VM is started, QEMU will install a "copy-before-write" filter in its block layer. This filter ensures that upon new guest writes, old data still needed for the backup is sent to the backup target first. The guest write blocks until this operation is finished so guest IO to not-yet-backed-up sectors will be limited by the speed of the backup target. With backup fleecing, such old data is cached in a fleecing image rather than sent directly to the backup target. This can help guest IO performance and even prevent hangs in certain scenarios, at the cost of requiring more storage space. With this series it will be possible to enable backup-fleecing via e.g. `vzdump 123 --fleecing enabled=1,storage=local-lvm` with fleecing images created on the storage `local-lvm`. The fleecing storage should be a fast local storage which supports thin-provisioning and discard. If the storage supports qcow2, that is used as the fleecing image format. If the underlying file system does not support discard, with qcow2 and preallocation=off, at least already allocated parts of the image can be re-used later. Fleecing images are created by qemu-server via pve-storage and attached to QEMU before the backup starts, and cleaned up after the backup finished or failed. The naming schema for fleecing images is 'vm-ID-fleece-N(.FORMAT)'. The allocated images are recorded in the guest configuration, so that even after a hard failure, clean-up can be re-attempted. While not too bad, it's a non-trivial amount of code and I'm not 100% sure about the cost-benefit, so sending those as RFC. The fleecing image needs to be the exact same size as the source, but luckily, an explicit size can be specified when attaching a raw image to QEMU so there are no size issues when using storages that have coarser allocation/round up. For qcow2, it seems that virtual size can be nearly arbitrary (i.e. modulo 512 byte granularity) during allocation. [0]: https://lists.proxmox.com/pipermail/pve-devel/2024-April/062815.html Originally-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-04-11 18:38:26 +03:00
goto error;
}
- cbw = bdrv_cbw_append(bs, target, filter_node_name, &bcs, errp);
+ cbw = bdrv_cbw_append(bs, target, filter_node_name, discard_source,
+ &bcs, errp);
if (!cbw) {
goto error;
}
diff --git a/block/block-copy.c b/block/block-copy.c
update submodule and patches to QEMU 8.2.2 This version includes both the AioContext lock and the block graph lock, so there might be some deadlocks lurking. It's not possible to disable the block graph lock like was done in QEMU 8.1, because there are no changes like the function bdrv_schedule_unref() that require it. QEMU 9.0 will finally get rid of the AioContext locking. During live-restore with a VirtIO SCSI drive with iothread there is a known racy deadlock related to the AioContext lock. Not new [1], but not sure if more likely now. Should be fixed in QEMU 9.0. The block graph lock comes with annotations that can be checked by clang's TSA. This required changes to the block drivers, i.e. alloc-track, pbs, zeroinit as well as taking the appropriate locks in pve-backup, savevm-async, vma-reader. Local variable shadowing is prohibited via a compiler flag now, required slight adaptation in vma.c. Major changes only affect alloc-track: * It is not possible to call a generated co-wrapper like bdrv_get_info() while holding the block graph lock exclusively [0], which does happen during initialization of alloc-track when the backing hd is set and the refresh_limits driver callback is invoked. The bdrv_get_info() call to get the cluster size is moved to directly after opening the file child in track_open(). The important thing is that at least the request alignment for the write target is used, because then the RMW cycle in bdrv_pwritev will gather enough data from the backing file. Partial cluster allocations in the target are not a fundamental issue, because the driver returns its allocation status based on the bitmap, so any other data that maps to the same cluster will still be copied later by a stream job (or during writes to that cluster). * Replacing the node cannot be done in the track_co_change_backing_file() callback, because it is a coroutine and cannot hold the block graph lock exclusively. So it is moved to the stream job itself with the auto-remove option not having an effect anymore (qemu-server would always set it anyways). In the future, there could either be a special option for the stream job, or maybe the upcoming blockdev-replace QMP command can be used. Replacing the backing child is actually already done in the stream job, so no need to do it in the track_co_change_backing_file() callback. It also cannot be called from a coroutine. Looking at the implementation in the qcow2 driver, it doesn't seem to be intended to change the backing child itself, just update driver-internal state. Other changes: * alloc-track: Error out early when used without auto-remove. Since replacing the node now happens in the stream job, where the option cannot be read from (it's internal to the driver), it will always be treated as 'on'. Makes sure to have users beside qemu-server notice the change (should they even exist). The option can be fully dropped in the future while adding a version guard in qemu-server. * alloc-track: Avoid seemingly superfluous child permission update. Doesn't seem necessary nowadays (maybe after commit "alloc-track: fix deadlock during drop" where the dropping is not rescheduled and delayed anymore or some upstream change). Replacing the block node will already update the permissions of the new node (which was the file child before). Should there really be some issue, instead of having a drop state, this could also be just based off the fact whether there is still a backing child. Dumping the cumulative (shared) permissions for the BDS with a debug print yields the same values after this patch and with QEMU 8.1, namely 3 and 5. * PBS block driver: compile unconditionally. Proxmox VE always needs it and something in the build process changed to make it not enabled by default. Probably would need to move the build option to meson otherwise. * backup: job unreferencing during cleanup needs to happen outside of coroutine, so it was moved to before invoking the clean * mirror: Cherry-pick stable fix to avoid potential deadlock. * savevm-async: migrate_init now can fail, so propagate potential error. * savevm-async: compression counters are not accessible outside migration/ram-compress now, so drop code that prophylactically set it to zero. [0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/ [1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/ Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
index 8fca2c3698..7e3b378528 100644
implement support for backup fleecing Excerpt from Fiona's v3 cover-letter [0]: When a backup for a VM is started, QEMU will install a "copy-before-write" filter in its block layer. This filter ensures that upon new guest writes, old data still needed for the backup is sent to the backup target first. The guest write blocks until this operation is finished so guest IO to not-yet-backed-up sectors will be limited by the speed of the backup target. With backup fleecing, such old data is cached in a fleecing image rather than sent directly to the backup target. This can help guest IO performance and even prevent hangs in certain scenarios, at the cost of requiring more storage space. With this series it will be possible to enable backup-fleecing via e.g. `vzdump 123 --fleecing enabled=1,storage=local-lvm` with fleecing images created on the storage `local-lvm`. The fleecing storage should be a fast local storage which supports thin-provisioning and discard. If the storage supports qcow2, that is used as the fleecing image format. If the underlying file system does not support discard, with qcow2 and preallocation=off, at least already allocated parts of the image can be re-used later. Fleecing images are created by qemu-server via pve-storage and attached to QEMU before the backup starts, and cleaned up after the backup finished or failed. The naming schema for fleecing images is 'vm-ID-fleece-N(.FORMAT)'. The allocated images are recorded in the guest configuration, so that even after a hard failure, clean-up can be re-attempted. While not too bad, it's a non-trivial amount of code and I'm not 100% sure about the cost-benefit, so sending those as RFC. The fleecing image needs to be the exact same size as the source, but luckily, an explicit size can be specified when attaching a raw image to QEMU so there are no size issues when using storages that have coarser allocation/round up. For qcow2, it seems that virtual size can be nearly arbitrary (i.e. modulo 512 byte granularity) during allocation. [0]: https://lists.proxmox.com/pipermail/pve-devel/2024-April/062815.html Originally-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-04-11 18:38:26 +03:00
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -137,6 +137,7 @@ typedef struct BlockCopyState {
CoMutex lock;
int64_t in_flight_bytes;
BlockCopyMethod method;
+ bool discard_source;
BlockReqList reqs;
QLIST_HEAD(, BlockCopyCallState) calls;
/*
update submodule and patches to QEMU 8.2.2 This version includes both the AioContext lock and the block graph lock, so there might be some deadlocks lurking. It's not possible to disable the block graph lock like was done in QEMU 8.1, because there are no changes like the function bdrv_schedule_unref() that require it. QEMU 9.0 will finally get rid of the AioContext locking. During live-restore with a VirtIO SCSI drive with iothread there is a known racy deadlock related to the AioContext lock. Not new [1], but not sure if more likely now. Should be fixed in QEMU 9.0. The block graph lock comes with annotations that can be checked by clang's TSA. This required changes to the block drivers, i.e. alloc-track, pbs, zeroinit as well as taking the appropriate locks in pve-backup, savevm-async, vma-reader. Local variable shadowing is prohibited via a compiler flag now, required slight adaptation in vma.c. Major changes only affect alloc-track: * It is not possible to call a generated co-wrapper like bdrv_get_info() while holding the block graph lock exclusively [0], which does happen during initialization of alloc-track when the backing hd is set and the refresh_limits driver callback is invoked. The bdrv_get_info() call to get the cluster size is moved to directly after opening the file child in track_open(). The important thing is that at least the request alignment for the write target is used, because then the RMW cycle in bdrv_pwritev will gather enough data from the backing file. Partial cluster allocations in the target are not a fundamental issue, because the driver returns its allocation status based on the bitmap, so any other data that maps to the same cluster will still be copied later by a stream job (or during writes to that cluster). * Replacing the node cannot be done in the track_co_change_backing_file() callback, because it is a coroutine and cannot hold the block graph lock exclusively. So it is moved to the stream job itself with the auto-remove option not having an effect anymore (qemu-server would always set it anyways). In the future, there could either be a special option for the stream job, or maybe the upcoming blockdev-replace QMP command can be used. Replacing the backing child is actually already done in the stream job, so no need to do it in the track_co_change_backing_file() callback. It also cannot be called from a coroutine. Looking at the implementation in the qcow2 driver, it doesn't seem to be intended to change the backing child itself, just update driver-internal state. Other changes: * alloc-track: Error out early when used without auto-remove. Since replacing the node now happens in the stream job, where the option cannot be read from (it's internal to the driver), it will always be treated as 'on'. Makes sure to have users beside qemu-server notice the change (should they even exist). The option can be fully dropped in the future while adding a version guard in qemu-server. * alloc-track: Avoid seemingly superfluous child permission update. Doesn't seem necessary nowadays (maybe after commit "alloc-track: fix deadlock during drop" where the dropping is not rescheduled and delayed anymore or some upstream change). Replacing the block node will already update the permissions of the new node (which was the file child before). Should there really be some issue, instead of having a drop state, this could also be just based off the fact whether there is still a backing child. Dumping the cumulative (shared) permissions for the BDS with a debug print yields the same values after this patch and with QEMU 8.1, namely 3 and 5. * PBS block driver: compile unconditionally. Proxmox VE always needs it and something in the build process changed to make it not enabled by default. Probably would need to move the build option to meson otherwise. * backup: job unreferencing during cleanup needs to happen outside of coroutine, so it was moved to before invoking the clean * mirror: Cherry-pick stable fix to avoid potential deadlock. * savevm-async: migrate_init now can fail, so propagate potential error. * savevm-async: compression counters are not accessible outside migration/ram-compress now, so drop code that prophylactically set it to zero. [0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/ [1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/ Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
@@ -353,6 +354,7 @@ static int64_t block_copy_calculate_cluster_size(BlockDriverState *target,
implement support for backup fleecing Excerpt from Fiona's v3 cover-letter [0]: When a backup for a VM is started, QEMU will install a "copy-before-write" filter in its block layer. This filter ensures that upon new guest writes, old data still needed for the backup is sent to the backup target first. The guest write blocks until this operation is finished so guest IO to not-yet-backed-up sectors will be limited by the speed of the backup target. With backup fleecing, such old data is cached in a fleecing image rather than sent directly to the backup target. This can help guest IO performance and even prevent hangs in certain scenarios, at the cost of requiring more storage space. With this series it will be possible to enable backup-fleecing via e.g. `vzdump 123 --fleecing enabled=1,storage=local-lvm` with fleecing images created on the storage `local-lvm`. The fleecing storage should be a fast local storage which supports thin-provisioning and discard. If the storage supports qcow2, that is used as the fleecing image format. If the underlying file system does not support discard, with qcow2 and preallocation=off, at least already allocated parts of the image can be re-used later. Fleecing images are created by qemu-server via pve-storage and attached to QEMU before the backup starts, and cleaned up after the backup finished or failed. The naming schema for fleecing images is 'vm-ID-fleece-N(.FORMAT)'. The allocated images are recorded in the guest configuration, so that even after a hard failure, clean-up can be re-attempted. While not too bad, it's a non-trivial amount of code and I'm not 100% sure about the cost-benefit, so sending those as RFC. The fleecing image needs to be the exact same size as the source, but luckily, an explicit size can be specified when attaching a raw image to QEMU so there are no size issues when using storages that have coarser allocation/round up. For qcow2, it seems that virtual size can be nearly arbitrary (i.e. modulo 512 byte granularity) during allocation. [0]: https://lists.proxmox.com/pipermail/pve-devel/2024-April/062815.html Originally-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-04-11 18:38:26 +03:00
BlockCopyState *block_copy_state_new(BdrvChild *source, BdrvChild *target,
BlockDriverState *copy_bitmap_bs,
const BdrvDirtyBitmap *bitmap,
+ bool discard_source,
Error **errp)
{
ERRP_GUARD();
update submodule and patches to QEMU 8.2.2 This version includes both the AioContext lock and the block graph lock, so there might be some deadlocks lurking. It's not possible to disable the block graph lock like was done in QEMU 8.1, because there are no changes like the function bdrv_schedule_unref() that require it. QEMU 9.0 will finally get rid of the AioContext locking. During live-restore with a VirtIO SCSI drive with iothread there is a known racy deadlock related to the AioContext lock. Not new [1], but not sure if more likely now. Should be fixed in QEMU 9.0. The block graph lock comes with annotations that can be checked by clang's TSA. This required changes to the block drivers, i.e. alloc-track, pbs, zeroinit as well as taking the appropriate locks in pve-backup, savevm-async, vma-reader. Local variable shadowing is prohibited via a compiler flag now, required slight adaptation in vma.c. Major changes only affect alloc-track: * It is not possible to call a generated co-wrapper like bdrv_get_info() while holding the block graph lock exclusively [0], which does happen during initialization of alloc-track when the backing hd is set and the refresh_limits driver callback is invoked. The bdrv_get_info() call to get the cluster size is moved to directly after opening the file child in track_open(). The important thing is that at least the request alignment for the write target is used, because then the RMW cycle in bdrv_pwritev will gather enough data from the backing file. Partial cluster allocations in the target are not a fundamental issue, because the driver returns its allocation status based on the bitmap, so any other data that maps to the same cluster will still be copied later by a stream job (or during writes to that cluster). * Replacing the node cannot be done in the track_co_change_backing_file() callback, because it is a coroutine and cannot hold the block graph lock exclusively. So it is moved to the stream job itself with the auto-remove option not having an effect anymore (qemu-server would always set it anyways). In the future, there could either be a special option for the stream job, or maybe the upcoming blockdev-replace QMP command can be used. Replacing the backing child is actually already done in the stream job, so no need to do it in the track_co_change_backing_file() callback. It also cannot be called from a coroutine. Looking at the implementation in the qcow2 driver, it doesn't seem to be intended to change the backing child itself, just update driver-internal state. Other changes: * alloc-track: Error out early when used without auto-remove. Since replacing the node now happens in the stream job, where the option cannot be read from (it's internal to the driver), it will always be treated as 'on'. Makes sure to have users beside qemu-server notice the change (should they even exist). The option can be fully dropped in the future while adding a version guard in qemu-server. * alloc-track: Avoid seemingly superfluous child permission update. Doesn't seem necessary nowadays (maybe after commit "alloc-track: fix deadlock during drop" where the dropping is not rescheduled and delayed anymore or some upstream change). Replacing the block node will already update the permissions of the new node (which was the file child before). Should there really be some issue, instead of having a drop state, this could also be just based off the fact whether there is still a backing child. Dumping the cumulative (shared) permissions for the BDS with a debug print yields the same values after this patch and with QEMU 8.1, namely 3 and 5. * PBS block driver: compile unconditionally. Proxmox VE always needs it and something in the build process changed to make it not enabled by default. Probably would need to move the build option to meson otherwise. * backup: job unreferencing during cleanup needs to happen outside of coroutine, so it was moved to before invoking the clean * mirror: Cherry-pick stable fix to avoid potential deadlock. * savevm-async: migrate_init now can fail, so propagate potential error. * savevm-async: compression counters are not accessible outside migration/ram-compress now, so drop code that prophylactically set it to zero. [0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/ [1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/ Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
@@ -418,6 +420,7 @@ BlockCopyState *block_copy_state_new(BdrvChild *source, BdrvChild *target,
implement support for backup fleecing Excerpt from Fiona's v3 cover-letter [0]: When a backup for a VM is started, QEMU will install a "copy-before-write" filter in its block layer. This filter ensures that upon new guest writes, old data still needed for the backup is sent to the backup target first. The guest write blocks until this operation is finished so guest IO to not-yet-backed-up sectors will be limited by the speed of the backup target. With backup fleecing, such old data is cached in a fleecing image rather than sent directly to the backup target. This can help guest IO performance and even prevent hangs in certain scenarios, at the cost of requiring more storage space. With this series it will be possible to enable backup-fleecing via e.g. `vzdump 123 --fleecing enabled=1,storage=local-lvm` with fleecing images created on the storage `local-lvm`. The fleecing storage should be a fast local storage which supports thin-provisioning and discard. If the storage supports qcow2, that is used as the fleecing image format. If the underlying file system does not support discard, with qcow2 and preallocation=off, at least already allocated parts of the image can be re-used later. Fleecing images are created by qemu-server via pve-storage and attached to QEMU before the backup starts, and cleaned up after the backup finished or failed. The naming schema for fleecing images is 'vm-ID-fleece-N(.FORMAT)'. The allocated images are recorded in the guest configuration, so that even after a hard failure, clean-up can be re-attempted. While not too bad, it's a non-trivial amount of code and I'm not 100% sure about the cost-benefit, so sending those as RFC. The fleecing image needs to be the exact same size as the source, but luckily, an explicit size can be specified when attaching a raw image to QEMU so there are no size issues when using storages that have coarser allocation/round up. For qcow2, it seems that virtual size can be nearly arbitrary (i.e. modulo 512 byte granularity) during allocation. [0]: https://lists.proxmox.com/pipermail/pve-devel/2024-April/062815.html Originally-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-04-11 18:38:26 +03:00
cluster_size),
};
+ s->discard_source = discard_source;
block_copy_set_copy_opts(s, false, false);
ratelimit_init(&s->rate_limit);
update submodule and patches to QEMU 8.2.2 This version includes both the AioContext lock and the block graph lock, so there might be some deadlocks lurking. It's not possible to disable the block graph lock like was done in QEMU 8.1, because there are no changes like the function bdrv_schedule_unref() that require it. QEMU 9.0 will finally get rid of the AioContext locking. During live-restore with a VirtIO SCSI drive with iothread there is a known racy deadlock related to the AioContext lock. Not new [1], but not sure if more likely now. Should be fixed in QEMU 9.0. The block graph lock comes with annotations that can be checked by clang's TSA. This required changes to the block drivers, i.e. alloc-track, pbs, zeroinit as well as taking the appropriate locks in pve-backup, savevm-async, vma-reader. Local variable shadowing is prohibited via a compiler flag now, required slight adaptation in vma.c. Major changes only affect alloc-track: * It is not possible to call a generated co-wrapper like bdrv_get_info() while holding the block graph lock exclusively [0], which does happen during initialization of alloc-track when the backing hd is set and the refresh_limits driver callback is invoked. The bdrv_get_info() call to get the cluster size is moved to directly after opening the file child in track_open(). The important thing is that at least the request alignment for the write target is used, because then the RMW cycle in bdrv_pwritev will gather enough data from the backing file. Partial cluster allocations in the target are not a fundamental issue, because the driver returns its allocation status based on the bitmap, so any other data that maps to the same cluster will still be copied later by a stream job (or during writes to that cluster). * Replacing the node cannot be done in the track_co_change_backing_file() callback, because it is a coroutine and cannot hold the block graph lock exclusively. So it is moved to the stream job itself with the auto-remove option not having an effect anymore (qemu-server would always set it anyways). In the future, there could either be a special option for the stream job, or maybe the upcoming blockdev-replace QMP command can be used. Replacing the backing child is actually already done in the stream job, so no need to do it in the track_co_change_backing_file() callback. It also cannot be called from a coroutine. Looking at the implementation in the qcow2 driver, it doesn't seem to be intended to change the backing child itself, just update driver-internal state. Other changes: * alloc-track: Error out early when used without auto-remove. Since replacing the node now happens in the stream job, where the option cannot be read from (it's internal to the driver), it will always be treated as 'on'. Makes sure to have users beside qemu-server notice the change (should they even exist). The option can be fully dropped in the future while adding a version guard in qemu-server. * alloc-track: Avoid seemingly superfluous child permission update. Doesn't seem necessary nowadays (maybe after commit "alloc-track: fix deadlock during drop" where the dropping is not rescheduled and delayed anymore or some upstream change). Replacing the block node will already update the permissions of the new node (which was the file child before). Should there really be some issue, instead of having a drop state, this could also be just based off the fact whether there is still a backing child. Dumping the cumulative (shared) permissions for the BDS with a debug print yields the same values after this patch and with QEMU 8.1, namely 3 and 5. * PBS block driver: compile unconditionally. Proxmox VE always needs it and something in the build process changed to make it not enabled by default. Probably would need to move the build option to meson otherwise. * backup: job unreferencing during cleanup needs to happen outside of coroutine, so it was moved to before invoking the clean * mirror: Cherry-pick stable fix to avoid potential deadlock. * savevm-async: migrate_init now can fail, so propagate potential error. * savevm-async: compression counters are not accessible outside migration/ram-compress now, so drop code that prophylactically set it to zero. [0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/ [1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/ Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
@@ -589,6 +592,12 @@ static coroutine_fn int block_copy_task_entry(AioTask *task)
implement support for backup fleecing Excerpt from Fiona's v3 cover-letter [0]: When a backup for a VM is started, QEMU will install a "copy-before-write" filter in its block layer. This filter ensures that upon new guest writes, old data still needed for the backup is sent to the backup target first. The guest write blocks until this operation is finished so guest IO to not-yet-backed-up sectors will be limited by the speed of the backup target. With backup fleecing, such old data is cached in a fleecing image rather than sent directly to the backup target. This can help guest IO performance and even prevent hangs in certain scenarios, at the cost of requiring more storage space. With this series it will be possible to enable backup-fleecing via e.g. `vzdump 123 --fleecing enabled=1,storage=local-lvm` with fleecing images created on the storage `local-lvm`. The fleecing storage should be a fast local storage which supports thin-provisioning and discard. If the storage supports qcow2, that is used as the fleecing image format. If the underlying file system does not support discard, with qcow2 and preallocation=off, at least already allocated parts of the image can be re-used later. Fleecing images are created by qemu-server via pve-storage and attached to QEMU before the backup starts, and cleaned up after the backup finished or failed. The naming schema for fleecing images is 'vm-ID-fleece-N(.FORMAT)'. The allocated images are recorded in the guest configuration, so that even after a hard failure, clean-up can be re-attempted. While not too bad, it's a non-trivial amount of code and I'm not 100% sure about the cost-benefit, so sending those as RFC. The fleecing image needs to be the exact same size as the source, but luckily, an explicit size can be specified when attaching a raw image to QEMU so there are no size issues when using storages that have coarser allocation/round up. For qcow2, it seems that virtual size can be nearly arbitrary (i.e. modulo 512 byte granularity) during allocation. [0]: https://lists.proxmox.com/pipermail/pve-devel/2024-April/062815.html Originally-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-04-11 18:38:26 +03:00
co_put_to_shres(s->mem, t->req.bytes);
block_copy_task_end(t, ret);
+ if (s->discard_source && ret == 0) {
+ int64_t nbytes =
+ MIN(t->req.offset + t->req.bytes, s->len) - t->req.offset;
+ bdrv_co_pdiscard(s->source, t->req.offset, nbytes);
+ }
+
return ret;
}
diff --git a/block/copy-before-write.c b/block/copy-before-write.c
update submodule and patches to QEMU 8.2.2 This version includes both the AioContext lock and the block graph lock, so there might be some deadlocks lurking. It's not possible to disable the block graph lock like was done in QEMU 8.1, because there are no changes like the function bdrv_schedule_unref() that require it. QEMU 9.0 will finally get rid of the AioContext locking. During live-restore with a VirtIO SCSI drive with iothread there is a known racy deadlock related to the AioContext lock. Not new [1], but not sure if more likely now. Should be fixed in QEMU 9.0. The block graph lock comes with annotations that can be checked by clang's TSA. This required changes to the block drivers, i.e. alloc-track, pbs, zeroinit as well as taking the appropriate locks in pve-backup, savevm-async, vma-reader. Local variable shadowing is prohibited via a compiler flag now, required slight adaptation in vma.c. Major changes only affect alloc-track: * It is not possible to call a generated co-wrapper like bdrv_get_info() while holding the block graph lock exclusively [0], which does happen during initialization of alloc-track when the backing hd is set and the refresh_limits driver callback is invoked. The bdrv_get_info() call to get the cluster size is moved to directly after opening the file child in track_open(). The important thing is that at least the request alignment for the write target is used, because then the RMW cycle in bdrv_pwritev will gather enough data from the backing file. Partial cluster allocations in the target are not a fundamental issue, because the driver returns its allocation status based on the bitmap, so any other data that maps to the same cluster will still be copied later by a stream job (or during writes to that cluster). * Replacing the node cannot be done in the track_co_change_backing_file() callback, because it is a coroutine and cannot hold the block graph lock exclusively. So it is moved to the stream job itself with the auto-remove option not having an effect anymore (qemu-server would always set it anyways). In the future, there could either be a special option for the stream job, or maybe the upcoming blockdev-replace QMP command can be used. Replacing the backing child is actually already done in the stream job, so no need to do it in the track_co_change_backing_file() callback. It also cannot be called from a coroutine. Looking at the implementation in the qcow2 driver, it doesn't seem to be intended to change the backing child itself, just update driver-internal state. Other changes: * alloc-track: Error out early when used without auto-remove. Since replacing the node now happens in the stream job, where the option cannot be read from (it's internal to the driver), it will always be treated as 'on'. Makes sure to have users beside qemu-server notice the change (should they even exist). The option can be fully dropped in the future while adding a version guard in qemu-server. * alloc-track: Avoid seemingly superfluous child permission update. Doesn't seem necessary nowadays (maybe after commit "alloc-track: fix deadlock during drop" where the dropping is not rescheduled and delayed anymore or some upstream change). Replacing the block node will already update the permissions of the new node (which was the file child before). Should there really be some issue, instead of having a drop state, this could also be just based off the fact whether there is still a backing child. Dumping the cumulative (shared) permissions for the BDS with a debug print yields the same values after this patch and with QEMU 8.1, namely 3 and 5. * PBS block driver: compile unconditionally. Proxmox VE always needs it and something in the build process changed to make it not enabled by default. Probably would need to move the build option to meson otherwise. * backup: job unreferencing during cleanup needs to happen outside of coroutine, so it was moved to before invoking the clean * mirror: Cherry-pick stable fix to avoid potential deadlock. * savevm-async: migrate_init now can fail, so propagate potential error. * savevm-async: compression counters are not accessible outside migration/ram-compress now, so drop code that prophylactically set it to zero. [0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/ [1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/ Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
index afa5f473d2..5506f66857 100644
implement support for backup fleecing Excerpt from Fiona's v3 cover-letter [0]: When a backup for a VM is started, QEMU will install a "copy-before-write" filter in its block layer. This filter ensures that upon new guest writes, old data still needed for the backup is sent to the backup target first. The guest write blocks until this operation is finished so guest IO to not-yet-backed-up sectors will be limited by the speed of the backup target. With backup fleecing, such old data is cached in a fleecing image rather than sent directly to the backup target. This can help guest IO performance and even prevent hangs in certain scenarios, at the cost of requiring more storage space. With this series it will be possible to enable backup-fleecing via e.g. `vzdump 123 --fleecing enabled=1,storage=local-lvm` with fleecing images created on the storage `local-lvm`. The fleecing storage should be a fast local storage which supports thin-provisioning and discard. If the storage supports qcow2, that is used as the fleecing image format. If the underlying file system does not support discard, with qcow2 and preallocation=off, at least already allocated parts of the image can be re-used later. Fleecing images are created by qemu-server via pve-storage and attached to QEMU before the backup starts, and cleaned up after the backup finished or failed. The naming schema for fleecing images is 'vm-ID-fleece-N(.FORMAT)'. The allocated images are recorded in the guest configuration, so that even after a hard failure, clean-up can be re-attempted. While not too bad, it's a non-trivial amount of code and I'm not 100% sure about the cost-benefit, so sending those as RFC. The fleecing image needs to be the exact same size as the source, but luckily, an explicit size can be specified when attaching a raw image to QEMU so there are no size issues when using storages that have coarser allocation/round up. For qcow2, it seems that virtual size can be nearly arbitrary (i.e. modulo 512 byte granularity) during allocation. [0]: https://lists.proxmox.com/pipermail/pve-devel/2024-April/062815.html Originally-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-04-11 18:38:26 +03:00
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c
@@ -44,6 +44,7 @@ typedef struct BDRVCopyBeforeWriteState {
BdrvChild *target;
OnCbwError on_cbw_error;
uint32_t cbw_timeout_ns;
+ bool discard_source;
/*
* @lock: protects access to @access_bitmap, @done_bitmap and
update submodule and patches to QEMU 8.2.2 This version includes both the AioContext lock and the block graph lock, so there might be some deadlocks lurking. It's not possible to disable the block graph lock like was done in QEMU 8.1, because there are no changes like the function bdrv_schedule_unref() that require it. QEMU 9.0 will finally get rid of the AioContext locking. During live-restore with a VirtIO SCSI drive with iothread there is a known racy deadlock related to the AioContext lock. Not new [1], but not sure if more likely now. Should be fixed in QEMU 9.0. The block graph lock comes with annotations that can be checked by clang's TSA. This required changes to the block drivers, i.e. alloc-track, pbs, zeroinit as well as taking the appropriate locks in pve-backup, savevm-async, vma-reader. Local variable shadowing is prohibited via a compiler flag now, required slight adaptation in vma.c. Major changes only affect alloc-track: * It is not possible to call a generated co-wrapper like bdrv_get_info() while holding the block graph lock exclusively [0], which does happen during initialization of alloc-track when the backing hd is set and the refresh_limits driver callback is invoked. The bdrv_get_info() call to get the cluster size is moved to directly after opening the file child in track_open(). The important thing is that at least the request alignment for the write target is used, because then the RMW cycle in bdrv_pwritev will gather enough data from the backing file. Partial cluster allocations in the target are not a fundamental issue, because the driver returns its allocation status based on the bitmap, so any other data that maps to the same cluster will still be copied later by a stream job (or during writes to that cluster). * Replacing the node cannot be done in the track_co_change_backing_file() callback, because it is a coroutine and cannot hold the block graph lock exclusively. So it is moved to the stream job itself with the auto-remove option not having an effect anymore (qemu-server would always set it anyways). In the future, there could either be a special option for the stream job, or maybe the upcoming blockdev-replace QMP command can be used. Replacing the backing child is actually already done in the stream job, so no need to do it in the track_co_change_backing_file() callback. It also cannot be called from a coroutine. Looking at the implementation in the qcow2 driver, it doesn't seem to be intended to change the backing child itself, just update driver-internal state. Other changes: * alloc-track: Error out early when used without auto-remove. Since replacing the node now happens in the stream job, where the option cannot be read from (it's internal to the driver), it will always be treated as 'on'. Makes sure to have users beside qemu-server notice the change (should they even exist). The option can be fully dropped in the future while adding a version guard in qemu-server. * alloc-track: Avoid seemingly superfluous child permission update. Doesn't seem necessary nowadays (maybe after commit "alloc-track: fix deadlock during drop" where the dropping is not rescheduled and delayed anymore or some upstream change). Replacing the block node will already update the permissions of the new node (which was the file child before). Should there really be some issue, instead of having a drop state, this could also be just based off the fact whether there is still a backing child. Dumping the cumulative (shared) permissions for the BDS with a debug print yields the same values after this patch and with QEMU 8.1, namely 3 and 5. * PBS block driver: compile unconditionally. Proxmox VE always needs it and something in the build process changed to make it not enabled by default. Probably would need to move the build option to meson otherwise. * backup: job unreferencing during cleanup needs to happen outside of coroutine, so it was moved to before invoking the clean * mirror: Cherry-pick stable fix to avoid potential deadlock. * savevm-async: migrate_init now can fail, so propagate potential error. * savevm-async: compression counters are not accessible outside migration/ram-compress now, so drop code that prophylactically set it to zero. [0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/ [1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/ Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
@@ -357,6 +358,8 @@ cbw_child_perm(BlockDriverState *bs, BdrvChild *c, BdrvChildRole role,
uint64_t perm, uint64_t shared,
uint64_t *nperm, uint64_t *nshared)
implement support for backup fleecing Excerpt from Fiona's v3 cover-letter [0]: When a backup for a VM is started, QEMU will install a "copy-before-write" filter in its block layer. This filter ensures that upon new guest writes, old data still needed for the backup is sent to the backup target first. The guest write blocks until this operation is finished so guest IO to not-yet-backed-up sectors will be limited by the speed of the backup target. With backup fleecing, such old data is cached in a fleecing image rather than sent directly to the backup target. This can help guest IO performance and even prevent hangs in certain scenarios, at the cost of requiring more storage space. With this series it will be possible to enable backup-fleecing via e.g. `vzdump 123 --fleecing enabled=1,storage=local-lvm` with fleecing images created on the storage `local-lvm`. The fleecing storage should be a fast local storage which supports thin-provisioning and discard. If the storage supports qcow2, that is used as the fleecing image format. If the underlying file system does not support discard, with qcow2 and preallocation=off, at least already allocated parts of the image can be re-used later. Fleecing images are created by qemu-server via pve-storage and attached to QEMU before the backup starts, and cleaned up after the backup finished or failed. The naming schema for fleecing images is 'vm-ID-fleece-N(.FORMAT)'. The allocated images are recorded in the guest configuration, so that even after a hard failure, clean-up can be re-attempted. While not too bad, it's a non-trivial amount of code and I'm not 100% sure about the cost-benefit, so sending those as RFC. The fleecing image needs to be the exact same size as the source, but luckily, an explicit size can be specified when attaching a raw image to QEMU so there are no size issues when using storages that have coarser allocation/round up. For qcow2, it seems that virtual size can be nearly arbitrary (i.e. modulo 512 byte granularity) during allocation. [0]: https://lists.proxmox.com/pipermail/pve-devel/2024-April/062815.html Originally-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-04-11 18:38:26 +03:00
{
+ BDRVCopyBeforeWriteState *s = bs->opaque;
+
if (!(role & BDRV_CHILD_FILTERED)) {
/*
* Target child
update submodule and patches to QEMU 8.2.2 This version includes both the AioContext lock and the block graph lock, so there might be some deadlocks lurking. It's not possible to disable the block graph lock like was done in QEMU 8.1, because there are no changes like the function bdrv_schedule_unref() that require it. QEMU 9.0 will finally get rid of the AioContext locking. During live-restore with a VirtIO SCSI drive with iothread there is a known racy deadlock related to the AioContext lock. Not new [1], but not sure if more likely now. Should be fixed in QEMU 9.0. The block graph lock comes with annotations that can be checked by clang's TSA. This required changes to the block drivers, i.e. alloc-track, pbs, zeroinit as well as taking the appropriate locks in pve-backup, savevm-async, vma-reader. Local variable shadowing is prohibited via a compiler flag now, required slight adaptation in vma.c. Major changes only affect alloc-track: * It is not possible to call a generated co-wrapper like bdrv_get_info() while holding the block graph lock exclusively [0], which does happen during initialization of alloc-track when the backing hd is set and the refresh_limits driver callback is invoked. The bdrv_get_info() call to get the cluster size is moved to directly after opening the file child in track_open(). The important thing is that at least the request alignment for the write target is used, because then the RMW cycle in bdrv_pwritev will gather enough data from the backing file. Partial cluster allocations in the target are not a fundamental issue, because the driver returns its allocation status based on the bitmap, so any other data that maps to the same cluster will still be copied later by a stream job (or during writes to that cluster). * Replacing the node cannot be done in the track_co_change_backing_file() callback, because it is a coroutine and cannot hold the block graph lock exclusively. So it is moved to the stream job itself with the auto-remove option not having an effect anymore (qemu-server would always set it anyways). In the future, there could either be a special option for the stream job, or maybe the upcoming blockdev-replace QMP command can be used. Replacing the backing child is actually already done in the stream job, so no need to do it in the track_co_change_backing_file() callback. It also cannot be called from a coroutine. Looking at the implementation in the qcow2 driver, it doesn't seem to be intended to change the backing child itself, just update driver-internal state. Other changes: * alloc-track: Error out early when used without auto-remove. Since replacing the node now happens in the stream job, where the option cannot be read from (it's internal to the driver), it will always be treated as 'on'. Makes sure to have users beside qemu-server notice the change (should they even exist). The option can be fully dropped in the future while adding a version guard in qemu-server. * alloc-track: Avoid seemingly superfluous child permission update. Doesn't seem necessary nowadays (maybe after commit "alloc-track: fix deadlock during drop" where the dropping is not rescheduled and delayed anymore or some upstream change). Replacing the block node will already update the permissions of the new node (which was the file child before). Should there really be some issue, instead of having a drop state, this could also be just based off the fact whether there is still a backing child. Dumping the cumulative (shared) permissions for the BDS with a debug print yields the same values after this patch and with QEMU 8.1, namely 3 and 5. * PBS block driver: compile unconditionally. Proxmox VE always needs it and something in the build process changed to make it not enabled by default. Probably would need to move the build option to meson otherwise. * backup: job unreferencing during cleanup needs to happen outside of coroutine, so it was moved to before invoking the clean * mirror: Cherry-pick stable fix to avoid potential deadlock. * savevm-async: migrate_init now can fail, so propagate potential error. * savevm-async: compression counters are not accessible outside migration/ram-compress now, so drop code that prophylactically set it to zero. [0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/ [1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/ Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
@@ -381,6 +384,10 @@ cbw_child_perm(BlockDriverState *bs, BdrvChild *c, BdrvChildRole role,
implement support for backup fleecing Excerpt from Fiona's v3 cover-letter [0]: When a backup for a VM is started, QEMU will install a "copy-before-write" filter in its block layer. This filter ensures that upon new guest writes, old data still needed for the backup is sent to the backup target first. The guest write blocks until this operation is finished so guest IO to not-yet-backed-up sectors will be limited by the speed of the backup target. With backup fleecing, such old data is cached in a fleecing image rather than sent directly to the backup target. This can help guest IO performance and even prevent hangs in certain scenarios, at the cost of requiring more storage space. With this series it will be possible to enable backup-fleecing via e.g. `vzdump 123 --fleecing enabled=1,storage=local-lvm` with fleecing images created on the storage `local-lvm`. The fleecing storage should be a fast local storage which supports thin-provisioning and discard. If the storage supports qcow2, that is used as the fleecing image format. If the underlying file system does not support discard, with qcow2 and preallocation=off, at least already allocated parts of the image can be re-used later. Fleecing images are created by qemu-server via pve-storage and attached to QEMU before the backup starts, and cleaned up after the backup finished or failed. The naming schema for fleecing images is 'vm-ID-fleece-N(.FORMAT)'. The allocated images are recorded in the guest configuration, so that even after a hard failure, clean-up can be re-attempted. While not too bad, it's a non-trivial amount of code and I'm not 100% sure about the cost-benefit, so sending those as RFC. The fleecing image needs to be the exact same size as the source, but luckily, an explicit size can be specified when attaching a raw image to QEMU so there are no size issues when using storages that have coarser allocation/round up. For qcow2, it seems that virtual size can be nearly arbitrary (i.e. modulo 512 byte granularity) during allocation. [0]: https://lists.proxmox.com/pipermail/pve-devel/2024-April/062815.html Originally-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-04-11 18:38:26 +03:00
* start
*/
*nperm = *nperm | BLK_PERM_CONSISTENT_READ;
+ if (s->discard_source) {
+ *nperm = *nperm | BLK_PERM_WRITE;
+ }
+
*nshared &= ~(BLK_PERM_WRITE | BLK_PERM_RESIZE);
}
}
update submodule and patches to QEMU 8.2.2 This version includes both the AioContext lock and the block graph lock, so there might be some deadlocks lurking. It's not possible to disable the block graph lock like was done in QEMU 8.1, because there are no changes like the function bdrv_schedule_unref() that require it. QEMU 9.0 will finally get rid of the AioContext locking. During live-restore with a VirtIO SCSI drive with iothread there is a known racy deadlock related to the AioContext lock. Not new [1], but not sure if more likely now. Should be fixed in QEMU 9.0. The block graph lock comes with annotations that can be checked by clang's TSA. This required changes to the block drivers, i.e. alloc-track, pbs, zeroinit as well as taking the appropriate locks in pve-backup, savevm-async, vma-reader. Local variable shadowing is prohibited via a compiler flag now, required slight adaptation in vma.c. Major changes only affect alloc-track: * It is not possible to call a generated co-wrapper like bdrv_get_info() while holding the block graph lock exclusively [0], which does happen during initialization of alloc-track when the backing hd is set and the refresh_limits driver callback is invoked. The bdrv_get_info() call to get the cluster size is moved to directly after opening the file child in track_open(). The important thing is that at least the request alignment for the write target is used, because then the RMW cycle in bdrv_pwritev will gather enough data from the backing file. Partial cluster allocations in the target are not a fundamental issue, because the driver returns its allocation status based on the bitmap, so any other data that maps to the same cluster will still be copied later by a stream job (or during writes to that cluster). * Replacing the node cannot be done in the track_co_change_backing_file() callback, because it is a coroutine and cannot hold the block graph lock exclusively. So it is moved to the stream job itself with the auto-remove option not having an effect anymore (qemu-server would always set it anyways). In the future, there could either be a special option for the stream job, or maybe the upcoming blockdev-replace QMP command can be used. Replacing the backing child is actually already done in the stream job, so no need to do it in the track_co_change_backing_file() callback. It also cannot be called from a coroutine. Looking at the implementation in the qcow2 driver, it doesn't seem to be intended to change the backing child itself, just update driver-internal state. Other changes: * alloc-track: Error out early when used without auto-remove. Since replacing the node now happens in the stream job, where the option cannot be read from (it's internal to the driver), it will always be treated as 'on'. Makes sure to have users beside qemu-server notice the change (should they even exist). The option can be fully dropped in the future while adding a version guard in qemu-server. * alloc-track: Avoid seemingly superfluous child permission update. Doesn't seem necessary nowadays (maybe after commit "alloc-track: fix deadlock during drop" where the dropping is not rescheduled and delayed anymore or some upstream change). Replacing the block node will already update the permissions of the new node (which was the file child before). Should there really be some issue, instead of having a drop state, this could also be just based off the fact whether there is still a backing child. Dumping the cumulative (shared) permissions for the BDS with a debug print yields the same values after this patch and with QEMU 8.1, namely 3 and 5. * PBS block driver: compile unconditionally. Proxmox VE always needs it and something in the build process changed to make it not enabled by default. Probably would need to move the build option to meson otherwise. * backup: job unreferencing during cleanup needs to happen outside of coroutine, so it was moved to before invoking the clean * mirror: Cherry-pick stable fix to avoid potential deadlock. * savevm-async: migrate_init now can fail, so propagate potential error. * savevm-async: compression counters are not accessible outside migration/ram-compress now, so drop code that prophylactically set it to zero. [0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/ [1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/ Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
@@ -472,7 +479,9 @@ static int cbw_open(BlockDriverState *bs, QDict *options, int flags,
implement support for backup fleecing Excerpt from Fiona's v3 cover-letter [0]: When a backup for a VM is started, QEMU will install a "copy-before-write" filter in its block layer. This filter ensures that upon new guest writes, old data still needed for the backup is sent to the backup target first. The guest write blocks until this operation is finished so guest IO to not-yet-backed-up sectors will be limited by the speed of the backup target. With backup fleecing, such old data is cached in a fleecing image rather than sent directly to the backup target. This can help guest IO performance and even prevent hangs in certain scenarios, at the cost of requiring more storage space. With this series it will be possible to enable backup-fleecing via e.g. `vzdump 123 --fleecing enabled=1,storage=local-lvm` with fleecing images created on the storage `local-lvm`. The fleecing storage should be a fast local storage which supports thin-provisioning and discard. If the storage supports qcow2, that is used as the fleecing image format. If the underlying file system does not support discard, with qcow2 and preallocation=off, at least already allocated parts of the image can be re-used later. Fleecing images are created by qemu-server via pve-storage and attached to QEMU before the backup starts, and cleaned up after the backup finished or failed. The naming schema for fleecing images is 'vm-ID-fleece-N(.FORMAT)'. The allocated images are recorded in the guest configuration, so that even after a hard failure, clean-up can be re-attempted. While not too bad, it's a non-trivial amount of code and I'm not 100% sure about the cost-benefit, so sending those as RFC. The fleecing image needs to be the exact same size as the source, but luckily, an explicit size can be specified when attaching a raw image to QEMU so there are no size issues when using storages that have coarser allocation/round up. For qcow2, it seems that virtual size can be nearly arbitrary (i.e. modulo 512 byte granularity) during allocation. [0]: https://lists.proxmox.com/pipermail/pve-devel/2024-April/062815.html Originally-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-04-11 18:38:26 +03:00
((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK) &
bs->file->bs->supported_zero_flags);
- s->bcs = block_copy_state_new(bs->file, s->target, bs, bitmap, errp);
+ s->discard_source = flags & BDRV_O_CBW_DISCARD_SOURCE;
+ s->bcs = block_copy_state_new(bs->file, s->target, bs, bitmap,
+ flags & BDRV_O_CBW_DISCARD_SOURCE, errp);
if (!s->bcs) {
error_prepend(errp, "Cannot create block-copy-state: ");
ret = -EINVAL;
update submodule and patches to QEMU 8.2.2 This version includes both the AioContext lock and the block graph lock, so there might be some deadlocks lurking. It's not possible to disable the block graph lock like was done in QEMU 8.1, because there are no changes like the function bdrv_schedule_unref() that require it. QEMU 9.0 will finally get rid of the AioContext locking. During live-restore with a VirtIO SCSI drive with iothread there is a known racy deadlock related to the AioContext lock. Not new [1], but not sure if more likely now. Should be fixed in QEMU 9.0. The block graph lock comes with annotations that can be checked by clang's TSA. This required changes to the block drivers, i.e. alloc-track, pbs, zeroinit as well as taking the appropriate locks in pve-backup, savevm-async, vma-reader. Local variable shadowing is prohibited via a compiler flag now, required slight adaptation in vma.c. Major changes only affect alloc-track: * It is not possible to call a generated co-wrapper like bdrv_get_info() while holding the block graph lock exclusively [0], which does happen during initialization of alloc-track when the backing hd is set and the refresh_limits driver callback is invoked. The bdrv_get_info() call to get the cluster size is moved to directly after opening the file child in track_open(). The important thing is that at least the request alignment for the write target is used, because then the RMW cycle in bdrv_pwritev will gather enough data from the backing file. Partial cluster allocations in the target are not a fundamental issue, because the driver returns its allocation status based on the bitmap, so any other data that maps to the same cluster will still be copied later by a stream job (or during writes to that cluster). * Replacing the node cannot be done in the track_co_change_backing_file() callback, because it is a coroutine and cannot hold the block graph lock exclusively. So it is moved to the stream job itself with the auto-remove option not having an effect anymore (qemu-server would always set it anyways). In the future, there could either be a special option for the stream job, or maybe the upcoming blockdev-replace QMP command can be used. Replacing the backing child is actually already done in the stream job, so no need to do it in the track_co_change_backing_file() callback. It also cannot be called from a coroutine. Looking at the implementation in the qcow2 driver, it doesn't seem to be intended to change the backing child itself, just update driver-internal state. Other changes: * alloc-track: Error out early when used without auto-remove. Since replacing the node now happens in the stream job, where the option cannot be read from (it's internal to the driver), it will always be treated as 'on'. Makes sure to have users beside qemu-server notice the change (should they even exist). The option can be fully dropped in the future while adding a version guard in qemu-server. * alloc-track: Avoid seemingly superfluous child permission update. Doesn't seem necessary nowadays (maybe after commit "alloc-track: fix deadlock during drop" where the dropping is not rescheduled and delayed anymore or some upstream change). Replacing the block node will already update the permissions of the new node (which was the file child before). Should there really be some issue, instead of having a drop state, this could also be just based off the fact whether there is still a backing child. Dumping the cumulative (shared) permissions for the BDS with a debug print yields the same values after this patch and with QEMU 8.1, namely 3 and 5. * PBS block driver: compile unconditionally. Proxmox VE always needs it and something in the build process changed to make it not enabled by default. Probably would need to move the build option to meson otherwise. * backup: job unreferencing during cleanup needs to happen outside of coroutine, so it was moved to before invoking the clean * mirror: Cherry-pick stable fix to avoid potential deadlock. * savevm-async: migrate_init now can fail, so propagate potential error. * savevm-async: compression counters are not accessible outside migration/ram-compress now, so drop code that prophylactically set it to zero. [0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/ [1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/ Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
@@ -546,12 +555,14 @@ static BlockDriver bdrv_cbw_filter = {
implement support for backup fleecing Excerpt from Fiona's v3 cover-letter [0]: When a backup for a VM is started, QEMU will install a "copy-before-write" filter in its block layer. This filter ensures that upon new guest writes, old data still needed for the backup is sent to the backup target first. The guest write blocks until this operation is finished so guest IO to not-yet-backed-up sectors will be limited by the speed of the backup target. With backup fleecing, such old data is cached in a fleecing image rather than sent directly to the backup target. This can help guest IO performance and even prevent hangs in certain scenarios, at the cost of requiring more storage space. With this series it will be possible to enable backup-fleecing via e.g. `vzdump 123 --fleecing enabled=1,storage=local-lvm` with fleecing images created on the storage `local-lvm`. The fleecing storage should be a fast local storage which supports thin-provisioning and discard. If the storage supports qcow2, that is used as the fleecing image format. If the underlying file system does not support discard, with qcow2 and preallocation=off, at least already allocated parts of the image can be re-used later. Fleecing images are created by qemu-server via pve-storage and attached to QEMU before the backup starts, and cleaned up after the backup finished or failed. The naming schema for fleecing images is 'vm-ID-fleece-N(.FORMAT)'. The allocated images are recorded in the guest configuration, so that even after a hard failure, clean-up can be re-attempted. While not too bad, it's a non-trivial amount of code and I'm not 100% sure about the cost-benefit, so sending those as RFC. The fleecing image needs to be the exact same size as the source, but luckily, an explicit size can be specified when attaching a raw image to QEMU so there are no size issues when using storages that have coarser allocation/round up. For qcow2, it seems that virtual size can be nearly arbitrary (i.e. modulo 512 byte granularity) during allocation. [0]: https://lists.proxmox.com/pipermail/pve-devel/2024-April/062815.html Originally-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-04-11 18:38:26 +03:00
BlockDriverState *bdrv_cbw_append(BlockDriverState *source,
BlockDriverState *target,
const char *filter_node_name,
+ bool discard_source,
BlockCopyState **bcs,
Error **errp)
{
BDRVCopyBeforeWriteState *state;
BlockDriverState *top;
QDict *opts;
+ int flags = BDRV_O_RDWR | (discard_source ? BDRV_O_CBW_DISCARD_SOURCE : 0);
assert(source->total_sectors == target->total_sectors);
GLOBAL_STATE_CODE();
update submodule and patches to QEMU 8.2.2 This version includes both the AioContext lock and the block graph lock, so there might be some deadlocks lurking. It's not possible to disable the block graph lock like was done in QEMU 8.1, because there are no changes like the function bdrv_schedule_unref() that require it. QEMU 9.0 will finally get rid of the AioContext locking. During live-restore with a VirtIO SCSI drive with iothread there is a known racy deadlock related to the AioContext lock. Not new [1], but not sure if more likely now. Should be fixed in QEMU 9.0. The block graph lock comes with annotations that can be checked by clang's TSA. This required changes to the block drivers, i.e. alloc-track, pbs, zeroinit as well as taking the appropriate locks in pve-backup, savevm-async, vma-reader. Local variable shadowing is prohibited via a compiler flag now, required slight adaptation in vma.c. Major changes only affect alloc-track: * It is not possible to call a generated co-wrapper like bdrv_get_info() while holding the block graph lock exclusively [0], which does happen during initialization of alloc-track when the backing hd is set and the refresh_limits driver callback is invoked. The bdrv_get_info() call to get the cluster size is moved to directly after opening the file child in track_open(). The important thing is that at least the request alignment for the write target is used, because then the RMW cycle in bdrv_pwritev will gather enough data from the backing file. Partial cluster allocations in the target are not a fundamental issue, because the driver returns its allocation status based on the bitmap, so any other data that maps to the same cluster will still be copied later by a stream job (or during writes to that cluster). * Replacing the node cannot be done in the track_co_change_backing_file() callback, because it is a coroutine and cannot hold the block graph lock exclusively. So it is moved to the stream job itself with the auto-remove option not having an effect anymore (qemu-server would always set it anyways). In the future, there could either be a special option for the stream job, or maybe the upcoming blockdev-replace QMP command can be used. Replacing the backing child is actually already done in the stream job, so no need to do it in the track_co_change_backing_file() callback. It also cannot be called from a coroutine. Looking at the implementation in the qcow2 driver, it doesn't seem to be intended to change the backing child itself, just update driver-internal state. Other changes: * alloc-track: Error out early when used without auto-remove. Since replacing the node now happens in the stream job, where the option cannot be read from (it's internal to the driver), it will always be treated as 'on'. Makes sure to have users beside qemu-server notice the change (should they even exist). The option can be fully dropped in the future while adding a version guard in qemu-server. * alloc-track: Avoid seemingly superfluous child permission update. Doesn't seem necessary nowadays (maybe after commit "alloc-track: fix deadlock during drop" where the dropping is not rescheduled and delayed anymore or some upstream change). Replacing the block node will already update the permissions of the new node (which was the file child before). Should there really be some issue, instead of having a drop state, this could also be just based off the fact whether there is still a backing child. Dumping the cumulative (shared) permissions for the BDS with a debug print yields the same values after this patch and with QEMU 8.1, namely 3 and 5. * PBS block driver: compile unconditionally. Proxmox VE always needs it and something in the build process changed to make it not enabled by default. Probably would need to move the build option to meson otherwise. * backup: job unreferencing during cleanup needs to happen outside of coroutine, so it was moved to before invoking the clean * mirror: Cherry-pick stable fix to avoid potential deadlock. * savevm-async: migrate_init now can fail, so propagate potential error. * savevm-async: compression counters are not accessible outside migration/ram-compress now, so drop code that prophylactically set it to zero. [0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/ [1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/ Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
@@ -564,7 +575,7 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source,
implement support for backup fleecing Excerpt from Fiona's v3 cover-letter [0]: When a backup for a VM is started, QEMU will install a "copy-before-write" filter in its block layer. This filter ensures that upon new guest writes, old data still needed for the backup is sent to the backup target first. The guest write blocks until this operation is finished so guest IO to not-yet-backed-up sectors will be limited by the speed of the backup target. With backup fleecing, such old data is cached in a fleecing image rather than sent directly to the backup target. This can help guest IO performance and even prevent hangs in certain scenarios, at the cost of requiring more storage space. With this series it will be possible to enable backup-fleecing via e.g. `vzdump 123 --fleecing enabled=1,storage=local-lvm` with fleecing images created on the storage `local-lvm`. The fleecing storage should be a fast local storage which supports thin-provisioning and discard. If the storage supports qcow2, that is used as the fleecing image format. If the underlying file system does not support discard, with qcow2 and preallocation=off, at least already allocated parts of the image can be re-used later. Fleecing images are created by qemu-server via pve-storage and attached to QEMU before the backup starts, and cleaned up after the backup finished or failed. The naming schema for fleecing images is 'vm-ID-fleece-N(.FORMAT)'. The allocated images are recorded in the guest configuration, so that even after a hard failure, clean-up can be re-attempted. While not too bad, it's a non-trivial amount of code and I'm not 100% sure about the cost-benefit, so sending those as RFC. The fleecing image needs to be the exact same size as the source, but luckily, an explicit size can be specified when attaching a raw image to QEMU so there are no size issues when using storages that have coarser allocation/round up. For qcow2, it seems that virtual size can be nearly arbitrary (i.e. modulo 512 byte granularity) during allocation. [0]: https://lists.proxmox.com/pipermail/pve-devel/2024-April/062815.html Originally-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-04-11 18:38:26 +03:00
qdict_put_str(opts, "file", bdrv_get_node_name(source));
qdict_put_str(opts, "target", bdrv_get_node_name(target));
- top = bdrv_insert_node(source, opts, BDRV_O_RDWR, errp);
+ top = bdrv_insert_node(source, opts, flags, errp);
if (!top) {
return NULL;
}
diff --git a/block/copy-before-write.h b/block/copy-before-write.h
index 6e72bb25e9..01af0cd3c4 100644
--- a/block/copy-before-write.h
+++ b/block/copy-before-write.h
@@ -39,6 +39,7 @@
BlockDriverState *bdrv_cbw_append(BlockDriverState *source,
BlockDriverState *target,
const char *filter_node_name,
+ bool discard_source,
BlockCopyState **bcs,
Error **errp);
void bdrv_cbw_drop(BlockDriverState *bs);
diff --git a/block/replication.c b/block/replication.c
update submodule and patches to QEMU 8.2.2 This version includes both the AioContext lock and the block graph lock, so there might be some deadlocks lurking. It's not possible to disable the block graph lock like was done in QEMU 8.1, because there are no changes like the function bdrv_schedule_unref() that require it. QEMU 9.0 will finally get rid of the AioContext locking. During live-restore with a VirtIO SCSI drive with iothread there is a known racy deadlock related to the AioContext lock. Not new [1], but not sure if more likely now. Should be fixed in QEMU 9.0. The block graph lock comes with annotations that can be checked by clang's TSA. This required changes to the block drivers, i.e. alloc-track, pbs, zeroinit as well as taking the appropriate locks in pve-backup, savevm-async, vma-reader. Local variable shadowing is prohibited via a compiler flag now, required slight adaptation in vma.c. Major changes only affect alloc-track: * It is not possible to call a generated co-wrapper like bdrv_get_info() while holding the block graph lock exclusively [0], which does happen during initialization of alloc-track when the backing hd is set and the refresh_limits driver callback is invoked. The bdrv_get_info() call to get the cluster size is moved to directly after opening the file child in track_open(). The important thing is that at least the request alignment for the write target is used, because then the RMW cycle in bdrv_pwritev will gather enough data from the backing file. Partial cluster allocations in the target are not a fundamental issue, because the driver returns its allocation status based on the bitmap, so any other data that maps to the same cluster will still be copied later by a stream job (or during writes to that cluster). * Replacing the node cannot be done in the track_co_change_backing_file() callback, because it is a coroutine and cannot hold the block graph lock exclusively. So it is moved to the stream job itself with the auto-remove option not having an effect anymore (qemu-server would always set it anyways). In the future, there could either be a special option for the stream job, or maybe the upcoming blockdev-replace QMP command can be used. Replacing the backing child is actually already done in the stream job, so no need to do it in the track_co_change_backing_file() callback. It also cannot be called from a coroutine. Looking at the implementation in the qcow2 driver, it doesn't seem to be intended to change the backing child itself, just update driver-internal state. Other changes: * alloc-track: Error out early when used without auto-remove. Since replacing the node now happens in the stream job, where the option cannot be read from (it's internal to the driver), it will always be treated as 'on'. Makes sure to have users beside qemu-server notice the change (should they even exist). The option can be fully dropped in the future while adding a version guard in qemu-server. * alloc-track: Avoid seemingly superfluous child permission update. Doesn't seem necessary nowadays (maybe after commit "alloc-track: fix deadlock during drop" where the dropping is not rescheduled and delayed anymore or some upstream change). Replacing the block node will already update the permissions of the new node (which was the file child before). Should there really be some issue, instead of having a drop state, this could also be just based off the fact whether there is still a backing child. Dumping the cumulative (shared) permissions for the BDS with a debug print yields the same values after this patch and with QEMU 8.1, namely 3 and 5. * PBS block driver: compile unconditionally. Proxmox VE always needs it and something in the build process changed to make it not enabled by default. Probably would need to move the build option to meson otherwise. * backup: job unreferencing during cleanup needs to happen outside of coroutine, so it was moved to before invoking the clean * mirror: Cherry-pick stable fix to avoid potential deadlock. * savevm-async: migrate_init now can fail, so propagate potential error. * savevm-async: compression counters are not accessible outside migration/ram-compress now, so drop code that prophylactically set it to zero. [0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/ [1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/ Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
index 5ded5f1ca9..bd75a6aee3 100644
implement support for backup fleecing Excerpt from Fiona's v3 cover-letter [0]: When a backup for a VM is started, QEMU will install a "copy-before-write" filter in its block layer. This filter ensures that upon new guest writes, old data still needed for the backup is sent to the backup target first. The guest write blocks until this operation is finished so guest IO to not-yet-backed-up sectors will be limited by the speed of the backup target. With backup fleecing, such old data is cached in a fleecing image rather than sent directly to the backup target. This can help guest IO performance and even prevent hangs in certain scenarios, at the cost of requiring more storage space. With this series it will be possible to enable backup-fleecing via e.g. `vzdump 123 --fleecing enabled=1,storage=local-lvm` with fleecing images created on the storage `local-lvm`. The fleecing storage should be a fast local storage which supports thin-provisioning and discard. If the storage supports qcow2, that is used as the fleecing image format. If the underlying file system does not support discard, with qcow2 and preallocation=off, at least already allocated parts of the image can be re-used later. Fleecing images are created by qemu-server via pve-storage and attached to QEMU before the backup starts, and cleaned up after the backup finished or failed. The naming schema for fleecing images is 'vm-ID-fleece-N(.FORMAT)'. The allocated images are recorded in the guest configuration, so that even after a hard failure, clean-up can be re-attempted. While not too bad, it's a non-trivial amount of code and I'm not 100% sure about the cost-benefit, so sending those as RFC. The fleecing image needs to be the exact same size as the source, but luckily, an explicit size can be specified when attaching a raw image to QEMU so there are no size issues when using storages that have coarser allocation/round up. For qcow2, it seems that virtual size can be nearly arbitrary (i.e. modulo 512 byte granularity) during allocation. [0]: https://lists.proxmox.com/pipermail/pve-devel/2024-April/062815.html Originally-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-04-11 18:38:26 +03:00
--- a/block/replication.c
+++ b/block/replication.c
update submodule and patches to QEMU 8.2.2 This version includes both the AioContext lock and the block graph lock, so there might be some deadlocks lurking. It's not possible to disable the block graph lock like was done in QEMU 8.1, because there are no changes like the function bdrv_schedule_unref() that require it. QEMU 9.0 will finally get rid of the AioContext locking. During live-restore with a VirtIO SCSI drive with iothread there is a known racy deadlock related to the AioContext lock. Not new [1], but not sure if more likely now. Should be fixed in QEMU 9.0. The block graph lock comes with annotations that can be checked by clang's TSA. This required changes to the block drivers, i.e. alloc-track, pbs, zeroinit as well as taking the appropriate locks in pve-backup, savevm-async, vma-reader. Local variable shadowing is prohibited via a compiler flag now, required slight adaptation in vma.c. Major changes only affect alloc-track: * It is not possible to call a generated co-wrapper like bdrv_get_info() while holding the block graph lock exclusively [0], which does happen during initialization of alloc-track when the backing hd is set and the refresh_limits driver callback is invoked. The bdrv_get_info() call to get the cluster size is moved to directly after opening the file child in track_open(). The important thing is that at least the request alignment for the write target is used, because then the RMW cycle in bdrv_pwritev will gather enough data from the backing file. Partial cluster allocations in the target are not a fundamental issue, because the driver returns its allocation status based on the bitmap, so any other data that maps to the same cluster will still be copied later by a stream job (or during writes to that cluster). * Replacing the node cannot be done in the track_co_change_backing_file() callback, because it is a coroutine and cannot hold the block graph lock exclusively. So it is moved to the stream job itself with the auto-remove option not having an effect anymore (qemu-server would always set it anyways). In the future, there could either be a special option for the stream job, or maybe the upcoming blockdev-replace QMP command can be used. Replacing the backing child is actually already done in the stream job, so no need to do it in the track_co_change_backing_file() callback. It also cannot be called from a coroutine. Looking at the implementation in the qcow2 driver, it doesn't seem to be intended to change the backing child itself, just update driver-internal state. Other changes: * alloc-track: Error out early when used without auto-remove. Since replacing the node now happens in the stream job, where the option cannot be read from (it's internal to the driver), it will always be treated as 'on'. Makes sure to have users beside qemu-server notice the change (should they even exist). The option can be fully dropped in the future while adding a version guard in qemu-server. * alloc-track: Avoid seemingly superfluous child permission update. Doesn't seem necessary nowadays (maybe after commit "alloc-track: fix deadlock during drop" where the dropping is not rescheduled and delayed anymore or some upstream change). Replacing the block node will already update the permissions of the new node (which was the file child before). Should there really be some issue, instead of having a drop state, this could also be just based off the fact whether there is still a backing child. Dumping the cumulative (shared) permissions for the BDS with a debug print yields the same values after this patch and with QEMU 8.1, namely 3 and 5. * PBS block driver: compile unconditionally. Proxmox VE always needs it and something in the build process changed to make it not enabled by default. Probably would need to move the build option to meson otherwise. * backup: job unreferencing during cleanup needs to happen outside of coroutine, so it was moved to before invoking the clean * mirror: Cherry-pick stable fix to avoid potential deadlock. * savevm-async: migrate_init now can fail, so propagate potential error. * savevm-async: compression counters are not accessible outside migration/ram-compress now, so drop code that prophylactically set it to zero. [0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/ [1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/ Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
@@ -604,8 +604,8 @@ static void replication_start(ReplicationState *rs, ReplicationMode mode,
implement support for backup fleecing Excerpt from Fiona's v3 cover-letter [0]: When a backup for a VM is started, QEMU will install a "copy-before-write" filter in its block layer. This filter ensures that upon new guest writes, old data still needed for the backup is sent to the backup target first. The guest write blocks until this operation is finished so guest IO to not-yet-backed-up sectors will be limited by the speed of the backup target. With backup fleecing, such old data is cached in a fleecing image rather than sent directly to the backup target. This can help guest IO performance and even prevent hangs in certain scenarios, at the cost of requiring more storage space. With this series it will be possible to enable backup-fleecing via e.g. `vzdump 123 --fleecing enabled=1,storage=local-lvm` with fleecing images created on the storage `local-lvm`. The fleecing storage should be a fast local storage which supports thin-provisioning and discard. If the storage supports qcow2, that is used as the fleecing image format. If the underlying file system does not support discard, with qcow2 and preallocation=off, at least already allocated parts of the image can be re-used later. Fleecing images are created by qemu-server via pve-storage and attached to QEMU before the backup starts, and cleaned up after the backup finished or failed. The naming schema for fleecing images is 'vm-ID-fleece-N(.FORMAT)'. The allocated images are recorded in the guest configuration, so that even after a hard failure, clean-up can be re-attempted. While not too bad, it's a non-trivial amount of code and I'm not 100% sure about the cost-benefit, so sending those as RFC. The fleecing image needs to be the exact same size as the source, but luckily, an explicit size can be specified when attaching a raw image to QEMU so there are no size issues when using storages that have coarser allocation/round up. For qcow2, it seems that virtual size can be nearly arbitrary (i.e. modulo 512 byte granularity) during allocation. [0]: https://lists.proxmox.com/pipermail/pve-devel/2024-April/062815.html Originally-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-04-11 18:38:26 +03:00
s->backup_job = backup_job_create(
NULL, s->secondary_disk->bs, s->hidden_disk->bs,
- 0, MIRROR_SYNC_MODE_NONE, NULL, 0, false, NULL,
- &perf,
+ 0, MIRROR_SYNC_MODE_NONE, NULL, 0, false, false,
+ NULL, &perf,
BLOCKDEV_ON_ERROR_REPORT,
BLOCKDEV_ON_ERROR_REPORT, JOB_INTERNAL,
backup_job_completed, bs, NULL, &local_err);
diff --git a/blockdev.c b/blockdev.c
update submodule and patches to QEMU 8.2.2 This version includes both the AioContext lock and the block graph lock, so there might be some deadlocks lurking. It's not possible to disable the block graph lock like was done in QEMU 8.1, because there are no changes like the function bdrv_schedule_unref() that require it. QEMU 9.0 will finally get rid of the AioContext locking. During live-restore with a VirtIO SCSI drive with iothread there is a known racy deadlock related to the AioContext lock. Not new [1], but not sure if more likely now. Should be fixed in QEMU 9.0. The block graph lock comes with annotations that can be checked by clang's TSA. This required changes to the block drivers, i.e. alloc-track, pbs, zeroinit as well as taking the appropriate locks in pve-backup, savevm-async, vma-reader. Local variable shadowing is prohibited via a compiler flag now, required slight adaptation in vma.c. Major changes only affect alloc-track: * It is not possible to call a generated co-wrapper like bdrv_get_info() while holding the block graph lock exclusively [0], which does happen during initialization of alloc-track when the backing hd is set and the refresh_limits driver callback is invoked. The bdrv_get_info() call to get the cluster size is moved to directly after opening the file child in track_open(). The important thing is that at least the request alignment for the write target is used, because then the RMW cycle in bdrv_pwritev will gather enough data from the backing file. Partial cluster allocations in the target are not a fundamental issue, because the driver returns its allocation status based on the bitmap, so any other data that maps to the same cluster will still be copied later by a stream job (or during writes to that cluster). * Replacing the node cannot be done in the track_co_change_backing_file() callback, because it is a coroutine and cannot hold the block graph lock exclusively. So it is moved to the stream job itself with the auto-remove option not having an effect anymore (qemu-server would always set it anyways). In the future, there could either be a special option for the stream job, or maybe the upcoming blockdev-replace QMP command can be used. Replacing the backing child is actually already done in the stream job, so no need to do it in the track_co_change_backing_file() callback. It also cannot be called from a coroutine. Looking at the implementation in the qcow2 driver, it doesn't seem to be intended to change the backing child itself, just update driver-internal state. Other changes: * alloc-track: Error out early when used without auto-remove. Since replacing the node now happens in the stream job, where the option cannot be read from (it's internal to the driver), it will always be treated as 'on'. Makes sure to have users beside qemu-server notice the change (should they even exist). The option can be fully dropped in the future while adding a version guard in qemu-server. * alloc-track: Avoid seemingly superfluous child permission update. Doesn't seem necessary nowadays (maybe after commit "alloc-track: fix deadlock during drop" where the dropping is not rescheduled and delayed anymore or some upstream change). Replacing the block node will already update the permissions of the new node (which was the file child before). Should there really be some issue, instead of having a drop state, this could also be just based off the fact whether there is still a backing child. Dumping the cumulative (shared) permissions for the BDS with a debug print yields the same values after this patch and with QEMU 8.1, namely 3 and 5. * PBS block driver: compile unconditionally. Proxmox VE always needs it and something in the build process changed to make it not enabled by default. Probably would need to move the build option to meson otherwise. * backup: job unreferencing during cleanup needs to happen outside of coroutine, so it was moved to before invoking the clean * mirror: Cherry-pick stable fix to avoid potential deadlock. * savevm-async: migrate_init now can fail, so propagate potential error. * savevm-async: compression counters are not accessible outside migration/ram-compress now, so drop code that prophylactically set it to zero. [0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/ [1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/ Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
index 3049811be8..e167ad1e54 100644
implement support for backup fleecing Excerpt from Fiona's v3 cover-letter [0]: When a backup for a VM is started, QEMU will install a "copy-before-write" filter in its block layer. This filter ensures that upon new guest writes, old data still needed for the backup is sent to the backup target first. The guest write blocks until this operation is finished so guest IO to not-yet-backed-up sectors will be limited by the speed of the backup target. With backup fleecing, such old data is cached in a fleecing image rather than sent directly to the backup target. This can help guest IO performance and even prevent hangs in certain scenarios, at the cost of requiring more storage space. With this series it will be possible to enable backup-fleecing via e.g. `vzdump 123 --fleecing enabled=1,storage=local-lvm` with fleecing images created on the storage `local-lvm`. The fleecing storage should be a fast local storage which supports thin-provisioning and discard. If the storage supports qcow2, that is used as the fleecing image format. If the underlying file system does not support discard, with qcow2 and preallocation=off, at least already allocated parts of the image can be re-used later. Fleecing images are created by qemu-server via pve-storage and attached to QEMU before the backup starts, and cleaned up after the backup finished or failed. The naming schema for fleecing images is 'vm-ID-fleece-N(.FORMAT)'. The allocated images are recorded in the guest configuration, so that even after a hard failure, clean-up can be re-attempted. While not too bad, it's a non-trivial amount of code and I'm not 100% sure about the cost-benefit, so sending those as RFC. The fleecing image needs to be the exact same size as the source, but luckily, an explicit size can be specified when attaching a raw image to QEMU so there are no size issues when using storages that have coarser allocation/round up. For qcow2, it seems that virtual size can be nearly arbitrary (i.e. modulo 512 byte granularity) during allocation. [0]: https://lists.proxmox.com/pipermail/pve-devel/2024-April/062815.html Originally-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-04-11 18:38:26 +03:00
--- a/blockdev.c
+++ b/blockdev.c
update submodule and patches to QEMU 8.2.2 This version includes both the AioContext lock and the block graph lock, so there might be some deadlocks lurking. It's not possible to disable the block graph lock like was done in QEMU 8.1, because there are no changes like the function bdrv_schedule_unref() that require it. QEMU 9.0 will finally get rid of the AioContext locking. During live-restore with a VirtIO SCSI drive with iothread there is a known racy deadlock related to the AioContext lock. Not new [1], but not sure if more likely now. Should be fixed in QEMU 9.0. The block graph lock comes with annotations that can be checked by clang's TSA. This required changes to the block drivers, i.e. alloc-track, pbs, zeroinit as well as taking the appropriate locks in pve-backup, savevm-async, vma-reader. Local variable shadowing is prohibited via a compiler flag now, required slight adaptation in vma.c. Major changes only affect alloc-track: * It is not possible to call a generated co-wrapper like bdrv_get_info() while holding the block graph lock exclusively [0], which does happen during initialization of alloc-track when the backing hd is set and the refresh_limits driver callback is invoked. The bdrv_get_info() call to get the cluster size is moved to directly after opening the file child in track_open(). The important thing is that at least the request alignment for the write target is used, because then the RMW cycle in bdrv_pwritev will gather enough data from the backing file. Partial cluster allocations in the target are not a fundamental issue, because the driver returns its allocation status based on the bitmap, so any other data that maps to the same cluster will still be copied later by a stream job (or during writes to that cluster). * Replacing the node cannot be done in the track_co_change_backing_file() callback, because it is a coroutine and cannot hold the block graph lock exclusively. So it is moved to the stream job itself with the auto-remove option not having an effect anymore (qemu-server would always set it anyways). In the future, there could either be a special option for the stream job, or maybe the upcoming blockdev-replace QMP command can be used. Replacing the backing child is actually already done in the stream job, so no need to do it in the track_co_change_backing_file() callback. It also cannot be called from a coroutine. Looking at the implementation in the qcow2 driver, it doesn't seem to be intended to change the backing child itself, just update driver-internal state. Other changes: * alloc-track: Error out early when used without auto-remove. Since replacing the node now happens in the stream job, where the option cannot be read from (it's internal to the driver), it will always be treated as 'on'. Makes sure to have users beside qemu-server notice the change (should they even exist). The option can be fully dropped in the future while adding a version guard in qemu-server. * alloc-track: Avoid seemingly superfluous child permission update. Doesn't seem necessary nowadays (maybe after commit "alloc-track: fix deadlock during drop" where the dropping is not rescheduled and delayed anymore or some upstream change). Replacing the block node will already update the permissions of the new node (which was the file child before). Should there really be some issue, instead of having a drop state, this could also be just based off the fact whether there is still a backing child. Dumping the cumulative (shared) permissions for the BDS with a debug print yields the same values after this patch and with QEMU 8.1, namely 3 and 5. * PBS block driver: compile unconditionally. Proxmox VE always needs it and something in the build process changed to make it not enabled by default. Probably would need to move the build option to meson otherwise. * backup: job unreferencing during cleanup needs to happen outside of coroutine, so it was moved to before invoking the clean * mirror: Cherry-pick stable fix to avoid potential deadlock. * savevm-async: migrate_init now can fail, so propagate potential error. * savevm-async: compression counters are not accessible outside migration/ram-compress now, so drop code that prophylactically set it to zero. [0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/ [1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/ Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
@@ -2854,7 +2854,7 @@ static BlockJob *do_backup_common(BackupCommon *backup,
implement support for backup fleecing Excerpt from Fiona's v3 cover-letter [0]: When a backup for a VM is started, QEMU will install a "copy-before-write" filter in its block layer. This filter ensures that upon new guest writes, old data still needed for the backup is sent to the backup target first. The guest write blocks until this operation is finished so guest IO to not-yet-backed-up sectors will be limited by the speed of the backup target. With backup fleecing, such old data is cached in a fleecing image rather than sent directly to the backup target. This can help guest IO performance and even prevent hangs in certain scenarios, at the cost of requiring more storage space. With this series it will be possible to enable backup-fleecing via e.g. `vzdump 123 --fleecing enabled=1,storage=local-lvm` with fleecing images created on the storage `local-lvm`. The fleecing storage should be a fast local storage which supports thin-provisioning and discard. If the storage supports qcow2, that is used as the fleecing image format. If the underlying file system does not support discard, with qcow2 and preallocation=off, at least already allocated parts of the image can be re-used later. Fleecing images are created by qemu-server via pve-storage and attached to QEMU before the backup starts, and cleaned up after the backup finished or failed. The naming schema for fleecing images is 'vm-ID-fleece-N(.FORMAT)'. The allocated images are recorded in the guest configuration, so that even after a hard failure, clean-up can be re-attempted. While not too bad, it's a non-trivial amount of code and I'm not 100% sure about the cost-benefit, so sending those as RFC. The fleecing image needs to be the exact same size as the source, but luckily, an explicit size can be specified when attaching a raw image to QEMU so there are no size issues when using storages that have coarser allocation/round up. For qcow2, it seems that virtual size can be nearly arbitrary (i.e. modulo 512 byte granularity) during allocation. [0]: https://lists.proxmox.com/pipermail/pve-devel/2024-April/062815.html Originally-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-04-11 18:38:26 +03:00
job = backup_job_create(backup->job_id, bs, target_bs, backup->speed,
backup->sync, bmap, backup->bitmap_mode,
- backup->compress,
+ backup->compress, backup->discard_source,
backup->filter_node_name,
&perf,
backup->on_source_error,
diff --git a/include/block/block-common.h b/include/block/block-common.h
update submodule and patches to QEMU 8.2.2 This version includes both the AioContext lock and the block graph lock, so there might be some deadlocks lurking. It's not possible to disable the block graph lock like was done in QEMU 8.1, because there are no changes like the function bdrv_schedule_unref() that require it. QEMU 9.0 will finally get rid of the AioContext locking. During live-restore with a VirtIO SCSI drive with iothread there is a known racy deadlock related to the AioContext lock. Not new [1], but not sure if more likely now. Should be fixed in QEMU 9.0. The block graph lock comes with annotations that can be checked by clang's TSA. This required changes to the block drivers, i.e. alloc-track, pbs, zeroinit as well as taking the appropriate locks in pve-backup, savevm-async, vma-reader. Local variable shadowing is prohibited via a compiler flag now, required slight adaptation in vma.c. Major changes only affect alloc-track: * It is not possible to call a generated co-wrapper like bdrv_get_info() while holding the block graph lock exclusively [0], which does happen during initialization of alloc-track when the backing hd is set and the refresh_limits driver callback is invoked. The bdrv_get_info() call to get the cluster size is moved to directly after opening the file child in track_open(). The important thing is that at least the request alignment for the write target is used, because then the RMW cycle in bdrv_pwritev will gather enough data from the backing file. Partial cluster allocations in the target are not a fundamental issue, because the driver returns its allocation status based on the bitmap, so any other data that maps to the same cluster will still be copied later by a stream job (or during writes to that cluster). * Replacing the node cannot be done in the track_co_change_backing_file() callback, because it is a coroutine and cannot hold the block graph lock exclusively. So it is moved to the stream job itself with the auto-remove option not having an effect anymore (qemu-server would always set it anyways). In the future, there could either be a special option for the stream job, or maybe the upcoming blockdev-replace QMP command can be used. Replacing the backing child is actually already done in the stream job, so no need to do it in the track_co_change_backing_file() callback. It also cannot be called from a coroutine. Looking at the implementation in the qcow2 driver, it doesn't seem to be intended to change the backing child itself, just update driver-internal state. Other changes: * alloc-track: Error out early when used without auto-remove. Since replacing the node now happens in the stream job, where the option cannot be read from (it's internal to the driver), it will always be treated as 'on'. Makes sure to have users beside qemu-server notice the change (should they even exist). The option can be fully dropped in the future while adding a version guard in qemu-server. * alloc-track: Avoid seemingly superfluous child permission update. Doesn't seem necessary nowadays (maybe after commit "alloc-track: fix deadlock during drop" where the dropping is not rescheduled and delayed anymore or some upstream change). Replacing the block node will already update the permissions of the new node (which was the file child before). Should there really be some issue, instead of having a drop state, this could also be just based off the fact whether there is still a backing child. Dumping the cumulative (shared) permissions for the BDS with a debug print yields the same values after this patch and with QEMU 8.1, namely 3 and 5. * PBS block driver: compile unconditionally. Proxmox VE always needs it and something in the build process changed to make it not enabled by default. Probably would need to move the build option to meson otherwise. * backup: job unreferencing during cleanup needs to happen outside of coroutine, so it was moved to before invoking the clean * mirror: Cherry-pick stable fix to avoid potential deadlock. * savevm-async: migrate_init now can fail, so propagate potential error. * savevm-async: compression counters are not accessible outside migration/ram-compress now, so drop code that prophylactically set it to zero. [0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/ [1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/ Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
index d7599564db..7f56364e73 100644
implement support for backup fleecing Excerpt from Fiona's v3 cover-letter [0]: When a backup for a VM is started, QEMU will install a "copy-before-write" filter in its block layer. This filter ensures that upon new guest writes, old data still needed for the backup is sent to the backup target first. The guest write blocks until this operation is finished so guest IO to not-yet-backed-up sectors will be limited by the speed of the backup target. With backup fleecing, such old data is cached in a fleecing image rather than sent directly to the backup target. This can help guest IO performance and even prevent hangs in certain scenarios, at the cost of requiring more storage space. With this series it will be possible to enable backup-fleecing via e.g. `vzdump 123 --fleecing enabled=1,storage=local-lvm` with fleecing images created on the storage `local-lvm`. The fleecing storage should be a fast local storage which supports thin-provisioning and discard. If the storage supports qcow2, that is used as the fleecing image format. If the underlying file system does not support discard, with qcow2 and preallocation=off, at least already allocated parts of the image can be re-used later. Fleecing images are created by qemu-server via pve-storage and attached to QEMU before the backup starts, and cleaned up after the backup finished or failed. The naming schema for fleecing images is 'vm-ID-fleece-N(.FORMAT)'. The allocated images are recorded in the guest configuration, so that even after a hard failure, clean-up can be re-attempted. While not too bad, it's a non-trivial amount of code and I'm not 100% sure about the cost-benefit, so sending those as RFC. The fleecing image needs to be the exact same size as the source, but luckily, an explicit size can be specified when attaching a raw image to QEMU so there are no size issues when using storages that have coarser allocation/round up. For qcow2, it seems that virtual size can be nearly arbitrary (i.e. modulo 512 byte granularity) during allocation. [0]: https://lists.proxmox.com/pipermail/pve-devel/2024-April/062815.html Originally-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-04-11 18:38:26 +03:00
--- a/include/block/block-common.h
+++ b/include/block/block-common.h
update submodule and patches to QEMU 8.2.2 This version includes both the AioContext lock and the block graph lock, so there might be some deadlocks lurking. It's not possible to disable the block graph lock like was done in QEMU 8.1, because there are no changes like the function bdrv_schedule_unref() that require it. QEMU 9.0 will finally get rid of the AioContext locking. During live-restore with a VirtIO SCSI drive with iothread there is a known racy deadlock related to the AioContext lock. Not new [1], but not sure if more likely now. Should be fixed in QEMU 9.0. The block graph lock comes with annotations that can be checked by clang's TSA. This required changes to the block drivers, i.e. alloc-track, pbs, zeroinit as well as taking the appropriate locks in pve-backup, savevm-async, vma-reader. Local variable shadowing is prohibited via a compiler flag now, required slight adaptation in vma.c. Major changes only affect alloc-track: * It is not possible to call a generated co-wrapper like bdrv_get_info() while holding the block graph lock exclusively [0], which does happen during initialization of alloc-track when the backing hd is set and the refresh_limits driver callback is invoked. The bdrv_get_info() call to get the cluster size is moved to directly after opening the file child in track_open(). The important thing is that at least the request alignment for the write target is used, because then the RMW cycle in bdrv_pwritev will gather enough data from the backing file. Partial cluster allocations in the target are not a fundamental issue, because the driver returns its allocation status based on the bitmap, so any other data that maps to the same cluster will still be copied later by a stream job (or during writes to that cluster). * Replacing the node cannot be done in the track_co_change_backing_file() callback, because it is a coroutine and cannot hold the block graph lock exclusively. So it is moved to the stream job itself with the auto-remove option not having an effect anymore (qemu-server would always set it anyways). In the future, there could either be a special option for the stream job, or maybe the upcoming blockdev-replace QMP command can be used. Replacing the backing child is actually already done in the stream job, so no need to do it in the track_co_change_backing_file() callback. It also cannot be called from a coroutine. Looking at the implementation in the qcow2 driver, it doesn't seem to be intended to change the backing child itself, just update driver-internal state. Other changes: * alloc-track: Error out early when used without auto-remove. Since replacing the node now happens in the stream job, where the option cannot be read from (it's internal to the driver), it will always be treated as 'on'. Makes sure to have users beside qemu-server notice the change (should they even exist). The option can be fully dropped in the future while adding a version guard in qemu-server. * alloc-track: Avoid seemingly superfluous child permission update. Doesn't seem necessary nowadays (maybe after commit "alloc-track: fix deadlock during drop" where the dropping is not rescheduled and delayed anymore or some upstream change). Replacing the block node will already update the permissions of the new node (which was the file child before). Should there really be some issue, instead of having a drop state, this could also be just based off the fact whether there is still a backing child. Dumping the cumulative (shared) permissions for the BDS with a debug print yields the same values after this patch and with QEMU 8.1, namely 3 and 5. * PBS block driver: compile unconditionally. Proxmox VE always needs it and something in the build process changed to make it not enabled by default. Probably would need to move the build option to meson otherwise. * backup: job unreferencing during cleanup needs to happen outside of coroutine, so it was moved to before invoking the clean * mirror: Cherry-pick stable fix to avoid potential deadlock. * savevm-async: migrate_init now can fail, so propagate potential error. * savevm-async: compression counters are not accessible outside migration/ram-compress now, so drop code that prophylactically set it to zero. [0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/ [1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/ Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
@@ -246,6 +246,8 @@ typedef enum {
implement support for backup fleecing Excerpt from Fiona's v3 cover-letter [0]: When a backup for a VM is started, QEMU will install a "copy-before-write" filter in its block layer. This filter ensures that upon new guest writes, old data still needed for the backup is sent to the backup target first. The guest write blocks until this operation is finished so guest IO to not-yet-backed-up sectors will be limited by the speed of the backup target. With backup fleecing, such old data is cached in a fleecing image rather than sent directly to the backup target. This can help guest IO performance and even prevent hangs in certain scenarios, at the cost of requiring more storage space. With this series it will be possible to enable backup-fleecing via e.g. `vzdump 123 --fleecing enabled=1,storage=local-lvm` with fleecing images created on the storage `local-lvm`. The fleecing storage should be a fast local storage which supports thin-provisioning and discard. If the storage supports qcow2, that is used as the fleecing image format. If the underlying file system does not support discard, with qcow2 and preallocation=off, at least already allocated parts of the image can be re-used later. Fleecing images are created by qemu-server via pve-storage and attached to QEMU before the backup starts, and cleaned up after the backup finished or failed. The naming schema for fleecing images is 'vm-ID-fleece-N(.FORMAT)'. The allocated images are recorded in the guest configuration, so that even after a hard failure, clean-up can be re-attempted. While not too bad, it's a non-trivial amount of code and I'm not 100% sure about the cost-benefit, so sending those as RFC. The fleecing image needs to be the exact same size as the source, but luckily, an explicit size can be specified when attaching a raw image to QEMU so there are no size issues when using storages that have coarser allocation/round up. For qcow2, it seems that virtual size can be nearly arbitrary (i.e. modulo 512 byte granularity) during allocation. [0]: https://lists.proxmox.com/pipermail/pve-devel/2024-April/062815.html Originally-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-04-11 18:38:26 +03:00
read-write fails */
#define BDRV_O_IO_URING 0x40000 /* use io_uring instead of the thread pool */
+#define BDRV_O_CBW_DISCARD_SOURCE 0x80000 /* for copy-before-write filter */
+
#define BDRV_O_CACHE_MASK (BDRV_O_NOCACHE | BDRV_O_NO_FLUSH)
diff --git a/include/block/block-copy.h b/include/block/block-copy.h
index 8b41643bfa..bdc703bacd 100644
--- a/include/block/block-copy.h
+++ b/include/block/block-copy.h
@@ -27,6 +27,7 @@ typedef struct BlockCopyCallState BlockCopyCallState;
BlockCopyState *block_copy_state_new(BdrvChild *source, BdrvChild *target,
BlockDriverState *copy_bitmap_bs,
const BdrvDirtyBitmap *bitmap,
+ bool discard_source,
Error **errp);
/* Function should be called prior any actual copy request */
diff --git a/include/block/block_int-global-state.h b/include/block/block_int-global-state.h
update submodule and patches to QEMU 8.2.2 This version includes both the AioContext lock and the block graph lock, so there might be some deadlocks lurking. It's not possible to disable the block graph lock like was done in QEMU 8.1, because there are no changes like the function bdrv_schedule_unref() that require it. QEMU 9.0 will finally get rid of the AioContext locking. During live-restore with a VirtIO SCSI drive with iothread there is a known racy deadlock related to the AioContext lock. Not new [1], but not sure if more likely now. Should be fixed in QEMU 9.0. The block graph lock comes with annotations that can be checked by clang's TSA. This required changes to the block drivers, i.e. alloc-track, pbs, zeroinit as well as taking the appropriate locks in pve-backup, savevm-async, vma-reader. Local variable shadowing is prohibited via a compiler flag now, required slight adaptation in vma.c. Major changes only affect alloc-track: * It is not possible to call a generated co-wrapper like bdrv_get_info() while holding the block graph lock exclusively [0], which does happen during initialization of alloc-track when the backing hd is set and the refresh_limits driver callback is invoked. The bdrv_get_info() call to get the cluster size is moved to directly after opening the file child in track_open(). The important thing is that at least the request alignment for the write target is used, because then the RMW cycle in bdrv_pwritev will gather enough data from the backing file. Partial cluster allocations in the target are not a fundamental issue, because the driver returns its allocation status based on the bitmap, so any other data that maps to the same cluster will still be copied later by a stream job (or during writes to that cluster). * Replacing the node cannot be done in the track_co_change_backing_file() callback, because it is a coroutine and cannot hold the block graph lock exclusively. So it is moved to the stream job itself with the auto-remove option not having an effect anymore (qemu-server would always set it anyways). In the future, there could either be a special option for the stream job, or maybe the upcoming blockdev-replace QMP command can be used. Replacing the backing child is actually already done in the stream job, so no need to do it in the track_co_change_backing_file() callback. It also cannot be called from a coroutine. Looking at the implementation in the qcow2 driver, it doesn't seem to be intended to change the backing child itself, just update driver-internal state. Other changes: * alloc-track: Error out early when used without auto-remove. Since replacing the node now happens in the stream job, where the option cannot be read from (it's internal to the driver), it will always be treated as 'on'. Makes sure to have users beside qemu-server notice the change (should they even exist). The option can be fully dropped in the future while adding a version guard in qemu-server. * alloc-track: Avoid seemingly superfluous child permission update. Doesn't seem necessary nowadays (maybe after commit "alloc-track: fix deadlock during drop" where the dropping is not rescheduled and delayed anymore or some upstream change). Replacing the block node will already update the permissions of the new node (which was the file child before). Should there really be some issue, instead of having a drop state, this could also be just based off the fact whether there is still a backing child. Dumping the cumulative (shared) permissions for the BDS with a debug print yields the same values after this patch and with QEMU 8.1, namely 3 and 5. * PBS block driver: compile unconditionally. Proxmox VE always needs it and something in the build process changed to make it not enabled by default. Probably would need to move the build option to meson otherwise. * backup: job unreferencing during cleanup needs to happen outside of coroutine, so it was moved to before invoking the clean * mirror: Cherry-pick stable fix to avoid potential deadlock. * savevm-async: migrate_init now can fail, so propagate potential error. * savevm-async: compression counters are not accessible outside migration/ram-compress now, so drop code that prophylactically set it to zero. [0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/ [1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/ Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
index 57265a617a..df731688b4 100644
implement support for backup fleecing Excerpt from Fiona's v3 cover-letter [0]: When a backup for a VM is started, QEMU will install a "copy-before-write" filter in its block layer. This filter ensures that upon new guest writes, old data still needed for the backup is sent to the backup target first. The guest write blocks until this operation is finished so guest IO to not-yet-backed-up sectors will be limited by the speed of the backup target. With backup fleecing, such old data is cached in a fleecing image rather than sent directly to the backup target. This can help guest IO performance and even prevent hangs in certain scenarios, at the cost of requiring more storage space. With this series it will be possible to enable backup-fleecing via e.g. `vzdump 123 --fleecing enabled=1,storage=local-lvm` with fleecing images created on the storage `local-lvm`. The fleecing storage should be a fast local storage which supports thin-provisioning and discard. If the storage supports qcow2, that is used as the fleecing image format. If the underlying file system does not support discard, with qcow2 and preallocation=off, at least already allocated parts of the image can be re-used later. Fleecing images are created by qemu-server via pve-storage and attached to QEMU before the backup starts, and cleaned up after the backup finished or failed. The naming schema for fleecing images is 'vm-ID-fleece-N(.FORMAT)'. The allocated images are recorded in the guest configuration, so that even after a hard failure, clean-up can be re-attempted. While not too bad, it's a non-trivial amount of code and I'm not 100% sure about the cost-benefit, so sending those as RFC. The fleecing image needs to be the exact same size as the source, but luckily, an explicit size can be specified when attaching a raw image to QEMU so there are no size issues when using storages that have coarser allocation/round up. For qcow2, it seems that virtual size can be nearly arbitrary (i.e. modulo 512 byte granularity) during allocation. [0]: https://lists.proxmox.com/pipermail/pve-devel/2024-April/062815.html Originally-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-04-11 18:38:26 +03:00
--- a/include/block/block_int-global-state.h
+++ b/include/block/block_int-global-state.h
@@ -189,7 +189,7 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
MirrorSyncMode sync_mode,
BdrvDirtyBitmap *sync_bitmap,
BitmapSyncMode bitmap_mode,
- bool compress,
+ bool compress, bool discard_source,
const char *filter_node_name,
BackupPerf *perf,
BlockdevOnError on_source_error,
diff --git a/qapi/block-core.json b/qapi/block-core.json
update submodule and patches to QEMU 8.2.2 This version includes both the AioContext lock and the block graph lock, so there might be some deadlocks lurking. It's not possible to disable the block graph lock like was done in QEMU 8.1, because there are no changes like the function bdrv_schedule_unref() that require it. QEMU 9.0 will finally get rid of the AioContext locking. During live-restore with a VirtIO SCSI drive with iothread there is a known racy deadlock related to the AioContext lock. Not new [1], but not sure if more likely now. Should be fixed in QEMU 9.0. The block graph lock comes with annotations that can be checked by clang's TSA. This required changes to the block drivers, i.e. alloc-track, pbs, zeroinit as well as taking the appropriate locks in pve-backup, savevm-async, vma-reader. Local variable shadowing is prohibited via a compiler flag now, required slight adaptation in vma.c. Major changes only affect alloc-track: * It is not possible to call a generated co-wrapper like bdrv_get_info() while holding the block graph lock exclusively [0], which does happen during initialization of alloc-track when the backing hd is set and the refresh_limits driver callback is invoked. The bdrv_get_info() call to get the cluster size is moved to directly after opening the file child in track_open(). The important thing is that at least the request alignment for the write target is used, because then the RMW cycle in bdrv_pwritev will gather enough data from the backing file. Partial cluster allocations in the target are not a fundamental issue, because the driver returns its allocation status based on the bitmap, so any other data that maps to the same cluster will still be copied later by a stream job (or during writes to that cluster). * Replacing the node cannot be done in the track_co_change_backing_file() callback, because it is a coroutine and cannot hold the block graph lock exclusively. So it is moved to the stream job itself with the auto-remove option not having an effect anymore (qemu-server would always set it anyways). In the future, there could either be a special option for the stream job, or maybe the upcoming blockdev-replace QMP command can be used. Replacing the backing child is actually already done in the stream job, so no need to do it in the track_co_change_backing_file() callback. It also cannot be called from a coroutine. Looking at the implementation in the qcow2 driver, it doesn't seem to be intended to change the backing child itself, just update driver-internal state. Other changes: * alloc-track: Error out early when used without auto-remove. Since replacing the node now happens in the stream job, where the option cannot be read from (it's internal to the driver), it will always be treated as 'on'. Makes sure to have users beside qemu-server notice the change (should they even exist). The option can be fully dropped in the future while adding a version guard in qemu-server. * alloc-track: Avoid seemingly superfluous child permission update. Doesn't seem necessary nowadays (maybe after commit "alloc-track: fix deadlock during drop" where the dropping is not rescheduled and delayed anymore or some upstream change). Replacing the block node will already update the permissions of the new node (which was the file child before). Should there really be some issue, instead of having a drop state, this could also be just based off the fact whether there is still a backing child. Dumping the cumulative (shared) permissions for the BDS with a debug print yields the same values after this patch and with QEMU 8.1, namely 3 and 5. * PBS block driver: compile unconditionally. Proxmox VE always needs it and something in the build process changed to make it not enabled by default. Probably would need to move the build option to meson otherwise. * backup: job unreferencing during cleanup needs to happen outside of coroutine, so it was moved to before invoking the clean * mirror: Cherry-pick stable fix to avoid potential deadlock. * savevm-async: migrate_init now can fail, so propagate potential error. * savevm-async: compression counters are not accessible outside migration/ram-compress now, so drop code that prophylactically set it to zero. [0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/ [1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/ Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
index 7b977459fa..82960797dc 100644
implement support for backup fleecing Excerpt from Fiona's v3 cover-letter [0]: When a backup for a VM is started, QEMU will install a "copy-before-write" filter in its block layer. This filter ensures that upon new guest writes, old data still needed for the backup is sent to the backup target first. The guest write blocks until this operation is finished so guest IO to not-yet-backed-up sectors will be limited by the speed of the backup target. With backup fleecing, such old data is cached in a fleecing image rather than sent directly to the backup target. This can help guest IO performance and even prevent hangs in certain scenarios, at the cost of requiring more storage space. With this series it will be possible to enable backup-fleecing via e.g. `vzdump 123 --fleecing enabled=1,storage=local-lvm` with fleecing images created on the storage `local-lvm`. The fleecing storage should be a fast local storage which supports thin-provisioning and discard. If the storage supports qcow2, that is used as the fleecing image format. If the underlying file system does not support discard, with qcow2 and preallocation=off, at least already allocated parts of the image can be re-used later. Fleecing images are created by qemu-server via pve-storage and attached to QEMU before the backup starts, and cleaned up after the backup finished or failed. The naming schema for fleecing images is 'vm-ID-fleece-N(.FORMAT)'. The allocated images are recorded in the guest configuration, so that even after a hard failure, clean-up can be re-attempted. While not too bad, it's a non-trivial amount of code and I'm not 100% sure about the cost-benefit, so sending those as RFC. The fleecing image needs to be the exact same size as the source, but luckily, an explicit size can be specified when attaching a raw image to QEMU so there are no size issues when using storages that have coarser allocation/round up. For qcow2, it seems that virtual size can be nearly arbitrary (i.e. modulo 512 byte granularity) during allocation. [0]: https://lists.proxmox.com/pipermail/pve-devel/2024-April/062815.html Originally-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-04-11 18:38:26 +03:00
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
update submodule and patches to QEMU 8.2.2 This version includes both the AioContext lock and the block graph lock, so there might be some deadlocks lurking. It's not possible to disable the block graph lock like was done in QEMU 8.1, because there are no changes like the function bdrv_schedule_unref() that require it. QEMU 9.0 will finally get rid of the AioContext locking. During live-restore with a VirtIO SCSI drive with iothread there is a known racy deadlock related to the AioContext lock. Not new [1], but not sure if more likely now. Should be fixed in QEMU 9.0. The block graph lock comes with annotations that can be checked by clang's TSA. This required changes to the block drivers, i.e. alloc-track, pbs, zeroinit as well as taking the appropriate locks in pve-backup, savevm-async, vma-reader. Local variable shadowing is prohibited via a compiler flag now, required slight adaptation in vma.c. Major changes only affect alloc-track: * It is not possible to call a generated co-wrapper like bdrv_get_info() while holding the block graph lock exclusively [0], which does happen during initialization of alloc-track when the backing hd is set and the refresh_limits driver callback is invoked. The bdrv_get_info() call to get the cluster size is moved to directly after opening the file child in track_open(). The important thing is that at least the request alignment for the write target is used, because then the RMW cycle in bdrv_pwritev will gather enough data from the backing file. Partial cluster allocations in the target are not a fundamental issue, because the driver returns its allocation status based on the bitmap, so any other data that maps to the same cluster will still be copied later by a stream job (or during writes to that cluster). * Replacing the node cannot be done in the track_co_change_backing_file() callback, because it is a coroutine and cannot hold the block graph lock exclusively. So it is moved to the stream job itself with the auto-remove option not having an effect anymore (qemu-server would always set it anyways). In the future, there could either be a special option for the stream job, or maybe the upcoming blockdev-replace QMP command can be used. Replacing the backing child is actually already done in the stream job, so no need to do it in the track_co_change_backing_file() callback. It also cannot be called from a coroutine. Looking at the implementation in the qcow2 driver, it doesn't seem to be intended to change the backing child itself, just update driver-internal state. Other changes: * alloc-track: Error out early when used without auto-remove. Since replacing the node now happens in the stream job, where the option cannot be read from (it's internal to the driver), it will always be treated as 'on'. Makes sure to have users beside qemu-server notice the change (should they even exist). The option can be fully dropped in the future while adding a version guard in qemu-server. * alloc-track: Avoid seemingly superfluous child permission update. Doesn't seem necessary nowadays (maybe after commit "alloc-track: fix deadlock during drop" where the dropping is not rescheduled and delayed anymore or some upstream change). Replacing the block node will already update the permissions of the new node (which was the file child before). Should there really be some issue, instead of having a drop state, this could also be just based off the fact whether there is still a backing child. Dumping the cumulative (shared) permissions for the BDS with a debug print yields the same values after this patch and with QEMU 8.1, namely 3 and 5. * PBS block driver: compile unconditionally. Proxmox VE always needs it and something in the build process changed to make it not enabled by default. Probably would need to move the build option to meson otherwise. * backup: job unreferencing during cleanup needs to happen outside of coroutine, so it was moved to before invoking the clean * mirror: Cherry-pick stable fix to avoid potential deadlock. * savevm-async: migrate_init now can fail, so propagate potential error. * savevm-async: compression counters are not accessible outside migration/ram-compress now, so drop code that prophylactically set it to zero. [0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/ [1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/ Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
@@ -1834,6 +1834,9 @@
implement support for backup fleecing Excerpt from Fiona's v3 cover-letter [0]: When a backup for a VM is started, QEMU will install a "copy-before-write" filter in its block layer. This filter ensures that upon new guest writes, old data still needed for the backup is sent to the backup target first. The guest write blocks until this operation is finished so guest IO to not-yet-backed-up sectors will be limited by the speed of the backup target. With backup fleecing, such old data is cached in a fleecing image rather than sent directly to the backup target. This can help guest IO performance and even prevent hangs in certain scenarios, at the cost of requiring more storage space. With this series it will be possible to enable backup-fleecing via e.g. `vzdump 123 --fleecing enabled=1,storage=local-lvm` with fleecing images created on the storage `local-lvm`. The fleecing storage should be a fast local storage which supports thin-provisioning and discard. If the storage supports qcow2, that is used as the fleecing image format. If the underlying file system does not support discard, with qcow2 and preallocation=off, at least already allocated parts of the image can be re-used later. Fleecing images are created by qemu-server via pve-storage and attached to QEMU before the backup starts, and cleaned up after the backup finished or failed. The naming schema for fleecing images is 'vm-ID-fleece-N(.FORMAT)'. The allocated images are recorded in the guest configuration, so that even after a hard failure, clean-up can be re-attempted. While not too bad, it's a non-trivial amount of code and I'm not 100% sure about the cost-benefit, so sending those as RFC. The fleecing image needs to be the exact same size as the source, but luckily, an explicit size can be specified when attaching a raw image to QEMU so there are no size issues when using storages that have coarser allocation/round up. For qcow2, it seems that virtual size can be nearly arbitrary (i.e. modulo 512 byte granularity) during allocation. [0]: https://lists.proxmox.com/pipermail/pve-devel/2024-April/062815.html Originally-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-04-11 18:38:26 +03:00
# node specified by @drive. If this option is not given, a node
# name is autogenerated. (Since: 4.2)
#
+# @discard-source: Discard blocks on source which are already copied
+# to the target. (Since 9.0)
+#
# @x-perf: Performance options. (Since 6.0)
#
# Features:
update submodule and patches to QEMU 8.2.2 This version includes both the AioContext lock and the block graph lock, so there might be some deadlocks lurking. It's not possible to disable the block graph lock like was done in QEMU 8.1, because there are no changes like the function bdrv_schedule_unref() that require it. QEMU 9.0 will finally get rid of the AioContext locking. During live-restore with a VirtIO SCSI drive with iothread there is a known racy deadlock related to the AioContext lock. Not new [1], but not sure if more likely now. Should be fixed in QEMU 9.0. The block graph lock comes with annotations that can be checked by clang's TSA. This required changes to the block drivers, i.e. alloc-track, pbs, zeroinit as well as taking the appropriate locks in pve-backup, savevm-async, vma-reader. Local variable shadowing is prohibited via a compiler flag now, required slight adaptation in vma.c. Major changes only affect alloc-track: * It is not possible to call a generated co-wrapper like bdrv_get_info() while holding the block graph lock exclusively [0], which does happen during initialization of alloc-track when the backing hd is set and the refresh_limits driver callback is invoked. The bdrv_get_info() call to get the cluster size is moved to directly after opening the file child in track_open(). The important thing is that at least the request alignment for the write target is used, because then the RMW cycle in bdrv_pwritev will gather enough data from the backing file. Partial cluster allocations in the target are not a fundamental issue, because the driver returns its allocation status based on the bitmap, so any other data that maps to the same cluster will still be copied later by a stream job (or during writes to that cluster). * Replacing the node cannot be done in the track_co_change_backing_file() callback, because it is a coroutine and cannot hold the block graph lock exclusively. So it is moved to the stream job itself with the auto-remove option not having an effect anymore (qemu-server would always set it anyways). In the future, there could either be a special option for the stream job, or maybe the upcoming blockdev-replace QMP command can be used. Replacing the backing child is actually already done in the stream job, so no need to do it in the track_co_change_backing_file() callback. It also cannot be called from a coroutine. Looking at the implementation in the qcow2 driver, it doesn't seem to be intended to change the backing child itself, just update driver-internal state. Other changes: * alloc-track: Error out early when used without auto-remove. Since replacing the node now happens in the stream job, where the option cannot be read from (it's internal to the driver), it will always be treated as 'on'. Makes sure to have users beside qemu-server notice the change (should they even exist). The option can be fully dropped in the future while adding a version guard in qemu-server. * alloc-track: Avoid seemingly superfluous child permission update. Doesn't seem necessary nowadays (maybe after commit "alloc-track: fix deadlock during drop" where the dropping is not rescheduled and delayed anymore or some upstream change). Replacing the block node will already update the permissions of the new node (which was the file child before). Should there really be some issue, instead of having a drop state, this could also be just based off the fact whether there is still a backing child. Dumping the cumulative (shared) permissions for the BDS with a debug print yields the same values after this patch and with QEMU 8.1, namely 3 and 5. * PBS block driver: compile unconditionally. Proxmox VE always needs it and something in the build process changed to make it not enabled by default. Probably would need to move the build option to meson otherwise. * backup: job unreferencing during cleanup needs to happen outside of coroutine, so it was moved to before invoking the clean * mirror: Cherry-pick stable fix to avoid potential deadlock. * savevm-async: migrate_init now can fail, so propagate potential error. * savevm-async: compression counters are not accessible outside migration/ram-compress now, so drop code that prophylactically set it to zero. [0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/ [1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/ Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
@@ -1855,6 +1858,7 @@
implement support for backup fleecing Excerpt from Fiona's v3 cover-letter [0]: When a backup for a VM is started, QEMU will install a "copy-before-write" filter in its block layer. This filter ensures that upon new guest writes, old data still needed for the backup is sent to the backup target first. The guest write blocks until this operation is finished so guest IO to not-yet-backed-up sectors will be limited by the speed of the backup target. With backup fleecing, such old data is cached in a fleecing image rather than sent directly to the backup target. This can help guest IO performance and even prevent hangs in certain scenarios, at the cost of requiring more storage space. With this series it will be possible to enable backup-fleecing via e.g. `vzdump 123 --fleecing enabled=1,storage=local-lvm` with fleecing images created on the storage `local-lvm`. The fleecing storage should be a fast local storage which supports thin-provisioning and discard. If the storage supports qcow2, that is used as the fleecing image format. If the underlying file system does not support discard, with qcow2 and preallocation=off, at least already allocated parts of the image can be re-used later. Fleecing images are created by qemu-server via pve-storage and attached to QEMU before the backup starts, and cleaned up after the backup finished or failed. The naming schema for fleecing images is 'vm-ID-fleece-N(.FORMAT)'. The allocated images are recorded in the guest configuration, so that even after a hard failure, clean-up can be re-attempted. While not too bad, it's a non-trivial amount of code and I'm not 100% sure about the cost-benefit, so sending those as RFC. The fleecing image needs to be the exact same size as the source, but luckily, an explicit size can be specified when attaching a raw image to QEMU so there are no size issues when using storages that have coarser allocation/round up. For qcow2, it seems that virtual size can be nearly arbitrary (i.e. modulo 512 byte granularity) during allocation. [0]: https://lists.proxmox.com/pipermail/pve-devel/2024-April/062815.html Originally-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-04-11 18:38:26 +03:00
'*on-target-error': 'BlockdevOnError',
'*auto-finalize': 'bool', '*auto-dismiss': 'bool',
'*filter-node-name': 'str',
+ '*discard-source': 'bool',
'*x-perf': { 'type': 'BackupPerf',
'features': [ 'unstable' ] } } }