pve-qemu/debian/patches/pve/0029-PVE-Add-sequential-job...

102 lines
2.9 KiB
Diff
Raw Permalink Normal View History

From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: Stefan Reiter <s.reiter@proxmox.com>
Date: Thu, 20 Aug 2020 14:31:59 +0200
Subject: [PATCH] PVE: Add sequential job transaction support
Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
---
include/qemu/job.h | 12 ++++++++++++
update submodule and patches to 7.2.0 User-facing breaking change: The slirp submodule for user networking got removed. It would be necessary to add the --enable-slirp option to the build and/or install the appropriate library to continue building it. Since PVE is not explicitly supporting it, it would require additionally installing the libslirp0 package on all installations and there is *very* little mention on the community forum when searching for "slirp" or "netdev user", the plan is to only enable it again if there is some real demand for it. Notable changes: * The big change for this release is the rework of job locking, using a job mutex and introducing _locked() variants of job API functions moving away from call-side AioContext locking. See (in the qemu submodule) commit 6f592e5aca ("job.c: enable job lock/unlock and remove Aiocontext locks") and previous commits for context. Changes required for the backup patches: * Use WITH_JOB_LOCK_GUARD() and call the _locked() variant of job API functions where appropriate (many are only availalbe as a _locked() variant). * Remove acquiring/releasing AioContext around functions taking the job mutex lock internally. The patch introducing sequential transaction support for jobs needs to temporarily unlock the job mutex to call job_start() when starting the next job in the transaction. * The zeroinit block driver now marks its child as primary. The documentation in include/block/block-common.h states: > Filter node has exactly one FILTERED|PRIMARY child, and may have > other children which must not have these bits Without this, an assert will trigger when copying to a zeroinit target with qemu-img convert, because bdrv_child_cb_attach() expects any non-PRIMARY child to be not FILTERED: > qemu-img convert -n -p -f raw -O raw input.raw zeroinit:output.raw > qemu-img: ../block.c:1476: bdrv_child_cb_attach: Assertion > `!(child->role & BDRV_CHILD_FILTERED)' failed. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2022-12-14 17:16:32 +03:00
job.c | 34 ++++++++++++++++++++++++++++++++++
2 files changed, 46 insertions(+)
diff --git a/include/qemu/job.h b/include/qemu/job.h
index 2b873f2576..528cd6acb9 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -362,6 +362,18 @@ void job_unlock(void);
*/
JobTxn *job_txn_new(void);
+/**
+ * Create a new transaction and set it to sequential mode, i.e. run all jobs
+ * one after the other instead of at the same time.
+ */
+JobTxn *job_txn_new_seq(void);
+
+/**
+ * Helper method to start the first job in a sequential transaction to kick it
+ * off. Other jobs will be run after this one completes.
+ */
+void job_txn_start_seq(JobTxn *txn);
+
/**
* Release a reference that was previously acquired with job_txn_add_job or
* job_txn_new. If it's the last reference to the object, it will be freed.
diff --git a/job.c b/job.c
index baf54c8d60..3ac5e5cde2 100644
--- a/job.c
+++ b/job.c
update submodule and patches to QEMU 8.2.2 This version includes both the AioContext lock and the block graph lock, so there might be some deadlocks lurking. It's not possible to disable the block graph lock like was done in QEMU 8.1, because there are no changes like the function bdrv_schedule_unref() that require it. QEMU 9.0 will finally get rid of the AioContext locking. During live-restore with a VirtIO SCSI drive with iothread there is a known racy deadlock related to the AioContext lock. Not new [1], but not sure if more likely now. Should be fixed in QEMU 9.0. The block graph lock comes with annotations that can be checked by clang's TSA. This required changes to the block drivers, i.e. alloc-track, pbs, zeroinit as well as taking the appropriate locks in pve-backup, savevm-async, vma-reader. Local variable shadowing is prohibited via a compiler flag now, required slight adaptation in vma.c. Major changes only affect alloc-track: * It is not possible to call a generated co-wrapper like bdrv_get_info() while holding the block graph lock exclusively [0], which does happen during initialization of alloc-track when the backing hd is set and the refresh_limits driver callback is invoked. The bdrv_get_info() call to get the cluster size is moved to directly after opening the file child in track_open(). The important thing is that at least the request alignment for the write target is used, because then the RMW cycle in bdrv_pwritev will gather enough data from the backing file. Partial cluster allocations in the target are not a fundamental issue, because the driver returns its allocation status based on the bitmap, so any other data that maps to the same cluster will still be copied later by a stream job (or during writes to that cluster). * Replacing the node cannot be done in the track_co_change_backing_file() callback, because it is a coroutine and cannot hold the block graph lock exclusively. So it is moved to the stream job itself with the auto-remove option not having an effect anymore (qemu-server would always set it anyways). In the future, there could either be a special option for the stream job, or maybe the upcoming blockdev-replace QMP command can be used. Replacing the backing child is actually already done in the stream job, so no need to do it in the track_co_change_backing_file() callback. It also cannot be called from a coroutine. Looking at the implementation in the qcow2 driver, it doesn't seem to be intended to change the backing child itself, just update driver-internal state. Other changes: * alloc-track: Error out early when used without auto-remove. Since replacing the node now happens in the stream job, where the option cannot be read from (it's internal to the driver), it will always be treated as 'on'. Makes sure to have users beside qemu-server notice the change (should they even exist). The option can be fully dropped in the future while adding a version guard in qemu-server. * alloc-track: Avoid seemingly superfluous child permission update. Doesn't seem necessary nowadays (maybe after commit "alloc-track: fix deadlock during drop" where the dropping is not rescheduled and delayed anymore or some upstream change). Replacing the block node will already update the permissions of the new node (which was the file child before). Should there really be some issue, instead of having a drop state, this could also be just based off the fact whether there is still a backing child. Dumping the cumulative (shared) permissions for the BDS with a debug print yields the same values after this patch and with QEMU 8.1, namely 3 and 5. * PBS block driver: compile unconditionally. Proxmox VE always needs it and something in the build process changed to make it not enabled by default. Probably would need to move the build option to meson otherwise. * backup: job unreferencing during cleanup needs to happen outside of coroutine, so it was moved to before invoking the clean * mirror: Cherry-pick stable fix to avoid potential deadlock. * savevm-async: migrate_init now can fail, so propagate potential error. * savevm-async: compression counters are not accessible outside migration/ram-compress now, so drop code that prophylactically set it to zero. [0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/ [1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/ Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
@@ -94,6 +94,8 @@ struct JobTxn {
/* Reference count */
int refcnt;
+
+ bool sequential;
};
update submodule and patches to 7.2.0 User-facing breaking change: The slirp submodule for user networking got removed. It would be necessary to add the --enable-slirp option to the build and/or install the appropriate library to continue building it. Since PVE is not explicitly supporting it, it would require additionally installing the libslirp0 package on all installations and there is *very* little mention on the community forum when searching for "slirp" or "netdev user", the plan is to only enable it again if there is some real demand for it. Notable changes: * The big change for this release is the rework of job locking, using a job mutex and introducing _locked() variants of job API functions moving away from call-side AioContext locking. See (in the qemu submodule) commit 6f592e5aca ("job.c: enable job lock/unlock and remove Aiocontext locks") and previous commits for context. Changes required for the backup patches: * Use WITH_JOB_LOCK_GUARD() and call the _locked() variant of job API functions where appropriate (many are only availalbe as a _locked() variant). * Remove acquiring/releasing AioContext around functions taking the job mutex lock internally. The patch introducing sequential transaction support for jobs needs to temporarily unlock the job mutex to call job_start() when starting the next job in the transaction. * The zeroinit block driver now marks its child as primary. The documentation in include/block/block-common.h states: > Filter node has exactly one FILTERED|PRIMARY child, and may have > other children which must not have these bits Without this, an assert will trigger when copying to a zeroinit target with qemu-img convert, because bdrv_child_cb_attach() expects any non-PRIMARY child to be not FILTERED: > qemu-img convert -n -p -f raw -O raw input.raw zeroinit:output.raw > qemu-img: ../block.c:1476: bdrv_child_cb_attach: Assertion > `!(child->role & BDRV_CHILD_FILTERED)' failed. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2022-12-14 17:16:32 +03:00
void job_lock(void)
update submodule and patches to QEMU 8.2.2 This version includes both the AioContext lock and the block graph lock, so there might be some deadlocks lurking. It's not possible to disable the block graph lock like was done in QEMU 8.1, because there are no changes like the function bdrv_schedule_unref() that require it. QEMU 9.0 will finally get rid of the AioContext locking. During live-restore with a VirtIO SCSI drive with iothread there is a known racy deadlock related to the AioContext lock. Not new [1], but not sure if more likely now. Should be fixed in QEMU 9.0. The block graph lock comes with annotations that can be checked by clang's TSA. This required changes to the block drivers, i.e. alloc-track, pbs, zeroinit as well as taking the appropriate locks in pve-backup, savevm-async, vma-reader. Local variable shadowing is prohibited via a compiler flag now, required slight adaptation in vma.c. Major changes only affect alloc-track: * It is not possible to call a generated co-wrapper like bdrv_get_info() while holding the block graph lock exclusively [0], which does happen during initialization of alloc-track when the backing hd is set and the refresh_limits driver callback is invoked. The bdrv_get_info() call to get the cluster size is moved to directly after opening the file child in track_open(). The important thing is that at least the request alignment for the write target is used, because then the RMW cycle in bdrv_pwritev will gather enough data from the backing file. Partial cluster allocations in the target are not a fundamental issue, because the driver returns its allocation status based on the bitmap, so any other data that maps to the same cluster will still be copied later by a stream job (or during writes to that cluster). * Replacing the node cannot be done in the track_co_change_backing_file() callback, because it is a coroutine and cannot hold the block graph lock exclusively. So it is moved to the stream job itself with the auto-remove option not having an effect anymore (qemu-server would always set it anyways). In the future, there could either be a special option for the stream job, or maybe the upcoming blockdev-replace QMP command can be used. Replacing the backing child is actually already done in the stream job, so no need to do it in the track_co_change_backing_file() callback. It also cannot be called from a coroutine. Looking at the implementation in the qcow2 driver, it doesn't seem to be intended to change the backing child itself, just update driver-internal state. Other changes: * alloc-track: Error out early when used without auto-remove. Since replacing the node now happens in the stream job, where the option cannot be read from (it's internal to the driver), it will always be treated as 'on'. Makes sure to have users beside qemu-server notice the change (should they even exist). The option can be fully dropped in the future while adding a version guard in qemu-server. * alloc-track: Avoid seemingly superfluous child permission update. Doesn't seem necessary nowadays (maybe after commit "alloc-track: fix deadlock during drop" where the dropping is not rescheduled and delayed anymore or some upstream change). Replacing the block node will already update the permissions of the new node (which was the file child before). Should there really be some issue, instead of having a drop state, this could also be just based off the fact whether there is still a backing child. Dumping the cumulative (shared) permissions for the BDS with a debug print yields the same values after this patch and with QEMU 8.1, namely 3 and 5. * PBS block driver: compile unconditionally. Proxmox VE always needs it and something in the build process changed to make it not enabled by default. Probably would need to move the build option to meson otherwise. * backup: job unreferencing during cleanup needs to happen outside of coroutine, so it was moved to before invoking the clean * mirror: Cherry-pick stable fix to avoid potential deadlock. * savevm-async: migrate_init now can fail, so propagate potential error. * savevm-async: compression counters are not accessible outside migration/ram-compress now, so drop code that prophylactically set it to zero. [0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/ [1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/ Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2024-04-25 18:21:28 +03:00
@@ -119,6 +121,25 @@ JobTxn *job_txn_new(void)
return txn;
}
+JobTxn *job_txn_new_seq(void)
+{
+ JobTxn *txn = job_txn_new();
+ txn->sequential = true;
+ return txn;
+}
+
+void job_txn_start_seq(JobTxn *txn)
+{
+ assert(txn->sequential);
+ assert(!txn->aborting);
+
+ Job *first = QLIST_FIRST(&txn->jobs);
+ assert(first);
+ assert(first->status == JOB_STATUS_CREATED);
+
+ job_start(first);
+}
+
update submodule and patches to 7.2.0 User-facing breaking change: The slirp submodule for user networking got removed. It would be necessary to add the --enable-slirp option to the build and/or install the appropriate library to continue building it. Since PVE is not explicitly supporting it, it would require additionally installing the libslirp0 package on all installations and there is *very* little mention on the community forum when searching for "slirp" or "netdev user", the plan is to only enable it again if there is some real demand for it. Notable changes: * The big change for this release is the rework of job locking, using a job mutex and introducing _locked() variants of job API functions moving away from call-side AioContext locking. See (in the qemu submodule) commit 6f592e5aca ("job.c: enable job lock/unlock and remove Aiocontext locks") and previous commits for context. Changes required for the backup patches: * Use WITH_JOB_LOCK_GUARD() and call the _locked() variant of job API functions where appropriate (many are only availalbe as a _locked() variant). * Remove acquiring/releasing AioContext around functions taking the job mutex lock internally. The patch introducing sequential transaction support for jobs needs to temporarily unlock the job mutex to call job_start() when starting the next job in the transaction. * The zeroinit block driver now marks its child as primary. The documentation in include/block/block-common.h states: > Filter node has exactly one FILTERED|PRIMARY child, and may have > other children which must not have these bits Without this, an assert will trigger when copying to a zeroinit target with qemu-img convert, because bdrv_child_cb_attach() expects any non-PRIMARY child to be not FILTERED: > qemu-img convert -n -p -f raw -O raw input.raw zeroinit:output.raw > qemu-img: ../block.c:1476: bdrv_child_cb_attach: Assertion > `!(child->role & BDRV_CHILD_FILTERED)' failed. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2022-12-14 17:16:32 +03:00
/* Called with job_mutex held. */
static void job_txn_ref_locked(JobTxn *txn)
{
@@ -1042,6 +1063,12 @@ static void job_completed_txn_success_locked(Job *job)
*/
QLIST_FOREACH(other_job, &txn->jobs, txn_list) {
update submodule and patches to 7.2.0 User-facing breaking change: The slirp submodule for user networking got removed. It would be necessary to add the --enable-slirp option to the build and/or install the appropriate library to continue building it. Since PVE is not explicitly supporting it, it would require additionally installing the libslirp0 package on all installations and there is *very* little mention on the community forum when searching for "slirp" or "netdev user", the plan is to only enable it again if there is some real demand for it. Notable changes: * The big change for this release is the rework of job locking, using a job mutex and introducing _locked() variants of job API functions moving away from call-side AioContext locking. See (in the qemu submodule) commit 6f592e5aca ("job.c: enable job lock/unlock and remove Aiocontext locks") and previous commits for context. Changes required for the backup patches: * Use WITH_JOB_LOCK_GUARD() and call the _locked() variant of job API functions where appropriate (many are only availalbe as a _locked() variant). * Remove acquiring/releasing AioContext around functions taking the job mutex lock internally. The patch introducing sequential transaction support for jobs needs to temporarily unlock the job mutex to call job_start() when starting the next job in the transaction. * The zeroinit block driver now marks its child as primary. The documentation in include/block/block-common.h states: > Filter node has exactly one FILTERED|PRIMARY child, and may have > other children which must not have these bits Without this, an assert will trigger when copying to a zeroinit target with qemu-img convert, because bdrv_child_cb_attach() expects any non-PRIMARY child to be not FILTERED: > qemu-img convert -n -p -f raw -O raw input.raw zeroinit:output.raw > qemu-img: ../block.c:1476: bdrv_child_cb_attach: Assertion > `!(child->role & BDRV_CHILD_FILTERED)' failed. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2022-12-14 17:16:32 +03:00
if (!job_is_completed_locked(other_job)) {
+ if (txn->sequential) {
update submodule and patches to 7.2.0 User-facing breaking change: The slirp submodule for user networking got removed. It would be necessary to add the --enable-slirp option to the build and/or install the appropriate library to continue building it. Since PVE is not explicitly supporting it, it would require additionally installing the libslirp0 package on all installations and there is *very* little mention on the community forum when searching for "slirp" or "netdev user", the plan is to only enable it again if there is some real demand for it. Notable changes: * The big change for this release is the rework of job locking, using a job mutex and introducing _locked() variants of job API functions moving away from call-side AioContext locking. See (in the qemu submodule) commit 6f592e5aca ("job.c: enable job lock/unlock and remove Aiocontext locks") and previous commits for context. Changes required for the backup patches: * Use WITH_JOB_LOCK_GUARD() and call the _locked() variant of job API functions where appropriate (many are only availalbe as a _locked() variant). * Remove acquiring/releasing AioContext around functions taking the job mutex lock internally. The patch introducing sequential transaction support for jobs needs to temporarily unlock the job mutex to call job_start() when starting the next job in the transaction. * The zeroinit block driver now marks its child as primary. The documentation in include/block/block-common.h states: > Filter node has exactly one FILTERED|PRIMARY child, and may have > other children which must not have these bits Without this, an assert will trigger when copying to a zeroinit target with qemu-img convert, because bdrv_child_cb_attach() expects any non-PRIMARY child to be not FILTERED: > qemu-img convert -n -p -f raw -O raw input.raw zeroinit:output.raw > qemu-img: ../block.c:1476: bdrv_child_cb_attach: Assertion > `!(child->role & BDRV_CHILD_FILTERED)' failed. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2022-12-14 17:16:32 +03:00
+ job_unlock();
+ /* Needs to be called without holding the job lock */
+ job_start(other_job);
update submodule and patches to 7.2.0 User-facing breaking change: The slirp submodule for user networking got removed. It would be necessary to add the --enable-slirp option to the build and/or install the appropriate library to continue building it. Since PVE is not explicitly supporting it, it would require additionally installing the libslirp0 package on all installations and there is *very* little mention on the community forum when searching for "slirp" or "netdev user", the plan is to only enable it again if there is some real demand for it. Notable changes: * The big change for this release is the rework of job locking, using a job mutex and introducing _locked() variants of job API functions moving away from call-side AioContext locking. See (in the qemu submodule) commit 6f592e5aca ("job.c: enable job lock/unlock and remove Aiocontext locks") and previous commits for context. Changes required for the backup patches: * Use WITH_JOB_LOCK_GUARD() and call the _locked() variant of job API functions where appropriate (many are only availalbe as a _locked() variant). * Remove acquiring/releasing AioContext around functions taking the job mutex lock internally. The patch introducing sequential transaction support for jobs needs to temporarily unlock the job mutex to call job_start() when starting the next job in the transaction. * The zeroinit block driver now marks its child as primary. The documentation in include/block/block-common.h states: > Filter node has exactly one FILTERED|PRIMARY child, and may have > other children which must not have these bits Without this, an assert will trigger when copying to a zeroinit target with qemu-img convert, because bdrv_child_cb_attach() expects any non-PRIMARY child to be not FILTERED: > qemu-img convert -n -p -f raw -O raw input.raw zeroinit:output.raw > qemu-img: ../block.c:1476: bdrv_child_cb_attach: Assertion > `!(child->role & BDRV_CHILD_FILTERED)' failed. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2022-12-14 17:16:32 +03:00
+ job_lock();
+ }
return;
}
assert(other_job->ret == 0);
@@ -1253,6 +1280,13 @@ int job_finish_sync_locked(Job *job,
return -EBUSY;
}
+ /* in a sequential transaction jobs with status CREATED can appear at time
+ * of cancelling, these have not begun work so job_enter won't do anything,
+ * let's ensure they are marked as ABORTING if required */
+ if (job->status == JOB_STATUS_CREATED && job->txn->sequential) {
update submodule and patches to 7.2.0 User-facing breaking change: The slirp submodule for user networking got removed. It would be necessary to add the --enable-slirp option to the build and/or install the appropriate library to continue building it. Since PVE is not explicitly supporting it, it would require additionally installing the libslirp0 package on all installations and there is *very* little mention on the community forum when searching for "slirp" or "netdev user", the plan is to only enable it again if there is some real demand for it. Notable changes: * The big change for this release is the rework of job locking, using a job mutex and introducing _locked() variants of job API functions moving away from call-side AioContext locking. See (in the qemu submodule) commit 6f592e5aca ("job.c: enable job lock/unlock and remove Aiocontext locks") and previous commits for context. Changes required for the backup patches: * Use WITH_JOB_LOCK_GUARD() and call the _locked() variant of job API functions where appropriate (many are only availalbe as a _locked() variant). * Remove acquiring/releasing AioContext around functions taking the job mutex lock internally. The patch introducing sequential transaction support for jobs needs to temporarily unlock the job mutex to call job_start() when starting the next job in the transaction. * The zeroinit block driver now marks its child as primary. The documentation in include/block/block-common.h states: > Filter node has exactly one FILTERED|PRIMARY child, and may have > other children which must not have these bits Without this, an assert will trigger when copying to a zeroinit target with qemu-img convert, because bdrv_child_cb_attach() expects any non-PRIMARY child to be not FILTERED: > qemu-img convert -n -p -f raw -O raw input.raw zeroinit:output.raw > qemu-img: ../block.c:1476: bdrv_child_cb_attach: Assertion > `!(child->role & BDRV_CHILD_FILTERED)' failed. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2022-12-14 17:16:32 +03:00
+ job_update_rc_locked(job);
+ }
+
update submodule and patches to 7.2.0 User-facing breaking change: The slirp submodule for user networking got removed. It would be necessary to add the --enable-slirp option to the build and/or install the appropriate library to continue building it. Since PVE is not explicitly supporting it, it would require additionally installing the libslirp0 package on all installations and there is *very* little mention on the community forum when searching for "slirp" or "netdev user", the plan is to only enable it again if there is some real demand for it. Notable changes: * The big change for this release is the rework of job locking, using a job mutex and introducing _locked() variants of job API functions moving away from call-side AioContext locking. See (in the qemu submodule) commit 6f592e5aca ("job.c: enable job lock/unlock and remove Aiocontext locks") and previous commits for context. Changes required for the backup patches: * Use WITH_JOB_LOCK_GUARD() and call the _locked() variant of job API functions where appropriate (many are only availalbe as a _locked() variant). * Remove acquiring/releasing AioContext around functions taking the job mutex lock internally. The patch introducing sequential transaction support for jobs needs to temporarily unlock the job mutex to call job_start() when starting the next job in the transaction. * The zeroinit block driver now marks its child as primary. The documentation in include/block/block-common.h states: > Filter node has exactly one FILTERED|PRIMARY child, and may have > other children which must not have these bits Without this, an assert will trigger when copying to a zeroinit target with qemu-img convert, because bdrv_child_cb_attach() expects any non-PRIMARY child to be not FILTERED: > qemu-img convert -n -p -f raw -O raw input.raw zeroinit:output.raw > qemu-img: ../block.c:1476: bdrv_child_cb_attach: Assertion > `!(child->role & BDRV_CHILD_FILTERED)' failed. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2022-12-14 17:16:32 +03:00
job_unlock();
AIO_WAIT_WHILE_UNLOCKED(job->aio_context,
(job_enter(job), !job_is_completed(job)));