pve-qemu/debian/patches/pve/0058-PVE-Backup-ensure-jobs...

74 lines
2.5 KiB
Diff
Raw Normal View History

From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: Fabian Ebner <f.ebner@proxmox.com>
Date: Wed, 25 May 2022 13:59:38 +0200
Subject: [PATCH] PVE-Backup: ensure jobs in di_list are referenced
Ensures that qmp_backup_cancel doesn't pick a job that's already been
freed. With unlucky timings it seems possible that:
1. job_exit -> job_completed -> job_finalize_single starts
2. pvebackup_co_complete_stream gets spawned in completion callback
3. job finalize_single finishes -> job's refcount hits zero -> job is
freed
4. qmp_backup_cancel comes in and locks backup_state.backup_mutex
before pvebackup_co_complete_stream can remove the job from the
di_list
5. qmp_backup_cancel will pick a job that's already been freed
update submodule and patches to 7.2.0 User-facing breaking change: The slirp submodule for user networking got removed. It would be necessary to add the --enable-slirp option to the build and/or install the appropriate library to continue building it. Since PVE is not explicitly supporting it, it would require additionally installing the libslirp0 package on all installations and there is *very* little mention on the community forum when searching for "slirp" or "netdev user", the plan is to only enable it again if there is some real demand for it. Notable changes: * The big change for this release is the rework of job locking, using a job mutex and introducing _locked() variants of job API functions moving away from call-side AioContext locking. See (in the qemu submodule) commit 6f592e5aca ("job.c: enable job lock/unlock and remove Aiocontext locks") and previous commits for context. Changes required for the backup patches: * Use WITH_JOB_LOCK_GUARD() and call the _locked() variant of job API functions where appropriate (many are only availalbe as a _locked() variant). * Remove acquiring/releasing AioContext around functions taking the job mutex lock internally. The patch introducing sequential transaction support for jobs needs to temporarily unlock the job mutex to call job_start() when starting the next job in the transaction. * The zeroinit block driver now marks its child as primary. The documentation in include/block/block-common.h states: > Filter node has exactly one FILTERED|PRIMARY child, and may have > other children which must not have these bits Without this, an assert will trigger when copying to a zeroinit target with qemu-img convert, because bdrv_child_cb_attach() expects any non-PRIMARY child to be not FILTERED: > qemu-img convert -n -p -f raw -O raw input.raw zeroinit:output.raw > qemu-img: ../block.c:1476: bdrv_child_cb_attach: Assertion > `!(child->role & BDRV_CHILD_FILTERED)' failed. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2022-12-14 17:16:32 +03:00
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
update submodule and patches to 7.2.0 User-facing breaking change: The slirp submodule for user networking got removed. It would be necessary to add the --enable-slirp option to the build and/or install the appropriate library to continue building it. Since PVE is not explicitly supporting it, it would require additionally installing the libslirp0 package on all installations and there is *very* little mention on the community forum when searching for "slirp" or "netdev user", the plan is to only enable it again if there is some real demand for it. Notable changes: * The big change for this release is the rework of job locking, using a job mutex and introducing _locked() variants of job API functions moving away from call-side AioContext locking. See (in the qemu submodule) commit 6f592e5aca ("job.c: enable job lock/unlock and remove Aiocontext locks") and previous commits for context. Changes required for the backup patches: * Use WITH_JOB_LOCK_GUARD() and call the _locked() variant of job API functions where appropriate (many are only availalbe as a _locked() variant). * Remove acquiring/releasing AioContext around functions taking the job mutex lock internally. The patch introducing sequential transaction support for jobs needs to temporarily unlock the job mutex to call job_start() when starting the next job in the transaction. * The zeroinit block driver now marks its child as primary. The documentation in include/block/block-common.h states: > Filter node has exactly one FILTERED|PRIMARY child, and may have > other children which must not have these bits Without this, an assert will trigger when copying to a zeroinit target with qemu-img convert, because bdrv_child_cb_attach() expects any non-PRIMARY child to be not FILTERED: > qemu-img convert -n -p -f raw -O raw input.raw zeroinit:output.raw > qemu-img: ../block.c:1476: bdrv_child_cb_attach: Assertion > `!(child->role & BDRV_CHILD_FILTERED)' failed. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2022-12-14 17:16:32 +03:00
[FE: adapt for new job lock mechanism replacing AioContext locks]
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---
update submodule and patches to 7.2.0 User-facing breaking change: The slirp submodule for user networking got removed. It would be necessary to add the --enable-slirp option to the build and/or install the appropriate library to continue building it. Since PVE is not explicitly supporting it, it would require additionally installing the libslirp0 package on all installations and there is *very* little mention on the community forum when searching for "slirp" or "netdev user", the plan is to only enable it again if there is some real demand for it. Notable changes: * The big change for this release is the rework of job locking, using a job mutex and introducing _locked() variants of job API functions moving away from call-side AioContext locking. See (in the qemu submodule) commit 6f592e5aca ("job.c: enable job lock/unlock and remove Aiocontext locks") and previous commits for context. Changes required for the backup patches: * Use WITH_JOB_LOCK_GUARD() and call the _locked() variant of job API functions where appropriate (many are only availalbe as a _locked() variant). * Remove acquiring/releasing AioContext around functions taking the job mutex lock internally. The patch introducing sequential transaction support for jobs needs to temporarily unlock the job mutex to call job_start() when starting the next job in the transaction. * The zeroinit block driver now marks its child as primary. The documentation in include/block/block-common.h states: > Filter node has exactly one FILTERED|PRIMARY child, and may have > other children which must not have these bits Without this, an assert will trigger when copying to a zeroinit target with qemu-img convert, because bdrv_child_cb_attach() expects any non-PRIMARY child to be not FILTERED: > qemu-img convert -n -p -f raw -O raw input.raw zeroinit:output.raw > qemu-img: ../block.c:1476: bdrv_child_cb_attach: Assertion > `!(child->role & BDRV_CHILD_FILTERED)' failed. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2022-12-14 17:16:32 +03:00
pve-backup.c | 22 +++++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)
diff --git a/pve-backup.c b/pve-backup.c
update submodule and patches to 7.2.0 User-facing breaking change: The slirp submodule for user networking got removed. It would be necessary to add the --enable-slirp option to the build and/or install the appropriate library to continue building it. Since PVE is not explicitly supporting it, it would require additionally installing the libslirp0 package on all installations and there is *very* little mention on the community forum when searching for "slirp" or "netdev user", the plan is to only enable it again if there is some real demand for it. Notable changes: * The big change for this release is the rework of job locking, using a job mutex and introducing _locked() variants of job API functions moving away from call-side AioContext locking. See (in the qemu submodule) commit 6f592e5aca ("job.c: enable job lock/unlock and remove Aiocontext locks") and previous commits for context. Changes required for the backup patches: * Use WITH_JOB_LOCK_GUARD() and call the _locked() variant of job API functions where appropriate (many are only availalbe as a _locked() variant). * Remove acquiring/releasing AioContext around functions taking the job mutex lock internally. The patch introducing sequential transaction support for jobs needs to temporarily unlock the job mutex to call job_start() when starting the next job in the transaction. * The zeroinit block driver now marks its child as primary. The documentation in include/block/block-common.h states: > Filter node has exactly one FILTERED|PRIMARY child, and may have > other children which must not have these bits Without this, an assert will trigger when copying to a zeroinit target with qemu-img convert, because bdrv_child_cb_attach() expects any non-PRIMARY child to be not FILTERED: > qemu-img convert -n -p -f raw -O raw input.raw zeroinit:output.raw > qemu-img: ../block.c:1476: bdrv_child_cb_attach: Assertion > `!(child->role & BDRV_CHILD_FILTERED)' failed. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2022-12-14 17:16:32 +03:00
index fde3554133..0cf30e1ced 100644
--- a/pve-backup.c
+++ b/pve-backup.c
update submodule and patches to 7.2.0 User-facing breaking change: The slirp submodule for user networking got removed. It would be necessary to add the --enable-slirp option to the build and/or install the appropriate library to continue building it. Since PVE is not explicitly supporting it, it would require additionally installing the libslirp0 package on all installations and there is *very* little mention on the community forum when searching for "slirp" or "netdev user", the plan is to only enable it again if there is some real demand for it. Notable changes: * The big change for this release is the rework of job locking, using a job mutex and introducing _locked() variants of job API functions moving away from call-side AioContext locking. See (in the qemu submodule) commit 6f592e5aca ("job.c: enable job lock/unlock and remove Aiocontext locks") and previous commits for context. Changes required for the backup patches: * Use WITH_JOB_LOCK_GUARD() and call the _locked() variant of job API functions where appropriate (many are only availalbe as a _locked() variant). * Remove acquiring/releasing AioContext around functions taking the job mutex lock internally. The patch introducing sequential transaction support for jobs needs to temporarily unlock the job mutex to call job_start() when starting the next job in the transaction. * The zeroinit block driver now marks its child as primary. The documentation in include/block/block-common.h states: > Filter node has exactly one FILTERED|PRIMARY child, and may have > other children which must not have these bits Without this, an assert will trigger when copying to a zeroinit target with qemu-img convert, because bdrv_child_cb_attach() expects any non-PRIMARY child to be not FILTERED: > qemu-img convert -n -p -f raw -O raw input.raw zeroinit:output.raw > qemu-img: ../block.c:1476: bdrv_child_cb_attach: Assertion > `!(child->role & BDRV_CHILD_FILTERED)' failed. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2022-12-14 17:16:32 +03:00
@@ -316,6 +316,13 @@ static void coroutine_fn pvebackup_co_complete_stream(void *opaque)
}
}
+ if (di->job) {
update submodule and patches to 7.2.0 User-facing breaking change: The slirp submodule for user networking got removed. It would be necessary to add the --enable-slirp option to the build and/or install the appropriate library to continue building it. Since PVE is not explicitly supporting it, it would require additionally installing the libslirp0 package on all installations and there is *very* little mention on the community forum when searching for "slirp" or "netdev user", the plan is to only enable it again if there is some real demand for it. Notable changes: * The big change for this release is the rework of job locking, using a job mutex and introducing _locked() variants of job API functions moving away from call-side AioContext locking. See (in the qemu submodule) commit 6f592e5aca ("job.c: enable job lock/unlock and remove Aiocontext locks") and previous commits for context. Changes required for the backup patches: * Use WITH_JOB_LOCK_GUARD() and call the _locked() variant of job API functions where appropriate (many are only availalbe as a _locked() variant). * Remove acquiring/releasing AioContext around functions taking the job mutex lock internally. The patch introducing sequential transaction support for jobs needs to temporarily unlock the job mutex to call job_start() when starting the next job in the transaction. * The zeroinit block driver now marks its child as primary. The documentation in include/block/block-common.h states: > Filter node has exactly one FILTERED|PRIMARY child, and may have > other children which must not have these bits Without this, an assert will trigger when copying to a zeroinit target with qemu-img convert, because bdrv_child_cb_attach() expects any non-PRIMARY child to be not FILTERED: > qemu-img convert -n -p -f raw -O raw input.raw zeroinit:output.raw > qemu-img: ../block.c:1476: bdrv_child_cb_attach: Assertion > `!(child->role & BDRV_CHILD_FILTERED)' failed. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2022-12-14 17:16:32 +03:00
+ WITH_JOB_LOCK_GUARD() {
+ job_unref_locked(&di->job->job);
+ di->job = NULL;
+ }
+ }
+
// remove self from job list
backup_state.di_list = g_list_remove(backup_state.di_list, di);
update submodule and patches to 7.2.0 User-facing breaking change: The slirp submodule for user networking got removed. It would be necessary to add the --enable-slirp option to the build and/or install the appropriate library to continue building it. Since PVE is not explicitly supporting it, it would require additionally installing the libslirp0 package on all installations and there is *very* little mention on the community forum when searching for "slirp" or "netdev user", the plan is to only enable it again if there is some real demand for it. Notable changes: * The big change for this release is the rework of job locking, using a job mutex and introducing _locked() variants of job API functions moving away from call-side AioContext locking. See (in the qemu submodule) commit 6f592e5aca ("job.c: enable job lock/unlock and remove Aiocontext locks") and previous commits for context. Changes required for the backup patches: * Use WITH_JOB_LOCK_GUARD() and call the _locked() variant of job API functions where appropriate (many are only availalbe as a _locked() variant). * Remove acquiring/releasing AioContext around functions taking the job mutex lock internally. The patch introducing sequential transaction support for jobs needs to temporarily unlock the job mutex to call job_start() when starting the next job in the transaction. * The zeroinit block driver now marks its child as primary. The documentation in include/block/block-common.h states: > Filter node has exactly one FILTERED|PRIMARY child, and may have > other children which must not have these bits Without this, an assert will trigger when copying to a zeroinit target with qemu-img convert, because bdrv_child_cb_attach() expects any non-PRIMARY child to be not FILTERED: > qemu-img convert -n -p -f raw -O raw input.raw zeroinit:output.raw > qemu-img: ../block.c:1476: bdrv_child_cb_attach: Assertion > `!(child->role & BDRV_CHILD_FILTERED)' failed. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2022-12-14 17:16:32 +03:00
@@ -491,6 +498,11 @@ static void create_backup_jobs_bh(void *opaque) {
aio_context_release(aio_context);
di->job = job;
+ if (job) {
update submodule and patches to 7.2.0 User-facing breaking change: The slirp submodule for user networking got removed. It would be necessary to add the --enable-slirp option to the build and/or install the appropriate library to continue building it. Since PVE is not explicitly supporting it, it would require additionally installing the libslirp0 package on all installations and there is *very* little mention on the community forum when searching for "slirp" or "netdev user", the plan is to only enable it again if there is some real demand for it. Notable changes: * The big change for this release is the rework of job locking, using a job mutex and introducing _locked() variants of job API functions moving away from call-side AioContext locking. See (in the qemu submodule) commit 6f592e5aca ("job.c: enable job lock/unlock and remove Aiocontext locks") and previous commits for context. Changes required for the backup patches: * Use WITH_JOB_LOCK_GUARD() and call the _locked() variant of job API functions where appropriate (many are only availalbe as a _locked() variant). * Remove acquiring/releasing AioContext around functions taking the job mutex lock internally. The patch introducing sequential transaction support for jobs needs to temporarily unlock the job mutex to call job_start() when starting the next job in the transaction. * The zeroinit block driver now marks its child as primary. The documentation in include/block/block-common.h states: > Filter node has exactly one FILTERED|PRIMARY child, and may have > other children which must not have these bits Without this, an assert will trigger when copying to a zeroinit target with qemu-img convert, because bdrv_child_cb_attach() expects any non-PRIMARY child to be not FILTERED: > qemu-img convert -n -p -f raw -O raw input.raw zeroinit:output.raw > qemu-img: ../block.c:1476: bdrv_child_cb_attach: Assertion > `!(child->role & BDRV_CHILD_FILTERED)' failed. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2022-12-14 17:16:32 +03:00
+ WITH_JOB_LOCK_GUARD() {
+ job_ref_locked(&job->job);
+ }
+ }
if (!job || local_err) {
error_setg(errp, "backup_job_create failed: %s",
update submodule and patches to 7.2.0 User-facing breaking change: The slirp submodule for user networking got removed. It would be necessary to add the --enable-slirp option to the build and/or install the appropriate library to continue building it. Since PVE is not explicitly supporting it, it would require additionally installing the libslirp0 package on all installations and there is *very* little mention on the community forum when searching for "slirp" or "netdev user", the plan is to only enable it again if there is some real demand for it. Notable changes: * The big change for this release is the rework of job locking, using a job mutex and introducing _locked() variants of job API functions moving away from call-side AioContext locking. See (in the qemu submodule) commit 6f592e5aca ("job.c: enable job lock/unlock and remove Aiocontext locks") and previous commits for context. Changes required for the backup patches: * Use WITH_JOB_LOCK_GUARD() and call the _locked() variant of job API functions where appropriate (many are only availalbe as a _locked() variant). * Remove acquiring/releasing AioContext around functions taking the job mutex lock internally. The patch introducing sequential transaction support for jobs needs to temporarily unlock the job mutex to call job_start() when starting the next job in the transaction. * The zeroinit block driver now marks its child as primary. The documentation in include/block/block-common.h states: > Filter node has exactly one FILTERED|PRIMARY child, and may have > other children which must not have these bits Without this, an assert will trigger when copying to a zeroinit target with qemu-img convert, because bdrv_child_cb_attach() expects any non-PRIMARY child to be not FILTERED: > qemu-img convert -n -p -f raw -O raw input.raw zeroinit:output.raw > qemu-img: ../block.c:1476: bdrv_child_cb_attach: Assertion > `!(child->role & BDRV_CHILD_FILTERED)' failed. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2022-12-14 17:16:32 +03:00
@@ -518,11 +530,15 @@ static void create_backup_jobs_bh(void *opaque) {
di->target = NULL;
}
- if (!canceled && di->job) {
+ if (di->job) {
update submodule and patches to 7.2.0 User-facing breaking change: The slirp submodule for user networking got removed. It would be necessary to add the --enable-slirp option to the build and/or install the appropriate library to continue building it. Since PVE is not explicitly supporting it, it would require additionally installing the libslirp0 package on all installations and there is *very* little mention on the community forum when searching for "slirp" or "netdev user", the plan is to only enable it again if there is some real demand for it. Notable changes: * The big change for this release is the rework of job locking, using a job mutex and introducing _locked() variants of job API functions moving away from call-side AioContext locking. See (in the qemu submodule) commit 6f592e5aca ("job.c: enable job lock/unlock and remove Aiocontext locks") and previous commits for context. Changes required for the backup patches: * Use WITH_JOB_LOCK_GUARD() and call the _locked() variant of job API functions where appropriate (many are only availalbe as a _locked() variant). * Remove acquiring/releasing AioContext around functions taking the job mutex lock internally. The patch introducing sequential transaction support for jobs needs to temporarily unlock the job mutex to call job_start() when starting the next job in the transaction. * The zeroinit block driver now marks its child as primary. The documentation in include/block/block-common.h states: > Filter node has exactly one FILTERED|PRIMARY child, and may have > other children which must not have these bits Without this, an assert will trigger when copying to a zeroinit target with qemu-img convert, because bdrv_child_cb_attach() expects any non-PRIMARY child to be not FILTERED: > qemu-img convert -n -p -f raw -O raw input.raw zeroinit:output.raw > qemu-img: ../block.c:1476: bdrv_child_cb_attach: Assertion > `!(child->role & BDRV_CHILD_FILTERED)' failed. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2022-12-14 17:16:32 +03:00
WITH_JOB_LOCK_GUARD() {
- job_cancel_sync_locked(&di->job->job, true);
+ if (!canceled) {
+ job_cancel_sync_locked(&di->job->job, true);
+ canceled = true;
+ }
+ job_unref_locked(&di->job->job);
+ di->job = NULL;
}
- canceled = true;
}
}
}