pve-qemu

Commit Graph

Author	SHA1	Message	Date
Fiona Ebner	4fbd50e2f9	update submodule and patches to QEMU 9.0.0 Biggest change is that AioContext locking got removed, but no changes required other than dropping the calls to acquire and release it. As a consequence, the single parameter for the bdrv_graph_wrlock() call got removed which also required adaptation. QAPI docs became stricter requiring to document all members. Other minor changes: - Single parameter from migration_is_running() was dropped. - qemu_mutex_(un)lock_iothread() got renamed to bql_(un)lock(). Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2024-04-29 15:29:52 +02:00
Fiona Ebner	f1eed34ac7	update submodule and patches to QEMU 8.2.2 This version includes both the AioContext lock and the block graph lock, so there might be some deadlocks lurking. It's not possible to disable the block graph lock like was done in QEMU 8.1, because there are no changes like the function bdrv_schedule_unref() that require it. QEMU 9.0 will finally get rid of the AioContext locking. During live-restore with a VirtIO SCSI drive with iothread there is a known racy deadlock related to the AioContext lock. Not new [1], but not sure if more likely now. Should be fixed in QEMU 9.0. The block graph lock comes with annotations that can be checked by clang's TSA. This required changes to the block drivers, i.e. alloc-track, pbs, zeroinit as well as taking the appropriate locks in pve-backup, savevm-async, vma-reader. Local variable shadowing is prohibited via a compiler flag now, required slight adaptation in vma.c. Major changes only affect alloc-track: * It is not possible to call a generated co-wrapper like bdrv_get_info() while holding the block graph lock exclusively [0], which does happen during initialization of alloc-track when the backing hd is set and the refresh_limits driver callback is invoked. The bdrv_get_info() call to get the cluster size is moved to directly after opening the file child in track_open(). The important thing is that at least the request alignment for the write target is used, because then the RMW cycle in bdrv_pwritev will gather enough data from the backing file. Partial cluster allocations in the target are not a fundamental issue, because the driver returns its allocation status based on the bitmap, so any other data that maps to the same cluster will still be copied later by a stream job (or during writes to that cluster). * Replacing the node cannot be done in the track_co_change_backing_file() callback, because it is a coroutine and cannot hold the block graph lock exclusively. So it is moved to the stream job itself with the auto-remove option not having an effect anymore (qemu-server would always set it anyways). In the future, there could either be a special option for the stream job, or maybe the upcoming blockdev-replace QMP command can be used. Replacing the backing child is actually already done in the stream job, so no need to do it in the track_co_change_backing_file() callback. It also cannot be called from a coroutine. Looking at the implementation in the qcow2 driver, it doesn't seem to be intended to change the backing child itself, just update driver-internal state. Other changes: * alloc-track: Error out early when used without auto-remove. Since replacing the node now happens in the stream job, where the option cannot be read from (it's internal to the driver), it will always be treated as 'on'. Makes sure to have users beside qemu-server notice the change (should they even exist). The option can be fully dropped in the future while adding a version guard in qemu-server. * alloc-track: Avoid seemingly superfluous child permission update. Doesn't seem necessary nowadays (maybe after commit "alloc-track: fix deadlock during drop" where the dropping is not rescheduled and delayed anymore or some upstream change). Replacing the block node will already update the permissions of the new node (which was the file child before). Should there really be some issue, instead of having a drop state, this could also be just based off the fact whether there is still a backing child. Dumping the cumulative (shared) permissions for the BDS with a debug print yields the same values after this patch and with QEMU 8.1, namely 3 and 5. * PBS block driver: compile unconditionally. Proxmox VE always needs it and something in the build process changed to make it not enabled by default. Probably would need to move the build option to meson otherwise. * backup: job unreferencing during cleanup needs to happen outside of coroutine, so it was moved to before invoking the clean * mirror: Cherry-pick stable fix to avoid potential deadlock. * savevm-async: migrate_init now can fail, so propagate potential error. * savevm-async: compression counters are not accessible outside migration/ram-compress now, so drop code that prophylactically set it to zero. [0]: https://lore.kernel.org/qemu-devel/220be383-3b0d-4938-b584-69ad214e5d5d@proxmox.com/ [1]: https://lore.kernel.org/qemu-devel/e13b488e-bf13-44f2-acca-e724d14f43fd@proxmox.com/ Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>	2024-04-26 14:14:06 +02:00
Thomas Lamprecht	8dd76cc52d	backup: factor out & clean up gathering device info into helper Squash the two original patches [0][1] from Fiona, which got send separate to be easier to review, into the big patch that adds the Proxmox backup integration. [0]: https://lists.proxmox.com/pipermail/pve-devel/2024-January/061479.html [1]: https://lists.proxmox.com/pipermail/pve-devel/2024-January/061478.html Originally-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2024-03-12 13:55:00 +01:00
Fiona Ebner	cd7676f3e6	backup: avoid bubbling up first ECANCELED error With pvebackup_propagate_error(), the first error wins. When one job in the transaction fails, it is expected that later jobs get the ECANCELED error. Those are not interesting and by skipping them a more interesting error, which is likely the actual root cause, can win. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>	2024-03-12 13:20:28 +01:00
Fiona Ebner	4b7975e75d	update submodule and patches to QEMU 8.1.5 Most notable fixes from a Proxmox VE perspective are: * "virtio-net: correctly copy vnet header when flushing TX" To prevent a stack overflow that could lead to leaking parts of the QEMU process's memory. * "hw/pflash: implement update buffer for block writes" To prevent an edge case for half-completed writes. This potentially affected EFI disks. * Fixes to i386 emulation and ARM emulation. No changes for patches were necessary (all are just automatic context changes). Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>	2024-02-02 19:06:29 +01:00
Fiona Ebner	10e1093325	update submodule and patches to QEMU 8.1.2 Bigger notable changes: * Commit 1a30b0f5d7 ("block: .bdrv_open is non-coroutine and unlocked") broke the PVE backup patches, in particular setting up the backup dump block driver, because bdrv_new_open_driver() cannot be called from a coroutine. To fix it, bdrv_co_open() is used instead, and while it's a much more involved function, the result should be essentially the same. The only difference I noticed is that the BDRV_O_ALLOW_RDWR flag is also set in the resulting bds (block driver state), but that shouldn't hurt. Smaller notable changes: * aio_set_fd_handler() dropped its 'is_external' parameter stating that all callers now pass false in 60f782b6b7 ("aio: remove aio_disable_external() API"). The calls in the PVE patches also passed false, so just drop the parameter too. * global_state_store() does not have a return value anymore, so the user in the PVE savevm-async patch was adapted. For context, see c33f1829f8 ("migration: never fail in global_state_store()"). * Renames affecting the PVE savevm-async patch: migrate_use_block() -> migrate_block() and ram_counters -> mig_stats 9d4b1e5f22 ("migration: Move migrate_use_block() to options.c") aff3f6606d ("migration: Rename ram_counters to mig_stats") Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2023-10-24 15:01:23 +02:00
Filip Schauer	0ff45eb23e	backup: Fix spelling error in function name Signed-off-by: Filip Schauer <f.schauer@proxmox.com> [FE: fixup patch context] Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>	2023-09-08 11:13:04 +02:00
Fiona Ebner	9e0186f289	backup: drop broken BACKUP_FORMAT_DIR Since upstream QEMU 8.0, it's no longer possible to call bdrv_img_create() from a coroutine anymore, meaning a backup with the directory format would crash the QEMU instance. The feature is only exposed via the monitor and was intended to be experimental. There were no user reports about the breakage and it only was noticed during the rebase for QEMU 8.1, because other parts of the backup code needed adaptation and I decided to check the BACKUP_FORMAT_DIR case too. It should not stay in a broken state of course, but avoid the maintenance cost and just make it a removed feature for Proxmox VE 8 retroactively. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>	2023-09-06 16:59:12 +02:00
Fiona Ebner	0cffb504e7	backup: create jobs in a drained section With the drive-backup QMP command, upstream QEMU uses a drained section for the source drive when creating the backup job. Do the same here to avoid subtle bugs. There, the drained section extends until after the job is started, but this cannot be done here for multi-disk backups (could at most start the first job). The important thing is that the cbw (copy-before-write) node is in place and the bcs (block-copy-state) bitmap is initialized, which both happen during job creation (ensured by the "block/backup: move bcs bitmap initialization to job creation" PVE patch). One such bug is one reported in the community forum [0], where using a drive with iothread can lead to an overlapping block-copy request and consequently an assertion failure. The block-copy code relies on the bcs bitmap to determine if a request for a certain range can be created. Each time a request is created, it resets the bcs bitmap at that range to indicate that it's being handled. The duplicate request can happen as follows: Thread A attaches the cbw node Thread B creates a request and resets the bitmap at that range Thread A clears the bitmap and merges it with the PBS bitmap The merging can lead to the bitmap being set again at the range of the previous request, so the block-copy code thinks it's fine to create a request there. Thread B creates another requests at an overlapping range before the other request is finished. The drained section ensures that nothing else can interfere with the bcs bitmap between attaching the copy-before-write block node and initialization of the bitmap. [0]: https://forum.proxmox.com/threads/133149/ Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>	2023-09-06 16:59:12 +02:00
Fiona Ebner	5f9cb29c3a	backup: trim heap after finishing Reported in the community forum [0]. By default, there can be large amounts of memory left assigned to the QEMU process after backup. Likely because of fragmentation, it's necessary to explicitly call malloc_trim() to tell glibc that it shouldn't keep all that memory resident for the process. QEMU itself already does a malloc_trim() in the RCU thread, but that code path might not be reached (or not for a long time) under usual operation. The value of 4 MiB for the argument was also copied from there. Example with the following configuration: > agent: 1 > boot: order=scsi0 > cores: 4 > cpu: x86-64-v2-AES > ide2: none,media=cdrom > memory: 1024 > name: backup-mem > net0: virtio=DA:58:18:26:59:9F,bridge=vmbr0,firewall=1 > numa: 0 > ostype: l26 > scsi0: rbd:base-107-disk-0/vm-106-disk-1,size=4302M > scsihw: virtio-scsi-pci > smbios1: uuid=b2d4511e-8d01-44f1-afd6-9581b30c24a6 > sockets: 2 > startup: order=2 > virtio0: lvmthin:vm-106-disk-1,iothread=1,size=1G > virtio1: lvmthin:vm-106-disk-2,iothread=1,size=1G > virtio2: lvmthin:vm-106-disk-3,iothread=1,size=1G > vmgenid: 0a1d8751-5e02-449d-977e-c0160e900231 Before the change: > root@pve8a1 ~ # grep VmRSS /proc/$(cat /var/run/qemu-server/106.pid)/status > VmRSS: 370948 kB > root@pve8a1 ~ # vzdump 106 --storage pbs > (...) > INFO: Backup job finished successfully > root@pve8a1 ~ # grep VmRSS /proc/$(cat /var/run/qemu-server/106.pid)/status > VmRSS: 2114964 kB After the change: > root@pve8a1 ~ # grep VmRSS /proc/$(cat /var/run/qemu-server/106.pid)/status > VmRSS: 398788 kB > root@pve8a1 ~ # vzdump 106 --storage pbs > (...) > INFO: Backup job finished successfully > root@pve8a1 ~ # grep VmRSS /proc/$(cat /var/run/qemu-server/106.pid)/status > VmRSS: 424356 kB [0]: https://forum.proxmox.com/threads/131339/ Co-diagnosed-by: Friedrich Weber <f.weber@proxmox.com> Co-diagnosed-by: Dominik Csapak <d.csapak@proxmox.com> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Acked-by: Wolfgang Bumiller <w.bumiller@proxmox.com>	2023-08-16 11:50:12 +02:00
Fiona Ebner	d847446186	regenerate patches There's still some context changes not covered by earlier series. No functional change intended. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>	2023-06-15 13:55:22 +02:00
Fiona Ebner	a816d2969e	drop patch for custom get_link_status QMP command There doesn't seem to be any Proxmox VE code using this. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>	2023-06-07 19:35:40 +02:00

12 Commits (master)