Ray Of Hope
GPU Passthrough under ballooning + XEN and role of POD driver.
Its happening only in case HVM guest. I used Win 7 64 bit of reproducing the crash.
This article is in continuation with the previous one where I saw a crash in RMRR region in case of GPU passthrough. To work around this problem I introduced a memory whole of 2 GB in the guest E-820 table which covered the RMRR region so that allocations doesn’t come from this region.
xe vm-param-add param-name=platform mmio_hole_size=2147483648 uuid=< guest uuid >
Now the crash due to reservation in RMRR region was resolved.
I continued my testing and observed a scenario wherein after a series of memory swap operation (dynamic memory swapping of 2 VMs as mentioned in previous article) and rebooting the guest with GPU passthrough, guest crashes with BSOD.
This time the address was in valid range as per guest E820 table and just above the memory hole.
00100000 ——— 80000000 RAM //crash address around 0x7d0d6000
80000000 ———-fc000000 2 GB hole
Error “DMA PTE write faults at address 0x7d0d6000” and this only happens when we have GPU passthrough assigned.
Xen’s implements POD which is similar to demand paging of bare metal OS. When using POD, Xen maintain a separate pool of memory called POD cache where Xen allocates the actual mfns. Xen then creates a guest’s p2m table, but in this table instead of mfns corresponding to the the gpfns we have POD entry initialized to INVALID mfn with type as POD entry.
How POD Works: When guest boots, whenever it touches gpfn backed by a POD entry, it will trap into Xen. When Xen sees a type as POD_entry in the VM_EXIT reason, it will take an mfn from the POD pool and put it in the p2m table for that gpfn.
Coming back to debugging: I instrumented POD driver and found that guest addresses (gpfns) that were being touched/accessed by guest never got closer to my faulting address . Infact my faulting addresses, 0x7d0d6000, or any other address in nearby range was never touched or freed. So, it means that mapping for this address was never created because we are using POD and mapping will only be created once the gpfn is accessed.
Further to understand the exact reason for the crash, I want to discuss about the IOMMU mapping as well. IOMMU has its own hardware page table that provide mappings for virtual addresses to mfns, CPU has its EPT hardware page table and there is p2m table maintained by Xen. All these three tables are synchronous and this has to be ensured by Xen. Xen updates all the tables simultaneously. On Broadwell,
In this case when when a GPU, assigned to guest via passthrough, or any other PCI device request a memory regions via gpfn, the request is passed to IOMMU that tries to find the mapping for the gpfn. But the mapping doesn’t exist. Mapping will only be created if the the gpfn access is routed through CPU via guest and then its trapped in XEN. If the gpfn access is via IOMMU and that gpfn access has not been accessed earlier via CPU, then that address is bound to fault.
Important points to consider.
HVM guest :
In the above explanation I have mentioned that Xen maintain a p2m table, but in actual implementation it basically touches EPT table and stores an invalid MFN for the gpfn in the EPT table and uses the spare bit of entry to save the POD_ENTRY type information. There is no separate p2m table.
ON Haswell EPT table and IOMMU page tables are not shared. So XEN maintains both the tables and ensure that they are in sync.
On Broadwell EPT table and IOMMU page tables are shared and again managed and maintained by XEN.
EPT and IOMMU page tables are level 1 page table.
We don’t have EPT tables but Xen maintains its own p2m table. Also we can IOMMU page tables but Xen ensures that they are in sync with the p2m table.
Disclaimer: The views are information shared in this article is based on my personal understanding only.