JustKernel

Ray Of Hope

RMRR region GPU passthough + xen

Today faced with a issue where one VM is assigned GPU via passthrough mode and other VM doesn’t have GPU assignement. If I swap the max and min dynamic memory assignment of both these VMs (changing the size of balloon for each VM), guest with GPU crashes after 2-3 iterations.

E820 Map of Host:
(XEN) [ 0.000000] 0000000000000000 – 0000000000099c00 (usable)
(XEN) [ 0.000000] 0000000000099c00 – 00000000000a0000 (reserved)
(XEN) [ 0.000000] 00000000000e0000 – 0000000000100000 (reserved)
(XEN) [ 0.000000] 0000000000100000 – 00000000ab142000 (usable)
(XEN) [ 0.000000] 00000000ab142000 – 00000000ab149000 (ACPI NVS)
(XEN) [ 0.000000] 00000000ab149000 – 00000000bafe8000 (usable)
(XEN) [ 0.000000] 00000000bafe8000 – 00000000bb0cf000 (reserved)
(XEN) [ 0.000000] 00000000bb0cf000 – 00000000bb114000 (usable)
(XEN) [ 0.000000] 00000000bb114000 – 00000000bb256000 (ACPI NVS)
(XEN) [ 0.000000] 00000000bb256000 – 00000000bcfff000 (reserved)
(XEN) [ 0.000000] 00000000bcfff000 – 00000000bd000000 (usable)
(XEN) [ 0.000000] 00000000bd800000 – 00000000bfa00000 (reserved) //faulting addresses lies in this range.
(XEN) [ 0.000000] 00000000f8000000 – 00000000fc000000 (reserved)
(XEN) [ 0.000000] 00000000fec00000 – 00000000fec01000 (reserved)
(XEN) [ 0.000000] 00000000fed00000 – 00000000fed04000 (reserved)
(XEN) [ 0.000000] 00000000fed1c000 – 00000000fed20000 (reserved)
(XEN) [ 0.000000] 00000000fee00000 – 00000000fee01000 (reserved)
(XEN) [ 0.000000] 00000000ff000000 – 0000000100000000 (reserved)
(XEN) [ 0.000000] 0000000100000000 – 0000000840600000 (usable)

Logs:
(XEN) [ 2220.328347] iommu.c:644:d8v1 vtd/iommu dma_clear call iotlb_flush
(XEN) [ 2222.118448] [VT-D]DMAR:[DMA Write] Request device [0000:00:02.0] fault addr bf9ff000, iommu reg = ffff82c000201000
(XEN) [ 2222.118453] [VT-D]DMAR: reason 05 – PTE Write access is not set
(XEN) [ 2222.118460] [VT-D]DMAR:[DMA Write] Request device [0000:00:02.0] fault addr bf9aa000, iommu reg = ffff82c000201000
(XEN) [ 2222.118462] [VT-D]DMAR: reason 05 – PTE Write access is not set
(XEN) [ 2222.118467] [VT-D]DMAR:[DMA Write] Request device [0000:00:02.0] fault addr bf99a000, iommu reg = ffff82c000201000
(XEN) [ 2222.118469] [VT-D]DMAR: reason 05 – PTE Write access is not set
(XEN) [ 2222.118473] [VT-D]DMAR:[DMA Write] Request device [0000:00:02.0] fault addr bf98e000, iommu reg = ffff82c000201000
(XEN) [ 2222.118475] [VT-D]DMAR: reason 05 – PTE Write access is not set
(

When I dynamically allocate/deallocate dynamic RAM i.e balloon out / balloon IN RAM I get a series of IOTLB flush calls and then a series of above PTE Write error messages.

I thought that iotlb flush is leading to these PTE write error messages. My understanding was that iotlb flush is leading to writes to PTEs or some iotlb register to clear some flags which is not allowed and thus leading to crash.

But after discussions, “PTE write access is not set. Request device fault addr bf84000”. So it means device is trying to access a DMA region which is not available.

It implies that memory has neen unmapped due to ballooning out but iotlb entry is still there and thus device in guest thinks that memory is there or you are in process of flushing memory/unmapping memory/ballooning out and in between guest device is accessing the memory via DMA. We have no control the guest access to device memory as its a DMA operation with no role of processor (So it can’t be tracked).

Further I looked into the e820 memory and found that memory access that is generating fault is a reserved region (RMRR region) in the e820 map. But it was a host e820 map (I saw in xen hypervisor log). So, there is also possibility that this RMRR region which has not been communicated to guest e820 map and guest balloons out this memory region unaware of the fact that its part of RMRR.

Thanks
anshul_makkar@justkernel.com

Tags: ,


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.