JustKernel

Ray Of Hope

Random host crash when running Xen + Braindump pvMMU.

Kernel crash call trace
Apr 6 02:50:06 localhost kernel: [ 3107.388117] [] dump_stack+0x19/0x20
Apr 6 02:50:06 localhost kernel: [ 3107.388119] [] warn_slowpath_common+0x70/0xa0
Apr 6 02:50:06 localhost kernel: [ 3107.388122] [] warn_slowpath_null+0x1a/0x20
Apr 6 02:50:06 localhost kernel: [ 3107.388125] [] xen_mc_flush+0x177/0x190
Apr 6 02:50:06 localhost kernel: [ 3107.388128] [] __xen_pgd_pin+0x23c/0x290
Apr 6 02:50:06 localhost kernel: [ 3107.388131] [] xen_pgd_pin+0x12/0x20
Apr 6 02:50:06 localhost kernel: [ 3107.388133] [] xen_dup_mmap+0x2d/0x50
Apr 6 02:50:06 localhost kernel: [ 3107.388137] [] dup_mm+0x364/0x4b0
Apr 6 02:50:06 localhost kernel: [ 3107.388140] [] copy_process+0x966/0x1190
Apr 6 02:50:06 localhost kernel: [ 3107.388143] [] ? do_page_fault+0x48c/0x4b0
Apr 6 02:50:06 localhost kernel: [ 3107.388146] [] do_fork+0x94/0x290
Apr 6 02:50:06 localhost kernel: [ 3107.388150] [] ? init_peercred+0x1d/0x80
Apr 6 02:50:06 localhost kernel: [ 3107.388153] [] SyS_clone+0x16/0x20
Apr 6 02:50:06 localhost kernel: [ 3107.388156] [] stub_clone+0x69/0x90
Apr 6 02:50:06 localhost kernel: [ 3107.388159] [] ? system_call_fastpath+0x16/0x1b
Apr 6 02:50:06 localhost kernel: [ 3107.388161] ---[ end trace 809c34cdd3101d9d ]---
Apr 6 02:50:06 localhost kernel: [ 3107.388232] ------------[ cut here ]------------
Apr 6 02:50:06 localhost kernel: [ 3107.388238] WARNING: at arch/x86/xen/multicalls.c:129 xen_mc_flush+0x177/0x190()
Apr 6 02:50:06 localhost kernel: [ 3107.388240] Modules linked in: tun nfsv3 nfs_acl nfs fscache lockd sunrpc openvswitch(O) gre libcrc32c ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_conntrack nf_conntrack iptable_filter ip_tables x_tables dm_multipath nls_utf8 isofs video backlight sbs sbshc acpi_ipmi ipmi_msghandler nvram hid_generic usbhid hid sr_mod cdrom sg dcdbas nvidia(PO) hed wmi tpm_tis tpm tpm_bios tg3(O) sb_edac ptp edac_core pps_core lpc_ich mfd_core ahci ehci_pci libahci libata microcode crc32_pclmul scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_dh dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod shpchp megaraid_sas(O) sd_mod scsi_mod uhci_hcd ohci_hcd ehci_hcd
Apr 6 02:50:06 localhost kernel: [ 3107.388286] CPU: 5 PID: 0 Comm: swapper/5 Tainted: P W O 3.10.0+2 #1
Apr 6 02:50:06 localhost kernel: [ 3107.388288] Hardware name: Dell Inc. PowerEdge R720/0XH7F2, BIOS 1.6.0 03/07/2013
Apr 6 02:50:06 localhost kernel: [ 3107.388289] ffffffff817bbab3 ffff880186ca9e38 ffffffff812ce2a9 ffff880186ca9e78
Apr 6 02:50:06 localhost kernel: [ 3107.388293] ffffffff810532b0 ffff880186c705f8 ffff8801cd6abb20 0000000000000001
Apr 6 02:50:06 localhost kernel: [ 3107.388296] ffff8801cd6ab100 0000000000000001 0000000000000002 ffff880186ca9e88
Apr 6 02:50:06 localhost kernel: [ 3107.388299] Call Trace:
Apr 6 02:50:06 localhost kernel: [ 3107.388301] ---[ end trace 809c34cdd3101d9e ]---
Apr 6 02:50:12 localhost kernel: [ 3113.488650] device tap16.0 left promiscuous mode
Apr 6 02:50:12 localhost kernel: [ 3113.490264] ------------[ cut here ]------------
Apr 6 02:50:12 localhost kernel: [ 3113.490274] WARNING: at arch/x86/xen/multicalls.c:129 xen_mc_flush+0x177/0x190()
Apr 6 02:50:12 localhost kernel: [ 3113.490276] Modules linked in: tun nfsv3 nfs_acl nfs fscache lockd sunrpc openvswitch(O) gre libcrc32c ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_conntrack nf_conntrack iptable_filter ip_tables x_tables dm_multipath nls_utf8 isofs video backlight sbs sbshc acpi_ipmi ipmi_msghandler nvram hid_generic usbhid hid sr_mod cdrom sg dcdbas nvidia(PO) hed wmi tpm_tis tpm tpm_bios tg3(O) sb_edac ptp edac_core pps_core lpc_ich mfd_core ahci ehci_pci libahci libata microcode crc32_pclmul scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_dh dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod shpchp megaraid_sas(O) sd_mod scsi_mod uhci_hcd ohci_hcd ehci_hcd
Apr 6 02:50:12 localhost kernel: [ 3113.490342] CPU: 5 PID: 3991 Comm: xenopsd Tainted: P W O 3.10.0+2 #1
Apr 6 02:50:12 localhost kernel: [ 3113.490344] Hardware name: Dell Inc. PowerEdge R720/0XH7F2, BIOS 1.6.0 03/07/2013
Apr 6 02:50:12 localhost kernel: [ 3113.490346] ffffffff817bbab3 ffff880186ca9af8 ffffffff812ce2a9 ffff880186ca9b38
Apr 6 02:50:12 localhost kernel: [ 3113.490350] ffffffff810532b0 0000000000000000 ffff8801cd6abb20 0000000000000001
Apr 6 02:50:12 localhost kernel: [ 3113.490354] ffff8801cd6ab100 0000000000000001 0000000000000002 ffff880186ca9b48
Apr 6 02:50:12 localhost kernel: [ 3113.490357] Call Trace:
Apr 6 02:50:12 localhost kernel: [ 3113.490359] ---[ end trace 809c34cdd3101d9f ]---
Apr 6 02:50:12 localhost kernel: [ 3113.490388] ------------[ cut here ]------------
Apr 6 02:50:12 localhost kernel: [ 3113.490392] WARNING: at arch/x86/xen/multicalls.c:129 xen_mc_flush+0x177/0x190()
Apr 6 02:50:12 localhost kernel: [ 3113.490393] Modules linked in: tun nfsv3 nfs_acl nfs fscache lockd sunrpc openvswitch(O) gre libcrc32c ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_conntrack nf_conntrack iptable_filter ip_tables x_tables dm_multipath nls_utf8 isofs video backlight sbs sbshc acpi_ipmi ipmi_msghandler nvram hid_generic usbhid hid sr_mod cdrom sg dcdbas nvidia(PO) hed wmi tpm_tis tpm tpm_bios tg3(O) sb_edac ptp edac_core pps_core lpc_ich mfd_core ahci ehci_pci libahci libata microcode crc32_pclmul scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_dh dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod shpchp megaraid_sas(O) sd_mod scsi_mod uhci_hcd ohci_hcd ehci_hcd
Apr 6 02:50:12 localhost kernel: [ 3113.490440] CPU: 5 PID: 3991 Comm: xenopsd Tainted: P W O 3.10.0+2 #1
Apr 6 02:50:12 localhost kernel: [ 3113.490441] Hardware name: Dell Inc. PowerEdge R720/0XH7F2, BIOS 1.6.0 03/07/2013
Apr 6 02:50:12 localhost kernel: [ 3113.490443] ffffffff817bbab3 ffff880186ca9af8 ffffffff812ce2a9 ffff880186ca9b38
Apr 6 02:50:12 localhost kernel: [ 3113.490446] ffffffff810532b0 0000000000000000 ffff8801cd6abb20 0000000000000001
Apr 6 02:50:12 localhost kernel: [ 3113.490449] ffff8801cd6ab100 0000000000000001 0000000000000002 ffff880186ca9b48
Apr 6 02:50:12 localhost kernel: [ 3113.490453] Call Trace:
Apr 6 02:50:12 localhost kernel: [ 3113.490454] ---[ end trace 809c34cdd3101da0 ]---
Apr 6 02:50:12 localhost kernel: [ 3113.491163] ------------[ cut here ]------------
Apr 6 02:50:12 localhost kernel: [ 3113.491167] WARNING: at arch/x86/xen/multicalls.c:129 xen_mc_flush+0x177/0x190()
Apr 6 02:50:12 localhost kernel: [ 3113.491169] Modules linked in: tun nfsv3 nfs_acl nfs fscache lockd sunrpc openvswitch(O) gre libcrc32c ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_conntrack nf_conntrack iptable_filter ip_tables x_tables dm_multipath nls_utf8 isofs video backlight sbs sbshc acpi_ipmi ipmi_msghandler nvram hid_generic usbhid hid sr_mod cdrom sg dcdbas nvidia(PO) hed wmi tpm_tis tpm tpm_bios tg3(O) sb_edac ptp edac_core pps_core lpc_ich mfd_core ahci ehci_pci libahci libata microcode crc32_pclmul scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_dh dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod shpchp megaraid_sas(O) sd_mod scsi_mod uhci_hcd ohci_hcd ehci_hcd
Apr 6 02:50:12 localhost kernel: [ 3113.491215] CPU: 5 PID: 3142 Comm: klogd Tainted: P W O 3.10.0+2 #1
Apr 6 02:50:12 localhost kernel: [ 3113.491217] Hardware name: Dell Inc. PowerEdge R720/0XH7F2, BIOS 1.6.0 03/07/2013
Apr 6 02:50:12 localhost kernel: [ 3113.491218] ffffffff817bbab3 ffff880186ca9af8 ffffffff812ce2a9 ffff880186ca9b38
Apr 6 02:50:12 localhost kernel: [ 3113.491222] ffffffff810532b0 0000000000000000 ffff8801cd6abb20 0000000000000001
Apr 6 02:50:12 localhost kernel: [ 3113.491225] ffff8801cd6ab100 0000000000000001 0000000000000002 ffff880186ca9b4

at the same time hypervisor log.
016-04-06 01:59:07] (XEN) [ 8.400003] __csched_vcpu_acct_start: setting dom 0 as the privileged domain
[2016-04-06 02:50:04] (XEN) [ 3105.715355] mm.c:2359:d0v1 Bad type (saw 7400000000000001 != exp 1000000000000000) for mfn 8df82b (pfn 1a99dd)
[2016-04-06 02:50:04] (XEN) [ 3105.715361] mm.c:3009:d0v1 Error while pinning mfn 8df82b
[2016-04-06 02:50:04] (XEN) [ 3105.715892] mm.c:2359:d0v1 Bad type (saw 7400000000000001 != exp 1000000000000000) for mfn 8df82b (pfn 1a99dd)
[2016-04-06 02:50:04] (XEN) [ 3105.715897] mm.c:910:d0v1 Attempt to create linear p.t. with write perms
[2016-04-06 02:50:04] (XEN) [ 3105.715901] mm.c:1301:d0v1 Failure in alloc_l2_table: entry 390
[2016-04-06 02:50:04] (XEN) [ 3105.715912] mm.c:2106:d0v1 Error while validating mfn 10373af (pfn 15cf40) for type 2000000000000000: caf=8000000000000003 taf=2000000000000001
[2016-04-06 02:50:04] (XEN) [ 3105.715918] mm.c:952:d0v1 Attempt to create linear p.t. with write perms
[2016-04-06 02:50:04] (XEN) [ 3105.715922] mm.c:1383:d0v1 Failure in alloc_l3_table: entry 295
[2016-04-06 02:50:04] (XEN) [ 3105.715929] mm.c:2106:d0v1 Error while validating mfn 90e857 (pfn 1ac08e) for type 3000000000000000: caf=8000000000000003 taf=3000000000000001
[2016-04-06 02:50:04] (XEN) [ 3105.715934] mm.c:976:d0v1 Attempt to create linear p.t. with write perms
[2016-04-06 02:50:12] (XEN) [ 3113.490251] printk: 40 messages suppressed.
[2016-04-06 02:50:12] (XEN) [ 3113.490253] mm.c:2359:d0v5 Bad type (saw 7400000000000001 != exp 3000000000000000) for mfn 8df835 (pfn 1a99dc)
[2016-04-06 02:50:22] (XEN) [ 3123.488918] printk: 14 messages suppressed.
[2016-04-06 02:50:22] (XEN) [ 3123.488920] mm.c:2359:d0v3 Bad type (saw 7400000000000001 != exp 1000000000000000) for mfn 8df82b (pfn 1a99dd)
[2016-04-06 02:50:22] (XEN) [ 3123.488923] mm.c:910:d0v3 Attempt to create linear p.t. with write perms
[2016-04-06 02:50:36] (XEN) [ 3137.314912] printk: 20 messages suppressed.
[2016-04-06 02:50:36] (XEN) [ 3137.314913] mm.c:2359:d0v3 Bad type (saw 7400000000000001 != exp 1000000000000000) for mfn 8df82b (pfn 1a99dd)
[2016-04-06 02:50:36] (XEN) [ 3137.314917] mm.c:910:d0v3 Attempt to create linear p.t. with write perms
[2016-04-06 02:50:36] (XEN) [ 3137.314919] mm.c:1301:d0v3 Failure in alloc_l2_table: entry 390
[2016-04-06 02:50:41] (XEN) [ 3142.454200] printk: 8 messages suppressed.
[2016-04-06 02:50:41] (XEN) [ 3142.454202] mm.c:2359:d0v3 Bad type (saw 7400000000000001 != exp 1000000000000000) for mfn 8df82b (pfn 1a99d

hypervisor.log has e820 table that hypervisor gets from the BIOS.

kern.log has e820 table that xen provides it to it.

dom0 constructs another e820 table taking reserved region from kernel e820 table and ignoring rest of them. It takes them so that it doesn’t touches these regions.

In the above crash log guest is trying to modify a PTE for a page that is already mapped by the guest (its reference count is not 0) or guest is trying to create a writable mapping for a page that itself is pagetable which Xen doesn’t allow.
To debug it some of the points to focus
1) Verify that p2m table maintained by dom0 (its a host crash) is consistent with m2p.
2) when dom0 request for page from xen giving pfn, confirm that mfn returned is consistent with m2p table.
3) check that dom0 is not returning any foreign page. Trace the linux kernel page freeing logic and check if it belongs to foreign page or not. Linux kernel page allocator is not aware of xen specific foreign domain concept, but then I have to use this concept in linux kernel to debug this issue.
4) Another angle is that the APIs that we have shared with NVIDIA to get guest pages, is not working properly. In the call trace we can’t see the NVIDIA VGPU driver because somehow it has got trapped in kernel calls so it may be at the bottom of stack.

In the call trace we can see a simple call flow fork->copy_process->dump_mm() { allocate mm(), mm_init_cpu_mask(), mm_init(), init_new_context(), dup_mmap->arch_dup_mmap()—>xen_dup_mmap(), xen_pgd_pn() }. When we do allocate_mm, I traced it and found that the memory is received via kmem_cachee_alloc from slab_cache(preallocated pool). Generally for commonly use data_structs we have preallocated memory in slab_cache. (slab gets it memory from the linux page allocator). So my question was that in this simple call trace, where are the mfns getting corrupted , then I was clarified that in call trace we are trying to pin a page table, and memory for the page table comes from linux page allocator and not slab and thus you should debug page allocator.

On a baremetal system creating writable mapping of page table is fine as kernel needs to write to page table.

BrainDump PVMMU.
For PV guest we virtualise MMU operation (this is the only supported operation). Here guest is trusted with maintaining P2M table (pseudo-physical: pfn to machine address: mfn) and xen maintains M2P table.

M2P is a global page table and all guest have read permission for this table except for the guest that owns the mfn. HOw can we know who know an mfn. Each page associated with mfn (some page related structure) has a pointer to a parent domain. So for every mfn we know its owner.

P2M : Each guest maintains its own P2M table. Guest is trusted by Xen to maintain the integrity of the table. When a guest boots, toolstack request Xen for a set of pages. Xen handover the pages to toolstack and toolstack builds p2m table and hands it over to guest table. Pages given by Xen to toolstack are actual physical pages with mfns. Afterwards, guest kernel is responsible for maintaining this P2M table.

Now whenever it tries to modify P2M, its audited by Xen (guest can manipulate P2M table only by making hypercalls to Xen) and then Xen checks if the mfn belong to correct domain or not and the page mapped by mfn is of relavant type and then it allows the guest to proceed further.

If guest needs more pages, then guest kernel makes hypercall that it wants a new page with this pfn. Xen returns the mfn to guest and updates the M2P table with new mfn and guest provided pfn and then guest kernel updates the P2M table.

Guest also maintains its own set of regular page tables L4/L3/L2 and L1. PTEs in L1 points to actual mfns. Yes, that is the whole purpose of PVMMU, guest L1 page table directly points to MFNs.

P2m is a software construct that CPU doesn’t access. Guest keeps it to maintain the record of mfns that guest has access to. While regular page tables l1, l2 , l3 and l4 are maintained by the guest and contain the actual mfns. These page tables are directly accessed by CPU by loading the cR3 register with page table address space (page table is supported in intel architecture but not P2M table thats why page tables are directly accessible the CPU and p2m is not.).

Another important point to remember is that , guest maintaining the page tables doesn’t mean that guest can modify them without hypervisors knowledge. Pages which are mapped as page tables are mapped READ ONLY to guest. Guest modifies them using hypercalls. Xen audits each of the updates to the page tables by guest. (There are some RW pagetables mapped to guest, but for them hypervisor use trap and emulate approach which is slow)

From linux/arch/x86/xen/mmu.c: ”
* Xen allows guests to directly update the pagetable, in a controlled
* fashion. In other words, the guest modifies the same pagetable
* that the CPU actually uses, which eliminates the overhead of having
* a separate shadow pagetable.
*
* In order to allow this, it falls on the guest domain to map its
* notion of a "physical" pfn - which is just a domain-local linear
* address - into a real "machine address" which the CPU's MMU can
* use.
*
* A pgd_t/pmd_t/pte_t will typically contain an mfn, and so can be
* inserted directly into the pagetable. When creating a new
* pte/pmd/pgd, it converts the passed pfn into an mfn. Conversely,
* when reading the content back with __(pgd|pmd|pte)_val, it converts
* the mfn back into a pfn.
*
* The other constraint is that all pages which make up a pagetable
* must be mapped read-only in the guest. This prevents uncontrolled
* guest updates to the pagetable. Xen strictly enforces this, and
* will disallow any pagetable update which will end up mapping a
* pagetable page RW, and will disallow using any writable page as a
* pagetable.
*
* Naively, when loading %cr3 with the base of a new pagetable, Xen
* would need to validate the whole pagetable before going on.
* Naturally, this is quite slow. The solution is to "pin" a
* pagetable, which enforces all the constraints on the pagetable even
* when it is not actively in use. This menas that Xen can be assured
* that it is still valid when you do load it into %cr3, and doesn't
* need to revalidate it."

Also reference count is maintained for each type of page and type of page can only be changed if reference count reaches 0.

Another point to remember is that for a PV guest GFN maps directly to MFN or GFN = MFN only for a single page but not for a contiguous range of pages or space that spans multiple page boundaries. Guest can have a contiguous virtual address space that maps to GFN via its page tables but not from GFN to MFN. If you write a program in guest kernel requiring contiguous address space of 8K (2 pages) or 12 k(3 pages), then guest will successfully allocate this space as it will belong to a virtual address space. But the problem is with the hardware driver that want contiguous physical address space (MFN space). So for this , guest makes a hypercall to Xen to give this much of contiguous physical address space that is aslo contiguous in GFN space. This region is called bounce buffer (swiotlb).

queries: mailto : anshul_makkar@justkernel.com

Tags:

2 Responses to “Random host crash when running Xen + Braindump pvMMU.”


Leave a Reply to sds Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.