JustKernel

Ray Of Hope

Broadwell + Xen + Classic guest kernel + WRITE_FAULT + SMAP Violation.

Recently faced an issue where classic guest kernel (2.6.*) were faulting on Broadwell hardware.
From the day 1 doubt was on SMAP feature that is exclusive to Broadwell but had to prove it.

Approach:
Compiled the guest kernel with instrumentation but can’t use it as the guest on Broadwell hangs just during reboot. (I compiled the 32 bit on a Haswell system on another 32 bit guest.) So, can’t boot the guest with my customized kernel .

Then came to know about xl API that allows you to create a domain (U) with your customized kernel.

xl create -c vm32.cfg

vm32.cfg
name=”example32″
loader=”generic”
memory=2048
kernel=”/root/bzImage” customized image with additional printfs.
ramdisk=”/root/initramfs-2.6.32-504.el6.i686.img” initramfs of any working os image will work.

Using the above command I was able to boot the domain with my customized kernel and initramfs of a working 32 bit guest. The guest too the point where it got hanged or faulted.

I printed cs register in do_page_fault to get the faulting address
regs->eip which gave me the faulting instruction address

page fault = 0x3 addr=0x805f3a0 and ip = c0617233 gs=e0 from guest log where my customized

Used xenctx command to get the stack trace of the hanged guest. Execute the following command on the host
/usr/libexec/xen/bin/xenctx -s System.map-2.6.32-504.el6.i686 20 // 20 is the guest id

Got the following trace.
ec83fc70: [] do_page_fault+0x48
ec83fc80: [] clear_user+0x43
ec83fc8c: [] irq_exit+0x35
ec83fc98: [] do_page_fault
ec83fca0: [] error_code+0x73
ec83fcd4: [] clear_user+0x43<------------- clear the user space ie write to user space address which faults. It happens after load_elf_binary function calls. ec83fcf0: [] padzero+0x24
ec83fcf8: [] load_elf_binary+0x5ad
ec83fd24: [] check_events+0x8
ec83fd34: [] xen_restore_fl_direct_end
ec83fd38: [] _spin_unlock_irqrestore+0x10

if (unlikely(fault_in_kernel_space(address))) // for me it was not kernel space address fault. If it was kernel would just crash with BSOD.
if (user_mode_vm(regs)) {
printk(KERN_INFO “__do_page_fault: user_mode_vm reg \n”);
local_irq_enable();
error_code |= PF_USER;
}
else
{
//for me this condition was true.
printk(KERN_INFO “__do_page_fault: !user_mode_vm reg \n”);
if (regs->flags & X86_EFLAGS_IF)
{

So it became clear that when guest kernel is trying to access the user space memory it fault.

Looked like a classical case of SMAP violation. ut SMAP feature prevents ring0 to access ring3. But a PV guest kernel executes in ring 1 and userpace in ring3 and SMAP should have no control over it. This was the doubt.

Further I also found :
if (!pte_write(entry)) //check if the PTE entry has the write flag enable. Yes I had this.
{
printk(KERN_INFO “handle_pte_fault: call do_wp_page \n”);
return do_wp_page(mm, vma, address,
pte, pmd, ptl, entry);
}
printk(KERN_INFO “handle_pte_fault: mkdirty \n”);
entry = pte_mkdirty(entry); //control came here. mark page as dirty for hardware which doesnt have hardware support for marking page as dirtry x86 32 bit.. Basically it means that virtual page has been written and now its content needs to be copied to physical RAM page. It has become dirty and needs to be written to physical storage.
}
printk(KERN_INFO “handle_pte_fault: mkyoung \n”);
entry = pte_mkyoung(entry); // it marks the page as accessed for the hardware that doesn#t have hardware support for marking pages as young/accessed.
if (ptep_set_access_flags(vma, address, pte, entry, flags & FAULT_FLAG_WRITE)) {
printk(KERN_INFO “handle_pte_fault: update mmu cache \n”);
update_mmu_cache(vma, address, entry);
} else {
if (flags & FAULT_FLAG_WRITE)
{
printk(KERN_INFO “handle_pte_fault: tlb_flush \n”); //then control came here. fault can be due to the case where PTE entry is correct and have correct permission but permissions in TLB is wrong. So flush tlb.
flush_tlb_page(vma, address);
}
}
unlock:
}

According to the above condition, the PTE entry had the write permission to userspace but still kernel was faulting when trying to access it.

Then I traced the exact PTE entry and decoded it.
0x800000007fffe067 faulting pte entry. guest kernel
Bit 0 : It can map 4 K pages.
Bit 1: Wirtes are allowed to the page referenced by it.
Bit 2: user mode accesses are allowed.
Bit 3:
Bit 5: Accessed bit is set
Bit 6: Dirty bit is set:
Bit 8: translation is not global
Bit 62: 59 :

page fault code: 0x03

x2660 : Cr4 register. SMAP is 0 though in Xen its enabled. Guest kernel gets its SMAP value from Xen and Xen hasn’t exposed it as guest kernel is a classic kernel and can’t handle SMAP.

Then I noted
Intel Software developer manual. Section 4.6.1
Determination of Access Rights
Every access to a linear address is either a supervisor-mode access or a user-mode access. For all instruction
fetches and most data accesses, this distinction is determined by the current privilege level (CPL): accesses made
while CPL < 3 are supervisor-mode accesses, while accesses made while CPL = 3 are user-mode accesses. 32 bit pv guest executing in ring 1 (supervisor mode) tries to access the user space memory (ring 3 memory) and thus faults due to SMAP violation. As per Xen it is perfect legal access as while creating page table for DomU, XEN sets the user bit which allows guest kernel to access user mode memory. Thanks Anshul Makkar

Tags: , ,


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.