Ray Of Hope

interrupt path xen

interrupt generated by the hardware.
based on the static interrupt routing configuration provided by ACPI table at boot time, the corresponding link is selected.

Interrupt is delivered to IO-APIC. Most modern days IO-APICs converts interrupt to msi i.e memory writes to specfic address in LAPIC name space . Each interrupt is mapped to a specific IO-APIC link. (LAPIC is mapped at a fixed address aat -xfe0000.. and memory writes are to this address.). LAPIC is basicalying inside CPU chip nowdays.

When memory write happens i.e msi is raised, CPU selects save the current context (registers etc), changes context and execute interrupt handler.

Somewhere in this picture also comes IDTs maintained in the segments.

Xen has separate stack (different stack pointers) for for NMI, MCE, double fault and other exceptions (8 pages are allocated for these stack.). What’s the reason.:

Whenever there is a context_switch (syscall, or sysret, move from non-root mode to root mode/ vmexit),hypervisor continues to run with stack pointer pointing to user mode stack. This is done to reduce the number of instructions during context switch. Now here we have short “race window”. While hypervisor is executing in root mode using usermode stack, NMI or MCE or double fault can occur. At which point guest can change the return address in the guest stack and then hypervisor can never return or can be made to perform any illegal action.). Now to prevent this scenario, what is generally done is that in case of NMI, MCE or any double fault stack pointer is move to a new stack. Task state segment struct holds pointer to to stacks of all these NMI, MCE , double fault and other exceptions stacks . The new stack will be fresh . (defined in intel and AMD reference manual).

Now I raised this question, that in case of NMI/MCE/Double fault xenserver always faults and raise the exception then whats the use of switching to new stack. At which, Andrew mentioned that we continuously takes watchdog NMI (we fire watchdog NMI intermittently to get performance counters.) and we never fault. But in case of hardware NMI we do fault, but then switching to a new stack is a general practice that we follow to avoid the above security issue.

Note: Baremetal also behaves in a similar way and has separate stacks for each of these exceptions.
David informed that we don’t support pci passthrough in PV mode as it involves security issue. But in HVM mode we use QEMU to do the passthrough and we don;t have security issue there. Why, I need to ask Andrew ?

never believed that only talks aout theoratical concepts / desigs are sufficient. All what I tell is backed by practical experience eventhough they may be just small samples.

Corrupt stack trace and make it the processor perform a wrong action.


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.